The main component of computing sentence relevance in most
summarization systems including SUMMaR is the term relevance
Each sentence in the text is assigned a weight based on the terms occurring in this sentence.
We experimented with various forms of terms: wordforms, stems, and character-based ngrams of length n=4 and n=5. Best results we could observe using the ngrams. They seem to be very appropriate for a inflected language like german.
For each term occurring in the text a text specific weight is calculated. The weight for a term depends on the frequency of this term in the text (tf) and the document frequency of the term in a corpus (df). This form of weighting is called TF-IDF.
SUMMaR considers different text sorts. For each of this text sorts (general news, news commentaries, film critics, hotel critics, political speeches, press releases in the pharma industrie) we provided a corpus of document frequencies as well as one general corpus for german and one for english.
For computing the relevance of a sentence we experimented with various
variants. These include the similarity between a sentence vector and a
text vector using the weighted terms, average measures and others.
A detailed evaluation of the calculation of sentence relevances varying the form of used terms and the method of sentence relevance calculation is published in our paper "Measures for Term and Sentence Relevances: an Evaluation for German".
View terms ordered by relevance [txt]