In the current SUMMaR pipeline we implemented the algorithm by Masao Utiyama
and Hitoshi Isahara.
The basic idea of this algorithm is to calculate the costs for each
possible span within the text and then find the minimum-cost path
through the text. The main factor for the cost of a span is the number
of word repetitions and the number of different words. In our
implementation we use character based 4grams instead of wordforms and
the tf-idf value instead of absolute term frequency.
In a first step we find the main topics of the text, using the
identified paragraph boundaries as possible segment boundaries. In the
next step we split the bigger paragraphs into subtopics, using the
sentence boundaries as possible segment boundaries.
For each topic and subtopic we calculate keywords on the base of tf-idf
of the upper case words (german nouns) occurring in the topic.
Example output