Topic Segmentation

In the current SUMMaR pipeline we implemented the algorithm by Masao Utiyama and Hitoshi Isahara. The basic idea of this algorithm is to calculate the costs for each possible span within the text and then find the minimum-cost path through the text. The main factor for the cost of a span is the number of word repetitions and the number of different words. In our implementation we use character based 4grams instead of wordforms and the tf-idf value instead of absolute term frequency.
In a first step we find the main topics of the text, using the identified paragraph boundaries as possible segment boundaries. In the next step we split the bigger paragraphs into subtopics, using the sentence boundaries as possible segment boundaries.
For each topic and subtopic we calculate keywords on the base of tf-idf of the upper case words (german nouns) occurring in the topic.

Example output

topics [txt]

subtopics [txt]

Back to pipeline