Sentence Selection

Input for this module is the merged PAULA inline file. It contains all annotations of the previous modules.
In a first step we assign to each sentence a final weight. This step is parametrized. For each text type the basic sentence weights and their influence factor for the summary can be defined. The calculation is quite straightforward. First each weight is normalized so, that the sum of weights for all sentences is the same for each weighting module. The final weight for each sentence is then calculated as the sum of the factored weights. In the current implementation just the weights for term relevance and layout structure have an impact on this final weight, while the term relevance weight gets factor 2.

In a next step we use the text structure information for deciding about the extraction of each sentence. Parameters for each zone type can be specified. They define the maximum and minimum number (or percent) of sentences from all text of the zone in question. For example in a film review we want to extract exactly 1 title, at least 1 sentence from a comment zone, but at most 90% of all comment zones, and no text from any zone legal notice. Using this parameters each sentence is tagged as "obligatory", "optional" or "not" to extract.

In the current pipeline implementation this results are just highlighted, without any further processing.
For presenting a final summary, first the sentences marked as obligatory should be extracted, then the containing anaphora and discourse markers should be evaluated for including more sentences to make the summary readable. If the maximum compression rate is not achieved, more high weighted optional sentences (and possibly antecedents solving sentences) can be included. Instead of including more sentences for anaphora resolution, of course, one could try to replace the anaphora.

Example output [xml]

Pipeline overview