Input for this module is the merged
PAULA inline file. It contains all annotations of the
previous modules.
In a first step we assign to each sentence a final weight. This step is
parametrized. For each text type the basic sentence weights and their
influence factor for the summary can be defined. The calculation is
quite
straightforward. First each weight is normalized so, that the sum of
weights for all sentences is the same for each weighting module. The
final weight for each sentence is then calculated as the sum of the
factored weights.
In the current implementation just the weights for term relevance and layout structure have an impact
on this final weight, while the term relevance weight gets factor 2.
In a next step we use the text structure information for deciding about the extraction of each sentence. Parameters for each zone type can be specified. They define the maximum and minimum number (or percent) of sentences from all text of the zone in question. For example in a film review we want to extract exactly 1 title, at least 1 sentence from a comment zone, but at most 90% of all comment zones, and no text from any zone legal notice. Using this parameters each sentence is tagged as "obligatory", "optional" or "not" to extract.
In the current pipeline implementation this results are just highlighted,
without any further processing.
For presenting a final summary, first the sentences
marked as obligatory should be extracted, then the containing anaphora
and discourse markers should be evaluated for including more sentences
to make the summary readable. If the maximum compression rate is not
achieved, more high weighted optional sentences (and possibly
antecedents solving sentences) can be included. Instead of including
more sentences for anaphora resolution, of course, one could try to
replace the anaphora.
Example output [xml]
Pipeline overview