Text Structure Identification

This module is based on a diploma thesis (including rule implementations) by Annika Neumann.

Based on a corpus study, we have developed an inventory of zone labels for the genre “film review” and implemented a system for automatically identifying these zones, i.e., for breaking up a document into its content structure.
Our approach is hybrid: it utilizes both symbolic rules and a statistical classifier.
The symbolic rules are used for identifying the formal zones. In general, formal zones contain meta data about the film and the review. Examples are the title of the film or the name of the author of the review. The rules are implemented in LAPIS.

table formal zones
overview of formal zones in film reviews

A statistical classifier is used to identify the functional zones. Basically we have 2 types of functional zones:

The zone identification is based on paragraphs of the text, which are identified in the layout identification module (xdoc). The structure identification module (idoc) just adds the xml-feature "zone" to each identified <div> in xdoc. idoc was implemented for german and english.

View example output of this module [xml file]
View PAULA standoff layer created from this module [xml file]

Back to module overview