Layout Structure and Metadata Extraction

SUMMaR accepts different input formats (plain text, html, xml). The first step in the pipeline is to convert any input into a general layout structure representation (xdoc).

The current implementation of our logical structure extractor is written in Python, with optional assistance from an XSL transformation engine for XML inputs. Along with the layout structure, the extractor also tries to extract metadata of the document. For plain text documents, the only layout information is the use of spacing, vertically and horizontally, and some creative typography like surrounding words with asterisks for **emphasis**. A central task is the interpretation of CR/LF, which might or might not represent a paragraph boundary. As a heuristic, beforehand we determine the longest line in the document and from its length guess whether subsequent lines of equal length in the document are likely to be intended as one continuous paragraph or not. Here are some example mapping rules from the program:

 If lines are very long, they are regarded as a single paragraph (div), and a linebreak
as a paragraph boundary. Otherwise, adjacent lines are grouped into one paragraph,
and paragraph boundaries are recognized from double linebreaks (i.e. blank lines).
If a line consists of capital letters only, or begins with indicative numerals, and
if the last character is not a punctuation sign (other than ‘?’ and ‘!’), it is taken to
be a heading. If a paragraph boundary has been identified as a linebreak, a heading
boundary is preceded by two linebreaks and followed by one.
A line followed by a line of same length, consisting of ‘-’ or ‘=’, is interpreted as a
A character sequence surrounded by asterisks (with no intervening blanks) is
mapped to italic; one surrounded by ‘ ’ is mapped to underline.

For ‘familiar’ XML documents, we use a library of XSL sheets and have the transformation engine extract all layout structure as well as the metadata from the document. In the case of ‘unfamiliar’ XML and HTML documents, heuristics are used for extracting a generic xdoc by selectively inspecting a set of elements and attributes.

Example output (xml)
Back to pipeline