SUMMaR accepts different input formats (plain text, html, xml). The first step in the pipeline is to convert any input into a general layout structure representation (xdoc).
The current implementation of our logical structure extractor is written in Python, with optional assistance from an XSL transformation engine for XML inputs. Along with the layout structure, the extractor also tries to extract metadata of the document. For plain text documents, the only layout information is the use of spacing, vertically and horizontally, and some creative typography like surrounding words with asterisks for **emphasis**. A central task is the interpretation of CR/LF, which might or might not represent a paragraph boundary. As a heuristic, beforehand we determine the longest line in the document and from its length guess whether subsequent lines of equal length in the document are likely to be intended as one continuous paragraph or not. Here are some example mapping rules from the program:
For ‘familiar’ XML documents, we use a library of XSL sheets and have the transformation engine extract all layout structure as well as the metadata from the document. In the case of ‘unfamiliar’ XML and HTML documents, heuristics are used for extracting a generic xdoc by selectively inspecting a set of elements and attributes.
Example output (xml)
Back to pipeline