Gerlof Bouma, Uni Potsdam

QPL
Linguistic corpus querying with Prolog

This page is work in progress, but it will contain Prolog code and conversion scripts to query linguistically annotated corpora. The general ideas are explained and illustrated in the paper duo Bouma (2010a, LAW) and Bouma (2010b, KONVENS). As yet, the code is divided after the corpus it is to be used with, although there is considerable overlap in the code.

Note that the corpora themselves (generally) are not included: you will have acquire those yourself and convert them to Prolog with the conversion scripts provided. Also, the size and the type of annotated data mentioned below refers to the data made available for querying with Prolog.

There is also the beginning of a bibliography of papers using Prolog in similar ways.

Code

Spoken Dutch Corpus CGN

LanguageDutch
Mod./genreSpoken, mixed genres
Size∼1mln tokens in 130k segments with syntactic annotation in v2.0
AnnotationSyntax: discontinuous phrase structure and edge labels.
Distributionat Centrale voor Taal- en Spraaktechnologie.
NoteIncludes code to (almost) re-create the data used in Bouma (2008).
Code[as gzipped tar archive]

TIGER Corpus

LanguageGerman
Mod./genreNewspaper text
Size50k sentences in v2.1
AnnotationSyntax: discontinuous phrase structure and edge labels.
Distributionat IMS/Uni Stuttgart.
Code[as gzipped tar archive]

Tüba-D/Z Corpus

LanguageGerman
Mod./genreNewspaper text
Size45k sentences in v5
AnnotationSyntax: topological fields, phrase structure and edge labels. Anaphora.
Distributionat Uni Tübingen.
NoteThis is the code used in Bouma (2010a, 2010b)
Code[as gzipped tar archive]

Four languages from Europarl v3

LanguageDutch, English, German & Swedish
Mod./genre(Translated?) Minutes of parliamentary sessions.
SizeUp to 1.5mln sentences per language.
AnnotationSyntax: dependency structure.
Distributionat statm.org for the underlying corpus. The re-tokenization used here is available from my site.
NoteThis is the corpus and code used in Bouma, et al. (2010a). The retokenized and parsed corpus itself is also available here.
Code & data[as bzip2-ed tar archive - 800MBytes]

Bibliography

I'm also compiling a bibliography of papers that describe — or significantly mention use of — Prolog as a corpus querying and transformation tool

Bouma, Gerlof. 2008. Starting a sentence in Dutch: A corpus study of subject- and object-fronting. Groningen Dissertations in Linguistics 66, Center for Language and Cognition, University of Groningen.

Bouma, Gerlof. 2010a. Syntactic tree queries in Prolog. In: Proceedings of the Fourth Linguistic Annotation Workshop, ACL 2010, pp212–216, Uppsala. [pdf in ACL anthology]

Bouma, Gerlof. 2010b. Querying Linguistic Corpora with Prolog. In Pinkal, Rehbein, Schulte im Walde & Storrer (eds), Semantic Approaches in Natural Language Processing: Proceedings of the Conference on Natural Language Processing 2010 (KONVENS 2010), Saarbrücken, Universaar.[final draft with a slightly spacier layout]

There's a small bug in the anaphora annotation transformation code in the paper. Fixed in the online code

Bouma, Gerlof, Lilja Øvrelid & Jonas Kuhn. 2010. Towards a Large Parallel Corpus of Cleft Constructions. In: Proceedings of LREC 2010, pp3585–3592. Malta. [abstract & link to pdf]

Lally, Adam and Paul Fodor. 2011. Natural Language Processing With Prolog in the IBM Watson System. ALP Newsletter, March 2011.

See the notes at the bottom -- a longer paper is supposedly in the works.

Schneiker, Christian, Dietmar Seipel & Werner Wegstein. 2009. Schema and Variation: Digitizing Printed Dictionaries. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pp82–89, Singapore. [pdf in ACL anthology]

Witt, Andreas. 2005. Mutiple hierarchies: New aspects of an old solution. In: Dipper, Götze & Stede (eds), Heterogenity in Focus: Creating and Using Linguistic Databases (ISIS 2), pp55–86, Potsdam: Universtitätsverlag Potsdam. [pdf of ISIS 2]