QPL
Linguistic corpus querying with Prolog
This page is work in progress, but it will contain Prolog code and conversion scripts to query linguistically annotated corpora. The general ideas are explained and illustrated in the paper duo Bouma (2010a, LAW) and Bouma (2010b, KONVENS). As yet, the code is divided after the corpus it is to be used with, although there is considerable overlap in the code.
Note that the corpora themselves (generally) are not included: you will have acquire those yourself and convert them to Prolog with the conversion scripts provided. Also, the size and the type of annotated data mentioned below refers to the data made available for querying with Prolog.
There is also the beginning of a bibliography of papers using Prolog in similar ways.
Code
Spoken Dutch Corpus CGN
| Language | Dutch |
| Mod./genre | Spoken, mixed genres |
| Size | ∼1mln tokens in 130k segments with syntactic annotation in v2.0 |
| Annotation | Syntax: discontinuous phrase structure and edge labels. |
| Distribution | at Centrale voor Taal- en Spraaktechnologie. |
| Note | Includes code to (almost) re-create the data used in Bouma (2008). |
| Code | [as gzipped tar archive] |
TIGER Corpus
| Language | German |
| Mod./genre | Newspaper text |
| Size | 50k sentences in v2.1 |
| Annotation | Syntax: discontinuous phrase structure and edge labels. |
| Distribution | at IMS/Uni Stuttgart. |
| Code | [as gzipped tar archive] |
Tüba-D/Z Corpus
| Language | German |
| Mod./genre | Newspaper text |
| Size | 45k sentences in v5 |
| Annotation | Syntax: topological fields, phrase structure and edge labels. Anaphora. |
| Distribution | at Uni Tübingen. |
| Note | This is the code used in Bouma (2010a, 2010b) |
| Code | [as gzipped tar archive] |
Four languages from Europarl v3
| Language | Dutch, English, German & Swedish |
| Mod./genre | (Translated?) Minutes of parliamentary sessions. |
| Size | Up to 1.5mln sentences per language. |
| Annotation | Syntax: dependency structure. |
| Distribution | at statm.org for the underlying corpus. The re-tokenization used here is available from my site. |
| Note | This is the corpus and code used in Bouma, et al. (2010a). The retokenized and parsed corpus itself is also available here. |
| Code & data | [as bzip2-ed tar archive - 800MBytes] |
Bibliography
I'm also compiling a bibliography of papers that describe — or significantly mention use of — Prolog as a corpus querying and transformation tool
Bouma, Gerlof. 2008. Starting a sentence in Dutch: A corpus study of subject- and object-fronting. Groningen Dissertations in Linguistics 66, Center for Language and Cognition, University of Groningen.
Bouma, Gerlof. 2010a. Syntactic tree queries in Prolog. In: Proceedings of the Fourth Linguistic Annotation Workshop, ACL 2010, pp212–216, Uppsala. [pdf in ACL anthology]
Bouma, Gerlof. 2010b. Querying Linguistic Corpora with Prolog. In Pinkal, Rehbein, Schulte im Walde & Storrer (eds), Semantic Approaches in Natural Language Processing: Proceedings of the Conference on Natural Language Processing 2010 (KONVENS 2010), Saarbrücken, Universaar.[final draft with a slightly spacier layout]
Bouma, Gerlof, Lilja Øvrelid & Jonas Kuhn. 2010. Towards a Large Parallel Corpus of Cleft Constructions. In: Proceedings of LREC 2010, pp3585–3592. Malta. [abstract & link to pdf]
Lally, Adam and Paul Fodor. 2011. Natural Language Processing With Prolog in the IBM Watson System. ALP Newsletter, March 2011.
Schneiker, Christian, Dietmar Seipel & Werner Wegstein. 2009. Schema and Variation: Digitizing Printed Dictionaries. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pp82–89, Singapore. [pdf in ACL anthology]
Witt, Andreas. 2005. Mutiple hierarchies: New aspects of an old solution. In: Dipper, Götze & Stede (eds), Heterogenity in Focus: Creating and Using Linguistic Databases (ISIS 2), pp55–86, Potsdam: Universtitätsverlag Potsdam. [pdf of ISIS 2]