The PTOLEMAIOS research project

The PTOLEMAIOS project on grammar learning from parallel corpora is funded by the DFG (Deutsche Forschungsgemeinschaft / German Science Foundation) as part of the Emmy Noether program.

The project's starting date was April 2005, and it is expected to run until end of 2008.

From April 2005 to September 2006, the research group was situated at the Department for Computational Linguistics and Phonetics at Saarland University in Saarbrücken, Germany. (Physically, the group was located in the Zentrum für Sprachforschung und Sprachtechnologie; Center for language research and language technology, as is seen in the picture.)

As of October 2006, Jonas Kuhn is a professor at the University of Potsdam (Department of Linguistics; see new homepage). The PTOLEMAIOS project is being continued in Potsdam.



People

Members of the PTOLEMAIOS research group (from left to right):
  • Michael Jellinghaus (former project member)
  • Jonas Kuhn (principal investigator)
  • Andreas Eisele (former project member, still associated)
  • Mark Hopkins


Project summary

One of the key challenges for Computational Linguistics today is the development of advanced language technology for a truly multilingual application context. Providing high-quality language processing tools not only for English and a few other "big" languages will be crucial for preserving linguistic and cultural diversity. The critical language-specific components for most higher-level language technology are the grammar and lexicon used for morphological, syntactic and semantic analysis or generation. For English, over many years high-quality grammars and lexicons have been developed and large collections of text have been labelled with the correct linguistic structure in order to train statistical grammars. Similar efforts are under way to develop such resources for a number of other languages.

Our project aims to establish an alternative technique for building grammars, which will also be applicable to languages (or domain-specific sublanguages) for which there exists neither a high enough commercial interest in language technology, nor sufficient government funding to pay for the labor-intensive manual resource development. The technique is inspired by linguistic considerations about how humans learn language and at the same time uses state-of-the-art machine learning techniques. The only resource required is a parallel corpus, i.e., collection of original and translated text in various languages. A sufficient amount of translated text is available even for most minority languages, especially since the advent of the internet. Our technique exploits the implicit information about the structure of a language X that is included in translations from X into other languages Y and Z (or from Y or Z into X). Hence, rather than labelling a large text corpus for X with structural representations in order to train a grammar for X, the grammar is induced from a largely unlabelled parallel corpus of X, Y and Z, using "bootstrapping" techniques and knowledge from linguistic theory about the crucial structural features differentiating between types of languages.

An additional project goal for PTOLEMAIOS has developed over the first year of the project: we are working on the development of statistical machine translation techniques that are able to exploit syntactic information (obtained by parsing the training corpus and the test material with an induced grammar or an existing parser).


Some background

The key idea of the PTOLEMAIOS project (for "Parallel-Text-based Optimization for Language learning--Exploiting Multilingual Alignment for the Induction Of Syntactic grammars") is the following: sentence-aligned parallel corpora contain significant implicit information about syntactic structure of the sentences, and thus about the grammars of the languages involved. With appropriate learning techniques, one should be able to access this information and induce grammars from parallel corpora (compare the successful pilot study in Kuhn 2004a/b -- download ACL 2005 paper -- .pdf file).

The motivation behind the project is twofold: (1) understanding learnability properties of grammar models is crucial for a better theoretical grasp of the language faculty; and (2) being able to train grammars from corpus data is an important step for multilingual Natural Language Processing applications (such as information extraction, question answering and machine translation).

The PTOLEMAIOS project will address questions about formal representation, computational properties of grammar formalisms, formulation of linguistic constraints, algorithms for analysis and learning, (probabilistic) learning models based on corpus data, bootstrapping techniques, etc.

For an outline of the envisaged PTOLEMAIOS system architecture see "An Architecture for Parallel Corpus-based Grammar Learning" (.pdf file). For people with an Optimality Theory background, there is a special overview "Optimality in Analysis, Generation, and Learning: Towards a Robust Computational Architecture for Corpus-based Studies of Syntax" (.pdf file)


Project Publications


March 2005 / June 2007