Project summary
One of the key challenges for Computational Linguistics today is the
development of advanced language technology for a truly multilingual
application context. Providing high-quality language processing tools
not only for English and a few other "big" languages will be crucial
for preserving linguistic and cultural diversity. The critical
language-specific components for most higher-level language technology
are the grammar and lexicon used for morphological, syntactic and
semantic analysis or generation. For English, over many years
high-quality grammars and lexicons have been developed and large
collections of text have been labelled with the correct linguistic
structure in order to train statistical grammars. Similar efforts are
under way to develop such resources for a number of other languages.
Our project aims to establish an alternative technique for building
grammars, which will also be applicable to languages (or
domain-specific sublanguages) for which there exists neither a high
enough commercial interest in language technology, nor sufficient
government funding to pay for the labor-intensive manual resource
development. The technique is inspired by linguistic considerations
about how humans learn language and at the same time uses
state-of-the-art machine learning techniques. The only resource
required is a parallel corpus, i.e., collection of original and
translated text in various languages. A sufficient amount of
translated text is available even for most minority languages,
especially since the advent of the internet. Our technique exploits
the implicit information about the structure of a language X that is
included in translations from X into other languages Y and Z (or from
Y or Z into X). Hence, rather than labelling a large text corpus for X
with structural representations in order to train a grammar for X, the
grammar is induced from a largely unlabelled parallel corpus of X, Y
and Z, using "bootstrapping" techniques and knowledge from linguistic
theory about the crucial structural features differentiating between
types of languages.
An additional project goal for PTOLEMAIOS has developed over the first
year of the project: we are working on the development of statistical
machine translation techniques that are able to exploit syntactic
information (obtained by parsing the training corpus and the test
material with an induced grammar or an existing parser).
Some background
The key idea of the PTOLEMAIOS project (for
"Parallel-Text-based Optimization for
Language learning--Exploiting
Multilingual Alignment for the Induction
Of Syntactic grammars") is
the following: sentence-aligned parallel corpora contain significant
implicit information about syntactic structure of the sentences, and
thus about the grammars of the languages involved. With appropriate
learning techniques, one should be able to access this information and
induce grammars from parallel corpora (compare the successful pilot
study in Kuhn 2004a/b -- download ACL 2005 paper -- .pdf file).
The motivation behind the project is twofold: (1) understanding
learnability properties of grammar models is crucial for a better
theoretical grasp of the language faculty; and (2) being able
to train grammars from corpus data is an important step for
multilingual Natural Language Processing applications (such as
information extraction, question answering and machine translation).
The PTOLEMAIOS project will address questions about formal
representation, computational properties of grammar formalisms,
formulation of linguistic constraints, algorithms for analysis and
learning, (probabilistic) learning models based on corpus data,
bootstrapping techniques, etc.
For an outline of the envisaged PTOLEMAIOS system architecture see "An Architecture for Parallel
Corpus-based Grammar Learning" (.pdf file). For people with an
Optimality Theory background, there is a special overview "Optimality in Analysis,
Generation, and Learning: Towards a Robust Computational Architecture
for Corpus-based Studies of Syntax" (.pdf file)