Background

Overview

Word order typology has found several robust correlations between the order of different types of head and their dependents (e.g., Dryer 2007). Among the most widely discussed features are those that correlate with the serialization of a transitive verb (V) and its object (O) (e.g., Dryer 2013; Haider 2020). Many of the correlating features are instances of a harmony in the serialization of head and dependent, such as the order of adpositions (head) and noun phrases (dependent). The present database, however, which is part of the project entitled Consequences of Head Argument Order for Syntax (CHAOS), focuses on features that are known or suspected to correlate with OV/VO but go beyond the linearization of head and dependent. An example for this is the position of oblique elements (X) such as adverbial nominal or adpositional phrases in a transitive clause. Concerning this feature, OV order shows more flexibility than VO, with the latter excluding, among other things, VXO as dominant order, which cannot be described in terms of head and dependent ordering (Dryer & Gensler 2013). Many theoretical approaches addressing such correlations were based on very small or phylogenetically biased samples (e.g., Haider 2020 on Germanic), which necessitates further research with larger, cross-linguistic samples. The aim of the project was to collect high-quality crosslinguistic data that can be used to test whether there are such systematic correlations, analyze the type and strength of these correlations and investigate whether there could be alternative explanations, including the linear and structural position of the subject within the clause (Dryer 2013; Fanselow 2020; Adam & Hölzl 2024). As such, the project is a contribution to both language documentation and to the cross-linguistic study of syntactic phenomena. The following briefly sketches the methodological approach taken in the construction of this database.

Data elicication

A major concern for any cross-linguistic research is the collection of the necessary data. Often, typological studies are based on available grammar books and descriptions. While unavoidable for large language samples, this method is known to lead to distortions in some of the resulting typologies. For instance, Dryer & Gensler (2013) argue that this may have led to the overrepresentation of OVX in their typology with respect to XOV and OXV:

"The frequency on the map is a methodological artifact, and emerges from the fact that many grammars state the ordering of O and V and of X and V but not of O and X. In such cases, if the language is of the type OV&XV, nothing can be concluded about the ordering of O and X, and so such languages had to be omitted from the map. By contrast, if the language is of the type OV&VX, then the ordering OX follows automatically, and so the language is always included on the map."

To avoid this type of distortion, this project is not primarily based on reference grammars but on data elicited under controlled conditions. While this has the disadvantage of being time-consuming and costly, it offers the clear advantage of providing the necessary data for answering the questions. It also offers the possibility to include properties that are not usually found in reference grammars, including negative data, preferences, information about specific types of obliques, or the influence of information structure on the range of possible word orders. Unlike databases like WALS or Grambank, this database, therefore, contains extensive first-hand data.

The data were collected in three different ways. First, some of the data are elicited from native speakers by the team or by colleagues from the University of Potsdam (Akan, Ancash Quechua, Cochabamba Quechua, Czech, English, Jejueo, Kangle Chinese, Kannada, Linxia Chinese, Mandarin, Turkish, Udmurt, Vietnamese). Second, some of the questionnaires are filled in by native speakers with linguistic training and access to further native speakers for verification (Amharic, Bernese German, Bosnian-Croatian-Serbian, Bwamu, Finnish, German, Huarong Chinese, Hungarian, Indonesian, Italian, Jula, Kazakh, Marathi, Mongolian, Mooré, Nepali, Oromo, Polish, Slovenian, Thai, Upper Sorbian). Third, some data was collected by linguists through fieldwork (e.g., Gagauz, Ika, Kurux, Mopan Maya, Newar, Ninkaré, Tunen). The project offered a financial compensation for the latter two types of data collection. On occasion, remaining gaps could be filled with the help of further speakers and experts or available reference grammars (e.g., Kobayashi & Tirkey 2017 on Kurux). See the the Team page for further details on the people involved in the data collection and the construction of this database.

Questionnaire

The data assembled in this database have been collected by means of elicitation with an extensive questionnaire that is available on this homepage. It was originally designed by Gisbert Fanselow (1959-2022) and extended by Andreas Hölzl, Nina Adam, and Andreas Pregla. It serves two main purposes: First, similar to the Lingua Descriptive Studies Questionnaire (Comrie & Smith 1977), it can function as a guideline for describing aspects of the syntax of individual languages. Second, for the purpose of this database, it allowed for the elicitation of specific information concerning the OV / VO alternation. The result of the data collection was a set of extensive language reports for the individual languages (on average approximately 120 pages). These were manually or semiautomatically coded and transferred to a uniform file format adequate for long-term storage (CSV). These individual files can be either accessed through the interface of this homepage or downloaded for further use. Almost all examples are glossed, generally adhering to the Leipzig Glossing Rules.

The following briefly illustrates one example of the questionnaire (question 6) and the resulting data with the help of Nepali data (Indo-Iranian, Indo-European) provided by Prof. Dubi Nanda Dhakal. While previous research collapsed the oblique elements mentioned above into one broad category, the questionnaire distinguishes between three different types of obliques: instruments (question 6a), places (question 6b), and directions (question 6c). Importantly, this database not only presents the possible word orders and preferences but explicitly includes grammaticality judgements and the negative data as well.

In the case of Nepali, both XOV and OXV order are equally possible in all three questions. Other word orders are considered ungrammatical in a wide focus context and are therefore marked with an asterisk. (For simplicity, the glossing is omitted in these ungrammatical examples here.)

(1) Instrument

a) meri-le burush-le nəjã tsitrə bəna-i.

Mary-ERG brush-INST new picture make-PST.3SG.F.NH

'Mary painted (made) a new picture with a brush.' (XOV)

b) meri-le nəjã tsitrə burush-le bəna-i.

Mary-ERG new picture brush-INST make-PST.3SG.F.NH

'Mary painted (made) a new picture with a brush.' (OXV)

c) *meri-le nəjã tsitrə bəna-i burush-le. (OVX)

d) *meri-le burush-le bəna-i nəjã tsitrə. (XVO)

e) *meri-le bəna-i burush-le nəjã tsitrə. (VXO)

f) *meri-le bəna-i nəjã tsitrə burush-le. (VOX)

(2) Location

a) meri-le park-ma nəjã tsitrə bəna-i.

Mary-ERG park-LOC new picture make-PST.3SG.F.NH

'Mary painted (made) a new picture in the park.' (XOV)

b) meri-le nəjã tsitrə park-ma bəna-i.

Mary-ERG new picture park-LOC make-PST.3SG.F.NH

'Mary painted (made) a new picture in the park.' (OXV)

c) *meri-le nəjã tsitrə bəna-i park-ma. (OVX)

d) *meri-le park-ma bəna-i nəjã tsitrə. (XVO)

e) *meri-le bəna-i park-ma nəjã tsitrə. (VXO)

f) *meri-le bəna-i nəjã tsitrə park-ma. (VOX)

(3) Direction

a) meri-le bədzar-ma nəjã tsitrə ləg-i.

Mary-ERG market-loc new picture take-PST.3SG.F.NH

'Mary carried a new picture to the market.' (XOV)

b) meri-le nəjã tsitrə bədzar-ma ləg-i.

Mary-ERG new picture market-loc take-PST.3SG.F.NH

'Mary carried a new picture to the market.' (OXV)

c) *meri-le nəjã tsitrə ləg-i bədzar-ma. (OVX)

d) *meri-le bədzar-ma ləg-i nəjã tsitrə. (XVO)

e) *meri-le ləg-i bədzar-ma nəjã tsitrə. (VXO)

f) *meri-le ləg-i nəjã tsitrə bədzar-ma. (VOX)

The inclusion of three different types of oblique elements allows for a more fine-grained classification of languages than the typology shown above. While there is no difference in Nepali, other languages exhibit different word orders depending on the type of oblique element. For instance, in the lowland East Cushitic language Oromo, for which the data was provided by Wakweya Olani Gobena, XOV is the natural order for instruments and locations, while OXV is also possible. For destinations, however, OXV is the only possible order.

(4) Afaan Oromo (Cushitic, Afro-Asiatic)

a) meerii-n bruuʃii-ɗaan fakkii haaraa botʃ'-t-e.

Mary-NOM brush-INSTR portrait new paint-3SF-PFV

'Mary painted a new portrait with a brush.'

b) meerii-n paark-ittʃa keessa-tti fakkii botʃ'-t-e.

Mary-NOM park-DEF inside-LOC portrait paint-3SF-PFV

'Mary painted a portrait in the park.'

c) meerii-n fakkii gara gabaa-tti geessi-t-e.

Mary-NOM portrait towards market-LOC take-3SF-PFV

'Mary took the portrait to the market.'

In many cases, additional questions tested for the effects of information structure. For instance, question 15 further investigates word order possibilities for oblique elements under information focus (i.e., as answers to different kinds of content questions). The exact contexts are described and illustrated in detail in the questionnaire.

Coding

To make the data accessible for computational approaches, hypothesis testing, and comparisons with already available databases, the individual datapoints needed to be transformed into a useful format. For the sake of this coding procedure, individual questions in the questionnaire were split into different lines corresponding to the number of logical possibilities (the limits of variability). In the case of oblique elements, each of the three sub-questions on different types of adjuncts were split into six different lines corresponding to O-X-V, X-O-V, V-O-X, V-X-O, X-V-O, and O-V-X, respectively. In this example, there is thus a total of 18 (3x6) different lines. The 238 (sub)questions in the questionnaire resulted in 840 lines per CSV file, each of which ideally contains examples, values assigned to these data, and an indication of preferences (see right below). The CSV files for the 40 languages in the sample thus amount to a total of 33.600 lines, although there are some gaps where the relevant data is either missing or the questions were not applicable to a given language. Since every line contains multiple datapoints (value, preference, comment, examples, glossings, translations), the total number of datapoints is much higher. Comments provide additional information on contexts, grammaticality judgements, and further background information relevant for the individual datapoints.

For the purpose of coding, for each line one of the following six values was used: The values 0 or 1 are assigned if a given linearization or feature is absent or present in the language under investigation, respectively. In the case of Nepali, in each of the three sub-questions the value 1 is assigned to X-O-V and O-X-V and the value 0 to the remaining four alternatives. The value 0.5 will be assigned if a certain order is potentially available in a language but its grammaticality is uncertain. If a given datapoint is missing, the value ND is used. If a given question cannot be applied to a language (e.g., if it lacks a certain category), the value NA is used. If a datapoint is available but for some reason cannot be assigned to any of the other values or remained uncertain, the value U is employed. Additionally, if more than one option is grammatical in a language (the value 1 occurs more than once) but one of them is more natural or preferred, this was indicated in a separate column, using P for preferred and S for secondary (Table 1).

Table 1: Overview of the coding system employed in this database

Coding Meaning
Examples (no marking) grammatical (in this context)
# unfelicitous in this context
? grammaticality unclear
* ungrammatical
Values 0 impossible, ungrammatical ("*" or "#")
1 possible, grammatical
0.5 grammaticality unclear ("?")
ND no data
NA not applicable
U unclear/uncertain classification
Preferences P primary/preferred choice (in this context)
S secondary choice (in this context)
ND (or left empty) no data
NA not applicable

Language sample

Given the large amount of data collected per language, the language sample is comparatively small. The database currently contains data from 40 languages from around the world. The choice of the languages was based on multiple considerations. First, we aimed for a balanced sample in terms of OV / VO order. Currently, the database includes 20 languages with OV order, 19 with VO order, and one without basic word order (i.e., Hungarian). Here, the sample aimed for maximal cross-linguistic variability, covering languages with additional V2 order next to OV (Bernese German, Standard German) or AuxOV order typical for West Africa (Jula, Tunen). Second, we included languages from as many different world regions as possible. The dataset currently covers six macro-areas, including Africa, Asia, Europe, Mesoamerica, South America, and Southeast Asia. Third, we sought to maximize the phylogenetic diversity. The 40 languages represent 13 separate language families, and altogether 24 different branches (Table 2). Given previous claims that Slavic languages do not represent (S)VO order but are underlyingly underspecified (Haider & Szucsich 2022), altogether five languages from this branch of Indo-European were included.

Table 2: Phylogenetic diversity in the language sample

Family Branches Languages
Afroasiatic 2 2
Atlantic-Congo 4 6
Austroasiatic 1 1
Austronesian 1 1
Dravidian 2 2
Indo-European 4 11
Khitano-Mongolic 1 1
Koreanic 1 1
Kra-Dai 1 1
Mayan 1 1
Quechuan 2 2
Trans-Himalayan (Sino-Tibetan) 2 5
Turkic 2 3

However, the languages were also chosen on the basis of further considerations: (A) Aiming to avoid an exceeding overlap with previous samples, the project includes at least 12 understudied, endangered, or minority languages, i.e., Bwamu, Gagauz, Huarong Chinese, Ika, Jejueo, Jula, Kangle Chinese, Linxia Chinese, Mopan Maya, Nepali Kurux, Ninkaré, or Upper Sorbian. Some of these had not been described in detail before (e.g., Huarong Chinese) or were only partly described in Chinese (e.g., Kangle Chinese) or French (e.g., Bwamu).

(B) Additionally, following the example of Haider (2020), multiple languages from the same language family but representing either OV or VO were chosen. Systematic structural differences between such related languages can lend additional support to the cross-linguistic correlations. Examples from the database are listed in Table 3. Additionally, this also includes more distantly related languages like Mandarin (VO) vs. Newar (OV) within Trans-Himalayan (Sino-Tibetan), Italian (VO) vs. Nepali (OV) within Indo-European, etc.

Table 3: OV / VO within a single subfamily or branch

subfamily / branch VO OV
(Northern) Sinitic Kangle Chinese, Mandarin, Huarong Chinese Linxia Chinese
(Oghuz) Turkic Gagauz Turkish, Kazakh
(West) Germanic English German (+ V2), Bernese German (+ V2)
(West) Slavic Czech, Polish, Serbian, Slovene Upper Sorbian
Finno-Ugric Finnish Hungarian (OV / VO)

For instance, Standard Mandarin, like most other Sinitic languages, typically has SVO order. Additionally, under certain conditions, there is marked SOV order in combination with flagging by a preposition derived from verbs (question 66). However, several Sinitic languages, such as Linxia Chinese, have undergone language change to SOV order, possibly due to language contact (with Mongolic, Tibetic, and Turkic). Among other features, these languages have developed a postnominal case marking or flagging system involving suffixes or enclitics, which is otherwise highly untypical for Sinitic (Hölzl preprint). Further changes in language structure include the word order in copula constructions (question 79) and comparative constructions (question 82), among others.

Another relevant language family is the Slavic branch of the Indo-European languages, which is largely uniform in displaying basic VO order. However, under the influence of German, Sorbian (and especially Upper Sorbian) has developed neutral OV order. In contrast to German, this applies to both embedded and main clauses. Arguably also due to the contact situation, colloquial Upper Sorbian has developed other features untypical for Slavic, such as the use of expletives with weather expressions (question 23), a feature common in Northwestern Europe (Eriksen et al. 2010: 575):

(5) Colloquial Upper Sorbian (Slavic, Indo-European)

To so děšćika dźo. To so hrimoce.

it REFL rain go.PRS.3SG it REFL thunder.PRS.3SG

'It is raining. It is thundering.' (Scholze 2007: 322; our glossing)

(6) Czech (Slavic, Indo-European)

Prší. Hřmí.

rain.PRS.3SG thunder.PRS.3SG

'It is raining. It is thundering.' (example provided by Nina Adam)

Another example is the availability of the “wh-scope marking” construction in Sorbian, see question 19 (Fanselow 2025). A systematic comparison of a range of features in Sorbian, other West Slavic languages such as Czech and Polish, and German can shed light on the relationship between contact-induced changes and changes that might be tied to a more fundamental shift in underlying structure. Examples of this sort need careful qualitative evaluation to differentiate between structural and areal factors. The large amount of data on each individual language will allow for the detailed study of even minute similarities and differences.

(C) Finally, the language sample also includes languages with rare grammatical features, such as XVO in Sinitic languages, or rare features, such as the wh-scope construction (question 19), the relative-correlative construction (question 40), or the mermaid construction (question 59).

Aims and unique properties

This database is a contribution not only to the cross-linguistic study of syntax, but also to the documentation and description of languages from around the world, including some that have not been described in much detail before (e.g., Bwamu, Huarong Chinese, Kangle Chinese, Nepali Kurux, Linxia Chinese, Ninkaré). Additionally, some of the languages included in the study only have few remaining native speakers (e.g., Ika, Jejueo, Mopan Maya, Upper Sorbian). Some of the features, such as wh-scope marking, are cross-linguistically rare or have not been addressed in global surveys before (Fanselow 2025). The documentation will allow such underdescribed or endangered languages and rare features to be included in future cross-linguistic studies.

Summarizing the points made above, there are several properties that differentiate this database from existing databases like WALS or Grambank:

  1. the data were all collected from native speakers rather than taken from grammars
  2. it includes negative data as well as grammaticality judgements and preferences
  3. all examples are glossed and freely available on this homepage
  4. many of the questions address very specific features, some of which are not easily found in grammatical descriptions

Limitations

The methodological background and reliance on elicitation lead to a slight overrepresentation of some areas (Europe, Asia) and language families (Indo-European). Currently, this database has no data from languages of North America, New Guinea, and Australia. Because of differences in accessibility of the languages, for some languages a reduced version of the questionnaire was used (resulting in some gaps being indicated with ND). Given differences in language strucutre, some questions were not applicable (leading to gaps indicated with NA).

References

Author(s)TitleYearPublished in
Comrie, Bernard & Norval SmithLingua Descriptive Studies: Questionnaire.1977Lingua 42(1): 1-72.
Dryer, Matthew S.On the six-way word order typology.1997Studies in Language 21: 69-103.
Dryer, Matthew S.Word order.2007In Timothy Shopen (ed.), Language typology and syntactic description, vol. 1, 61-131. 2nd edn. Cambridge: Cambridge University Press.
Dryer, Matthew S. & Orin D. GenslerOrder of object, oblique, and verb.2013In Matthew S. Dryer & Martin Haspelmath (eds.), The world atlas of language structures online. Leipzig: MPI for Evolutionary Anthropology.
https://wals.info/chapter/84
Eriksen, Pål Kristian, Seppo Kittilä & Leena KolehmainenThe linguistics of weather.2010Studies in Language 34: 565-601.
Fanselow, GisbertIs the OV-VO distinction due to a macroparameter?2020In Tanaka, Masatoshi, Tomoya Tsutsui & Masashi Hashimoto (Eds), Linguistic Research as an Interdisciplinary Science, 1-26. Tokyo: Hitsuji Publishers.
Haider, HubertVO/OV-base ordering.2020In Michael T. Putnam & Page, B. Richard (eds.), Cambridge handbook of Germanic linguistics, 339-364. Cambridge: Cambridge University Press.
Haider, Hubert & Luka SzucsichSlavic languages - SVO languages without SVO qualities?2022Theoretical Linguistics 48(1/2): 1-39.
Kobayashi, Masato & Bablu TirkeyThe Kurux language: Grammar, texts, and lexicon.2017Leiden: Brill.
Scholze, LenkaDas grammatische System der obersorbischen Umgangssprache.2007University of Konstanz. (Doctoral dissertation.)
Fanselow, GisbertRemarks on the distribution of wh-scope marking.2025In Łukasz Jędrzejowski, Uwe Junghanns, Kerstin Schwabe & Carla Umbach (eds.), Syntax, semantics, and lexicon: Papers by and in honor of Ilse Zimmermann (Open Slavic Linguistics 9), 183-209. Berlin: Language Science Press.
Adam, Nina & Andreas HölzlSubjects in word order and alignment typology.2024Artemis Alexiadou, Doreen Georgi, Fabian Heck, Gereon Müller & Florian Schäfer (eds.), Gisbert Fanselow's contributions to syntactic theory, 311-322. (Linguistische Arbeits Berichte 96). Leipzig: Universität Leipzig.
Hölzl, AndreasPostnominal flagging and OV in Sinitic: Areal and typological perspectives.preprintStudies in Language 49.