From View message header detail Stefan Benus Sent Friday, January 11, 2008 3:33 am To Adamantios Gafos Subject [Fwd: Re: electronic database of Hungarian] Attachments vCard(sb513) 1K -- Štefan Benuš http://www1.cs.columbia.edu/~sbenus/ Department of English and American Studies Constantine the Philosopher University Štefánikova 67 94974 Nitra, Slovakia ----- Original Message ----- From andras@kornai.com Date Fri, 29 Mar 2002 21:50:05 +0000 (GMT) To sb513@nyu.edu (Stefan Benus) Cc andras@kornai.com Subject Re: electronic database of Hungarian Stefan Benus writes: > > Dear Andras, > yeah, I did get it fine, thank you very much. If it was bigger than 5-6 Mb, than I > would have to make room for it but this is fine. > Again, thanks a lot, > Stefan > PS: Do I need some special software to be able to seach the database? Stefan, no, this is a flat file. The easiest way to search it is by regular expressions (grep, egrep, etc if you use Unix, or perl regexps if you only have Win). Here are some basics. Accents are in "Pro1sze1ky" code, so that a1 is long a, o1 is long o, u1 is long u, e1 is long e, i1 is long i -- the values follow the orthography rather than actual phonetic length, though in 95% of the cases the two are identical. 2 means rounding, so that o2 is o umlaut, u2 is u umlaut, and 3=1+2 length+rounding (the double acute accent peculiar to Hungarian orthography). A typical entry is like a1bra1ndoza1s n CVvvccvvccvcvvc F1 O9 D1 T01 A03 PL02 PO01 The first field is the word, in this case "daydreaming". The second is the part of speech, in this case "n" for noun. The third field begins with CV, and contains the CV-skeleton of the word. The fourth field begins with F, and contains a single digit frequency indicator, the higher the number, the more frequent the word. For vowel harmony, you are probably most interested in the T, PL, and PO fields which code the accusative, plural, and possessive endings: 01 at 02 ot 03 t 04 et 05 o2t 11 rare at and a code like T12 would mean free variation between -at and -ot. The PL codes are as follows: 00 indeclinable/ 01 ak/an 02 ok/on 03 k/lag 04 ek/ul 05 o2k/n 06 /l 07 /en 08 /leg 09 /u2l 10 /indeclinable back 11 rare ak/an 20 /front 22 rare ok/on 30 /mixed 33 rare k/lag 44 rare ek/ul 55 rare o2k/n 66 rare /l 77 rare /en 88 rare /leg 99 other (as you can see, the fields are overloaded with some adverbial suffix info) PO fields are: 00 does not exist 01 a/abb 02 ja/bb 03 /ebb 04 e 05 je 11 rare a/abb 22 rare ja/bb 33 rare ebb 44 rare e 55 rare je 99 other You should get hold of Ferenc Papp's Reverse-Alphabetized Dictionary of the Hungarian Language (Szo1ve1gmutato1 Szo1ta1r) because a lot of what's going into the codes is explained in the foreword (there is an English foreword as well). Eventually I'll prepare a fuller help file for this, but I don't have the time know, hope this is enough to get you started. Best of luck, Andras