Modèle de document pour TALN 20011

Building a French-Comorian parallel corpus using French-Swahili MT

Moneim Abdourahamane^{1, 2} Christian Boitet^1,2 Valérie Bellynck^1,3 Lingxiao Wang^{1, 2}
Hervé Blanchon^{1, 2}

(1) LIG, campus, 38041 Grenoble cedex 9, France

(2) UGA, adresse, 38401 Saint Martin-d’Hères, France

(3) G-INP, 47 av. Félix Viallet, 38000 Grenoble, France

Abstract

Comorian or shikomori is a macro-language made of 4 dialects very near one to another (ngazidja, maore, mweli, ndzuani), and quite near to swahili. It is quite under-resourced as far as computerized linguistic resources are concerned, having neither corpora nor dictionaries nor correction or machine translation (MT) tools. It is hence a priori not possible to build efficiently a parallel corpus, as we know how to build one using MT followed by online post-editing (PE): for French-Chinese, 17 mn/page with Google Translate (GT), 12 mn/page with the MosesLIG.fr-zh MT system and SECTra/ iMAG. We are however on the way to achieve it by post-editing swahili “pre-translations” produced by GT. Swahili is used here not as a pivot language, but as an auxiliary language. We have now a good quality French-Ngazidja corpus containing 14 articles of the Alwatwan newspaper (366 segments, 6754 words, 27 standard pages). We extract in parallel bilingual lexical correspondences. The first application envisaged is active reading of French for Comorian speakers; it will use the dictionary and the MT system respectively derived from the lexical database and the growing bilingual corpus.

Keywords: parallel corpus building, French-Comorian, Swahili, auxiliary language