Building a French-Comorian parallel corpus using French-Swahili MT

Moneim Abdourahamane1, 2 Christian Boitet1,2 Valˇrie Bellynck1,3 Lingxiao Wang1, 2
Hervˇ Blanchon1, 2

(1) LIG, campus, 38041 Grenoble cedex 9, France

(2) UGA, adresse, 38401 Saint Martin-dÕH¸res, France

(3) G-INP, 47 av. Fˇlix Viallet, 38000 Grenoble, France

prenom.nom@imag.fr

 

Abstract

Comorian or shikomori is a macro-language made of 4 dialects very near one to another (ngazidja, maore, mweli, ndzuani), and quite near to swahili. It is quite under-resourced as far as computerized linguistic resources are concerned, having neither corpora nor dictionaries nor correction or machine translation (MT) tools. It is hence a priori not possible to build efficiently a parallel corpus, as we know how to build one using MT followed by online post-editing (PE): for French-Chinese, 17 mn/page with Google Translate (GT), 12 mn/page with the MosesLIG.fr-zh MT system and SECTra/ iMAG. We are however on the way to achieve it by post-editing swahili Ņpre-translationsÓ produced by GT. Swahili is used here not as a pivot language, but as an auxiliary language. We have now a good quality French-Ngazidja corpus containing 14 articles of the Alwatwan newspaper (366 segments, 6754 words, 27 standard pages). We extract in parallel bilingual lexical correspondences. The first application envisaged is active reading of French for Comorian speakers; it will use the dictionary and the MT system respectively derived from the lexical database and the growing bilingual corpus.

 

Keywords: parallel corpus building, French-Comorian, Swahili, auxiliary language