Building a French-Comorian parallel corpus
using French-Swahili MT Moneim
Abdourahamane1, 2 Christian Boitet1,2 Valˇrie Bellynck1,3
Lingxiao Wang1, 2 (1) LIG, campus, 38041 Grenoble cedex 9, France (2) UGA, adresse, 38401 Saint Martin-dÕH¸res, France (3) G-INP,
47 av. Fˇlix Viallet, 38000 Grenoble, France |
Abstract
Comorian or shikomori is a macro-language made of 4 dialects very near one to another (ngazidja,
maore, mweli, ndzuani), and
quite near to swahili. It is quite under-resourced as far as computerized
linguistic resources are concerned, having neither corpora nor dictionaries
nor correction or machine translation (MT) tools. It is hence a priori not possible to build efficiently a parallel corpus,
as we know how to build one using MT followed by online post-editing (PE): for
French-Chinese, 17 mn/page with Google Translate (GT), 12 mn/page with
the MosesLIG.fr-zh MT system and SECTra/ iMAG. We are
however on the way to achieve it by post-editing swahili Ņpre-translationsÓ produced
by GT. Swahili is used here not as a pivot language, but as an auxiliary language. We
have now a good quality French-Ngazidja corpus containing 14 articles of the
Alwatwan newspaper (366 segments, 6754 words, 27 standard pages). We extract in
parallel bilingual lexical correspondences. The first application envisaged is
active reading of French for Comorian speakers; it will use the dictionary and
the MT system respectively derived from the lexical database and the growing bilingual
corpus.
Keywords: parallel corpus building, French-Comorian, Swahili, auxiliary language