The European Commission is offering translation software developers free access to around one million sentences translated between 22 of the European Union's 23 official languages. It hopes the data will help improve the quality of a variety of language tools, including grammar and spelling checkers, online dictionaries and machine translators -- particularly in less well-served languages such as Latvian or Romanian.
The sentences are mostly drawn from the "Acquis Communautaire," the body of law that must be implemented by all new E.U. member states, and include the treaties, directives and regulations adopted by the E.U., and rulings from the European Court of Justice.
Translated by professional translators, they cover topics such as IT, telecommunications, labor law, agriculture and fishing.
The translations form part of the "translation memory" used by the Commission's permanent staff of 1,750 translators, and are matched up, sentence by sentence, in each of the 22 languages, and are tagged with subject classifications.
The matching and tagging makes the sentences especially useful for developers of statistical machine translation software, who must amass a corpus of thousands of matched sentences in the languages between which they wish to translate, so that they can calculate the most likely translation for any given expression. Since the matching of sentences has already been done, they will save time -- and the immense size of the Acquis Communautaire will help them make their calculations more accurate.
Until now, developers have typically resorted to scouring the Web for texts translated into several languages, and using other software tools to make a guess at where sentences start and end in order to match them up.
While the release of the data will benefit software developers, the Commission is not being entirely altruistic: it hopes that the availability of better, cheaper automated translation software will help speakers of the E.U.'s minority languages by giving them access to online information currently available only in the more widely spoken languages.