The term parallel corpora is typically used in linguistic circles to refer to texts that are translations of each other. And the term comparable corpora refers to texts in two languages that are similar in content, but are not translations. In order to exploit a parallel text, some kind of text alignment, which identifies equivalent text segments (approximately sentences), is a prerequisite for analysis. (Some researchers take the next step and aim for lexical alignment.)


Some projects associated with ParaConc are described here.

European Parliament

Aligned files created by Philipp Koehn. Available in Danish-English, German-English, Greek-English, Spanish-English, Finnish-English, French-English, Italian-English, Dutch-English, Portuguese-English, Swedish-English. Each corpus is about 100 MB. Note the alignment has been done pairwise and so you cannot straightforwardly obtain an English-French-Spanish corpus.

Czech National Corpus Czech-Other Languages

Professor Frantisek Cermak and his colleagues are using ParaConc to analyse translations of texts in the Czech National Corpus.

Centre for Corpus Linguistics, Birmingham University Chinese-English
Pernilla Danielsson.

INTERSECT: a Parallel Corpus Project
Raphael Salkie,

The INTERSECT (International Sample of English Contrastive Texts) Project at Brighton University began in the Spring of 1994. The aim is to construct and analyse a parallel bilingual corpus of French and English written texts, adding other languages later if resources permit.

The Contrastive Grammar Research Group. University of Gent.

A project involving the construction of multilingual corpora for English, French, Greek and some others, for use in language pedagogy.

Building tools for multilingual corpus access, and also a bunch of sample corpora. Contact

Parallel and comparable corpora in Eastern European languages.

A Scandinavian Project to build multilingual (english/swedish/norwegian/finnish) parallel corpora. Contact

7. English-Norwegian Parallel Corpus Project
ENPC Information on English-Norwegian Parallel Corpus (University of Oslo); includes an on-line search facility

Knut Hofland has also set up an interesting web-based search engine for some English-French texts.

TRIPTIC: TRIlingual Parallel Text Information Corpus
TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. The corpus forms part of the empirical data used for research on the contrastive analysis of prepositions (PhD thesis). The object of the study, which assumes the cognitive linguistic framework, is to examine in which way languages converge and diverge in the semantic structure of so-called function words.

The corpus consists of 2,000,000 words, one half fiction, the other half non-fiction material. All paragraphs are aligned, allowing automatic selection of the n-th paragraph in the 3 languages.

The original text files have been converted into a database structure (4th Dimension on Macintosh), in order to facilitate the description of the prepositions under study.

Translation Corpus of English and German

Prof. Schmied at the Technical University Of Chemnitz-Zwickau is compiling a translation corpus of English and German

The corpus at present includes EC-material, academic textbooks, modern fiction and tourist brochures (approx. 500000 words altogether). The researchers are currently looking at aspects such as culture-specific problems in translation or translationese.
Contact: or

Corpora projects Språkteknologi, University of Uppsala

Erik Tjong Kim Sang sent email about a project in Sweden which is currently working on structuring two multilingual text corpora and integrating them with lexical resources they have available. The prime goal for the resulting corpus is applying it for research in Machine Translation. Anna Sågvall-Hein is the project leader.

Thai On-Line Library TOLL of parallel Thai/English texts.


A. Barentsen. Holland. Slavic Languages
Working on Slavic constructions that are more or less equivalent to contructions as English "let me ..." or "let us...". Also interested in "taxis", i.e. the expression of temporal relations between events. Monika Szirmai. Japan.

Sources of texts

European Language Resources Association

Canadian HansardWeb-searchable.

Canadian Embassy English texts
Canadian Embassy French texts

LDC material: on Tends to be expensive if priced with respect to individual corpora. Includes Canadian Hansard and EC materials.

European Corpus Initiative (ECI) have produced a cheap CD-ROM which contains a wide variety of corpora, including some non-aligned parallel texts.

Cathrine Fabricius-Hansen
Germanistisk institutt
P.b. 1004, Blindern
N-0315 Oslo