Text Corpora and Corpus Linguistics


Introduction

If you are new to corpus investigations, then I should point out that it takes a little time playing with corpora to see how they can be best exploited to find out about different aspects of language usage. Some people also find the sheer amount of data overwhelming and uninterpretable, but again with some practice you will develop the ability to pick out interesting patterns in the data. If you'd like to get a perspective on an expert's interpretations of corpus data, you can consult Susan Hunston's Corpora in Applied Linguistics or John Sinclair's Reading Concordances.

Some corpora are very expensive and others are a little difficult to get hold of, but if you are starting out and are investigating English, then I'd recommend you get the ICAME CDROM, which contains a good variety of commonly used corpora; the British National Corpus; the American National Corpus -- and if you are interested in spoken American -- the Santa Barbara corpus. 

In terms of software, our own MonoConc Pro is generally considered to be powerful and easy to use. (It is usually in many university courses on Corpus Linguistics -- but there are others, including free concordancers. There are also online concordance/corpus sites where you can try concordancing.

Why use corpora?

How do we know about language? If we are preparing to teach a course on Scientific English, how would we know what to include in such a course? We can consult grammars and dictionaries, but then we would want to know how the authors of these reference texts obtained their knowledge about language in use. We can always use our intuitions -- perhaps scientist use the passive a lot when writing. But then we would still want to know when and how the passive is used. Which verbs tend to be used in the passive. And we would want to make sure that our intuition is, in fact, correct. What are your intuitions concerning the use of conditional clauses in scientific writing?

It is clear that some facts about language usage will be very helpful and so to get started we just need an appropriate corpus and some software to help us look for patterns in the language data.