Corpus of Spoken Professional American-English -- Untagged Version

Info on tagged version

Description of corpus

The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. One sub-corpus consists mainly of academic discussions such as faculty council meetings and committee meetings related to testing. The second sub-corpus contains transcripts of White House press conferences, which are almost exclusively question-and-answer sessions.

The transcripts making up the spoken American corpus have been selected on the basis of being relatively unedited. However, since they have not been produced by linguists, the transcripts do not have all the features one might wish for. For further info on the corpus, you can look at the more detailed description, examine or download a sample of the corpus (below), or contact Michael Barlow.

Price: $49 (Individual user); $179 Site licence


Sample of corpus

You can examine or download a sample of the corpus. The sample differs from the actual corpus in that some sections have been deleted in order to fit in several text types. In addition, the tags coding the names of speakers <SP> and </SP> will probably not be displayed by your web-browser, which means that speaker names will simply appear untagged.

Each section starts with (Sample n). You can use this marking if you want to download the file and separate the different text types.

The file is around 300K.