CPSA Description

Description of corpus

The Corpus of Spoken, Professional American-English (CSPA) includes transcripts of conversations of various types occurring between 1994 and 1998. The corpus consists primarily of short interchanges by approximately 400 speakers that are centered on professional activities broadly tied to academics and politics, including academic politics. The seventeen files comprising the corpus contain over 2 million words and require 12 MB space on the hard disk. For comparison the Brown Corpus and LOB Corpus, popular corpora from the sixties, each contain around 1 million words.

The corpus does not contain any searching software. I assume that the corpus will be analysed using a concordance program such as MonoConc and in these notes I illustrate the content of the corpus using MonoConc, but let me emphasize that the concordance program is sold separately from the corpus and is not a part of CSPA.

Coding of the transcripts

The transcripts have been coded in a minimal but consistent way. The speaker is indicated by last name when it is known. Thus if President Clinton says "Hi.", then that would be presented in the corpus as:

<SP>CLINTON:</SP> Hi.

An unknown speaker is represented as <SP>VOICE:</SP>.

Just about everything not enclosed between <SP> and </SP> represents the utterances themselves. Exceptions are limited to information about events in the speech situation such as applause or laughter. These non-language occurrences are presented between parentheses-as in (laughter).

A typical portion of the transcript is shown below:

<SP>STRICKLAND:</SP> Okay. Connie Juel has joined us. Connie, would you introduce yourself?

<SP>JUEL:</SP> Okay. My name is Connie Juel from the University of Virginia.

And I missed the opening question because my immediate concern was getting here.

(Laughter)

<SP>JUEL:</SP> The traffic and no room. So I'm here. Sorry for being late.

If you wish to know more about the background of the speakers, there is some minimal info in the documentation: entries of the following kind:

Dorothy S. STRICKLAND, State of New Jersey Professor of Reading, Graduate School of Education, Rutgers University, New Jersey, Reading Comm. [F]

Connie JUEL, Professor of Education, Curry School of Education, University of Virginia, Charlottesville, Virginia, Reading Comm. [F]

The information typically provided is name, affiliation, relevant sub-corpus, and sex. The sub-corpus options are UNC (Faculty meetings of the University of North Carolina), White House (Press conferences held at the White House and at other locations), Reading Comm (National meetings on Reading tests), and Math Comm (National meetings on Mathematics tests).

Reliability of transcripts

These transcripts have been chosen because there appear to be relatively unedited and thus include hesitations, false starts, and so on. You can use the corpus to search for features of spoken discourse such as utterance-initial well. However, since these transcriptions were not produced with the needs of discourse analysts in mind, the transcriptions are lacking information that we might wish to find, and we cannot be sure that no normalisation of the data has occurred. Nevertheless, these transcriptions are a very useful resource and can provide insights into the lexis and structure of language associated with professional situations. The fact that the genre is professional discourse means that the form of the interactions is more similar to written discourse than more casual conversations would be.

Paragraph breaks in the form of blank lines have been added by the transcribers. While their significance is difficult to assess, they have been retained (but not coded with <p>) because they make the transcript much easier to read.

Basic investigations

Using a concordance program such as MonoConc, it is possible to look for non-language descriptions. Thus a search for (laughter) will lead to the following KWIC display:

1. ...t your mikes are on. And you are on. (Laughter) <SP>DOSSEY:</SP> So with that, I'd l...
2. ...tucky and do as little as possible. (Laughter) <SP>GRAMPP:</SP> I'm Joan Grampp. I'...
3. ...tion K-12, all areas for 27 schools. (Laughter) <SP>BASS:</SP> What do you do in you...
4. ... What do you do in your spare time? (Laughter) <SP>BASS:</SP> Hyman Bass. I'm a M...
5. ...too. <SP>SEELEY:</SP> Thank you. (Laughter) <SP>MANDEL:</SP> I'm David Mandel. I...
6. ...ll give your name and local address. (Laughter) <SP>DOSSEY:</SP> But there will be I...

In cases such as this in which more context is needed (to see what is causing the laughter), you can click on any concordance line in MfW to display the larger context.

As the only tagging in the corpus is for the speaker name, it is always necessary to search for particular words or phrases. We cannot look for the function "introductions" or the function "apology", for example. Thus to investigate the language of introductions, we can search for the word name (or for other strings). Editing the results leads to the concordance lines displayed below.

1. ...of Wisconsin. <SP>PHILLIPS:</SP> My name is Gary Phillips. I'm in OERI. And I'm...
2. ...national test. <SP>FERRARA:</SP> My name is Steve Ferrara. I'm the State Directo...
3. ...then to proceed. <SP>SADLER:</SP>My name is Glenda Sadler. And I'm the Math Proj...
4. ...imony? (Pause) <SP>PUTNAM:</SP>My name is John Putnam. I am a retired junior h...
5. ...Good morning, committee members. My name is Joseph Jaramillo. I'm an education...
6. ...rch on Learning. <SP>GREENO:</SP> My name is James Greeno. I'm a Research Fellow...
7. ...ositive effect. <SP>KNUDSEN:</SP> My name is Jennifer Knudsen. I work at the Inst...

Other patterns can be found in the corpus---such as I'm Gail Burrill---and in some cases speakers introducing themselves simply give their name.

Similarly, if we want to look at apologies, we can search for both sorry and apolog*. A small selection of the results is presented here.

1. ...TE:</SP> What are we talking about? I'm sorry. <SP>MANDEL:</SP> This....
2. ...u know. They are busting some -- I'm sorry. But you know. And I'm not against livi...
3. ...is were on a -- <SP>SEELEY:</SP> I'm sorry, which number, Gail? <SP>BURRILL: 4. ...<SP>SEELEY:</SP> Which section? I'm sorry. D? <SP>BURRILL:</SP> D.
5. ...- algebra, computational skills. I'm sorry I can't think of them right off the top...
6. ...lography or -- <SP>BEAVERS:</SP> No, sorry. <SP>BASS:</SP> Or transformations
7. ...einwand. <SP>LEINWAND:</SP> Yes. I'm sorry I'm late. <SP>DOSSEY:</SP> And w...
8. ...'s helpful here? <SP>VOICE:</SP> I'm sorry. Can you say that again, Pat? <SP>WI...
9. ...ltiple choice. <SP>BURRILL:</SP> I'm sorry, but I have -- to make an inference fro...
10. ...MARTIN:</SP> Good morning. I'm late. I apologize. I'm Wayne Martin. I'm from the Cou...
11. ... Oh, I thought I saw one over there. I apologize. <SP>HORTON:</SP> No. They've read t...
12. ...like to say. And I would have said -- I apologize for coming back late from the break....
13. ... <SP>STRICKLAND:</SP> Gloria, I apologize. <SP>JOHNSTON:</SP> I think we're j...
14. ...KLAND:</SP> Now, I think I've got it. I apologize. I'm thinking of separate little ones...
15. ...Jane, I'd just to say one last thing. I apologize that we had some breakdowns, and I hop...
16. ...all that. <SP>BROWN:</SP> I want to apologize to Rachel Windham. I introduced her er...
17. ...t to this year. One is I would like to apologize to Joy Kasson for a brusque response t...
18. ...you've simply run out of lab space, I apologize. And finally, a regret that the Kenan...
19. .../SP> Yes. Anything else? Well, again, I apologize for going on so long. Thank you. (T...
20. ...hird part of our mission, service. The apology is for the fact that faculty were not i...
21. ...have an answer for you, I'm sorry. I apologize. I will take the question. <SP>VOIC...
22. ...s somewhat -- and it's our fault and I apologize for that. It was not intentional. I t...

These searches are presented here simply to provide a feel for the content of this particular corpus. For the final sample of results from the corpus, I have searched for the verb speak and sorted the concordance lines by alphabetical order of the word following the search word. When we do this, we discover basic information about usage and find, for example, that there are a variety of prepositions used with speak, not just the expected to.

1. ...have to go beyond what we have here to speak a little bit about what range of calcul...
2. ...be very clear in what we do and how we speak about what we want to be on the test....
3. ...scannable things which really start to speak about the accuracy with which students...
4. ...xactly. <SP>WARLOE:</SP> I'd like to speak as a teacher here. You want it two days...
5. ...te student that said that she wants to speak at the hearing. So our luncheon is a...
6. ...unhappy. And the teachers are afraid to speakbecause they're afraid they're going t...
7. ...this committee is responsible for can speak directly to a variety of things thats...
8. ...parents, including parents who do not speak English. Perhaps, the committee shou...
9. ...of students. I -- you know, I can't speak for it, but that's my clear feeling of...
10. ...at normal distribution. I'm going to speak for the students. And I don't know if y...
11. .... I think that's the thing I know I can speak from. When we linked the NAEP togeth...
12. ...I'm not going to go up there. I'll just speak here. But I do want to go over these i...
13. ...Ed, Marjorie, Kris, who would like to speak. I actually have their problems there f...
14. ...e things. <SP>SILVER:</SP> I want to speak in favor of this idea because, you know...
15. ...a committee recommendation, to have it speak in one voice, then Ithink we should g...
16. ...tI should mention is to make sure you speak into the mike because we don't have qu...
17. ...at the table. Again, I remind you to speak into the mike, but probably also to spe...
18. ...hearing out of the horse's mouth, so to speak, is a little bit different from what I...
19. ...t she agreed with Kris, but she didn't speak out. I was picking on her. You know,...
20. ...I'm hearing. And I want to say, to speak personally and not for the committee, t...
21. ...t oftentimes the parents possibly don't speak the dominant language. The child ma...
22. ...th the student, even though you do not speak the student's primary language. And...
23. ...th math. A lot of them are afraid to speak. They're afraid there's going to be rep...
24. ...LL:</SP> It's Ed. Ed, would you like to speak to the assumption issue? <SP>SILVER:...
25. ...ples in our specifications, we need to speak to that. <SP>SEELEY:</SP> When you e...
26. ...ou can -- that the committee itself can speak to and say that we think it's importan...
27. ...g that somebody like Bob Miesel has to speak to from a psychometric standpoint. I...
28. ...-- <SP>DOSSEY:</SP> Yes. Could you speak to the mike? <SP>PETIT:</SP>Do you w...
29. ...s two, if it's math equals, you want to speakto equity. If it's the Society for...
30. ...ome. And maybe, Wayne and David need to speak tothis. But if this is really an o...
31. ...And I think that Kris and Jim ought to speak to this as well. <SP>WARLOE:</SP> Th...
32. ...very pleased to have the opportunity to speak to this committee around the voluntary...
33. ...-- I need to be able as a district tospeak to my communities, describing who these...
34. ...t, maybe Holly or Naomi may be ableto speak to this, is that there are really two k...
35. ...nel. <SP>KIFER:</SP> I would like to speak to that, the issue of machine-scorable...
36. ...Joan. <SP>GRAMPP:</SP> I want to speak up from the teachers' perspective since...
37. .... The science teachers are afraid to speak up. The science teachers get pressure i...
38. ...science teachers get pressure if they speak up. And I think you will succeed. I...
39. ...edure, so that in our example items, we speak very clearly to what we feel is being...
40. ...of it, on the one hand, it can credibly speak with some authorityas representing th...
41. ...that, especially at the low end, we can speak with some surety about things they can...
42. ...eak into the mike, but probably also to speak with a little bit of volume in order t...

Text Types

CSPA can be divided into two main sub-corpora. The first sub-corpus is made up of press conference transcripts from the White House. These contain some policy statements by politicians and White House officials, but consist mainly of question and answer sessions. The second sub-corpus is a record of faculty meetings at UNC and Committee Meetings held at various locations around the country to discuss the creation of different kinds of national tests. In this second sub-corpus the interactions include questions, but also involve statements and discussion of issues.

In the following table I compare the two sub-corpora with each other and also with a written corpus based on a million words from the Independent newspaper.

White HouseMeetingsNewspaper
Words0.9 million1.1 million1.1 million
Number of question marks10,6446,579649
Most common verbsthink, know, said, saythink, know, get, saysaid, ...
Most common pronounsI, we, you, it, he, theyI, we. you, it, they it, he, they, I, we, us
Words 14 chars or more296536464369
Frequency of why0.08%0.07%0.02%

This small selection of measures provides some sense of the two sub-corpora and how they compare with a written corpus. The frequency of I in the Meetings sub-corpus is 1.99%; in the White House sub-corpus it is 1.20%. In contrast, the most common pronoun, it, in the Newspaper corpus has a frequency of 0.69%. There are a variety of other dimensions that we could look at, but the ones presented in the table provide some indications regarding the text type of the two sub-corpora comprising CSPA.