Language resource #: 3330
Results 711 - 720 of 2023
-
C-001281: Switchboard Cellular Part 1 Transcription
*Introduction*
Switchboard Cellular Part 1 Transcription was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of approximately 24 hours of English telephone conversations collected by LDC between 1999-2000. The corresponding audio files are contained in Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15).
*Data*
This release consists of 250 talker pairs (250 speakers total) with one tracnscript (session) per talker pair for a total of 250 conversations. The documentation included with this release includes information on how calls were selected for transcription and on the specification used to transcribe the audio files.
*Sample*
For an example transcript please click here.
*Updates*
There are no updates at this time.- references: David Graff, Kevin Walker, and David Miller 2001 Switchboard Cellular Part 1 Transcription Linguistic Data Consortium, Philadelphia
- hasVersion: C-001279: Switchboard Cellular Part 1 Audio
- hasVersion: C-001279: Switchboard Cellular Part 1 Audio
- hasVersion: C-001280: Switchboard Cellular Part 1 Transcribed Audio
- hasVersion: C-001280: Switchboard Cellular Part 1 Transcribed Audio
- hasVersion: C-001282: Switchboard Cellular Part 2 Audio
- hasVersion: C-001282: Switchboard Cellular Part 2 Audio
-
C-001282: Switchboard Cellular Part 2 Audio
*Introduction*
Switchboard Cellular Part 2 Audio was devloped by the Linguistic Data Consortium (LDC) and consists of approximately 200 hours of English telephone conversations collected by LDC in 2000. The Switchboard cellular collection focused primarily on cellular phone technology of all service types. The goal was to target 200 subjects balanced by gender to participate in (10+) five-six minute conversations on cellular phones. The speech data was collected for research, development, and evaluation of automatic systems for speech-to-text conversion, talker identification, language identification and speech signal detection purposes.
During the study period, LDC collected a total of 2,020 calls, or 4,040 sides (2,950 cellular) from 419 participants (2,405 female speakers, 1,635 male speakers) under varied environmental conditions.
*Data*
This release contains speech data files with documentation describing speaker information (sex, age, education, city and state where raised), call information (date, time, call duration, Personal Identification Numbers, topic), and audit information (channel quality, background noise). The documentation also contains reports on clipped files.
Each speech file consists of a 1,024-byte ASCII-formatted Sphere header, followed by two-channel interleaved mu-law sample data. The mu-law samples represent the actual digital data transmission from the telephone service provider (MCI), as captured separately for each side of the telephone conversation by LDC's telephone collection platform. The header also indicates the caller_pin, callee_pin, topic_id, cellular service/handset information and speaker demographic information. The data files are not compressed.
Other releases in this series include:
Switchboard Cellular Part 1 Audio (LDC2001S13)
Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15)
Switchboard Cellular Part 1 Transcription (LDC2001T14)
*Sample*
Please examine this example audio file to review a sample of this corpus.
*Updates*
There are no updates available at this time.- references: David Graff, Kevin Walker, and David Miller 2004 Switchboard Cellular Part 2 Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001279: Switchboard Cellular Part 1 Audio
- hasVersion: C-001279: Switchboard Cellular Part 1 Audio
- hasVersion: C-001280: Switchboard Cellular Part 1 Transcribed Audio
- hasVersion: C-001280: Switchboard Cellular Part 1 Transcribed Audio
- hasVersion: C-001281: Switchboard Cellular Part 1 Transcription
- hasVersion: C-001281: Switchboard Cellular Part 1 Transcription
-
C-001283: Switchboard-1 Release 2
*Introduction*
The Switchboard-1 Telephone Speech Corpus (LDC97S62) was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.
Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
*Data*
In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.)
Since the 1997 release, the Switchboard transcripts have been carefully revised at The Institute for Signal and Information Processing (ISIP) and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at ISIP. The LDC makes the transcript summaries available via http. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes.
*Samples*
Please view this audio sample.
*Updates*
08/11/2015: The three files from the 03/26/2013 update were converted into unshortened sphere. File tables and documentation were updated to reflect the conversion of these files. The corpus is also now available as a web download. All copies of this corpora obtained after the above date include this update.
03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update.
09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes.
11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory at https://catalog.ldc.upenn.edu/docs/LDC97S62/
09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available as a free download via the online documentation folder.- references: John J. Godfrey and Edward Holliman 1997 Switchboard-1 Release 2 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001283: Switchboard-1 Release 2
-
C-001284: Switchboard-2 Phase I
*Introduction*
Switchboard-2 Phase I consists of 3,638 5-minute telephone conversations involving 657 participants. This corpus was collected by the Linguistic Data Consortium (LDC), in support of a project on Speaker Recognition sponsored by the U.S. Department of Defense. This release consists of speech files only; these calls were not transcribed.
*Data*
Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. Potential participants responded from all areas of the United States, although the majority of the subjects were from the Mid-Atlantic area: (PA=303), (NJ=116), (NY=53), (DE=13), (CT=12), (MD=14), (OH=13) and (MA=8). Most of the participants in SWB-2 Phase I were college students from the following universities: Penn State University, University of Delaware, University of Pennsylvania, Drexel University and Rutgers University. Of the 657 participants, 358 were female and 299 were male. An LDC recruiter asked all participants for the following demographic information: age, sex, years of completed education, country of birth, city and state where raised.
Each recruit was asked to participate in at least ten five-minute phone calls. Ideally each participant would receive five calls at a designated number and make five calls from phones with different telephone numbers (ANI codes). The average subject participated in 11 conversations; however, one gentleman participated in 64 calls. A suggested topic of discussion was given (read by the automated operator), although participants could chat about whatever they preferred.
Each of the 657 participants placed their calls via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project.
Upon conclusion of the study all calls were audited by LDC staff members. Particular attention was paid to PIN verification (matching speaker with PIN), checking call duration and call quality. Upon completion of this process checks were issued and mailed to participants.
*Updates*
09/29/2011: A file list and updated readme were added to reflect the data set's release on DVD.- references: David Graff, Alexandra Canavan, and George Zipperlen 1998 Switchboard-2 Phase I Linguistic Data Consortium, Philadelphia
- hasVersion: C-000738: Switchboard-2 Phase II
- hasVersion: C-000738: Switchboard-2 Phase II
- hasVersion: C-001285: Switchboard-2 Phase III Audio
- hasVersion: C-001285: Switchboard-2 Phase III Audio
-
C-001285: Switchboard-2 Phase III Audio
*Introduction*
The Switchboard-2 Phase III Audio corpus was produced by the Linguistic Data Consortium; catalog number LDC2002S06 and ISBN number 1-58563-222-8. This release contains speech data files ONLY, along with documentation describing speaker information (sex, age, education, city and state where raised), call information (date, time, call duration, Personal Identification Numbers, topic), and audit information (channel quality, background noise). The data files are not compressed.
The Switchboard-2 Phase III collection was focused primarily in the American South. The collection commenced on October 21, 1997 and was completed on January 1, 1998. The project's goal was to target native speakers of English in the American South, balanced by gender, to participate in (10+) five to six minute conversations on a variety of telephone (land line) handsets.
*Data*
The speech data was collected for research, development, and evaluation of automatic systems for speech-to-text conversion, talker identification, language identification and speech signal detection purposes.
During the collection period, the LDC collected a total of 2,728 calls, or 5,456 sides, from 640 participants (292 Male, 348 Female), under varied environmental conditions.
Each speech file consists of a 1,024-byte ASCII-formatted Sphere header, followed by two-channel interleaved mu-law sample data. The mu-law samples represent the actual digital data transmission from the telephone service provider (MCI), as captured separately for each side of the telephone conversation by the LDC's telephone collection platform. The header also indicates the caller_pin, callee_pin, topic_id.
The speech files are named according to the following pattern:
sw_NNNNN.sph
where the five-digit string "NNNNN" represents the conversation-id; this string is used to identify all speech files and to identify the calls in the associated data base tables that provide information about the calls and participants (i.e. callstat.tbl, master.tbl).
Other documentation files available on the publication are:
0readme.1st Field information for all database tables swb_callaudit.tbl Audit results for each channel swb_callaudit.txt Document describing audit table swb_callstats.tbl Information about recorded calls swb_callstats.txt Document describing callstats table swb_callsubjects.tbl Demographic information swb_callsubjects.txt Document describing callsubjects table topics.txt List of proposed call topics There are a total of 2,657 data files (=~ 222 hours of audio)
*Updates*
No updates are available at this time.- references: David Graff, David Miller, and Kevin Walker 2002 Switchboard-2 Phase III Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001284: Switchboard-2 Phase I
- hasVersion: C-001284: Switchboard-2 Phase I
- hasVersion: C-000738: Switchboard-2 Phase II
- hasVersion: C-000738: Switchboard-2 Phase II
-
C-001286: Syllable-Final /s/ Lenition
*Introduction*
This publication represents a study of lenition of syllable-final // in Latin American Spanish produced by the Linguistic Data Consortium (LDC). The data used in this study came from three other LDC corpora, the CALLHOME Spanish Speech corpus, the CALLHOME Spanish Transcripts, and the CALLHOME Spanish Lexicon. It is a well-known fact that syllable-final /s/ is subject to lenition in many Latin American Spanish dialects. Lenition of -/s/ is a variable phonological process in which an -/s /may be aspirated (pronounced [h]) or deleted altogether. Lenition of -/s/ has been widely studied by sociolinguists, who have identified various linguistic and extralinguistic factors that favor the process. Since syllable-final /s /is frequent in Spanish, lenition has a great effect on overall pronunciation.
*Data*
Please see file.tbl for the directory structure of this publication, as well as a complete list of files. The primary data file consists of data stored in the following fields:
* Token id
* Code
* Confidence level
* Speaker id
* Header of the line in the transcript
* Words from the transcript
* Location of word in the speakers turn
* Location of /s/ in the word
* Preceding segment
* Following segment
* Word stress pattern
* Following word stress pattern
* Word start time
* Word end time
* Length of pause following word
* Coder
* Speakers dialect
* Speakers sex
* Speakers age
* Corrected following word
* Comment
* Morphological information
There are on the order of 3,000 - 4,000 missing occurences of syllable-final /s/ encodings. These omissions occur for two main reasons: changes in the transcriptions after the list of all of the syllable-final /s/ were generated, and the failure of some transcript lines to be automatically aligned.
For a more detailed description of this publication see the researchers description in HTML or Microsoft Word format.- references: Michelle A. Fox 2001 Syllable-Final /s/ Lenition Linguistic Data Consortium, Philadelphia
- isPartOf: G-000663: CALLHOME Spanish Lexicon
- isPartOf: C-000665: CALLHOME Spanish Transcripts
- isPartOf: C-000664: CALLHOME Spanish Speech
-
C-001287: TDT Pilot Study Corpus
*Introduction*
The TDT Pilot Study corpus was created to support an initiative in "topic detection and tracking." This initiative is directed toward computer processing of language data, both text and speech. The objective is namely to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them.
*Data*
The TDT corpus comprises a set of stories that includes both newswire (text) and broadcast news (speech). Each story is represented as a stream of text, in which the text is either taken directly from the newswire (Reuters) or is a manual transcription of the broadcast news speech (CNN). The corpus spans the period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories, with about half taken from Reuters newswire and half from CNN broadcast news transcripts.
An integral and key part of the corpus is the annotation of the corpus in terms of the events discussed in the stories. 25 events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems.
*Updates*
There are no updates at this time.- references: James Allan, et al. 1998 TDT Pilot Study Corpus Linguistic Data Consortium, Philadelphia
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
-
C-001288: TDT2 Careful Transcription Audio
*Introduction*
This file contains documentation on the Topic Detection and Tracking (TDT) 2 Careful Transcription Audio Corpus, Linguistic Data Consortium (LDC) catalog number LDC2000S92 and ISBN 1-58563-167-1. This corpus contains the recordings of broadcast news audio. The transcriptions to these recordings are available in the Topic Detection and Tracking (TDT) 2 Careful Transcription Text Corpus, Linguistic Data Consortium (LDC) catalog number LDC2000T44 and ISBN 1-58563-166-3.
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection), and track the reoccurrence of old or new events (tracking). For further information on TDT2 please visit our TDT2 Information Pages.
*Data*
This publication contains 1998 broadcasts from the following sources: ABC News Cable News Network (CNN) Public Radio International (PRI) Voice of America (VOA)
*Samples*
For an example of the data in this corpus, please review this audio sample.
*Updates*
There are no updates at this time.- references: John Garofalo and David Graff 2000 TDT2 Careful Transcription Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
-
C-001289: TDT2 Careful Transcription Text
*Introduction*
TDT2 (Topic Detection and Tracking) Careful Transcription was developed by the Linguistic Data Consortium (LDC) and contains transcripts of English broadcast news audio recordings collected by LDC in 1998. The corresponding audio data is available in TDT2 Careful Transcription Audio LDC2000S92.
Topic Detection and Tracking refers to automatic techniques for finding topically-related material in streams of data such as newswire and broadcast news. This corpus was created to support three TDT2 tasks: to find topically homogeneous sections (segmentation), to detect the occurrence of new events (detection) and to track the reoccurrence of old or new events (tracking).
*Data*
The broadcast data was collected from the following sources: ABC News, Cable News Network, Public Radio International and Voice of America.
Please look at this sample transcript.
*Updates*
There are no updates at this time.- references: Stephanie Strassel and Nii Martey 2000 TDT2 Careful Transcription Text Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
-
C-001290: TDT2 English Audio
PRICING:
* 1999 Commercial Members: $0
* 1999 Non-Profit Members: $1460
* Non-Members: $14,600
*Introduction*
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking).
*Data*
The TDT2 Audio Corpus contains a total of 1,036 waveform files. Each file is a complete single-channel recording of 30- or 60-minute broadcast, which has been digitized at a sample rate of 16 KHz using 16-bit samples.
The four broadcast sources represented in the corpus with their format and programning frequency are as follows:
ABC World News Tonight -- "traditional" network news, 30 minutes/day
CNN Headline News -- continuous news summaries, up to 4 30-minute samples/day
PRI The World -- "in-depth" radio news, 60 minutes/weekday
VOA -- varied 60-minute news programs, up to 2/day
*Updates*
There are no updates at this time.- references: David Graff 1999 TDT2 English Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations