Language resource #: 3330
Results 281 - 290 of 2023
-
C-000591: 2002 NIST Speaker Recognition Evaluation
*Introduction*
The 2002 NIST Speaker Recognition Evaluation corpus was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S04 and ISBN 1-58563-293-7.
The 2002 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible.
The 2002 NIST Speaker Recognition Evaluation main data was extracted from the Switchboard Cellular Part 2. The extended data task used two phases of Switchboard II, Phases 2 and 3. This evaluation also included the first multi-modal task, using data from the FBI voice database.
Supporting documentation for this evaluation may be found on the 2002 NIST Speaker Recognition Evaluation website. Please consult the NIST evaluation plan for detailed instructions on using this evaluation material.
*Data*
There are a total of 9,153 speech files (6,098 at 8 KHz and 3,055 at 16KHz), all of which are in sphere format, for a total of ~156 hours.
The data was initially distributed by NIST on 13 CD-ROMs (r81_1_1 through r81_13_1). This corpus consists of training and test data and replicates exactly the content and structure of the 13 CD-ROMs.
*Updates*
There are no available updates at this time.- references: C-001282: Switchboard Cellular Part 2 Audio
- references: C-000738: Switchboard-2 Phase II
- references: Switchboard-2 Phase III
- references: Forensic Voice Database for Automated Speaker Recognition (FBI Voice Database)
- references: CALLHOME American English
- references: CALLFRIEND American English
- hasVersion: 1996 Speaker Recognition Benchmark
- hasVersion: C-000575: 1997 Speaker Recognition Benchmark
- hasVersion: C-000579: 1998 Speaker Recognition Benchmark
- hasVersion: C-000581: 1999 Speaker Recognition Benchmark
- hasVersion: C-000590: 2001 NIST Speaker Recognition Evaluation Corpus
- hasVersion: C-000584: 2000 NIST Speaker Recognition Evaluation
- hasVersion: C-001249: 2004 NIST Speaker Recognition Evaluation
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2004S04/
- isReferencedBy: Alvin Martin and Mark Przybocki 2004 2002 NIST Speaker Recognition Evaluation Linguistic Data Consortium, Philadelphia
-
C-000592: 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
*Introduction*
2002 Rich Transcription Broadcast News and Conversational Telephone Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S11 and ISBN 1-58563-311-9.
This corpus contains the test material used in the 2002 Rich Transcription (RT-02) Evaluation of Broadcast News and Conversational Telephone Speech, administered by the NIST Speech Group in the Spring of 2002. The RT-02 Meeting Recognition Evaluation material is available in a separate distribution. For complete up-to-date information, see the RT-02 Evaluation Website.
The RT-02 Evaluation supported two main evaluation tasks:
* Speech-To-Text (STT) Tasks -- included three processing speeds (1x real time, 10x real time, and unlimited time) for both the Broadcast News (BN) and Conversational Telephone Speech (CTS) domains.
* Metadata Extraction (MDE) Task -- consisted of a speaker diarization task for the BN and CTS domains.
*Data*
This distribution of the RT-02 Evaluation Data contains only Broadcast News and Conversational Telephone Speech data. Meeting data used in the RT-02 Evaluation is not included in this distribution and is packaged in a separate distribution. All recordings are in English.
The BN data is composed of six approximately 10-minute excerpts from six different broadcasts. Each waveform is a SPHERE-headered, single-channel, 16-bit PCM file. The broadcasts were selected from programs from MNB, PRI, NBC, CNN, VOA and ABC, all collected in 1998. The evaluation excerpts were transcribed to the nearest story boundary.
The CTS data is composed of 60 approximately five-minute excerpts from 60 different conversations: 20 from Switchboard-1 data, 20 from Switchboard-2 data, and 20 from Switchboard Cellular-2 data. Evaluation excerpts were transcribed to the nearest turn. Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.
The reference transcripts are also provided in this corpus. The official format for STT reference data is STM (files with the extension 'stm'), while the official format for MDE reference data is RTTM (files with the extension 'rttm') . Files with the extensions 'txt' or 'utf' are the original reference transcripts before any format conversions, additions of annotations, etc., and are included for completeness.
*Samples*
Please examine this example to review a sample of this corpus.
*Updates*
There are no updates available at this time.
The World is the co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- references: C-001283: Switchboard-1 Release 2
- references: C-001285: Switchboard-2 Phase III Audio
- references: C-001282: Switchboard Cellular Part 2 Audio
- references: HUB4 Broadcast News Evaluation English Test Material (not sure of which year)
- hasVersion: C-003109: 2003 NIST Rich Transcription Evaluation Data
- hasVersion: C-003110: 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2004S11/
- isReferencedBy: RT-2002 Evaluation Plan (Version 1.0): http://www.nist.gov/speech/tests/rt/rt2002/docs/rt02_eval_plan_v3.pdf
- isReferencedBy: John S. Garofolo, Jonathan Fiscus, and Audrey Le 2004 2002 Rich Transcription Broadcast News and Conversational Telephone Speech Linguistic Data Consortium, Philadelphia
-
C-000593: ACE 2004 Multilingual Training Corpus
*Introduction*
This file contains documentation on the ACE 2004 Multilingual Training Corpus, Linguistic Data Consortium (LDC) catalog number LDC2005T09 and ISBN 1-58563-334-8.
This publication contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation.
The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by the ACE Time Normalization (TERN) 2004 English Training Data Corpus (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release.
A complete description of the ACE 2004 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST): http://www.nist.gov/speech/tests/ace/
For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website.
*Samples*
The files listed below are samples from the English data. They should provide a good example of the material in this corpus.
* Chinese Treebank
* Fisher Transcripts
* Broadcast News
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- replaces: ACE/TIDES Extraction 2004 Training Data V1.1 (LDC2004E17)
- hasVersion: C-000594: ACE 2005 Multilingual Training Corpus
- references: C-000595: ACE Time Normalization (TERN) 2004 English Training Data v 1.0
- hasVersion: C-001301: TIDES Extraction (ACE) 2003 Multilingual Training Data
- isReferencedBy: (On-line documentation): http://www.ldc.upenn.edu/Catalog/docs/LDC2005T09/
- isReferencedBy: Alexis Mitchell, et al. 2005 ACE 2004 Multilingual Training Corpus Linguistic Data Consortium, Philadelphia
-
C-000594: ACE 2005 Multilingual Training Corpus
*Introduction*
This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation.
The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form.
In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks.
A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST).
For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website
Below is information about the amount of data included in the current release and its annotation status.
* 1P: data subject to first pass (complete) annotation
* DUAL: data also subject to dual first pass (complete) annotation
* ADJ: data also subject to discrepancy resolution/adjudication
* NORM: data also subject to TIMEX2 normalization
English
words
files
1P
DUAL
ADJ
NORM
1P
DUAL
ADJ
NORM
NW
60658
57807
33459
48399
128
124
81
106
BN
59239
58144
52444
55967
239
234
217
226
BC
46612
46110
33874
40415
68
67
52
60
WL
45210
43648
35529
37897
127
122
114
119
UN
45161
44473
26371
37366
58
57
37
49
CTS
47003
47003
34868
39845
46
46
34
39
Total
303833
297185
216545
259889
666
650
535
599
Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.
chars
files
1P
DUAL
ADJ
1P
DUAL
ADJ
NW
127319
124175
121797
248
242
238
BN
134963
133696
120513
332
328
298
WL
71839
68063
65681
107
101
97
Total
334121
325834
307991
687
671
633
Arabic
words
files
1P
DUAL
ADJ
1P
DUAL
ADJ
NW
61287
56158
53026
239
226
221
BN
29259
27165
26907
134
128
127
WL
21687
20181
20181
60
55
55
Total
112233
103504
100114
433
409
403
*Samples*
For examples of the data in this publication, please review the following samples:
* English
* Arabic
* Chinese- replaces: ACE 2005 Multilingual Training Data V2.0 (LDC2005E18)
- references: C-001297: TDT4 Multilingual Text and Annotations
- references: C-001407: English Gigaword Second Edition
- references: EARS Fisher 2004 Telephone Speech Collection Supplement
- hasVersion: C-000593: ACE 2004 Multilingual Training Corpus
- hasVersion: C-001301: TIDES Extraction (ACE) 2003 Multilingual Training Data
- isReferencedBy: (On-line documentation): http://www.ldc.upenn.edu/Catalog/docs/LDC2006T06/
- isReferencedBy: Christopher Walker, et al. 2006 ACE 2005 Multilingual Training Corpus Linguistic Data Consortium, Philadelphia
-
C-000595: ACE Time Normalization (TERN) 2004 English Training Data v 1.0
*Introduction*
This file contains documentation on the ACE Time Normalization (TERN) 2004 English Training Data v 1.0, Linguistic Data Consortium (LDC) catalog number LDC2005T07 and ISBN 1-58563-331-3.
This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program. The evaluation was held in August 2004 and a workshop in September 2004. Evaluation participants received this data for training purposes, and it is now being released for general use.
The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE.
The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday," "last week," or "three months starting October 1," one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization.
*Samples*
Please examine this sample to see an example of the corpus.
*Updates*
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2005T07.
"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- references: Topic Detection and Tracking - Phase 4 (TDT4)
- references: Arabic Treebank: Part 1
- references: Chinese Treebank English Parallel Text Corpus
- isReferencedBy: C-000593: ACE 2004 Multilingual Training Corpus
- isReferencedBy: "ACE Time Normalization (TERN) 2004 English Training Data V1.0"(http://timex2.mitre.org/corpora/README_TERN_English_Data.txt)
- isReferencedBy: Lisa Ferro, et al. 2005 ACE Time Normalization (TERN) 2004 English Training Data v 1.0 Linguistic Data Consortium, Philadelphia
-
C-000596: ACE-2 Version 1.0
*Introduction*
ACE-2 Version 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T11 and ISBN 1-58563-270-8.
This release contains Version 1.0 of the ACE-2 corpus, created and distributed by the LDC to support the Automatic Content Extraction (ACE) program. The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. There are three main ACE tasks: Entity Detection and Tracking, Relation Detection and Characterization, and Event Detection and Characterization.
Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium to support the following two research tasks: Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC).
For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology (NIST), please visit the NIST website.
For information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit the ACE Project page at the LDC.
*Data*
This publication contains two sets of data: training and devtest. Each of these sets is further divided by source: broadcast news, newspaper, and newswire.
The training contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The devtest contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation.
The broadcast and newswire source data is drawn from a subset of the TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this has been supplemented with additional newspaper data from the Washington Post. A portion of the training broadcast data was drawn from the 1997 English Broadcast News Transcripts (HUB4) corpus (LDC98T28).
All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below.
Newswire New York Times Newswire Service (NYT) Associated Press Worldstream Service (APW) Broadcast News Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) Public Radio International, "The World" (PRI) Voice of America, English news programs (VOA) MSNBC, "The News With Brian Williams" (MNB) National Broadcasting Company, "Nightly News" (NBC) Newspaper Washington Post (WAP) This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation.
There are 179,007 words of source data, or 519 files, broken down as follows:
Source # Words train # Words devtest # Files train # Files devtest NYT 32892 7487 48 9 APW 29144 7037 82 20 CNN 2290 2653 69 11 ABC 1588 2687 24 10 PRI 1272 5284 43 9 VOA 594 2611 24 7 MNB 0 2539 0 6 NBC 0 2633 0 8 WAP 60247 15070 76 17 ea 2019 0 31 0 ed 1094 0 25 0 Total 131023 47984 422 97
*Updates*
There are no updates available at this time.
"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- references: C-001292: TDT2 Multilanguage Text Version 4.0
- references: C-000564: 1997 English Broadcast News Transcripts (HUB4)
- isReferencedBy: (On-line documentation): http://www.ldc.upenn.edu/Catalog/docs/LDC2003T11/
- isReferencedBy: Alexis Mitchell, et al. 2003 ACE-2 Version 1.0 Linguistic Data Consortium, Philadelphia
-
C-000597: ACL/DCI
ACL Data Collection Initiative contains text from the Wall Street Journal, the Collins English Dictionary, scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.
The many formats of the original texts have been mapped into a markup language consistent with the SGML standard (ISO 8879).
The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tags such as "". The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.
The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called "FIT", by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch.
The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory.
Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory "postext" contains text with part-of-speech annotations; "parstext" contains text with syntactic bracketing.- references: Penn Treebank
- isReferencedBy: (Online documentation) "Documentation for ACL_DCI" (http://www.ldc.upenn.edu/Catalog/docs/LDC93T1/)
- isReferencedBy: 1993 ACL/DCI Linguistic Data Consortium, Philadelphia
-
C-000598: ARL Urdu Speech Database, Training Data
*Introduction*
This file contains documentation for ARL Urdu Speech Database, Training Data, Linguistic Data Consortium (LDC) catalog number LDC2007S03 and isbn 1-58563-421-3.
The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the LDC for distribution.
Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan.
The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The distribution of speaker dialects is as follows:
Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27 North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test).
*Data*
Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories.
Each utterance is transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full.
*Update*
Earlier versions were missing the content list file. This is now available as a download. Please contact the LDC membership office to receive instructions for download.
*Samples*
For an example of the data in this corpus, please listen to this following audio sample (.wav format)- isReferencedBy: (online documentation)http://www.ldc.upenn.edu/Catalog/docs/LDC2007S03/
- isReferencedBy: Appen Pty Ltd, Sydney, Australia 2007 ARL Urdu Speech Database, Training Data Linguistic Data Consortium, Philadelphia
-
C-000599: ATIS0 Read
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3 - ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.
All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.
The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). 36 speakers produced a total of 912 utterances.
The second disc (ATIS0 Read) contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.
The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.- references: Charles T. Hemphill, et al. 1993 ATIS0 Read Linguistic Data Consortium, Philadelphia
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
-
C-000600: ATIS0 SD Read
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3 - ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.
All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.
The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). Thirty-six speakers produced a total of 912 utterances.
The second disc (ATIS0 Read) contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.
The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.
*Update*
This publication has been condensed from 4 CDROM discs to a single DVDROM. The contents of each CD reside in separate directories that are organized identically to the original version.- references: Charles T. Hemphill, et al. 1993 ATIS0 SD Read Linguistic Data Consortium, Philadelphia
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read