Language resource #: 3330
Results 601 - 610 of 2023
-
C-001083: RST Discourse Treebank
RST Discourse Treebank contains a selection of 385 Wall Street Journal articles from the Penn Treebank which have been annotated with discourse structure in the framework of Rhetorical Structure Theory (RST). In addition, the corpus includes a number of humanly-generated extracts and abstracts associated with the original documents.
-
C-001084: RT-03 MDE Training Data Speech
*Introduction*
MDE RT-03 Training Data Speech corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2004S08 and ISBN 1-58563-300-3.
This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.
The data in this release consists of English Conversational Telephone Speech (CTS) and Broadcast News (BN) audio data. The corresponding transcripts and annotations are available as MDE RT-03 Training Data Text and Annotations.
*Data*
There are 633 files, totalling approximately 5.39 GB (uncompressed) representing over 60 hours of recorded speech. There are approximately 20 hours of Broadcast News and over 40 hours of Conversational Telephone Speech contained in the corpus. The annotated data was originally developed to support the DARPA EARS Metadata Extraction (MDE) Program, and was distributed as training data for the RT-03F evaluation cycle.
The CTS data was drawn from the Switchboard-1 Release 2 corpus.
The BN speech data was drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct sources:
American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001)
*Data Format*
The audio data in this corpus conforms to the following technical specifications. Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec Note that the data is in wave format. This is the audio file format that our annotation tool (MDE Tool) supports. Since the annotation data is best explored with this open-source annotation tool, the WAVE format is our choice of data format.
*Annotations*
The transcripts corresponding to this speech have been annotated for various kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs).
General information about the EARS MDE Annotation effort, including free annotation tools, annotation guidelines and additional information can be found at LDC's main EARS MDE Project Page.
*Updates*
There are no updates available at this time.
Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania- isPartOf: MDE RT-03 Training Data Text and Annotations.
-
C-001085: RT-03 MDE Training Data Text and Annotations
*Introduction*
MDE RT-03 Training Data Text and Annotations corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2004T12 and ISBN 1-58563-301-1.
This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.
The data in this release consists of English Conversational Telephone Speech (CTS) and Broadcast News (BN) transcripts and annotations. The corresponding speech data is available as MDE RT-03 Training Data Speech .
*Data*
There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The transcripts and annotations cover approximately 20 hours of Broadcast News and over 40 hours of Conversational Telephone Speech data. The annotated data was originally developed to support the DARPA EARS Metadata Extraction (MDE) Program, and was distributed as training data for the RT-03F evaluation cycle.
The CTS data was drawn from the Switchboard-1 Release 2 corpus.
The BN speech data was drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct sources:
American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001)
*Annotations*
The transcripts within this corpus have been annotated for various kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). The docs directory contains the complete set of SimpleMDE annotation guidelines used to create this data.
*Data Format*
The data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library. This data is best explored using the LDC MDE Toolkit, which is freely available at http://www.ldc.upenn.edu/Projects/MDE/Tools.
The data is also provided in RTTM format developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc.
Please click here for a RTTM file example.
General information about the EARS MDE Annotation effort, including free annotation tools, annotation guidelines and additional information can be found at LDC's EARS MDE Project Page.
*Updates*
There are no updates available at this time.
Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania- isPartOf: MDE RT-03 Training Data Speech
-
C-001086: RT-04 MDE Training Data Speech
*Introduction*
This file contains documentation on the MDE RT-04 Training Data Speech, Linguistic Data Consortium (LDC) catalog number LDC2005S16 and ISBN 1-58563-355-0.
This corpus was created by Linguistic Data Consortium to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by Linguistic Data Consortium. This data was previously released to the EARS MDE community as LDC2004E31.
The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
*Samples*
For an example of the data in this publication, please review this broadcast news sample and this telephone conversation sample. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- isReplacedBy: the EARS MDE community as LDC2004E31
-
C-001087: RT-04 MDE Training Data Text/Annotations
*Introduction*
This corpus was created by Linguistic Data Consortium to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by Linguistic Data Consortium. This data was previously released to the EARS MDE community as LDC2004E31.
The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
In this release, some original annotations contained in LDC2004E31 have been re-mapped to new MDE elements to support better annotation consistency. In particular, the mapping affects Discourse Responses (DR), Discourse Markers (DM) and Backchannel SUs (BC). A description of the original mapping proposed by ICSI appears in 3) below, with complete documentation of the mapping rules contained in the docs/drmap-discussion directory. The scripts used to apply the mapping can be found in the docs/scripts/drmap directory.
*Samples*
For an example of this corpus, please review the following xml samples.
* Broadcast News Annotations
* Telephone Conversation Annotation
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- replaces: the EARS MDE community as LDC2004E31
-
C-001088: Resource Management Complete Set 2.0
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C - Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in five DARPA benchmark tests conducted in March and October of 1987, June 1988, and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material on this disc was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication. -
C-001089: Resource Management RM1 2.0
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C - Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in 5 DARPA benchmark tests conducted in March and October of 1987, June 1988 and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication. -
C-001090: Resource Management RM2 2.0
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C- Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in five DARPA benchmark tests conducted in March and October of 1987, June 1988 and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the RM1 corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material on this disc was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication. -
C-001091: Road Rally
The Road Rally corpus was designed for the development and testing of word-spotting systems and was collected in a conversational domain using a road rally planning task as the topic. The corpus actually consists of two sub-corpora: "Stonehenge" and "Waterloo." The Stonehenge corpus contains road rally planning conversations as well as some read speech collected using high quality microphones and a telephone-simulating filter. The Waterloo corpus contains read road rally planning domain speech which was collected using actual telephone lines.* Stonehenge
The Stonehenge corpus was collected from subjects using telephone handsets which were modified to contain a high quality microphone. To gather conversational data, two talkers were located in separate rooms, given a road map and asked to participate in a road rally planning task. Their objective was to form a path between two locations on the map which would maximize their road rally point score. They were also given a time limit in which to complete the task to increase their responsiveness. Their speech was recorded on a stereo tape recorder with each subject's speech on a separate track. The tracks were digitized and the speech was edited to remove silences longer than a second or so. This resulted in approximately three minutes of continuous speech per subject. The speech was filtered using a 300Hz to 3300Hz PCM FIR bandpass filter to simulate telephone bandwidth quality. The Stonehenge corpus consists of 80 speakers; 28 females and 52 males.* Waterloo
The Waterloo corpus was collected as an extension to Stonehenge to provide similar domain speech under different conditions. The corpus was collected from subjects using conventional telephones and dialed up telephone lines in the Massachussetts area. Unlike the Stonehenge speech, the Waterloo speech is naturally band-limited by the telephones/lines but for consistency, the speech was also filtered using the Stonehenge 300Hz to 3300Hz PCM FIR bandpass filter. The corpus consists of 56 speakers (28 males and 28 females) each reading aloud a paragraph of road rally domain speech. -
C-001092: Russian through Switched Telephone Network (RuSTeN)
*Introduction*
This file contains documentation on the Russian through Switched Telephone Network (RuSTeN), Linguistic Data Consortium (LDC) catalog number LDC2006S34 and ISBN 1-58563-388-7.
This corpus was developed as part of ?Trawl? (Automatic Voice Identification System in Telephone Channel). The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels. The training of the system was performed with the telephone speech corpus RuSTeN.
*Data*
The RuSTeN (Russian through Switched Telephone Network) database was recorded between March 2001 and February 2003 by Speech Technology Center using the "forget-me-not" professional telephone recording and archiving software package developed by STC. The files were recorded with sample frequency 11025 Hz, one-channel, 16-bit linear.
Each of the speakers made at least five calls from different locations and/or telephone sets. Most of the calls were made from home or an office environment with uncontrolled noise level. Additionally, one call per speaker was made from a public telephone (with either street or metro station noise in the background). The recordings are spontaneous (sometimes guided by the near-end speaker) conversations between the caller and the speech database collector on various subjects (the weather, the caller's biography, hobbies, etc.) and include approximately 150 seconds of the far-end and at least five seconds of the near-end speaker. Besides, each time the caller was asked to utter the usual digits set (0-9) and the words "yes" and "no."
The time interval between two successive sessions is at least two days. The database contains 125 speakers (far-end), 58 male and 67 female. Each far-end speaker is represented by at least five speech files. The sound files are in the wav-format. The speech filenames contain the following information: FFF (far-end speaker number) and SS (session number).
*Samples*
For an example of the data in this corpus, please review this audio sample.- isReferencedBy: Switched Telephone Network (RuSTeN)catalog number LDC2006S34 and ISBN 1-58563-388-7.