Language resource #: 3330
Results 391 - 400 of 2023
-
C-000718: ICSI Meeting Transcripts
*Introduction*
ICSI Meeting Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T04 and ISBN 1-58563-286-4.
The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. The speech files are available as ICSI Meeting Speech.
*Data*
This corpus consists of 75 word-level transcripts (one transcript file per meeting), time-synchronized to digitized audio recordings. There are approximately 795 K-words and 13K unique words in the transcripts.
The meetings were recorded with close-talking and far-field microphones. The transcripts were based mostly on the close-talking microphones, either separately or blended together in a so-called "mixed" channel. The focus of the transcripts was on capturing the flow of audible events, especially the words which were spoken, and who spoke them.
Transcripts were prepared by means of the "Channeltrans" interface. Channeltrans is an extension of the "Transcriber" interface.
There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.
*Sponsorship*
The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM.
*Updates*
There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.- references: Adam Janin, et al. 2004 ICSI Meeting Transcripts Linguistic Data Consortium, Philadelphia
- isReferencedBy: C-000717: ICSI Meeting Speech
-
C-000719: ISI Arabic-English Automatically Extracted Parallel Text
This distribution contains a corpus of Arabic-English parallel sentences, which were extracted automatically from two monolingual corpora: Arabic Gigaword Second Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The data was extracted from news articles published by Xinhua News Agency and Agence France Presse and was obtained using the automatic parallel sentence identification method described in the following publication: Dragos Stefan Munteanu, Daniel Marcu, 2005. Machine Translation Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504
The corpus contains 1,124,609 sentence pairs; the word count on the English side is approximately 31M words. The sentences in the parallel corpus preserve the form and encoding of the texts in the original Gigaword corpora.
For each sentence pair in the corpus the authors provide the names of the documents from which the two sentences were extracted, as well as a confidence score (between 0.5 and 1.0), which is indicative of their degree of parallelism. The parallel sentence identification approach is designed to judge sentence pairs in isolation from their contexts, and can therefore find parallel sentences within document pairs which are not parallel. The fact that two documents share several parallel sentences does not necessarily mean the documents are parallel.
In order to make this resource useful for research in Machine Translation (MT), the authors made efforts to detect potential overlaps between this data and the standard test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation data sets contain several articles from Xinhua News Agency and Agence France Presse. Sentence pairs in this distribution that have a 7-gram overlap with a sentence pair in a NIST MT evaluation set or sentence pairs coming from documents whose names are similar to those in the NIST MT sets are marked with a negative confidence score.
*Samples*
For an example of the data in this publication, please examine this image of text data.- references: C-000612: Arabic Gigaword Second Edition
- references: C-001407: English Gigaword Second Edition
- hasVersion: C-003329: ISI Chinese-English Automatically Extracted Parallel Text
-
C-000720: ISL Meeting Speech Part 1
*Introduction*
ISL Meeting Speech Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S05 and ISBN 1-58563-294-5.
The ISL Meeting Speech Part 1 is a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from eight to 64 minutes and averages at 34 minutes. Word-level orthographic transcriptions are available as ISL Meeting Transcripts Part 1. The transcriptions are available as ISL Meeting Transcripts Part 1.
*Data*
The collection includes 105 speech files, for a total of approximately 10 hours of meeting speech. The speech for each meeting consists of wave files for each channel and a wave file containing a mix of all channels.
The audio was collected at a 16 kHz sample-rate. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit (little-endian) wave files.
During meeting recordings, each speaker wore an individual lapel microphone and was recorded via an Alesis 8-channel mix board and an ECHO Layla 8-channel sound card. This setup was designed to obtain a consumer- or application-style sound quality. All meetings were recorded in the same instrumented meeting area.
For an example transcript, please click here.
There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from three to nine participants, averaging at five. The corpus contains a significant proportion of non-native English speakers, varying in fluency.
*Sponsorship*
The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the GENOA project and through ROAR.
*Updates*
Additional information, updates, bug fixes may be avaibale on the ISL Meeting Room project page.- references: Susanne Burger, Victoria MacLaren, and Alex Waibel 2004 ISL Meeting Speech Part 1 Linguistic Data Consortium, Philadelphia
- isReferencedBy: ISL Meeting Corpus
- hasVersion: C-000721: ISL Meeting Transcripts Part 1
-
C-000721: ISL Meeting Transcripts Part 1
*Introduction*
ISL Meeting Transcripts Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T10 and ISBN 1-58563-295-3.
The ISL Meeting Corpus Part 1 is a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from eight to 64 minutes and averages at 34 minutes.The audio files are available as ISL Meeting Speech Part 1.
*Data*
This corpus consists of 19 word-level transcripts of 18 meetings (one transcription file per meeting, meeting m039 has two parts, m039a and m039b), time synchronized to digitized audio recordings. There are approximately 116,200 word tokens and 5,850 unique word types in the transcripts.
The meetings were recorded with lapel microphones. The transcriptions were based on the lapel microphones recordings. The focus of the transcriptions was on capturing the flow of audible events, especially the words which were spoken, and who spoke them. The transcriptions contain additional annotations for spontaneous speech events and disfluencies.
Transcriptions were prepared by means of the TransEdit transcription application. This application was developed for the transcription of multi-channel recordings and displays a synchronized multi-track view for all channels of a meeting with listening and segmentation function for each single channel separately.
For an example transcript, please click here.
There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from three to nine participants, averaging at five. The corpus contains a significant proportion of non-native English speakers, varying in fluency.
*Sponsorship*
The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the GENOA project and through ROAR.
*Updates*
Additional information, updates, bug fixes may be avaibale on the ISL Meeting Room project page.- references: Susanne Burger, Victoria MacLaren, and Alex Waibel 2004 ISL Meeting Transcripts Part 1 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000720: ISL Meeting Speech Part 1
-
C-000722: Iraqi Arabic Conversational Telephone Speech, Transcripts
*Introduction*
This database contains 474 Iraqi Arabic speakers taking part in spontaneous telephone conversations in Colloquial Iraqi Arabic. A total of 478 conversation sides are provided (most speakers appear only once), and most of these call sides comprise both sides of a conversation (that is, 202 two-channel recordings plus 74 single-channel recordings). The average duration per call is about 6 minutes, so each call side contains about 3 minutes of speech, on average.
This corpus was collected and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.
*Samples*
For an example of the transcripts in this release, please review this sample.- references: Appen Pty Ltd, Sydney, Australia 2006 Iraqi Arabic Conversational Telephone Speech, Transcripts Linguistic Data Consortium, Philadelphia
-
C-000723: JEIDA/JCSD-Channel 0 City Names
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 1 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- hasVersion: Jonathan Hamaker, et al. 1996 JEIDA/JCSD-Channel 0 City Names Linguistic Data Consortium, Philadelphia
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001028: JEIDA/JCSD-Channel 1 Control Words
- hasVersion: C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
- hasVersion: C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
-
C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 5 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 1 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 1 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 1 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 1 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- hasVersion: C-000723: JEIDA/JCSD-Channel 0 City Names
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001028: JEIDA/JCSD-Channel 1 Control Words
- hasVersion: C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
- hasVersion: C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
-
C-000734: OGI Multilanguage Corpus
The corpus consists of responses to prompts spoken over commercial telephone lines by speakers of English, Farsi (Persian), French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1,927 calls, an average of 175 calls per language. Speech was collected using an automated system that answered the telephone, played digitized prompts in the appropriate language to request the speech samples and digitized the callers' responses for a designated period of time.
Log files are included that provide a set of automatic measurements made on each utterance. In addition, some utterances were automatically segmented into broad phonetic catagories. The speech data are compressed, with NIST SPHERE headers.- references: Ron Cole and Yeshwant Muthusamy 1994 OGI Multilanguage Corpus Linguistic Data Consortium, Philadelphia
-
C-000736: SPIDRE
This is two-CD subset of the SWITCHBOARD collection (see above), selected for speaker ID research and with special attention to telephone instrument variation. It contains training and testing data for experiments in closed or open set recognition or verification. Combining the two sides of the conversations also permits speaker change detection, or speaker monitoring, experiments. There are 45 "target" speakers; four conversations from each target are included, of which two are from the same handset. There are also 100 calls in which no target appears. Since all conversations are two-sided, this results in 180 target sides and 180 + 200 = 380 nontarget sides.
Except for truncations of a few longer calls at five minutes, the call themselves are as described under SWITCHBOARD.- references: Alvin Martin, Jack Godfrey, Ed Holliman and Mark Przybocki 1994 SPIDRE Linguistic Data Consortium, Philadelphia
-
C-000738: Switchboard-2 Phase II
*Introduction*
SWB-2 Phase II consists of 4,472 five-minute telephone conversations involving 679 participants. This corpus was collected by the Linguistic Data Consortium (LDC) in support of a project on Speaker Recognition sponsored by the U.S. Department of Defense.
*Data*
Participants in SWB-2 Phase II were recruited from the following midwestern college campuses: Iowa State University, Michigan State University, University of Michigan, University of Minnesota, University of Wisconsin at Madison, Northwestern University, and Ohio State University. Solicitation methods included the Internet, newspaper advertisements and personal contacts. The majority of the participants resided in Minnesota, Wisconsin, Ohio, Iowa, Michigan and Illinois as follows:
Minnesota - 156 speakers
Wisconsin -- 105 speakers
Ohio -- 70 speakers
Iowa 64 speakers
Michigan -- 41 speakers
Illinois - 37 speakers
Each recruit was asked to participate in at least ten five-minute phone calls. Ideally each participant would receive five calls at a designated number and make five calls from phones with different (ANI) codes. Participants were asked to discuss a specific topic (read by the automated operator) and not to provide personal information during their call.
Each of the 679 participants placed their calls via a toll-free robot operator maintained by LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at LDC when the caller enrolled in the project.
Upon conclusion of the study all calls were audited by LDC staff members. Particular attention was paid to PIN verification (matching speaker with PIN), checking call duration, and call quality. Upon completion of this process, checks were issued and mailed to participants. The conversations have not been transcribed.
*Updates*
09/29/2011: The file table and readme were updated to reflect that this data set was made available on DVD.- isReferencedBy: C-000579: 1998 Speaker Recognition Benchmark
- hasVersion: C-001284: Switchboard-2 Phase I
- hasVersion: C-001285: Switchboard-2 Phase III Audio
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC99S79/
- isReferencedBy: David Graff, Kevin Walker, and Alexandra Canavan 1999 Switchboard-2 Phase II Linguistic Data Consortium, Philadelphia