Language resource #: 3330
Results 311 - 320 of 2023
-
C-000623: Articulation Index
*Introduction*
Articulation Index was developed by the Linguistic Data Consortium (LDC) and was partly inspired by the work of Harvey Fletcher, who performed a number of perceptual experiments involving English syllables during the first half of the 20th century. His term articulation index meant something like perceptual index of syllables, where those syllables were not necessarily words, and reflected how well speakers could correctly identify syllables in the presence of noise. This corpus was created to facilitate similar experiments, as well as to potentially facilitate new methods in speech recognition research.
The basic concept behind the corpus was to record speakers pronouncing syllables of English, some of which might be real words, but most of which are nonsense syllables. The goal was to have each speaker say a set of 2,000 syllables common to all speakers, as well as a set of 20 syllables unique to that speaker.
LDC has also released Articulation Index LSCP (LDC2015S12)
*Data*
This release contains recordings of 20 American English speakers (12 males, 8 females) saying 2005 common syllables, 1845 of which are common to all speakers, and 400 unique syllables (20 syllables/ speaker).
The recordings were made in small, sound-treated anechoic room at LDC. The speakers wore two microphones: a Sennheiser 410 headset and a Nortel Liberator wireless phone headset. The Sennheiser's signal traveled through a Symetrix 302 Dual Microphone Preamp, Sony PCM-R300 DAT deck and Townshend Datlink to a Sun Sparcserver 20 where it was written to disk at 16 KHz, 16-bit, pcm data. The Nortel's signal was transmitted to a wireless base station at a telephone connected via the network to LDC's telephone recording platform where it was caputred to disk as 8 KHz, 8-bit, u-law data.
The speakers were prompted via a computer interface that displayed one prompt at a time, allowing them to iterate through the prompts by pressing a "next" button. Each recording session lasted approximately 15 minutes.
*Samples*
For an example of this corpus, please review this audio sample.- references: Jonathan Wright 2005 Articulation Index Linguistic Data Consortium, Philadelphia
-
C-000624: BBN Pronoun Coreference and Entity Type Corpus
*Introduction*
This file contains documentation on the BBN Pronoun Coreference and Entity Type Corpus, Linguistic Data Consortium (LDC) catalog number LDC2005T33 and ISBN 1-58563-362-3.
This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types. All annotation was done by hand at BBN using proprietary annotation tools. This corpus was developed by BBN to support the ACE and AQUAINT programs
The corpus contains two components:
* Pronoun coreference. Stand-off annotation of pronoun coreference of the WSJ corpus is provided in a single file. Pronouns and antecedents are indexed by sentence and token numbers.
* Entity types. The corpus includes annotation of 12 named entity types (Person, Facility, Organization, GPE, Location, Nationality, Product, Event, Work of Art, Law, Language, and Contact-Info), nine nominal entity types (Person, Facility, Organization, GPE, Product, Plant, Animal, Substance, Disease and Game), and seven numeric types (Date, Time, Percent, Money, Quantity, Ordinal and Cardinal). Several of these types are further divided into subtypes. Annotation for a total of 64 subtypes is provided.
*Samples*
For an example of the data in this corpus, please examing the following samples:
* LDC2005T33.qa
* LDC2005T33_pron.txt
* LDC2005T33_sent.txt- references: C-001546: Treebank-2
-
C-000625: BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
*Introduction*
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts consists of transcribed, spontaneous speech recorded from subjects speaking Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic spoken in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic. It is a spoken rather than a written language, and includes different words and pronounciations from Modern Standard Arabic.
The corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program was intended to advance the state of the art in speech-to-speech translation systems by creating new technology and by developing systems for field use. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and it collected this corpus as part of its work. The corpus may be useful for speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems.
*Samples*
To see an example of this corpus, we have provided a audio sample and transcription.- references: BBN Technologies (with American University of Beirut a subcontractor): John Makhoul, et al. 2005 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts Linguistic Data Consortium, Philadelphia
-
C-000626: BLLIP 1987-89 WSJ Corpus Release 1
*Introduction*
Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately 30 million words. The annotation was performed using statistically-based methods developed by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson.
This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
*Data*
The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42), both of which include the raw text for each story.
*Updates*
There are no updates at this time.- references: C-001546: Treebank-2
- references: C-001547: Treebank-3
-
C-000628: Boston University Radio Speech Corpus
The Boston University Radio Speech Corpus was collected primarily to support research in text-to-speech synthesis, particularly generation of prosodic patterns. The corpus consists of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research.
The corpus includes speech from seven (four male, three female) FM radio news announcers associated with WBUR, a public radio station. The main radio news portion of the corpus consists of over seven hours of news stories recorded in the WBUR radio studio during broadcasts over a two year period. In addition, the announcers were also recorded in a laboratory at Boston University. In this, the lab news portion, the announcers read a total of 24 stories from the radio news portion. The announcers were first asked to read the stories in their non-radio style and then, 30 minutes later, to read the same stories in their radio style.
Each story read by an announcer was digitized in paragraph size units, which typically include several sentences. The files were digitized at a 16k Hz sample rate using a 16-bit A/D. The paragraphs were annotated with the orthographic transcription, phonetic alignments, part-of-speech tags and prosodic markers. The orthographic transcripts were generated by hand and include indication of where the speaker took a breath. The phonetic alignments and part-of-speech tags were generated automatically and hand corrected. The prosodic labels were marked by hand and are available only for a subset of the corpus.
A zipped compressed file example.zip is available. Please be aware that this file is slightly larger than 1 Mb (1,278,998 bytes). An additional sample file, LDC1996.tgz and WAV sample are also available.- isReferencedBy: Mari Ostendorf, Patti Price, and Stefanie Shattuck-Hufnagel 1996 Boston University Radio Speech Corpus Linguistic Data Consortium, Philadelphia
- isReferencedBy: Mari Ostendorf, Patti Price, and Stefanie Shattuck-Hufnagel, "The Boston University Radio News Corpus"
-
C-000631: CALLFRIEND American English-Non-Southern Dialect
*Introduction*
The CALLFRIEND project supports the development of language identification technology.
*Data*
The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
For each conversation, both the caller and callee are native speakers of non-Southern dialects of American English. All calls are domestic and were placed inside the continental United States, Canada, Puerto Rico, or the Dominican Republic.
Callers in the "non-Southern" (or "general") collection of CALLFRIEND American English appear to come from a wide geographic range, based on their own reports of where they were raised (some identified their origins as being in the southeastern U.S.). Regardless of their geographic or ethnic backgrounds, the feature they share is the clear absence of a vowel quality pattern that would distinguish them as speakers of a "Southern" dialect.
Some information was inadvertently left out of the speaker information table and the call information table. Copies of these files are available here at CALLINFO.TBL and SPKRINFO.TBL.
*Updates*
There are no updates at this time.- hasVersion: C-000632: CALLFRIEND American English-Southern Dialect
- hasVersion: C-000633: CALLFRIEND Canadian French
- hasVersion: C-000634: CALLFRIEND Egyptian Arabic
- hasVersion: C-000635: CALLFRIEND Farsi
- hasVersion: C-000636: CALLFRIEND German
- hasVersion: C-000637: CALLFRIEND Hindi
- hasVersion: C-000638: CALLFRIEND Japanese
- hasVersion: C-000639: CALLFRIEND Korean
- hasVersion: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- hasVersion: C-000641: CALLFRIEND Mandarin Chinese-Taiwan Dialect
- hasVersion: C-000642: CALLFRIEND Spanish-Caribbean Dialect
- hasVersion: C-000643: CALLFRIEND Spanish-Non-Caribbean Dialect
- hasVersion: C-000644: CALLFRIEND Tamil
- hasVersion: C-000645: CALLFRIEND Vietnamese
- isReferencedBy: "Description of the CallFriend telephone speech corpus for American English" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S46/CF_ENG_N.TXT)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLFRIEND American English-Non-Southern Dialect Linguistic Data Consortium, Philadelphia
-
C-000632: CALLFRIEND American English-Southern Dialect
*Introduction*
The CALLFRIEND project supports the development of language identification technology.
*Data*
The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
For each conversation, both the caller and callee are native speakers of Southern American English. All calls are domestic and were placed inside the continental United States, Canada, Puerto Rico or the Dominican Republic.
Callers in the "Southern" collection of CALLFRIEND American English were identified primarily on the basis of vowel quality patterns that are common among native speakers raised in the southeastern United States (from Texas eastward to the Atlantic coast and from Virginia and Kentucky southward to the Gulf of Mexico). This category also includes a small number of African-American speakers, whose geographic origins may be more dispersed, but who share some of the vowel quality patterns distinctive of Southern white speakers. (Of course, other dialect features involving phonology, syntax and prosody, serve to differentiate these two subgroups within the "Southern" collection.)
*Updates*
There are no updates at this time.- hasVersion: C-000631: CALLFRIEND American English-Non-Southern Dialect
- hasVersion: C-000633: CALLFRIEND Canadian French
- hasVersion: C-000634: CALLFRIEND Egyptian Arabic
- hasVersion: C-000635: CALLFRIEND Farsi
- hasVersion: C-000636: CALLFRIEND German
- hasVersion: C-000637: CALLFRIEND Hindi
- hasVersion: C-000638: CALLFRIEND Japanese
- hasVersion: C-000639: CALLFRIEND Korean
- hasVersion: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- hasVersion: C-000641: CALLFRIEND Mandarin Chinese-Taiwan Dialect
- hasVersion: C-000642: CALLFRIEND Spanish-Caribbean Dialect
- hasVersion: C-000643: CALLFRIEND Spanish-Non-Caribbean Dialect
- hasVersion: C-000644: CALLFRIEND Tamil
- hasVersion: C-000645: CALLFRIEND Vietnamese
- isReferencedBy: "Description of the CallFriend telephone speech corpus for American English" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S47/CF_ENG_S.TXT)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLFRIEND American English-Southern Dialect Linguistic Data Consortium, Philadelphia
-
C-000633: CALLFRIEND Canadian French
*Introduction*
The CALLFRIEND project supports the development of language identification technology.
*Data*
The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
For each conversation, both the caller and callee are native Canadian speakers of French. All calls are domestic and were placed inside the continental United States and Canada.
*Updates*
There are no updates at this time.- hasVersion: C-000631: CALLFRIEND American English-Non-Southern Dialect
- hasVersion: C-000632: CALLFRIEND American English-Southern Dialect
- hasVersion: C-000634: CALLFRIEND Egyptian Arabic
- hasVersion: C-000635: CALLFRIEND Farsi
- hasVersion: C-000636: CALLFRIEND German
- hasVersion: C-000637: CALLFRIEND Hindi
- hasVersion: C-000638: CALLFRIEND Japanese
- hasVersion: C-000639: CALLFRIEND Korean
- hasVersion: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- hasVersion: C-000641: CALLFRIEND Mandarin Chinese-Taiwan Dialect
- hasVersion: C-000642: CALLFRIEND Spanish-Caribbean Dialect
- hasVersion: C-000643: CALLFRIEND Spanish-Non-Caribbean Dialect
- hasVersion: C-000644: CALLFRIEND Tamil
- hasVersion: C-000645: CALLFRIEND Vietnamese
- isReferencedBy: "Description of the CallFriend telephone speech corpus for French" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S48/CF_FRE.TXT)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLFRIEND Canadian French Linguistic Data Consortium, Philadelphia
-
C-000634: CALLFRIEND Egyptian Arabic
*Introduction*
The CALLFRIEND project supports the development of language identification technology.
*Data*
The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
For each conversation, both the caller and callee are native speakers of the Egyptian dialect of Arabic. All calls are domestic and were placed inside the continental United States and Canada.
*Updates*
There are no updates at this time.- hasVersion: C-000631: CALLFRIEND American English-Non-Southern Dialect
- hasVersion: C-000632: CALLFRIEND American English-Southern Dialect
- hasVersion: C-000633: CALLFRIEND Canadian French
- hasVersion: C-000635: CALLFRIEND Farsi
- hasVersion: C-000636: CALLFRIEND German
- hasVersion: C-000637: CALLFRIEND Hindi
- hasVersion: C-000638: CALLFRIEND Japanese
- hasVersion: C-000639: CALLFRIEND Korean
- hasVersion: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- hasVersion: C-000641: CALLFRIEND Mandarin Chinese-Taiwan Dialect
- hasVersion: C-000642: CALLFRIEND Spanish-Caribbean Dialect
- hasVersion: C-000643: CALLFRIEND Spanish-Non-Caribbean Dialect
- hasVersion: C-000644: CALLFRIEND Tamil
- hasVersion: C-000645: CALLFRIEND Vietnamese
- isReferencedBy: "Description of the CallFriend telephone speech corpus for Arabic" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S49/CF_ARA.TXT)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLFRIEND Egyptian Arabic Linguistic Data Consortium, Philadelphia
-
C-000635: CALLFRIEND Farsi
*Introduction*
The CALLFRIEND project supports the development of language identification technology.
*Data*
The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
For each conversation, both the caller and callee are native speakers of Farsi. All calls are domestic and were placed inside the continental United States and Canada.
*Updates*
There are no updates at this time.- hasVersion: C-000631: CALLFRIEND American English-Non-Southern Dialect
- hasVersion: C-000632: CALLFRIEND American English-Southern Dialect
- hasVersion: C-000633: CALLFRIEND Canadian French
- hasVersion: C-000634: CALLFRIEND Egyptian Arabic
- hasVersion: C-000636: CALLFRIEND German
- hasVersion: C-000637: CALLFRIEND Hindi
- hasVersion: C-000638: CALLFRIEND Japanese
- hasVersion: C-000639: CALLFRIEND Korean
- hasVersion: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- hasVersion: C-000641: CALLFRIEND Mandarin Chinese-Taiwan Dialect
- hasVersion: C-000642: CALLFRIEND Spanish-Caribbean Dialect
- hasVersion: C-000643: CALLFRIEND Spanish-Non-Caribbean Dialect
- hasVersion: C-000644: CALLFRIEND Tamil
- hasVersion: C-000645: CALLFRIEND Vietnamese
- isReferencedBy: "Description of the CallFriend telephone speech corpus for Farsi" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S50/CF_FAR.TXT)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLFRIEND Farsi Linguistic Data Consortium, Philadelphia