Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 351 - 360 of 2023

C-000673: CSLU: Speaker Recognition Version 1.1
*Introduction*

This file contains documentation on the CSLU Speaker Recognition Corpus, Version 1.1, Linguistic Data Consortium (LDC) catalog number LDC2006S26 and ISBN 1-58563-382-8.

The Speaker Recognition corpus (formerly known as Speaker Verification), consists of telephone speech from 91 participants. Each participant has recorded speech in twelve sessions over a two-year period answering questions like "what is your eye color" or responding to prompts like "describe a typical day in your life." Most of the utterances in the release of the corpus have corresponding non-time-aligned word level transcriptions.

In most of the CSLU data collections, each participant calls a toll free telephone number and answers a few question. CSLU records the speech, transcribes it, then packages it as a released corpus.

The Speaker Recognition data collection was quite a bit more complicated. The goal of the data collection was to collect speech from each participant over a two-year period. Each participant called call the data collection system 12 times over the two-year period and say the same utterances each time.

Some of the recording sessions were only a few days apart and others several weeks apart. Participant followed the following calling schedule. During the first month, they called twice in a week. No calls were made in the second and third months. In the fourth month they made one call. No calls were made in the fifth and sixth months. This pattern repeated three more times for a total of 12 calls per participant.

In order to balance the workload required to remind participants to call and to avoid large data collection bursts on the system, the participants were divided into 12 groups. Each group began the two-year schedule on subsequent months. The first group started in September 1996. The second group started in October 1996. And so on.

*Samples*

For an example of the data in this corpus, please listen to the following audio sample.
- replaces: Speaker Recognition Corpus Release 1.0
- replaces: Speaker Verification Corpus
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2006S26/
- isReferencedBy: CSLU 2006 CSLU: Speaker Recognition Version 1.1 Linguistic Data Consortium, Philadelphia
C-000674: CSLU: Spelled and Spoken Words
*Introduction*

This file contains documentation on the Spelled and Spoken Words Corpus, Linguistic Data Consortium (LDC) catalog number LDC2006S15 and ISBN 1-58563-382-8.

The Spelled and Spoken Words corpus consists of spelled and spoken words. 3,647 callers were prompted to to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, 1,371 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of 2,648 calls has been phonetically labeled.

*Samples*

For an example of the data in this corpus, please listen to this audio sample.
- replaces: Spelled and Spoken Words Corpus Release 1.1
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2006S15/
- isReferencedBy: R. A. Cole., M. Fanty, and K. Roginski 2006 CSLU: Spelled and Spoken Words Linguistic Data Consortium, Philadelphia
C-000675: CSLU: Spoltech Brazilian Portuguese Version 1.0
*Introduction*

CSLU: Spoltech Brazilian Portuguese Version 1.0, Linguistic Data Consortium (LDC) catalog number LDC2006S16 and ISBN 1-58563-383-6, contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8,080 separate utterances. A total of 2,540 utterances have been transcribed at the word level (without time alignments), and 5,479 utterances have been transcribed at the phoneme level (with time alignments). Protocol design, recording and transcription were performed by the Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul.

*Data*

The data has been recorded at 44.1 kHz (mono, 16-bit) and stored in RIFF format. The recording was conducted with a direct connection from the microphone to the sound card. The sound card was SoundBlaster-compatible. For the prompted sentences, the sentence was hidden from view when recording began, so that the speaker might utter the sentence more naturally. Verification of the recording quality was performed immediately after each utterance recording; the data-collection software allowed the speaker to re-record utterances in case the recording was not of sufficient quality. The acoustic environment was not controlled, in order to allow for background conditions that would occur in application environments.

*Samples*

For an example of the data in this corpus, please listen to this audio sample and examine its transcript

.
- isReferencedBy: Mauricio C. Schramm, et al. 2006 CSLU: Spoltech Brazilian Portuguese Version 1.0 Linguistic Data Consortium, Philadelphia
C-000676: CSLU: Voices
The Voices Corpus was created by Alexander Kain for his Ph.D. dissertation work on high resolution voice transformation. The corpus contains 12 speakers reading 50 phonetically rich sentences. The recording procedure involved a "mimicking" approach which resulted in a high degree of natural time-alignment between different speakers. The acoustic wave and the concurrent laryngograph signal were recorded for one "free" and two "mimicked" renditions of each sentence. Pitch marks, calculated from the laryngograph signal, and time marks, the output of a forced-alignment algorithm, have been added to the corpus.

*Samples*

For an example of the data contained in this publication, please review the following samples. * Concurrent laryngograph.
* Pitch marks derived from laryngograph signal.
* Transcription.
* Wave file of speech.
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2006S01/
- isReferencedBy: Alexander Kain 2006 CSLU: Voices Linguistic Data Consortium, Philadelphia
C-000677: CSR-I (WSJ0) Complete
*Introduction*

LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I other speech

During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems.

The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however, will consist of read texts from other sources of North American business news and eventually from other news domains).

The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details). Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.

Two microphones are used throughout: a close-talking Sennheiser HMD414 and a secondary microphone, which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone and the speech from both; all three sets include all transcriptions, tests, documentation, etc.

In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems and software used in scoring are included on separate discs from the waveform data.

*Samples*

Please listen to this audio sample.
- references: John Garofalo, et al. 1993 CSR-I (WSJ0) Complete Linguistic Data Consortium, Philadelphia
- hasVersion: C-000679: CSR-I (WSJ0) Sennheiser
- hasVersion: C-000678: CSR-I (WSJ0) Other
C-000678: CSR-I (WSJ0) Other
LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I other speech During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems.

The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however, will consist of read texts from other sources of North American business news and eventually from other news domains).

The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details). Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.

Two microphones are used throughout: a close-talking Sennheiser HMD414 and a secondary microphone, which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone and the speech from both; all three sets include all transcriptions, tests, documentation, etc.

In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems and software used in scoring are included on separate discs from the waveform data.
- references: John Garofalo, et al. 1993 CSR-I (WSJ0) Other Linguistic Data Consortium, Philadelphia
- hasVersion: C-000677: CSR-I (WSJ0) Complete
- hasVersion: C-000679: CSR-I (WSJ0) Sennheiser
C-000679: CSR-I (WSJ0) Sennheiser
LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I other speech

During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems.

The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however, will consist of read texts from other sources of North American business news and eventually from other news domains).

The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details). Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.

Two microphones are used throughout: a close-talking Sennheiser HMD414 and a secondary microphone, which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone and the speech from both; all three sets include all transcriptions, tests, documentation, etc.

In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems and software used in scoring are included on separate discs from the waveform data.

Please note this corpus has been updated from its original disc release to a web download, some of the documentation may still reflect its original disc state. However all data is still present.
- references: John Garofalo, et al. 1993 CSR-I (WSJ0) Sennheiser Linguistic Data Consortium, Philadelphia
- hasVersion: C-000677: CSR-I (WSJ0) Complete
- hasVersion: C-000678: CSR-I (WSJ0) Other
C-000680: CSR-II (WSJ1) Other
LDC94S13A - Complete CSR-II corpus

LDC94S13B - CSR-II Sennheiser speech

LDC94S13C - CSR-II Other speech

*Data*

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours.

In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University.

*Updates*

The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.
- references: . 1994 CSR-II (WSJ1) Other Linguistic Data Consortium, Philadelphia
- hasVersion: C-000681: CSR-II (WSJ1) Sennheiser
- hasVersion: C-000680: CSR-II (WSJ1) Other
C-000681: CSR-II (WSJ1) Sennheiser
LDC94S13A - Complete CSR-II corpus

LDC94S13B - CSR-II Sennheiser speech

LDC94S13C - CSR-II Other speech

*Data*

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 conventional development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours.

In early 1993, a Hub and Spoke test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or hub condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded Shorten compression algorithm developed at Cambridge University.

*Updates*

Please note that even tho the file wsj1/doc/lng_modl/base_lm/tcb20onp.z (WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z on a Windows OS) has the .z extension, it is not a compressed file. In order to use the file, simply ignore the .z extension.
- references: . 1994 CSR-II (WSJ1) Sennheiser Linguistic Data Consortium, Philadelphia
- hasVersion: CSR-II (WSJ1) Complete
- hasVersion: C-000680: CSR-II (WSJ1) Other
C-000682: CSR-III Speech
The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection is a three CD-ROM set that contains complete development test and evaluation test suites for speaker-independent, large-vocabulary speech recognition systems. The development and evaluation tests share a common structure, consisting of two core test components ("hubs") and seven specialized test components ("spokes"). The hub tests, which were mandatory for all ARPA CSR participants in the November '94 evaluations, provide a base-line for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to base-line performance. Participants were free to take any combination of spoke tests according to their research interests. Taken together, the collection encompasses 180 speakers, each producing 20-40 sentences. These are organized into two complete development test sets and one evaluation set.

The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances.

This was the first ARPA CSR Benchmark Test in which prompting texts were drawn from a variety of news sources. Whereas earlier benchmarks were based on Wall Street Journal excerpts (from the period 1987-89), CSR-III prompts come a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June, 1994. (NAB stands for "North American Business," in contrast to earlier benchmarks and training collections labeled "WSJ").

An important companion to the 1994 Benchmark Speech data collection is the four-disk CSR-III Text Collection (LDC95T6), which includes the ARPA CSR 1994 Standard Language Model. This corpus is also available from the LDC as a 1995 release.

Because of restrictions imposed by the copyright holders of much of the NAB text, both the speech and text collections are available to LDC members only. For more information on how to join, send email to ldc@ldc.upenn.edu.

*Pricing*

The Reduced Licensing Fee for this corpus is US$200.
- hasVersion: C-000683: CSR-III Text

SHACHI - Language Resource Metadata Database