言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 941 - 950 件目

C-001584: Voice of America (VOA) Czech Broadcast News Audio
*Introduction*

Voice of America (VOA) Czech Broadcast News Audio was developed by the Linguistic Data Consortium (LDC). Corresponding transcripts are contained in Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53), the documentation for which is included with this release.

*Data*

Between February 9 and May 28, 1999, LDC collected approximately 30 hours of Czech broadcast audio from the Voice of America news service. The 62 data files presented in this corpus represent the audio of the daily broadcasts of 30-minute news programs.

Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred.

The 62 audio files in this corpus are single-channel, 16 KHz, 16-bit linear SPHERE files.

*Samples*

For an example of the data in this corpus, please review this audio sample.

*Updates*

There are no updates at this time.
- references: C-001583: Voice of America (VOA) Broadcast News Czech Transcript Corpus
C-001585: Voicemail Corpus Part I
*Introduction*

This corpus was created by: M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P. S. Gopalakrishnan and C. Dunn

*Data*

This corpus consists of 1,801 messages, collected from volunteers at various IBM sites in the United States, comprising the training data set and 42 messages in the development test set. The average voicemail message is 31 seconds in duration and has about 100 words. Approximately 38% of the messages correspond to male speakers the remainder correspond to females. All messages were transcribed by IBM.

*Updates*

There are no updates at this time.

*Pricing*

The Reduced Licensing Fee for this corpus is US$150.
- isVersionOf: C-001586: Voicemail Corpus Part II
C-001586: Voicemail Corpus Part II
*Introduction*

Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail Corpus Part I, LDC98S77.

*Data*

This publication is comprised of speech and script files, and is structured in training and evaluation data. The training data consists of 2,048 voicemail messages and the corresponding script files. The speech and script files are organized in 41 directories, each of which contains up to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts.

The speech data is provided in sphere format it is sampled at 8 KHz, and recorded in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation.

In addition to the individual script files, there are three files which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts .all represent a concatenation of the training and evaluation script files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered version of the file eval_scripts.all, after eliminating the tagged elements () and the proper nouns marker.

*Updates*

A more recent version of the paper Automatic Speech Recognition Performance on a Voicemail Transcription Task (M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7, pp 433-442, October 2002) is available in both PDF and PS format by email request.

*Pricing*

The Reduced Licensing Fee for this corpus is US$150.
- hasVersion: C-001585: Voicemail Corpus Part I
C-001587: WEBCOMMAND
Desktop/Microphone
WEBCOMMAND contains recording sessions of 49 native speakers of France and Great Britain in two different quiet office rooms. In each session one speaker reads a list of 130 prompts from a screen. There are two prompt lists of 130 items each for each language; therefore most of the speakers have read 260 items in two different rooms. Speakers are recorded with two microphones: a high quality headset and a high quality microphone fixed to a 'webpad' hold on the lap. The corpus contains a total of 15600 two-channel recordings in 120 sessions.
- Length of each recording: 5.7 sec
- Total: 49.4 h
- Formats and distribution: SpeechDat Exchange Format
- Transcription: SpeechDat
C-001588: WSJCAM0 Cambridge Read News
A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus WSJ0).

This release of WSJCA0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection was modelled directly on the ARPA CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal.

There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments.

The contents of the publication consist of the following:

* Training data from head-mounted microphone
* Development test data from head-mounted microphone, plus first set of evaluation test data
* Training data from desk-mounted microphone
* Development test data from desk-mounted microphone, plus second set of evaluation test data

There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone.

Within the train and test sets, speech data are organized by speaker prompting texts and detailed transcriptions and speaker information are included in each speaker directory.

All waveform files have NIST SPHERE headers. Waveform data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package.

*Samples*

Please view the following samples:

* Head Mounted Mic
* Desk Mounted Mic
* Phoneme Alignments
* Word Alignments

*Updates*

On October 1, 2015 the corpus was modified to be released as a web download. Documentaiton was modified to reflect this.
- references: Tony Robinson, et al. 1995 WSJCAM0 Cambridge Read News Linguistic Data Consortium, Philadelphia
C-001590: West Point Arabic Speech
*Introduction*

West Point Arabic Speech was produced by the Linguistic Data Consortium (LDC), catalog number LDC2002S02 and ISBN 1-58563-199-x.

West Point Arabic Speech contains speech data that was collected and processed by members of the Department of Foreign languages at the United States Military Academy at West Point and the Center For Technology Enhanced Language Learning (CTELL) as part of an effort called "Project Santiago." The original purpose of this corpus was to train acoustic models for automatic speech recognition that could be used as an aid in teaching Arabic to West Point cadets.

*Data*

The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person reciting one prompt from one of four prompt scripts. The utterances were recorded using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The files were recorded as 16-bit PCM low-byte-first ("little-endian") raw audio files, with a sampling rate of 22.05 KHz. They were then converted to NIST sphere format.

Approximately 7,200 of the recordings are from native informants and 1200 files are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers.

number of speakers

male female total native: 41 34 75 non-native: 25 10 35 totals: 66 44 110 hours of data

male female total native: 6.0 4.4 10.4 non-native: 0.74 0.28 1.02 totals: 6.74 4.68 11.42 megabytes of data

male female total native: 918 667 1585 non-native: 111.9 42.8 154.7 totals: 1029.9 709.8 1739.7 number of speech files

male female total native: 4107 3163 7270 non-native: 883 363 1246 totals: 4990 3526 8516 Some of the recording sessions include a handful of utterances that were cut short due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors slamming, etc). These partial utterances have been retained in the waveform directories and are distinguished from the full-sentence recordings by having a trailing "-u" in the filename, before the extension (e.g. "s1_080-u.sph" instead of "s1_080.sph"). The above tables describe all data; both the complete and partial utterances are accounted for. 168 of the 8,516 speech files are partial utterances, and the remaining 8,348 are complete.

*Updates*

There are no updates at this time.
- references: Col. Stephen A. LaRocca and Rajaa Chouairi 2002 West Point Arabic Speech Corpus Linguistic Data Consortium, Philadelphia
C-001591: West Point Croatian Speech
*Introduction*

This file contains documentation on West Point Croatian Speech, Linguistic Data Consortium (LDC) catalog number LDC2005S28 and ISBN 1-58563-359-3.

West Point Croatian Speech is a database of digital recordings of spoken Croatian . It was collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The US government uses These systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems.

The corpus consists of two subcorpora collected in 2000 and 2001 in Zagreb Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speach, while the 2001 corpus includes free response answers to questions in addition to read speech.

The read speech in the two subcorpora were elicited from two different prompt scripts. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences. These sentences were written by Christine Tomei. Dr. Tomei's design analysis can be found in the file design-2000.txt. Informants in 2001 read short text passages extracted from Croatian language webpages. Thus the scripts used to record read speech contain a total of 6,329 distinct sentences. The read speech prompts are listed in the files read-200[01].txt in the transcripts directory. Each line of these files has two fields separated by a tab, the first denoting the base name of the waveform file, and the second the prompt used in recording the utterence. The read speech data are stored under the Recordings Croatian directory.

The script used to elicit free response answers contains 143 questions. The text that was actually presented to the informants is in the file named questions.txt in the transcripts directory. Data recorded from these prompts are stored in the Answers Croatian directory.

The human-performed transcriptions of the informant's answers are listed in the answers.txt file in the transcripts directory. Again, each line of this file has two fields separated by a tab, the first field contains two numbers separated by a slash. The first number is an identification index for the speaker. The second number is an index to the question. The second field on the line contains a word level transcription of the informants's answer to the question indexed by the second number in the first field. So, for example, in the line:

1/15 eh roena je u splitu eh roena je u splitu is a transcription of the response speaker one gave to question 15. The corresponding waveform file is stored in the file 15.wav in the directory Answers Croatian1. These recordings were transcribed by Milan Sokolich. Mr. Sokoloch also wrote a pronouncing dictionary that includes grammatical tags. His work is stored in the file named raw-lexicon.txt. The file lexicon.txt contains a processed version of the raw-lexicon.txt file.

Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions.

Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.

*Samples*

For an example of the speech in this corpus, please listen to this audio sample.
- : Stephan LaRocca, Christine Tomei, and Milan Sokolich 2005 West Point Croatian Speech Corpus Linguistic Data Consortium, Philadelphia
C-001592: West Point Heroico Spanish Speech
*Introduction*

This file contains documentation on West Point Heroico Spanish Speech, Linguistic Data Consortium (LDC) catalog number LDC2006S37 and ISBN 1-58563-391-7.

West Point Heroico Spanish Speech is a database of digital recordings of spoken Spanish. It was designed and collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The U.S. government uses these systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. Additionally, parts of this corpus were designed to model question/answer dialogues for use in domain-specific speech-to-speech translation systems. The corpus consists of two subcorpora, one collected in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy in Mexico City, and the other at USMA at different times since 1997. The USMA subcorpus includes data from non-native speakers and data collected through a throat microphone.

*Data*

Two kinds of prompt scripts were used, one to elicit read speech and one for free-response answers to questions. The read speech prompts are also divided into two groups, one designed to elicit speech typical of language learning scenarios and the other for speech from educated native speakers. The scripts used to record read speech have a total of 724 distinct sentences. This number includes 205 short, simple sentences used in typical language learning scenarios. The other 519 sentences were extracted from lecture notes used at USMA in a military readings course. All of the read speech prompts are listed in two files in the transcripts directory: HEROICO- Recordings.txt and USMA-prompts.txt, containing the sentences read by informants at the Mexican Military Academy and USMA, respectively. Each line of these files has two fields separated by a tab, the first denoting the base name of the waveform file, and the second the prompt used in recording the utterence.

The read speech data collected from informants at HEROICO are stored in the HEROICO/Recordings Spanish directory. The script used to elicit free-response answers contains 143 questions. The text that was actually presented to the informants is in the file named questions.txt in the transcripts directory. Data recorded from these prompts are stored in the HEROICO/Answers Spanish directory. The human-performed transcriptions of the informants answers are listed in the HEROICO-Answers.txt file in the transcripts directory. Again, each line of this file has two fields separated by a tab the first field contains two numbers separated by a slash. The first number is an identification index for the speaker. The second number is an index to the question. The second field on the line contains a word level transcription of the informants answer to the question indexed by the second number in the first field. So for example in the line: 100/10 no ella no tiene barba ni bigote no ella no tiene barba ni bigote is a transcription of the response speaker 100 gave to question 10. The corresponding waveform file is stored in the file 10.wav in the directory HEROICOAnswers Spanish100. Each speaker in the HEROICO subcorpus attempted to record 100 utter- ances by reading 75 sentences and giving 25 free-response answers to questions.

Both native and non-native USMA informatnts read from the list of 205 simple sentences. The prompts used in the USMA subcorpus are listed in the file USMA-prompts.txt in the transcripts directory. This file has the same two-field format as the above transcription files. Some of the USMA informants wore an additional throat microphone. That data was recorded in a separate stream and stored in files whose names begin with the letter t. Data collected at USMA are stored under the USMA directory. The names of the directories under the USMA directory indicate whether the speaker was native or non-native. The speakers native country is also indicated in the case of native speakers.

Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re- recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.

The data from USMA was collected using several different microphones and formats. Most of the data were recorded on Pentium computers running Linux through an m-10 Shuer head-mounted microphone. Entropics ESPS programs were used in most cases, especially when both head-mounted and throat microphones were used.

*Samples*

For an example of the data in this corpus, please listen to this audio sample.
- references: John Morgan 2006 West Point Heroico Spanish Speech Linguistic Data Consortium, Philadelphia
C-001593: West Point Korean Speech
*Introduction*

This file contains documentation on West Point Korean Speech, Linguistic Data Consortium (LDC) catalog number LDC2006S36 and ISBN 1-58563-360-7.

West Point Korean Speech is a database of digital recordings of spoken Korean. Corpus design and data collection were carried out by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL), located at the United States Military Academy (USMA), West Point, New York. The corpus was designed to develop speech recognition systems that would be used by the US government for speech-recognition enhanced language learning courseware. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific speech to speech translation systems. Each speaker attempted to record 100 utterances. Three data collection scripts were designed by Ms. Jennifer Son, a native speaker of Korean, under contract with the Department of Foreign Languages.

*Samples*

For an example of the data in this corpus, please listen to the following audio sample.
- references: John Morgan 2006 West Point Korean Speech Linguistic Data Consortium, Philadelphia
C-001594: West Point Russian Speech
*Introduction*

West Point Russian Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S05 and ISBN 1-58563-277-5.

The West Point Russian Speech corpus was developed at the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) at the United States Military Academy at West Point. The purpose of the corpus is to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Russian language program.

*Data*

The corpus consists of 4,181 speech files in SPHERE format, totalling approximately four hours of speech. Approximately 2,290 files are from native informants and 1,891 are from non-native informants.

The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers.

Number of speakers:

male female total native 13 16 29 non-native 16 10 26 totals 29 26 55 Number of speech files:

male female total native 1027 1263 2290 non-native 1103 788 1891 totals 2130 2050 4181 The speech data was collected using laptop computers running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050 Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual display of the sentence, along with a digital recording of the sentence as read by a native speaker, was presented. The informant pressed the Enter key to record the utterance. The informant's recording was played back for review and the utterance was re-recorded if necessary.

The collection script consists of 96 sentences with a total of 528 tokens and 351 types.

Each waveform file has a monophone and word level master label file transcription in HTK-format. A concatenated version of the master label files at both the word level and the phone level is provided.

The lexicon contains 690 distinct orthographic word forms, including all words found in the collection script.

*Updates*

There are no updates available at this time.
- references: Col. Stephen A. LaRocca and Christine Tomei 2003 West Point Russian Speech Linguistic Data Consortium, Philadelphia

SHACHI - Language Resource Metadata Database