Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 581 - 590 of 2023

C-001062: Multiple-Translation Arabic (MTA) Part 1
Multiple-Translation Arabic (MTA) Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T18 and ISBN 1-58563-276-7.
- references: C-000615: Arabic Newswire Part 1
- hasVersion: C-001063: Multiple-Translation Arabic (MTA) Part 2
- hasVersion: C-001064: Multiple-Translation Chinese (MTC) Part 2
- hasVersion: C-001065: Multiple-Translation Chinese (MTC) Part 3
- hasVersion: C-001066: Multiple-Translation Chinese (MTC) Part 4
- hasVersion: ,Multiple-Translation Chinese Corpus
C-001063: Multiple-Translation Arabic (MTA) Part 2
Multiple-Translation Arabic (MTA) Part 2 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T05 and ISBN 1-58563-328-3.
- hasVersion: C-001062: Multiple-Translation Arabic (MTA) Part 1
- hasVersion: S-001067,Multiple-Translation Chinese Corpus
- hasVersion: C-001064: Multiple-Translation Chinese (MTC) Part 2
- hasVersion: C-001065: Multiple-Translation Chinese (MTC) Part 3
- hasVersion: C-001066: Multiple-Translation Chinese (MTC) Part 4
C-001064: Multiple-Translation Chinese (MTC) Part 2
- hasVersion: C-001062: Multiple-Translation Arabic (MTA) Part 1
- hasVersion: C-001063: Multiple-Translation Arabic (MTA) Part 2
- hasVersion: C-001065: Multiple-Translation Chinese (MTC) Part 3
- hasVersion: C-001066: Multiple-Translation Chinese (MTC) Part 4
- hasVersion: ,Multiple-Translation Chinese Corpus
C-001065: Multiple-Translation Chinese (MTC) Part 3
Multiple-Translation Chinese (MTC) Part 3 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T07 and ISBN 1-58563-289-9.
- hasVersion: C-001062: Multiple-Translation Arabic (MTA) Part 1
- hasVersion: C-001063: Multiple-Translation Arabic (MTA) Part 2
- hasVersion: C-001064: Multiple-Translation Chinese (MTC) Part 2
- hasVersion: C-001066: Multiple-Translation Chinese (MTC) Part 4
- hasVersion: S-001067,Multiple-Translation Chinese Corpus
C-001066: Multiple-Translation Chinese (MTC) Part 4
Multiple-Translation Chinese (MTc) Part 4 was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T04 and ISBN 1-58563-375-5.
- hasVersion: C-001062: Multiple-Translation Arabic (MTA) Part 1
- hasVersion: C-001063: Multiple-Translation Arabic (MTA) Part 2
- hasVersion: C-001064: Multiple-Translation Chinese (MTC) Part 2
- hasVersion: C-001065: Multiple-Translation Chinese (MTC) Part 3
- hasVersion: S-001067,Multiple-Translation Chinese Corpus
C-001068: N4 NATO Native and Non-Native Speech
*Introduction*

This file contains documentation on the N4 NATO Native and Non-Native Speech Corpus, Linguistic Data Consortium (LDC) catalog number LDC2006S13 and ISBN 1-58563-344-5.

The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military-oriented database for multilingual and non-native speech processing studies.

Speech data was recorded in the naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada). The material consists of native and non-native speakers speakers using NATO English procedure between ships and reading from a text, "The North Wind and the Sun," in both English and the speaker's native language.

Speech technology is covering an increasing number of languages, and systems are becoming more robust with regard to speech variablity such as speaking style and accents. However, for real applications, especially in a multilingual and multinational context, further robustness to regional and even non-native accents is necessary. Among numerous corpora available for speech research few have specifically addressed this issue.

The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries.

*Data*

The database was collected in four countries (Germany, The Netherlands, United Kingdom, and Canada) during naval communication training sessions in 2000-2002. For each country, the main part of the recordings consists of a NATO Naval procedure in English where the typical sentence sounds like "This is alpha, whiskey, roger. I make two seven zero six hostile, two seven zero six. Out." In addition each speaker read a text, "The North Wind and the Sun," in English and his or her native language.

The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the tool, Transcriber. Navy procedure recordings and text readings have been stored in different files. The first digit in the filename indicates the type of speech

Among speech segments, the duration of Navy procedure recordings range from 1.3h to 2.3h for a total of 7.5h. The duration of the native language text readings range from 1.5min to 22.9min for a total of around one hour.

CA GE NL UK All Signal 5.30 3.20 5.00 6.30 19.80 Silence 3.00 0.56 2.00 4.70 10.26 Speech 2.30 2.64 3.00 1.60 9.54 Speech 2.30 2.64 3.00 1.60 9.54 Navy proc 2.00 1.90 2.30 1.30 7.50 Read text 0.30 0.74 0.70 0.30 2.04 Read text 0.30 0.74 0.70 0.30 2.04 Non-native 0.27 0.37 0.32 0.00 0.96 Native 0.03 0.37 0.38 0.30 1.08 The database contains the following information about each speaker: gender, age, weight, length, possible speaking or hearing disorders, education level, living area, accent, second language, the year English was learned(for non-native speakers). The speaker accents vary widely from country to country. The speaker's average age was 22.6 years. Nineteen women participated, accounting for 18% of the study participants. There were a total of 115 speakers.

CA GE NL UK All #Speakers 22 51 31 11 115 #Women 5 0 9 5 19 Age 22-35 17-23 17-61 19-62 17-62 Age mean 28.3 20.1 21 27.5 22.6

*Samples*

For an example of the speech data in this corpus, please listen to this audio sample.
C-001069: 2003 NIST Language Recognition Evaluation
*Introduction*

The goal of the NIST Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. The series had its first evaluation in 1996. 2003 NIST Language Recognition Evaluation (LRE-03) was part of this ongoing series of evaluations of language recognition technology.

Further information regarding this evaluation may be found on the 2003 NIST Language Recognition Evaluation website and in the NIST 2003 evaluation plan.

The task evaluated was the detection of a given target language. Given a test segment of speech, a target language was assigned as a test hypothesis, and the task was to determine whether this test hypothesis was true or false. This release contains both the 1996 and 2003 NIST Language Recognition Evaluations.

*Data*

Each speech file is one side of a "four wire" telephone conversation represented as 8-bit, 8kHz mulaw data. There are 11,830 speech files in sphere(.sph) format for a total of around forty six hours of speech. The speech data was compiled from the LDC's CALLFRIEND, CALLHOME, and Switchboard-2 corpora. Each file contains one test segment. The test segments are divided into three-second, ten-second, and thirty-second tests, each in its own directory.

*Samples*

For an example of the data in this corpus, please listen to this audio sample.

*Updates*

A typo was fixed in the index.html file. There are 11,830 sphere files, not 11,839. The updated index file is available in the online docs folder.
- hasVersion: 1996 NIST Language Recognition Evaluation
- hasVersion: 2005 NIST Language Recognition Evaluation
- hasVersion: 2007 NIST Language Recognition Evaluation
C-001070: NIST Meeting Pilot Corpus Speech
*Introduction*

NIST Meeting Pilot Corpus Speech consists of approximately 15 hours of English meeting speech and was collected in the NIST Meeting Data Collection Laboratory for the NIST Automatic Meeting Recognition Project. The corresponding transcripts are available as the NIST Meeting Pilot Corpus Transcripts and Metadata, while the video files will be published later as NIST Meeting Pilot Corpus Video.

For more information regarding the data collection conditions, meeting scenarios, transcripts, speaker information, recording logs, errata, and other ancillary data for the corpus, please consult the NIST project website for this corpus.

*Data*

The data in this corpus consists of 369 SPHERE audio files generated from 19 meetings (comprising about 15 hours of meeting room data and amounting to about 32 GB) recorded between November 2001 and December 2003.

Each meeting was recorded using two wireless "personal" mics attached to each meeting participant: a close-talking noise-cancelling boom mic and an omni-directional lapel mic. Each meeting was also recorded using three omni-directional table mics and a four-channel directional table mic covering 365 degrees (each channel is recorded in a separate file). Each individual channel was converted from its 48Khz, 24-bits, linear PCM source format to 16 Khz, 16-bits, linear PCM-sampled audio SPHERE-formatted files.

*Updates*

There are no updates available at this time.
- hasVersion: C-001071: NIST Meeting Pilot Corpus Transcripts and Metadata
C-001071: NIST Meeting Pilot Corpus Transcripts and Metadata
This corpus contains the full speech transcripts created by the Linguistic Data Consortium for the NIST Automatic Meeting Recognition Project as well as a metadata database with useful information about the meeting forums, topics, participants and recording conditions and equipment. The corresponding speech files are available as the NIST Meeting Pilot Corpus Speech, while the video files will be published later as NIST Meeting Pilot Corpus Video.
- hasVersion: C-001070: NIST Meeting Pilot Corpus Speech
C-001072: NTIMIT
The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to TIMIT.

NTIMIT was collected by transmitting all 6,300 original TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them. The recordings were transmitted through ten Local Access and Transport Areas, half of which required the use of long-distance carriers.

In order to calibrate the transmission characteristics of the various channels, stationary 1 kHz and frequency-sweeping tones were also recorded for each of the transmission channels. These are found on Disc 2.

The re-recorded waveforms were time-aligned with the original TIMIT waveforms so that the TIMIT time-aligned transcriptions can be used with the NTIMIT corpus as well. In additiont to the documentation on the disc, see Jankowski et al., "NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database," Proc. ICASSP-90, April 1990. NYNEX retains full copyright on the corpus and all associated materials.

*Updates*

(02/08/2016) All sphere files were updated to flac and the corpus was made a web download. Documentation was edited to reflect these changes, please note that some documentation may still refer to the corpus as being on CD-ROM and contatining sphere files.

SHACHI - Language Resource Metadata Database