言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 681 - 690 件目

C-001249: 2004 NIST Speaker Recognition Evaluation
*Introduction*

The 2004 NIST Speaker Recognition evaluation is part of an ongoing series of yearly evaluations conducted by NIST (National Institute of Standards and Technology). These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text-independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible.

NIST has been coordinating Speaker Recognition Evaluations since 1996. Each evaluation begins with the announcement of the official evaluation plan which clearly states the rules and tasks involved with the evaluation. The evaluation culminates with a follow-up workshop, where NIST reports the official results and researchers share in their findings.

The data consists of conversational telephone speech collected by the LDC.

Additional documentation is available from the NIST website at http://www.itl.nist.gov/iad/mig/tests/sre/2004/index.html.

*Samples*

This audio sample and its transcript provide an example of the data contained in this corpus.
C-001250: ATIS0 Complete
*Introduction*

The ATIS0 Corpus is comprised of spontaneous data from 36 speakers; read versions of the data from 20 of those speakers, along with some adaptation material; and extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.

LDC also released: LDC93S4B - ATIS0 Pilot, LDC93S4B-2 - ATIS0 Read, and LDC93S4B-3 - ATIS0 SD-Read

*Data*

All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.

ATIS0 Pilot contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). 36 speakers produced a total of 912 utterances.

ATIS0 Read contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.

ATIS0 SD-Read contains "read" speech in the ATIS domain for ten of the speakers on ATIS0 Pilot. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. This section also contains the close-talking (Sennheiser) microphone data and corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files in this section.

*Samples*

Please view this audio sample and transcript sample.

*Updates*

None at this time.
- hasVersion: ATIS0 Pilot
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: ATIS0 SD-Read
- hasVersion: C-000599: ATIS0 Read
- hasVersion: C-000599: ATIS0 Read
C-001251: Arabic Broadcast News Speech
*Introduction*

Arabic Broadcast News Speech consists of 10 hours of speech recorded by the Linguistic Data Consortium (LDC) from Voice of America satellite radio news broadcasts in Arabic transmitted between June 2000 and January 2001. The corresponding transcripts are available as Arabic Broadcast News Transcripts (LDC2006T20).

This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017, NSF IIS-9982201) in conjunction with the European Language Resources Association (ELRA). ELRA collected 22.5 hours of Arabic broadcast data from Radio Orient (France) that is available in NetDC Arabic BNSC (Broadcast News Speech Corpus) ELRA-S0157. The goal of the NetDC project was to improve the infrastructure for language resources by designing and implementing new modes of cooperation between LDC and ELRA.

*Data*

The recordings were captured from a dedicated satellite receiver and stored as 16-bit PCM, 16-kHz, single-channel, in NIST SPHERE format. The duration of each recording is either 60 minutes or 120 minutes, depending on the VOA broadcast schedule; the date (YYYYMMDD), start-time and end-time (HHMM EST) for each recording are indicated in the file names. The sample data are not compressed.

*Samples*

For an example of the speech in this corpus, please listen to this audio sample (wav format).
C-001253: CSLU: Stories v 1.2
*Introduction*

This file contains documentation on CSLU: Stories V1.2, Linguistic Data Consortium (LDC) catalog number LDC2006S14 and ISBN 1-58563-366-6.

CSLU: Stories contains extemporaneous speech collected from English speakers in the CSLU Multilanguage Telephone Speech data collection. Each speaker was asked to speak on a topic of his or her choice for one minute. Those utterances are collected in the Stories corpus.

*Data*

The Stories corpus comprises:

* Speech files for the 702 calls
* Time-aligned word level transcriptions (and corresponding comment files) for approximately 322 stories
* Word transcriptions (not time aligned) for 702 stories
* Time-aligned phonetic labels for 702 stories

*Samples*

For an example of the data in this corpus, please listen to this audio sample.
C-001255: Chinese Treebank Final Release
This publication contains the Chinese Penn Treebank Project Corpus Final Release, produced by:
Principal Investigators:
Martha Palmer, Mitch Marcus, Tony Kroch
Consultants:
Martha Palmer, Mitch Marcus
Tony Kroch, Shizhe Huang
Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc
Project Managers and Guideline Designers:
Fei Xia, Nianwen Xue
Annotators:
Fu-Dong Chiou, Nianwen Xue
Programming support:
Zhibiao Wu
Published by the Linguistic Data Consortium (LDC), catalog number LDC2000T48, isbn 1-58563-187-6. The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project.
C-001256: CSR-II (WSJ1) Complete
LDC94S13A - Complete CSR-II corpus

LDC94S13B - CSR-II Sennheiser speech

LDC94S13C - CSR-II Other speech

*Data*

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours.

In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University.

*Samples*

Please listen to this audio sample.

*Updates*

The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.
C-001257: FFMTIMIT
The FFMTIMIT corpus contains the previously unreleased secondary microphone waveforms for the TIMIT Acoustic-Phonetic Continuous Speech corpus. The primary microphone waveforms, which were recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone (model HMD-414), are available from the LDC on NIST Speech Disc 1-1.1 (LDC93S1). The secondary microphone used in the recording of the TIMIT corpus was a Breul & Kjaer 1/2" free-field microphone (model 4165). While the Sennheiser microphone recordings are relatively "clean" with respect to non-speech noise, the FFMTIMIT recordings includes significant low frequency noise, which was due to the HVAC system and mechanical vibration transmitted through the floor of the double-walled sound booth used in recording. Because it is noiser than its TIMIT counterpart, the data of FFMTIMIT may be used in the development of more noise-robust speech recognition systems. In addition, this data may be of value to researchers involved in vocal tract modeling because the B&K microphone has extremely flat free-field frequency response and calibration tones are provided.

Note that the B&K TIMIT data contained with this release has not been processed through any highpass filter, (e.g., the 1,581-point filter described in the paper "The DARPA Speech Recognition Research Database" by Fisher, Doddington and Goudie-Marshall in "DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM," NISTIR 4930 / NTIS Order No. PB93- 173938.)
- isRequiredBy: C-001303: TIMIT Acoustic-Phonetic Continuous Speech Corpus
C-001258: Gulf Arabic Conversational Telephone Speech
*Introduction*

This database contains 975 Gulf Arabic speakers taking part in spontaneous telephone conversations in Colloquial Gulf Arabic. A total of 976 conversation sides are provided (one speaker appears on two distinct calls). The average duration per side is about 5.7 minutes.

This corpus was collected and transcribed in 2004 by Appen Pty Ltd (Appen), Sydney, Australia.

*Data*

The single-channel files represent just one side of a normal conversation. The "devtest" set represents a relatively balanced (representative) sample drawn from the total pool of collected calls, based on a test-set selection process applied by the National Institute of Standards and Technology (NIST) and based on demographic, phone and audit information as provided by Appen.

*Samples*

For an example of the data contained in this corpus, please listen to this audio sample(wav).
C-001259: Iraqi Arabic Conversational Telephone Speech
*Introduction*

This database contains 474 Iraqi Arabic speakers taking part in spontaneous telephone conversations in Colloquial Iraqi Arabic. A total of 478 conversation sides are provided (most speakers appear only once), and most of these call sides comprise both sides of a conversation (that is, 202 two-channel recordings plus 74 single-channel recordings). The average duration per call is about 6 minutes, so each call side contains about 3 minutes of speech, on average.

This corpus was collected and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.

*Samples*

For an example of the speech contained in this corpus, please list to this sample audio file in wav format.
- hasVersion: C-000722: Iraqi Arabic Conversational Telephone Speech, Transcripts
C-001260: Penn Chinese Treebank
Penn's Chinese Language Processing program is anchored by linguistic corpora annotated with morphological, syntactic, semantic and discourse structures. The Penn Chinese Treebank is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 500 thousand words (over 824K Chinese characters).
Task: Building a segmented, POS tagged and bracketed Chinese corpus. The data consists of Xinhua newswire, Hong Kong news and articles from Sinorama news magazine.
Project Status: The Chinese TreeBank (CTB) version 4.0, which has 404K words, has been officially released via Linguistic Data Consortium. CTB 5.0, which will have 507K words, is also in the LDC data release pipeline. It will be available at the end of 2004.
Workshops and meetings
1st CLP Workshop (6-7/98), Philadelphia, USA
meeting during ACL-98, Montreal, Canada (8/98)
meeting during ICCIP-98, Beijing, China (11/98)
meeting during ACL-99, Maryland, USA (6/99)
2nd CLP Workshop (10/00), Hong Kong, China
- isReplacedBy: the Chinese Penn Treebank Project

SHACHI - Language Resource Metadata Database