言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1821 - 1830 件目

C-004617: REPERE Evaluation Package
Multimodal/Multimedia Resources
The REPERE project (REconnaissance de PERsonnes dans des Emissions audiovisuelles) consists in a series of 3 evaluation campaigns for multimedia information processing systems. The project was funded by the DGA (Délégation Générale de lArmement, France).
The REPERE Evaluation Package contains the visual annotation of 60 hours of French news TV shows, for the purpose of person recognition within TV programs. This annotation concerns both persons and written information appearing on screen. For the evaluated systems, the aim is to answer automatically to the following questions:
- What is, at each moment, the identity of the persons appearing on screen?
- What is, at each moment, the identity of the persons who are speaking?
- Who are, at each moment, the persons which names are pronounced?
- Who are, at each moment, the persons which names appear on screen?

In order to measure the quality of the answers given by the evaluated systems, a manual description of the videos, has been produced. Thus, the speech of the persons was segmented, transcribed and annotated. Visual information was also annotated manually: detection of heads, description of embedded text, person identification.

Provided data consists of:
- video files with indexes and with manual transcriptions in XGTF format (Viper),
- audio files compressed in WAV format with transcriptions in TRS format (Transcriber).

This package includes the material that was used for the REPERE evaluation campaign. It includes resources, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of this evaluation package is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
C-004620: TDT2 Careful Transcription Audio
*Introduction*

TDT2 (Topic Detection and Tracking) Careful Transcription Audio was developed by the Linguistic Data Consortium (LDC) and contains English broadcast news audio recordings collected by LDC in 1998. Corresponding transcripts are available in TDT2 Careful Transcription Text LDC2000T44.

Topic Detection and Tracking refers to automatic techniques for finding topically-related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection) and track the reoccurrence of old or new events (tracking).

*Data*

This publication contains 1998 broadcasts from the following sources: ABC News, Cable News Network, Public Radio International and Voice of America.

*Samples*

For an example of the data in this corpus, please review this audio sample.

*Updates*

There are no updates at this time.
C-004628: 2000 HUB5 English Evaluation Speech
*Introduction*

2000 HUB5 English Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of English conversational telephone speech used in the 2000 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).

The Hub5 evaluation series focused on conversational speech over the telephone with the particular task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology. Further information about the evaluation can be found on the NIST HUB5 website.

*Data*

The source data consists of conversational telephone speech collected by LDC: (1) 20 unreleased telephone conversations from the Swtichboard studies in which recruited speakers were connected through a robot operator to carry on casual conversations about a daily topic announced by the robot operator at the start of the call; and (2) 20 telephone conversations from CALLHOME American English Speech which consists of unscripted telephone conversations between native English speakers.

The audio files are two channel interleaved mulaw in sphere format. The sphere headers have been modified from the original evaluation data by the addition of sample checksums to the CALLHOME data files. A documentation table contains information on the speech segments.

Corresponding transcripts are available in 2000 HUB5 English Evaluation Transcripts (LDC2003T43).

*Samples*

Please listen to this audio sample.

*Updates*

There are no updates at this time.
- hasFormat: N-004726: 2000 HUB5 English Evaluation Transcripts
- references: C-000647: CALLHOME American English Speech
C-004629: 1997 HUB5 English Evaluation
*Introduction*

1997 HUB5 English Evaluation was developed by the Lingustic Data Consortium (LDC) and consists of English conversational telephone speech and associated transcripts used in the 1997 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).

The Hub5 evaluation series focused on conversational speech over the telephone with the particular task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology. Further information about the evaluation can be found on the NIST HUB5 website and in The 1997 HUB-5E Evaluation Plan for Recognition of Conversational Speech over the Telephone in English, included in this release.

*Data*

The source data consists of conversational telephone speech collected by LDC: (1) 20 telephone conversations from the Swtichboard-2 studies (LDC98S75, LDC98S79) in which recruited speakers were connected through a robot operator to carry on casual conversations about a daily topic announced by the robot operator at the start of the call; and (2) 20 telephone conversations from CALLHOME American English Speech which consists of unscripted telephone conversations between native English speakers.

The audio files are in sphere format. The sphere headers have been modified from the original evaluation data by the addition of sample checksums to the CALLHOME data files. The corresponding transcripts are presented in text format.

*Samples*

Please listen to this audio sample and view this transcript sample.

*Updates*

There are no updates at this time.

*Licensing*

This is a members' only release.
- references: C-000647: CALLHOME American English Speech
- references: C-001284: Switchboard-2 Phase I
- references: C-000738: Switchboard-2 Phase II
C-004640: 2008 NIST Speaker Recognition Evaluation Training Set Part 1
*Introduction*

2008 NIST Speaker Recognition Evaluation Training Set Part 1 was developed by LDC and NIST (National Institute of Standards and Technology). It contains 640 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as training data in the 2008 NIST Speaker Recognition Evaluation (SRE).

SRE is part of an ongoing series of evaluations conducted by NIST. These evaluations are an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation is designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.

The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.

*Data*

The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The telephone speech in this corpus is predominately English, but also includes the languages identified above. All interview segments are in English. Telephone speech represents approximately 565 hours of the data, whereas microphone speech represents the other 75 hours.

The telephone speech segments include excerpts in the range of 8-12 seconds and 5 minutes from longer original conversations. The interview material includes short conversation interview segments of approximately 3 minutes from a longer interview session. As in prior evaluations, intervals of silence were not removed. Also, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). There are approximately six files distributed as part of SRE08 where each file is a 1024 byte header with no audio. However, these files were not included in the trials or keys distributed in the SRE08 aggregate corpus.

English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system.

*Samples*

For an example of the data contained in this corpus, review this audio sample.
C-004641: 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
*Introduction*

2005 Spring NIST Rich Transcription (RT-05S) Conference Meeting Evaluation Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains approximately 78 hours of English meeting speech, reference transcripts and other material used in the RT Spring 2005 evaluation. Rich Transcription (RT) is broadly defined as a fusion of speech-to-text (STT) technology and metadata extraction technologies providing the bases for the generation of more usable transcriptions of human-human speech in meetings. LDC has also released 2004 Spring NIST Rich Transcription (RT-04S) Development Data LDC2007S11 and 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data LDC2007S12.

RT-05S included the following tasks in the meeting domain:

* Speech-To-Text (STT) -convert spoken words into streams of text
* Speaker Diarization (SPKR) -find the segments of time within a meeting in which each meeting participant is talking
* Speech Activity Detection (SAD) - detect when someone in a meeting space is talking

Further information about the evaluation is available on the RT-05Spring Evaluation Website. Please note the lecture meeting data is not included in this release.

*Data Description*

The data in this release consists of portions of meeting speech collected between 2001 and 2005 by the IDIAP Research Institutes Augmented Multi-Party Interaction project (AMI), Martigny, Switzerland International Computer Science Institute (ICSI) at University of California, Berkeley Interactive Systems Laboratories (ISL) at Carnegie Mellon University (CMU), Pittsburgh, PA NIST and Virginia Polytechnic Institute and State University (VT), Blacksburg, VA. Each meeting excerpt contains a head-mic recording for each subject and one or more distant microphone recordings.

Reference transcripts for the evaluation excerpts were prepared by LDC according to its Meeting Recording Careful Transcription Guidelines. Those specifications are designed to provide an accurate, verbatim (word-for-word) transcription, time-aligned with the audio file and including the identification of additional audio and speech signals with special mark-up.

*Samples*

For an example of the data contained in this corpus, review this audio sample.
C-004649: CSC Deceptive Speech
*Introduction*

CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male,16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.

The participants were told that they were participating in a communication experiment which sought to identify people who fit the profile of the top entrepreneurs in America. To this end, the participants performed tasks and answered questions in six areas. They were later told that they had received low scores in some of those areas and did not fit the profile. The subjects then participated in an interview where they were told to convince the interviewer that they had actually achieved high scores in all areas and that they did indeed fit the profile. The task of the interviewer was to determine how he thought the subjects had actually performed, and he was allowed to ask them any questions other than those that were part of the performed tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true or contained any false information by pressing one of two pedals hidden from the interviewer under a table.

*Data*

Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on two channels using Crown CM311A Differoid headworn close-talking microphones, then downsampled to 16kHz before processing.

The interviews were orthographically transcribed by hand using the NIST EARS transcription guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected for alignment, and labels for global lies were annotated during transcription based on the known scores of the subjects versus their reported scores. The orthographic transcription was force-aligned using the SRI telephone speech recognizer adapted for full-bandwidth recordings. There are several segmentations associated with the corpus: the implicit segmentation of the pedal presses, derived semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which were hand labeled, intonational phrase units and the units corresponding to each topic of the interview.

Transcript files are in .trs format and audio files are .wav presented in flac-compressed form for this release.

*Samples*

Please view these audio and transcript samples for the interviewer side of a conversation..

*Updates*

On May 22, 2014 an additional documentation file was added to explain the questions participants were asked.
C-004653: ATIS0 Pilot
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3 - ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.

All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.

The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). 36 speakers produced a total of 912 utterances.

The second disc (ATIS0 Read) contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.

The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.
C-004657: RM Isolated and Spelled Word Data
*Introduction*

This release contains previously unreleased isolated-word and spell-mode (spelled out words) speech data from the (D)ARPA Resource Management (RM1) Corpus. This data is based on a 600-word subset of the 991-word RM1 vocabulary and contains spoken and spelled words pertaining to the RM1 naval resource management task. This corpus was collected simultaneously as part of the RM1 Continuous Speech Corpus (NIST Speech Discs 2-1-2-4) and contains speech from the same sets of subjects used in RM1.

*Data*

The speech data has been segmented into separate spelled and spoken-word waveform files for each subject-word utterance. Time-aligned word and phonetic transcriptions have been generated automatically using forced recognition and are included. The time-aligned transcriptions employ the same format and phone set as the TIMIT Acoustic-Phonetic Continuous Speech Corpus (NIST Speech Disc 1-1). See the TIMIT CD-ROM companion booklet, NISTIR 4930, pp. 29-31, for a description of the phone set.

As with the continuous speech portion of RM1, this data is subsetted into speaker-independent and speaker-dependent partitions. These data sets are further partioned into training, development-test and evaluation-test subsets. See the "readme.doc" file in the top-level directory for more information about the data.

Texas Instruments recruited the subjects and collected the speech. The National Institute of Standards and Technology (NIST) segmented the waveforms, generated the time-aligned transcriptions and produced this release.

*Updates*

RM Isolated and Spelled Word Data is no longer available as catalog number LDC97S39; it has been incorporated into Resource Management RM1 2.0, and it is currently available in both Resource Management RM1 2.0 (LDC93S3B), and Resource Management Complete Set 2.0 (LDC93S3A).
C-004658: 1996 Speaker Recognition Benchmark
*Introduction*

This corpus, which is a subset of the Switchboard-1 (LDC93S7) corpus, was used in NISTs 1996 Speaker Recognition Evaluation. The focus of this evaluation was on detection of the presence of a hypothesized target speaker, given a segment of conversational speech over the telephone.

*Data*

The corpus consists of one Development Data disc and two Evaluation Data discs. Both sets include training and test segments.

The Development Data includes both training and test segments for about 45 male and 45 female speakers. The training data consists of about four one minute segments of speech data for each target speaker. The test data contains shorter segments of speech data (three, 10, and 30 seconds) that were taken from different conversations for each speaker.

The Evaluation Data includes about 20 male and 20 female target speakers and 200 male and 200 female non-target speakers. All of these speakers are different from the speakers in the Development Data set. Training data is supplied for each of the target speakers, in the same manner as the Development Data. Test data is supplied for both the target and the non-target speakers, in the same manner as the Development Data.

*Updates*

There are no updates at this time.
- references: C-001283: Switchboard-1 Release 2

SHACHI - Language Resource Metadata Database