言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1001 - 1010 件目

C-001722: CSLU: Apple Words and Phrases
*Introduction*

Apple Words and Phrases Version 1.3 contains approximately 69.5 hours of speech from 3008 telephone calls placed on analog and digital phone systems. Apple Computer, Inc. supported the development of this data and also supplied the list of words and phrases collected. Callers responded to questions and repeated a list of phrases as they were prompted.

*Data*

Subjects calling the analog system (998 callers) were employees of Apple Computer, Inc. and were solicited through interoffice email within the company. Subjects calling the digital system (2010 callers) were responding to USEnet postings or newspaper advertisements placed in several cities across the United States. Each subject called the CSLU data collection system by dialing a toll-free number. The analog data were collected via a Worldport Pod on an Apple Quadra A/V. The digital data were collected with the CSLU T1 digital data collection system.

Callers were prompted to answer certain questions including, What is your native language? In which city and state did you spend most of your childhood? What time is it now? What day is today? Callers were also instructed to repeat various comnand and control type phrases, including "play previous message again", "make a meeting for today", "quit", "who is at work", "what is the area code for this state", "hello, what are my messages", "help", "please send a car from the city", "delete my email tomorrow", "read this text", "erase all information", "record extended phonebook", "transfer all calls to home at twelve o'clock", "record urgent message" and "find the operator".

Each recorded utterance was listened to by a human verifier to determine if the speaker adequately followed the directions. If an utterance contained extraneous words or excessive noise, it was not included in the corpus.

*Samples*

* Analog
* Digital
- isReferencedBy: Mike Noel 2007 CSLU: Apple Words and Phrases Linguistic Data Consortium, Philadelphia
C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.

The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).

Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.

This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Spanish. The same packages are available for English (ELRA-E0011), Mandarin (ELRA-E0013), and for the CORTES task for Spanish (ELRA-E0012/01), for ASR, and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).

To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.

This package was used within the EPPS task and consists of 2 data sets:
- Development data set: consists of audio recordings of Parliaments sessions from 6 June to 7 July 2005, manually transcribed. 3 hours of recordings were selected and transcribed, corresponding to approximately 30,000 running words in Spanish.
- Test data set: consists of audio recordings of Parliaments sessions September to November 2005. As for the development set, the test data set is made of 3 hours (30,000 running words).
- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS The rest is omitted.(17)
C-001727: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.

The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).

Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.

This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the EPPS task. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR, and for SLT in 2 other directions, English-to-Spanish (ELRA-E0014) and Chinese-to-English (ELRA-E0016), as well as for the CORTES task for Spanish-to-English (ELRA-E0015/01).

To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.

This package was used within the EPPS task and consists of 2 data sets:
- Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 25,000 words were selected from the EPPS verbatim transcriptions and from the EPPS Final Text Edition documents. The source texts were then the translated into English by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.
- Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 25,000 words were selected from the test data set (Parliaments sessions from 5 September to 17 November 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into English by two independent agencies.
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Chinese-to-English -The rest is omitted.(16)
C-003102: 2001 Topic Annotated Enron Email Data Set
*Introduction*

The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of the original Enron Email Data Set of 1.5 million emails that was posted on the Federal Energy Regulatory Commission website as a matter of public record during the investigation of Enron. The original set suffered from document integrity problems; attempts were made to improve the quality of the data and to remove some sensitive and private information. Dr. William Cohen of Carnegie Mellon University took the lead in distributing the improved corpus, consisting of 517,431 Enron employee emails that covered the period 1999-2002.

This corpus is a subset of the Carnegie Mellon data set and covers the period from January 2001 to December 2001. The email topics reflect the business activities and interests of Enron employees in that year: California energy problems and the subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice communications), Enron's venture with the Dabhol India Power Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even fantasy football and college football. Eliminated from this data set are duplicates, emails that are too small and emails that are not really topics but are types (personnel memos and personal quips). The manual indexing was performed in the summer of 2006 by two people who worked closely together: a research associate familiar with the Enron saga and a junior in economics at the University of Tennessee.

The original Enron Email Data Set is the first large email set made available to researchers, but until now there has been no ability to assess the performance of topic detection and tracking algorithms with the email set. Having an annotated subset such as this one should provide text mining researchers with a way to evaluate the accuracy of new algorithms for clustering and classification. This data set can also be used to provide communication context for researchers using the Enron Email Data Set in social network analysis. Previous annotations such as the one developed at UC Berkeley have been primarily based on email type rather than the specific topic(s) of discussion. This annotation can be used to qualify the discussion topics between individuals and groups comprising a social network of Enron employees.

Due to the complexity of this corpus' directory structure, it will be distributed as compressed tar file on a cd. Most compression utilities will uncompress the package.

*Updates*

As of Aug 13, 2007, an update corrects a small error in the subjection annotation file. Those members and licensees who received this publication prior to Aug 13, 2007 should re-download the corpus. All copies issued since this date have been corrected.
- references: Carnegie Mellon Enron Email Dataset (www.cs.cmu.edu/~enron)
- isReferencedBy: (On-line documentation): http://www.ldc.upenn.edu/Catalog/docs/LDC2007T22/
- isReferencedBy: Dr. Michael W. Berry, Murray Browne and Ben Signer 2007 2001 Topic Annotated Enron Email Data Set Linguistic Data Consortium, Philadelphia
C-003103: CSLU: Foreign Accented English Release 1.2
*Introduction*

This file contains documentation on CSLU: Foreign Accented English Release 1.2, Linguistic Data Consortium (LDC) catalog number LDC2006S38 and isbn 1-58563-392-5.

CSLU: Foreign Accented English Release 1.2 consists of continuous speech in English by native speakers of 22 different languages: Arabic, Cantonese, Czech, Farsi, French, German, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Malay, Polish, Portuguese (Brazilian and Iberian), Russian, Swedish, Spanish, Swahili, Tamil and Vietnamese. The corpus contains 4925 telephone-quality utterances, information about the speakers' linguistic backgrounds and perceptual judgments about the accents in the utterances. The speakers were asked to speak about themselves in English for 20 seconds. Three native speakers of American English independently listened to each utterance and judged the speakers' accents on a 4-point scale: negligible/no accent, mild accent, strong accent and very strong accent. This corpus is intended to support the study of the underlying characteristics of foreign accent and to enable research, development and evaluation of algorithms for the identification and understanding of accented speech. Some of the files in this corpus are also contained in CSLU: 22 Languages Corpus, LDC2005S26.

*Samples*

For an example of the data in this corpus, please listen to this audio sample.
- replaces: CSLU: Foreign Accented English v1.1
- isPartOf: C-000670: CSLU: 22 Languages Corpus
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2007S08/
- isReferencedBy: T. Lander 2007 CSLU: Foreign Accented English Release 1.2 Linguistic Data Consortium, Philadelphia
C-003106: Web日本語Nグラム第1版
Nグラムは一般に公開されている日本語のWebページでGoogleがクロールしたものから抽出されている。ただし、閲覧に特別な認証が必要なページや、metaタグにnoarchive,noindex 等が指定されているページは対象に入っていない。抽出対象となった文数は約200億文で、出現頻度20回以上の１～７グラムを収録している。
C-003107: Web English N-gram Data
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
C-003109: 2003 NIST Rich Transcription Evaluation Data
*Introduction*

2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the NIST (National Institute of Standards and Technology) Speech Group. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the RT-03 Spring Evaluation Website and the RT-03 Fall Evaluation Website.

*Data*

The BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts.

The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the Switchboard Cellular collection and 36 from the Fisher collection. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the CallFriend Mandarin Chinese data. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the CallHome Egyptian Arabic data.

No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically.

Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.

*Samples*

* English Broacast News Audio
* Indices
* Transcriptions
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
- hasVersion: C-000592: 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
- hasVersion: C-003110: 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
- references: TDT-4 Corpus
- references: C-001279: Switchboard Cellular Part 1 Audio
- references: C-001418: Fisher English Training Speech Part 1 Speech
- references: C-000640: CALLFRIEND Mandarin Chinese-Mainland Dialect
- references: C-000650: CALLHOME Egyptian Arabic Speech
- isReferencedBy: Jonathan Fiscus, et al. 2007 2003 NIST Rich Transcription Evaluation Data Linguistic Data Consortium, Philadelphia
C-003110: 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
*Introduction*

2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data contains the test material (meeting speech and reference transcripts) used in the RT-04S evaluation administered by the NIST (National Institute of Standards and Technology) Speech Group. Rich Transcription (RT) is broadly defined as a fusion of speech-to-text technology and metadata extraction technologies designed to provide the basis for a generation of more usable transcriptions of human-human meeting speech.

The data in this release consists of portions of meeting speech collected and/or transcribed by the International Computer Science Institute (ICSI) at Berkeley, the Interactive Systems Laboratories (ISL) at Carnegie Mellon University, NIST and LDC. The complete meeting speech and corresponding transcript data sets are available from LDC's catalog as follows: ICSI Meeting Speech (LDC2004S02), ICSI Meeting Transcripts (LDC2004T04), ISL Meeting Speech Part 1 (LDC2004S05), ISL Meeting Transcripts Part 1 (LDC2004T10), NIST Meeting Pilot Corpus Speech (LDC2004S09) and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13).

RT-04S included the following tasks in the meeting domain:

Speech-to-Text Transcription (STT) tasks Microphone conditions: * Multiple distant microphones
* Single distant microphone
* Individual head microphone
Processing time conditions: * Unlimited time STT
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one times realtime
Diarization (SPKR) task (?who spoke when?) Microphone conditions: * Multiple distant microphones
* Single distant microphone
Input conditions: * Speech input only
* Speech plus reference transcript input
Processing time conditions: * Unlimited time
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one time realtime
Futher information about the evaluation is available on the RT-04 Spring Evaluation Website.

*Samples*

For an example of the data in this corpus, please review this audio sample.
- references: C-000717: ICSI Meeting Speech
- references: C-000718: ICSI Meeting Transcripts
- references: C-000720: ISL Meeting Speech Part 1
- references: C-000721: ISL Meeting Transcripts Part 1
- references: C-001070: NIST Meeting Pilot Corpus Speech
- references: C-001071: NIST Meeting Pilot Corpus Transcripts and Metadata
- hasVersion: C-000592: 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
- hasVersion: C-003109: 2003 NIST Rich Transcription Evaluation Data
- isReferencedBy: (On-line documentation for ICSI Corpus) http://www.icsi.berkeley.edu/Speech/mr/icsimc_doc/index.html
- isReferencedBy: (On-line documentation for ICSI Corpus) http://www.ldc.upenn.edu/Catalog/docs/LDC2004T04/
- isReferencedBy: (On-line documentation for ISL Corpus) http://www.ldc.upenn.edu/Catalog/docs/LDC2004S05/
- isReferencedBy: (On-line documentation for NISS Corpus) http://www.ldc.upenn.edu/Catalog/docs/LDC2004S09/
- isReferencedBy: "Spring 2004 (RT-04S) Rich Transcription Meeting Recognition Evaluation Plan": http://www.nist.gov/speech/tests/rt/rt2004/spring/documents/rt04s-meeting-eval-plan-v1.pdf
- isReferencedBy: "RT-04S Evaluation Data Documentation": http://www.nist.gov/speech/tests/rt/rt2004/spring/eval/docs.html
- isReferencedBy: (for The ICSI Meeting Corpus) :http://www.icsi.berkeley.edu/Speech/mr/
- isReferencedBy: (for the NIST Meeting Pilot Corpus) "The NIST Meeting Room Pilot Corpus": http://www.nist.gov/speech/test_beds/mr_proj/publications/lrec04.pdf
- isReferencedBy: Jonathan Fiscus, et al. 2007 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data Linguistic Data Consortium, Philadelphia
C-003111: CSLU: Kids` Speech Version 1.1
*Introduction*

CSLU: Kids' Speech Version 1.1 , Linguistic Data Consortium (LDC) catalog number LDC2007S18 and isbn 1-58563-395-X, is a collection of spontaneous and prompted speech from 1100 children between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. All children -- approximately 100 children at each grade level -- read approximately 60 items from a total list of 319 phonetically-balanced but simple words, sentences or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in duration. This release consists of 1017 files containing approximately 8-10 minutes of speech per speaker. Corresponding word-level transcriptions are also included.

This corpus was developed to facilitate research about the characteristics of children's speech at different ages and to train and evaluate recognizers for use in language training and other interactive tasks involving children, including to train recognizers used in language development with deaf children.

*Data*

Data collection was performed using the CSLU Speech Toolkit and two computers running Windows NT 4.0. Each computer was manned by a CSLU staff member who monitored progress and helped the child with any difficulties. The average time at the computer was 20 minutes, yielding approximately 8-10 minutes of speech digitized at 16 bits and 16kHz using Soundblaster 16 PnP audio cards with head-mounted microphones.

The prompted speech, consisting of 200 isolated words and 10 numeric strings, was presented as text appearing below an animated character that produced accurate visible speech synchronized with recored prompts. A text prompt was also displayed. The child then reproduced the prompted word. Once the prompted speech collection was completed, the experimenter then asked the subject a series of questions designed to elicit spontaneous speech (i.e "Tell me about your favorite movie"). Information about the subject's age, gender, languages spoken and physical conditions affecting speech was also collected.

*Samples*

For an example of the speech in this corpus, please listen to this sample of spontaneous speech.
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2007S18/
- isReferencedBy: Khaldoun Shobaki, John-Paul Hosom, and Ronald Cole 2007 CSLU: Kids` Speech Version 1.1 Linguistic Data Consortium, Philadelphia

SHACHI - Language Resource Metadata Database