言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 721 - 730 件目

C-001291: TDT2 Mandarin Audio Corpus
*Introduction*

Topic Detection and Tracking (TDT) 2 Mandarin Audio Corpus contains recordings of broadcast news audio. The transcriptions to these recordings are available in the Topic Detection and Tracking (TDT) 2 Multilanguage Text Version 4.0, LDC2001T57.

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking).

*Data*

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

The data files are recordings of Voice of America (VOA) news broadcasts. The data were collected daily over a period of six months (February-June 1998). The audio files in this corpus are single channel, 16 KHz, 16-bit linear SPHERE files.

*Updates*

There are no updates at this time.
- references: David Graff 2001 TDT2 Mandarin Audio Corpus Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001292: TDT2 Multilanguage Text Version 4.0
*Introduction*

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking).

*Data*

TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected daily from nine news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June 1998). Both manually-created reference text and automatically- generated text (ASR and/or machine translation) are provided for all broadcast and all Mandarin data.

This version has been prepared to complement the first general release of the TDT3 Multilanguage Text Corpus, providing new enhancements to make the data content more accessible to a broader research community. The news sources and approximate number of stories per source (in thousands) are as follows:

English sources (thousands of stories)

New York Times Newswire Service 11.8

Associated Press Worldstream Service 12.8

Cable News Network, Headline News 15.8

American Broadcasting Co., World News Tonight 2.1

Public Radio International, The World 2.9

Voice of America (news programs) 8.2

Total English stories: 53.6 thousand)

Mandarin sources (thousands of stories)

Xinhua News Agency 11.3

Zaobao News Agency 5.2

Voice of America (news programs) 2.3

Total Mandarin stories: 18.8 thousand

This release consists of the English and Mandarin text components of the TDT2 corpus. The data was collected daily over a period of six months (January-June 1998) from the following sources.

* American Broadcasting Company (ABC)
* Associated Press
* Cable News Network, Inc. (CNN)
* New York Times
* Public Radio International (PRI)
* Voice of America (VOA)
* Xinhua News Agency
* ZaoBao News

The data is provided in the following formats.

.sgm: Reference true-text, with markup providing story boundaries and descriptive information .tkn: Tokenized version of sgml data, with all descriptive and boundary information removed .as0: Output of the Dragon ASR system in tokenized form with information on timing, speaker clusters, and confidence .as1: Output of the BBN ASR system in tokenized form with timing information (English Only) .mttkn: SYSTRAN output from .tkn (Mandarin Only) .mtas0: SYSTRAN output from .as0 (Mandarin Only)

The corpus also includes topic relevance tables as well as tables for locating story boundaries.

*Updates*

7/21/16 - Topic tables were added to the release and the online documentation folder.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
- references: Charles Wayne, et al. 2001 TDT2 Multilanguage Text Version 4.0 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001293: TDT3 English Audio
PRICING: 2001 Commercial Members: $0 2001 Non-Profit Members: $1,100 Non-Members: $11,000

*Introduction*

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT3 corpus was created to support three TDT3 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection), and track the reoccurrence of old or new events (tracking).

The goal of Topic Detection and Tracking - Phase 3 (TDT3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation), segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT2 tasks of segmentation, detection and tracking, TDT3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance.

*Data*

The TDT3 English Audio Corpus contains the audio (in compressed sphere format) of news broadcasts collected daily from six news sources in American English, over a three-month collection period (October - December 1998). The sources and amounts are as follows:

Sources Hours CDs ------------------------------------------------------------------ CNN_HDL Cable News Network, "Headline News" 174.6 19 ABC_WNT American Broadcasting Co., "World News Tonight" 38.6 5 NBC_NNW National Broadcasting Co., "NBC Nightly News" 44.6 6 MNB_NBW MS-NBC, "News with Brian Williams" 51.8 6 PRI_TWD Public Radio International, "The World" 63.9 7 VOA_ENG Voice of America, English news programs 102.2 12 Total 475.7 55

The files in this publication are complete single-channel recordings of the (30 or 60-minute) broadcasts listed above. Each one has been digitized at a sample rate of 16 KHz using 16-bit samples, and compressed using the "shorten" algorithm.

(The audio CD-ROMs are grouped into subsets by broadcast source and the LDC will support the option of purchasing one or more subsets, e.g. just the ABC data. We regret that we cannot provide "customized" subsets.)

Tools for decompression can be found here.

*Updates*

There are no updates at this time.
- references: David Graff 2001 TDT3 English Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: TDT4 Multilingual Broadcast News Speech CorpuTDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001294: TDT3 Mandarin Audio
*Introduction*

This publication contains the TDT3 Broadcast News Mandarin Corpus (Audio), produced by the Linguistic Data Consortium (LDC), catalog number LDC2001S95 and ISBN number 1-58563-186-8. The contents of this publication were recorded from various 60-minute, twice daily Mandarin news programs from VOA. The transcripts of these broadcasts will be published in the TDT3 Mandarin Text and TDT3 Multilanguage Text Corpora.

*Data*

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT3 corpus was created to support three TDT3 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection), and track the reoccurrence of old or new events (tracking).

The goal of Topic Detection and Tracking - Phase 3 (TDT3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation), segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT-2 tasks of segmentation, detection and tracking, TDT3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance.

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

The data files are recordings of Voice of America (VOA) news broadcasts. The data were collected daily over a period of three months (October-December 1998). The audio files in this corpus are single channel, 16 KHz, 16-bit linear SPHERE files.

*Updates*

There are no updates at this time.
- references: David Graff 2001 TDT3 Mandarin Audio Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001295: TDT3 Multilanguage Text Version 2.0
*Introduction*

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT3 corpus was created to support three TDT3 tasks: to find topically homogeneous sections (segmentation), to detect the occurrence of new events (detection) and to track the reoccurrence of old or new events (tracking).

*Data*

TDT3 Multilanguage Text Corpus Version 2.0 is the first general release of this collection (Version 1.0 was made available only to participants in the TDT 1999 and 2000 evaluation tests). It contains data from the same nine sources found in TDT2, plus two additional English television sources. Like TDT2, it provides both manually-created and automatically-generated text for most sources.

For TDT3, the daily collection took place over a period of three months (October - December 1998). The sources and approximate number of stories per source are as follows:

English sources Thousands of stories

New York Times Newswire Service 6.9

Associated Press Worldstream Service 7.3

Cable News Network, "Headline News" 9.0

American Broadcasting Co., "World News Tonight" 1.0

Public Radio International, "The World" 1.6

Voice of America, English news programs 3.9

MS-NBC, "News with Brian Williams" 0.7

National Broadcasting Co., "NBC Nightly News" 0.8

Total English stories: 31.2 thousand

Mandarin sources Thousands of stories

Xinhua News Agency 5.2

Zaobao News Agency 3.8

Voice of America, Mandarin Chinese news programs 3.8

Total Mandarin stories: 12.8 thousand

The goal of Topic Detection and Tracking - Phase 3 (TDT3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation), segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT2 tasks of segmentation, detection and tracking, TDT3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance.

There are two types of files in this publication:

asr_sgm -- text data output from automatic speech recognition (ASR) systems in English and Mandarin, formatted in "TIPSTER- style" SGML, derived from the audio recordings of radio and TV broadcasts.

tkn_sgm -- reference text data (newswire, closed captions and manual transcripts), formatted in "TIPSTER-style" SGML

*Samples*

Please view this asr_sgm sample and tkn_sgm sample.

*Updates*

7/21/16 - Topic tables added.
- references: David Graff, Chris Cieri, and Stephanie Strassel 2001 TDT3 Multilanguage Text Version 2.0 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001296: TDT4 Multilingual Broadcast News Speech Corpus
*Introduction*

This file contains documentation about the TDT4 Multilingual Broadcast News Speech Corpus; Linguistic Data Consortium (LDC) catalog number LDC2005S11, ISBN number 1-58563-338-0.

This corpus was created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations. The transcripts corresponding to the audio contained in this release, along with newswire data and topic relevance annotations, can be found in LDC Publication LDC2005T16, TDT4 Multilingual Text and Annotations.

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. Evaluation tasks in 2002 and 2003 included the segmentation of a news source into stories, the tracking of known topics, the detection of unknown topics, the detection of initial stories on unknown topics, and the detection of pairs of stories on the same topic.

*Samples*

Please examine these wave files for an example of this corpus.

* Arabic
* English
* Mandarin

*Updates*

The initial July 2005 release contained an error in which a small number of files were misnamed. That error has been corrected. Users who received the original release should have also received the correction. Anyone who ordered this publication before Nov. 1, 2005 and has not received a correction should contact LDC's Membership Office at ldc@ldc.upenn.edu.
- references: Junbo Kong and David Graff 2005 TDT4 Multilingual Broadcast News Speech Corpus Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001297: TDT4 Multilingual Text and Annotations
*Introduction*

This page contains documentation on the TDT4 Multilingual Text and Annotations, Linguistic Data Consortium (LDC) catalog number LDC2005T16 and ISBN 1-58563-339-9.

The TDT4 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This release contains the complete set of English, Arabic and Chinese news text (broadcast news transcripts and newswire data) used in the 2002 and 2003 Topic Detection and Tracking technology evaluations, along with topic annotations created for those evaluations. The audio corresponding to the broadcast news transcripts contained in this release can be found in LDC Publication LDC2005S11, TDT4 Multilingual Broadcast News Speech Corpus.

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. Evaluation tasks in 2002 and 2003 included the segmentation of a news source into stories, the tracking of known topics, the detection of unknown topics, the detection of initial stories on unknown topics, and the detection of pairs of stories on the same topic.

*Samples*

To see an example of this corpus, please examine this sample. This sample is an English translation from an Arabic news broadcast. The translation is the product of the IBM Arabic to English translation engine.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
- references: Stephanie Strassel, Junbo Kong, and David Graff 2005 TDT4 Multilingual Text and Annotations Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001298: TDT5 Multilingual Text
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001298: TDT5 Multilingual Text
*Introduction*

The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This release contains the complete set of English, Arabic and Chinese newswire text used in the 2004 Topic Detection and Tracking technology evaluations. The topic relevance annotations corresponding to this publication can be found in LDC Publication LDC2006T19, TDT5 Topics and Annotations.

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news.

There were four TDT tasks defined for the 2004 evaluation: the tracking of known topics, the detection of unknown topics, the detection of initial stories on unknown topics, and the detection of pairs of stories on the same topic (links). Of these four tasks, the topic tracking task and the link detection task are considered to be "primary." Previous TDT evaluations also included a story segmentation task. This task applied only to broadcast news. Since TDT5 does not include broadcast news, there is no story segmentation task in the 2004 TDT Evaluation.

*Samples*

The images below are samples of the text data contained in this corpus.

* Arabic
* English
* Chinese
- references: David Graff, et al. 2006 TDT5 Multilingual Text Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001299: TDT5 Topics and Annotations
C-001299: TDT5 Topics and Annotations
*Introduction*

This file contains documentation on the TDT5 Topics and Annotations, Linguistic Data Consortium (LDC) catalog number LDC2006T19 and isbn 1-58563-418-2.

This release includes topic relevance judgments and associated information for the TDT5 2004 evaluation topics. This release contains complete relevance judgments, including the results of adjudication, in which discrepancies between system submissions and LDC annotations are reviewed and relevance judgments updated. This release also contains answer keys for the link detection task.

The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The multilingual news text corresponding to this publication can be found in LDC Publication LDC2006T18, TDT5 Multilingual News Text.

*Data*

A total of 250 topics, numbered 55001 - 55250, were annotated by LDC using a search guided annotation technique. Details of the annotation process are described in the annotation task definition. Approximately 25% of the topics are monolingual English (ENG), 25% are monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB), and 25% are multilingual:

63 ENG 62 MAN 62 ARB 35 ARB ENG MAN 21 ENG MAN 7 ARB ENG 250 total Broken down by language and counting both mono- and multi-lingual topics: 126 ENG 118 MAN 104 ARB

*Samples*

For an example of the data in this corpus, please review this sample from the link detection files.
- references: Meghan Glenn, et al. 2006 TDT5 Topics and Annotations Linguistic Data Consortium, Philadelphia
- hasVersion: C-001287: TDT Pilot Study Corpus
- hasVersion: C-001288: TDT2 Careful Transcription Audio
- hasVersion: C-001289: TDT2 Careful Transcription Text
- hasVersion: C-001290: TDT2 English Audio
- hasVersion: C-001291: TDT2 Mandarin Audio Corpus
- hasVersion: C-001292: TDT2 Multilanguage Text Version 4.0
- hasVersion: C-001293: TDT3 English Audio
- hasVersion: C-001294: TDT3 Mandarin Audio
- hasVersion: C-001295: TDT3 Multilanguage Text Version 2.0
- hasVersion: C-001296: TDT4 Multilingual Broadcast News Speech Corpus
- hasVersion: C-001297: TDT4 Multilingual Text and Annotations
- hasVersion: C-001298: TDT5 Multilingual Text
C-001300: TI 46-Word
*Introduction*

This release contains a corpus of speech which was originally designed and collected at Texas Instruments, Inc. (TI) in 1980 and used initially in performance assessment tests of isolated-word speaker-dependent technology. (See "Speech Recognition: Turning Theory to Practice" by G. R. Doddington and T. B. Schalk, in IEEE Spectrum, Vol. 18, No. 9, September 1981.)

The 46-word vocabulary consists of two sub-vocabularies: (1) the TI 20-word vocabulary (consisting of the digits zero through nine plus the words "enter," "erase," "go," "help," "no," "rubout," "repeat," "stop," "start," and "yes" as well as (2) the TI 26-word "alphabet set" (consisting of the letters "a" through "z").

*Data*

The corpus contains read utterances from 16 speakers (eight males and eight females) each speaking 26 utterances of the 46-word vocabulary: 16 tokens designated as training and ten as test. Note these numbers reflect the aim of the collection and for various reasons, the full number of utterances was not reached for some speakers. See the included documentation for more information.

The corpus was collected at Texas Instruments in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at 12.5kHz sample rate with 12-bit quantization. The files are in NIST SPHERE format and have a ".wav" filename extension.

*Updates*

As of October 5, 2016 the documentation was updated to more closely reflect the file inventory.
- references: Mark Liberman, et al. 1993 TI 46-Word Linguistic Data Consortium, Philadelphia

SHACHI - Language Resource Metadata Database