言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 731 - 740 件目

C-001301: TIDES Extraction (ACE) 2003 Multilingual Training Data
*Introduction*

TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T09 and ISBN 1-58563-292-9.

This corpus was created and previously distributed by Linguistic Data Consortium as an e-corpus (catalog number LDC2003E18) to support the September 2003 TIDES Extraction (ACE) program evaluation. For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology, please visit the NIST website. For more information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit LDC's ACE Project page.

The source material for this corpus consists of broadcast and newswire data drawn from October 2000 through the end of December 2000. The sources are listed below.

Newswire:

* Arabic * Agency France Press (AFA)
* Al Hayat (ALH)
* An-Nahar (ANN)

* Chinese * Xinhua Newswire (XIN)
* Zaobao (ZBN)

* English * New York Times Newswire Service (NYT)
* Associated Press Worldstream Service (APW)

Broadcast News:

* Arabic * Voice of America, Arabic news programs (VAR)
* Nile TV (NTV)

* Chinese * China National Radio (CNR)
* China Television System (CTS)
* Voice of America, Chinese news programs (VOM)
* China TV Program Agency (CTV)
* China Broadcasting System (CBS)

* English * Cable News Network, "Headline News" (CNN)
* American Broadcasting Co., "World News Tonight" (ABC)
* Public Radio International, "The World" (PRI)
* Voice of America, English news programs (VOA)
* MSNBC, "The News With Brian Williams" (MNB)
* National Broadcasting Company, "Nightly News" (NBC)

*Data*

Annotations for this corpus were produced by Linguistic Data Consortium to support the following tasks broken down by language:

Arabic * Entity Detection and Tracking (EDT)
Chinese

* Entity Detection and Tracking (EDT)
* Relation Detection and Characterization (RDC)
English

* Entity Detection and Tracking (EDT)
* Relation Detection and Characterization (RDC)
This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), as well as the ACE DTD and supporting documentation.

The data files for each language are divided by source type (bnews, nwire). For Chinese, the annotation files (.apf.xml) are encoded in UTF8. We have included source files (.sgm) in both GB and UTF8 encoding. The following tables outline the word and file counts by language and source.

Arabic

Source Words Files AFA 11154 66 ALH 7437 20 ANN 7734 20 VAR 8360 57 NTV 7512 43 Total 42197 206 Chinese

Source Characters Files XIN 28157 57 ZBN 25591 42 CNR 4758 21 CTS 7160 22 VOM 18160 42 CTV 6017 18 CBS 8130 19 Total 97973 221 English

Source Words Files NYT 18983 24 APW 38222 81 CNN 5706 54 ABC 4453 15 PRI 9785 27 VOA 4203 28 MNB 4356 8 NBC 4976 15 Total 90684 252

*Updates*

There are no updates available at this time.

© 2000 American Broadcasting Corporation © 2000 Cable News Network, Inc. © 2000 Press Association, Inc. © 2000 New York Times © 2000 National Broadcasting Company, Inc. © 2000 Public Radio International © 2000 Agency France Press © 2000 Al Hayat © 2000 An-Nahar © 2000 Nile TV © 2000 Xinhua News © 2000 SPH AsiaOne Ltd. © 2000 China National Radio © 2000 China Television System © 2000 China TV Program Agency © 2000 China Broadcasting System
- references: Alexis Mitchell, et al. 2004 TIDES Extraction (ACE) 2003 Multilingual Training Data Linguistic Data Consortium, Philadelphia
- isReplacedBy: LDC2003E18
C-001302: TIDIGITS
This corpus contains speech which was originally designed and collected at Texas Instruments, Inc. (TI) for the purpose of designing and evaluating algorithms for speaker-independent recognition of connected digit sequences. There are 326 speakers (111 men, 114 women, 50 boys and 51 girls) each pronouncing 77 digit sequences. Each speaker group is partitioned into test and training subsets.

The corpus was collected at TI in 1982 in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized at 20kHz. The waveform files are in the NIST SPHERE format.

Updates

As of April, 2015, TIDIGITS is also available in flac compressed wav. This package is available to licensees as an additional download. Not included in this version are the folders relating to handling the shortened sphere files of the original corpus.
- references: R. Gary Leonard and George Doddington 1993 TIDIGITS Linguistic Data Consortium, Philadelphia
C-001303: TIMIT Acoustic-Phonetic Continuous Speech Corpus
*Introduction*

The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

*Samples*

* phonemes
* transcripts
* audio
* word list
- references: John S. Garofolo, et al. 1993 TIMIT Acoustic-Phonetic Continuous Speech Corpus Linguistic Data Consortium, Philadelphia
- isReferencedBy: C-004346: Audiovisual Database of Spoken American English
C-001304: TIPSTER Complete
LDC93T3A - Complete TIPSTER corpus

LDC93T3B - Volume 1 of the TIPSTER corpus

LDC93T3C - Volume 2 of the TIPSTER corpus

LDC93T3D - Volume 3 of the TIPSTER corpus

TIPSTER is sometimes also called the Text Research Collection Volume or TREC.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source (vol)
Year
Approx. # Words (Millions)

Associated Press (1)
1989
40

Associated Press (2)
1988
37

Associated Press (3)
1990
37

Wall Street Journal (1)
1987
20

Wall Street Journal (1)
1988
17

Wall Street Journal (1)
1989
6

Wall Street Journal (2)
1990
11

Wall Street Journal (2)
1991
22

Wall Street Journal (2)
1992
5

Dept. of Energy (1)

28

Federal Register (1)
1989
38

Federal Register (2)
1988
30

Ziff/Davis (1)

36

Ziff/Davis (2)
1989-90
26

Ziff/Davis (3)
1991-92
50

San Jose Mercury News (3)
1991
45

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.
- references: Donna Harman and Mark Liberman 1993 TIPSTER Complete Linguistic Data Consortium, Philadelphia
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
C-001305: TIPSTER Volume 1
LDC93T3A - Complete TIPSTER corpus

LDC93T3B - Volume 1 of the TIPSTER corpus

LDC93T3C - Volume 2 of the TIPSTER corpus

LDC93T3D - Volume 3 of the TIPSTER corpus

TIPSTER 1 is sometimes also called the Text Research Collection Volume 1 or TREC-1.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source
Year
Approx. # Words (Millions)

Associated Press
1989
40

Wall Street Journal
1987
20

Wall Street Journal
1988
17

Wall Street Journal
1989
6

Dept. of Energy

28

Federal Register
1989
38

Ziff/Davis

36

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.
- references: Donna Harman and Mark Liberman 1993 TIPSTER Volume 1 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
C-001306: TIPSTER Volume 2
LDC93T3A - Complete TIPSTER corpus

LDC93T3B - Volume 1 of the TIPSTER corpus

LDC93T3C - Volume 2 of the TIPSTER corpus

LDC93T3D - Volume 3 of the TIPSTER corpus

TIPSTER 2 is sometimes also called the Text Research Collection Volume 2 or TREC-2.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source
Year
Approx. # Words (Millions)

Associated Press
1988
37

Wall Street Journal
1990
11

Wall Street Journal
1991
22

Wall Street Journal
1992
5

Federal Register
1998
30

Ziff/Davis
1989-09
26

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.
- references: Donna Harman and Mark Liberman 1993 TIPSTER Volume 2 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
C-001307: TIPSTER Volume 3
LDC93T3A - Complete TIPSTER corpus

LDC93T3B - Volume 1 of the TIPSTER corpus

LDC93T3C - Volume 2 of the TIPSTER corpus

LDC93T3D - Volume 3 of the TIPSTER corpus

TIPSTER 3 is sometimes also called the Text Research Collection Volume 3 or TREC-3.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source
Year
Approx. # Words (Millions)

Associated Press
1990
37

San Jose Mercury
1991
45

Ziff/Davis
1991-92
250

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.
- references: Donna Harman and Mark Liberman 1993 TIPSTER Volume 3 Linguistic Data Consortium, Philadelphia
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
- hasVersion: C-001304: TIPSTER Complete
C-001308: TRAINS Spoken Dialog Corpus
This release contains a corpus of task-oriented spoken dialogs. These dialogs were collected in 1993 at the University of Rochester Department of Computer Science as part of the TRAINS project, a project to develop a conversationally proficient planning assistant, which helps a user construct a plan to achieve some task involving the manufacturing and shipment of goods in a railroad freight system. The collection procedure was designed to make the setting as close to human-computer interaction as possible, but was not a "wizard" scenario, where one person pretends to be a computer. Thus these dialogs provide a snapshot into an ideal human-computer interface that would be able to engage in fluent conversations.

Altogether, this corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5,900 speaker turns and 55,000 transcribed words.
- references: James Allen and Peter A. Heeman 1995 TRAINS Spoken Dialog Corpus Linguistic Data Consortium, Philadelphia
C-001309: TREC Mandarin
This publication contains the TREC ("Text REtreival Conference") Mandarin Corpus used for the Chinese task in TRECs 5-6 and consist of approximately 170 megabytes of articles drawn from the People's Daily newspaper and the Xinhua newswire formatted to include TREC document IDs. The text is Mandarin Chinese and is encoded using the GB encoding scheme. The topics (questions) and relevance judgments (right answers) are not included in this publication but can be downloaded from the Data/Non-English section of the TREC web site.

The Mandarin Chinese text data is from the Xinhua News Agency and the People's Daily News Service (both from mainland China). Click here to see the appereance of a sample file from Xinhua Newswire and People's Daily. This collection of text was originally gathered by the Linguistic Data Consortium (LDC), and then adapted by the National Institute of Standards and Technology (NIST) for use in the TREC Mandarin evaluation program.
- references: Willie Rogers 2000 TREC Mandarin Linguistic Data Consortium, Philadelphia
- references: N-001310: TREC Spanish
C-001311: TRECVID 2005 Keyframes & Transcripts
*Introduction*

This file contains documentation for TRECVID 2005 Keyframes & Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2007V01 and isbn 1-58563-437-9.

TREC Video Retrieval Evaluation (TRECVID) is sponsored by the National Institute of Standards and Technology (NIST) to promote progress in content-based retrieval from digital video via open, metrics-based evaluation. The keyframes in this release were extracted for use in the NIST TRECVID 2005 Evaluation.

TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations. In 2005 there were four main tasks with associated tests:

* shot boundary determination
* low-level feature extraction
* high-level feature extraction
* search (interactive, manual, and automatic)
For a detailed description of the TRECVID Evaluation Tasks, please refer to the NIST TRECVID 2005 Evaluation Description.

*Data*

The source data is Arabic, Chinese and English language broadcast programming collected in November 2004 from the following sources: Lebanese Broadcasting Corp. (Arabic); China Central TV and New Tang Dynasty TV (Chinese); and CNN and MSNBC/NBC (English).

Shots are fundamental units of video, useful for higher-level processing. To create the master list of shots, the video was segmented. The results of this pass are called subshots. Because the master shot reference is designed for use in manual assessment, a second pass over the segmentation was made to create the master shots of at least 2 seconds in length. These master shots are the ones used in submitting results for the feature and search tasks in the evaluation. In the second pass, starting at the beginning of each file, the subshots were aggregated, if necessary, until the currrent shot was at least 2 seconds in duration, at which point the aggregation began anew with the next subshot.

The keyframes were selected by going to the middle frame of the shot boundary, then parsing left and right of that frame to locate the nearest I-Frame. This then became the keyframe and was extracted. Keyframes have been provided at both the subshot (NRKF) and master shot (RKF) levels.

In a small number of cases (all of them subshots) there was no I-Frame within the subshot boundaries. When this occured, the middle frame was selected. There is one anomaly: at the end of the first video in the test collection, a subshot occurs outside a master shot.)

The emphasis in the common shot boundary reference is on the shots, not the transitions. The shots are contiguous. There are no gaps between them. They do not overlap. The media time format is based on the Gregorian day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions of a second.

*Samples*

The Keyframe below is a sample of the data contained in this corpus.

For information about this frame, please examine this annotation file.
- references: Georges Quenot, Christian Petersohn, Kevin Walker 2007 TRECVID 2005 Keyframes & Transcripts Linguistic Data Consortium, Philadelphia
- references: TRECVID 2003 Keyframes & Transcripts

SHACHI - Language Resource Metadata Database