言語資源の登録件数: 3330件 2023 件中 281 - 290 件目
現在の検索条件
キーワードを入力
検索条件を選択
  • C-000591: 2002 NIST Speaker Recognition Evaluation
    *Introduction*

    The 2002 NIST Speaker Recognition Evaluation corpus was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S04 and ISBN 1-58563-293-7.

    The 2002 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible.

    The 2002 NIST Speaker Recognition Evaluation main data was extracted from the Switchboard Cellular Part 2. The extended data task used two phases of Switchboard II, Phases 2 and 3. This evaluation also included the first multi-modal task, using data from the FBI voice database.

    Supporting documentation for this evaluation may be found on the 2002 NIST Speaker Recognition Evaluation website. Please consult the NIST evaluation plan for detailed instructions on using this evaluation material.

    *Data*

    There are a total of 9,153 speech files (6,098 at 8 KHz and 3,055 at 16KHz), all of which are in sphere format, for a total of ~156 hours.

    The data was initially distributed by NIST on 13 CD-ROMs (r81_1_1 through r81_13_1). This corpus consists of training and test data and replicates exactly the content and structure of the 13 CD-ROMs.

    *Updates*

    There are no available updates at this time.
  • C-000592: 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
    *Introduction*

    2002 Rich Transcription Broadcast News and Conversational Telephone Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S11 and ISBN 1-58563-311-9.

    This corpus contains the test material used in the 2002 Rich Transcription (RT-02) Evaluation of Broadcast News and Conversational Telephone Speech, administered by the NIST Speech Group in the Spring of 2002. The RT-02 Meeting Recognition Evaluation material is available in a separate distribution. For complete up-to-date information, see the RT-02 Evaluation Website.

    The RT-02 Evaluation supported two main evaluation tasks:

    * Speech-To-Text (STT) Tasks -- included three processing speeds (1x real time, 10x real time, and unlimited time) for both the Broadcast News (BN) and Conversational Telephone Speech (CTS) domains.
    * Metadata Extraction (MDE) Task -- consisted of a speaker diarization task for the BN and CTS domains.

    *Data*

    This distribution of the RT-02 Evaluation Data contains only Broadcast News and Conversational Telephone Speech data. Meeting data used in the RT-02 Evaluation is not included in this distribution and is packaged in a separate distribution. All recordings are in English.

    The BN data is composed of six approximately 10-minute excerpts from six different broadcasts. Each waveform is a SPHERE-headered, single-channel, 16-bit PCM file. The broadcasts were selected from programs from MNB, PRI, NBC, CNN, VOA and ABC, all collected in 1998. The evaluation excerpts were transcribed to the nearest story boundary.

    The CTS data is composed of 60 approximately five-minute excerpts from 60 different conversations: 20 from Switchboard-1 data, 20 from Switchboard-2 data, and 20 from Switchboard Cellular-2 data. Evaluation excerpts were transcribed to the nearest turn. Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.

    The reference transcripts are also provided in this corpus. The official format for STT reference data is STM (files with the extension 'stm'), while the official format for MDE reference data is RTTM (files with the extension 'rttm') . Files with the extensions 'txt' or 'utf' are the original reference transcripts before any format conversions, additions of annotations, etc., and are included for completeness.

    *Samples*

    Please examine this example to review a sample of this corpus.

    *Updates*

    There are no updates available at this time.

    The World is the co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
  • C-000593: ACE 2004 Multilingual Training Corpus
    *Introduction*

    This file contains documentation on the ACE 2004 Multilingual Training Corpus, Linguistic Data Consortium (LDC) catalog number LDC2005T09 and ISBN 1-58563-334-8.

    This publication contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation.

    The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

    The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by the ACE Time Normalization (TERN) 2004 English Training Data Corpus (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release.

    A complete description of the ACE 2004 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST): http://www.nist.gov/speech/tests/ace/

    For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website.

    *Samples*

    The files listed below are samples from the English data. They should provide a good example of the material in this corpus.

    * Chinese Treebank
    * Fisher Transcripts
    * Broadcast News

    The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
  • C-000594: ACE 2005 Multilingual Training Corpus
    *Introduction*

    This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation.

    The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form.

    In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks.

    A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST).

    For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website

    Below is information about the amount of data included in the current release and its annotation status.

    * 1P: data subject to first pass (complete) annotation
    * DUAL: data also subject to dual first pass (complete) annotation
    * ADJ: data also subject to discrepancy resolution/adjudication
    * NORM: data also subject to TIMEX2 normalization

    English

    words
    files

    1P
    DUAL
    ADJ
    NORM
    1P
    DUAL
    ADJ
    NORM

    NW
    60658
    57807
    33459
    48399
    128
    124
    81
    106

    BN
    59239
    58144
    52444
    55967
    239
    234
    217
    226

    BC
    46612
    46110
    33874
    40415
    68
    67
    52
    60

    WL
    45210
    43648
    35529
    37897
    127
    122
    114
    119

    UN
    45161
    44473
    26371
    37366
    58
    57
    37
    49

    CTS
    47003
    47003
    34868
    39845
    46
    46
    34
    39

    Total
    303833
    297185
    216545
    259889
    666
    650
    535
    599

    Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.

    chars
    files

    1P
    DUAL
    ADJ
    1P
    DUAL
    ADJ

    NW
    127319
    124175
    121797
    248
    242
    238

    BN
    134963
    133696
    120513
    332
    328
    298

    WL
    71839
    68063
    65681
    107
    101
    97

    Total
    334121
    325834
    307991
    687
    671
    633

    Arabic

    words
    files

    1P
    DUAL
    ADJ
    1P
    DUAL
    ADJ

    NW
    61287
    56158
    53026
    239
    226
    221

    BN
    29259
    27165
    26907
    134
    128
    127

    WL
    21687
    20181
    20181
    60
    55
    55

    Total
    112233
    103504
    100114
    433
    409
    403

    *Samples*

    For examples of the data in this publication, please review the following samples:

    * English
    * Arabic
    * Chinese
  • C-000595: ACE Time Normalization (TERN) 2004 English Training Data v 1.0
    *Introduction*

    This file contains documentation on the ACE Time Normalization (TERN) 2004 English Training Data v 1.0, Linguistic Data Consortium (LDC) catalog number LDC2005T07 and ISBN 1-58563-331-3.

    This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program. The evaluation was held in August 2004 and a workshop in September 2004. Evaluation participants received this data for training purposes, and it is now being released for general use.

    The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE.

    The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday," "last week," or "three months starting October 1," one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization.

    *Samples*

    Please examine this sample to see an example of the corpus.

    *Updates*

    Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2005T07.

    "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
    • references: Topic Detection and Tracking - Phase 4 (TDT4)
    • references: Arabic Treebank: Part 1
    • references: Chinese Treebank English Parallel Text Corpus
    • isReferencedBy: C-000593: ACE 2004 Multilingual Training Corpus
    • isReferencedBy: "ACE Time Normalization (TERN) 2004 English Training Data V1.0"(http://timex2.mitre.org/corpora/README_TERN_English_Data.txt)
    • isReferencedBy: Lisa Ferro, et al. 2005 ACE Time Normalization (TERN) 2004 English Training Data v 1.0 Linguistic Data Consortium, Philadelphia
  • C-000596: ACE-2 Version 1.0
    *Introduction*

    ACE-2 Version 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T11 and ISBN 1-58563-270-8.

    This release contains Version 1.0 of the ACE-2 corpus, created and distributed by the LDC to support the Automatic Content Extraction (ACE) program. The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. There are three main ACE tasks: Entity Detection and Tracking, Relation Detection and Characterization, and Event Detection and Characterization.

    Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium to support the following two research tasks: Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC).

    For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology (NIST), please visit the NIST website.

    For information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit the ACE Project page at the LDC.

    *Data*

    This publication contains two sets of data: training and devtest. Each of these sets is further divided by source: broadcast news, newspaper, and newswire.

    The training contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The devtest contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation.

    The broadcast and newswire source data is drawn from a subset of the TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this has been supplemented with additional newspaper data from the Washington Post. A portion of the training broadcast data was drawn from the 1997 English Broadcast News Transcripts (HUB4) corpus (LDC98T28).

    All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below.

    Newswire New York Times Newswire Service (NYT) Associated Press Worldstream Service (APW) Broadcast News Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) Public Radio International, "The World" (PRI) Voice of America, English news programs (VOA) MSNBC, "The News With Brian Williams" (MNB) National Broadcasting Company, "Nightly News" (NBC) Newspaper Washington Post (WAP) This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation.

    There are 179,007 words of source data, or 519 files, broken down as follows:

    Source # Words train # Words devtest # Files train # Files devtest NYT 32892 7487 48 9 APW 29144 7037 82 20 CNN 2290 2653 69 11 ABC 1588 2687 24 10 PRI 1272 5284 43 9 VOA 594 2611 24 7 MNB 0 2539 0 6 NBC 0 2633 0 8 WAP 60247 15070 76 17 ea 2019 0 31 0 ed 1094 0 25 0 Total 131023 47984 422 97

    *Updates*

    There are no updates available at this time.

    "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
  • C-000597: ACL/DCI
    ACL Data Collection Initiative contains text from the Wall Street Journal, the Collins English Dictionary, scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.

    The many formats of the original texts have been mapped into a markup language consistent with the SGML standard (ISO 8879).

    The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tags such as "". The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.

    The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called "FIT", by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch.

    The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory.

    Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory "postext" contains text with part-of-speech annotations; "parstext" contains text with syntactic bracketing.
    • references: Penn Treebank
    • isReferencedBy: (Online documentation) "Documentation for ACL_DCI" (http://www.ldc.upenn.edu/Catalog/docs/LDC93T1/)
    • isReferencedBy: 1993 ACL/DCI Linguistic Data Consortium, Philadelphia
  • C-000598: ARL Urdu Speech Database, Training Data
    *Introduction*

    This file contains documentation for ARL Urdu Speech Database, Training Data, Linguistic Data Consortium (LDC) catalog number LDC2007S03 and isbn 1-58563-421-3.

    The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the LDC for distribution.

    Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan.

    The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The distribution of speaker dialects is as follows:

    Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27 North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test).

    *Data*

    Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories.

    Each utterance is transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full.

    *Update*

    Earlier versions were missing the content list file. This is now available as a download. Please contact the LDC membership office to receive instructions for download.

    *Samples*

    For an example of the data in this corpus, please listen to this following audio sample (.wav format)
    • isReferencedBy: (online documentation)http://www.ldc.upenn.edu/Catalog/docs/LDC2007S03/
    • isReferencedBy: Appen Pty Ltd, Sydney, Australia 2007 ARL Urdu Speech Database, Training Data Linguistic Data Consortium, Philadelphia
  • C-000599: ATIS0 Read
    LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3 - ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.

    All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.

    The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). 36 speakers produced a total of 912 utterances.

    The second disc (ATIS0 Read) contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.

    The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.
  • C-000600: ATIS0 SD Read
    LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3 - ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by ten of the same speakers.

    All ATIS speech data is recorded at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.

    The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). Thirty-six speakers produced a total of 912 utterances.

    The second disc (ATIS0 Read) contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences read by each of the 20 speakers.

    The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3,171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.

    *Update*

    This publication has been condensed from 4 CDROM discs to a single DVDROM. The contents of each CD reside in separate directories that are organized identically to the original version.