Language resource #: 3330
Results 361 - 370 of 2023
-
C-000683: CSR-III Text
The third ARPA Continuous Speech Recognition (CSR) Language Model Training Data is a set for speaker-independent, large-vocabulary speech recognition systems. This corpus is an important companion to the 1994 Benchmark Speech data collection (LDC95S23).
The text collection comprises both source text data (prepared by LDC and BBN) and derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies. The sources include all available WSJ texts, spanning 1987 through March 1994 and all AP and San Jose Mercury news data from the three TIPSTER volumes. (Some of the WSJ data, from 1992 through 1994, appears here for research use for the first time). This corpus is also available from the LDC as a 1995 release.
Because of restrictions imposed by the copyright holders of much of the NAB text, both the speech and text collections are available to LDC members only. For more information on how to join, send email to ldc@ldc.upenn.edu.
*Pricing*
The Reduced Licensing Fee for this corpus is US$200.- hasVersion: C-000682: CSR-III Speech
-
C-000684: CSR-IV HUB3
This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 HUB3 Multi-Microphone tests. The data consists of digitized waveforms collected with eight different microphones simultaneously from 40 subjects reading 15 sentence articles drawn from various North American business news publications. The data is partitioned into development-test and evaluation-test sets. The test sets were collected with different subjects, prompts and microphones. No training data was collected for this corpus since a substantial amount of NAB acoustic training data was already available. Index files have been included that specify the exact subset of the evaluation test recordings which were used in the November 1995 tests. The software NIST used to process and score the output of the tests systems is also included. The data is organized as follows:
CD26-3 Development-Test Data-Location 1, Adaptation and NAB recordings, Subjects:703-705, 707-70a, 70c, 70f, 70g
CD26-4 Development-Test Data-Location 2, NAB recordings, Subjects:70k, 70m, 70o, 70q-70s, 70u-70w
CD26-5 Development-Test Data-Location 2, Adaptation recordings, Subjects:70k 70m-70o, 70q-70s, 70u-70w
CD26-3 Development-Test Data-NAB recordings, Subjects:710-71j
As of September, 2007 this publication has been condensed to fit on a single DVD. The data on each CD resides in its own directory labeled with the above NIST labels.
*Pricing*
The Reduced Licensing Fee for this corpus is US$200.- references: Jonathan Fiscus, John Garofolo, and David Pallett 1996 CSR-IV HUB3 Linguistic Data Consortium, Philadelphia
-
C-000685: CSR-IV HUB4
This release contains all of the speech data provided to sites participating in the DARPA CSR November 1995 HUB4 (Radio) Broadcast News tests. The data consists of digitized waveforms of MarketPlace (tm) business news radio shows provided by KUSC through an agreement with the Linguistic Data Consortium and detailed transcriptions of those broadcasts. The software NIST used to process and score the output of the test systems is also included.
The data is organized as follows:
CD26-1: Training Data-Ten complete half-hour broadcasts with minimal-verified transcripts. The transcripts are time aligned with the waveforms at the story-boundary level.
CD26-2: Development-Test Data-Six complete half-hour broadcasts with verified transcripts. The transcripts are time aligned with the waveforms at the story- and turn-boundary level. Index files have been included which specify how the data may be partitioned into 2 test sets.
CD26-6 Evaluation-Test Data-Five complete half-hour broadcasts with verified/adjudicated transcripts. The transcripts are time aligned with the waveforms at the story-, turn- and music-boundary level. An index file has been included which specifies how the data was partitioned into the test set used in the CSR 1995 HUB4 tests.
*Samples*
* Audio
* Transcripts
* Speaker- references: John Garofolo, et al. 1996 CSR-IV HUB4 Linguistic Data Consortium, Philadelphia
-
C-000686: CTIMIT
The CTIMIT corpus is a cellular-bandwidth adjunct to the TIMIT Acoustic Phonetic Continuous Speech Corpus (NIST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990). The corpus was contributed by Lockheed-Martin Sanders to the LDC for distribution on CD-ROM media. The CTIMIT read speech corpus has been designed to provide a large phonetically labeled database for use in the design and evaluation of speech processing systems operating in diverse, often hostile, cellular telephone environments. CTIMIT was collected by members of the Voice Communication Initiative (VCI) at Lockheed-Martin Sanders' Signal Processing Center of Technology (SPCOT) as part of internal R&D efforts, with additional sponsorship from the Wireless Communications Group in the company's Advanced Engineering and Technology (AE&T) Division.
Like NTIMIT, CTIMIT is based on the original TIMIT recordings, which were passed through a sample of actual telephone circuits--cellular circuits in the case of CTIMIT. Thus the original phonetic segmentation and labeling of TIMIT continue to be applicable to CTIMIT as well as NTIMIT.- references: E. Bryan George, et al. 1996 CTIMIT Linguistic Data Consortium, Philadelphia
- hasPart: the TIMIT Acoustic Phonetic Continuous Speech Corpus (NIST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990).
-
C-000687: Chinese <-> English Name Entity Lists v 1.0
*Introduction*
This file contains documentation for Chinese English Name Entity Lists, v.1.0, Linguistic Data Consortium (LDC) catalog number LDC2005T34 and ISBN 1-58563-368-2.
These Chinese-English bi-directional name entity lists are compiled from Xinhua News Agency newswire texts. Not every irregularity in the original source has been detected and normalized. Some Chinese characters are not encoded in the source and brackets are used to describe their composition. Except for the person name lists, most instances were left untouched in the created lists. An effort was made to replace GB-encoded characters (such as Roman numbers) in the English translation with ASCII characters. However no attempt has been made to do the opposite for Chinese names. The use of slashes as delimiters presents another problem. Some names may have internal slashes. Initially, double quotes ("") were used to enclose the name with an internal slash to avoid confusion without realizing that these is just one " in ASCII (as opposed to a set of enclosing " in GB). Later it was decided to use &slash;. In future releases, some lists will be changed for greater consistency. Finally, most of the English names in the source use lower cases throughout. An effort was made to capitalize the initial letter (and possibly some middle ones) for person names, but not for any other kind of names as most other names have multiple words, some of which may contain articles and prepositions.
The word "English" is somewhat misleading here. Although most of the foreign words are English or can appear in English texts, there are also many non-English words written in Roman alphabet - some of which may have English equivalents while others do not. No efforts have been made to eliminate those non-English names where English equivlants are available.
The entire set consists of nine pairs of lists. The English->Chinese version of each pair was created by reversing the Chinese->English, both sorted by the Unix built-in sort function.
The contents are as follows
* Place Names, Chinese to English: 276,382
* Place Names, English to Chinese: 298,993
* Organization Names, Chinese to English: 30,800
* Organization Names, English to Chinese: 37,145
* Corporate Names, Chinese to English: 54,747
* Corporate Names, English to Chinese 58,468
* Press Organization Names, Chinese to English: 29,757
* Press Organization Names, English to Chinese: 32,922
* Intl. Organization Names, Chinese to English: 7,040
* Intl. Organization Names, English to Chinese: 7,040
*Samples*
For an example of the data in this publication, please view this screen capture of the corporate names list.
*Pricing*
The Reduced Licensing Fee for this corpus is US$100.- references: Shudong Huang 2005 Chinese <->; English Name Entity Lists v 1.0 Linguistic Data Consortium, Philadelphia
-
C-000688: Chinese English News Magazine Parallel Text
*Introduction*
This file contains documentation on the Chinese English News Magazine Parallel Text, Linguistic Data Consortium (LDC) catalog number LDC2005T10 and ISBN 1-58563-333-X.
This corpus contains Chinese news stories and their English translations LDC collected via Sinorama Magazine, Taiwan, from 1976 to 2004. It totals 6,366 story pairs, 365,568 sentence pairs, 20M Chinese characters and 9M English words. The corpus is aligned at sentence level.
*Data*
Sinorama Magazine is published monthly in several languages, including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on a single CD, and its 2001 to 2004 publications via Sinorama's website.
The Sinorama Chinese text was encoded in Big5. The data came story aligned but were lack of sentence level alignment. The sentence alignment was done at the LDC using Champollion v 1.1.
The final data is put in the data directory, which contains subdirectories for Chinese documents, English documents, and the sentence level alignment, identified as "Chinese," "English," and "alignment."
The English and Chinese files may contain one or more documents, with each document formatted in SGML as follows:
[English or Chinese text]
[English or Chinese text] [English or Chinese text] ...
Notes: * the
and tags are always assigned sequential numeric IDs, starting at one. * the tags are always placed on the same line with their contents, and are always separated from the contents by a space.
* if an English file and a Chinese file share the same file name, they contain the same documents. * all Chinese text is encoded in Big5. Each alignment file contains the sentence level alignment of multiple documents, each being formatted in SGML as follows: ...
Notes: * the docid in an English file, its Chinese translation and the ALIGNMENT are the same. * EnglishSegId and ChineseSegId may have none, one, or more than one segment IDs.
*Samples*
The following files provide an example of this corpus:
* Chinese
* English
* Alignment
Portions © 2005 Trustees of the University of Pennsylvania- references: Xiaoyi Ma 2005 Chinese English News Magazine Parallel Text Linguistic Data Consortium, Philadelphia
-
C-000689: Chinese Gigaword Second Edition
*Introduction*
Chinese Gigaword Release Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T14 and ISBN 1-58563-353-4. This is a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC.
This edition includes all of the contents in the first release of the Chinese Gigaword corpus (LDC2003T09) as well as new data collected after the publication of the first edition. Also, a limited number of articles from a new newspaper source (Zaobao) have been added in this edition.
The three distinct international sources of Chinese newswire included in this edition are the following:
Central News Agency, Taiwan (cna_cmn) Xinhua News Agency (xin_cmn) Zaobao Newspaper (zbn_cmn) The seven-character abbreviations shown above represent both the source name and the language ID ("cmn" for Mandarin Chinese).
*New In Second Edition*
New documents (Xinhua from October 2002 through December 2004 and CNA from January 2003 through December 2004) have been added.
A new newspaper source (Lianhe Zaobao) has been added.
*Samples*
For an example of this corpus, please review this screen capture displaying some of the text included.- references: C-000690: Chinese Gigaword
- hasVersion: C-003304: Chinese Gigaword Third Edition
- hasVersion: C-003305: Tagged Chinese Gigaword
-
C-000690: Chinese Gigaword
*Introduction*
Chinese Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T09 and ISBN 1-58563-230-9. This is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by the LDC over several years.
Two distinct international sources of Chinese newswire are represented here:
Central News Agency of Taiwan (cna) Xinhua News Agency of Beijing (xin) Some of the Xinhua content in this collection has been published previously by the LDC in other, older corpora, particularly Mandarin Chinese News Text (LDC95T13), TREC Mandarin (LDC2000T52), and the various TDT Multilanguage Text corpora. But all of the CNA data and a significant amount of Xinhua material is being released here for the first time.
*Data*
There are 286 files, totalling approximately 1.5GB in compressed form.
The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (nearly four gigabytes, total), K-wrds are actually the number of Chinese characters (there is no notion of "space-separated word tokens" in Chinese), and number of documents.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs CNA 144 1018 2606 735499 1649492 XIE 142 548 1331 382881 817348 TOTAL 286 1566 3937 1118380 2466840 The original data archives received by the LDC from Xinhua were encoded in GB-2312, whereas those from CNA were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the 0readme.txt file, all characters in the text are either single-byte ASCII or multi-byte Chinese.
Each data file name consists of a three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.
All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file provided in the corpus.
Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).
All sources have received a uniform treatment in terms of quality control and have been categorized into four distinct "types":
story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on advis these are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users" other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."
*Updates*
There are no updates at this time.- references: C-001692: Mandarin Chinese News Text
- references: C-001309: TREC Mandarin
- hasVersion: C-000689: Chinese Gigaword Second Edition
- hasVersion: C-003304: Chinese Gigaword Third Edition
- hasVersion: C-003305: Tagged Chinese Gigaword
-
C-000691: Chinese News Translation Text Part 1
*Introduction*
Chinese News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T06 and ISBN 1-58563-329-1.
To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Chinese source materials.
The source Chinese text and its English translations were selected and translated in different LDC projects during the time period of February 2003 to January 2005. A total of about 474K Chinese characters were selected from two sources, namely Xinhua and AFP, and translation services were provided by seven translation agencies. Each Chinese news story was translated once.
All stories and its translations were created for TIDES Machine Translation as training data, following roughly the same guidelines and procedures.
*Samples*
To see an example of this corpus, please examine this translation file.- references: Xiaoyi Ma 2005 Chinese News Translation Text Part 1 Linguistic Data Consortium, Philadelphia
-
C-000692: Chinese Proposition Bank 1.0
*Introduction*
Chinese Proposition Bank 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T23 and ISBN 1-58563-354-2.
Chinese Proposition Bank 1.0 is the first public release of the Penn Chinese Proposition Bank project, which aims to create a corpus of text annotated with information about basic semantic propositions. Specifically, predicate-argument relations have been added to the syntactic trees of the first update to Chinese Treebank 5.0 as an additional layer of annotation.
*Data*
Chinese Proposition Bank 1.0 includes annotations for files chtb_001.fid to chtb_931.fid, or the first 250K words of the first update of Chinese Treebank 5.0. There is a total of 37,183 propositions. Auxiliary verbs are not annotated. Some verbs have light verb and non-light verbs uses and in these cases only the non-light verbs are annotated. All the annotations in this release are the result of double blind annotation followed by adjudication of differences.
The following table summarizes the framesets in CPB 1.0:
Total verbs framed 4,865 Total framesets 5,298 Verbs with multiple framesets 351 Average framesets per verb 1.09
*Annotation Format*
Each P-A structure is represented in a line of space separated columns. The columns are as follows ctb-filename sentence terminal tagger frameset inflection arglabel arglabel ... The content of each column is described in detail below. ctb-filename the name of the file in the Penn Chinese TreeBank 5.0 update 1 sentence the number of the sentence in the file (starting with 0) terminal the number of the terminal in the sentence that is the location of the verb. Note that the terminal number counts empty constituents as terminals and starts with 0. This will hold for all references to terminal number in this description. An example: (IP (NP-SBJ (DNP (NP (NN 货币)(NN 回笼))(DEG 的))(NP (NN 增加)))(PU ,) (VP (PP-BNF (P 为)(IP (NP-SBJ (-NONE- *PRO*))(VP (VV 平抑)(NP-OBJ (NP (DP (DT 全)) (NP (NN 区)))(NP (NN 物价))))))(VP (VV 发挥)(AS 了)(NP-OBJ (NN 作用)))) (PU 。)) The terminal numbers: 货币 0 回笼 1 的 2 增加 3 ,4 为 5 *PRO* 6 平抑 7 全 8 区 9 物价 10 发挥 11 了 12 作用 13 。14 tagger the name of the annotator, or "gold" if it's been double annotated and adjudicated. frameset The frameset identifier from the frames file of the verb. For example, '发挥.01' refers to the frameset ID "f1" in the frame file for the verb '发挥' (frames/0930-fa-hui.xml). The names of the frame files are composed of numerical id, plus the pinyin of the verb. The numerical ids can be found in the enclosed verb list (verbs.txt). inflection The inflection field is a carry-over from the Penn English Proposition Bank, and is set to '-----', meaning no annotation in the Chinese Proposition Bank. arglabel A string representing the annotation associated with a particular argument or adjunct of the proposition. Each arglabel is dash '-' delimited and has the following columns 1) column for the address of a constituent The address of the constituent are in one of the two forms. form 1: : A single node in the syntactic tree of the sentence in question, identified by the first terminal the node spans together with the height from that terminal to the syntax node (a height of 0 represents a terminal). For example, in the sentence (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP (VA 好)))(PU 。)) the address of "1:3" represents the top IP node and "2:2" represents the CP node form 2: terminal number:height*terminal number:height*... A trace chain identifying coreference within sentence boundaries. For example in the sentence (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP (VA 好)))(PU 。)) the address of of "2:0*1:0*6:1" represents the fact nodes '2:0' (-NONE- *T*-1), '1:0' (-NONE- *OP*) and '6:1' (NP (NN 外商)(NN 投资)(NN 企业)) are coreferential. form 3: terminal number:height,terminal number:height,... This represents a collection of different pieces of one argument. This form is rarely used in the annotation of the verbs, since most discontinuous constituents have well-defined relations between their components. Therefore the components of a discontinuous constituent are assigned the same label with a secondary tag representing their semantic relations. For example, if a constituent is marked as ARG0-CRD, it means that there is another constituent having the same label and together they fill the ARG0 role of the verb. 2) column for the 'label' The argument label one of {rel, ARGM} + { ARG0, ARG1, ARG2, ... }. The argument labels correspond to the argument labels in the frames files (see ./frames). ARGM for adjuncts of various sorts, and 'rel' refers to the surface string of the verb. 3) column for 'functional tag' (optional for numbered arguments; required for ARGM) Functional tags for "split" numbered arguments: PSR - possessor PSE - possessee CRD - coordinator PRD - predicate QTY - quantity Propositional tags for numbered arguments: AT, AS, INTO, TOWARDS, TO, ONTO Functional tags for ARGM: ADV - adverbial, default tag BNF - beneficiary CND - conditional DIR - directional DIS - discourse DGR - degree EXT - extent FRQ - frequency LOC - location MNR - manner NEG - negation PRP - purpose and reason TMP - temporal TPC - topic
*Samples*
For an example of this corpus, please examine this sample xml file.- references: C-000695: Chinese Treebank 5.0