Language resource #: 3330
Results 1101 - 1110 of 2023
-
C-003312: THE LANCASTER-OSLO/BERGEN CORPUS
The preparation of a corpus cannot be seen in isolation from its intended uses. These dictate the selection of texts, the amount of material, the coding system, etc. The aim of our project has been to assemble a British English equivalent to the Brown University Corpus of American English. 1) Both sources of data, rather than concentrating on limited types of texts to be used for specific purposes, aim at a general representation of text types for use in research on a broad range of aspects of the language. To facilitate a combined use of the two corpora, an attempt has been made to match the British English material as closely as possible with the American corpus.
-
C-003313: The LUCY Corpus
The LUCY Corpus is now freely available for downloading via the Resources page on this site. The LUCY Corpus is a structurally-annotated sample (“treebank”) of present-day British written English, representing not only the polished writing of published documents, but also the less-skilled or unskilled writing of young adults at the end of secondary and beginning of tertiary education, and of children aged nine to twelve in various types of school and parts of the country. For a detailed statement of the contents of the LUCY Corpus, see its documentation file.
The LUCY Corpus is named after St Lucy, patron saint of authors.- hasVersion: SUSANNE Corpus http://www.grsampson.net/RSue.html
- hasVersion: CHRISTINE Corpus http://www.grsampson.net/RChristine.html
-
C-003314: OntoNotes Release 1.0
*Introduction*
Natural language applications like machine translation, question answering, and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). Some richer semantic representation is badly needed.
The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to produce such a resource. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years.
The authors wish to make this resource available to the natural language research community so that decoders for these phenomena can be trained to generate the same structure in new documents. Lessons learned over the years have shown that the quality of annotation is crucial if it is going to be used for training machine learning algorithms. Taking this cue, we ensure that each layer of annotation in OntoNotes will have at least 90% inter- annotator agreement. Our pilot studies have shown that predicate structure, word sense, ontology linking, and coreference can all be annotated rapidly and with better than 90% consistency.
*Samples*
The following screen captures provide examples of the data contained in this corpus.
* English tree.
* English sense predicate structure.
* Chinese tree and sense predicate structure.
*Sponsorship*
This work was suppported in part by the Defense Research Advanced Projects Agency, GALE Program Grant No. HR0011-06-C-0022. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.- isReferencedBy: Ralph Weischedel, et al., 2007, "OntoNotes v 1.0," Linguistic Data Consortium, Philadelphia
- isReplacedBy: C-003322: OntoNotes Release 2.0
- references: C-001546: Treebank-2
- references: C-000695: Chinese Treebank 5.0
- conformsTo: C-001546: Treebank-2
-
C-003315: MICASE
In late 1997, the English Language Institute (ELI) at the University of Michigan started a major research project to create a resource for studying academic speech. The goal of the project was to record and transcribe close to 200 hours (approxi-mately 1.8 million words) of academic speech from across the university. In June 2001, we finished the recording goal, with over 190 total hours recorded. In April 2002, we completed transcribing and proofreading all the transcripts. (The digital sound re-cordings were transcribed with the help of a computer program called SoundScriber, developed by former research assistant Eric Breck.)
-
C-003316: The SUSANNE Analytic Scheme
Many findings of corpus linguistics shed new light on the nature of language as a human ability. But corpus analysis is crucial also for enabling computers to process human language. For that purpose, we need corpora annotated to show their structural features, as a source of information and statistics to guide the development of language-processing algorithms. This in turn requires some set of categories to be explicitly defined, so that researchers exchanging language data can be confident that they are using the annotations in the same way. Computational linguistics needs something like the Linnaean taxonomy created for botany in the 18th century, which for the first time enabled naturalists everywhere to exchange information about plants secure in the knowledge that when they used the same names they were talking about the same things.
- hasVersion: C-003307: CHRISTINE Corpus
-
C-003322: OntoNotes Release 2.0
*Introduction*
The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 2.0 is a continuation of the OntoNotes project and is supported by the Defense Advanced Research Projects Agency, GALE Program Contract No. HR0011-06-C-0022.
OntoNotes Release 1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from Xinhua News Agency and Sinorama Magazine) and 300k words of English newswire data (from the Wall Street Journal). OntoNotes Release 2.0 adds the following to the corpus: 274k words of Chinese broadcast news data (from China Broadcating System, China Central TV, China National Radio, China Television System and Voice of America); and 200k words of English broadcast news data (from ABC, CNN, NBC, Public Radio International and Voice of America).
Natural language applications like machine translation, question answering, and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years.
The authors wish to make this resource available to the natural language research community so that decoders for these phenomena can be trained to generate the same structure in new documents. Lessons learned over the years have shown that the quality of annotation is crucial if it is going to be used for training machine learning algorithms. Taking this cue, each layer of annotation in OntoNotes will have at least 90% inter-annotator agreement. Pilot studies have shown that predicate structure, word sense, ontology linking, and coreference can all be annotated rapidly and with better than 90% consistency.
*Samples*
For an example of the data in this corpus, please examine the following samples
* Chinese
* English
*Sponsorship*
This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- isReferencedBy: Ralph Weischedel, et al., 2008, "OntoNotes Release 2.0," Linguistic Data Consortium, Philadelphia
- references: C-001297: TDT4 Multilingual Text and Annotations
- references: C-001546: Treebank-2
- references: C-000695: Chinese Treebank 5.0
- replaces: C-003314: OntoNotes v 1.0
- conformsTo: C-001546: Treebank-2
-
C-003326: Global English Monitor Corpus
The Global English Monitor Corpus is an electronic archive of the world's leading English language newspapers. Sophisticated search procedures will tell us at a finger's tip how the meaning of terrorism has changed on September 11, 2001. It will tell us how the world will never be the same by finding what is being said now but has never been said before. It will tell us whether the English language discourses in Britain, the United States, Australia, Pakistan and South Africa have changed in the same way or differently
-
C-003327: GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
*Introduction*
This release is Part 1 of the three-part GALE Phase 1 Arabic Broadcast News Parallel Text, which, along with other corpora, was used as training data in year 1 (Phase 1) of the DARPA-funded GALE program. This corpus contains transcripts and English translations of 17 hours of Arabic broadcast news programming selected from a variety of sources.
This corpus does not contain the audio files from which the transcripts and translations were generated. The audio files will be released by the LDC at a future date.
LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Source Data*
A total of 17 hours of Arabic broadcast news recordings was selected from six sources and seven different programs.
A manual selection procedure was used to choose data appropriate for the GALE program, namely, news and conversation programs focusing on current events. Stories on topics such as sports, entertainment news, and stock market reports were excluded from the data set. The following table is a summary of the files included in this release.
Source
Program
Epoch (YYYY.MM)
#hours
#words
Al Hurra
News 10
2005.11
0.2
959
News 13
2005.04 - 2005.11
3.9
24,430
Dubai TV
Dubai News
2005.01 - 2005.02
1.9
10,842
Lebanese Broadcast
Naharkum Saiid
2005.01 - 2005.02
2.0
13,979
Nile TV
News
2000.10
0.6
3,671
Voice of America
News
2000.06 - 2000.11
5.7
36,925
*Transcription*
The selected audio snippets were then carefully transcribed by LDC annotators and professional transcription agencies following LDC's Quick Rich Transcription. Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU are identified:
- statement SU - question SU - incomplete SU
*Translation*
After transcription and SU annotation, the files were reformatted into a human-readable translation format and were then assigned to professional translators for careful translation. Translators followed LDC's GALE translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies), and quality control procedures applied to completed translations.
TDF Format
All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.
Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields (the 13th field "suType" might be empty):
field data_type ----- --------- 1 file unicode 2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode
A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.
Encoding
All data are encoded in UTF8.
*Samples*
For examples of this data, please examine these screen captures of the original text and its translation.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.- isReferencedBy: [???Reference] Xiaoyi Ma, Dalal Zakhary, 2007, "GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1," Linguistic Data Consortium, Philadelphia
- hasVersion: C-003328: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
- hasVersion: C-003348: GALE Phase 1 Arabic Blog Parallel Text
-
C-003328: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
*Introduction*
This release is Part 1 of the three-part GALE Phase 1 Chinese Broadcast News Parallel Text, which, along with other corpora, was used as training data in year 1 (Phase 1) of the DARPA-funded GALE program. This corpus contains transcripts and English translations of 23.3 hours of Chinese broadcast news programming. As indicated below, a small number of audio files corresponding to the text in this corpus have been previously released.
*Source Data*
A total of 23.3 hours of Chinese broadcast news programming was selected from two sources, China Central TV (CCTV) (a broadcaster from Mainland China) and Phoenix TV (a Hong Kong-based satellite TV station). The transcripts and translations represent recordings of five different programs.
A manual selection procedure was used to choose data appropriate for the GALE program, namely, news programs focusing on current events. Stories on topics such as sports, entertainment news, and stock markets were excluded from the data set. The following table is a summary of the files included in this release.
*Samples*
* Jpeg screen capture of the Chinese source text
* Original Chinese source text.
* English Translation
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.- isReferencedBy: [???Reference] Xiaoyi Ma, 2007, "GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1," Linguistic Data Consortium, Philadelphia
- hasVersion: C-003327: GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
- references: C-001246: 1997 Mandarin Broadcast News Speech (HUB4-NE)
- hasVersion: C-003348: GALE Phase 1 Arabic Blog Parallel Text
-
C-003329: ISI Chinese-English Automatically Extracted Parallel Text
*Introduction*
This file contains documentation for ISI Chinese-English Automatically Extracted Parallel Text, Linguistic Data Consortium (LDC) catalog number LDC2007T09 and isbn 1-58563-422-0.
This distribution contains a corpus of Chinese-English parallel sentences, which were extracted automatically from two monolingual corpora: Chinese Gigaword Second Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T14). The data was extracted from news articles published by Xinhua News Agency and was obtained using the automatic parallel sentence identification method described in the following publication: Dragos Stefan Munteanu, Daniel Marcu, 2005. Improving Machine Translation Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504
The corpus contains 558,567 sentence pairs the word count on the English side is approximately 16M words. The sentences in the parallel corpus preserve the form and encoding of the texts in the original Gigaword corpora.
For each sentence pair in the corpus the authors provide the names of the documents from which the two sentences were extracted, as well as a confidence score (between 0.5 and 1.0), which is indicative of their degree of parallelism. The parallel sentence identification approach is designed to judge sentence pairs in isolation from their contexts, and can therefore find parallel sentences within document pairs which are not parallel. The fact that two documents share several parallel sentences does not necessarily mean the documents are parallel
In order to make this resource useful for research in Machine Translation (MT), the authors made efforts to detect potential overlaps between this data and the standard test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation data sets contain several articles from Xinhua News Agency. Sentence pairs in this distribution that have a 7-gram overlap with a sentence pair in a NIST MT evaluation set or sentence pairs coming from documents whose names are similar to those in the NIST MT sets are marked with a negative confidence score.
*Samples*
Please view the following samples:
* Chinese Sample
* English Sample
* Parallel Sample- references: C-000689: Chinese Gigaword Second Edition
- references: C-001407: English Gigaword Second Edition
- hasVersion: C-000719: ISI Arabic-English Automatically Extracted Parallel Text
- conformsTo: Improving Machine Translation Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504
- isReferencedBy: [???Reference] Dragos Stefan Munteanu, Daniel Marcu, 2007, "ISI Chinese-English Automatically Extracted Parallel Text," Linguistic Data Consortium, Philadelphia