言語資源の登録件数: 3330件
2023 件中 1671 - 1680 件目
-
C-004342: 朝日新聞記事データ(学術・研究用)2009年版
朝日新聞の本社版記事2009年分・約14万件を収録した新聞記事データ集。各記事には13の記事種別と75のテーマ分類が付与されている。
- hasVersion: C-003594: 朝日新聞記事データ集 学術研究用 2006
- hasVersion: C-003593: 朝日新聞記事データ集 学術研究用 2007
- hasVersion: C-004341: 朝日新聞記事データ(学術・研究用)2008年版
- hasVersion: C-004343: 朝日新聞記事データ(学術・研究用)2010年版
- hasVersion: C-004344: 朝日新聞記事データ(学術・研究用)2011年版
- hasVersion: C-005019: 朝日新聞記事データ(学術・研究用)2012年版
- hasVersion: C-005020: 朝日新聞記事データ(学術・研究用)2013年版
- hasVersion: C-005021: 朝日新聞記事データ(学術・研究用)2014年版
- hasVersion: C-005022: 朝日新聞記事データ(学術・研究用)2015年版
- hasVersion: C-005023: 朝日新聞記事データ(学術・研究用)2016年版
-
C-004343: 朝日新聞記事データ(学術・研究用)2010年版
朝日新聞の本社版記事2010年分・約14万件を収録した新聞記事データ集。各記事には13の記事種別と75のテーマ分類が付与されている。
- hasVersion: C-003594: 朝日新聞記事データ集 学術研究用 2006
- hasVersion: C-003593: 朝日新聞記事データ集 学術研究用 2007
- hasVersion: C-004341: 朝日新聞記事データ(学術・研究用)2008年版
- hasVersion: C-004342: 朝日新聞記事データ(学術・研究用)2009年版
- hasVersion: C-004344: 朝日新聞記事データ(学術・研究用)2011年版
- hasVersion: C-005019: 朝日新聞記事データ(学術・研究用)2012年版
- hasVersion: C-005020: 朝日新聞記事データ(学術・研究用)2013年版
- hasVersion: C-005021: 朝日新聞記事データ(学術・研究用)2014年版
- hasVersion: C-005022: 朝日新聞記事データ(学術・研究用)2015年版
- hasVersion: C-005023: 朝日新聞記事データ(学術・研究用)2016年版
-
C-004344: 朝日新聞記事データ(学術・研究用)2011年版
朝日新聞の本社版記事2011年分・約14万件を収録した新聞記事データ集。各記事には13の記事種別と75のテーマ分類が付与されている。
- hasVersion: C-003594: 朝日新聞記事データ集 学術研究用 2006
- hasVersion: C-003593: 朝日新聞記事データ集 学術研究用 2007
- hasVersion: C-004341: 朝日新聞記事データ(学術・研究用)2008年版
- hasVersion: C-004342: 朝日新聞記事データ(学術・研究用)2009年版
- hasVersion: C-004343: 朝日新聞記事データ(学術・研究用)2010年版
- hasVersion: C-005019: 朝日新聞記事データ(学術・研究用)2012年版
- hasVersion: C-005020: 朝日新聞記事データ(学術・研究用)2013年版
- hasVersion: C-005021: 朝日新聞記事データ(学術・研究用)2014年版
- hasVersion: C-005022: 朝日新聞記事データ(学術・研究用)2015年版
- hasVersion: C-005023: 朝日新聞記事データ(学術・研究用)2016年版
-
C-004346: Audiovisual Database of Spoken American English
*Introduction*
The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech production and speech recognition. It contains approximately seven hours of audiovisual recordings of fourteen American English speakers producing syllables, word lists and sentences used in both academic and clinical settings.
All talkers were from the North Midland dialect region -- roughly defined as Indianapolis and north within the state of Indiana -- and had lived in that region for the majority of the time from birth to 18 years of age. Each participant read 238 different words and 166 different sentences. The sentences spoken were drawn from the following sources:
* Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)
* Northwestern University Auditory Test No. 6 (Lists I-IV)
* Vowels in /hVd/ context (separate words)
* Texas Instruments/Massachusetts Institute for Technology (TIMIT) sentences
The CID Everyday Sentences were created in the 1950s from a sample developed by the Armed Forces National Research Committee on Hearing and Bio-Acoustics. They are considered to represent everyday American speech and have the following characteristics: the vocabulary is appropriate to adults; the words appear with high frequency in one or more of the well-known word counts of the English language; proper names and proper nouns are not used; common non-slang idioms and contractions are used freely; phonetic loading and "tongue-twisting" are avoided; redundancy is high; the level of abstraction is low; and grammatical structure varies freely.
Northwestern University Auditory Test No. 6 is a phonemically-balanced set of monosyllabic English words used clinically to test speech perception in adults with hearing loss.
The /hVd/ vowel list was created to elicit all of the vowel sounds of American English.
The TIMIT sentences are a subset (34 sentences) of the 2342 phonetically-rich sentences read by speakers in the TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. TIMIT was designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT speakers were from eight dialect regions of the United States.
The Audiovisual Database of Spoken American English will be of interest in various disciplines: to linguists for studies of phonetics, phonology, and prosody of American English; to speech scientists for investigations of motor speech production and auditory-visual speech perception; to engineers and computer scientists for investigations of machine audio-visual speech recognition (AVSR); and to speech and hearing scientists for clinical purposes, such as the examination and improvement of speech perception by listeners with hearing loss.
*Data*
Participants were recorded individually during a single session. A participant first completed a statement of informed consent and a questionnaire to gather biographical data and then was asked by the experimenter to mark his or her Indiana hometown on a state map. The experimenter and participant then moved to a small, sound-treated studio where the participant was seated in front of three navy blue baffles. A laptop computer was elevated to eye-level on a speaker stand and placed approximately 50-60 cm in front of the participant. Prompts were presented to the participant in a Microsoft PowerPoint presentation. The experimenter was seated directly next to the participant, but outside the camera angle, and advanced the PowerPoint slides at a comfortable pace.
Participants were recorded with a Panasonic DVC-80 digital video camera to miniDV digital video cassette tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings.
Each speaker produced a total of 94 segmented files which were converted from Final Cut Express to Quicktime (.mov) files and then saved in the appropriately marked folder. If a speaker mispronounced a sentence or word during the recording process, the mispronunciations were edited out of the segments to be archived. The remaining parts of the recording, including the correct repetition of each prompt, were then sequenced together to create a continuous and complete segment.
The fourteen participants were between 19 and 61 years of age (with a mean age of 30 years) and native speakers of American English.
*Samples*
For an example of the data in this corpus, please view this video sample (Quicktime, mov). -
C-004347: BioProp Version 1.0
*Introduction*
BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations for approximately 500 English biomedical journal abstracts. The source abstracts, annotated in accordance with Penn Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at the Tsujii Laboratory at the University of Tokyo.
The purpose of the GENIA Project is to develop tools and resources for automatic information extraction of biomedical information. One result of that work is the GENIA corpus, a collection of 2000 biomedical journal abstracts containing semantic class annotation for biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset of that corpuse. BioProp Version 1.0 adds a proposition bank to the GTB.
Proposition Bank (PropBank) contains annotations of predicate argument structures and semantic roles in a treebank schema in the newswire domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as agent and patient, and adjunct arguments, such as time, manner and location. The term "argument" refers to a syntactic constituent of the sentence related to the predicate, and the term "semantic role" refers to the semantic relationship between a sentence's predicate and argument.
To suit the needs in the biomedical domain, the PropBank annotation guidelines were modified to characterize semantic roles as components of biological events. Specifically, thirty verbs were selected according to their frequency of use or importance in biomedical texts. Since targets in information extraction are relations of named entities, only sentences containing protein or gene names were used to count each verb's frequency. Verbs of general usage were filtered out in order to keep the focus on biomedical verbs. Some verbs that do not have a high frequency but play important roles in describing biomedical relations, such as "phosphorylate" and "transactivate," were also selected. The BioProp annotation was based on Levin?s verb classes as defined in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at the semantic level, and thus have associated semantic roles. However, since some verbs may have different usages in biomedical and newswire texts, it is necessary to customize the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic method was used to annotate BioProp. The annotation process consisted of the following steps:
* Identification of predicate candidates
* Automatic annotation of the biomedical semantic roles using newswire SRL system
* Transformation of automatic tagging results into WordFreak format
* Review by human annotators
*Data*
BioProp Version 1.0 consists of approximately 150,000 words. Each line in the corpus provides a PAS annotation that can be mapped to a sentence in the GTB.
*Samples*
91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC 91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV 91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC 91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC 91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU 91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel- references: C-001701: GENIA Corpus
-
C-004348: Chinese Gigaword Fourth Edition
*Introduction*
Chinese Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T27 and isbn 1-58563-527-8, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Third Edition (LDC2007T38) as well as newly collected data. In addition, four entirely new sources have been added in the fourth edition, Central News Service, Guangming Daily, Peoples Liberation Army Daily, and Peoples Daily.
The eight distinct international sources of Chinese newswire included in this edition are the following:
* Agence France Presse (afp_cmn)
* Central News Agency, Taiwan (cna_cmn)
* Central News Service (cns_cmn)
* Guangming Daily (gmw_cmn)
* Peoples Daily (pda_cmn)
* Peoples Liberation Army Daily (pla_cmn)
* Xinhua News Agency (xin_cmn)
* Zaobao Newspaper (zbn_cmn)
The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC id string assigned to each news article.
*Data*
The original data received by the LDC from AFP, Peoples Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and Peoples Daily were in a combination of GB-2312 and GB-18030. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding.
*New in the Fourth Edition*
* Two years worth of new articles (January 2007 through December 2008) have been added to the Xinhua, Agence France Presse, and CNA data sets.
* Four new data sources have been added - Guangming Daily, Central News Service, Peoples Daily and Peoples Liberation Army daily, covering a timespan from November 2006 through December 2008.
*Samples*
Please view this sample.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.- replaces: C-003304: Chinese Gigaword Third Edition
-
C-004349: FactBank 1.0
*Introduction*
FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events. FactBank 1.0 was built on top of TimeBank 1.2 and a fragment of the AQUAINT TimeML Corpus, both of which used the TimeML specification language. This resulted in a double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode most of the basic structural elements expressing factuality information while FactBank 1.0 represents the resulting factuality interpretation. The combination of the factuality values in FactBank with the structural information in TimeML-annotated corpora facilitates the development of tools aimed at automatically identifying the factuality values of events, a component fundamental in tasks requiring some degree of text understanding, such as Textual Entailment, Question Answering, or Narrative Understanding.
FactBank annotations indicate whether the event mention describes actual situations in the world, situations that have not happened, or situations of uncertain interpretation. Event factuality is not an inherent feature of events but a matter of perspective. Different discourse participants may present divergent views about the factuality of the very same event. Consequently, in FactBank, the factuality degree of events is assigned relative to the relevant sources at play. In this way, it can adequately reflect the divergence of opinions regarding the factual status of events, as is common in news reports.
The annotation language is grounded on established linguistic analyses of the phenomenon, which facilitated the creation of a battery of discriminatory tests for distinguishing between factuality values. Furthermore, the annotation procedure was carefully designed and divided into basic, sequential annotation tasks. This made it possible for hard tasks to be built on top of simpler ones, while at the same time allowing annotators to become incrementally familiar with the complexity of the problem. As a result, FactBank annotation achieved a relatively high interannotation agreement, kappa=0.81, a positive result when considered against similar annotation efforts.
*Data*
All FactBank markup is standoff and is represented through a set of 20 tables which can be easily loaded into a database. Each table resides in an independent text file, where fields are separated by three consecutive bars (i.e., |||). The data in fields of string type are presented between simple quotations (').
Because FactBank 1.0 was built on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up with inline XML-based annotation, this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in standoff, table-based format as well.
*Samples*
* Source
* Event Coreference
* Event Annotations
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.- references: C-001318: TimeBank 1.2(登録未)
- references: AQUAINT TimeML Corpus
-
C-004350: GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
*Introduction*
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1, Linguistic Data Consortium (LDC) catalog number LDC2009T03 and isbn 1-58563-506-5, was prepared by LDC and contains a total of 178,000 words (264 files) of Arabic newsgroup text and its translation selected from thirty-five sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program.
LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Source Data*
Preparing the source data involved four stages of work: data scouting, data harvesting, formatting and data selection.
Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potentital text of interest (sites, threads and posts) to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout.
Once the text was downloaded, its format was standardized (by running various scripts) so that the data could be more easily integrated into downstream annotation processes. Original-format versions of each document were also preserved. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems. The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.
Manual sentence unit/segment (SU) annotation was also performed on a subset of files following LDC's Quick Rich Transcription specification. Three types of end of sentence SU were identified:
* statement SU
* question SU
* incomplete SU
*Translation*
After files were selected, the files were reformatted into a human-readable translation format, and the files were then assigned to professional translators for careful translation. Translators followed LDC's GALE translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies) and quality control procedures applied to completed translations.
TDF Format
All final data are presented in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format maklng it easy to process.
Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:
field
data_type
1
file
unicode
2
channel
int
3
start
float
4
end
float
5
speaker
unicode
6
speakerType
unicode
7
speakerDialect
unicode
8
transcript
unicode
9
section
int
10
turn
int
11
segment
int
12
sectionType
unicode
13
suType
unicode
A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.
Some fields are inapplicable to newsgroup text. Those include the channel, start time, end time and speaker dialect fields. Those fields are either empty or contain values as a place holder.
Encoding
All data are encoded in UTF8.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
*Samples*
For an example of the data in this corpus, please examine this screen capture of a source and its translation. -
C-004351: GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
*Introduction*
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 was prepared by the Linguistic Data Consortium (LDC) and contains a total of 145,000 words (263 files) of Arabic newsgroup text and its translation selected from thirty-five sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program. This is the second of a two-part release. GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 was releasd in early 2009.
LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Source Data*
Preparing the source data involved four stages of work: data scouting, data harvesting, formatting and data selection.
Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress. The data scouting process is described in the GALE task specification.
Data scouts logged their decisions about potential text of interest (sites, threads and posts) to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout.
Once the text was downloaded, its format was standardized (by running various scripts) so that the data could be more easily integrated into downstream annotation processes. Original-format versions of each document were also preserved. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.
Manual sentence unit/segment (SU) annotation was also performed on a subset of files following LDC's Quick Rich Transcription guidelines. Three types of end of sentence SU were identified:
* statement SU
* question SU
* incomplete SU
*Translation*
After files were selected, they were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies) and quality control procedures applied to completed translations.
*Final Data*
A source file and its translation share the same file name across directories.
TDF Format
All final data are presented in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, making it easy to process.
Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:
field
data_type
1
file
unicode
2
channel
int
3
start
float
4
end
float
5
speaker
unicode
6
speakerType
unicode
7
speakerDialect
unicode
8
transcript
unicode
9
section
int
10
turn
int
11
segment
int
12
sectionType
unicode
13
suType
unicode
A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.
Some fields are inapplicable to newsgroup text. Those include the channel, start time, end time and speaker dialect fields. Those fields are either empty or contain values as place holder.
Encoding
All data are encoded in UTF-8.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
*Samples*
For an example of the data in this corpus, please examine these images of a source document and it's translation.
* Original posting
* Translation -
C-004352: GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
*Introduction*
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 was prepared by LDC and contains 240,000 characters (112 files) of Chinese newsgroup text and its translation selected from twenty-five sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program. .
*Source Data*
Preparating the source data involved four stages of work: data scouting, data harvesting, formating, and data selection.
Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest (sites, threads and posts) to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout.
Once the text was downloaded, its format was standardized (by running various scripts) so that the data could be more easily integrated into downstream annotation processes. Original-format versions of each document were also preserved. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.
Manual sentence unit/segment (SU) annotation.was also performed on a subset of files following LDC's Quick Rich Transcription specification. Three types of end of sentence SU were identified: statement SU, question SU and incomplete SU.
*Translation*
After files were selected, they were reformatted into a human-readable translation format, and the files were then assigned to professional translators for careful translation. Translators followed GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies), and quality control procedures applied to completed translations.
TDF Format
All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.
Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:
field data_type 1 file unicode 2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.
Some fields are inapplicable to newsgroup text. Those include the channel, start time, end time and speaker dialect fields. These fields are either empty or contain values as a placeholder. Encoding
All data are encoded in UTF8.
*Sponsorship *
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
*Samples*
* Source
* Translation