Language resource #: 3330
Results 1091 - 1100 of 2023
-
C-003302: Chinese Affect Recognition
The corpus contains 1,815 Chinese speech datasets from 16 speakers. Every speech consists of the affective readings os sentences with embedded affective words representing the affections in the sentences.
-
C-003303: Chinese Treebank 6.0
*Introduction*
This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium (LDC) catalog number LDC2007T36 and isbn 1-58563-450-6.
The Chinese Treebank project began at the University of Pennsylvania in 1998 and continues at Penn and the University of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort, consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented, part-of-speech tagged and fully bracketed. The data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs.
The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01).
For information about Chinese Treebank methodology and guidelines, consult the attached documentation files and the Penn-CU Chinese Treebank Project website.
This release encompasses 2,036 text files, containing 28,295 sentences, 781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed.
*Samples*
For an example of the data in this publication, please examine this sample of the bracketed data.- isReferencedBy: Martha Palmer, et al., 2007, "Chinese Treebank 6.0 (CTB6.0)," Linguistic Data Consortium, Philadelphia
- replaces: C-000696: Chinese Treebank 5.1
- conformsTo: C-001546: Treebank-2
- isReplacedBy: C-004360: Chinese Treebank 7.0
-
C-003304: Chinese Gigaword Third Edition
*Introduction*
Chinese Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Second Edition (LDC2005T14) as well as new data collected after the publication of that edition. Also, an archive of articles from a new newswire source (Agence France Presse) has been added in the third edition.
The four distinct international sources of Chinese newswire included in this edition are the following:
* Agence France Presse (afp_cmn)
* Central News Agency, Taiwan (cna_cmn)
* Xinhua News Agency (xin_cmn)
* Zaobao Newspaper (zbn_cmn)
The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article.
*Data*
The original data archives received by the LDC from Agence France Presse, Xinhua News Agency and Zaobao were encoded in GB-2312, whereas those from Central News Agency (CNA) were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding.
*New in the Third Edition*
* Over six years worth of articles (October 2000 through December 2006) from Agence France Presse are being released for the first time.
* Two years worth of new articles (January 2005 through December 2006) have been added to the Xinhua data set.
* Nearly two years worth of content was added to the CNA data set. There was a gap in the LDC's collection from this source during 2006: no CNA Chinese content was collected between July 27 and December 17 2006, inclusive, so there are no data files for August through November of that year, and the December data file is about half its expected size.
* A small set of older stories (October through December 1998) have been added from Zaobao; these were previously published by LDC as part of TDT3 Multilanguage Text Version 2.0 (LDC2001T58) and are being included in Gigaword for the first time.
*Samples*
Please examine this sample(JPEG) for an example of the data in this corpus.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.- isReferencedBy: [???Reference] Dave Graff, 2007, "Chinese Gigaword Third Edition," Linguistic Data Consortium, Philadelphia
- replaces: C-000689: Chinese Gigaword Second Edition
- references: C-001295: TDT3 Multilanguage Text Version 2.0
- isReplacedBy: C-004348: Chinese Gigaword Fourth Edition
-
C-003305: Tagged Chinese Gigaword
*Introduction*
Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags.
In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.
All sources have been categorized into four distinct "types":
* story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
* multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
* advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
* other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
*Data*
The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.
Source #Files Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535 471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632 The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:
#DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0 TOTAL 14713 1462 type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41 TOTAL 41936 30986 type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279 130 TOTAL 132292 50387 type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878 ZBN_CMN 41084 27898 TOTAL 2614691 1208524 The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.
The test result is shown as follows:
Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score (%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327 87332 96.6 96.7 96.6 Note:
Recall=MatchWord# / RefWord#
Precision=MatchWord# / TestWord#
F-Score=2 * Recall * Precision / (Recall + Precision)
*Samples*
For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.- references: C-000689: Chinese Gigaword Second Edition
- isReferencedBy: [???Reference] Chu-Ren Huang, 2007, "Tagged Chinese Gigaword," Linguistic Data Consortium, Philadelphia
-
C-003306: CELT - Corpus of Electronic Texts
CELT, the Corpus of Electronic Texts, brings the wealth of Irish literary and historical culture to the Internet, for the use and benefit of everyone worldwide. It has a searchable online textbase consisting of 954 contemporary and historical documents from many areas, including literature and the other arts.
- hasVersion: Multitext
-
C-003307: CHRISTINE Corpus
It offers structural analyses of a cross-section of 1990s spontaneous speech from all British regions, social classes, etc. For details on its current location and how to download it, see my resources page. The CHRISTINE documentation file is also available as a Web page (250 kb).
- hasFormat: The CHRISTINE documentation file
- hasPart: SUSANNE analytic scheme and Corpus
-
C-003308: Mandarin Affective Speech
*Introduction*
Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, College of Computer Science and Technology, Zhejiang University, Hangzhou, People's Republic of China. This corpus was designed with two goals: first, to serve as a tool for linguistic and prosodic feature investigation of emotional expression in Mandarin Chinese; and second, to provide a source of training and test data essential to support research in speaker recognition with affective speech. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli. The speakers read scenarios designed to elicite an emotional response such as a colleague's mistake for anger, a pleasant trip for elation, a hurry-up scene for panic and a puppy's death for sadness. The five emotional states recorded are characterized as follows:
* Neutral - Simple statements without any emotion.
* Anger - A strong feeling of displeasure or hostility.
* Elation - Be glad or happy because of praise.
* Panic - A sudden, overpowering terror, often affecting many people at once.
* Sadness - Affected or characterized by sorrow or unhappiness
*Data*
Over 100 speakers participated in the data collection. After screening, recordings from 68 speakers (23 females, 45 males) were used in this corpus. Most of the speakers were in their twenties at the time of collection. Information about the speakers is contained in "SpeakerInfo.doc."
Subjects were given a text to read that consisted of five phrases, fifteen sentences and two paragraphs designed to generate the emotional speech. The material included all the phonemes in Mandarin. Each subject read the phrases, paragraphs, and sentences portraying the five emotional states: neutral (unemotional), anger, elation, panic and sadness. Altogether this database contains 25,636 utterances. The read material was constructed as follows:
* 5 phrases - "yes", "no" and three nouns as "apple", "train", "tennis ball". In Chinese, these words contain many different basic vowels and consonants.
* 20 sentences - These sentences include all the phonemes and most common consonant clusters in Mandarin. The types of sentences are: simple statements, a declarative sentence with an enumeration, general questions (yes/no question), alternative questions, imperative sentences, exclamatory sentences, special questions (whquestions).
* 2 paragraphs - Two readings, one selected from a famous Chinese novel, and the other stating a normal fact.
All the data were recorded in a quiet office on an OLYMPUS DM-20 digital voice recorder with a sampling rate of 22050Hz. Afterwards, the recorded voice files were transferred to a personal computer by USB (Universal Serial Bus). The recordings were then converted into monophonic Windows PCM format at 8 kHz sampling frequency and 16 bits resolution.
Further information about the data and methodology in this corpus is contained in the authors' paper, "MASC: A Speech Corpus in Mandarin for Emotional Analysis and Affective Speaker Recognition," in "MASC.pdf."
*Samples *
For an example of the data in this corpus, please listen the following examples:
* Neutral
* Anger
* Elation- isReferencedBy: [???Reference] Yingchun Yang,Zhaohui Wu,Tian Wu,Dongdong Li, 2007, "Mandarin Affective Speech," Linguistic Data Consortium, Philadelphia
-
C-003309: Nationwide Speech Project
*Introduction*
This corpus represents part of the work of the Nationwide Speech Project (NSP) conducted by the authors at Indiana University. The purpose of the NSP was to collect a large amount of speech produced by male and female talkers representing the primary regional varieties of American English: New England, Mid-Atlantic, North, Midland, South and West. This release contains approximately 60 hours of speech or nearly one hour of speech from each of 60 white American English speakers --including five male and five female talkers from the six dialect regions -- reading words and sentences. The corpus can be used for perceptual and acoustic experiments designed to explore the role of variation in spoken language processing. Such applications include speech science experiments and sociolinguistic or sociophonetic research.
*Data*
The speakers were recruited from the Indiana University community; they were all 18-25 years old at the time of recording, had lived exclusively in one region prior to age 18, and both parents of each speaker were also raised in the same region. Further demographic information about the speakers is provided in the file talkers.txt. The materials include 102 high predictability sentences and five repetitions of each of 10 hVd words. The high predictability sentences are 5-8 words in length and the final word in each sentence is highly predictable based on the preceding semantic context. The 10 hVd words are: heed, hid, hayed, head, had, hod, hud, hoes, hood and who'd.
Participants were recorded one at a time by an experimenter in a sound attenuated booth (IAC Audiometric Testing Room, Model 402). Both the experimenter and the participant sat in the sound booth during testing. During the recording session, the participant was seated in front of a ViewSonic LCD flatscreen monitor (ViewPanel VG151) which mirrored the screen of a Macintosh Powerbook G3 laptop. The participant wore a Shure head-mounted microphone (SM10A) that was positioned approximately one inch from the left corner of the talker's mouth. The microphone output was fed to an Applied Research Technology microphone tube pre-amplifier. The output gain on the pre-amplifier was adjusted by the experimenter while the participant read the Grandfather Passage as a warm-up before recording began. The output of the microphone pre-amplifier was connected to a Roland UA-30 USB audio interface which digitized the signal and transmitted it via USB ports to the laptop where each utterance was recorded in an individual AIFF 16-bit digital sound file at a sampling rate of 44.1 kHz (converted to .wav format files for this release) The experimenter held the laptop on her lap and wore headphones connected to the Roland device so that she could hear the same audio signal that inputted into the laptop for recording.
*Samples*
* hpspin
* vowel- isReferencedBy: [???Reference] Cynthia G. Clopper and David B. Pisoni, 2007, "Nationwide Speech Project," Linguistic Data Consortium, Philadelphia
-
C-003310: Corpus of Spoken, Professional American-English
The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. One sub-corpus consists mainly of academic discussions such as faculty council meetings and committee meetings related to testing. The second sub-corpus contains transcripts of White House press conferences, which are almost exclusively question-and-answer sessions.
-
C-003311: ELISA English Language Interview Corpus as a Second-Language Application
The ELISA corpus is being developed at the University of Tuebingen (Dept of Applied English Linguistics, AEL) and the University of Surrey (Dept of Languages and Translation Studies, LTS) as a resource for language learning and teaching, and interpreter training. It contains interviews with native speakers of English. They talk about their professional career (e.g. in tourism, politics, the media or environmental education). We are very grateful to all speakers for their kind contributions. This demo website contains selected materials from the ELISA corpus. ( more information, acknowledgements, availability and copyright).