言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 371 - 380 件目

C-000693: Chinese Treebank 2.0
The Chinese Treebank 2.0 was produced by:

Principal Investigators: Martha Palmer, Mitch Marcus, Tony Kroch

Consultants: Martha Palmer, Mitch Marcus, Tony Kroch, Shizhe Huang, Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc

Project Managers and Guideline Designers: Fei Xia, Nianwen Xue

Annotators: Fu-Dong Chiou, Nianwen Xue

Programming support: Zhibiao Wu

*Introduction*

Published by the Linguistic Data Consortium (LDC), catalog number LDC2001T11 and ISBN 1-58563-204-X.

The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project. Chinese Treebank 2.0 supersedes and replaces the Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6).

*Data*

Size: About 100K words, 325 data files Source: 325 articles from Xinhua newswire between 1994 and 1998 Coding: GB code Format: Same as the UPenn English Treebank except that we keep some original file information was retained such as "SRCID" and "DATE" in the data file. Annotation: All the files are annotated at least twice, the first-pass is done by one annotator, and the resulting files are checked by the second annotator (second-pass). SGML: All data files validate against chtb.dtd using nsmls. The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in file.tbl which provides some annotator and historical information.

More extensive documentation, including samples of the annotated data, can be found at http://www.cis.upenn.edu/~chinese.
- references: Martha Palmer, et al. 2001 Chinese Treebank 2.0 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000694: Chinese Treebank 4.0
- hasVersion: C-000695: Chinese Treebank 5.0
- hasVersion: C-000696: Chinese Treebank 5.1
C-000694: Chinese Treebank 4.0
*Introduction*

Chinese Treebank 4.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T05 and ISBN 1-58563-287-2.

The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0. More information about the project is available on the Penn Chinese Treebank website.

The content used in this corpus comes from the following newswire sources:

698 articles Xinhua (1994-1998) 55 articles Information Services Department of HKSAR (1997) 80 articles Sinorama magazine, Taiwan (1996-1998 & 2000-2001)

*Data*

Chinese Treebank 4.0 contains 404,156 words, 664,633 Hanzi, 15,162 sentences, and 838 data files.

All files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). The corpus also provides seven files intended to serve as the gold standard annotation.

The corpus provides four versions of files: bracketed, raw, segmented and postagged. The raw, segmented and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages.

*Updates*

Additional information, updates, bug fixes will be posted on the Penn Chinese Treebank website.

*Sponsorship*

This corpus was funded in part through the DARPA-TIDES grant number N66001-00-1-8915.
- references: Martha Palmer, et al. 2004 Chinese Treebank 4.0 Linguistic Data Consortium, Philadelphia
C-000695: Chinese Treebank 5.0
*Introduction*

Chinese Treebank 5.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T01 and ISBN 1-58563-323-2.

The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0. Another updated version was released in 2004 as Chinese Treebank 4.0. More information about the project is available on the Penn Chinese Treebank website.

The content used in this corpus comes from the following newswire sources:

698 articles
Xinhua (1994-1998)

55 articles
Information Services Department of HKSAR (1997)

132 articles
Sinorama magazine, Taiwan (1996-1998 & 2000-2001)

*Data*

Chinese Treebank 5.0 contains 507,222 words, 824,983 Hanzi, 18,782 sentences, and 890 data files.

All files are GB encoded. The format of Chinese Treebank 5.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). Some files were also double-blind annotated and then adjudicated to create gold standard files.

The corpus provides four versions of files: bracketed, raw, segmented and postagged. The raw, segmented and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. The bracketed files are sequentially named as follows: chtb_nnnn.fid, where nnnn is a sequential file number.

*Samples*

To see an example of Gold Standard file, please examine this sample.

*Updates*

The 5.1 update contains corrections to errors found in the earlier version. Specifically, sentences which had more than one top-level node have been modified. Additionally, some GB-encoded white spaces have been converted to ASCII. The 5.1 package is available as an additional download to all those who have licensed CTB5.0.
- references: Martha Palmer, et al. 2005 Chinese Treebank 5.0 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000693: Chinese Treebank 2.0
- hasVersion: C-000694: Chinese Treebank 4.0
- references: C-000696: Chinese Treebank 5.1
C-000696: Chinese Treebank 5.1
- references: Martha Palmer, et al. 2005 Chinese Treebank 5.1 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000693: Chinese Treebank 2.0
- hasVersion: C-000694: Chinese Treebank 4.0
- hasVersion: C-000695: Chinese Treebank 5.0
C-000698: Czech Broadcast News Speech
*Introduction*

Czech Broadcast News Speech contains audio recordings collected from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2, Cesky rozhlas 3 Vltava - CRo3) and two TV channels (Ceska televize - CTV and Prima TV - Prima). The audio was recorded between February 1 and April 22, 2000, at the Department of Cybernetics, University of West Bohemia in Pilsen.

The corpus was created to support the development of large vocabulary speaker independent speech recognition systems for Czech.

*Data*

There are 286 audio files, totaling approximately 50 hours of broadcast news. The news does not contain weather forecasts, sports news, or traffic announcements. The audio files are single-channel, 22.05 kHz, 16 bit linear wav files. The stations, channels, number of files and number of hours are listed below:

Radio

Source
Files
Hours

CRo1
138
30.8

CRo2
90
7.8

CRo3
14
2

TV

Source
Files
Hours

CTV
22
5.1

Prima
22
4.2

The corresponding transcripts are available as Czech Broadcast News Transcripts. The transcripts were created by native Czech speakers working at the Department of Cybernetics, University of West Bohemia in Pilsen, under the direction of Vlasta Radova. The transcription was done using software provided by the LDC (Transcriber 1.4.1). Those parts of the audio recordings that do not contain speech or where the signal was disrupted were not transcribed. As a consequence, the corpus contains about 23 hours of transcribed speech. The transcriptions are provided in both the ISO-8859-2 and Windows-1250 character set.

*Samples*

Please listen to this audio sample.

*Sponsorship*

The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic (Grants No. MSM235200004 and LN00A063) and by the National Science Foundation (NSF) project no. IIS-9820687 entitled "1999 Language Engineering Workshop for Students and Professionals: Integrating Research and Education (WS99)" under the agreement no. 8004-48231 between the Johns Hopkins University, Baltimore, Maryland, and the University of West Bohemia in Pilsen, Czech Republic.

*Updates*

There are no updates available at this time.
- hasFormat: C-000699: Czech Broadcast News Transcripts
- isReferencedBy: Vlasta Radova, et al. 2004 Czech Broadcast News Speech Linguistic Data Consortium, Philadelphia
C-000699: Czech Broadcast News Transcripts
*Introduction*

Czech Broadcast News Transcripts contains the transcripts corresponding to the Czech broadcast news audio published as Czech Broadcast News Speech.

The audio recordings were collected from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2, Cesky rozhlas 3 Vltava - CRo3) and two TV channels (Ceska televize - CTV and Prima TV - Prima). The audio was recorded between February 1 and April 22, 2000, at the Department of Cybernetics, University of West Bohemia in Pilsen.

The corpus was created to support the development of large vocabulary speaker independent speech recognition systems for Czech.

*Data*

There are 286 transcripts, corresponding to the 286 audio files (approximately 50 hours of broadcast news). The transcripts contain approximatelly 196K words and 27K unique words. The news does not contain weather forecasts, sports news, or traffic announcements. The stations, channels, number of files and number of hours are listed below:

Radio

Source Files Hours CRo1 138 30.8 CRo2 90 7.8 CRo3 14 2 TV

Source Files Hours CTV 22 5.1 Prima 22 4.2 The transcripts were created by native Czech speakers working at the Department of Cybernetics, University of West Bohemia in Pilsen, under the direction of Vlasta Radova. The transcription was done using software provided by the LDC (Transcriber 1.4.1). Those parts of the audio recordings that do not contain speech or where the signal was disrupted, were not transcribed. As a consequence, the corpus contains about 23 hours of transcribed speech. The transcriptions are provided both in the ISO-8859-2 and Windows-1250 character set.

For an example transcript please click on this example.

*Sponsorship*

The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic (Grants No. MSM235200004 and LN00A063) and by the National Science Foundation (NSF) project no. IIS-9820687 entitled "1999 Language Engineering Workshop for Students and Professionals: Integrating Research and Education (WS99)" under the agreement no. 8004-48231 between the Johns Hopkins University, Baltimore, Maryland, and the University of West Bohemia in Pilsen, Czech Republic.

*Updates*

There are no updates available at this time.
- isFormatOf: C-000698: Czech Broadcast News Speech
- isReferencedBy: Vlasta Radova, et al. 2004 Czech Broadcast News Transcripts Linguistic Data Consortium, Philadelphia
C-000700: DCIEM/HCRC
*Introduction*

This set of CD-ROMs contains the materials used to collect all 216 spoken dialogues digital audio, orthographic transcriptions, documentation and source code for tools. The dialogues were selected to provide balanced representation at different points in a sleep deprivation experiment.

*Data*

The materials have been designed to be easily accessible to users with different equipment and a variety of needs from those who merely wish to generate hardcopies of the orthographic transcriptions to those who require computational analyses of the speech material. All the text files (transcriptions and documentation) should be readable and printable via most systems that can be connected to a CD-ROM reader. The maps are intended for printing via POSTSCRIPT printers and the speech files are provided with human-readable standard headers, enabling them to be played by a wide range of environments for processing sampled speech.

*Updates*

There are no updates at this time.
- references: Martin Taylor, et al. 1996 DCIEM/HCRC Linguistic Data Consortium, Philadelphia
C-000704: Grassfields Bantu Fieldwork: Dschang Tone Paradigms
*Introduction*

Grassfields Bantu Fieldwork: Dschang Tone Paradigms was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S02 and ISBN 1-58563-254-6.

The data contains tone paradigms of the language Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon.

*Data*

There are 45 paradigm pages in html format. Each page lists 32 utterances, varying across subject, verb, and object. Each utterance has one to three links to recordings in .wav format, as well as a laryngograph recording (also in .wav format). Phonetic transcription has been done for every utterance tonological transcription has been done for a little more than half.

Recorded: June 1997, Dschang, Western Province, and recording studio of SIL Cameroon, Yaoundé

Digitized, Labelled and Segmented: 1997-1998 Phonetics Laboratory, University of Edinburgh

Transcribed and Annotated: 1998-2002 LDC, University of Pennsylvania

Sponsorship:

SIL Cameroon

Economic and Social Research Council (UK) Grant R000235540

National Science Foundation (US) Grant 9983258

National Science Foundation (US) TalkBank Project Grant BCS-998009, KDI, SBE

Linguistic Data Consortium

*Updates*

There are no updates available at this time.

Due to a problem with the way a number of images are rendered by the Windows operating system, the CD is recommended to be viewed in the Unix operating system, which displays the icons correctly.

*Note*

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number 9983258, and therefore free of charge to qualified researchers a $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available under the normal terms and costs.
- references: Steven Bird 2003 Grassfields Bantu Fieldwork: Dschang Tone Paradigms Linguistic Data Consortium, Philadelphia
- hasVersion: G-000703: Grassfields Bantu Fieldwork: Dschang Lexicon
- hasVersion: C-000705: Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
C-000705: Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
*Introduction*

Grassfields Bantu Fieldwork: Ngomba Tone Paradigms was produced by Linguistic Data Consortium (LDC) catalog number LDC2001S16 and ISBN 1-58563-216-3. Please see below for information regarding its collection, processing, and contents.

The data contains tone paradigms of the language Ngomba, a Bamileke (Grassfields Bantu) language spoken by some 63,000 people in the Western Province of Cameroon. Ngombas tone system is undescribed, but it has many similarities with the closely related Yémba language (also known as Bamileke Dschang).

*Data*

This publication contains 755 audio files. The files in rawdata are 21 extended audio and laryngograph recordings with ESPS xlabel files each one of the raw sound files contains the complete recording of one of the tenses. The files in paradigms are HTML indexes linked to 734 one to three second audio clips in .wav format. Each HTML page lists 32 utterances, varying across subject, verb, and object. Transcriptions are provided for the audio clips using the IPA-based orthography, and using phonetic and tonological transcription systems.

Recorded: June 21, 1997, Recording Studio of SIL Cameroon, Yaoundé

Digitized, Labelled and Segmented: 1997-1998 Phonetics Laboratory, University of Edinburgh

Transcribed and Annotated: 1998-2001 LDC, University of Pennsylvania

Sponsorship:

SIL Cameroon

Economic and Social Research Council (UK) Grant R000235540

National Science Foundation (US) Grant 9983258

National Science Foundation (US) TalkBank Project Grant BCS-998009, KDI, SBE

Linguistic Data Consortium

*Updates*

There are no updates available at this time.

*Note*

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number 9983258, and therefore free of charge to qualified researchers. A $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available under the normal terms and costs.
- references: Steven Bird and John Bell 2002 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms Linguistic Data Consortium, Philadelphia
- hasVersion: G-000703: Grassfields Bantu Fieldwork: Dschang Lexicon
- hasVersion: C-000704: Grassfields Bantu Fieldwork: Dschang Tone Paradigms
C-000706: Gulf Arabic Conversational Telephone Speech, Transcripts
*Introduction*

Gulf Arabic Conversational Telephone Speech, Transcripts is a database containing transcripts of 975 Gulf Arabic speakers taking part in spontaneous telephone conversations in Colloquial Gulf Arabic. A total of 976 conversation sides are provided (one speaker appears on two distinct calls). The average duration per side is about 5.7 minutes.

The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia.

Each transcript file is a tab-delimited flat table, where each line contains information and text for a single contiguous utterance, presented via the following fields:

* beginning time stamp in seconds, in square brackets ("[5.7189]")
* ending time stampe in seconds, in square brackets
* channel/speaker-ID ("A:" or "B:")
* "consonant skeleton" orthography for the utterance, in UTF-8
* "diacritized" orthography for the utterance, in ASCII
The ASCII field is the Buckwalter transliteration of the fully "vowelized" (pronunciation) form of the utterance. Within fields 4 and 5, word boundaries are marked by space characters in the normal way, following common practices of Arabic orthographic convention (e.g. all definite articles and many conjunctions and prepositions are attached as prefixes to the following word).

Transcript tokens enclosed in single parentheses -- e.g. "(DHk)" -- represent annotation marks for non-speech events or conditions, such as laughter, noise, etc. Multi-token strings within single parentheses involve words in some other language (typically English) or some other Arabic dialect.

Double parentheses, either with or without tokens enclosed within them -- e.g. "(())", "((word))" or "((word1 word2))" -- represent regions where the transcriber was unable to tell for sure what was said.

The "consonant skeleton" orthography is intended to reflect common orthographic practice in written Arabic (i.e. Modern Standard Arabic (MSA)), but without being bound strictly by the specific spellings of MSA words. That is, there may be novel (dialect-specific) words and changes of consonant quality (hence altered spelling) in words that are cognate between MSA and Gulf Arabic.

The "vowelized" orthography is restricted to a character set that allows words to be rendered coherently in Arabic script (with all diacritics present as needed to represent short vowels, etc), but is intended to reflect the perceived pronunciation of each token. As a result, a given word (type), having a multiple occurrences in the text with identical "skeletal" spellings, may have multiple distinct "vowelized" spellings. In some cases, these different spellings simply reflect pronunciation variants, while in other cases, they represent distinct morphological forms (with distinct contextual meanings) where the semantic differences are conveyed solely by the the short vowels (i.e. the diacritics).

*Samples*

For an example of the data in this publication, please view this screen capture.
- references: Appen Pty Ltd, Sydney, Australia 2006 Gulf Arabic Conversational Telephone Speech, Transcripts Linguistic Data Consortium, Philadelphia
- hasVersion: C-001258: Gulf Arabic Conversational Telephone Speech

SHACHI - Language Resource Metadata Database