言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 341 - 350 件目

C-000660: CALLHOME Mandarin Chinese Speech
The CALLHOME Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to 30 minutes, originated in North America and were placed to locations overseas. Most participants called family members or close friends.

This corpus contains speech data files only, along with documentation that describes the contents and format of the speech files and the software packages needed to uncompress the speech data. The transcripts and documentation (LDC96T16) are available separately, as is an associated lexicon (LDC96L15).
- hasFormat: C-000661: CALLHOME Mandarin Chinese Transcripts
- hasVersion: C-000647: CALLHOME American English Speech
- hasVersion: C-000650: CALLHOME Egyptian Arabic Speech
- hasVersion: C-000654: CALLHOME German Speech
- hasVersion: C-000657: CALLHOME Japanese Speech
- hasVersion: C-000664: CALLHOME Spanish Speech
- isReferencedBy: (Online documentation) "Documentation for CALLHOME_Mandarin_Chinese_Speech" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S34/)
- isReferencedBy: (Online documentation 2) (http://www.ldc.upenn.edu/Catalog/docs/LDC96T16/)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLHOME Mandarin Chinese Speech Linguistic Data Consortium, Philadelphia
C-000661: CALLHOME Mandarin Chinese Transcripts
*Introduction*

The text component of the CALLHOME Mandarin Chinese package includes transcripts and documentation files.

*Data*

The transcripts cover a contiguous five or ten-minute segment taken from 120 unscripted telephone conversations between native speakers of Mandarin Chinese. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography.

In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Auditing and demographic information on the speakers represented in the transcripts (including gender, channel quality and so on) are also included. The data is encoded as "gb2312" (a.k.a. "euc-cn").

The corpus of telephone speech (LDC96S34) are available separately, as is an associated lexicon (LDC96L15).

*Updates*

There are no updates at this time.
- isFormatOf: C-000660: CALLHOME Mandarin Chinese Speech
- isReferencedBy: G-000659: CALLHOME Mandarin Chinese Lexicon
- hasVersion: C-000648: CALLHOME American English Transcripts
- hasVersion: C-000652: CALLHOME Egyptian Arabic Transcripts
- hasVersion: C-000655: CALLHOME German Transcripts
- hasVersion: C-000658: CALLHOME Japanese Transcripts
- hasVersion: C-000665: CALLHOME Spanish Transcripts
- isReferencedBy: (Online documentation ) "Documentation for CALLHOME_Mandarin_Chinese_Transcripts" (http://www.ldc.upenn.edu/Catalog/docs/LDC96T16/index.html)
- isReferencedBy: Barbara Wheatley 1996 CALLHOME Mandarin Chinese Transcripts Linguistic Data Consortium, Philadelphia
- hasFormat: C-004388: CALLHOME Mandarin Chinese Transcripts - XML version
C-000662: CALLHOME Spanish Dialogue Act Annotation
*Introduction*

The CALLHOME Spanish Dialogue Act Annotation Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T61 and ISBN 1-58563-197-3, was developed under Project CLARITY. The goal of CLARITY was to glean discourse information from unrestricted conversational speech using shallow, corpus-based analysis. The annotation was carried out at Interactive Systems Labs at Carnegie Mellon University.

*Data*

This publication used a three-level coding scheme to manually tag the LDC CALLHOME Spanish Transcripts. The three levels of the coding scheme are:

* a dialogue act level consisting of a tag set extended from DAMSL and Switchboard;
* a dialogue game level featuring short sequences of dialogue acts
* a genre level similiar to topical segments. All available (120) dialogues have been annotated.
Dialogue games are short sequences of dialogue acts such as question/answer pairs. Genres can be storytelling, discussion, planning, etc. Segmentation takes topics into account as well. Genres, games, and dialogue acts are annotated by type. Genres are additionally annotated for activities and topics (on a 0-5 scale), for the central object or person being discussed (who or what category), and contain a short synopsis of the segment.

All available 120 CALLHOME Spanish dialogues have been annotated. The dialogue act annotation scheme is a further development of the SwitchBoard DAMSL tagset. Dialogue games are short sequences of dialogue acts such as question/answer pairs. Genres can be storytelling, discussion, planning etc. and the segmentation takes topic into account as well. Genres, games and dialogue acts are annotated for their type. Genres are additionally annotated for activities and topics (on a 0-5 scale), for the central object or person being discussed (who or what category) and contain a short gist of the segment.

An example of the tagging from one conversation is presented below.

<?xml version="1.0" encoding="iso-8859-1"?> Sm, eso es para eso, de seguro. No importa. No importa. Bueno aqum, la Zaida esta estudiando tambiin en la universidad con la Liana. Y qui estudia, mama, qui estan estudiando. [background speech] Estan estudiando Sociales. Ciencias Sociales. Ah, para maes- para maestra de Sociales. Sm

*Updates*

There are no updates at this time.
- references: C-000665: CALLHOME Spanish Transcripts
- isReferencedBy: Klaus Ries, Lori Levin, Liza Valle, Alon Lavie, Alex Waibel, "Shallow discourse genre annotation in CallHome Spanish" (http://isl.ira.uka.de/fileadmin/publication-files/LREC2000-klausr.pdf)
- isReferencedBy: Alex Waibel, et al. 2001 CALLHOME Spanish Dialogue Act Annotation Linguistic Data Consortium, Philadelphia
C-000664: CALLHOME Spanish Speech
The CALLHOME Spanish corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Spanish. All calls, which lasted up to 30 minutes, originated in North America and were placed to international locations. Most participants called family members or close friends.

This corpus contains speech data files ONLY, along with the minimal amount of documentation needed to describe the contents and format of the speech files and the software packages needed to uncompress the speech data. The transcripts and documentation (LDC96T17) are available separately, as is an associated lexicon (LDC96L16).

*Updates*

The "shorten" and "sphere" directories have been removed.

The sphere directory contained NIST "SPeech HEader REsources" (SPHERE): C-language source code libraries and utilities for manipulating NIST SPHERE-format waveform files.

The shorten directory contained files for Tony Robinson's "shorten" software for speech compression.

A more recent version of the SPHERE utilities is now available on the NIST web site; additional utilities for converting from SPHERE to other waveform file formats is also available at the LDC web site.

10.10.2003: It has been brought to our attention that 16 sphere files (both from the train and devtest directories) were corrupted; the problem becomes apparent when trying to decompress the files using the w_decode utility. The correct version of these files is now available on a third CD-Rom, containing the 16 speech files and a readme.txt, listing the contents of the disc. If you purchased the corpus, please request the CD by writing to ldc@ldc.upenn.edu. The new orders will receive the two CDs and the third disc with the corrected files.
- hasFormat: C-000665: CALLHOME Spanish Transcripts
- isReferencedBy: G-000663: CALLHOME Spanish Lexicon
- hasVersion: C-000647: CALLHOME American English Speech
- hasVersion: C-000650: CALLHOME Egyptian Arabic Speech
- hasVersion: C-000654: CALLHOME German Speech
- hasVersion: C-000657: CALLHOME Japanese Speech
- hasVersion: C-000660: CALLHOME Mandarin Chinese Speech
- isReferencedBy: C-000571: 1997 HUB5 Spanish Evaluation
- isReferencedBy: (Online documentation) "Documentation for CALLHOME_Spanish_Speech" (http://www.ldc.upenn.edu/Catalog/docs/LDC96S35/)
- isReferencedBy: (Online documentation including speakers' information) "Documentation for CALLHOME_Spanish_Transcripts" (http://www.ldc.upenn.edu/Catalog/docs/LDC96T17/)
- isReferencedBy: Alexandra Canavan and George Zipperlen 1996 CALLHOME Spanish Speech Linguistic Data Consortium, Philadelphia
C-000665: CALLHOME Spanish Transcripts
The CALLHOME Spanish Transcripts includes transcripts and documentation files for CALLHOME Spanish Speech which contains 120 unscripted telephone conversations between native speakers of Spanish. The transcripts cover a contiguous five or ten minute segment of each call. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography.
- isFormatOf: C-000664: CALLHOME Spanish Speech
- isReferencedBy: G-000663: CALLHOME Spanish Lexicon
- isReferencedBy: C-000662: CALLHOME Spanish Dialogue Act Annotation
- hasVersion: C-000648: CALLHOME American English Transcripts
- hasVersion: C-000652: CALLHOME Egyptian Arabic Transcripts
- hasVersion: C-000655: CALLHOME German Transcripts
- hasVersion: C-000658: CALLHOME Japanese Transcripts
- hasVersion: C-000661: CALLHOME Mandarin Chinese Transcripts
- isReferencedBy: C-000572: 1997 HUB5 Spanish Transcripts
- isReferencedBy: (Online documentation) "Documentation for CALLHOME_Spanish_Transcripts" (http://www.ldc.upenn.edu/Catalog/docs/LDC96T17/)
- isReferencedBy: Barbara Wheatley 1996 CALLHOME Spanish Transcripts Linguistic Data Consortium, Philadelphia
C-000666: CCGbank
*Introduction*

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure.

*Data*

CCGbank contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies and errors in the original annotation.

*Samples*

For an example of this corpus, please examine this sample.

*Update*

The current version, 1.1, is a bug fix that supersedes the old package. It is available for download.
- isReferencedBy: Julia Hockenmaier and Mark Steedman, "CCGbank: User's Manual " http://www.cis.upenn.edu/departmental/reports/CCGbankManual.pdf
- isReferencedBy: "THE COMBINATORY CATEGORIAL GRAMMER SITE"http://groups.inf.ed.ac.uk/ccg/index.html
- isReferencedBy: Julia Hockenmaier and Mark Steedman 2005 CCGbank Linguistic Data Consortium, Philadelphia
C-000667: CETEMpublico
*Introduction*

CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos MCT/Publico), produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001S04 with ISBN 1-58563-206-6, is a corpus of newspaper texts from the Portuguese daily newspaper Publico, compiled for purposes of research and development in natural language processing (NLP) by the Computational Processing of Portuguese Project, under an agreement between Publico and the Portuguese Ministry of Science and Technology (MCT).

*Data*

The corpus includes the text of approximately 2,600 editions of Publico, produced between 1991 and 1998, and amounting to approximately 180 million words. CETEMPublico Version 1.7 contains 1,504,258 extracts (CETEMPublico Version 1.0 had 1,567,625). Version 1.7 was created in Oslo on August 6, 2001 and uses SGML tagging. The corpus is in 196 compressed text files, with names in the form cetemXXX.gz, from cetem001.gz to cetem196.gz.

This corpus was designed to assist researchers who develop computer programs processing the Portuguese language and who would need raw material for their work. In addition, the authors wished for the corpus to be useful to everyone who studies the Portuguese language and wishes to verify their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.

More detailed information is available at http://www.linguateca.pt/cetempublico.

*Updates*

There are no updates at this time.
- references: Diana Santos and Paulo Rocha 2001 CETEMpublico Linguistic Data Consortium, Philadelphia
C-000670: CSLU: 22 Languages Corpus
*Introduction*

This file contains documentation on the CSLU: 22 Languages v 1.2, Linguistic Data Consortium (LDC) catalog number LDC2005S26 and ISBN 1-58563-361-5.

Produced by Center for Spoken Language Understanding and distributed by the Linguistic Data Consortium, the 22 Languages corpus consists of telephone speech from 21 languages: Eastern Arabic, Cantonese, Czech, Farsi, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and English. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. Each of the 50,191 utterances is verified by a native speaker to determine if the caller followed instructions when answering the prompts. For this release, approximately 19,758 utterances have corresponding orthographic transcriptions.

*Samples*

For an exampe of this corpus, please listen to these Arabic and English audio samples.

*Updates and Contact*

Questions regarding this corpus and about the Center for Spoken Language Understanding should be directed to Jan van Santen.
- replaces: CSLU: 22 Languages Corpus Version 1.1
- hasPart: C-003103: CSLU: Foreign Accented English Release 1.2
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2005S26/
- isReferencedBy: T. Lander 2005 CSLU: 22 Languages Corpus Linguistic Data Consortium, Philadelphia
C-000671: CSLU: Multilanguage Telephone Speech Version 1.2
*Introduction*

The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (eg. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, for a total of about 38.5 hours of speech. Time-aligned phonetic transcriptions for 619 of the utterances are also included.

*Data*

Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 khz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.

*Samples*

For an example of the data in this corpus, please listen to these audio samples in Tamil and English.
- replaces: Multilanguage Telephone Corpus Release Version 1.1
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2006S35/
- isReferencedBy: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika 2006 CSLU: Multilanguage Telephone Speech Version 1.2 Linguistic Data Consortium, Philadelphia
C-000672: CSLU: Names Release 1.3
A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The Center for Spoken Language Understanding is attempting to address this problem with the Names Corpus. The Names Corpus is a collection of name utterances, both first and last names, from several thousand different speakers over the telephone. Name utterances are "spontaneous" in that the subject is not reading from a word list.

Another area of active research is the development of name Recognition systems. The Names Corpus is a useful resource for addressing this problem.

The utterances in this corpus were taken from many other telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus.

Each file in the Names Corpus has an orthographic transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterances have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed.

Release 1.3 of this corpus contains 24,245 files. All of these have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented.

*Samples*

For an example of the data in this publication, please review this audio sample and its transcription.
- replaces: Names Corpus Release Version 1.2
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC2006S39/
- isReferencedBy: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika 2006 CSLU: Names Release 1.3 Linguistic Data Consortium, Philadelphia

SHACHI - Language Resource Metadata Database