Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 901 - 910 of 2023

C-001531: Switchboard Credit Card
This release contains 35 conversations on the topic of "Credit Card Use." The conversations can be used in training and testing wordspotting systems. In addition to two-channel mu-law encoded audio waveform files, the disc contains transcriptions, time-alignments and wordspotting targets.
- references: John J. Godfrey and Ed Holliman 1993 Switchboard Credit Card Linguistic Data Consortium, Philadelphia
C-001533: TAXI - Multilingual telephone dialog database
Telephone
TAXI was produced by BAS, in collaboration with the German research centre for artificial intelligence, DFKI. This speech database contains recordings which consist of dialogues, 94 on the whole (spontaneous speech), between a German speaking cab dispatcher and his clients, who always answered in English. To prevent overlap and to allow automatic segmentation by the recording server, each party pressed a button on his phone to signal the other one that his turn was over. They were recorded over the telephone network. Each dialogue part is translated into the other language. Noise markers are included in the transcripts (not in the translations).The database was annotated following the SpeechDat specifications, and validated to assess its compliance with the SpeechDat format. The files are stored as BAS Partitur Format files.
C-001534: TECTRA Corpus of English-Galician literary texts
- hasPart: C-001354: CLUVI Parallel Corpus
C-001540: The EMILLE Lancaster Corpus
Written Corpora
The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora.
There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu.
The EMILLE monolingual corpora contain approximately 58,880,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu).
The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu.
The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. Developing Asian language corpora: standards and practice in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.

This database is available only for commercial use. For research use by academic organisations, a more complete set of the EMILLE Lancaster Corpus is available under the reference ELRA-W0037 The EMILLE/CIIL Corpus.
C-001541: The Lancaster Corpus of Mandarin Chinese (LCMC)
Written Corpora
The Lancaster Corpus of Mandarin Chinese (LCMC) is designed as a Chinese match for the FLOB and FROWN corpora for modern British and American English.
The corpus is suitable for use in both monolingual research into modern Mandarin Chinese and cross-linguistic contrast of Chinese and British/American English. The corpus sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC.
The corpus is marked up for text categories, sample file numbers, paragraphs, sentences and tokens. Linguistic annotations undertaken on the corpus include tokenization and part-of-speech tagging. The whole corpus is annotated at the word level and includes orthographic and morphological annotations. The tagging system used was produced by the Institute of Computing Science Chinese Lexical Analysis System (ICTCLAS), the Chinese Academy of Sciences. The corpus is encoded in Unicode (UTF-8) and marked up in XML.
The corpus comes with a User Manual detailing corpus design specifications and part-of-speech tags. The XML structure of the corpus was validated using the parser built in Xaira. Part-of-speech tagging of all aspect markers was manually checked.

References: McEnery, A., Xiao, Z. and Mo, L. 2003. Aspect marking in English and Chinese: using the Lancaster Corpus of Mandarin Chinese for contrastive language study. Literary and Linguistic Computing 18/4: 361-378. Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. Developing Asian language corpora: standards and practice in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya. McEnery, A and Xiao, Z. 2004. The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study. Paper presented at LREC 2004. May 2004, Lisbon.

For more information on the LCMC: http://www.ling.lancs.ac.uk/corplang/lcmc
C-001543: Translanguage English Database (TED) Speech
*Introduction*

Translanguage English Database (TED) Speech consists of recordings of presentations made by native English and non-native English speakers at the Third European Conference on Speech Communication and Technology, EUROSPEECH 1993 in Berlin, Germany. This is a joint publication with the European Language Resources Association (ELRA) sponsored in part by National Science Foundation Grant No. IIS-9982201. The data set is released by ELRA as TED Translanguage English Database (ELRA-S0031).

*Data*

The audio recordings contain 188 speakers presenting academic papers for approximately 15 minutes each. Transcripts for 39 of the recordings are available in Translanguage English Database (TED) Transcripts LDC2002T03 and in Translanguage English Database (TED) Transcripts database ELRA-S0120.

*Updates*

There are no updates available at this time.
- references: A. Kipp, et al. 2002 Translanguage English Database (TED) Speech Linguistic Data Consortium, Philadelphia
C-001544: Translanguage English Database (TED) Transcripts database
Desktop/Microphone
LDC reference: http://www.ldc.upenn.edu/Catalog/LDC2002T03.html

The Translanguage English Database (TED) Transcripts corpus contains transcriptions of thirty-nine of the 188 speeches of the TED Corpus made at Eurospeech'93 in Berlin. The thirty-nine transcripts in this publication are in Universal Transcription Format (UTF) and were prepared by the LDC. All utf files in the transcript publication were validated against an included utf.dtd. Tables containing speaker demographic information and a cross-reference of file names from the TED audio corpus are included.
C-001545: Translanguage English Database (TED) Transcripts
*Introduction*

Translanguage English Database (TED) Transcripts consists of transcripts of presentations by 39 native English and non-native English speakers at the Third European Conference on Speech Communication and Technology, EUROSPEECH 1993 in Berlin, Germany. This is a joint publication with the European Language Resources Association (ELRA) sponsored in part by National Science Foundation Grant No. IIS-9982201. The data set is released by ELRA as Translanguage English Database (TED) Transcripts database (ELRA-S0120).

*Data*

The transcripts in this release were developed by the Linguistic Data Consortium and are a subset of the speech recordings in Translanguage English Database (TED) Speech LDC2002S04 and ELRA publication ELRA-S0031.

The transcripts are in Universal Transcription Format (UTF). All UTF files were validated against a utf.dtd. Tables containing speaker demographic information and cross-references of file names from the TED audio corpus are included this release. A transcript sample is available here.

*Updates*

There are no updates at this time
- references: A. Kipp, et al. 2002 Translanguage English Database (TED) Transcripts Linguistic Data Consortium, Philadelphia
C-001546: Treebank-2
Original release was: LDC Catalog No.: LDC94T4B-3.1 NIST Catalog No.: NA LDC Release date: 4/94 (MY94)

Original Treebank Release

This release contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.

In addition, the release includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. Also included is a pre-compiled program file for tgrep, built for use on Sun sparc systems.)

Release - 2

The PTB Project Release 2 features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This release also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.

The contents of Treebank Release 2 are as follows:

* One million words of 1989 Wall Street Journal material annotated in Treebank-2 style.
* A small sample of ATIS-3 material annotated in Treebank-2 style.
* 300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines.
* The contents of the previous Treebank release (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style).
* Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port).

In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.

The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 & Treebank-3 both include the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
- references: Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz 1995 Treebank-2 Linguistic Data Consortium, Philadelphia
C-001547: Treebank-3
*Introduction*

This release contains the following Treebank-2 Material:

* One million words of 1989 Wall Street Journal material annotated in Treebank II style.
* A small sample of ATIS-3 material annotated in Treebank II style.
* A fully tagged version of the Brown Corpus.

and the following new material:

* Switchboard tagged, dysfluency-annotated, and parsed text
* Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

*Data*

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

*Updates*

After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.

As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).

Corpus downoads after these dates will include these missing files.
- references: Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor 1999 Treebank-3 Linguistic Data Consortium

SHACHI - Language Resource Metadata Database