Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 571 - 580 of 2023

C-001049: Levantine Arabic Conversational Telephone Speech, Transcripts
This database contains 982 Levantine Arabic speakers taking part in spontaneous telephone conversations in Colloquial Levantine Arabic. A total of 985 conversation sides are provided (there are three speakers who each appear in two disctinct conversations). The average duration per side is between 5 and 6 minutes.
- isFormatOf: C-001050: Levantine Arabic Conversational Telephone Speech
C-001050: Levantine Arabic Conversational Telephone Speech
*Introduction*

This database contains 982 Levantine Arabic speakers taking part in spontaneous telephone conversations in Colloquial Levantine Arabic. A total of 985 conversation sides are provided (there are three speakers who each appear in two disctinct conversations). The average duration per side is between 5 and 6 minutes.

This corpus was collected and transcribed in 2004 by Appen Pty Ltd, Sydney, Australia.

*Samples*

For an example of the data into this corpus, please listen to this audio sample (wav format).
- hasFormat: C-001050: Levantine Arabic Conversational Telephone Speech
C-001052: Levantine Arabic QT Training Data Set 5, Speech
*Introduction*

Levantine Arabic QT Training Data Set 5, Speech contains 1,660 calls totalling approximately 250 hours of telephone conversation in Levantine Arabic. These calls were collected between 2003 and 2005. Corresponding transcriptions may be found in LDC2006T07.

*Data*

This corpus is the combination of four former training data sets: LDC2004E21 and LDC2004E22, LDC2004E65 and LDC2004E66, LDC2005S07 and LDC2005T03 and LDC2005S14 (Speech and Transcripts). More than half of the speakers are Lebanese, the others are Jordanian, Palestinian, and Syrian. The table below shows the distribution of the speakers' national origin:

* 559 Jordanian
* 1,853 Lebanese
* 355 Palestinian
* 67 Syrian
* 484 Levantine speakers whose national origin could not be determined.

*Samples*

For an example of this corpus, please listen to this audio sample in sphere format.
- isFormatOf: C-001052: Levantine Arabic QT Training Data Set 5, Speech
- isFormatOf: C-001053: Levantine Arabic QT Training Data Set 5, Transcripts
- hasPart: C-001328: Arabic CTS Levantine Fisher Training Data Set 3, Speech
- hasPart: C-000610: Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
- hasPart: Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
C-001053: Levantine Arabic QT Training Data Set 5, Transcripts
This release contains 1,660 calls, transcribing approximately 250 hours of telephone conversation in Levantine Arabic. These calls were collected between 2003 and 2005.
- replaces: C-001053: Levantine Arabic QT Training Data Set 5, Transcripts
- hasVersion: C-001328: Arabic CTS Levantine Fisher Training Data Set 3, Speech
- hasVersion: C-000610: Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
- hasVersion: ,Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
C-001054: MACROPHONE
MACROPHONE consists of approximately 200,000 utterances by 5,000 speakers. It is designed to provide material sufficient and suitable for research, development and evaluation of automatic speech recognition technology for common telephone applications, such as shopping, transportation, database access and autodialing. In addition to application-oriented phrases and numerous digit strings, seven sentences are spoken by each talker to provide ensemble phoneme, diphone and triphone coverage of the language. The spoken material also refers to times, locations, monetary amounts, spellings and interactive operations. The utterances were collected automatically over the telephone network by recording directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly equal numbers of males and females, were solicited by a marketing firm from all regions of the United States. They ranged in age from the teens to the seventies and represented a broad range of educations and incomes as well. Each recorded utterance is accompanied by an orthographic transcription which also notes any unusual acoustic events or anomalies. Macrophone is the American English contribution to an international database of telephone speech corpora called POLYPHONE. Similar data sets are expected for major languages of the world and at least some of these will be made available through LDC. Prospects are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard German, Japanese, Mandarin Chinese, Swiss French and Danish versions of POLYPHONE, all with basically the same structure and methods of collection.

MACROPHONE was collected at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: "Macrophone: An American English Telephone Speech Corpus for the POLYPHONE Project," by Jared Bernstein, Kelsey Taussig and Jack Godfrey.
- isRequiredBy: POLYPHONE
C-001057: Message Understanding Conference (MUC) 6 Additional News Text
*Introduction*

Message Understanding Conference (MUC) 6 Additional News Text was produced by Linguistic Data Consortium (LDC) catalog number LDC96T10 and ISBN 1-58563-105-1.

In the 1990s, the MUC evaluations funded the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies. Additional information from NIST can be found at http://www.itl.nist.gov/iaui/894.02/related_projects/muc.

*Data*

This corpus contains additional training data, which had been tagged, but not annotated. Both the MUC 6 and the MUC 6 Additional News Text are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).

*Updates*

August 20th, 2003: This corpus was formerly published under the name "MUC VI Text Collection." The more suitable name of "Message Understanding Conference (MUC) 6 Additional News Text" was adopted when MUC 6 (LDC2003T13), the main corpus containing the evaluation materials, was published in 2003.

*Restricted Rights*

RESTRICTED RIGHTS LEGEND: INFORMATION FROM THE WALL STREET JOURNAL AND/OR THE DOW JONES NEWS SERVICE CONTAINED HEREIN IS THE PROPERTY OF DOW JONES & COMPANY, INC. AND IS PROTECTED BY COPYRIGHT. USE, DUPLICATION OR DISCLOSURE BY YOU IS SUBJECT TO THE RESTRICTIONS SET FORTH IN THE USER AGREEMENT DELIVERED TO YOU BY THE LINGUISTIC DATA CONSORTIUM OF THE UNIVERSITY OF PENNSYLVANIA. COPYRIGHT 1993-1994 DOW JONES & COMPANY, INC. ALL RIGHTS RESERVED.
- hasVersion: C-001058: Message Understanding Conference (MUC) 6
- isReplacedBy: MUC VI Text Collection
- isReferencedBy: http://www.itl.nist.gov/iaui/894.02/related_projects/muc.
C-001058: Message Understanding Conference (MUC) 6
*Introduction*

Message Understanding Conference (MUC) 6 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T13 and ISBN 1-58563-239-2.

In the 1990s, the MUC evaluations funded the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies. Additional information from NIST can be found at http://www.itl.nist.gov/iaui/894.02/related_projects/muc.

*Data*

This corpus contains the 318 annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).

*Updates*

August 20th, 2003: What was formerly published as MUC VI Text Collection (LDC1996T10) was renamed as MUC 6 Additional News Text, because LDC96T10 consists only of additional trainining materials. RESTRICTED RIGHTS LEGEND: INFORMATION FROM THE WALL STREET JOURNAL AND/OR THE DOW JONES NEWS SERVICE CONTAINED HEREIN IS THE PROPERTY OF DOW JONES & COMPANY, INC. AND IS PROTECTED BY COPYRIGHT. USE, DUPLICATION OR DISCLOSURE BY YOU IS SUBJECT TO THE RESTRICTIONS SET FORTH IN THE USER AGREEMENT DELIVERED TO YOU BY THE LINGUISTIC DATA CONSORTIUM OF THE UNIVERSITY OF PENNSYLVANIA. COPYRIGHT 1986-1994 DOW JONES & COMPANY, INC. ALL RIGHTS RESERVED.
- hasVersion: MUC 6 Additional News Text
- isReferencedBy: http://www.itl.nist.gov/iaui/894.02/related_projects/muc.
C-001059: Message Understanding Conference (MUC) 7
*Introduction*

Message Understanding Conference (MUC) 7 was produced by Linguistic Data Consortium (LDC) catalog number LDC2001T02 and ISBN 1-58563-205-8.

In the 1990s, the MUC evaluations funded the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies. Additional information from NIST can be found here.

*Data*

The following list shows the correspondence between versions of the IE task definition and stages of the MUC-7 evaluation.

Version # Stage 4.1 training and dryrun 4.2 formalrun 5.1 final The dryrun and formalrun have different domains; the dryrun (and training) consists of aircrashes scenarios and the formalrun consists of missile launches scenarios. The final version updates especially the Template Relations portion of the guidelines.

Normally, for each scenario, two datasets are provided: training and test. When the evaluation cycle begins, the label for the scenario dataset is training. Then the corresponding test dataset for that same scenario is used for the dryrun testing. For the formal run, a formal training set is given out four weeks before the test answers are due. The formal test is given out one week before the test answers are due. After the entire evaluation and meeting have been held, final edits are made if necessary.

*Updates*

August 22, 2001: This publication was inadvertently released without the guidelines documentation and the scoring software. These documents and programs have now been added to the publication and if you previously purchased this corpus and would like to download a complete copy of the corpus please contact ldc@ldc.upenn.edu.
- isReferencedBy: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
- isReferencedBy: ldc@ldc.upenn.edu.
C-001060: Middle East Technical University Turkish Microphone Speech v 1.0
*Introduction*

Middle East Technical University Turkish Microphone Speech v 1.0 was developed at Middle East Technical University (METU) as part of a collaborative work between METU's Department of Electrical and Electronics Engineering and the Center for Spoken Language Research (CSLR) at the University of Colorado at Boulder. The collaboration was supported by TUBITAK, the Scientific and Technical Research Council of Turkey, through a combined doctoral scholarship program. The corpus was used to port CSLR's speech recognition system, SONIC, to Turkish.

The corpus contains text, speech and alignment files. The corpus is of size ~600Mbytes. 120 speakers (60 male and 60 female) speak 40 sentences each (aproximately 300 words per speaker), which makes approximately 500 minutes of speech in total. The 40 sentences are selected randomly for each speaker from a triphone-balanced set of 2,462 Turkish sentences. The speakers are selected from students, faculty and staff at METU and all are native speakers of Turkish. The age range is from 19 to 50 years with an average of 23.9 years.

The data has been digitally recorded with a Sound Blaster sound card on a PC at a 16 kHz sampling rate.

*Samples*

Please listen to this audio sample and examine its companion transcript for an example of the data contained in this publication.
C-001061: Morphologically Annotated Korean Text
*Introduction*

Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.

This is a collection of Korean text with annotated morphological analysis and part-of-speech tags. The source text was extracted from the Korean Newswire corpus. The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.

The corpus is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was tokenized and then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.

*Data*

The data consists of one single file, totalling approximately 880KB in uncompressed form. The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer.

The data is formatted as follows: one head word per line, the word and its morphologically analyzed output are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". ^EOS is a special symbol denoting the end of a sentence.

Morphologically analyzed and part-of-speech tagged data can be useful in the following applications: training of statistical morphological analyzers and part-of-speech taggers, evaluation of pre-existing morphological analyzers and part-of-speech taggers.

The morphologically tagged output is compatible with Klex: Finite-State Lexical Transducer for Korean. It also conforms to the Korean Treebank POS annotation standards.

*Updates*

There are no updates available at this time.

*Sponsorship*

The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.

*Note*

The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $300.
- hasVersion: C-001036: Klex: Finite-State Lexical Transducer for Korean
- hasVersion: Korean Treebank POS annotation standards. ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/

SHACHI - Language Resource Metadata Database