Language resource #: 3330
Results 1711 - 1720 of 2023
-
C-004384: Malto Speech and Transcripts
*Introduction*
Malto Speech and Transcripts was developed by Masato Kobayashi, Associate Professor in Linguistics at the University of Tokyo (Japan), and Bablu Tirkey, research scholar at the Tribal and Regional Languages Department, Ranchi University (India). It contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
Malto is a Dravidian language spoken in northeastern India (principally the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people called the Pahariyas. Indian census data places the number of Malto speakers in a range of between 100,000-200,000 total speakers. Most Malto speakers live in the three northeastern districts of Jharkhand, i.e, Sahebganj, Godda and Pakur the fieldwork that resulted in this corpus was conducted in those districts. Of the Pahariyas in that area, three subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak Malto. (Kobayashi 3)
Pahariya villages or hamlets are located on hilly tracts and in the lowlands are often separated by non-Parahiya villages. As a result, Malto varies from village to village. It may be more accurate to consider Malto a continuum of dialects rather than a unitary language. The three major dialects -- Sawriya Pahariya, Mal Pahariya, and Kumarbhag Pahariya -- correspond to the principal sub-tribal communities. (Kobayashi 14)
For further reading on Malto, consult Texts and Grammar of Malto (2012) by Masato Kobayashi published by Kotoba Books, Vizianagaram, India and sold by the book distributors: Mary Martin Booksellers, 123 Third Street, Tatabad, Coimbatore 641012, India. They can be contacted at info@marymartin.com or at books.kotoba@gmail.com.
*Data*
The transcribed data accounts for 6 hours of the collection and contains 21 speakers (17 male, 4 female). The untranscribed data accounts for 2 hours of the collection and contains 10 speakers (9 male, 1 female). Four of the male speakers are present in both groups.
All audio is presented in .wav format. Each audio file name includes a subject number, village name, speaker name and the topic discussed. The transcripts and glossary are UTF-8 text files. Because of ambiguities that occur when writing Malto in Devenagari script, the transcripts were developed using Roman script with symbols adapted from the International Phonetic Alphabet (IPA) but are not considered to be phonetic transcripts.
Consult readme.txt and untran_speaker.txt for further information about the corpus, its collection and the speakers. The transcription and glosses are split into three text files consult the readme to determine which audio files are covered by each transcript.
*Sample*
For a sample from this corpus, please listen to this audio file.
*Updates*
Some minor updates were made to the index file. An updated version is available in the online documentation folder, as well as an updated file table.
*Works Cited*
Kobayashi, Masato. Texts and Grammar of Malto. Vizianagaram: Kotoba Books, 2012. Print. -
C-004385: Turkish Broadcast News Speech and Transcripts
*Introduction*
Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University, Istanbul, Turkey and contains approximatley 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval.
The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digitial satellite transmissions. A quick manual segmentation and transcription approach was followed.
Speech recognition and retrieval experiments using the larger corpus can be found in the following journal article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, Turkish Broadcast News Speech and Transcripts Transcription and Retrieval, IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009.
For more information please visit http://busim.ee.boun.edu.tr/~speech or contact the principal investigator, Murat Saraçlar.
*Data*
The data was recrded at 32 kHz and resampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries.
The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data here. The manual segmentations and transcripts were created by native Turkish speakers at Bo?aziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.
*Samples*
Please follow the links below for samples:
* Audio
* Transcript
*Sponsorship*
Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University Research Fund Project 05HA202.
*Updates*
None at this time. -
C-004386: USC-SFI MALACH Interviews and Transcripts English
*Introduction*
USC-SFI MALACH Interviews and Transcripts English, LDC Catalog Number LDC2012S05 and ISBN 1-58563-602-9, was developed by The University of Southern California Shoah Foundation Institute (USC-SFI), the University of Maryland, IBM and Johns Hopkins University as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 375 hours of interviews from 784 interviewees along with transcripts and other documentation.
Inspired by his experience making Schindlers List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants. Within several years, the Foundations Visual History Archive held nearly 52,000 video testimonies in 32 languages representing 56 countries. It is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education.
The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives. The focus was advancing the state of the art of automatic speech recognition (ASR) and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts English was developed for the English speech recognition experiments.
*Data*
The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane overflights, wind noise, background conversations and highway noise). Original interviews were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound files in this release are compressed in MP3 format at a sampling frequency of 44.1 kHz.
Approximately 25,000 of all USC-SFI collected interviews are in English and average approximately 2.5 hours each. The 784 interviews included in this release are each a 30 minute section of the corresponding larger interview. Due to the way the original interviews were arranged on the tapes, some interviews are clipped and have a duration of less than 30 minutes. Certain interviews include speech from family members in addition to that of the subject and the interviewer. Accordingly, the corpus contains speech from more than 784 speakers, who are more or less equally distributed between males and females. The interviews also include accented speech over a wide range (e.g., Hungarian, Italian, Yiddish, German and Polish).
This release includes transcripts in .trs format of the first 15 minutes of each interview. The transcripts were created using Transcriber 1.5.1 and later modified.
*Samples*
For a sample of the audio in this release, use this link.
*Updates*
None at this time. -
C-004387: Tagged and Cleaned Wikipedia
TC Wikepedia is a collection of cleaned-up and tagged Wikepedia articles. Most of the Wikipedia pages that are not articles (e.g. meta-pages, navigational texts and user discussions) have been eliminated in the corpus. Tagged parts include the headword, headlines and infobox of each article, named entities, POS, lemma, categories and article variations, and Infoboxes. Note that the files are provided as is, which are not tagged 100% accurately and are not 100% cleaned. (http://nlp.cs.nyu.edu/wikipedia-data/)
-
C-004388: CALLHOME Mandarin Chinese Transcripts - XML version
*Introduction*
CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster Univeristy, United Kingdom.
LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts.
CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models.
In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin.
*Data*
This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below:
All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing.
Number of unique words: 6,895 Total number of words: 300,767
*Samples*- isFormatOf: C-000661: CALLHOME Mandarin Chinese Transcripts
- isFormatOf: C-000660: CALLHOME Mandarin Chinese Speech
-
C-004389: Annotated English Gigaword
*Introduction*
Annotated English Gigaword was developed by Johns Hopkins Universitys Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the datasets XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.
*Data*
Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:
* Agence France-Presse, English Service (afp_eng)
* Associated Press Worldstream, English Service (apw_eng)
* Central News Agency of Taiwan, English Service (cna_eng)
* Los Angeles Times/Washington Post Newswire Service (ltw_eng)
* Washington Post/Bloomberg Newswire Service (wpb_eng)
* New York Times Newswire Service (nyt_eng)
* Xinhua News Agency, English Service (xin_eng)
The following layers of annotation were added:
* Tokenized and segmented sentences
* Treebank-style constituent parse trees
* Syntactic dependency trees
* Named entities
* In-document coreference chains
The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.
The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.
*Samples*
Please the link for a sample.
*Additional Licensing Information*
Any 2011 member organization that licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $250 media fee. Please contact ldc@ldc.upenn.edu for licensing or with any additional questions.
*Updates*
None at this time.- references: English Gigaword Fifth Edition
-
C-004390: CD-毎日新聞2012データ集
毎日新聞の東京・大阪本社の朝夕刊最終版を対象とした、毎日新聞2012年の全文記事データ集(タグ付テキストデータ)。
- isPartOf: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-000838: DCS-毎日新聞1991~2006データファイル
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-003590: CD-毎日新聞2005データ集プラス
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
-
C-004391: CD-毎日新聞2012データ集プラス
毎日新聞の東京・大阪本社の朝夕刊最終版に加え、北海道~鹿児島までの記事を収録した「地方版」とがセットになった毎日新聞2012年の全文記事データ集(タグ付テキストデータ)。
- hasPart: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-000838: DCS-毎日新聞1991~2006データファイル
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-003590: CD-毎日新聞2005データ集プラス
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
-
C-004393: Speech database of Aragusuku Dialect
The corpus contains words, short sentences and songs of Simoji dialect, one of the dialects spoken in Aragusuku Islands of Ya’eyama Islands. The words and short sentences were chosen from the viewpoint of phonemic and grammatical studies. The songs, called ‘Jiraba’, are sung during seasonal festival wishing rich harvest. The Aragusuku Islands are composed of Shimoji and Kamiji Islands. At present, Shimoji Island is uninhabited and the island dialect is dying. Shimoji dialect is a valuable dialect keeping relatively ancient forms indicating its relation with Miyako dialects.
-
C-004395: Speech database of Oogami Dialect
The corpus offers actual conditions of spoken Oogami-jima dialect spoken in the north of Miyako mainland. The data items are arranged considering phone parallelism between Southern Ryukyu and Miyako dialects. Personal idiosyncrasy and generation difference are also considered. The Karimata dialect spoken in the opposite side of the Oogami-jima is also contained as a reference material, because it has something in common with Oogami-jima dialect in its phonemic characteristics. Both of them are valuable dialects which are suspected to disappear in the near future.