Language resource #: 3330
Results 1891 - 1900 of 2023
-
C-004903: GALE Phase 4 Chinese Broadcast Conversation Speech
*Introduction*
GALE Phase 4 Chinese Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).
Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.
*Data*
The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Beijing TV, a national television station in Mainland China; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei Province; Phoenix TV, a Hong Kong-based satellite television station ; and Voice of America (VOA), a U.S. government-funded broadcast programmer.
This release contains 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.
*Samples*
Please listen to this sample.
*Updates*
None at this time.
*Acknowledgment*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004904: GALE Phase 4 Chinese Broadcast Conversation Transcripts
*Introduction*
GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology, Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).
The broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Beijing TV, a national television station in Mainland China; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei Province; Phoenix TV, a Hong Kong-based satellite television station ; and Voice of America (VOA), a U.S. government-funded broadcast programmer.
*Data*
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
*Samples*
Please view this sample.
*Updates*
None at this time.
*Acknowledgement*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004905: FoxPersonTracks: a Benchmark for Person Re-Identification from TV Broadcast Shows
FoxPersonTracks is a person track dataset dedicated to person re-identification. The dataset is built from a set of real life TV shows broadcasted from BFMTV and LCP TV french channels, provided during REPERE challenge (see REPERE Evaluation Package, ELRA catalogue: http://catalog.elra.info, ISLRN: 360-758-359-485-0, ELRA ID: ELRA-E0044). It contains a total 4,604 persontracks (short video sequences featuring an individual with no background) from 266 persons. The dataset has been built from the REPERE dataset by following several automated processing and manual selection/filtering steps. It is meant to serve as a benchmark in person re-identification from images/videos. The dataset also provides re-identification results using space-time histograms as a baseline, together with an evaluation tool in order to ease the comparison to other re- identification methods.
- isFormatOf: C-004617: REPERE Evaluation Package
-
C-004906: TRAD Pashto Broadcast News Speech Corpus
This corpus contains transcribed broadcast news recordings in Pashto. Recordings are collected from 5 sources: Ashna TV, Azadi Radio, Deewa Radio, Mashaal Radio and Shamshad TV.
The corpus contains 108 hours of recordings covering more than 1,000 speakers. Transcriptions are provided together with the audio files and include about 46,000 segments and 1.1M words.
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA).- isReferencedBy: C-005001: TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
- isReferencedBy: D-005044: Pashto phonetic lexicon
-
C-004907: MoveOn Speech and Noise Corpus
Desktop/Microphone
The MoveOn Speech and Noise Corpus is a corpus recorded under the extreme conditions of the motorcycle environment within the MoveOn project. The speech utterances are in British English approaching the issue of command and control and template driven dialog systems with a focus on but not limited to - the police domain. The major part of the corpus comprises noisy speech and environmental noise recorded on a motorcycle. Several clean speech recording sessions with the same recording setup (including the motorcycle helmet) in an office environment complete the corpus. The corpus development focused on distortion free recordings and accurate descriptions of both recorded speech and noise.
In addition to an orthographical transcription of the speech segments, annotations of the background noise for both speech and pure noise segments are available.
The corpus is a small-sized speech corpus with up to 6 hours of clean and noisy speech utterances per channel and about 30 hours of segments with environmental noise only (without any speech). Recordings were performed simultaneously for three microphone channels two helmet close-talk microphones and one throat microphone. -
C-004908: GVLEX tales corpus
GV-LEX (Geste et Voix pour une Lecture Expressive – "Gesture and voice for an expressive reading") is a project funded by the French ANR within the call "Contenu et Interaction" in 2009.
GVLEX tales corpus was built to carry out research and development studies on automatic analysis of (written or spoken) tales for expressive voice and gesture synthesis.
The corpus consists of:
• 89 written tales, manually annotated in structures, speech turns, speakers, phrases, 7 of which were annotated by 2 human annotators (96 annotated texts in total)
• 12 tales read by a professional. These tales were recorded and annotated manually so as to tag elements that indicate expressivity, as well as the signal/transcription alignment that enabled to proceed with acoustic analyses. The provided data include audio files. Forced alignments between signal and manual transcriptions (produced within GV-LEX) are also provided in the TextGrid format
• Annotation and viewing software developed within the GV-LEX project -
C-004909: GlobalPhone Swahili
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.
The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).
In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.
Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.
The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.
Swahili Newspaper source:
http://www.voaswahili.com- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
- hasVersion: C-004910: GlobalPhone Ukrainian
-
C-004910: GlobalPhone Ukrainian
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.
The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).
In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.
Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.
The GlobalPhone Ukrainian corpus contains 12,814 utterances spoken by 119 speakers. Native speakers of Ukrainian were asked to read prompted sentences of newspaper articles. The entire collection took place in Donezk, Ukraine.
Ukrainian Newspaper sources:
http://umoloda.kiev.ua,
http://day.kiev.ua,
http://ukurier.com.ua,
http://pravda.com.ua,
http://chornomorka.com,
http://tsn.ua,
http://champion.com.ua,
http://ukrslovo.org.ua,
http://epravda.com.ua.- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
- hasVersion: C-004909: GlobalPhone Swahili
-
C-004911: Large Farsdat
Large Farsdat (L-FARSDAT) is a Persian (Farsi) Speech Database containing about 73 hours of read speech from formal Farsi texts (newspapers) which have been recorded by 100 speakers through a unidirectional desktop microphone. Each speaker uttered 20-25 pages of text from various subjects and recording was conducted in a noiseless environment. The average SNR of the desktop microphone is about 28 dB. The sampling rate is 22050 Hz for the whole corpus.
The whole database is segmented and labelled at word and sentence levels with byte count alignment and each word is transcribed according to the 29 standard Persian phonemes.
There are also three labels indicating silence (sil), breathy voice (br) and non-speech sounds (ns). -
C-004913: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
*Introduction*
IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Cantonese conversational and scripted telephone speech collected in 2011 along with corresponding transcripts.
The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.
*Data*
The Cantonese speech in this release represents that spoken in the Chinese provinces of Guangdong and Guangxi, and within those provinces, among five dialect groups. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: simplified Chinese characters and a romanization scheme based on the Yale system, both encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.
Evaluation data is available from NIST in support of OpenKWS.
*Samples*
Please view the following samples:
* audio sample
* transcription
* romanized transcription
*Updates*
None at this time.- hasVersion: C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
- hasVersion: C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
- hasVersion: C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
- hasVersion: C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
- hasVersion: C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
- hasVersion: C-004938: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
- hasVersion: C-004943: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
- hasVersion: C-004950: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
- hasVersion: C-004977: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a