言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1881 - 1890 件目

C-004863: The Subglottal Resonances Database
*Introduction*

The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings.

SGRs have received attention in studies of speech production, perception and technology. They affect voice production, divide vowels and consonants into discrete categories, affect vowel perception and can be useful in automatic speech recognition.

*Data*

Speakers were recruited by Washington University's Psychology Department. The majority of the participants were Washington University students who represented a wide range of American English dialects, although most were speakers of the mid-American English dialect.

The corpus consists of 35 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with 10 repetitions of each word by each speaker, resulting in 17,500 individual microphone (and accelerometer) waveforms. The monosyllables were comprised of 14 hVd words and 21 CVb words where C was b,d, g and V included all AE monophthongs and diphthongs.

The target vowel in each utterance was hand-labeled to indicate the start, stop, and steady-state parts of the vowel. For diphthongs, the steady-state refers to the diphthong nucleus which occurs early in the vowel.

The height and age of each speaker is included in the corpus metadata.

Audio files are presented as single channel 16-bit flac compressed wav files with sample rates of 48kHz or 16kHz. Image files are bitmap image files and plain text is UTF-8.

*Samples*

Please view the following samples:

* Image Sample
* Audio Sample
* Text Sample

*Acknowledgment*

This work was supported in part by National Science Foundation Grant No. 0905250.

*Updates*

None at this time.
C-004868: GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
*Introduction*

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09). Part 1 of this release is GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 (LDC2014S09). The corresponding part one transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 (LDC2014T28).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.

*Data*

The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Beijing TV, a national television station in Mainland China; China Central TV, a national and international broadcaster in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei Province; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America, a U.S. government-funded broadcast programmer.

This release contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.

*Samples*

Please listen to this audio sample.

*Updates*

None at this time.

*Acknowledgment*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
- hasFormat: N-004867: GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
- hasVersion: C-004802: GALE Phase 2 Chinese Broadcast Conversation Speech
- hasVersion: C-004903: GALE Phase 4 Chinese Broadcast Conversation Speech
C-004869: CIEMPIESS
*Introduction*

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio speech, associated transcripts, pronouncing dictionaries and language models. The goal of this work was to create acoustic models for automatic speech recognition.

For more information and documentation see the CIEMPIESS-UNAM Project website.

*Data*

The speech recordings are from 43 one-hour FM radio programs broadcast by Radio IUS, a UNAM radio station. They are comprised of spontaneous conversations between a radio moderator and guests, principally about legal issues. Approximately 78% of the speakers were males, and 22% of the speakers were females.

The audio was recorded in MP3 stereo format, using a 44.1 kHz sample rate and a bit-rate of 128 kbps or higher. Only "clean" utterances were selected from the raw data, meaning that the utterances were made by one only person with no background noises, whispers, music, foreign accents, white noise or static. The audio files were converted to 16 kHz, 16-bit PCM WAV format for this release.

The recordings were transcibed using PRAAT, a tool designed for phonetics research. The transcripts are in Mexbet, a phonetic alphablet designed for Mexican Spanish based on Worldbet (Hieronymus, 1994). Plain text transcripts, textgrid format time labels and files useful for performing experiments with the SPHINX3 recognition software are also included.

*Samples*

Please view this audio sample and transcript sample.

*Updates*

None at this time.
C-004878: LDC Spoken Language Sampler - Third Release
*Introduction*

LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains samples from 20 different corpora published by LDC between 1996 and 2015.

LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.

The sampler is available as a free download.

*Data*

The LDC Spoken Language Sampler - Third Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:

* Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
* Signal amplitude has been adjusted where necessary to normalize playback volume.
* Some corpora are published in compressed form, but all samples here are uncompressed.
* Some text files are presented as images to ensure foreign character sets display properly.
* In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.

The link for the catalog number takes you to the catalog entry.

LDC2014S06
2009 NIST Language Recognition Evaluation Test Set
The 2009 evaluation contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese.

LDC2014S01
CALLFRIEND Farsi Second Edition Speech
CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes.

LDC96S37
CALLHOME Japanese
A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.

LDC2013S09
CSC Deceptive Speech
CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.

LDC2007S18
CSLU Kids' Speech
Developed at Oregon State University's Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.

LDC2010S01
Fisher Spanish Speech
Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.

LDC2014S02
King Saud University Arabic Speech Database
King Saud University Arabic Speech Database contains 590 hours of recorded Arabic speech from 269 male and female Saudi and non-Saudi speakers. The utterances include read and spontaneous speech recorded in quiet and noisy environments. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes.

LDC2003S07
Korean Telephone Conversations Complete
The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes a lexicon (LDC2003L02).

LDC2012S04
Malto Speech and Transcripts
Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.

LDC2015S05
Mandarin Chinese Phonetic Segmentation and Tone
Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese.

LDC2015S04
Mandarin-English Code-Switching in South-East Asia
Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.

LDC2013S03
Mixer 6 Speech
Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area.

LDC2014S03
Multi-Channel WSJ Audio
Multi-Channel WSJ Audio was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.

LDC2004S09
NIST Meeting Pilot Corpus Speech
This data set contains speech and transcriptions from topical discussions in meeting settings, including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.

LDC2015S02
RATS Speech Activity Detection
RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

LDC2015S03
The Subglottal Resonances Database
The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.

LDC2012S02
TORGO Database of Dysarthric Articulation
TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.

LDC2012S06
Turkish Broadcast News Speech and Transcripts
Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.

LDC2014S08
United Nations Proceedings Speech
United Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly and First Committee (Disarmament and International Security), and meetings 6434-6763 of the Security Council.

LDC2014S04
USC-SFI MALACH Interviews and Transcripts Czech
USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.
- hasVersion: C-004698: LDC Spoken Language Sampler
- hasVersion: C-004808: LDC Spoken Language Sampler - Second Release
- references: C-004834: 2009 NIST Language Recognition Evaluation Test Set
- references: C-004818: CALLFRIEND Farsi Second Edition Speech
- references: C-000657: CALLHOME Japanese Speech
- references: C-000658: CALLHOME Japanese Transcripts
- references: C-004649: CSC Deceptive Speech
- references: C-003111: CSLU: Kids` Speech Version 1.1
- references: C-004723: Fisher Spanish Speech
- references: C-004821: King Saud University Arabic Speech Database
- references: C-001042: Korean Telephone Conversations Complete Set
- references: C-004384: Malto Speech and Transcripts
- references: C-004862: Mandarin Chinese Phonetic Segmentation and Tone
- references: C-004859: Mandarin-English Code-Switching in South-East Asia
- references: C-004800: Mixer 6 Speech
- references: C-004785: Multi-Channel WSJ Audio
- references: C-001070: NIST Meeting Pilot Corpus Speech
- references: C-004858: RATS Speech Activity Detection
- references: C-004863: The Subglottal Resonances Database
- references: C-004773: TORGO Database of Dysarthric Articulation
- references: C-004385: Turkish Broadcast News Speech and Transcripts
- references: C-004844: United Nations Proceedings Speech
- references: C-004826: USC-SFI MALACH Interviews and Transcripts Czech
C-004879: Arabic Learner Corpus
*Introduction*

Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.

*Data*

Two tasks were used to collect the written data, and participants had the choice to do one or both of them. In each of those tasks, learners were asked to write a narrative about a vacation trip and a discussion about the participant's study interest. Those choosing the first task generated a 40 minute timed essay without the use of any language reference materials. In the second task, participants completed the writing as a take-home assignment over two days and were permitted to use language reference materials.

The audio recordings were developed by allowing students a limited amount of time to talk about the topics above without using language reference materials.

The original handwritten essays were transcribed into an electronic text format. The corpus data consists of three types: (1) handwritten sheets scanned in PDF format; (2) audio recordings in MP3 format; and (3) textual unicode data in plain text and XML formats (including the transcribed audio and transcripts of the handwritten essays). The audio files are either 44100Hz 2-channel or 16000Hz 1-channel mp3 files.

*Samples*

Please view the following samples:

* Audio sample
* Arabic Header Text
* English Header XML

*Updates*

None at this time.
C-004880: GALE Phase 3 Arabic Broadcast Conversation Speech Part 1
*Introduction*

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 (LDC2015T16).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

*Data*

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting Corporation, a Lebanese television station; Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 149 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

*Samples*

Please listen to this audio sample.

*Updates*

None at this time.

*Acknowledgment*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
- hasFormat: N-004881: GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1
C-004887: Articulation Index LSCP
*Introduction*

Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions.

AIC consists of 20 American English speakers (12 males, 8 females) pronouncing syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice, once in isolation and once within a carrier-sentence, for a total of 25768 recorded syllables.

*Data*

Articulation Index LSCP alters AIC in the following ways.

* Time-alignments for the onset and offset of each word and syllable were generated through forced-alignment with a standard HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) ASR system.
* The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were not manually adjusted.
* The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end, and the time-alignments were altered to correspond to the cut recordings.
* The file naming scheme was slightly altered for compatibility with the Kaldi speech recognition toolkit.
* AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere format. The LSCP version contains the wide-band version only distributed as wave files.

This release does not include certain AIC triphone recordings (CVC, CCV or VCC).

Audio data is presented as 16kHz 16-bit flac compressed .wav files. The flac compression was added for distribution, and documentation may refer to the files as .wav files.

*Samples*

Please listen to this audio sample.

*Updates*

None at this time.
C-004891: GALE Phase 3 Chinese Broadcast News Speech
*Introduction*

GALE Phase 3 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.

*Data*

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA), a U.S. government-funded broadcast programmer.

This release contains 279 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

*Samples*

Please listen to this audio sample.

*Updates*

None at this time.

*Acknowledgment*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
- hasFormat: N-004890: GALE Phase 3 Chinese Broadcast News Transcripts
- hasVersion: C-004815: GALE Phase 2 Chinese Broadcast News Speech
C-004897: GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
*Introduction*

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 (LDC2016T06).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

*Data*

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer based in Egypt; Al Fayha, an Iraqi television channel; Al Hiwar, a regional broadcast station based in the United Kingdom; Alhurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Bahrain TV, a television station in the Kingdom of Bahrain; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Oman TV, a national broadcaster located in the Sultanate of Oman ; Qatar TV, a broadcast programmer in Qatar; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian National TV, a national television station in Tunisia.

This release contains 142 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

*Samples*

Please listen to this audio sample.

*Updates*

None at this time.

*Acknowledgment*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
- hasFormat: N-004896: GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
C-004901: 千葉大学 3人会話コーパス
千葉大学で収録された，同性3人からなる友人同士12組（男女6組ずつ，当時18～33才）の雑談を収めたもの。会話の内容や進行には極力制限を加えず，高い自発性を確保する一方で，話者ごとに個別のヘッドセットマイクを用いて高品質なデータを収録している。

本リリースでは，各組1会話ずつ，計約2時間分を対象として，以下のデータを公開する．（映像データは含まれない）：

音声データ
転記テキスト
形態論情報
転記テキストと形態論情報を統合したELANファイル

SHACHI - Language Resource Metadata Database