Language resource #: 3330
Results 1861 - 1870 of 2023
-
C-004785: Multi-Channel WSJ Audio
*Introduction*
Multi-Channel WSJ Audio (MCWSJ) was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.
This corpus was designed to address the challenges of speech recognition in meetings, which often occur in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. Using headset microphones represents one approach, but meeting participants may be reluctant to wear them. Microphone arrays are another option. MCWSJ supports research in large vocabulary tasks using microphone arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read News, a corpus originally developed for large vocabulary continuous speech recognition experiments, which in turn was based on CSR-1 (WSJ0) Complete, made available by LDC to support large vocabulary continuous speech recognition initiatives.
*Data*
Speakers reading news text from prompts were recorded using a headset microphone, a lapel microphone and an eight-channel microphone array. In the single speaker scenario, participants read from six fixed positions. Fixed positions were assigned for the entire recording in the overlapping scenario. For the moving scenario, participants moved from one position to the next while reading.
Fifteen speakers were recorded for the single scenario, nine pairs for the overlapping scenario and nine individuals for the moving scenario. Each read approximately 90 sentences.
The audio data are presented as single channel 16kHz flac compressed wav files.
*Samples*
Please listen to the below samples.
* Overlapping Sample
* Stationary Sample
* Moving Sample
*Updates*
None at this time.- references: C-001588: WSJCAM0 Cambridge Read News
-
C-004795: GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
*Introduction*
GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (LDC2013T04).
Broadcast audio for the DARPA GALE program was collected at the LDC Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong Kong, Republic of China (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
The LDC local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC desgined a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint weighs less than 30 pounds and can be transported as carry-on luggage.
Medianet collected Arabic programming from across the Gulf region using its internal system and LDCs portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.
*Data*
The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from the following sources: Al Alam News Channel, based in Iran, Al Arabiya, a news television station based in Dubai, Aljazeera, a regional broadcaster located in Doha, Qatar, Al Ordiniyah, a national broadcast station in Jordan, Lebanese Broadcasting Corporation, a Lebanese television station, Nile TV, a broadcast programmer based in Egypt, Oman TV, a national broadcaster located in the Sultanate of Oman, Saudi TV, a national television station based in Saudi Arabia and Syria TV, the national television station in Syria. A table showing the number of programs and hours recorded from each source is contained in the readme file.
This release contains 143 audio files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about program genre, data type and topic.
*Samples*
Please listen to this sample.
*Updates*
None at this time.
*Acknowledgement*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004800: Mixer 6 Speech
*Introduction*
Mixer 6 Speech was developed by the Linguistic Data Consortium (LDC) and comprises 15,863 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area.
The speech data in this release was collected by LDC at its Human Subjects Collection facilities in Philadelphia. The telephone collection protocol was similar to other LDC telephone studies (e.g., Switchboard-2 Phase III Audio - LDC2002S06): recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic announced by the robot operator at the start of the call. The raw digital audio content for each call side was captured as a separate channel, and each full conversation was presented as a 2-channel interleaved audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked to complete 15 calls.
The multi-microphone portion of the collection utilized 14 distinct microphones installed identically in two mutli-channel audio recording rooms at LDC. Each session was guided by collection staff using prompting and recording software to conduct the following activities: (1) repeat questions (less than one minute), (2) informal conversation (typically 15 minutes), (3) transcript reading (approximately 15 minutes) and (4) telephone call (generally 10 minutes). Speakers recorded up to three 45-minute sessions on distinct days. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second.
Certain demographic information about the speakers was collected, including date of birth, level of education, native language, other language capability, place of birth, place of residence and occupation.
The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans for guidelines on allowable training data for those tests.
*Data*
The collection contains 4,410 recordings made via the public telephone network and 1,425 sessions of multiple microphone recordings in office-room settings. The telephone recordings are presented as 8-KHz 2-channel NIST SPHERE files, and the microphone recordings are 16-KHz 1-channel flac/ms-wav files. All audio files names indicate the date and time when the recording began, along with other identifying information, as follows:
Telephone: {yyyymmdd}_{hrmnsc}_{callid}.sph
Microphone: {yyyymmdd}_{hrmnsc}_{room}_{subjid}_CH{nn}.flac
* yyyymmdd is the year, month and date of recording.
* hrmnsc is the hour, minute and second when recording began
* callid is a unique, incremental number assigned to each call
* room is either LDC or HRM, indicating which office was used
* subjid is a numeric identifier assigned to the speaker
When the flac files are uncompressed, they become ms-wav/RIFF files (flac compression does not presently support SPHERE file format).
The telephone audio is presented in SPHERE format because (a) this is consistent with other telephone audio releases from LDC, and (b) flac does not support ulaw sample encoding. The current release of the open-source SoX utility is able to handle both formats as input. Other utilities are available for both flac and SPHERE formats.
*Samples*
Please listen to this audio sample.
*Updates*
None at this time. -
C-004802: GALE Phase 2 Chinese Broadcast Conversation Speech
*Introduction*
GALE Phase 2 Chinese Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast Conversation Transcripts (LDC2013T08).
Broadcast audio for the GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three remote collection sites: HKUST (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
The LDC local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.
*Data*
The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province, China Central TV (CCTV), a national and international broadcaster in Mainland China, Hubei TV, a regional broadcaster in Mainland China, Hubei Province, and Phoenix TV, a Hong Kong-based satellite television station. A table showing the number of programs and hours recorded from each source is contained in the readme file.
This release contains 202 audio files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: (1) as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, (2) as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded and (3) as a guide for data selection by retaining information about the genre, data type and topic of a program.
*Samples*
Please listen to this audio sample.
*Updates*
February 1st, 2016: All wav files were converted to flac.
*Acknowledgement*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004805: Greybeard
*Introduction*
Greybeard was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006).
A total of 172 subjects were enrolled in the Greybeard collection, all of whom had participated in one of the following:
* Switchboard-1 (LDC97S62) 1991-1992: 2 subjects
* Switchboard-2 (LDC98S75, LDC99S79, LDC2002S06) 1996-1997: 16 subjects
* Mixer 1 and 2 2003-2005: 103 subjects
* Mixer 3 2006: 51 subjects
Most Greybeard participants completed 12 calls. Some subjects completed up to 24 calls. Calls were made or received via an automatic operator system at LDC which connected two participants and announced a topic for discussion.
*Data*
This releases consists of 4680 calls -- the complete set of calls recorded during the Greybeard collection (1098 calls) as well as all calls from the legacy collections that involved the Greybeard speakers.
The audio from each call was captured digitally by the operator system and stored in a separate file as raw mu-law sample data. As the recordings were uploaded daily from the robot operator to network disk storage, automated processes reformatted the audio into a 2-channel SPHERE-format file for each conversation and queued the recordings for manual audit to verify speaker identification and to check other aspects of the recording. Auditors provided impressionistic judgments on overall audio quality, presence of background noise and cross-channel echo and any other technical difficulty with the call, in addition to confirming the speaker-ID on each channel. These auditor decisions are provided in the call_info tables, described in more detail in the included documentation.
For this release, each 2-channel recording was converted from SPHERE to MS-WAV file format and compressed using FLAC. All audio files are 2-channel, 8 KHz, 16-bit PCM sample data, in FLAC-compressed form (http://flac.sourceforge.net). When uncompressed, they have MS-WAV/RIFF headers.
*Samples*
Please listen to the following audio sample.
*Updates*
None at this time. -
C-004808: LDC Spoken Language Sampler - Second Release
*Introduction*
LDC (Linguistic Data Consortium) Spoken Language Sampler - Second Release contains samples from 20 different corpora published by LDC between 1996 and 2013.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media or web downloads, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
This sampler is available as a free download.
*Data*
The LDC Spoken Language Sampler - Second Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
* Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
* Signal amplitude has been adjusted where necessary to normalize playback volume.
* Some corpora are published in compressed form, but all samples here are uncompressed.
* Some text files are presented as images to ensure foreign character sets display properly.
* In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.
The link for the catalog number takes you to the catalog entry.
LDC2013S05
Greybeard
Greybeard is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006).
LDC2013S04
GALE Phase 2 Chinese Broadcast Conversation Speech
GALE Phase 2 Chinese Broadcast Conversation Speech is comprised of approximately 120 hours of Chinese speech from current events programming featuring interviews, call-in programs and roundtable discussions.
LDC2012S06
Turkish Broadcast News Speech and Transcripts
Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.
LDC2012S05
USC-SFI MALACH Interviews and Transcripts English
USC-SFI MALACH Interviews and Transcripts English contains approximately 375 hours of interviews from 784 survivors of the Holocaust along with transcripts and other documentation.
LDC2012S04
Malto Speech and Transcripts
Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.
LDC2012S03
Digital Archive of Southern Speech
Digital Archive of Southern Speech contains approximately 370 hours of American English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations in the southern United States.
LDC2012S02
TORGO Database of Dysarthric Articulation
TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
LDC2011S08
2008 NIST Speaker Recognition Evaluation Test Set
2008 NIST Speaker Recognition Evaluation Test Set contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation.
LDC2010S05
Asian Elephant Vocalizations
Asian Elephant Vocalizations consists of 57.5 hours of audio recordings of vocalizations by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of which 31.25 hours have been annotated.
LDC2010S01
Fisher Spanish Speech
Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
LDC2007S18
CSLU Kids Speech
Developed at Oregon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
LDC2007S15
Nationwide Speech Project
A database of speech representing regional accents and dialects of the United States.
LDC2007S02
Fisher Levantine Arabic
A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
LDC2006S43
Gulf Arabic Conversational Telephone Speech
Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.
LDC2004S09
NIST Meeting Pilot Corpus Speech
Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
LDC2003S05
West Point Russian Speech
Utterances of sentences in Russian from 1,891 native and non-native speakers.
LDC2003S03
Korean Telephone Speech
Collection of 100 telephone conversations between native Korean speakers and their transcriptions.
LDC2003S02
Grassfields Bantu Fieldwork: Dschang Tone Paradigms
Tone paradigms from Yemba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon.
LDC96S50
CALLFRIEND Farsi
A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
LDC96S37
CALLHOME Japanese
A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.- references: C-004805: Greybeard
- references: C-004802: GALE Phase 2 Chinese Broadcast Conversation Speech
- references: C-004385: Turkish Broadcast News Speech and Transcripts
- references: C-004386: USC-SFI MALACH Interviews and Transcripts English
- references: C-004384: Malto Speech and Transcripts
- references: C-004774: Digital Archive of Southern Speech
- references: C-004773: TORGO Database of Dysarthric Articulation
- references: C-004766: 2008 NIST Speaker Recognition Evaluation Test Set
- references: C-004735: Asian Elephant Vocalizations
- references: C-004723: Fisher Spanish Speech
- references: C-003111: CSLU: Kids` Speech Version 1.1
- references: C-003309: Nationwide Speech Project
- references: C-001421: Fisher Levantine Arabic Conversational Telephone Speech
- references: C-001258: Gulf Arabic Conversational Telephone Speech
- references: C-001070: NIST Meeting Pilot Corpus Speech
- references: C-001594: West Point Russian Speech
- references: C-001044: Korean Telephone Conversations Speech
- references: C-000704: Grassfields Bantu Fieldwork: Dschang Tone Paradigms
- references: C-000635: CALLFRIEND Farsi
- references: C-000657: CALLHOME Japanese Speech
-
C-004811: GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
*Introduction*
GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 (LDC2013T17).
Broadcast audio for the GALE program was collected at LDCs Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong Kong, Republic of China (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.
LDC has also released GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 (LDC2013S02) and GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (LDC2013T04).
*Data*
The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a regional broadcaster located in Doha, Qatar; Lebanese Broadcasting Corporation, a Lebanese television station; Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria. A table showing the number of programs and hours recorded from each source is contained in the readme file.
This release contains 141 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program's genre, data type and topic.
*Samples*
Please view this audio sample.
*Updates*
None at this time.
*Acknowledgement*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004815: GALE Phase 2 Chinese Broadcast News Speech
*Introduction*
GALE Phase 2 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 126 hours of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC) and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast News Transcripts (LDC2013T20).
Broadcast audio for the GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
The LDC local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.
*Data*
The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; and Phoenix TV, a Hong Kong-based satellite television station.
This release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about program genre, data type and topic.
*Samples*
Please listen to this audio sample.
*Updates*
None at this time.
*Acknowledgment*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -
C-004818: CALLFRIEND Farsi Second Edition Speech
*Introduction*
CALLFRIEND Farsi Second Edition Speech was developed by the Linguistic Data Consortium (LDC) and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The calls were recorded in 1995 and 1996 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Farsi speakers living in the continental United States each made a single telephone call, lasting up to 30 minutes, to a family member or friend living in the United States.
This release represents all calls from the collection. LDC released recordings from 60 calls without transcripts in 1996 as CALLFRIEND Farsi (LDC96S50) after 20 of those calls were used as evaluation data in the first NIST Language Recognition Evaluation (LRE). Seven of these original 60 calls were deemed unsuitable for transcription and, thus 53 of the original CF Farsi files are included along with 47 new files.
Corresponding transcripts are available in CALLFRIEND Farsi Second Edition Speech Transcripts (LDC2014T01).
*Data*
All recordings involved domestic calls routed through the automated telephone collection platform at LDC and were stored as 2-channel (4-wire), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.
This release includes speaker information, including gender, the number of speakers on each channel and call duration.
*Samples*
Please listen to this audio sample.
*Updates*
None at this time.- hasFormat: N-004817: CALLFRIEND Farsi Second Edition Transcripts
- hasVersion: C-000635: CALLFRIEND Farsi
-
C-004821: King Saud University Arabic Speech Database
*Introduction*
King Saud University Arabic Speech Database was developed by Speech Group (SG) at King Saud University and contains 590 hours of recorded Arabic speech from 269 male and female speakers. The utterances include read and spontaneous speech. The recordings were conducted in varied environments representing quiet and noisy settings.
*Data*
The corpus was designed principally for speaker recognition research. However, other possible applications include first language recognition, mobile effect, multichannel effect, and use of different type of microphones. The speech sources are word lists, sentence lists, paragraphs and question and answer sessions. Read speech text includes the following:
* Sets of sentences devised to cover allophones of each phoneme, phonetic balance, and differentiation of accents.
* Word lists developed to minimize missing phonemes and to represent nasals fricatives, commonly used words, and numbers.
* Two paragraphs selected because they included all letters of the alphabet and were easy to read.
Spontaneous speech was captured through question and answer sessions where speakers answer questions displayed on screen. The questions were on general topics such as the weather and food and included the speaker name or number.
The speakers were Saudis and non-Saudis. Among the non-Saudi participants were Arabs and non-Arabs. All female speakers were either Saudis or non-Saudi Arabs. Male speakers included non-Arabs from the Indian subcontinent, Africa, South East Asia and East Europe. Non-Arab participants were required to be able to read Arabic at an acceptable level. Most of the Non-Arab speakers were from the fourth level in the Arabic Linguistics Institute at King Saud University. The non-Saudi participants represented 28 nationalities and were chosen from clusters of areas or countries.
Each speaker was recorded in three different environments: in a soundproof room , in an office and in a cafeteria. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes. The recordings were done in three sessions with a time-gap of an approximately 6 weeks.
The data was verified for missing recordings, problems with the recording system or errors in the recording process. All files are presented as two channel 48 kHz 16-bit FLAC compressed PCM wav files. Note that sizes and file names in the documentation are for the uncompressed wav files.
*Samples*
Please view this male sample and female sample.
*Updates*
None at this time.