Language resource #: 3330 Results 1901 - 1910 of 2023
Current query
Input keywords
Select items
  • C-004918: CHM150
    *Introduction*

    CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.

    *Data*

    This corpus is comprised of Mexican Spanish microphone speech from 75 male speakers and 75 female speakers in a quiet office environment. Speakers could answer pre-selected open questions or describe a particular painting shown to them on a computer monitor.

    Speaker metadata in this release includes age, gender, place of birth, place of residence and parents' nationalities.

    The audio files are presented as to 16 kHz, 16-bit PCM flac compressed wav.

    *Samples*

    Please view this audio sample and text sample.

    *Updates*

    None at this time.
  • C-004922: Digital Archive of Southern Speech - NLP Version
    *Introduction*

    Digital Archive of Southern Speech - NLP Version (DASS-NLP) was developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03) suitable for natural language processing and human language technology applications. Specifically, the original audio files have been converted to 16kHz 16-bit flac compressed wav and file names have been normalized to facilitate automatic processing.

    DASS was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS-NLP contains approximately 366 hours of English speech data from 30 female speakers and 34 male speakers in flac compressed wav format, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations.

    LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983. Interviews average approximately six hours in length; the systematic LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from LAGS selected to cover a range of speech across the region and to represent multiple education levels and ethnic backgrounds.

    *Data*

    The DASS-NLP speakers' average age is 61 years; there are 30 women and 34 men from the Gulf States region represented in this release. The interviews cover common topics such as family, the weather, household articles and activities, agriculture and social connections.

    The interviews were originally recorded in the field on reel-to-reel audio tape. A digital version of every reel of tape was then made, one .wav file per reel, usually about one hour of sound. Each interview thus consists of a set of 3 to 13 reels, or roughly 3 to 13 interview hours. Personally identifying or sensitive information in the files was replaced with a tone to protect the privacy and to assure ethical treatment of speakers.

    *Samples*

    Please listen to this sample.

    *Updates*

    None at this time.

    *Authorship*

    The following people were involved with the DASS project:

    William A. Kretzschmar, Jr., Paulina Bounds, Jacqueline Hettel and Steven Coats
    University of Georgia

    Lee Pederson
    Emory University

    Lisa Lena Opas-Hänninen, Ilkka Juuso and Tapio Seppänen
    University of Oulu (Finland)

    *Sponsorship*

    The Atlas Data contained herein comprises information collected in the period spanning from the 1930s to 2010 and has been compiled from diverse sources, by, and under the direction of, Dr. William A. Kretzschmar, Harry and Jane Wilson Professor in Humanities at the Department of English of The University of Georgia.

    Compilation and digitalization of this work was funded, in part, by the US National Science Foundation and by the US National Endowment for the Humanities.

    Additional information about the Atlas Project can be obtained at http://www.lap.uga.edu/.
  • C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
    *Introduction*

    IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 205 hours of Assamese conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

    The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

    *Data*

    The speech in this release represents three dialects spoken in Assam, a state in northeastern India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 66 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

    All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: Assamese script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

    Evaluation data is available from NIST in support of OpenKWS.

    *Samples*

    Please view the following samples:

    * Audio Sample
    * Transcription Sample
    * Romanized Transcription Sample

    *Updates*

    None at this time.
  • C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
    *Introduction*

    IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Bengali conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

    The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

    *Data*

    The Bengali speech in this release represents that spoken in India by native speakers of Bengali born in India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

    All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: the Bengali script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

    Evaluation data is available from NIST in support of OpenKWS.

    *Samples*

    Please view the following samples:

    * Audio Sample
    * Transcript Sample
    * Romanized Transcript Sample

    *Updates*

    None at this time.
  • C-004925: GALE Phase 3 Arabic Broadcast News Transcripts Part 1
    *Introduction*

    GALE Phase 3 Arabic Broadcast News Transcripts Part 1 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 132 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

    Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 1 (LDC2016S07).

    The broadcast news recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt, Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

    *Data*

    The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 741,689 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.

    The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

    *Samples*

    Please view this sample.

    *Updates*

    None at this time.

    *Acknowledgement*

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
  • C-004926: GALE Phase 3 Arabic Broadcast News Speech Part 1
    *Introduction*

    GALE Phase 3 Arabic Broadcast News Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 132 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

    Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 1 (LDC2016T17).

    Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

    LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

    LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

    Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

    *Data*

    The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt, Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

    This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

    *Samples*

    Please listen to the following audio sample.

    *Updates*

    None at this time.

    *Acknowledgment*

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
  • C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
    *Introduction*

    IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 214 hours of Pashto conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

    The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

    *Data*

    The Pashto speech in this release represents that spoken in four dialect regions of Afghanistan and Pakistan. The gender distribution among speakers is approximately 30% female, 70% male; speakers' ages range from 17 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

    All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: an extended Arabic script and a modified Buckwalter transliteration scheme, both encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

    Evaluation data is available from NIST in support of OpenKWS.

    *Samples*

    Please view the following samples:

    * Audio Sample
    * Transcript Sample
    * Romanized Transcript Sample

    *Updates*

    None at this time.
  • C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
    *Introduction*

    IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 213 hours of Turkish conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

    The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

    *Data*

    The Turkish speech in this release represents that spoken in seven dialect regions in Turkey. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

    All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

    Evaluation data is available from NIST in support of OpenKWS.

    *Updates*

    None at this time.

    *Samples*

    Please view this audio sample and transcript sample.
  • C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
    *Introduction*

    IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

    The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

    *Data*

    The Georgian speech in this release represents that spoken in the Eastern and Western dialect regions in Georgia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 73 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

    Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format or in 48kHz 24-bit PCM wav format. Transcripts are encoded in UTF-8 using a romanization scheme developed by Appen. Further information about transcription methodology is contained in the documentation accompanying this release.

    Evaluation data is available from NIST in support of OpenKWS.

    *Samples*

    Please view these samples:

    * Audio Sample
    * Transcription Sample
    * Romanized Transcription Sample

    *Updates*

    None at this time.
  • C-004935: Multi-Language Conversational Telephone Speech 2011 -- Slavic Group
    *Introduction*

    Multi-Language Conversational Telephone Speech 2011 -- Slavic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 60 hours of telephone speech in each of three distinct Slavic languages: Polish, Russian and Ukranian.

    The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related.

    *Data*

    Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected.

    All audio data are presented in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file is 2 channels, recorded at 8000 samples/second with samples stored as 16-bit signed integers, representing a lossless conversion from the original mu-law sample data as captured digitally from the public telephone network. The following table summarizes the total number of calls, total number of hours of recorded audio, and the total size of compressed data:

    group / lng / #calls / #hours / #MB
    slavic / pol / 124 / 28.3 / 1457
    slavic / rus / 71 / 13.1 / 577
    slavic / ukr / 89 / 19.0 / 932
    slavic / Totals / 284 / 60.4 / 2966

    *Samples*

    Please listen to this sample.

    *Updates*

    None at this time.