言語資源の登録件数: 3330件
2023 件中 1851 - 1860 件目
-
C-004757: 2006 NIST Spoken Term Detection Evaluation Set
*Introduction*
2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium (LDC) catalog number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights.
The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.
The development data is available in 2006 NIST Spoken Term Detection Development Set LDC2011S02.
*Data*
The evaluation corpus consists of three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2003 and 2004 by LDCs broadcast collection system from the following sources: ABC (English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC (English), Dubai TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English) and Radio Free Asia (Chinese). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Speech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group roundtable meetings and was collected in 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO (The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project.
This evaluation corpus includes scoring software. It uses the inputs described in the STD Evaluation plan to complete the evaluation of a system.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. The CONFMTG files contain a single recorded channel.
*Samples*
For an example of the audio data in this corpus, please examine this audio sample.- references: C-001418: Fisher English Training Speech Part 1 Speech
- references: C-000738: Switchboard-2 Phase II
- references: C-001284: Switchboard-2 Phase I
-
C-004759: 2005 NIST Speaker Recognition Evaluation Test Data
*Introduction *
NIST 2005 Speaker Recognition Evaluation Test Data, Linguistic Data Consortium (LDC) catalog number LDC2011S04 and isbn 1-58563-586-3, was developed at LDC and NIST (National Institute of Standards and Technology). It consists of 525 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as test data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To that end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and accessible.
The task of the 2005 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational speech. The task was divided into 20 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the task conditions is contained in the The NIST Year 2005 Speaker Recognition Evaluation Plan.
The training data for the 2005 evaluation is available in NIST 2005 Speaker Recognition Evaluation Training Data LDC2011S01.
*Data *
The speech data consists of conversational telephone speech with multi-channel data collected by LDC simultaneously from a number of auxiliary microphones. The files are organized into two segments: 10 second two-channel excerpts (continuous segments from single conversations that are estimated to contain approximately 10 seconds of actual speech in the channel of interest) and 5 minute two-channel conversations.
The data are stored as 8-bit u-law speech signals in NIST SPHERE format. In addition to the standard header fields, the SPHERE header for each file contains some auxiliary information that includes the language of the conversation and whether the data was recorded over a telephone line.
English language word transcripts in .cmt format were produced using an automatic speech recognition system (ASR) with error rates in the range of 15-30%.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Updates*
Trials files for the 2005 SRE evaluation were added Jan. 05, 2012 in the LDC2011S04U01 update. These files define the various evaluation tests, with one file for each trial in the evaluation. The index.html and file.tbl files were also updated. All copies of LDC2011S04 ordered after Jan. 05, 2012 should contain those updates. Please contact ldc@ldc.upenn.edu with any questions. -
C-004761: 2008 NIST Speaker Recognition Evaluation Training Set Part 2
*Introduction*
2008 NIST Speaker Recognition Evaluation Training Set Part 2, Linguistic Data Consortium (LDC) catalog number LDC2011S07 and ISBN 1-58563-591-X , was developed by LDC and NIST (National Institute of Standards and Technology). It contains 950 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as training data in the 2008 NIST Speaker Recognition Evaluation (SRE).
SRE is part of an ongoing series of evaluations conducted by NIST. These evaluations are an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation is designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.
The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.
Additional documentation is available at the NIST web site for the 2008 SRE and within the 2008 SRE Evaluation Plan.
*Data*
The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English speakers and bilingual English speakers. The telephone speech in this corpus is predominately English, but also includes the above languages. All interview segments are in English. Telephone speech represents approximately 523 hours of the data, and microphone speech represents the other 427 hours.
The telephone speech segments include summed-channel excerpts in the range of 5 minutes from longer original conversations. The interview material includes single channel conversation interview segments of at least 8 minutes from a longer interview session. As in prior evaluations, intervals of silence were not removed.
English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system. There are approximately six files distributed as part of SRE08 where each file is a 1024 byte header with no audio. However, these files were not included in the trials or keys distributed in the SRE08 aggregate corpus.
*Samples*
For an example of the data contained in this corpus, review this audio sample. -
C-004766: 2008 NIST Speaker Recognition Evaluation Test Set
*Introduction*
2008 NIST Speaker Recognition Evaluation Test Set, Linguistic Data Consortium (LDC) catalog number LDC2011S08 and ISBN 1-58563-594-4, was developed by LDC and NIST (National Institute of Standards and Technology). It contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation (SRE).
NIST SRE is part of an ongoing series of evaluations conducted by NIST. These evaluations are an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation is designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.
The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.
LDC previously released the 2008 NIST SRE Training Set in two parts as LDC2011S05 and LDC2011S07.
Additional documentation is available at the NIST web site for the 2008 SRE and within the 2008 SRE Evaluation Plan.
*Data*
The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The telephone speech in this corpus is predominantly English, but also includes the above languages. All interview segments are in English. Telephone speech represents approximately 368 hours of the data, whereas microphone speech represents the other 574 hours.
The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are either 3 or 8 minutes in length. As in prior evaluations, intervals of silence were not removed. There are approximately six files distributed as part of SRE08 where each file is a 1024 byte header with no audio. However, these files were not included in the trials or keys distributed in the SRE08 aggregate corpus.
English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Updates*
None at this time. -
C-004769: 2006 NIST Speaker Recognition Evaluation Training Set
*Introduction*
2006 NIST Speaker Recognition Evaluation Training Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 595 hours of conversational telephone speech in English, Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai and Urdu and associated English transcripts used as training data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To this end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and to be accessible to those wishing to participate.
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
*Data*
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai and Urdu.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into three types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes and summed-channel conversations also of approximately 5 minutes.
The speech files are stored as 8-bit u-law speech signals in separate SPHERE files. In addition to the standard header fields, the SPHERE header for each file contains some auxiliary information that includes the language of the conversation and whether the data was recorded over a telephone line.
English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Updates*
None at this time. -
C-004771: 2006 NIST Speaker Recognition Evaluation Test Set Part 1
*Introduction*
2006 NIST Speaker Recognition Evaluation Test Set Part 1 was developed by LDC and NIST (National Institute of Standards and Technology). It contains 437 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).
The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To this end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and to be accessible to those wishing to participate.
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
LDC also previously released 2006 NIST Speaker Recognition Evaluation Training Set.
*Data*
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length.
The speech files are stored as 8-bit u-law speech signals in separate SPHERE files. In addition to the standard header fields, the SPHERE header for each file contains some auxiliary information such as the language of the conversation.
English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Updates*
None at this time. -
C-004772: 2008 NIST Speaker Recognition Evaluation Supplemental Set
*Introduction*
2008 NIST Speaker Recognition Evaluation Supplemental Set, Linguistic Data Consortium (LDC) catalog number LDC2011S11 and ISBN 1-58563-601-0, was developed by LDC and NIST (National Institute of Standards and Technology) and contains additional data distributed after the main 2008 Speaker Recognition Evaluation (SRE). Specifically, the corpus consists of 770 hours of English microphone speech along with transcripts and other materials used as supplemental data in the 2008 NIST Speaker Recognition Evaluation (SRE) and in a follow-up evaluation to SRE08.
NIST SRE is part of an ongoing series of evaluations conducted by NIST. These evaluations are an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation is designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.
The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.The follow-up evaluation focused on speaker detection in the context of conversational interview type speech and was designed to measure the performance of SRE08 systems in previously unexposed test segment channel conditions.
LDC previously released the main 2008 NIST SRE Evaluation in three parts as 2008 NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05, 2008 NIST Speaker Recognition Evaluation Training Set Part 2 LDC2011S07 and 2008 NIST Speaker Recognition Evaluation Test Set LDC2011S08.
Additional documentation is available at the NIST web site for the 2008 SRE and within the 2008 SRE Evaluation Plan and the Plan for Follow-up Evaluation to SRE08.
*Data*
The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The microphone speech in this corpus is in English and consists of approximately 3 minute and 30 minute interview excerpts..
This supplemental data is split into four different parts which provide:
* new training data distributed to 2008 SRE participants
* additional data distributed to participants in the 2008 SRE follow-up evaluation
* interviewer channel files for the 2008 SRE main test (released after the evaluations)
* supplemental training data (released after the evaluations)
English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system and are included for some, but not all, speech data.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Updates*
None at this time. -
C-004773: TORGO Database of Dysarthric Articulation
*Introduction*
TORGO Database of Dysarthric Articulation was developed by the University of Toronto's departments of Computer Science and Speech Language Pathology in collaboration with the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
CP and ALS are examples of dysarthria which is caused by disruptions in the neuro-motor interface that distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases. The TORGO database is primarily a resource for developing advanced automatic speaker recognition (ASR) models suited to the needs of people with dysarthria, but it is also applicable to non-dysarthric speech. The inability of modern ASR to effectively understand dysarthric speech is a problem since the more general physical disabilities often associated with the condition can make other forms of computer input, such as computer keyboards or touch screens, difficult to use.
*Data*
The data consists of aligned acoustics and measured 3D articulatory features from the speakers carried out using the 3D AG500 electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully-automated calibration. This system allows for 3D recordings of articulatory movements inside and outside the vocal tract, thus providing a detailed window on the nature and direction of speech-related activity.
The data was collected between 2008 and 2010 in Toronto, Canada. All subjects read text consisting of non-words, short words and restricted sentences from a 19-inch LCD screen. The restricted sentences included 162 sentences from the sentence intelligibility section of Assessment of intelligibility of dysarthric speech (Yorkston & Beukelman, 1981) and 460 sentences derived from the TIMIT database. The unrestricted sentences were elicited by asking participants to spontaneously describe 30 images in interesting situations taken randomly from Webber Photo Cards - Story Starters (Webber, 2005), designed to prompt students to tell or write a story.
Data is organized by speaker and by the session in which each speaker recorded data. Each speaker was assigned a code and given their own file directory. The code for female speakers begins with F, and the code for male speakers begins with M. If the speaker was a member of the control group, the letter C follows the gender code. The last two digits of the code indicate the order in which that subject was recruited. For example, speaker FC02 was the second female speaker without dysarthria recruited. Note that some speakers were intentionally left out of the data, and thus, there are gaps in the numbering.
Each speakers directory contains Session directories which encapsulate data recorded in the respective visit and occasionally, a Notes directory which can include Frenchay assessments (test for the measurement, description and diagnosis of dysarthria), notes about sessions (e.g., sensor errors), and other relevant notes.
Each Session directory can, but does not necessarily, contain the following content:
* alignment.txt: This is a text file containing the sample offsets between audio files recorded simultaneously by the array microphone and the head-worn microphone.
* amps: These directories contain raw *.amp and *.ini files produced by the AG500 articulograph.
* phn_*: These directories contain phonemic transcriptions of audio data. Each file is plain text with a *.PHN file extensions and a filename referring to the utterance number. These files were generated using the free Wavesurfer tool.
* pos: These directories contain the head-corrected positions, velocities, and orientations of sensor coils for each utterance, as generated by the AG500 articulograph.
* prompts: These directories contain orthographic transcriptions.
* rawpos: These directories are equivalent to the pos directories except that their articulographic content is not head-normalized to a constant upright position.
* wav_*: These directories contain the acoustics. Each file is a RIFF (little-endian) WAVE audio file (Microsoft PCM, 16 bit, mono 16000 Hz).
* wavall: These directories contains a stereo recording in which one channel contains the recorded acoustics and the other channel contains the analog peaks associated with the sweep signal, which is used by the AG500 hardware for synchronization.
Additionally, sessions recorded with the AG500 articulograph are marked with the file EMA, and those recorded with the video-based system are marked with the file VIDEO. Files with a date form as the filename and a txt extension (e.g. april232008cal2.txt, jan28cal3.txt) are the measured responses from calibration. The *.log and *.calset files contain descriptions of the calibration process, but not the final result of calibration.
See the readme file and the AG500 Wiki for more complete descriptions of the possible subfolders and of the AG500 specific files. Also, see session_contents.tsv for a tab separated table of each sessions subfolders and metadata files.
*Samples*
For an example of the data contained in this corpus, review these two audio samples: Dysarthric & Control.
*Updates*
None at this time. -
C-004774: Digital Archive of Southern Speech
*Introduction*
Digital Archive of Southern Speech (DASS) was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguistic Atlas Project (LAP). DASS contains approximately 370 hours of English speech data from 30 female speakers and 34 male speakers in .wav format and in .mp3 format, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations.
LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983. Interviews average approximately six hours in length the systematic LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from LAGS selected to cover a range of speech across the region and to represent multiple education levels and ethnic backgrounds.
This release is distributed on an external hard drive and contains instructions for using the media and navigating to the LICHEN program.
Digital Archive of Southern Speech - NLP Version (LDC2016S05), an alternate version suitable for natural language processing and human language technology applications is also available.
*Data*
The DASS speakers average age is 61 years there are 30 women and 34 men from the Gulf States region represented in this release. The interviews cover common topics such as family, the weather, household articles and activities, agriculture and social connections.
The interviews were originally recorded in the field on reel-to-reel audio tape. A digital version of every reel of tape was then made, one .wav file per reel, usually about one hour of sound. Each interview thus consists of a set of 3 to 13 reels, or roughly 3 to 13 interview hours. Personally identifying or sensitive information in the files was replaced with a tone to protect the privacy and to assure ethical treatment of speakers. Each .wav file is split into multiple .mp3 files based on the topic of conversation and labeled thusly. Included spreadsheets provide information about the speakers, the labels used for topics and the sound files.
Also included in this release is a version of the LICHEN software developed at the University of Oulu, Finland. LICHEN allows users to browse and search through the audio data in a more advanced fashion using a graphical interface. Further information and instructions for LICHEN can be found within the docs folder of this release.
*Updates*
None at this time.
*Samples*
For an example of the data contained in this corpus, review this audio sample.
*Authorship*
The following people were involved with the DASS project:
William A. Kretzschmar, Jr., Paulina Bounds, Jacqueline Hettel and Steven Coats University of Georgia
Lee Pederson Emory University
Lisa Lena Opas-Hänninen, Ilkka Juuso and Tapio Seppänen University of Oulu (Finland)
*Sponsorship*
The Atlas Data contained herein comprises information collected in the period spanning from the 1930s to 2010 and has been compiled from diverse sources, by, and under the direction of, Dr. William A. Kretzschmar, Harry and Jane Wilson Professor in Humanities at the Department of English of The University of Georgia.
Compilation and digitalization of this work was funded, in part, by the US National Science Foundation and by the US National Endowment for the Humanities.
Additional information about the Atlas Project can be obtained at http://www.lap.uga.edu/Home.html.- isPartOf: Linguistic Atlas of the Gulf States
-
C-004779: GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
*Introduction*
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.
LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Data*
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising 169,109 words of Arabic source text and its English translation. Data is drawn from thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following sources: Al Alam News Channel, a broadcaster located in Iran Aljazeera, a regional broadcast programmer based in Doha, Qatar Dubai TV, located in Dubai, United Arab Emirates Oman TV, a national broadcaster located in the Sultanate of Oman and Radio Sawa, a U.S, government-funded regional broadcaster. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.
The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDCs Arabic to English translation guidelines which are included with this release. Bilingual LDC staff performed quality control procedures on the completed translations.
Source data and translations are distributed in TDF format. TDF files are tab-delimited text files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF8.
*Samples*
Please follow the links for samples in Arabic and English.
*Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
*Updates*
None at this time.