言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1911 - 1920 件目

C-004938: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
*Introduction*

IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 213 hours of Tagalog conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

*Data*

The Tagalog speech in this release represents that spoken in the North, Central and South dialect regions in the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

Evaluation data is available from NIST in support of OpenKWS.

*Samples*

Please view these audio and transcription samples.

*Updates*

None at this time.
- hasVersion: C-004913: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
- hasVersion: C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
- hasVersion: C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
- hasVersion: C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
- hasVersion: C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
- hasVersion: C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
- hasVersion: C-004943: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
- hasVersion: C-004950: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
- hasVersion: C-004977: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
C-004943: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
*Introduction*

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

*Data*

The Vietnamese speech in this release represents that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

Evaluation data is available from NIST in support of OpenKWS.

*Samples*

Please view this audio sample and transcript sample.

*Updates*

None at this time.
- hasVersion: C-004913: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
- hasVersion: C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
- hasVersion: C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
- hasVersion: C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
- hasVersion: C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
- hasVersion: C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
- hasVersion: C-004938: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
- hasVersion: C-004950: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
- hasVersion: C-004977: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
C-004948: GALE Phase 3 Arabic Broadcast News Speech Part 2
*Introduction*

GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 128 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 2 (LDC2017T04).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong Kong (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

*Data*

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al-Manar TV, a broadcast programmer located in Lebanon; Al Ordiniyah, a national broadcast station in Jordan; Al Sharqiya, an Iraqi television station; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Nile TV, a broadcast programmer based in Egypt; Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

*Samples*

Please listen to this sample.

*Updates*

None at this time.

*Acknowledgment*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
- hasVersion: C-004926: GALE Phase 3 Arabic Broadcast News Speech Part 1
- hasFormat: C-004949: GALE Phase 3 Arabic Broadcast News Transcripts Part 2
C-004949: GALE Phase 3 Arabic Broadcast News Transcripts Part 2
*Introduction*

GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 2 (LDC2017S02).

The recordings for transcription feature news broadcasts focusing primarily on current events from the following sources: Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al-Manar TV, a broadcast programmer located in Lebanon; Al Ordiniyah, a national broadcast station in Jordan; Al Sharqiya, an Iraqi television station; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Nile TV, a broadcast programmer based in Egypt; Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

*Data*

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 721,846 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

*Samples*

Please view this sample.

*Updates*

None at this time.

*Acknowledgement*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
C-004950: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
*Introduction*

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

*Data*

The Haitian Creole speech in this release represents that spoken in the Northern, Western and Southern dialect regions in Haiti. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format or 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.

Evaluation data is available from NIST in support of OpenKWS.

*Samples*

Please view this audio sample and this text samlpe.

*Updates*

None at this time.
- hasVersion: C-004913: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
- hasVersion: C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
- hasVersion: C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
- hasVersion: C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
- hasVersion: C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
- hasVersion: C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
- hasVersion: C-004938: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
- hasVersion: C-004943: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
- hasVersion: C-004977: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
C-004957: Collins Multilingual database (MLD) – WordBank with audio files
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).

This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese.

The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).

The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages.

Audio was recorded by native speakers.
- hasVersion: C-004958: Collins Multilingual database (MLD) – PhraseBank with audio files
- hasFormat: D-005015: Collins Multilingual database (MLD) - WordBank
C-004958: Collins Multilingual database (MLD) – PhraseBank with audio files
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).

This version includes the audio files corresponding to each phrase in the Collins MLD PhraseBank for 28 languages: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. Audio was recorded by a native speaker. It contains 2,000 audio files for each language.

The PhraseBank consists of 2,000 phrases in 28 language. Phrases are organised under 12 topics and 67 subtopics: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking.

Romanization is provided for Arabic, Farsi and Hindi.
- hasVersion: C-004957: Collins Multilingual database (MLD) – WordBank with audio files
- hasFormat: D-005014: Collins Multilingual database (MLD) - PhraseBank
C-004959: Arabic Speech Corpus
This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was collected from “Aljazeera Learn” (Aljazeera 2015), a language learning website which was chosen because it contained fully diacritised text which makes it easier to phonetise. The transcript was split into utterances based on punctuation, to make it easier for the speaker during the recording sessions. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours consisting of:
- 2.1 hours of normal utterances,
- 1.6 hours of nonsense utterances (utterances that are not semantically, orthographically or syntactically correct).

This package corresponds to version 2.0 of the corpus and includes:
- 1813 .wav files containing spoken utterances,
- 1813 .lab files containing text utterances,
- 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software (see http://www.fon.hum.uva.nl/praat/),
- phonetic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.
- orthographic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format (see http://www.qamus.org/transliteration.htm) which is friendlier where there is a software that does not read Arabic script. It can be easily converted back to Arabic.
- An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided (separate from above but with the same structure as above).

Arabic Speech Corpus by Nawar Halabi is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
C-004960: Serbian emotional speech database
The database contains recordings from six actors, three of each gender. The following emotions have been recorded: neutral, anger, happiness, sadness and fear. The database consists of: 32 isolated words, 30 short semantically neutral sentences, 30 long semantically neutral sentences and one passage with 79 words in size. The overall size of database is 2790 recordings or approximately 3 hours of speech. Statistical evaluation of database shows full phonetic balance according to the phonetic statistics of Serbian language and the statistics of other speech segments (syllables, consonant sets, accents) are in agreement with overall statistics of Serbian language. GEES database was recorded in an anechoic studio at 22,050 Hz sampling frequency.
C-004961: SecuVoice
SecuVoice is a corpus of single-channel utterances in Spanish containing sequences of isolated digits from zero to nine. These utterances were acquired by using two different devices, i.e. a mid-range smartphone and a high-range one. For both models, the utterances were stored as uncompressed monophonic WAV files with a sampling frequency of 8000 Hz and 16 bits per sample.

This database is especially suitable for research on biometrics and secure applications that integrate both automatic speech recognition (ASR) and speaker recognition/verification.

SecuVoice contains a total of 7,098 utterances (169 speakers x 42 utt./speaker) with 34,476 digits (204 digits/speaker). Utterances are arranged into two different datasets: (i) the ENROLL dataset contains the 1,014 enrollment utterances (169 speakers x 6 enroll. utt./speaker) with 10,140 digits; (ii) the VERIF dataset contains the 6,084 verification utterances (169 speakers x 36 verif. utt./speaker) with 24,336 digits. Each digit from zero to nine is present 3,380 times, except digits three and five unbalanced in the VERIF dataset (2,704 utterances against 2,366 for the other digits) for a total number of 3,718 utterances each.

Along with the WAV files containing the speech utterances, XML annotation files containing detailed information about the speakers and the recorded sequences of digits are provided.

SHACHI - Language Resource Metadata Database