Language resource #: 3330
Results 1931 - 1940 of 2023
-
C-004976: The FAME! Speech Corpus
The components of the Frisian data collection are speech and language resources gathered for building a large vocabulary ASR system for the Frisian language.
Firstly, a new broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details.
The second component of this collection is a language model created on a text corpus with diverse vocabulary.
Thirdly, a Frisian phonetic dictionary with the mappings between the Frisian words and phones is built to make the ASR viable for this under-resourced language.
Finally, an ASR recipe is provided which uses all previous resources to perform recognition and present the recognition accuracies.
The Corpus consists of 203 audio segments of approximately 5 minutes long extracted from various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database.
The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages.
The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. The available meta-information helped the annotators to identify these speakers and mark them either using their names or the same label (if the name is not known). There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. These speakers are mostly program presenters and celebrities appearing multiple times in different recordings over years. There are 233 unidentified speakers due to lack of meta-information. The total number of word- and sentence-level code-switching cases in the FAME! Speech Corpus is equal to 3837.
Music portions have been replaced by noise, except where these overlap with speech. -
C-004977: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
*Introduction*
IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Swahili conversational and scripted telephone speech collected from 2012-2014 along with corresponding transcripts.
The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.
*Data*
The Swahili speech in this release represents that spoken in the Nairobi dialect region of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format or 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded in UTF-8. Further information about transcription methodology is contained in the documentation accompanying this release.
Evaluation data is available from NIST in support of OpenKWS.- hasVersion: C-004913: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
- hasVersion: C-004923: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
- hasVersion: C-004924: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
- hasVersion: C-004930: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
- hasVersion: C-004932: IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
- hasVersion: C-004934: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
- hasVersion: C-004938: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
- hasVersion: C-004943: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
- hasVersion: C-004950: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
- hasVersion: C-005035: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
-
C-004978: Noisy TIMIT Speech
*Introduction*
Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation.
*Data*
The additive noise are white, pink, blue, red, violet and babble noise with noise levels varying in 5 dB (decibel) steps and ranges from 5 to 50 dB.
The color of noise refers to the power spectrum of a noise signal. Sound waves have two characteristics: frequency, which describes how fast the waveform vibrates per second; and amplitude, the size of the waveform. Colored noises are named in an analogy to the colors of light. For instance, white noise contains all audible frequencies just as white light contains all frequencies in the visible range. Non-white colored noises have more energy concentrated at the high or low end of the sound spectrum. White, pink and blue noise are officially defined in the federal telecommunications standard.
The white, pink, blue, red and violet noise types added to the TIMIT data in this release were generated artificially using MATLAB. For the babble noise, a random segment of recorded babble speech was selected and scaled relative to the power of the original TIMIT audio signal.
All audio files are presented as single channel 16kHz 16-flac. -
C-004979: Danish Propbank
The Danish Propbank (DPB) is a multi-layer treebank, annotated not only with morphosyntactic, but also with semantic information, in particular propositions/frames with VerbNet classes and semantic roles for both arguments and satellites. In addition, the corpus has been annotated with 20 Named Entity classes and a 200-category semantic ontology for nouns. The text samples are taken from Korpus 2010, compiled by the Society for Danish Language and Literature (http://korpus.dsl.dk/resources.html), and contain samples of written Danish from a variety of both formal and informal texts, such as newspapers, magazines, blogs, chat fora and parliamentary debates. The treebank consists of about 87,000 tokens. There are over 12,000 frames with 32,000 role instances. It can be regarded as a semantic sister treebank complementing the older Arboretum treebank (see ELRA-W0084). The two data sets also complement each other with regard to time periods and text types, together covering 3 decades of Danish text.
- references: Korpus 2010
-
C-004980: TRAD Chinese-French Email Parallel corpus – Test Set
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in French. The source texts are a selection of emails from the Speechocean King-NLP-001 corpus, a corpus of private emails collected from the daily life and business domains. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
The content has also been translated into English (see ELRA-W0115).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign.- references: Speechocean King-NLP-001
- hasVersion: C-004981: TRAD Chinese-English Email Parallel corpus – Test Set
- hasVersion: C-004982: TRAD Chinese-French Email Parallel corpus – Development Set
- hasVersion: C-004983: TRAD Chinese-English Email Parallel corpus – Development Set
-
C-004981: TRAD Chinese-English Email Parallel corpus – Test Set
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in English. The source texts are a selection of emails from the Speechocean King-NLP-001 corpus, a corpus of private emails collected from the daily life and business domains. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
The content has also been translated into French (see ELRA-W0116).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign.- references: Speechocean King-NLP-001
- hasVersion: C-004980: TRAD Chinese-French Email Parallel corpus – Test Set
- hasVersion: C-004982: TRAD Chinese-French Email Parallel corpus – Development Set
- hasVersion: C-004983: TRAD Chinese-English Email Parallel corpus – Development Set
-
C-004982: TRAD Chinese-French Email Parallel corpus – Development Set
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in French. The source texts are a selection of emails from the Speechocean King-NLP-001 corpus, a corpus of private emails collected from the daily life and business domains.
The content has also been translated into English (see ELRA-W0113).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a development set for MT systems.- references: Speechocean King-NLP-001
- hasVersion: C-004980: TRAD Chinese-French Email Parallel corpus – Test Set
- hasVersion: C-004981: TRAD Chinese-English Email Parallel corpus – Test Set
- hasVersion: C-004983: TRAD Chinese-English Email Parallel corpus – Development Set
-
C-004983: TRAD Chinese-English Email Parallel corpus – Development Set
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in English. The source texts are a selection of emails from the Speechocean King-NLP-001 corpus, a corpus of private emails collected from the daily life and business domains.
The content has also been translated into French (see ELRA-W0114).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a development set for MT systems.- references: Speechocean King-NLP-001
- hasVersion: C-004980: TRAD Chinese-French Email Parallel corpus – Test Set
- hasVersion: C-004981: TRAD Chinese-English Email Parallel corpus – Test Set
- hasVersion: C-004982: TRAD Chinese-French Email Parallel corpus – Development Set
-
C-004984: TRAD Chinese-English News Articles Parallel corpus
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in English. The source texts are newspaper articles from the Chinese version of Voice of America. Articles are dated from 2011 and 2012. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
The content has also been translated into French (see ELRA-W0111).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign. -
C-004985: TRAD Chinese-French News Articles Parallel corpus
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in French. The source texts are newspaper articles from the Chinese version of Voice of America. Articles are dated from 2011 and 2012. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
The content has also been translated into English (see ELRA-W0112).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign.