言語資源の登録件数: 3330件 2023 件中 1661 - 1670 件目
現在の検索条件
キーワードを入力
検索条件を選択
  • C-004332: SmartKom Home
    Multimodal/Multimedia Resources
    The SmartKom corpora were produced at BAS in the years 1999 to 2003 within the SmartKom project which was funded by the German Ministry of Education and Science. The corpus consists of 448 multi-modal recordings (“sessions”) of 224 persons in a Wizard-of-Oz setting.
    Release SKH 1.0 contains 130 recordings in the technical setup (“scenario”) SmartKom Home which should be an intelligent communication assistant for the private environment. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4.5 minutes while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human.
    Main technical features of release SKH 1.0
    • Technical setup: Home (scenario)
    • Primary domain “Television”; secondary domain “Scheduling”, “Music Selection”
    • Primary domain “VCR”; secondary domain “Scheduling”, “Music Selection”
    • 65 users
    • 130 recording sessions; size: 490 GB
    • Recorded modalities:
    o Audio in max 10 channels
    o Video of face
    o Video of upper body from the left
    o Infrared video of the display area (to capture the 2D gestures) as input to the SIVIT device (Siemens gesture recognizer)
    o Video of the GUI output
    o Coordinates of graphic tableau (when pen was used)
    o Coordinates of SIVIT device (when finger/hands were used)
    • Annotations:
    o Transliteration
    o 2D Gesture
    o user states in three modalities
    o Turn segmentation
    • Documentation, TechDoks and publications
    • All annotations compatible to the “BAS Partitur Format” (BPF)

    The full database is provided on USB. Single volumes on DVD can be obtained upon demand.
  • C-004333: SmartKom Audio
    Desktop/Microphone
    The SmartKom corpora were produced at BAS in the years 1999 to 2003 within the SmartKom project which was funded by the German Ministry of Education and Science. The corpus consists of multi-modal recordings ('sessions') of 224 persons in a Wizard-of-Oz setting.
    Release SKAUDIO 1.0 contains all audio channel recordings of the SmartKom corpora SmartKom Public (cf. ELRA-S0136), SmartKom Home (cf. ELRA-S0316) and SmartKom Mobil (cf. ELRA-S0317).
    Main technical features of release SKAUDIO 1.0:
    • Technical setup: Public, Home, Mobil
    • 224 users
    • 448 recording sessions
    • Contained modalities:
    o Audio in 10 channels
    • Annotations:
    o Transliteration
    o 2D gestures
    o User states in three modalities
    o Turn segmentation
    • Documentation, TechDoks and publications
    • All annotations compatible to the “BAS Partitur Format” (BPF)
  • C-004334: SmartKom Mobil
    Multimodal/Multimedia Resources
    The SmartKom corpora were produced at BAS in the years 1999 to 2003 within the SmartKom project which was funded by the German Ministry of Education and Science. The corpus consists of multi-modal recordings (“sessions”) of 224 persons in a Wizard-of-Oz setting.
    Release SKM 1.0 contains 146 recordings in the technical setup (“scenario”) SmartKom Mobil which is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and should more or less communicate like a human.
    Experiments were not performed in the field but rather in a studio-like environment. Background noise was played back artificially and the users did not carry the PDA in their hand but rather used a much smaller version of the SIVIT projection plane (to simulate a PDA display) and a pen as a pointing device. Speakers were speaking to a headset microphone.
    Main technical features of release SKM 1.0
    • Technical setup: Mobil (scenario)
    • Primary domain “Tourism”; secondary domain “Telephony”
    • Primary domain “Navigation”; secondary domain “Looking for parking place”
    • 73 users
    • 146 recording sessions; size: 490 GB
    • Recorded modalities:
    o Audio in max 9 channels
    o Video of face
    o Video of upper body from the left
    o Infrared video of the display area (to capture the 2D gestures) as input to the SIVIT device (Siemens gesture recognizer)
    o Video of the GUI output
    o Coordinates of graphic tableau (when pen was used)
    o Coordinates of SIVIT device (when finger/hands were used)
    • Annotations:
    o Transliteration
    o 2D Gesture
    o user states in three modalities
    o Turn segmentation
    • Documentation, TechDoks and publications
    • All annotations compatible to the “BAS Partitur Format” (BPF)

    The full database is provided on USB. Single volumes on DVD can be obtained upon demand.
  • C-004335: European Parliament Interpretation Corpus (EPIC)
    Multimodal/Multimedia Resources
    The EPIC corpus is a parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions (from English into Italian and Spanish; from Italian into English and Spanish; and from Spanish into Italian and English). It contains a total of 357 speeches (177,295 words).

    The EPIC corpus includes video clips of each source language speaker, audio clips of the corresponding interpreted target speeches and transcripts of all the clips. The corpus has been orthographically transcribed. Annotation includes paralinguistic features (truncated, mispronounced words, ...) and metadata (a header at the beginning of each transcript and information about the speaker and the speech). The transcripts are POS (part-of-speech) tagged and lemmatised. Non-tagged transcripts in text format are also available.

    Size of the nine subcorpora in the EPIC corpus:

    sub-corpus / number of speeches / total word count / % of EPIC
    ORG-EN (source) / 81 / 42,705 / 25
    INT-EN-IT (interpretation) / 81 / 35,765 / 20
    INT-EN-ES (interpretation) / 81 / 38,066 / 21
    ORG-IT (source) / 17 / 6,765 / 4
    INT-IT-EN (interpretation) / 17 / 6,708 / 4
    INT-IT-ES (interpretation) / 17 / 7,052 / 4
    ORG-ES (source) / 21 / 14,406 / 8
    INT-ES-IT (interpretation) / 21 / 12,833 / 7
    INT-ES-EN (interpretation) / 21 / 12,995 / 7
    TOTAL / 357 / 177,295 / 100


    The EPIC corpus was developed by a multidisciplinary research group based at the Department of Interdisciplinary Studies in Translation, Languages and Cultures (University of Bologna at Forlì), involving interpreting scholars, corpus linguists and IT technicians: Mariachiara Russo (coordinator), Claudio Bendazzoli, Cristina Monti, Annalisa Sandrelli, Marco Baroni, Silvia Bernardini, Gabriele Mack, Lorenzo Piccioni, Eros Zanchetta, Elio Ballardini, Peter Mead.
  • C-004336: GlobalPhone Thai
    Desktop/Microphone
    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

    The Thai part of GlobalPhone was collected between July and August 2003 in Bangkok, Thailand. Data was collected from 98 speakers in total, of which 65 were female, 27 were male. For six speakers the gender is not documented. The speakers were undergraduate and graduated students at the age of 18 to 25 years. Each speaker read about 160 utterances from newspaper articles, corresponding to roughly 20 minutes of speech per person, in total we recorded 14039 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in two small and one medium-size room with very low background noise. The text data used for recording mainly came from the news posted in newspaper websites as listed below. We followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). In sum, 14039 utterances were spoken, corresponding to 260,000 words, covering a vocabulary of 7,400 words. The latter numbers depend on the segmentation of Thai script into words, which by definition is rather arbitrary since Thai script does not provide any segmentation. For speech recognition purposes a segmentation into word segments could be provided. The Thai data are organized in a training set of 82 speakers, a development set of 8 speakers (spk IDs 023, 025, 028, 037, 045, 061, 073, 085), and an evaluation set of 8 speakers (spk IDs 101-108). More details on corpus statistics, collection scenario, and system building based on the Thai part of GlobalPhone can be found under [Suebvisai et al., 2005].

    Thai Newspaper sources:
    http://www.bangkokbiznews.com
    http://www.dailynews.co.th
    http://www.manager.co.th
    http://www.matichon.co.th
    http://www.naewna.com
    http://www.thairath.co.th

    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.
    [Suebvisai et al., 2005] Sinaporn Suebvisai, Paisarn Charoenpornsawat, Alan W Black, Monika Woszczyna, Tanja Schultz (2005): Thai Automatic Speech Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, Pennsylvania, March 2005.
  • C-004337: GlobalPhone Polish
    Desktop/Microphone
    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

    The Polish part of GlobalPhone was collected from altogether 102 native speakers in Poland, of which 48 speakers were female and 54 speakers were male. The majority of speakers are between 20 and 39 years old, the age distribution ranges from 18 to 65 years. Most of the speakers are non-smokers in good health conditions. Each speaker read on average about 100 utterances from newspaper articles, in total we recorded 10130 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small and large rooms, about half of the recordings took place under very quiet noise conditions, the other half with moderate background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for recording mainly came from the news posted in an online edition of a national Polish newspaper Dziennik Polski, (http://www.dziennik.krakow.pl/). We followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). In sum, 10130 utterances were spoken. The transcriptions are provided in Polish script in UTF-8 encoding and are also mapped to Roman script (Ascii). The Polish data are organized in a training set of 82 speakers, a development set of 10 speakers, and an evaluation set of another 10 speakers.

    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.
  • C-004338: GlobalPhone Vietnamese
    Desktop/Microphone
    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

    The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites as listed below, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

    Vietnamese Newspaper sources:
    http://www.tintuconline.vn
    http://www.nhandan.org.vn
    http://www.tuoitre.org.vn
    http://www.tinmoi.com.vn
    http://www.laodong.com.vn
    http://www.tet.tintuconline.com.vn
    http://www.anninhthudo.vn
    http://www.thanhnien.com.vn
    http://www.baomoi.com
    http://www.ca.cand.com.vn
    http://www.vnn.vn
    http://www.tinthethao.com.vn
    http://www.thethaovanhoa.vn
    http://www.vnexpress.net
    http://www.dantri.com

    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.
    [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010.
    [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.
  • C-004339: GlobalPhone Bulgarian
    Desktop/Microphone
    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

    The Bulgarian part of GlobalPhone was collected in 2005 in the cities of Sofia and Pazardzhik, Bulgaria. All speakers are Bulgarian native speakers from the west and central part of Bulgaria. Data was collected from 77 speakers in total, of which 45 were female and 32 were male. The majority of speakers are well educated, being graduated students, construction engineers, and teachers. The age distribution of the speakers ranges from 18 to 65 years. Of all speakers, 62 reported to be non-smokers, 15 are smokers, no further information about health status is provided. Each speaker read on average about 112 utterances from newspaper articles, corresponding to roughly 16.6 minutes of speech or 1940 words per person, in total we recorded 8674 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with low background noise, while one speaker was recorded in a public place. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for recording mainly came from the news posted in online editions of three national Bulgarian newspaper websites as listed below. About 350 articles with more than 10,000 sentences were downloaded and processed (manually edited to normalize and clean the text, resolve abbreviations and numbers). We followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). In sum, 8674 utterances were spoken, corresponding to 21.4 hours of speech or 150,000 spoken words in total, covering a vocabulary of 23,000 words. The transcriptions are provided in Bulgarian script (Cyrillic) in UTF-8 encoding. The Bulgarian data are organized in a training set of 63 speakers, a development set of 7 speakers (spk IDs 051, 055, 058, 084, 090, 100, 106), and an evaluation set of 7 speakers (spk IDs 040, 059, 063, 068, 095, 109, 110).

    Bulgarian Newspaper sources:
    Banker: http://www.banker.bg
    Kesh: http://www.cash.bg
    Sega: http://www.segabg.com

    [Mircheva 2006] Aneliya Mircheva (2006): Bulgarian Speech Recognition and Multilingual Language Modeling, Project Term (Studienarbeit), Institute for Theoretical Informatics, University Karlsruhe.
    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.
  • C-004340: GlobalPhone Hausa
    Desktop/Microphone
    The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT).

    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 17 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), and Korean (3500 syllables).

    1) Dictionary Encoding:
    The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script.

    2) Dictionary Phone set:
    The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models.

    3) Dictionary Generation:
    Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In many cases the GlobalPhone dictionaries were compared to straight-forward grapheme-based speech recognition and to alternative sources, such as Wiktionary and usually demonstrated to be superior in terms of quality, coverage, and accuracy.

    4) Format:
    The format of the dictionaries is the same across languages and is straight-forward. Each line consists of one word form and its pronunciation separated by blank. The pronunciation consists of a concatenation of phone symbols separated by blanks. Both, words and their pronunciations are given in tcl-script list format, i.e. enclosed in “{}”, since phones can carry tags, indicating the tone and length of a vowel, or the word boundary tag “WB”, indicating the boundary of a dictionary unit. The WB tag can for example be included as a standard question in the decision tree questions for capturing crossword models in context-dependent modeling. Pronunciation variants are indicated by (<n>) with n = 2, 3, 4,… indicating the number of variants per word. The order in which variants occur in the dictionary is not necessarily related to their frequency in the corpus.
    {word} {{w WB} o r {d WB}}

    5) Documentation: The pronunciation dictionaries for each language are complemented by a documentation that describes the format of the dictionary, the phone set including its mapping to the International Phonetic Alphabet (IPA), and the frequency distribution of the phones in the dictionary. Most of the pronunciation dictionaries have been successfully applied to large vocabulary speech recognition and references to publications are given when available.
  • C-004341: 朝日新聞記事データ(学術・研究用)2008年版
    朝日新聞の本社版記事2008年分・約15万件を収録した新聞記事データ集。各記事には13の記事種別と75のテーマ分類が付与されている。