言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 621 - 630 件目

C-001107: GlobalPhone Chinese-Shanghai
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001108: GlobalPhone Croatian
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).
In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Croatian corpus was produced using the HRT and Obzor Nacional newspapers. It contains recordings of 94 speakers (38 males, 56 females) recorded in Zagreb, Croatia, and parts of Bosnia. The following age distribution has been obtained: 21 speakers are below 19, 30 speakers are between 20 and 29, 14 speakers are between 30 and 39, 15 speakers are between 40 and 49, and 13 speakers are over 50 (1 speaker age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001109: GlobalPhone Czech
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Czech corpus was produced using the Ceskomoravsky Profit Journal and Lidove Noviny newspaper. It contains recordings of 102 speakers (57 males, 45 females) recorded in Prague, Czech Republic. The following age distribution has been obtained: 16 speakers are below 19, 70 speakers are between 20 and 29, 2 speakers are between 30 and 39, 9 speakers are between 40 and 49, and 5 speakers are over 50.
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: Global
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001110: GlobalPhone French
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The French corpus was produced using Le Monde newspaper. It contains recordings of 100 speakers (49 males, 51 females) recorded in Grenoble, France. The following age distribution has been obtained: 3 speakers are below 19, 52 speakers are between 20 and 29, 16 speakers are between 30 and 39, 13 speakers are between 40 and 49, and 14 speakers are over 50 (2 speakers age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001111: GlobalPhone German
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The German corpus was produced using the Frankfurter Allgemeine und Sueddeutsche Zeitung newspaper. It contains recordings of 77 speakers (70 males, 7 females) recorded in Karlsruhe, Germany. No age distribution is available.
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001112: GlobalPhone Japanese
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001113: GlobalPhone Korean
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Korean corpus was produced using the Hankyoreh Daily News. It contains recordings of 100 speakers (50 males, 50 females) recorded in Seoul, Korea. The following age distribution has been obtained: 7 speakers are below 19, 70 speakers are between 20 and 29, 19 speakers are between 30 and 39, and 3 speakers are between 40 and 49 (1 speaker age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001114: GlobalPhone Portuguese (Brazilian)
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001115: GlobalPhone Russian
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Russian corpus was produced using the Ogonyok Gaseta and Express-Chronika newspapers. It contains recordings of 115 speakers (61 males, 54 females) recorded in Minsk, Belarus. The following age distribution has been obtained: 9 speakers are below 19, 76 speakers are between 20 and 29, 9 speakers are between 30 and 39, 15 speakers are between 40 and 49, and 6 speakers are over 50.
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-004340: GlobalPhone Hausa
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
C-001116: GlobalPhone Spanish (Latin American)
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Spanish (Latin America) corpus was produced using the La Nacion newspaper. It contains recordings of 100 speakers (44 males, 56 females) recorded in Heredia and San Jose, Costa Rica. The following age distribution has been obtained: 20 speakers are below 19, 54 speakers are between 20 and 29, 13 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 8 speakers are over 50.
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa

SHACHI - Language Resource Metadata Database