Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 11 - 20 of 2023

C-000015: BABEL Bulgarian Database
Desktop/Microphone
The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). The project began in March 1995 and was completed in December 1998. The objective was to create a database of languages of Central and Eastern Europe in parallel to the EUROM1 databases produced by the SAM Project (funded by the ESPRIT programme).
The BABEL consortium included six partners from Central and Eastern Europe (who had the major responsibility of planning and carrying out the recording and labelling) and six from Western Europe (whose role was mainly to advise and in some cases to act as host to BABEL researchers). The five databases collected within the project concern the Bulgarian, Estonian, Hungarian, Polish, and Romanian languages.
The Bulgarian database consists of the basic "common" set which is:
- Many Talker Set: 30 males, 30 females; each to read twice the five blocks of numbers (each of which contains 10 numbers), 3 connected passages and one «filler» passage.
- Few Talker Set: 5 males, 5 females, selected from the above group: each to read 5 times the blocks of numbers, 15 connected passages and 2 «filler» passages, and 5 repetitions of the lists of monosyllables.
- Very Few Talker Set: 1 male, 1 female, selected from Few Talker set: each to read blocks of monosyllables in carrier sentences and five repetitions of the context words.
And the extension part: semi-spontaneous answers to questions: the answers were recorded by the 10 Few Talker Set speakers.
C-000016: BABEL Hungarian Database
Desktop/Microphone
The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). The project began in March 1995 and was completed in December 1998. The objective was to create a database of languages of Central and Eastern Europe in parallel to the EUROM1 databases produced by the SAM Project (funded by the ESPRIT programme).
The BABEL consortium included six partners from Central and Eastern Europe (who had the major responsibility of planning and carrying out the recording and labelling) and six from Western Europe (whose role was mainly to advise and in some cases to act as host to BABEL researchers). The five databases collected within the project concern the Bulgarian, Estonian, Hungarian, Polish, and Romanian languages.
The Hungarian database consists of the basic "common" set which is:
- Many Talker Set: 30 males, 30 females; each to read 50 numbers, 1-2 connected passages, 1 block of "filler" sentences, and 1 block of syllables.
- Few Talker Set: 4 males, 4 females; each to read 50 numbers, 10 connected passages, 1 block of "filler" sentences, and 2-3 blocks of syllables.
- Very Few Talker Set: 1 male, 1 female; each to read 2 blocks of 50 numbers, 40 connected passages, 4 blocks of "filler" sentences, and 9 blocks of syllables.
And the extension part: a short description of Hungarian sound system.
C-000022: British English SpeechDat(II) FDB-4000
Telephone
The British English SpeechDat(II) FDB-4000 database contains the recordings of 4,000 British English speakers (1,968 males, 2,032 females) recorded over the British fixed telephone network. This database is partitioned into 20 CDs.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

Each speaker uttered the following items:

* 1 isolated single digit
* 1 sequence of 10 isolated digits
* 4 connected digits (1 sheet number -6 digits, 1 telephone number 9/11 digits, 1 credit card number -16 digits, 1 PIN code -6 digits)
* 1 spontaneous phone number
* 1 currency money amount
* 1 natural number
* 3 dates (1 spontaneous e.g. birthday, 1 prompted date, 1 relative or general date expression)
* 2 time phrases (1 spontaneous time of day, 1 word style time phrase)
* 3 spelled words (1 spontaneous e.g. own forename, 1 city name, 1 real word for coverage)
* 5 directory assistance names (1 spontaneous e.g. own forename, 1 city of birth/growing up, 1 frequent city name, 1 frequent company name, 1 common forename and surname)
* 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question)
* 3 application words
* keyword phrase using an embedded application word
* 4 phonetically rich words
* 9 phonetically rich sentences

The following age distribution has been obtained: 1,242 speakers are between 16 and 30, 1,321 speakers are between 31 and 45, 1,298 speakers are between 46 and 60, and the age of 139 speakers is unknown.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000023: British English SpeechDat(II) MDB-1000
Telephone
The British English SpeechDat(II) MDB-1000 database contains the recordings of 1,000 British speakers recorded over the GSM digital mobile network. The MDB-1000 database is partitioned into 5 CDs in ISO 9660 format.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

Each speaker uttered the following items

* 1 sequence of 10 isolated digits
* 3 connected digits (1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
* 3 dates (1 spontaneous e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
* 1 word spotting phrase using an embedded application word
* 2 isolated digits
* 3 spelled words (1 spontaneous name e.g. own forename, 1 city name, 1 real / artificial word for coverage)
* 1 currency money amount
* 1 natural number
* 5 directory assistance names (1 spontaneous name e.g. own forename, 1 city of birth/growing up, 1 most frequent cities out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename surname out of a set of 150 full names
* 2 yes/no questions (1 predominantly yes question, 1 predominantly no question)
* 9 phonetically rich sentences
* 2 time phrases (1 spontaneous time of day, 1 word style time phrase)
* 4 phonetically rich words

The following age distribution has been obtained: 329 speakers are between 16 and 30, 340 speakers between 31 and 45, and 331 speakers are between 46 and 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000024: CADCC-Chinese Annotated Dialogue and Conversation Corpus
CADCC is comprised by spontaneous dialog wave data and text corpus, which is suited for the research of spontaneous speech, the project of speech recognition and superior teach for mandarin.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2005-013/intro.htm
C-000025: CASIA single syllable isolated word speech corpus
single syllable, word and sentence. Including: female corpus(totally 4 groups, 61 times, 61 record time) and male corpus(totally 5 groups, 40 times, 6 record time)
http://www.chineseldc.org/EN/doc/CLDC-SPC-1999-019/intro.htm
C-000026: CASIA-863Chinese Speech Synthesis Corpus
This corpus is recorded by a female speaker form china national radio. There are 6000 sentences (about 15 hours) in this corpus, which covers kinds of prosodic and acoustic phenomenon.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2005-011/intro.htm
C-000028: CASIA-Chinese Question Structures Corpus
There are four speakers which includes two male speakers and two female speakers. Each speaker is required to utter all the 590 questions, which covers various kinds of question structures. Among these four speakers, two speakers are also required to utter the statement sentences corresponding to these questions, which can be used to do the comparative study between questions and statements. There are 3540 sentences in all.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2004-021/intro.htm
C-000029: CASIA-Mandarin continuous digit speech corpus
Continuous speech of digit strings, composed by speech from 55 male speakers, each of which spoke 80 digit strings of length from 1 to 7. The total occurrence times of each digit are almost the same, and so are the chances of each digit occurring at the beginning, middle, and the end of a string. The times of any two arbitrary digits occurred connectively are the same, that makes a plenty of co-articulation phenomena. The total length of the data is about 100 minutes.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2004-014/intro.htm
C-000033: CASIAThe weather forecast broadcasts the pronunciation storehouse
58 segments in weather forecast broadcasting domain, about 3 hours in recording
http://www.chineseldc.org/EN/doc/CLDC-SPC-1999-018/intro.htm

SHACHI - Language Resource Metadata Database