Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 481 - 490 of 2023

C-000915: EUROM1f French
Desktop/Microphone
The first really multilingual speech database produced in Europe. Equivalent corpora for each of the European languages: same number of speakers selected in the same way, and recorded in the same conditions with common file formats. Initially eight European countries have made recordings: Italy, United Kingdom, Germany, Netherlands, Denmark, Sweden, Norway, France. Additional recordings have been then completed (thanks to CEE Esprit Project SAM-A), in Greece, Spain and Portugal. The content consists of Numbers, Passages, Sentences and CVC. More than sixty speakers per language.
- hasVersion: C-001403: EUROM1e
- hasVersion: C-000061: EUROM1g
- references: C-003359: EvaSy Evaluation Package
C-000916: EUROM1i
Desktop/Microphone
The first really multilingual speech database produced in Europe. Equivalent corpora for each of the European languages: same number of speakers selected in the same way, and recorded in the same conditions with common file formats. Initially eight European countries have made recordings: Italy, United Kingdom, Germany, Netherlands, Denmark, Sweden, Norway, France. Additional recordings have been then completed (thanks to CEE Esprit Project SAM-A), in Greece, Spain and Portugal. The content consists of Numbers, Passages, Sentences and CVC. More than sixty speakers per language.
C-000918: Eleftherotypia Journal Speech database
Desktop/Microphone
The Eleftherotypia Speech Database (13 CD-ROMs) consists of read material collected in order to be used for the development of continuous speech recognition systems for the Greek language. All recorded sentences were selected from extracts of the Elefterotypia-journal text corpus and provide a vocabulary of about 40,000 words. The total number of utterances is over 32,000 (aproximately 72 hours of speech material from 120 different speakers, male and female).

Detailed orthographic transcription files are also included in the distribution. There are markings for the utterance's orthography and several speech and non-speech events (e.g. mispronunciations, truncation, noise etc).

The recording procedure took place in three different environments : a sound proof room, a quiet environment and an office environment. Two different microphones were used : a desk microphone and a head-mounted close-talking microphone. The format of the waveform files is NIST. Waveforms are encoded using PCM coding format, 16000 sampling rate, 2 bytes per sample.
C-000920: English SpeechDat Polyphone database DB1
Telephone
The (polyphone-like) English SpeechDat(M) database was recorded within the framework of the SPEECHDAT(M) Project. It consists of 1,000 speakers, chosen according to their individual demographics, who were recorded over digital telephone lines using fixed telephone sets. The material to be spoken was provided to the caller via a prompt sheet. The database is divided into two sub-sets: the phonetically rich sentences (one CD) known as DB2, and the application-oriented utterances (two CDs) known as DB1.
The recorded material in DB1 comprises immediately usable and relevant speech, including number and letter sequences, common control keywords, dates, times, money amounts, etc. This provides a realistic basis for using these resources for the training and assessment of speaker-independent recognition of both isolated and continuous speech utterances, employing either whole-word modeling and/or phoneme based approaches.The sample rate for speech is 8 KHz, quantisation is 8 bit, and a-law encoding is used. This results in a data rate of 64 kB/s.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- isPartOf: C-000921: English SpeechDat(M) Polyphone database DB2
- hasVersion: C-001523: Spanish SpeechDat(M) - DB1
- hasVersion: C-000119: Portuguese SpeechDat(M) database
C-000921: English SpeechDat(M) Polyphone database DB2
Telephone
The (polyphone-like) English SpeechDat(M) database contains the recordings of 1,000 speakers who were recorded over the fixed telephone network. It is divided into two sub-sets: the phonetically rich sentences (one CD) known as DB2, and the application-oriented utterances (two CDs) known as DB1.

It was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

Each speaker uttered the following items: number and letter sequences, common control keywords, dates, times, money amounts, etc.

This provides a realistic basis for using these resources for the training and assessment of speaker-independent recognition of both isolated and continuous speech utterances, employing either whole-word modeling and/or phoneme based approaches.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- references: C-000920: English SpeechDat Polyphone database DB1
- isVersionOf: C-001523: Spanish SpeechDat(M) - DB1
- isVersionOf: C-000119: Portuguese SpeechDat(M) database
C-000922: Erlanger Bahnansage - ERBA
Desktop/Microphone
Over 10.000 utterances read by over 100 German speakers (60 male and 40 female), in the domain of train inquiries. All recordings were made in a quiet office room (4 CDROMs).
C-000930: Euskararen Datu-Base Lexikala (EDBL) Lexical Database for Basque
Monolingual Lexicons
EDBL (Lexical database for Basque) is the lexical basis needed for the automatic treatment of Basque. It was first developed as a lexical support for the spelling checker and corrector XUXEN, but in the course of the time it has proved to be a multipurpose tool. Nowadays, it is not only the lexical support of the speller but also of the morphological analyser MORFEUS and the lemmatiser EUSLEM. In the future, it will be also used for syntactic and semantic analysis.

Being neutral in relation to linguistic formalisms, flexible, open and easy to use, EDBL is, along with corpora, an essential tool for the Natural Language Processing. It is made up of about 75,000 entries divided into dictionary entries (the same you can find in a conventional dictionary), verb forms and dependent morphemes, all of them with their respective morphological information.

Currently, it is organized in a hierarchical structure, according to a category-system adapted to Basque. It aims to reflect the general lexicon of standard Basque (Euskara Batua) and it is the essential lexical information-store for Basque NLP.
C-000931: FIXED0IT - DB1
Telephone
DB1 Phonetically rich sentences & application oriented utterances

The Italian Fixed Network Speech SpeechDat(M) Corpus version 1.0 was recorded within the scope of the SpeechDat(M) project (LRE-63314), funded by the European Commission. Recording was done by using a primary rate ISDN interface, yielding 8 kHz, 8 bits per sample, A-law coded signal. The data files are formatted according to the SAM European project. The speech data are compressed with the GNU gzip program. All software needed to use the corpus is provided on the CDs.

The corpus contains the speech of about 1,000 speakers (about 500 males and 500 females) and was designed to support the creation of voice-driven teleservices. The callers spoke at least 39 items, comprising:

* isolated and connected digits
* natural numbers
* money amounts
* spelled words
* time and date phrases
* yes/no questions
* city names
* common application words
* application words in phrases
* phonetically rich sentences

Most items are read, some are spontaneously spoken.

The recordings come with extensive and standardised documentation. All speech is carefully transcribed at the orthographic level; in addition, a number of clearly audible non-speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad-class phonemic transcription.

Validation and premastering of the CD-ROMs were performed by the Speech Processing Expertise Centre (SPEX), Leidschendam, The Netherlands.

DB2 Phonetically rich sentences sub-set (S0053)

See ELRA-S0052 for description. DB2 is a sub-set of DB1; it contains only the phonetically rich sentences items
- isPartOf: C-000932: FIXED0IT - DB2
- isPartOf: C-000939: Fixed1it Design
- isVersionOf: C-001523: Spanish SpeechDat(M) - DB1
- isVersionOf: C-000119: Portuguese SpeechDat(M) database
C-000932: FIXED0IT - DB2
Telephone
DB1 Phonetically rich sentences & application oriented utterances

The Italian Fixed Network Speech SpeechDat(M) Corpus version 1.0 was recorded within the scope of the SpeechDat(M) project (LRE-63314), funded by the European Commission. Recording was done by using a primary rate ISDN interface, yielding 8 kHz, 8 bits per sample, A-law coded signal. The data files are formatted according to the SAM European project. The speech data are compressed with the GNU gzip program. All software needed to use the corpus is provided on the CDs.

The corpus contains the speech of about 1,000 speakers (about 500 males and 500 females) and was designed to support the creation of voice-driven teleservices. The callers spoke at least 39 items, comprising:

* isolated and connected digits
* natural numbers
* money amounts
* spelled words
* time and date phrases
* yes/no questions
* city names
* common application words
* application words in phrases
* phonetically rich sentences

Most items are read, some are spontaneously spoken.

The recordings come with extensive and standardised documentation. All speech is carefully transcribed at the orthographic level; in addition, a number of clearly audible non-speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad-class phonemic transcription.

Validation and premastering of the CD-ROMs were performed by the Speech Processing Expertise Centre (SPEX), Leidschendam, The Netherlands.

DB2 Phonetically rich sentences sub-set (S0053)

See ELRA-S0052 for description. DB2 is a sub-set of DB1; it contains only the phonetically rich sentences items
- hasPart: C-000931: FIXED0IT - DB1
- hasPart: C-000939: Fixed1it Design
C-000933: FRESCO: French Polyphone Database (SpeechDat(M)) DB1
Telephone
FRESCO, a polyphone-like telephone speech database in French, was produced as part of the SpeechDat(M) project. Containing approximately 35,000 utterances recorded from 1,000 callers over the terrestrial telephone network in France, it offers immediately usable and relevant speech for the training, assessment and deployment of speaker-independent speech recognisers based on phoneme models or word models. In addition to a speech and annotation file for every utterance, the database contains a pronunciation lexicon for all 13,000 different words recorded. The database consists of two two subsets DB1 and DB2. DB1 contains the complete set of data (phonetically rich sentences and application oriented data). DB2 contains only the phonetically rich sentences.
The speaker set is balanced with respect to gender and adheres to a predefined age distribution, while the geographic distribution roughly resembles the demographics of France.

For more information: http://www.elda.org/catalogue/en/speech/doc/fresco.html

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- requires: T-001414: FRESCO: French Polyphone Database (SpeechDat(M)) DB2
- hasVersion: C-001523: Spanish SpeechDat(M) - DB1
- hasVersion: C-000119: Portuguese SpeechDat(M) database

SHACHI - Language Resource Metadata Database