Language resource #: 3330 Results 481 - 490 of 2023
Current query
Input keywords
Select items
  • C-000915: EUROM1f French
    Desktop/Microphone
    The first really multilingual speech database produced in Europe. Equivalent corpora for each of the European languages: same number of speakers selected in the same way, and recorded in the same conditions with common file formats. Initially eight European countries have made recordings: Italy, United Kingdom, Germany, Netherlands, Denmark, Sweden, Norway, France. Additional recordings have been then completed (thanks to CEE Esprit Project SAM-A), in Greece, Spain and Portugal. The content consists of Numbers, Passages, Sentences and CVC. More than sixty speakers per language.
  • C-000916: EUROM1i
    Desktop/Microphone
    The first really multilingual speech database produced in Europe. Equivalent corpora for each of the European languages: same number of speakers selected in the same way, and recorded in the same conditions with common file formats. Initially eight European countries have made recordings: Italy, United Kingdom, Germany, Netherlands, Denmark, Sweden, Norway, France. Additional recordings have been then completed (thanks to CEE Esprit Project SAM-A), in Greece, Spain and Portugal. The content consists of Numbers, Passages, Sentences and CVC. More than sixty speakers per language.
  • C-000918: Eleftherotypia Journal Speech database
    Desktop/Microphone
    The Eleftherotypia Speech Database (13 CD-ROMs) consists of read material collected in order to be used for the development of continuous speech recognition systems for the Greek language. All recorded sentences were selected from extracts of the Elefterotypia-journal text corpus and provide a vocabulary of about 40,000 words. The total number of utterances is over 32,000 (aproximately 72 hours of speech material from 120 different speakers, male and female).

    Detailed orthographic transcription files are also included in the distribution. There are markings for the utterance's orthography and several speech and non-speech events (e.g. mispronunciations, truncation, noise etc).

    The recording procedure took place in three different environments : a sound proof room, a quiet environment and an office environment. Two different microphones were used : a desk microphone and a head-mounted close-talking microphone. The format of the waveform files is NIST. Waveforms are encoded using PCM coding format, 16000 sampling rate, 2 bytes per sample.
  • C-000920: English SpeechDat Polyphone database DB1
    Telephone
    The (polyphone-like) English SpeechDat(M) database was recorded within the framework of the SPEECHDAT(M) Project. It consists of 1,000 speakers, chosen according to their individual demographics, who were recorded over digital telephone lines using fixed telephone sets. The material to be spoken was provided to the caller via a prompt sheet. The database is divided into two sub-sets: the phonetically rich sentences (one CD) known as DB2, and the application-oriented utterances (two CDs) known as DB1.
    The recorded material in DB1 comprises immediately usable and relevant speech, including number and letter sequences, common control keywords, dates, times, money amounts, etc. This provides a realistic basis for using these resources for the training and assessment of speaker-independent recognition of both isolated and continuous speech utterances, employing either whole-word modeling and/or phoneme based approaches.The sample rate for speech is 8 KHz, quantisation is 8 bit, and a-law encoding is used. This results in a data rate of 64 kB/s.
    A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
  • C-000921: English SpeechDat(M) Polyphone database DB2
    Telephone
    The (polyphone-like) English SpeechDat(M) database contains the recordings of 1,000 speakers who were recorded over the fixed telephone network. It is divided into two sub-sets: the phonetically rich sentences (one CD) known as DB2, and the application-oriented utterances (two CDs) known as DB1.

    It was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

    Each speaker uttered the following items: number and letter sequences, common control keywords, dates, times, money amounts, etc.

    This provides a realistic basis for using these resources for the training and assessment of speaker-independent recognition of both isolated and continuous speech utterances, employing either whole-word modeling and/or phoneme based approaches.

    A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
  • C-000922: Erlanger Bahnansage - ERBA
    Desktop/Microphone
    Over 10.000 utterances read by over 100 German speakers (60 male and 40 female), in the domain of train inquiries. All recordings were made in a quiet office room (4 CDROMs).
  • C-000930: Euskararen Datu-Base Lexikala (EDBL) – Lexical Database for Basque
    Monolingual Lexicons
    EDBL (Lexical database for Basque) is the lexical basis needed for the automatic treatment of Basque. It was first developed as a lexical support for the spelling checker and corrector XUXEN, but in the course of the time it has proved to be a multipurpose tool. Nowadays, it is not only the lexical support of the speller but also of the morphological analyser MORFEUS and the lemmatiser EUSLEM. In the future, it will be also used for syntactic and semantic analysis.

    Being neutral in relation to linguistic formalisms, flexible, open and easy to use, EDBL is, along with corpora, an essential tool for the Natural Language Processing. It is made up of about 75,000 entries divided into dictionary entries (the same you can find in a conventional dictionary), verb forms and dependent morphemes, all of them with their respective morphological information.

    Currently, it is organized in a hierarchical structure, according to a category-system adapted to Basque. It aims to reflect the general lexicon of standard Basque (Euskara Batua) and it is the essential lexical information-store for Basque NLP.
  • C-000931: FIXED0IT - DB1
    Telephone
    DB1 Phonetically rich sentences & application oriented utterances

    The Italian Fixed Network Speech SpeechDat(M) Corpus version 1.0 was recorded within the scope of the SpeechDat(M) project (LRE-63314), funded by the European Commission. Recording was done by using a primary rate ISDN interface, yielding 8 kHz, 8 bits per sample, A-law coded signal. The data files are formatted according to the SAM European project. The speech data are compressed with the GNU gzip program. All software needed to use the corpus is provided on the CDs.

    The corpus contains the speech of about 1,000 speakers (about 500 males and 500 females) and was designed to support the creation of voice-driven teleservices. The callers spoke at least 39 items, comprising:

    * isolated and connected digits
    * natural numbers
    * money amounts
    * spelled words
    * time and date phrases
    * yes/no questions
    * city names
    * common application words
    * application words in phrases
    * phonetically rich sentences

    Most items are read, some are spontaneously spoken.

    The recordings come with extensive and standardised documentation. All speech is carefully transcribed at the orthographic level; in addition, a number of clearly audible non-speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad-class phonemic transcription.

    Validation and premastering of the CD-ROMs were performed by the Speech Processing Expertise Centre (SPEX), Leidschendam, The Netherlands.

    DB2 Phonetically rich sentences sub-set (S0053)

    See ELRA-S0052 for description. DB2 is a sub-set of DB1; it contains only the phonetically rich sentences items
  • C-000932: FIXED0IT - DB2
    Telephone
    DB1 Phonetically rich sentences & application oriented utterances

    The Italian Fixed Network Speech SpeechDat(M) Corpus version 1.0 was recorded within the scope of the SpeechDat(M) project (LRE-63314), funded by the European Commission. Recording was done by using a primary rate ISDN interface, yielding 8 kHz, 8 bits per sample, A-law coded signal. The data files are formatted according to the SAM European project. The speech data are compressed with the GNU gzip program. All software needed to use the corpus is provided on the CDs.

    The corpus contains the speech of about 1,000 speakers (about 500 males and 500 females) and was designed to support the creation of voice-driven teleservices. The callers spoke at least 39 items, comprising:

    * isolated and connected digits
    * natural numbers
    * money amounts
    * spelled words
    * time and date phrases
    * yes/no questions
    * city names
    * common application words
    * application words in phrases
    * phonetically rich sentences

    Most items are read, some are spontaneously spoken.

    The recordings come with extensive and standardised documentation. All speech is carefully transcribed at the orthographic level; in addition, a number of clearly audible non-speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad-class phonemic transcription.

    Validation and premastering of the CD-ROMs were performed by the Speech Processing Expertise Centre (SPEX), Leidschendam, The Netherlands.

    DB2 Phonetically rich sentences sub-set (S0053)

    See ELRA-S0052 for description. DB2 is a sub-set of DB1; it contains only the phonetically rich sentences items
  • C-000933: FRESCO: French Polyphone Database (SpeechDat(M)) DB1
    Telephone
    FRESCO, a polyphone-like telephone speech database in French, was produced as part of the SpeechDat(M) project. Containing approximately 35,000 utterances recorded from 1,000 callers over the terrestrial telephone network in France, it offers immediately usable and relevant speech for the training, assessment and deployment of speaker-independent speech recognisers based on phoneme models or word models. In addition to a speech and annotation file for every utterance, the database contains a pronunciation lexicon for all 13,000 different words recorded. The database consists of two two subsets DB1 and DB2. DB1 contains the complete set of data (phonetically rich sentences and application oriented data). DB2 contains only the phonetically rich sentences.
    The speaker set is balanced with respect to gender and adheres to a predefined age distribution, while the geographic distribution roughly resembles the demographics of France.

    For more information: http://www.elda.org/catalogue/en/speech/doc/fresco.html

    A pronunciation lexicon with a phonemic transcription in SAMPA is also included.