Language resource #: 3330
Results 131 - 140 of 2023
-
C-000396: ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
Written Corpora
The ILSP/ELEFTHEROTYPIA Corpus contains approximately 3 million words classified and annotated according to the common core PAROLE encoding standard. Thus, each file is classified according to the parameters of Medium, Topic and Genre, and structurally annotated at paragraph level (CES Level 1). The format of the corpus is SGML files. The source of the files is the Greek daily newspaper ELEFTHEROTYPIA.
A subset of the corpus (250,000 words) is morpho-syntactically tagged; all the words are also lemmatised and checked. For the morphosyntactic annotation of the corpus, a stepwise procedure consisting of the following four steps was used: automatic morphosyntactic annotation, automatic disambiguation, manual disambiguation and checking, conversion into the PAROLE format requirements. In certain texts, some passages are written in "katharevoussa", an older version of Greek; these passages are marked as "distinct" and have not been morpho-syntactically annotated.
The tagset used for the morphological annotation of the corpus is presented in the "Addendum to TA - Encoding features and values for the morphological layer in the lexicon Merged Tags" (P-WP1.1.-MEMO-ERLI-5).
More information about the PAROLE project: http://www.elda.org/catalogue/fr/text/doc/parole.html -
C-000397: British English SpeechDat(II) SDB-2400
Telephone
The British English SpeechDat(II) SDB-2400 database is designed for development and assessment of speaker verification and identification systems. It contains the recordings of 120 speakers who uttered 22 items 20 times, and was collected over the fixed and mobile telephone networks in quiet and noisy environments. This database is partitioned into 8 CDs.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.
Each speaker uttered the following items:
* 1 sequence of 10 isolated digits
* 2 connected digits (1 credit card number -16 digits, 1 PIN code -6 digits)
* 2 spelled words (1 fixed "forename surname", 2 "names/words")
* 1 fixed "forename surname"
* 2 "forename surname" out of a set of 10
* 2 application words
* 10 phonetically rich sentences
The following age distribution has been obtained: 7 speakers are under 16, 41 speakers are between 16 and 30, 33 speakers are between 31 and 45, 32 speakers are between 46 and 60, and the age of 7 speakers is unknown.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000398: Greek SpeechDat-Car
Desktop/Microphone
The Greek SpeechDat-Car database comprises 300 Greek speakers (150 males, 150 females) recorded over the GSM telephone network and in a car. This database is partitioned into 11 DVDs. The speech databases made within the SpeechDat-Car project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat-Car format and content specifications.
The speech data files are in two formats. Four of the microphones were recorded on the computer in the boot of the car. The speech data are stored as sequences of 16 kHz, 16 bit and uncompressed. The fifth microphone was connected to the GSM phone, and was recorded on a remote machine, with compressed data stored as sequences of 8 bit A-law 8.kHz. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
* 2 voice activation keywords
* 1 sequence of 10 isolated digits
* 7 connected digits : 1 sheet number (5+ digits), 1 spontaneous telephone number, 3 read telephone numbers, 1 credit card number (14-16 digits), 1 PIN code (6 digits)
* 3 dates : 1 spontaneous date (e.g. birthday), 1 prompted date, 1 relative or general date expression
* 2 word spotting phrases using an application word (embedded)
* 1 question (extra item)
* 4 isolated digits
* 7 spelled words : 1 spontaneous (own forename or surname), 1 spelling of directory city name, 4 real word/name, 1 artificial name for coverage
* 1 money amount
* 1 natural number
* 7 directory assistance names : 1 spontaneous (own forename or surname), 1 city of birth / growing up (spontaneous), 2 most frequent cities, 2 most frequent company/agency, 1 "forename surname"
* 1 yes question
* 1 no question
* 9 phonetically rich sentences
* 2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
* 4 phonetically rich words
* 67 application words: 13 mobile phone application words, 22 IVR function keywords, 32 car products keywords
* 2 additional language dependent keywords
* 10 prompts for spontaneous speech
The following age distribution has been obtained: 185 speakers are between 16 and 30, 79 speakers are between 31 and 45, and 36 speakers are between 46 and 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000402: SALA II Spanish from Costa Rica database
Telephone
The SALA II Spanish from Costa Rica database collected in Costa Rica was recorded within the scope of the SALA II project. The SALA II Spanish from Costa Rica database contains the recordings of 1,165 Costa Rican speakers (574 males and 591 females) recorded over the Costa Rican mobile telephone network.
The following acoustic conditions were selected as representative of a mobile user's environment:
* Passenger in moving car, railway, bus, etc. (152 speakers)
* Public place (300 speakers)
* Stationary pedestrian by road side (223 speakers)
* Home/office environment (441 speakers)
* Passenger in moving car using a hands-free kit (49 speakers)
This database is distributed as 1 DVD-ROM The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications. Each speaker uttered the following items:
Each speaker uttered the following items:
* 6 application words;
* 1 sequence of 10 isolated digits;
* 4 connected digits (1 sheet number -6 digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits);
* 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression);
* 1 spotting phrase using an application word (embedded);
* 2 isolated digits;
* 3 spelled words (1 surname, 1 directory assistance city name, 1 real/artificial name for coverage) ;
* 1 currency money amount;
* 1 natural number;
* 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename surname out of a set of 150 );
* 2 yes/no questions (1 predominantly yes question, 1 predominantly no question);
* 9 phonetically rich sentences;
* 2 time phrases (1 spontaneous time of day, 1 word style time phrase);
* 4 phonetically rich words.
The following age distribution has been obtained: 3 speakers are under 16, 588 speakers are between 16 and 30, 370 speakers are between 31 and 45, 183 speakers are between 46 and 60, and 21 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000404: SALA II Spanish from Argentina database
Telephone
The SALA II Spanish from Argentina database collected in Argentina was recorded within the scope of the SALA II project.
The SALA II Spanish from Argentina database contains the recordings of 1,076 Argentinian speakers (534 males and 542 females) recorded over the Argentinian mobile telephone network.
The following acoustic conditions were selected as representative of a mobile user's environment:
Passenger in moving car, railway, bus, etc. (204 speakers)
Public place (281 speakers)
Stationary pedestrian by road side (240 speakers)
Home/office environment (291 speakers)
Passenger in moving car using a hands-free kit (60 speakers)
This database is distributed as 1 DVD-ROM The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.
Each speaker uttered the following items:
6 application words
1 sequence of 10 isolated digits
4 connected digits (1 sheet number -6 digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
1 spotting phrase using an embedded application word
2 isolated digits
3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage)
1 currency money amount
1 natural number
5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename surname out of a set of 150 )
2 yes/no questions (1 predominantly yes question, 1 predominantly no question)
9 phonetically rich sentences
2 time phrases (1 spontaneous time of day, 1word style time phrase)
4 phonetically rich words
The following age distribution has been obtained: 6 speakers are under 16, 399 speakers are between 16 and 30, 395 speakers are between 31 and 45, 245 speakers are between 46 and 60, and 31 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000405: OrienTel French as spoken in Tunisia database
Telephone
The OrienTel French as spoken in Tunisia database comprises 576 Tunisian speakers of French (290 males, 286 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequencesof 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Western calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
2+3 spontaneous items (for control)
The following age distribution has been obtained: 2 speakers are below 16, 407 speakers are between 16 and 30, 104 speakers are between 31 and 45, 59 speakers are between 46 and 60, 4 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000406: OrienTel French as spoken in Morocco database
Telephone
The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequencesof 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
2 spontaneous items (for control)
The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.- hasVersion: C-000100: OrienTel Morocco MCA (Modern Colloquial Arabic) database
- hasVersion: C-000967: OrienTel Morocco MSA (Modern Standard Arabic) database
- hasVersion: C-000407: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
- hasVersion: C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
- hasVersion: C-000405: OrienTel French as spoken in Tunisia database
- hasVersion: C-000406: OrienTel French as spoken in Morocco database
-
C-000408: OrienTel Hebrew database
Telephone
The OrienTel Hebrew database comprises 1000 Hebrew speakers (500 males, 500 females) recorded over the Israeli fixed and mobile telephone network. This database is partitioned into 2 DVDs. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
2 spontaneous items (for control)
The following age distribution has been obtained: 616 speakers are between 16 and 30, 246 speakers are between 31 and 45, 138 speakers are between 46 and 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000409: OrienTel Arabic as spoken in Israel database
Telephone
The OrienTel Arabic as spoken in Israel database comprises 750 Arabic speakers (375 males, 375 females) recorded over the Israeli fixed and mobile telephone network. This database is partitioned into 2 DVDs. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
2 currency money amounts
1 natural number
4 dates : 1 prompted date, 1 relative or general date expression, 2 prompted date phrases (1 Western calendar, 1 Islamic calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4+ phonetically rich words
9 phonetically rich sentences
2 spontaneous items (for control)
1 free spontaneous speech
The following age distribution has been obtained: 450 speakers are between 16 and 30, 199 speakers are between 31 and 45, 101 speakers are between 46 and 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000410: ZipTel
Telephone
The ZipTel telephone speech database contains recordings of people applying for a SpeechDat prompt sheet via telephone. For the SpeechDat data collection, calls for participation were published in "phone", the customer magazine of the mobile telephone provider "e-plus", and in numerous newspapers all over Germany. In these calls, a telephone number was given where callers could order a SpeechDat prompt sheet. The calls were recorded by an automatic telephone server; callers were asked to provide address, ZIP code, city and telephone number.
Total number of recordings: 7746
Total duration: 14h
Format: SpeechDat Exchange Format, SAM, BAS Partitur Format (BPF)
Transliteration: SpeechDat Conventions