Language resource #: 3330
Results 531 - 540 of 2023
-
C-000991: SpeechDat Speaker Verification database
Desktop/Microphone
This subset of PolyVar (cf. ELRA-S0046) consists of 20 speakers which recorded 50 sessions. The format in use is SAM (a-law).- hasPart: C-000975: PolyVar
-
C-000992: Strange Corpus 1 - SC1 (ACCENTS)
Desktop/Microphone
The story 'Nordwind und Sonne' read by 72 speakers with foreign accents and 16 native German speakers. The latter are phonologically segmented by hand, (1 CDROM). -
C-000993: Strange Corpus 10 - SC10 ('Accents II')
Desktop/Microphone
70 speakers (67 non-native, 3 native German speakers) - 1 dialogue, 1 re-telling of a German story - transliteration, orthography, canonical transcription.
A collection of a variety of speech styles spoken by native and non-native German speakers, read texts, numbers, phonetically balanced sentences, story, free monolog, dialog -
C-000994: Strange Corpus 2 - SC2 (Noises)
Desktop/Microphone
The corpus contains read speech of 10 different male speakers with screen prompted 'automobile diagnosis phrases' recorded under real conditions in two different car maintenance halls. The language is German. All speakers are male native Germans and have never participated in such a task before. They are all experts in the field of car diagnosis. Each speaker has spoken 800 3-7 word utterances derived from 100 different sentences resulting in a total of 8000 utterances. Noises are manually labelled in the data. The data is stored on 1 CD-ROM
This corpus may be used for several tasks:
- automatic speech recognition under heavy noise
- investigation of Lombard effects under realistic conditions
- test of robustness against different recording conditions in the field -
C-000995: Swedish Speecon database
Desktop/Microphone
The Swedish Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Swedish speakers (270 males, 280 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Swedish speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 23 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)
The following age distribution has been obtained:
Adults: 227 speakers are between 15 and 30, 175 speakers are between 31 and 45, 100 speakers are between 46 and 60, and 48 speakers are over 60.
Children: 15 speakers are between 8 and 10, 35 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000999: Tagged text in French (MEMODATA) with rules of morphological disambiguation
Written Corpora
More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal).
See also W0011.- references: Tagged text in French (MEMODATA) W0011
-
C-001000: Turkish Speecon database
Desktop/Microphone
The Turkish Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Turkish speakers (280 males, 270 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Turkish speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 28 DVDs (first set) and 4 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
222 application specific words and phrases per session (adults)
74 toy commands, 14 general commands, 31 phone commands and 4 application word synonyms (children)
The following age distribution has been obtained:
Adults: 244 speakers are between 15 and 30, 235 speakers are between 31 and 45, and 71 speakers are over 46.
Children: 25 speakers are between 8 and 10, 25 speakers are between 11 and 15.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-001001: Twin database - TWINDB1
Telephone
The Twin database named TWINDB1 includes recordings of 45 French speakers, consisting of 9 pairs of identical twins (8 males and 10 females) with similar voices, and 27 other speakers (13 males and 14 females) including 4 none-twin siblings. Each twin or sibling spoke for a total of 24 to 30 minutes in three sessions conducted with at least one week interval between sessions.
In each session subjects were asked to read three different texts of one page. These texts consists of one paragraph of about 10 lines extracted from the journal SVM Mac July 1994, some short phrases, digits, credit card numbers, etc. extracted from the Polyphone French database corpus. The speakers called from their office or from their home. Subjects were recorded over the telephone using an OROS AU32 PC-board at 16 bits linear form, 8KHz sampling frequency. -
C-001002: UK English Speecon database
Desktop/Microphone
The UK English Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 606 adult UK English speakers (325 males, 281 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 51 child UK English speakers (14 boys, 37 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 31 DVDs (first set) and 4 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)
The following age distribution has been obtained:
Adults: 321 speakers are between 16 and 30, 182 speakers are between 31 and 45, 103 speakers are over 46.
Children: All 51 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-001017: Wolverhampton Business English Corpus
Written Corpora
The WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335).
A survey of electronic language resources in the business domain carried out at Wolverhampton revealed that there are very few business corpora in existence, and almost none of them are widely accessible. There is significant demand for a business corpus, from both the NLP and pedagogic (language, business communication, and linguistics teachers and students) communities.
The Wolverhampton Corpus of Written Business English is:
- A synchronic corpus, including only texts available on the web during a 6-month period in 1999-2000 AD.
- A monolingual English corpus: it comprises only texts written in English; but no restriction was applied as regards the variety of English used. On the contrary, the WBE deliberately tried to capture a wide range of varieties of English, by including documents from websites in Britain, USA, Pakistan, Netherlands, Belgium, Switzerland, Hong Kong, etc.
- A written corpus: it contains only written materials. However, a few of the documents are transcripts of speeches.
- A business corpus: the texts were selected manually, and care was taken to ensure that all the texts were from the business domain.
The corpus consists of 10,186,259 words from 23 different Web sites
The data can contribute to a wide range of NLP tasks, including information retrieval, information extraction, summarisation, etc.
The WBE was built using materials solely from the Web. However, this does not mean that the corpus gives access only to a restricted range of categories of texts. On the contrary, the amount of information available online allowed us to select from a wide variety of categories. These range from product descriptions, company press releases, and annual financial reports, to business journalism, academic research papers, political speeches and government reports. The texts have been grouped according to the source site.
The corpus is distributed in three formats.
- The first one is the original encoding of the text. The majority of the texts are in HTML and plain text format. There are a few in PDF format or Microsoft Word DOC format.
- The second format is plain text. The files were converted automatically if they were not in plain text format, and manually checked.
- The corpus is also provided as SGML encoded files, using the Corpus Encoding Standard (http://www.cs.vassar.edu/CES/). The header of each file provides information about the title of the file, length in words, etc. The paragraph and sentence boundaries, and part of speech tags for each word are marked using SGML tags.
All the available files were converted to 8-bit ASCII format using ISO 8859-1. Characters with ASCII codes from 127255 (also known as Extended ASCII) were manually checked in order to ensure the correct representation of the characters.
The corpus was checked for spelling errors, but special care was taken to ensure that any variant spellings specific to the business domain were not wrongly corrected.
A validation work was carried out by an external validator. It consisted of checking text files, tools, tagging and documentation.