Language resource #: 3330
Results 511 - 520 of 2023
-
C-000960: Mandarin-5000 database
Telephone
The MANDARIN-5000 database contains the recordings of 4,752 speakers (2383 males, 2369 females) of Mandarin as first or second language (3,222 native speakers) recorded over the fixed and mobile telephone networks in all provinces of mainland China, including Hong Kong (fixed network: cordless handset: 513 speakers, POT (plain old telephone): 3,558 speakers; mobile network: 491 speakers; undetermined (cordless or mobile): 190 speakers). The database design closely follows the SpeechDat(II) conventions, in particular with respect to the content of the database. The database consists of 1 CD containing all documentation files including the phonetic lexicon, and 3 DVD-R containing the data, i.e. speech files and corresponding transcription files.
Speech samples are stored as sequences of 8-bit 8 kHz A-law, uncompressed. Each prompted
utterance is stored in a separate file, and each signal file is accompanied by a transcription file encoded in GB-2312 and ASCII which contains the orthographic representation (i.e. pictograms), phonemic transcription in Pinyin with tones and word boundaries.
Each speaker uttered the following 54 items:
- 6 isolated application words (25 fixed, 5 free)
- 1 additional application command with a parameter (e.g. name dialling)
- 1 sequence of 10 isolated digits (balanced)
- 6 digit strings (in total balanced for digits, letters, dashes and their transitions)
- 3 dates, where 1 of them spontaneous
- 2 word spotting phrases using an application word
- 2 handset information (?mobile phone ?? ?cordless phone ??)
- 2 isolated digits
- 2 spelled words (letter sequences)
- 1 currency money amount
- 1 natural plain number (balanced for words and transitions)
- 1 natural number with measure word
- 8 names (persons, spelling, cities, companies), where 3 of them spontaneous
- 1 spontaneous train schedule request (origin, destination, date, time)
- 1 spontaneous correction
- 1 spontaneous answer to question for time
- 1 spontaneous answer to question for time or day
- 4 spontaneous answers to questions, including fuzzy yes/no
- For training 8 phonetically rich sentences (read newspaper text) and alternatively for test 8 sentences dictated out of newspaper article
- 1 time of day (spontaneous)
- 1 time phrase (read)
The following age distribution has been obtained: 239 speakers are under 16, 2,391 are between 16 and 30, 1,449 are between 31 and 45, 601 are between 46 and 60, and 32 speakers are over 60. (The age of 40 speakers was not determined.)
A pronunciation lexicon with orthographic representation (i.e. pictograms), phonemic transcription in Pinyin with tones and frequency of occurrences is also included. -
C-000961: Modern French Corpus including Anaphors Tagging
Written Corpora
The corpus that includes the tagging of the anaphors was created by the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team and XRCE (Xerox Research Centre Europe, France) in the framework of the call launched by the DGLF-LF (national institution for the French language and the languages spoken in France), for the creation of modern French corpora).
Over 1 million words have been annotated. The corpora have been selected so that they represent a wide sampling of the French language (scientific and human science articles, extracts from newspapers and magazines, legal texts, etc.) and according to the points of interest of the teams working on the project. The processed corpora supplied by ELRA are listed below:
- Two books edited by the CNRS: La protection des oeuvres scientifiques en droit d'auteur français, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591 words) and Cinquante ans de traction à la SNCF. Enjeux politiques, économiques et réponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990 words).
- 204 articles extracted from CNRS Info, a magazine which contains short popular scientific articles from the CNRS laboratories (201 280 words).
- 14 articles dealing with Hermès Human Sciences (111 886 words).
- 136 articles extracted from "Le Monde", dealing with economics (roughly 180 760 words).
- 13 booklets of the Official Journal of the European Communities (roughly 337 000 words).
Below the tagged anaphoric elements:
- Person pronouns: 3rd person pronoun, anaphoric.
- Possessive determiners: 3rd person possessive determiner.
- Demonstrative pronouns: anaphoric pronouns (celui, celle, ceux, celles-ci, celles-là)
- Indefinite pronouns: Aucun(e), chacun(e), certain(e)s, l'un(e), les un(e)s, tout(es), etc, when they are anaphoric.
- "Proverbs": "le" + "faire".
- Anaphoric and cataphoric adverbs: Dessus, dedans, dessous , when they have an anaphoric function.
- Ellipsis of head nouns: Nominal adjectives or quantifiers determiners ellipsis.
- Textual headers like "ce dernier": Ce dernier, le premier , etc.
The annotation scheme was defined in XML format. The texts were divided into sections, paragraphs and sentences. The sentence segmentation was carried out with NLP tools developed by XRCE, the annotation part was done manually by two qualified linguists. A large subset of anaphoric phrases was automatically pre-annotated. The antecedents and the tagging of the anaphoric relations were manually processed, but editing tools (emacs, macros from Author/Editor software) were used to make it easier. 5% of the corpora were checked to measure the annotation reliability. -
C-000963: Offensive Word Filter 1
Monolingual Lexicons
Oxford University Press has developed two lists of offensive words and expressions, specifically developed for filter applications in the contexts of web pages and email.
Each list features a grading system describing vocabulary type and offensive strength for each term, plus collocational information to help identify the terms in context.
Coverage: 4500 words and expressions; 10-category classification system; UK and US usage covered, plus other world English
Features: graded by class (offensive/vulgar), and type (racist, sexist etc); rated by strength (high/moderate/mild); part of speech included; morphological status marked (standalone/fixed collocation etc); collocational information included; practical screening recommendation
Format: tab-delimited ASCII
File Size: 262kB- hasPart: C-000964: Offensive Word Filter 2
-
C-000964: Offensive Word Filter 2
Monolingual Lexicons
Oxford University Press has developed two lists of offensive words and expressions, specifically developed for filter applications in the contexts of web pages and email. Each list features a grading system describing vocabulary type and offensive strength for each term, plus collocational information to help identify the terms in context.
Coverage: over 2000 words and expressions; 13-category classification system; US and UK usage covered
Features: graded by category/subcategory (eg abusive/sexist etc); rated by strength (extreme/moderate/mild); collocational information included; regional usage/source labelling; glosses for obscure senses
Format: Excel spreadsheet
File Size: 237 kB- hasPart: C-000963: Offensive Word Filter 1
-
C-000966: OrienTel French as spoken in Morocco database
Telephone
The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequencesof 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
2 spontaneous items (for control)
The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000967: OrienTel Morocco MSA (Modern Standard Arabic) database
Telephone
The OrienTel Morocco MSA (Modern Standard Arabic) database comprises 530 Moroccan speakers (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
2 sequences of 5 isolated digits
7+1 connected digits : 1 prompt sheet number (6 digits), 6 strings of 4 digits in written format, +1 prompt sheet number in digits
2 currency money amounts
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)
1 time phrase
2 spelled words : string of 4 letter sequences
3 directory assistance utterances : 1 frequent city name, 1 frequent company name, 1 personal name ( first name and family name)
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
4 spontaneous items (for control)
The following age distribution has been obtained: 1 speaker is below 16, 260 speakers are between 16 and 30, 174 speakers are between 31 and 45, 92 speakers are between 46 and 60, 3 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000968: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
- : C-000407: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
- : C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
- : OrienTel French as spoken in Tunisia database B0005
-
C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
Telephone
The OrienTel Tunisia MSA (Modern Standard Arabic) database comprises 598 Tunisian speakers (359 males, 239 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
2 sequences of 5 isolated digits
7+1 connected digits : 1 prompt sheet number (6 digits), 6 strings of 4 digits in written format, +1 prompt sheet number in digits
2 currency money amounts
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)
1 time phrase
2 spelled words : string of 4 letter sequences
3 directory assistance utterances : 1 frequent city name, 1 frequent company name, 1 personal name ( first name and family name)
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
4+1 spontaneous items (for control)
The following age distribution has been obtained: 2 speakers are below 16, 441 speakers are between 16 and 30, 101 speakers are between 31 and 45, 54 speakers are between 46 and 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.- hasVersion: OrienTel Tunisia MCA (Modern Colloquial Arabic) database B0005
-
C-000972: PRESS 65
Written Corpora
Språkdata has made available the first of its many Swedish corpora, PRESS 65. It consists of one million running words taken from Swedish newspapers from the year 1965. It has been categorised according to text type and is annotated down to the sentence level. -
C-000973: Phonetically Balanced Words (1)
Desktop/Microphone
Large acoustic corpus of read text in Korean. 2 announcers and 70 native speakers have been recorded (38 males, 32 females), distributed according to 4 age classes. They read two times 452 eojeols (Korean terms), and 2 announcers read one time 2000 eojeols. In these 2000 eojeols, the above 452 eojeols are included.
Other information such as the size and the level of studies of the speakers are provided. The recordings took place in a soundproof room. The data are stored in a 8-bit A-law speech file, with a 16 kHz sampling rate. The standard in use is NIST.