Language resource #: 3330
Results 51 - 60 of 2023
-
C-000110: PAROLE-SIMPLE-CLIPS PISA Italian Lexicon Phonetic layer
Monolingual Lexicons
This lexicon is subdivided into five different subsets:
L0072-01 Full lexicon
L0072-02 Phonetic layer
L0072-03 Morphological layer
L0072-04 Syntactic layer
L0072-05 Semantic layer
PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). It was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encoding were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering), which is responsible for 25,000 extra entries (to be released soon).
PAROLE-SIMPLE-CLIPS offers therefore the advantage of being compatible with the other eleven PAROLE-SIMPLE lexicons that were built for European languages and that share a common theoretical model, representation language and building methodology.
A PAROLE-SIMPLE-CLIPS entry gathers together all the phonological, morphological and inherent syntactic and semantic properties of a headword. Its subcategorization pattern is (or are) described in terms of optionality, syntactic function, syntagmatic realization as well as morpho-syntactic, syntactic and lexical properties of each slot filler. At the semantic level, the theoretical approach adopted by the SIMPLE model is essentially grounded on a revisited version of some fundamental aspects of the Generative Lexicon.
A SIMPLE-CLIPS semantic unit is richly endowed with a wide range of fine-grained, structured information, most relevant for NLP applications. First among them, the ontological typing: the lexicon is in fact structured in terms of a multidimensional type system based on both hierarchical and non-hierarchical conceptual relations, taking into account the principle of orthogonal inheritance. Other relevant information types in a word entry are its domain of use; type of denoted event; synonymy and morphological derivation relations; membership in a class of regular polysemy as well as any relevant distinctive semantic features. Particularly outstanding is the information encoded in the Extended Qualia Structure (a set of 60 semantic relations that allow modelling both the different meaning dimensions of a word sense and its relationships to other lexical units) and the Predicative Representation which describes the semantic scenario the word sense considered is involved in and characterizes its participants in terms of thematic roles and semantic constraints.
In a words description, lexical information is interrelated across the four description levels. Syntactic and semantic information, in particular, is related to each other through the projection of the predicate-argument structure onto its syntactic realization(s).
References :
Ruimy N., Corazzari O., Gola E., Spanu A., Calzolari N., Zampolli A. 2003. The PAROLE model and the Italian Syntactic lexicon. In A. Zampolli, N. Calzolari, L. Cignoni, (eds.), Computational Linguistics in Pisa - Linguistica Computazionale a Pisa. Linguistica Computazionale, Special Issue, XVIII-XIX, (2003). Pisa-Roma, IEPI. Tomo II, 793-820.
Lenci A., Busa F., Ruimy N., Gola E., Monachini M., Calzolari N., Zampolli A. et al., 2000. SIMPLE Linguistic Specifications, SIMPLE LE4-8346 EC Project, Deliverable D2.1 & D2.2, WP02, Final version, March 2000, ILC and University of Pisa, 404 pp. (http://www.ub.es/gilcub/SIMPLE/simple.html#Specifications).
Ruimy N., Monachini M., Gola E., Calzolari N., Del Fiorentino M.C., Ulivieri M., Rossi S. 2003. A computational semantic lexicon of Italian: SIMPLE. In A. Zampolli, N. Calzolari, L. Cignoni, (eds.), Computational Linguistics in Pisa - Linguistica Computazionale a Pisa. Linguistica Computazionale, Special Issue, XVIII-XIX, (2003). Pisa-Roma, IEPI. Tomo II, 821-864.
Ruimy N., Monachini M., Distante R., Guazzini E., Molino S., Ulivieri M., Calzolari N., Zampolli A. 2002. CLIPS, A Multi-level Italian Computational Lexicon: a Glimpse to Data. LREC 2002. Las Palmas de Gran Canaria, Spain 29th, 30th & 31 May 2002. Proceedings, Volume III, Paris, The European Languages Resources Association (ELRA). 792-799.- isPartOf: PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Full lexicon
- isPartOf: PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Morphological layer
- isPartOf: PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Syntactic layer
- isPartOf: PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Semantic layer
-
C-000117: Polish SpeechDat(E) Database
Telephone
The Polish SpeechDat(E) Database (Eastern European Speech Databases for Creation of Voice Driven Teleservices) comprises 1000 Polish speakers (488 males, 512 females) recorded over the Polish fixed telephone network. This database is partitioned into 5 CDs, each of which comprises 200 speakers sessions. The speech databases made within the SpeechDat(E) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat(E) format and content specifications.
The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of SpeechDat(E). Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
Corpus contents:
- 6 application words;
- 1 sequence of 10 isolated digits;
- 4 connected digits: 1 sheet number (5 digits), 1 telephone number (8-11 digits), 1 credit card number (15-16 digits), 1 PIN code (6 digits);
- 3 dates: 1 spontaneous date (birthday), 1 prompted date (word style), 1 relative and general date expression;
- 1 spotting phrase using an application word (embedded);
- 1 isolated digit;
- 3 spelled-out words (letter sequences): 1 spelling of surname; 1 spelling of directory assistance city name; 1 real/artificial name for coverage;
- 2 currency money amounts: 1 Polish money amount, 1 International money amount (USD, EURO)
- 1 natural number;
- 6 directory assistance names: 1 surname (out of 500); 1 city of birth / growing up (spontaneous); 1 most frequent city (out of 500); 1 most frequent company/agency (out of 500); 1 "forename surname" (set of 150 ), 1 "surname" (set of 150 )
- 2 questions, including "fuzzy" yes/no: 1 predominantly "yes" question, 1 predominantly "no" question;
- 12 phonetically rich sentences;
- 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style);
- 4 phonetically rich words.
The following age distribution has been obtained: 9 speakers are below 16 years old, 428 speakers are between 16 and 30, 291 speakers are between 31 and 45, 254 speakers are between 46 and 60, and 18 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included -
C-000118: Portuguese SpeechDat(II) FDB-4000
Telephone
The Portuguese SpeechDat(II) FDB-4000 comprises 4027 Portuguese speakers (1861 males, 2166 females) recorded over the Portuguese fixed telephone network. This database is partitioned into 11 CDs. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
- 1 isolated single digit
- 1 sequence of 10 isolated digits
- 4 numbers : 1 sheet number (5+ digits), 1 telephone number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits)
- 1 currency money amount
- 1 natural number
- 3 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression
- 2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
- 3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
- 5 directory assistance utterances : 1 spontaneous, own forename, 1 city of birth / growing up (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
- 2 yes/no questions : 1 predominantly ?yes? question, 1 predominantly ?no? question
- 3 application words
- 1 keyword phrase using an embedded application word
- 4 phonetically rich words
- 9 phonetically rich sentences
The following age distribution has been obtained: 241 speakers are below 16 years old, 1404 speakers are between 16 and 30, 1532 speakers are between 31 and 45, 711 speakers are between 46 and 60, and 139 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included -
C-000119: Portuguese SpeechDat(M) database
Telephone
The Portuguese SpeechDat(M) database contains the recordings of 1,001 speakers (453 males, 548 females). This speech database was collected by Portugal Telecom within the European SpeechDat project.
Speech signals are stored as sequences of 8 kHz, 8-bit A-law. Files are stored according to the file specifications proposed in the SpeechDat database format specification. The file formats and headers follow the SAM recommendations (header files separated from signal files).
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.
Each speaker uttered the following items:
* 3 natural numbers
* 1 isolated digit
* 2 connected digits (1 credit card number, 1 telephone number)
* 2 money amounts
* 2 dates
* 1 time phrase
* 6 application words
* 3 spelled-out words
* 3 word spotting phrases
* 9 sentences
* 4 yes/no questions
* 1 spontaneous date
* 1 spontaneous time
* 1 region name
The approach adopted for speaker recruitment involved selecting speakers among the employees of Portugal Telecom (about 20,000) and their relatives. The company has a wide geographical coverage, thus guaranteeing a good representation of many regional accents.
The following age distribution has been obtained: 12 speakers are under 16, 345 speakers are between 17 and 30, 436 speakers are between 31 and 45, 196 speakers are between 46 and 60 and 8 speakers are over 60; the age of two speakers is unknown and two others said they were born in 1996.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.- isVersionOf: C-001523: Spanish SpeechDat(M) - DB1
-
C-000120: Portuguese Speecon database
Desktop/Microphone
The Portuguese Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 553 adult Portuguese speakers (266 males, 287 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 52 child Portuguese speakers (19 boys, 33 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 29 DVDs (first set) and 4 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
213 application specific words and phrases per session (adults)
74 toy commands, 34 general commands, 14 phone commands and 26 command word synonyms (children)
The following age distribution has been obtained:
Adults: 270 speakers are between 15 and 30, 193 speakers are between 31 and 45, and 90 speakers are over 45.
Children: 15 speakers are between 8 and 10, 37 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000123: RASC863-annotated 4 regional accent speech corpus(II)
RASC863 consists of two parts of natural spoken language(spoken language monologue and familiar questions' answers) and reading language(speech balance sentences?Afrequently used spoken language sentences and frequently used dialect vocabularies).The other parts include 15 questions' answers?A23 frequently used spoken language?Aa great deal of dialect vocabularies and 110 speech balance sentences. Except dialect vocabularies, other sentences are all labeled with words and spellings. Names?Atelephone numbers?Ainternet addresses and dates etc. are all collected according to the questions answered by each speaker; The speech balance sentences is at most 30 syllables long and a portion thereof is acquired from transferred speech of chatting dialog.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2004-004/intro.htm- hasVersion: RASC863-annotated 4 regional accent speech corpus(Ⅰ)
- hasVersion: RASC863-annotated 4 regional accent speech corpus(Ⅲ)
-
C-000124: Russian SpeechDat(E) Database
Telephone
The Russian SpeechDat(E) Database (Eastern European Speech Databases for Creation of Voice Driven Teleservices) comprises 2500 Russian speakers (1242 males, 1258 females) recorded over the Russian fixed telephone network. This database is partitioned into 13 CDs. The speech databases made within the SpeechDat(E) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat(E) format and content specifications.
The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of SpeechDat(E). Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
Corpus contents:
- 6 application words;
- 1 sequence of 10 isolated digits;
- 4 connected digits: 1 sheet number (5 digits), 1 telephone number (9-10 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits);
- 3 dates: 1 spontaneous date (birthday), 1 prompted date (word style), 1 relative and general date expression;
- 1 spotting phrase using an application word (embedded);
- 1 isolated digit;
- 3 spelled-out words (letter sequences): 1 spelling of surname; 1 spelling of directory assistance city name; 1 real/artificial name for coverage;
- 2 currency money amounts: 1 Russian money amount, 1 International money amount (USD, EURO)
- 1 natural number;
- 6 directory assistance names: 1 spontaneous (own forename); 1 city of birth / growing up (spontaneous); 1 most frequent city (out of 500); 1 most frequent company/agency (out of 500); 1 "forename surname" (set of 150 ), 1 "surname" (set of 150 )
- 2 questions, including "fuzzy" yes/no: 1 predominantly "yes" question, 1 predominantly "no" question;
- 9 phonetically rich sentences;
- 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style);
- 4 phonetically rich words.
The following age distribution has been obtained: 10 speakers are below 16 years old, 854 speakers are between 16 and 30, 858 speakers are between 31 and 45, 679 speakers are between 46 and 60, 34 speakers are over 60, and 65 speakers are of unknown age.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
Telephone
The SALA II Spanish Mobile Network Database collected in Venezuela was recorded within the scope of the SALA II project.
The SALA II Spanish Venezuelan database contains the recordings of 1,179 Venezuelan speakers (576 males and 603 females) recorded over the Venezuelan mobile telephone network.
The following acoustic conditions were selected as representative of a mobile user's environment:
* Passenger in moving car (160 speakers)
* Public place (461 speakers)
* Stationary pedestrian by road side (236 speakers)
* Home/Office environment (272 speakers)
* Passenger in moving car using a hands-free kit (160 speakers)
This database is distributed as 1 DVD-ROMs The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file. This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.
Each speaker uttered the following items:
* 6 application words
* 1 sequence of 10 isolated digits
* 4 connected digits (1 sheet number -6 digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
* 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
* 1 spotting phrase using an embedded application word
* 1 isolated digit
* 3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage)
* 1 currency money amount
* 1 natural number
* 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename surname out of a set of 150 )
* 2 yes/no questions (1 predominantly yes question, 1 predominantly no question)
* 9 phonetically rich sentences
* 2 time phrases (1 spontaneous time of day, 1word style time phrase)
* 4 phonetically rich words
The following age distribution has been obtained: 7 speakers are under 16, 624 speakers are between 16 and 30, 368 speakers are between 31 and 45, 160 speakers are between 46 and 60, and 20 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000126: SALA II Spanish from Mexico database
Telephone
The SALA II Spanish from Mexico database collected in Mexico was recorded within the scope of the SALA II project.
The SALA II Spanish from Mexico database contains the recordings of 1,075 Mexican speakers (539 males and 536 females) recorded over the Mexican mobile telephone network.
The following acoustic conditions were selected as representative of a mobile user's environment:
* Passenger in moving car, railway, bus, etc. (155 speakers)
* Public place (279 speakers)
* Stationary pedestrian by road side (223 speakers)
* Home/office environment (364 speakers)
* Passenger in moving car using a hands-free kit (54 speakers)
This database is distributed as 1 DVD-ROM The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.
Each speaker uttered the following items:
* 6 application words
* 1 sequence of 10 isolated digits
* 4 connected digits (1 sheet number -6 digits, 1 telephone number -9/11 digits, 1 credit card number -14/16 digits, 1 PIN code -6 digits)
* 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
* 2 spotting phrase using an embedded application word
* 2 isolated digits
* 3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage)
* 1 currency money amount
* 1 natural number
* 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 "forename surname" out of a set of 150 )
* 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question)
* 9 phonetically rich sentences
* 2 time phrases (1 spontaneous time of day, 1word style time phrase)
* 4 phonetically rich words
The following age distribution has been obtained: 7 speakers are under 16, 643 speakers are between 16 and 30, 248 speakers are between 31 and 45, 169 speakers are between 46 and 60, and 8 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000127: SALA Spanish Mexican Database
Telephone
The SALA Spanish Mexican Database comprises 1260 Mexican speakers (554 males, 706 females) recorded over the Mexican fixed telephone network. This database is partitioned into 7 CD-ROMs The speech databases made within the SALA project were validated by SPEX, the Netherlands, to assess their compliance with the SALA format and content specifications.
The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of SALA. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
Each speaker uttered the following items:
* 6 application words;
* 1 sequence of 10 isolated digits;
* 4 connected digits: 1 sheet number (6 digits), 1 telephone number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits);
* 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date (word style), 1 relative and general date expression;
* 1 spotting phrase using an application word (embedded);
* 1 isolated digit;
* 3 spelled-out words (letter sequences): 1 spelling of surname; 1 spelling of directory assistance city name; 1 real/artificial name for coverage;
* 1 currency money amount;
* 1 natural number;
* 5 directory assistance names: 1 surname (out of 500); 1 city of birth / growing up (spontaneous); 1 most frequent city (out of 500); 1 most frequent company/agency (out of 500); 1 "forename surname" (set of 150 )
* 2 questions, including "fuzzy" yes/no: 1 predominantly "yes" question, 1 predominantly "no" question;
* 9 phonetically rich sentences;
* 9 additional spontaneous items
* 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style);
* 4 phonetically rich words.
The following age distribution has been obtained: 20 speakers are under 16 years old, 801 speakers are between 16 and 30, 291 speakers are between 31 and 45, 124 speakers are between 46 and 60, and 24 speakers are over 60. A phonetic lexicon with canonical transcriptions in SAMPA is also provided.