言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 31 - 40 件目

C-000067: Flemish SpeechDat(II) FDB-1000
Telephone
The Flemish SpeechDat(II) FDB-1000 database contains the recordings of 1,023 Flemish speakers (461 Males, 562 Females) recorded over the Belgian fixed telephone network. This database is partitioned into 4 CDs, which comprise 250 speakers sessions each.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file, which contains the relevant descriptive information.

This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

Each speaker uttered the following items. Each phrase or word was repeated about 5 times.

- 7 application words
- 4 isolated digits
- 1 sequence of 10 isolated digits
- 5 connected digits (1 area code, 1 spontaneous phone number, 1 credit card number 15/16 digits, etc.)
- 3 dates (1 spontaneous date e.g. birthday, 1 prompted date, 1 general and relative date expression)
- 1 embedded application word
- 4 spelled words
- 1 currency money amount
- 1 natural number
- 6 directory assistance names (1 forename, 1 city of birth, 1 most frequent city, 1 city name, 1 company name, 1 "forename surname")
- 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question)
- 10 phonetically rich sentences
- 2 time phrases (1 spontaneous time of day, 1 time phrase)
- 5 phonetically rich words
The following age distribution has been obtained: 22 speakers are under 16, 387 speakers are between 16 and 30, 306 speakers are between 31 and 45, 240 speakers are between 46 and 60, 68 are over 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000072: German SpeechDat(II) FDB-4000
Telephone
The German SpeechDat(II) FDB 4000 consists of 4000 calls over the German fixed network, stored on 17 CD-ROMs in the final SpeechDat(II) database exchange format. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications.

The following items were recorded:
- 1 isolated digit
- 1 sequence of 10 isolated digits
- prompt sheet number = 5
- 9-11 digit telephone number (read)
- 15-16 digit credit card number (read, 150 different credit card numbers were found)
- 6 digit PIN code (read)
- 1 natural number (read)
- 1 money amount (read)
- 2 yes/no questions (spontaneous, not prompted)
- 3 dates (1 spontaneous, e.g. birthday; 1 prompted text form; 1 relative and general date form)
- 1 time of day (spontaneous)
- 1 time phrase (read)
- 3 application words
- 1 word spotting phrase
- 5 directory assistance names (1 spontaneous name (e.g. forename), 1 spontaneous city name, 1 read city name (from a list of 500 most frequent), 1 read company/agency name (from a list of 500 most frequent), 1 read proper name, fore- and surname (from list of 150 SDB names).
- 3 spellings (1 spontaneous, e.g. forename; 1 directory city name; 1 real/artificial word)
- 4 isolated words
- 9 phonetically rich sentences (read)

The speech files are stored as sequence of 8-bit, 8kHz A-law speech files and are not compressed. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000075: Greek SpeechDat(II) FDB-5000
Telephone
The Greek SpeechDat(II) FDB-5000 database contains the recordings of 5,000 Greek speakers (2,405 males, 2,595 females) recorded over the Greek fixed telephone network.The FDB-5000 database is partitioned into 25 CDs in ISO 9660 format.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

This speech database was validated by SPEX (the Netherlands), to assess its compliance with the SpeechDat format and content specifications.

Each speaker uttered the following items:

* 2 isolated digits
* 1 sequence of 10 isolated digits
* 7 connected digits (1 prompt sheet number -5+ digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits, 1 long number greater than 999999, 1 decimal number, 1 age)
* 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
* 1 word spotting phrase using an embedded application word
* 3 application words
* 3 spelled words (1 spontaneous name e.g. own forename, 1 city name, 1 real/artificial word for coverage)
* 1 currency money amount
* 1 natural number
* 7 directory assistance names (1 name e.g. forename, 1 city of birth/growing up, set of 150 SDB full names, 1 most frequent cities, 1 most frequent company/agency, 1 city/region of call, 1 profession)
* 4 yes/no questions
* 1 fuzzy yes/no question that could have either yes/no or something else as an answer
* 9 phonetically rich sentences
* 2 time phrases (1 spontaneous time of day, 1 word style time phrase)
* 4 isolated words
* 1 male/female
* 1 telephone model
* 1 environment of call
* 5 words broken into syllables

The following age distribution has been obtained: 512 speaker are under 16, 2,555 speakers are between 16 and 30, 1,199 speakers are between 31 and 45, 653 speakers are between 46 and 60, 74 speakers are over 60, and the age of 7 speakers is unknown.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000077: IBNC - An Italian Broadcast News Corpus
Broadcast Resources
The Italian Broadcast News Corpus (IBNC) was produced by the ITC-IRST (Italy) through a funding from ELRA in the framework of the European Commission project LRsPThe Italian Broadcast News Corpus (IBNC) was produced by the ITC-IRST (Italy) through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335). RAI, the major Italian broadcast company, supplied studio quality recordings of radio news programs sampled from its internal digital archive. The collection consists of 150 programs, for a total time of about 30 hours, issued in 36 different days, between 1992 and 1999. Recordings were supplied by RAI on Digital Audio Tapes (DAT), with 44kHz sampling rate and 16 bit resolution. Each DAT was manually processed to transfer each single program issue into a single file. During this operation, the signal was down-sampled to 16kHz with a resolution of 16 bits, and encoded into the NIST Sphere PCM format. Speech recordings present variations of topic, speaker, acoustic channel, speaking mode, etc. The corpus has been segmented, labelled and transcribed manually using the tool developed by DGA (Délégation Générale pour l'Armement, France) and LDC (Linguistic Data Consortium, USA), called "Transcriber", with conventions similar to those adopted by LDC for the DARPA HUB-4 corpora.The transcription text consists of mixed-case ASCII characters of the ISO-8859-1 extended set. A validation work was carried out by an external validator. It consisted of checking audio files, documentation and transcriptions.
C-000078: IDIOLOGOS 1 Bootstrap (NEOLOGOS Project)
Telephone
The IDIOLOGOS 1 Bootstrap database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The databases produced in the framework of the NEOLOGOS project are designed for the development and the assessment of French speech or speaker recognizers and speech synthesizers. They consist of:
1) the IDIOLOGOS databases are made of adults voices and are available in 2 subsets:
- the Bootstrap database (catalogue ref. ELRA-S0226-01),
- the Eingenspeakers database (catalogue ref. ELRA-S0226-02);
2) the PAIDIALOGOS database (catalogue ref. ELRA-S0227) is made of childrens and teenagers voices.

The IDIOLOGOS 1 Bootstrap database contains the recordings of 1000 adult French speakers (470 males and 530 females) recorded over the French fixed telephone network. The speakers uttered 45 phonetically rich sentences. The 45 sentences are the same for all speakers.

This database is distributed as 1 DVD-ROM. The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of NEOLOGOS. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.

This speech database was validated by SPEX (the Netherlands) to assess its compliance with the NEOLOGOS format and content specifications.

Each speaker uttered the following items:
- 1 digit sequence (5+ digits)
- 1 telephone number (10 digits)
- 1 credit card number (16 digits)
- 1 spelling of directory assistance city name
- 1 real/artificial for coverage
- 45 phonetically rich sentences

The following age distribution has been obtained: 288 speakers are between 18 and 30, 264 speakers are between 31 and 45, 247 speakers are between 46 and 61, and 201 speakers are over 61.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasPart: ELRA:B0007,IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project) IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
- isVersionOf: ELRA:S0226-02,IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
C-000079: ILPho phonetic lexicon
Speech Related
The ILPho database is a phonetic lexicon which contains 39,000 lemmas (319,318 entries). It is distributed in two formats. The first format is compact and corresponds to an easy extension of the text format in which the Multext lexicons (réf. ELRA-L0010) (Ide et Veronis, 1994) are distributed, by adding a column where phonetic transcriptions are stored. The second format is instantiated in XML (see www.xml.org), corresponding to a set of mark-ups specifically designed within this project for lexicons representation.
C-000080: ISLE Speech Corpus
Desktop/Microphone
Approx. 20 minutes of speech (per speaker) from 23 German and 23 Italian intermediate learners of English. Each speaker recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions). The prompts were of varying perplexities.

About 2/3 of the data for each speaker was annotated by one of a team of linguists. The files were corrected first at the word level, and an automatic recognizer was then used to produce phone-level annotations. The annotator then re-annotated each sentence to mark phone and stress errors (e.g., substitutions, insertions, or deletions).

Corpus details:
* a total of 46 speakers (23 German and 23 Italian.)
* 11484 utterances
* 1.92 gigabytes of WAV files (4 CDs)
* 17 hours, 54 minutes, and 44 seconds of speech data

A much more detailed explanation of the ISLE corpus will be available in the proceedings of LREC 2000. An electronic copy of this paper may be obtained by sending an email to Dr. Wolfgang Menzel at <menzel@nats.informatik.uni-hamburg.de>.

W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and C. Souter. "The ISLE corpus of non-native spoken English", Proc. Second LREC.
C-000083: LC-STAR Catalan phonetic lexicon
Speech Related
The LC-STAR Catalan phonetic lexicon was created within the scope of the LC-STAR project (IST 2001-32216) which was sponsored by the European Commission and the Spanish Government.

Production was performed at the Technologies and Applications of Language and Speech Center (TALP) of the Universitat Politècnica de Catalunya (UPC) (Spain). The owner of the database is UPC.

The lexicon comprises more than 100,000 words, distributed over three categories:

- a set of 53,225 common word entries. This set is extracted from a corpus of more than 20 million words distributed over 6 different domains (sports/games, news, finance, culture/entertainment, consumer information, personal communications). This was done with the aim of reaching a target for each domain of at least 95% self coverage. In addition to extracting word lists from the corpus, a list of closed set (function) word classes are included in the final word list.

- a set of 45,306 proper names (including person names, family names, cities, streets, companies and brand names) divided into 3 domains. Multiple word names such as New_York are kept together in all three domains, and they count as one entry. The 3 domains consist of first and last names (21,868 different entries), place names (8,279 different entries), and organisations (16,004 different entries).

- and a list of 7,498 special application words translated from English terms defined by the LC-STAR consortium. This list contains: numbers, letters, abbreviations and specific vocabulary for applications controlled by voice (information retrieval, controlling of consumer devices, etc.).

The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. The database is stored on 1 CD.
- isVersionOf: C-000083: LC-STAR Catalan phonetic lexicon
- isVersionOf: G-000412: LC-STAR Spanish phonetic lexicon
C-000086: "Le Monde Diplomatique" Text corpus in English
Written Corpora
Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HTML file contains one article.

Number of articles available per year :
1999: 184 articles (292,908 words)
2000: 199 articles (289,056 words)
2001: 182 articles (266,602 words)
2002: 198 articles (262,682 words)
2003: 197 articles (290,797 words)
2004: 205 articles (282,233 words)
- hasVersion: N-001461: Le Monde Diplomatique Text corpus in Arabic
C-000087: "Le Monde Diplomatique" Text corpus in French - archives 1980-1998
Written Corpora
Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. Each HTML file contains one article.

Number of articles available per year :
1980: 575 articles (500,088 words)
1981: 560 articles (462,611 words)
1982: 604 articles (497,383 words)
1983: 643 articles (474,786 words)
1984: 599 articles (485,067 words)
1985: 617 articles (466,953 words)
1986: 766 articles (419,875 words)
1987: 784 articles (446,527 words)
1988: 906 articles (440,871 words)
1989: 859 articles (428,119 words)
1990: 902 articles (403,175 words)
1991: 907 articles (416,662 words)
1992: 879 articles (414,308 words)
1993: 935 articles (402,399 words)
1994: 986 articles (399,892 words)
1995: 1046 articles (391,945 words)
1996: 916 articles (397,254 words)
1997: 814 articles (412,820 words)
1998: 843 articles (394,842 words)
- isVersionOf: C-000088: Le Monde Diplomatique Text corpus in French - archives from 1999
- isVersionOf: C-000086: Le Monde Diplomatique Text corpus in English
- isVersionOf: N-001461: Le Monde Diplomatique Text corpus in Arabic

SHACHI - Language Resource Metadata Database