言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 471 - 480 件目

C-000891: Basque Spoken Corpus, by Jon Aske (Department of Foreign Languages, Salem State College - Salem, Massachusetts, USA)
Desktop/Microphone
This is a collection of forty two narratives in the Basque language (Euskara) by native speakers. It includes sound files (MP3 format) and full detailed transcripts. Each of the narratives is a recounting of a short, silent movie that the speaker has just watched to a friend or acquaintance who has not seen the movie (no other person was present in the room, just the recording equipment).
Two short silent movies were used to elicit the narratives: Twenty one of the narratives correspond to the 7-minute silent movie The Pear Story (Chafe, ed., 1980) and the other 21 are about a 12 minute collage from Charlie Chaplin's Modern Times.
The recordings were made as a part of a study on Basque word order in 1993 (Aske 1997).
The transcriptions are made following a modified version of the guidelines given in Edwards and Lampert 1993.
The speakers were from different age groups, different dialects, and had differing language abilities. Profiles of the speakers are also included.
In addition to the 42 narratives with transcripts, 53 additional sound tracks of extemporaneous speech and description of still images are also included.
C-000894: COLLECT
Telephone
The COLLECT speech database, supplied by CSELT, was recorded in Italy, in 1987 by 500 speakers; half of them called using Turin phones and the other half from all over the country. They were automatically prompted to utter 15 words (the 10 Italian digits and 5 command words - Yes, No, I accept, I refuse, Yes I accept). Each word was collected by digitally recording the utterances.
C-000895: COST232
Telephone
The COST232 consortium collected a "Multi-English" speech database over the telephone in Europe. Originally, it had been planned to collect data only at FUB (Fondazione Ugo Bordoni) in Rome, but in the event it was also possible to make a collection at BT labs in the UK. A total of 797 "successful" calls were collected.
Two countries received calls - Italy and the UK, using different types of collecting equipment (FUB in Rome used analog lines and BT in the UK used digital ones). Everybody had to repeat the same vocabulary - the "TI (Texas Instrument) words" - which makes this database unique in many respects.
The vocabulary comprised the name of the speaker's laboratory, the digits ("oh", zero, one , two, three, four, five, six, seven, eight and nine) and the words: "yes, no, erase, rubout, stop, start, help, enter, repeat, go". The data was collected from the following countries: Belgium, Czechoslovakia, Denmark, England, Germany, Italy, Norway, Portugal, Slovenia, Spain, Sweden and Switzerland. Each country provided 8 speakers who made 2 calls from a fixed set and a mobile to both the Italian and UK collection system (i.e. a total of 8 calls per speaker). Although the database was intended to aid for speech recognition, it is also balanced and can therefore be used for speaker recognition training and testing.
C-000896: CRATER corpus
Written Corpora
The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. The offer consists of a multi-lingual aligned corpus of 1,000,000 tokens per language for English, French and Spanish, with morphosyntactic annotations (human-edited).

An extended version of CRATER (ref. ELRA-W0003) is available in CRATER 2 (ref. ELRA-W0033)
- isVersionOf: C-001357: CRATER 2 Corpus
C-000906: Danish SpeechDat(M) database - DB1
Telephone
The Danish SpeechDat(M) database is the speech database collected within the SpeechDat(M) project. It consists ofpolyphone-like data recorded by 1,523 speakers.
The speech files are stored as sequences of 8 bit 8 kHz A-law samples. Each prompted utterance is stored within a separatefile and the associated label files are stored in SAM file format.
An ASCII file is attached and is listing information about each speaker: speaker code, sex, age, region, prompt number.
The lexicon is presented in a TAB delimited ASCII file containing an alphabetically ordered list of distinct lexical itemsoccurring in the database. Each entry contains a frequency count and corresponding pronunciation information.
Example:
WORD FREQUENCY PHONEMIC TRANSCRIPTIONS
åbnede 104 O b n @ D | O b n @ D @
adresseangivelse 97 a d R a s @ a n g i: u l s @
The complete Danish SpeechDat database consists of 5 CD-ROMs. The first three CD-ROMs contain the application oriented sub-set. The last two CD-ROMs contain the phonetically rich sentences.
The included items are:
· 5 application word phrases (semi spontaneous)
· 12 connected digit strings with 8 digits
· 24 natural numbers (3-4 digits)
· 27 application words
· 3 dates, D3 spontaneous (birthday)
· 3 spelled words
· 2 money amounts, M1 small, M2 large
· City name (spontaneous)
· 3 yes/no questions (spontaneous)
· 22-25 sentences
· T1 time phrase, T2 time of day (spontaneous)

There are 1,523 speakers in the SpeechDat database from 11 linguistic regions of Denmark and five age groups (under 16, 16-30, 31-45, 46-60, over 60). 78% of them are between 16 and 60 years old.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-001523: Spanish SpeechDat(M) - DB1
- hasVersion: C-000119: Portuguese SpeechDat(M) database
C-000907: Danish SpeechDat(M) database - DB2
Telephone
The (polyphone-like) Danish SpeechDat(M) database contains the recordings of 1,523 Danish speakers from 11 regions.

Speech samples are stored as sequences of 8 bit 8 kHz A-law. Each prompted utterance is stored in a separate file, and the associated label files are stored in SAM file format.

Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. It was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.

The lexicon is presented in a TAB delimited ASCII file containing an alphabetically ordered list of distinct lexical items occurring in the database. Each entry contains a frequency count and corresponding pronunciation information.

Example:
WORD FREQUENCY PHONEMIC TRANSCRIPTIONS
åbnede 104 O b n @ D | O b n @ D @
adresseangivelse 97 a d R a s @ a n g i: u l s @

The complete Danish SpeechDat database is partitioned into 5 CD-ROMs. The first three CD-ROMs contain the application oriented sub-set. The last two CD-ROMs contain the phonetically rich sentences.

Each speaker uttered the following items:

* 5 semi-spontaneous application word phrases
* 12 connected digit strings with 8 digits
* 24 natural numbers (3-4 digits)
* 27 application words
* 3 dates, including a spontaneous one e.g. birthday
* 3 spelled words
* 2 money amounts, including a small one, and a large one
* 1 spontaneous city name
* 3 spontaneous yes/no questions
* 22-25 sentences
* 2 time phrases, including a time phrase and a spontaneous time of day

The 5 age groups are the following: under 16, 16-30, 31-45, 46-60, over 60. 78% of the speakers are between 16 and 60 years old.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000909: Dutch PAROLE Distributable Corpus
Written Corpora
The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.

The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.

The Dutch PAROLE Distributable Corpus contains the following texts:

BOOKS:
Van Sterkenburg:
Wdlijst tot wdboek, 1984, 65,344 words
Taal vt Journaal, 1989, 56,215 words
WNT-portret, 1992, 60,133 words

NEWSPAPERS
Short Newspaper texts:
MN_Collection, 1986-1988, 19,537 words
CVNP(S)-Collection, 1983-1990, 179,220 words

PERIODICAL:
Short texts from
- Local Papers, 1985-1988, 47,019 words
- Magazines, 1985-1989, 164,589 words

MISCELLANEOUS:
Texts to be read out in TV-news broadcasts for:
- General audience, 1992-1995, 1,285,824 words
- Youth, 1991-1995, 1,008,658 words
Short texts from Ephemera, 1985-1986, 131,692 words

TOTAL: 3,018,231 words

Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.

More info on the Parole project: http://www.elda.org/catalogue/fr/text/doc/parole.html
C-000910: Dutch Polyphone Database
Telephone
The Dutch Polyphone corpus contains telephone speech from 5050 speakers. The corpus comprises 222,075 speech files (based on 44 or, in a few cases 43 items per speaker), which all have been orthographically transcribed. The data were collected in 8-bit A-law digital form, directly off an ISDN telephone line interface.
The corpus contains both read and extemporaneous items. Items to be read consist of isolated digits, numbers (one telephone number, two bank accounts or credit card numbers, and the participation number), a postal code, guilder amounts, time, date, amounts, application words, sentences with application word, phonetically rich sentences, spelled words, city names. Several questions were asked to get the spontaneous part of the speech (questions like Is Dutch your native language?, Did you ever live in another country than the Netherlands, In which cities did you grow up?, Are you a man or a woman?, Are you calling from your home phone?, etc.).
C-000911: Dutch SpeechDat(II) MDB-250
Telephone
The Dutch SpeechDat(II) MDB-250 comprises 250 Dutch speakers (125 males, 125 females) recorded over the Dutch mobile telephone network. This database is partitioned into 5 CDs The speech databases made within the SpeechDat(II) project were validated by SPEX to assess their compliance with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
The following items were recorded:
- 8 application words (2 optional); 2 isolated digits; 1 sequence of 10 isolated digits; 3 connected digits: 1 telephone number (1-10 digits), 1 credit card number (1-16 digits), 1 digit PIN code (6 digits); 3 dates: 1 spontaneous date, 1 date, 1 relative date expression; 1 embedded application word; 3 spelled words: 1 forename (spontaneous), 1 city name, 1 word; 1 currency money amount; 1 natural number; 6 directory assistance names: 1 forename (spontaneous), 1 city of birth, 1 most frequent city, 1 city name, 1 company name, 1 forename surname; 2 yes/no questions: 1 predominantly "yes" question, 1 predominantly "no" question; 9 phonetically rich sentences; 2 time phrases: 1 time of day (spontaneous), 1 time phrase; 4 phonetically rich words.
The following age distribution has been obtained: 5 speakers are under 16, 90 are between 16 and 30, 89 between 31 and 45, 56 between 46 and 60, and 10 are over 60. The lexicon was created following the guidelines in SD1.3.1 v4.3.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000914: ECI/MCI (European Corpus Initiative/Multilingual Corpus I)
Written Corpora
The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus, and supports existing and projected national and international efforts to carefully design, collect and publish large-scale multilingual written and spoken corpora. ECI has produced the Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material.

Just a sampling of the contents of the CD-ROM:

German newspaper texts from the Frankfurter Rundschau from July 1992 -March 1993. provided by Universität Gesamthochschule, Paderborn, Germany. Approximately 34 million words.
French newspaper texts from Le Monde, consisting of material from September 1989, October 1989, and January 1990. Provided by LIMSI CNRS, France. Approximately 4.1 million words.
Extracts from the Leiden Corpus of Dutch, consisting of newspapers, transcribed speech, etc. Provided by Institut voor Nederlandse Lexicologie, Leiden, Holland. Approximately 5.5 million words.
International Labor Organisation (ILO) "Official Bulletin, B Series". Vols LXVII(1984) - LXXII(1989). Parallel texts in English, French and Spanish provided by the International Labor Organisation. Approximately 5 million words.

SHACHI - Language Resource Metadata Database