言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 501 - 510 件目

C-000946: Hebrew Speecon database
Desktop/Microphone
The Hebrew Speecon database is divided into 2 sets:

1. The first set comprises the recordings of 550 adult Hebrew speakers (273 males, 277 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2. The second set comprises the recordings of 50 child Hebrew speakers (24 boys, 26 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 20 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications. Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:

* Calibration data:
o 6 noise recordings
o The silence word recording
* Free spontaneous items (adults only):
o 5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
* 17 Elicited spontaneous items (adults only):
o 3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
* Read speech:
o 30 phonetically rich sentences uttered by adults and 60 uttered by children
o 5 phonetically rich words (adults only)
o 4 isolated digits
o 1 isolated digit sequence
o 4 connected digit sequences
o 1 telephone number
o 3 natural numbers
o 1 money amount
o 2 time phrases (T1 : analogue, T2 : digital)
o 3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
o 3 letter sequences
o 1 proper name
o 2 city or street names
o 2 questions
o 2 special keyboard characters
o 1 Web address
o 1 email address
o 208 application specific words and phrases per session (adults)
o 74 toy commands and 48 general commands (children)

The following age distribution has been obtained:

* Adults: 313 speakers are between 15 and 30, 174 speakers are between 31 and 45, 63 speakers are over 46.
* Children: 16 speakers are between 8 and 10, 34 speakers are between 11 and 14.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000947: ILC Italian Morphological Lexicon
Monolingual Lexicons
The ILC Italian Morphological Lexicon consists of a set of lemmas/lexical entries (about 60,000) with the corresponding inflected word-forms, and a morphological engine for morphological analysis and generation. Lemmas and word-forms are encoded with grammatical codes compatible with the EAGLES recommendations for lexicon encoding at the morphosyntactic level.
C-000949: Italian Speech Corpus 1 (Appen)
Desktop/Microphone
The Italian Speech Corpus 1 contains the recordings of 202 native Italian speakers (112 males, 90 females) recorded in an office and a closed public place, over 4 channels, in a range of low to medium background noise environments (Plantronics Audio 10 (computer/desk mic), Shure SM58 (desk mounted dynamic mic), Shure Beta 53 (headset mic) and Andrea DA-400 (array mic)). The data collection and transcription were performed by Appen (Australia).
Speech samples are stored as sequences of 16-bit 22.05 kHz PCM in uncompressed WAV files.
Each speaker read the following items (prompted):
- 100 command words
- 100 phonetically rich sentences
The following age distribution has been obtained: 22 speakers are between 18 and 19, 141 are between 20 and 30, 34 are between 31 and 45, and 5 are between 45 and 60.
Information about the speakers? place of birth is included.
The database is provided with orthographic transcriptions in SAMPA, including canonical and alternative pronunciation, and syllable, stress and acoustic events markings. All transcriptions were segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 7,300 headwords (plus variants) is also available.
This database is aimed to be used within speech recognition and voice control applications.
C-000950: Italian SpeechDat(II) FDB-3000
Telephone
The Italian SpeechDat(II) FDB-3000 comprises more than 3000 Italian speakers (1494 males, 1546 females) recorded over the Italian fixed telephone network. The FDB-3000 database is partitioned into 12 CDs in ISO 9660 format, each CD contains the recordings of 550 speakers. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
? 1 isolated digits
? 1 sequence of 10 isolated digits
? 4 connected digits: 1 prompt sheet number (5+ digits), 1 telephone number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits)
? 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date (word style), 1 relative and general date expression.
? 1 word spotting phrase using an application word (embedded).
? 3 application words
? 3 spelled words: 1 spontaneous name (own forename), 1 city name, 1 real / artificial word for coverage.
? 1 Lira currency money amount.
? 1 natural number.
? 5 directory assistance names: 1 spontaneous name (own forename), 1 city of birth / growing up (spontaneous), 1 most frequent cities, 1 most frequent company / agency, 1 ?forename surname?.
? 2 questions including ?fuzzy? yes / no: 1 predominantly ?Yes? question, 1 predominantly ?No? question.
? 9 phonetically rich sentences.
? 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style).
? 4 phonetically rich words.
4 more items were added to the Italian corpus.
The following age distribution has been obtained: 133 speaker are below 16 years old, 757 speakers are between 16 and 30, 862 speakers are between 31 and 45, 626 speakers are between 46 and 60, 482 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000951: Italian SpeechDat(II) MDB-250
Telephone
The Italian SpeechDat(II) MDB-250 comprises 375 Italian speakers recorded over the Italian mobile telephone network. The MDB-250 database is partitioned into 6 CDs in ISO 9660 format. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
? 2 isolated digits
? 1 sequence of 10 isolated digits
? 4 connected digits: 1 prompt sheet number (5+ digits), 1 telephone number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits)
? 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date (word style), 1 relative and general date expression.
? 1 word spotting phrase using an application word (embedded).
? 6 application words
? 3 spelled words: 1 spontaneous name (own forename), 1 city name, 1 real / artificial word for coverage.
? 1 Lira currency money amount.
? 1 natural number.
? 7 directory assistance names: 1 spontaneous name (own forename), 1 city of birth / growing up (spontaneous), 2 most frequent cities (set of 25), 2 most frequent company / agency (set of 25), 1 ?forename surname? (set of 150 ?full? names)
? 2 questions including ?fuzzy? yes / no: 1 predominantly ?Yes? question, 1 predominantly ?No? question.
? 9 phonetically rich sentences.
? 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style).
? 4 phonetically rich words.
5 more items were added to the Italian corpus (4 spontaneous, 1 read).
The following age distribution has been obtained: 3 speaker are below 16 years old, 147 speakers are between 16 and 30, 149 speakers are between 31 and 45, 56 speakers are between 46 and 60, 48 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000952: Italian Speecon database
Desktop/Microphone
The Italian Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Italian speakers (273 males, 277 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Italian speakers (28 boys, 22 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 23 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)

The following age distribution has been obtained:
Adults: 243 speakers are between 15 and 30, 209 speakers are between 31 and 45, and 98 speakers are over 46.
Children: 22 speakers are between 8 and 10, 28 speakers are between 11 and 15.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000953: Italian TTS Speech Corpus (Appen)
Desktop/Microphone
The Italian TTS Speech Corpus contains the recordings of 1 native Italian speaker (male, 50 years old) recorded in a studio over 1 channel (Shure SM15 unidirectional professional head-word condenser microphone). The data collection and transcription were performed by Appen (Australia).
Speech samples are stored as sequences of 16-bit 22.05 kHz PCM in uncompressed WAV files.
The speaker read 3,300 prompted sentences covering all legal triphones and diphones.
The database is provided with orthographic transcriptions in SAMPA, including canonical and alternative pronunciation, and syllable, stress and acoustic events markings. All transcriptions were segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 7,300 headwords (plus variants) is also available.
This database is aimed to be used within text-to-speech and speech synthesis applications.
C-000955: Korean Speecon database
Desktop/Microphone
The Korean Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 568 adult Korean speakers (259 males, 309 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 58 child Korean speakers (25 boys, 33 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 30 DVDs (first set) and 4 DVDs (second set).

The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
220-221 application specific words and phrases per session (adults)
74 toy commands, 34 general commands, 14 phone commands and 5 application word synonyms (children)

The following age distribution has been obtained:
Adults: 250 speakers are between 15 and 30, 223 speakers are between 31 and 45, and 95 speakers are over 46.
Children: 25 speakers are between 8 and 10, 33 speakers are between 11 and 15.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000958: MTP Annotated German corpus - untagged version
Written Corpora
This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP). It comprises a collection of SGML-formatted texts from two German newspapers, "Die Frankfurter Allgemeine Zeitung" and "Die Zeit", for the years 1990 to 1992. The articles reflect the typical distribution of newspaper topics, including economics, regional, national and international politics, the arts, sport, literature, history, science and modern life.
The text was segmented into sentence units and word tokens, and tagged for morphosyntactic POS markers. Two tagsets, which mainly differed in the granularity of the noun and verb tags, and which comprised 137 and 52 tags respectively, were used. Users may obtain annotated versions using either set, each of which comes with documentation and an instruction manual for tag application. A suite of tools, including the MTP taggers and the Xlex workbench for text handling, textual analysis and lexicography, is also available.
- hasPart: MTP Annotated German corpus - tagged version W008-02
C-000959: MULTEXT JOC Corpus
Written Corpora
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.

The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment :

paragraph annotation level, conformant to the CESDOC specifications (1 M words * 5 languages);
morpho-syntactic annotation level (PoS Tagging), conformant to CESANA specifications (200,000 words * 4 languages);
parallel text alignment at sentence level, conformant to CESALIGN specifications (200,000 words * 4 languages).
Additional information: http://www.lpl.univ-aix.fr/projects/multext

SHACHI - Language Resource Metadata Database