言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 41 - 50 件目

C-000088: "Le Monde Diplomatique" Text corpus in French - archives from 1999
Written Corpora
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.

Number of articles available per year :
1999: 820 articles (393,813 words)
2000: 765 articles (376,027 words)
2001: 743 articles (368,739 words)
2002: 721 articles (357,076 words)
2003: 704 articles (379,998 words)
2004: 687 articles (315,197 words)
2005: 696 articles (360,000 words)
2006: 734 articles (360,000 words)
- isVersionOf: C-000087: Le Monde Diplomatique Text corpus in French - archives 1980-1998
- isVersionOf: C-000086: Le Monde Diplomatique Text corpus in English
- isVersionOf: N-001461: Le Monde Diplomatique Text corpus in Arabic
C-000090: MICROAES
Desktop/Microphone
The ATLAS Spanish Microphone Database (MICROAES) has been collected in Spain by Applied Technologies on Language and Speech, S.L. (ATLAS). This database comprises microphone recordings from 300 different speakers, who have been selected from five different dialectal areas. Sex and age distribution was also considered for speaker selection.

The corpus has 30 sets of 15 paragraphs giving a total of 450 paragraphs. Each 15 paragraph set contains at least two allophones from the extended SAMPA symbols. For this purpose, coarticulation effect between words was considered.

The recording platform is based on a laptop using a PCMCIA slot as interface to the audio equipment. Up to four microphones are recorded simultaneously:

* Sennheiser ME 104 (close distance)
* Nokia Lavalier HDC-6D (close distance)
* Sennheiser ME 64 (medium distance)
* Haun MBNM-550 E-L (far distance)

In this database all recordings have been done in an office with no discussion or meeting during the recordings. The signals are stored in a raw file format, i.e. without headers in the signal file. Each of the four speech channels is recorded at 16 kHz with 16 bit quantization.

A description of the sample rate, the quantization, and byte order used is held in the SAM label file that corresponds to each speech file. This label file also contains information about the signal quality value of the speech file.

The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. Transcription includes segment markers dividing the paragraph in portions of less than 10 seconds using speaker pauses.
The lexicon file included in this database has more that 7400 words with the corresponding pronunciation information using the SAMPA phonemic notation.

The database contains 30 hours of speech and is distributed in 30 ISO 9660 CD-ROM volumes or 5 ISO 9660 DVD-ROM volumes.
C-000092: Mandarin Chinese Speech Recognition Corpus (desktop) - digit string (119 people)
Desktop/Microphone
This corpus comprises 3,570 speech files uttered by 119 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 1,500 digit strings in total. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 7.54 hours of speech per channel. The total capacity of the data is 7.28 Gb.
Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-000093: Mandarin Chinese Speech Recognition Corpus (desktop) - person name (120 people)
Desktop/Microphone
This corpus comprises 3,586 speech files uttered by 120 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 2,250 person names in total. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 6.19 hours of speech per channel. The total capacity of the data is 5.97 Gb.
Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-000094: Mandarin Chinese Speech Recognition Corpus (in the car) - person name, place name in Beijing, stocks, digit string (20 people)
Desktop/Microphone
This corpus comprises 9,599 speech files uttered by 20 speakers of different dialects, ages and various educational levels, recorded over 2 channels. The database comprises person names, place names in Beijing, stocks, digit strings. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 10.45 hours of speech per channel. The total capacity of the data is 3.08 Gb.
Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-000095: Mandarin Chinese Speecon database
Desktop/Microphone
The Mandarin Chinese Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Chinese speakers (276 males, 274 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Chinese speakers (26 boys, 24 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 26 DVDs (first set) and 3 DVDs (second set).

The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
2 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands and 48 general commands (children)

The following age distribution has been obtained:
Adults: 224 speakers are between 15 and 30, 220 speakers are between 31 and 45, 106 speakers are between 46 and 60.
Children: 17 speakers are between 8 and 10, 33 speakers are between 11 and 14.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000100: OrienTel Morocco MCA (Modern Colloquial Arabic) database
Telephone
The OrienTel Morocco MCA (Modern Colloquial Arabic) database comprises 772 Moroccan speakers (383 males, 389 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
2 currency money amounts
1 natural number
4 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
3 spontaneous items (for control)

The following age distribution has been obtained: 381 speakers are between 16 and 30, 262 speakers are between 31 and 45, 129 speakers are between 46 and 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-000103: Oxford English phonetics files
Speech Related
Derived from a range of Oxford Dictionaries, these files list word forms together with a representation of their IPA pronunciation. It contains 250,000 words. Pronunciation is based on standard British English. Word forms include dictionary lemmas and inflections or other morphological variations, plus a wide range of proper name and encyclopedic material. The data also includes a large number of common multi-word phrases and compound nouns. The files are provided in XML.
C-000107: PAROLE Portuguese Corpus - complete version
Written Corpora
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium, as follows:

* Newspaper: about 65%, covering the period 1996-1997 of 3 titles;
* Book: about 20%, concerning 12 titles from 3 editing houses;
* Periodical: about 5%, concerning 7 weekly issues of 1 title, 1996;
* Miscellaneous: about 10%, concerning several files distributed by 8 titles.

The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML.

A subcorpus of the PAROLE Portuguese Corpus, which reproduces approximately the whole Corpus distribution by Medium (Newspaper: about 65%, Book: ab. 20%, Periodical: ab. 5%, Miscellaneous: ab. 10%) is also available.

It has about 250,000 words morpho-syntactically tagged accordingly to the parole common tagset and morpho-syntactic annotation standards. Disambiguation was manually checked.
- isVersionOf: C-000108: PAROLE Portuguese Corpus - tagged subset
C-000108: PAROLE Portuguese Corpus - tagged subset
Written Corpora
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium, as follows:

* Newspaper: about 65%, covering the period 1996-1997 of 3 titles;
* Book: about 20%, concerning 12 titles from 3 editing houses;
* Periodical: about 5%, concerning 7 weekly issues of 1 title, 1996;
* Miscellaneous: about 10%, concerning several files distributed by 8 titles.

The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML.

A subcorpus of the PAROLE Portuguese Corpus, which reproduces approximately the whole Corpus distribution by Medium (Newspaper: about 65%, Book: ab. 20%, Periodical: ab. 5%, Miscellaneous: ab. 10%) is also available.

It has about 250,000 words morpho-syntactically tagged accordingly to the parole common tagset and morpho-syntactic annotation standards. Disambiguation was manually checked.

SHACHI - Language Resource Metadata Database