言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1131 - 1140 件目

C-003363: ESTER Corpus
Broadcast Resources
The ESTER Corpus is a subset of the ESTER Evaluation Package (catalogue ref. ELRA-E0021), which was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems using French data.

This corpus includes the material that was used for the ESTER evaluation campaign, excluding the textual data (available in this catalogue and referenced ELRA-W0015 and ELRA-W0023):

1) About 100 hours of orthographically transcribed news broadcast, including annotations of named entities.
2) The evaluation tools allow to evaluation each task defined above.
3) Two guides and manuals were produced and are provided in the package distributed by ELDA :
o Guide for the annotation of named entities
o Specifications and evaluation protocol

An extra corpus of 1,700 hours of non-transcribed radio broadcast news recordings can also be provided upon request, on hard disk, as an adding to this package at a cost of 100 Euro (plus shipment fee).

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=60 (in French language)
- isPartOf: C-003362: ESTER Evaluation Package
C-003364: Text corpus of "Le Monde"
Written Corpora
Electronic archiving of "Le Monde" articles started on 1 January 1987. Some 200 articles are added every day, and as of October 1997 the database contains more than 500,000 articles, making it the biggest of its kind for all French daily newspapers.

Years 1987 to 2002 are available in an ASCII text format. Years 2003 to 2007 are available in .XML format. Each month consists of some 10 MB of data (circa 120 MB per year).

The number of words available since 2005 is given below:
- 2005: 19 million words
- 2006: 17 million words
- 2007: 21 million words

Years 2008 to 2012 are also available, in an ASCII text format, with no markup.

Data ranging from 1987 until 2012 are available through ELRA.
- isReferencedBy: C-003362: ESTER Evaluation Package
- isReferencedBy: C-003358: EQueR Evaluation Package
- isReferencedBy: C-003359: EvaSy Evaluation Package
C-003365: 太陽コーパス
現代日本語の書き言葉は，19世紀末から20世紀初め，文語文から口語文に移行することを機に，ほぼ確立したと見ることができます。その確立期の現代日本語について，様々な観点から調査研究を行うことができるデータベースとして，『太陽コーパス』を作成しました。
　『太陽コーパス』は，博文館から刊行された月刊誌『太陽』（1895～1928年）を構造化テキストにし，言語研究に有用な様々な情報を埋め込んだものです。『太陽』は，当時最もよく読まれた総合雑誌で，広範なジャンルと多彩な執筆者を特徴としています。
- hasVersion: C-004324: 現代日本語書き言葉均衡コーパス
C-003369: 大阪外国語大学多言語平行旅行会話文集 (サンプル)
旅行会話を中心とする多言語平行旅行会話文集の一部をサンプルデータとして提供するものである。 1000文中からさらに基本的な会話100文を選定し、収録している。日本語と英語のペアで記述した文を元に、アジア系言語を中心とする各国語に翻訳した。それを、マイクロソフトエクセル2003表形式に整理することによって、各言語の文を平行対照比較することができるものとなっている。収載言語は、日本語・中国語・朝鮮語・モンゴル語・タイ語・ベトナム語・ヒンディー語・ペルシア語・アラビア語・トルコ語・スペイン語・英語　の12言語である。また、日本語・ペルシア語・トルコ語・英語の会話文の各文に対応する発話音声データをWAVEオーディオファイルで用意している。
C-003370: Orientel United Arab Emirates MCA (Modern Colloquial Arabic)
Telephone
The OrienTel United Arab Emirates MCA (Modern Colloquial Arabic) database comprises 880 speakers (432 males, 448 females) recorded over the local fixed and mobile telephone network. This database is partitioned into 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
4 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14/16 digits), 1 PIN
code (6 digits), 1 spontaneous phone number
2 currency money amounts
1 natural number
4 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression, 1 prompted date
phrase (Islamic calendar)
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real/artificial word for coverage
5 directory assistance utterances : 1 spontaneous forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
3 spontaneous items (for control)

The following age distribution has been obtained: 488 speakers are between 16 and 30, 309 speakers are between 31 and 45, 83 speakers are over 46.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-003371: Orientel United Arab Emirates MSA (Modern Standard Arabic)
- hasVersion: C-003372: Orientel English as spoken in the United Arab Emirates
- hasVersion: C-000100: OrienTel Morocco MCA (Modern Colloquial Arabic) database
- hasVersion: C-000967: OrienTel Morocco MSA (Modern Standard Arabic) database
- hasVersion: C-000406: OrienTel French as spoken in Morocco database
- hasVersion: C-000407: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
- hasVersion: C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
- hasVersion: C-000405: OrienTel French as spoken in Tunisia database
- hasVersion: C-001488: OrienTel Egypt MCA (Modern Colloquial Arabic) database
- hasVersion: C-001489: OrienTel Egypt MSA (Modern Standard Arabic) database
- hasVersion: C-001490: OrienTel English as spoken in Egypt database
- hasVersion: C-000408: OrienTel Hebrew database
- hasVersion: C-000409: OrienTel Arabic as spoken in Israel database
- hasVersion: C-000792: OrienTel Turkish database
- hasVersion: C-001429: German spoken by Turkish OrienTel database
- hasVersion: OrienTel Jordan MCA (Modern Colloquial Arabic)
- hasVersion: OrienTel Jordan MSA (Modern Standard Arabic)
- hasVersion: OrienTel English as spoken in Jordan
- hasVersion: OrienTel Greek as spoken in Cyprus
- hasVersion: OrienTel English as spoken in Cyprus
C-003371: Orientel United Arab Emirates MSA (Modern Standard Arabic)
Telephone
The OrienTel United Arab Emirates MSA (Modern Standard Arabic) database comprises 500 speakers (254 males, 246 females) recorded over the local fixed and mobile telephone network. This database is partitioned into 2 DVDs. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
2 sequence of 10 isolated digits
7 connected digits : 1 prompt sheet number (6 digits), 6 strings of 4 digits in written format
2 currency money amounts
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)
1 time phrase (word style)
2 spelled items : string of 4-letter sequences written out as Aleph, Bae, Jim, etc
3 directory assistance utterances : 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
4 spontaneous items (for control)

The following age distribution has been obtained: 318 speakers are between 15 and 30, 129 speakers are between 31 and 45, 53 speakers are over 46.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-003370: Orientel United Arab Emirates MCA (Modern Colloquial Arabic)
- hasVersion: C-003372: Orientel English as spoken in the United Arab Emirates
- hasVersion: C-000100: OrienTel Morocco MCA (Modern Colloquial Arabic) database
- hasVersion: C-000967: OrienTel Morocco MSA (Modern Standard Arabic) database
- hasVersion: C-000406: OrienTel French as spoken in Morocco database
- hasVersion: C-000407: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
- hasVersion: C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
- hasVersion: C-000405: OrienTel French as spoken in Tunisia database
- hasVersion: C-001488: OrienTel Egypt MCA (Modern Colloquial Arabic) database
- hasVersion: C-001489: OrienTel Egypt MSA (Modern Standard Arabic) database
- hasVersion: C-001490: OrienTel English as spoken in Egypt database
- hasVersion: C-000408: OrienTel Hebrew database
- hasVersion: C-000409: OrienTel Arabic as spoken in Israel database
- hasVersion: C-000792: OrienTel Turkish database
- hasVersion: C-001429: German spoken by Turkish OrienTel database
- hasVersion: OrienTel Jordan MCA (Modern Colloquial Arabic)
- hasVersion: OrienTel Jordan MSA (Modern Standard Arabic)
- hasVersion: OrienTel English as spoken in Jordan
- hasVersion: OrienTel Greek as spoken in Cyprus
- hasVersion: OrienTel English as spoken in Cyprus
C-003372: Orientel English as spoken in the United Arab Emirates
Telephone
The Orientel English as spoken in the United Arab Emirates database comprises 535 speakers (266 males, 269 females) recorded over the local fixed and mobile telephone network. This database is partitioned into 2 DVDs. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14/16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amounts
2 natural number
3 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 first name, 1 city name, 1 real/artificial word for coverage
5 directory assistance utterances : 1 spontaneous forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
4 yes/no questions : 1 predominantly yes question, 1 predominantly no question, 2 fuzzy spontaneous, 1 fuzzy prompted
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
4 spontaneous items (for control)

The following age distribution has been obtained: 319 speakers are between 15 and 30, 162 speakers are between 31 and 45, 54 speakers are over 46.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-003370: Orientel United Arab Emirates MCA (Modern Colloquial Arabic)
- hasVersion: C-003371: Orientel United Arab Emirates MSA (Modern Standard Arabic)
- hasVersion: C-000100: OrienTel Morocco MCA (Modern Colloquial Arabic) database
- hasVersion: C-000967: OrienTel Morocco MSA (Modern Standard Arabic) database
- hasVersion: C-000406: OrienTel French as spoken in Morocco database
- hasVersion: C-000407: OrienTel Tunisia MCA (Modern Colloquial Arabic) database
- hasVersion: C-000969: OrienTel Tunisia MSA (Modern Standard Arabic) database
- hasVersion: C-000405: OrienTel French as spoken in Tunisia database
- hasVersion: C-001488: OrienTel Egypt MCA (Modern Colloquial Arabic) database
- hasVersion: C-001489: OrienTel Egypt MSA (Modern Standard Arabic) database
- hasVersion: C-001490: OrienTel English as spoken in Egypt database
- hasVersion: C-000408: OrienTel Hebrew database
- hasVersion: C-000409: OrienTel Arabic as spoken in Israel database
- hasVersion: C-000792: OrienTel Turkish database
- hasVersion: C-001429: German spoken by Turkish OrienTel database
- hasVersion: OrienTel Jordan MCA (Modern Colloquial Arabic)
- hasVersion: OrienTel Jordan MSA (Modern Standard Arabic)
- hasVersion: OrienTel English as spoken in Jordan
- hasVersion: OrienTel Greek as spoken in Cyprus
- hasVersion: OrienTel English as spoken in Cyprus
C-003373: MEDIA speech database for French
Telephone
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).

It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a Wizard of Oz (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation.

The database is formatted following the SpeechDat conventions and it includes the following items:
1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (lohi or Intel format) as signed integers.
Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).
Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:
1) agitée 3 A/ Z i t e
Documentation and statistics are also provided with the database.

The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
- hasVersion: C-003361: MEDIA Evaluation Package
C-003375: JEITAマルチモーダル対話コーパス
人間対人間のタスク対話を収録したコーパス。「顔課題」と「旅行課題」の2つのタスクについて9対話80分の動画データが収録されている。対話データの音声転記、および、そのデータに形態素情報、対話構造、韻律、に関するタグを付与したデータも収録。
C-003376: Japanese Speecon database
Desktop/Microphone
The Japanese Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 556 adult Japanese speakers (268 males, 288 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 51 child Japanese speakers (25 boys, 26 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 29 DVDs (first set) and 4 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications. Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
218 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)

The following age distribution has been obtained:
Adults: 236 speakers are between 15 and 30, 235 speakers are between 31 and 45, 85 speakers are over 46.
Children: 18 speakers are between 8 and 10, and 33 speakers are between above 11 (up to voice break).

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

Prices available upon request. Please contact us.
- hasVersion: C-000095: Mandarin Chinese Speecon database
- hasVersion: C-000120: Portuguese Speecon database
- hasVersion: C-000136: Spanish Speecon database
- hasVersion: C-001554: US Spanish Speecon database
- hasVersion: C-000415: German Speecon database
- hasVersion: C-000936: Finnish Speecon database
- hasVersion: C-000941: French Speecon database
- hasVersion: C-000946: Hebrew Speecon database
- hasVersion: C-000952: Italian Speecon database
- hasVersion: C-000955: Korean Speecon database
- hasVersion: C-000974: Polish Speecon database
- hasVersion: C-000977: Russian Speecon database
- hasVersion: C-000995: Swedish Speecon database
- hasVersion: C-001000: Turkish Speecon database
- hasVersion: C-001002: UK English Speecon database
- hasVersion: C-001553: US English Speecon database
- hasVersion: C-001237: Taiwan Mandarin Speecon database
- hasVersion: C-001530: Swiss-German Speecon database
- hasVersion: C-003377: Danish Speecon Database
- hasVersion: C-003380: French-Canadian Speecon database
- hasVersion: C-003379: Dutch from Belgium Speecon Database
- hasVersion: C-003378: Dutch from the Netherlands Speecon Database

SHACHI - Language Resource Metadata Database