言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 671 - 680 件目

C-001237: Taiwan Mandarin Speecon database
Desktop/Microphone
The Taiwan Mandarin Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Taiwanese speakers (273 males, 277 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Taiwanese speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room).

This database is partitioned into 56 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
46 core words synonyms
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)

The following age distribution has been obtained:
Adults: 246 speakers are between 15 and 30, 235 speakers are between 31 and 45, 63 speakers are between 46 and 60, and 6 speakers are over 60.
Children: 21 speakers are between 7 and 10, 29 speakers are between 11 and 14.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001238: The identifiable speech database of Chinese mandarin-----extract database
The number of people involving recording: All the sound is recorded by a professional recorder.The recording?fs content: sentence, number string,exiguous word, number cluster, measurement unit, light tone, Greece word, questionable sentence, English word, and simulate booking hotel.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2006-032/intro.htm
- hasPart: The identifiable speech database of Chinese mandarin -----wide label
C-001239: The identifiable speech database of Chinese mandarin-----wide label
The number of people involving recording: All the sound is recorded by a professional recorder.The recording?fs content: sentence, number string,exiguous word, number cluster, measurement unit, light tone, Greece word, questionable sentence, English word, and simulate booking hotel.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2006-031/intro.htm
- isPartOf: The identifiable speech database of Chinese mandarin -----extract database
C-001240: The identifiable speech database of tabletop speech--free topic (50 persons)
The number of people involving recording: The product totally uses 50 speakers (38 males, 32 females). The speakers have different accent, age, and education background.The recording?fs content: Every speaker talk 12 topic freely..The capability of product: The total product data is 242 MB, totally 8 hours.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2006-016/intro.htm
C-001241: The identifiable speech database of tabletop speech--the message (120 persons)
The number of people involving recording: The product totally uses 120 speakers (59 males, 61 females). The speakers have different accent, age, and education background.The recording?fs content: 50 speakers: 120 messages for one speaker; 70 speakers: 150 messages for one speaker.The capability of product: The total product data is 327 MB, totally 21.7 hours.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2006-012/intro.htm
- isPartOf: The identifiable speech database of tabletop speech——the message (200 persons )
- hasVersion: The identifiable speech database of tabletop speech——the number string (200 persons )
- hasVersion: The identifiable speech database of tabletop speech——the number string (120 persons )
- hasVersion: The identifiable speech database of tabletop speech——the number string (10 persons )
- hasVersion: The identifiable speech database of tabletop speech——the people’s name, the place’ name (120 persons )
- hasVersion: The identifiable speech database of tabletop speech——the stock (70 persons )
- hasVersion: The identifiable speech database of tabletop speech——free topic (50 persons )
C-001242: The identifiable speech database of tabletop speech--the message (200 persons)
The number of people involving recording: The product totally uses 200 speakers (87 males, 113 females). The speakers have different accent, age, and education background.The recording?fs content: 120 sentences for one speakerThe capacity of product: The 4 channels are 205 MB, totally 14.2 hours.The single channels are 5376 MB, totally 35.6 hours.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2006-009/intro.htm
- hasVersion: The identifiable speech database of tabletop speech——the number string (200 persons)
- hasVersion: The identifiable speech database of tabletop speech——the number string (120 persons )
- hasVersion: The identifiable speech database of tabletop speech——the number string (10 persons )
- hasVersion: The identifiable speech database of tabletop speech——free topic (50 persons )
- isPartOf: The identifiable speech database of tabletop speech——the message (120 persons )
- hasVersion: The identifiable speech database of tabletop speech——the stock (70 persons )
- hasVersion: The identifiable speech database of tabletop speech——the people’s name, the place’ name (120 persons )
C-001244: Turkish Continuous and Isolated Word Speech Database
Desktop/Microphone
This Turkish speech database was produced by the department of Théorie des Circuits et Traitement de Signal at the Faculté Polytechnique de Mons. The corpus was designed to provide read speech data for speech recognition purposes. The database contains 14 hours of speech (1618 words) from 43 Turkish speakers (adults over 18; 22 males, 21 females) from Belgium, Germany and Turkey (Istanbul, Ankara, Malatya), recorded at 32 kHz on DAT by Sennheiser MD-441-U microphone. The speech signal was sampled at 16 kHz and digitised with 16 bits. Each speaker read a predetermined text of 215 sentences and 100 isolated words, in quiet conditions. Parts of the corpus were labelled and segmented phonemically. Phonetic and orthographic transcriptions of sentences and isolated words are provided.
C-001246: 1997 Mandarin Broadcast News Speech (HUB4-NE)
*Introduction*

This collection consists of 30 hours of broadcast news recordings from the following sources: Voice of America (VOA), China Central TV (CCTV) and KAZN-AM, a commercial radio station based in Los Angeles, CA.

Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts. Only a relatively small sample of KAZN-AM recordings is included, owing to the relatively high proportion of unusable material in that source (e.g., commercials, local traffic reports).

Corresponding transcripts are released as 1997 Mandarin Broadcast News Transcripts (HUB4-NE) LDC98T24.

*Data*

All recordings were made using a single channel and 16-KHz sample frequency. Most files contain 30 minutes of recordings. There are some larger files consisting of 60 minutes and 120 minutes of programming.

*Updates*

There are no updates at this time.

*Pricing*

The Reduced Licensing Fee for this corpus is US$400.
- isFormatOf: C-001247: 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
C-001247: 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
*Introduction*

This collection consists of 30 hours of transcripts of Mandarin Chinese broadcast news recordings from the following sources: Voice of America (VOA), China Central TV (CCTV) and KAZN-AM, a commercial radio station based in Los Angeles, CA.

Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts. Only a relatively small sample of KAZN-AM recordings is included, owing to the relatively high proportion of unusable material in that source(e.g., commercials, local traffic reports).

Corresponding audio files are released as 1997 Mandarin Broadcast News Speech (HUB4-NE) LDC98S73.

*Data*

The transcripts were created by native speakers of Mandarin working at LDC. They are in GB-encoded form with SGML tags to identify story boundaries, speaker turn boundaries and phrasal pauses. The tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish HUB4 collections.

*Updates*

There are no updates at this time.

*Pricing*

The Reduced Licensing Fee for this corpus is US$100.
- hasVersion: C-001246: 1997 Mandarin Broadcast News Speech (HUB4-NE)
C-001248: 20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
20 newsgroups:
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
- hasVersion: 20news-19997.tar.gz
- hasVersion: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-19997.tar.gz
- hasVersion: 20news-bydate.tar.gz
- hasVersion: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
- hasVersion: 20news-18828.tar.gz
- hasVersion: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz

SHACHI - Language Resource Metadata Database