言語資源の登録件数: 3330件
2023 件中 1 - 10 件目
-
C-000004: 863 program in 2004 name entry identification evaluation data
the contents of corpus include two categories: simplified characters (241 files, about 400 thousand Chinese characters) and traditional characters (126 files, about 400 thousand Chinese characters)Log in http://www.863data.org.cn
http://www.chineseldc.org/EN/doc/2004-863-002/intro.htm -
C-000005: 863 program in 2004 speech recognition evaluation data
the corpus includes three parts: Chinese desktop speech, telephone speech, and PDA speech.Log in http://www.863data.org.cn
http://www.chineseldc.org/EN/doc/2004-863-006/intro.htm -
C-000006: 863 program in 2005 information index evaluation data
CWT100g-Chinese web corpus which contains 5,712,710 web pages.The relevant documents are extracted after pooling the submitted results of the participating systems in the IR evaluation. http://www.863data.org.cn
http://www.chineseldc.org/EN/doc/2005-863-002/intro.htm -
C-000007: 863 program in 2005 machine translation evaluation data
Include Chinese-English, English-Chinese, Chinese-Japanese, Japanese-Chinese, English-Japanese and Japanese-English.Two types: Dialogue and Writing.Domain: Olympic-related for dialogue and News for writing. http://www.863data.org.cn
http://www.chineseldc.org/EN/doc/2005-863-001/intro.htm- hasVersion: N-001206: 863 program in 2003 machine translation evaluation data
- hasVersion: 863 program in 2004 machine translation evaluation data
-
C-000008: 863 program in 2005 speech recognition evaluation data
The total data consist of Desktop PC speech data and telephone speech data http://www.863data.org.cn
http://www.chineseldc.org/EN/doc/2005-863-003/intro.htm -
C-000009: ASCCD-Annotated Speech Corpus of Chinese Discourse
ASCCD is comprised by text corpus, wav data and labeling information, which is suited for the research of speech and language, the development of speech software and the foundational teach for mandarin.
http://www.chineseldc.org/EN/doc/CLDC-SPC-2005-012/intro.htm -
C-000010: AURORA Project Database - Aurora 4a - Evaluation Package
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. ETSI formally adopted this activity as work items 007 and 008.
The two work items within ETSI are :
ETSI DES/STQ WI007 : DSR - Front-end feature extraction algorithm & compression algorithm
ETSI DES/STQ WI008 : DSR - Advanced feature extraction algorithm
The Aurora project has released a number of list files for performing the training and testing on the Wall Street Journal (WSJ0) data at two sampling rates -8 kHz and 16 kHz. The Aurora 4a database is based on the WSJ0 with artificial addition of noise over a range of signal to noise ratios. It contains both clean and multicondition training sets and 14 evaluation sets with different noise types and microphones.
Two original copies of the contract (pdf | doc | rtf) must be sent to ELDA. -
C-000012: Albayzin corpus
Desktop/Microphone
This corpus consists of 3 sub-corpora of 16 kHz 16 bits signals, recorded by 304 Castillian speakers.
The 3 sub-corpora are:
- Phonetic corpus: 6,800 utterances of phonetically balanced sentences, including 1000 with phonetic segmentation.
- Geographic corpus: 6,800 utterances of sentences extracted from a Spanish geographic database.
- "Lombard" corpus: 2,000 utterances from various corpora. -
C-000013: Al-Hayat Arabic Corpus
Written Corpora
The corpus was developed in the course of a research project at the University of Essex, in collaboration with the Open University.
The corpus contains Al-Hayat newspaper articles with value added for Language Engineering and Information Retrieval applications development purposes.
The data have been distributed into 7 subject-specific databases, thus following the Al-Hayat subject tags: General, Car, Computer, News, Economics, Science, and Sport.
Mark-up, numbers, special characters and punctuation have been removed. The size of the total file is 268 MB. The dataset contains 18,639,264 distinct tokens in 42,591 articles, organised in 7 domains. -
C-000014: Austrian SpeechDat(AT) FDB-1000 database
Telephone
The SpeechDat(AT) FDB-1000 database contains the recordings of 1,000 Austrian speakers (544 males, 456 females) recorded over the Austrian fixed telephone network. The database is partitioned into 5 CD-ROMs, in ISO 9660 format.
Speech samples are stored as sequences of 8-bit 8 kHz A-law, uncompressed. Each prompted utterance is stored in a separate file, and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat format and content specifications.
Each speaker uttered the following items:
* 3 isolated digits
* 4 connected digits (prompt sheet number -5 digits, telephone number 9/11 digits, credit card number 15/16 digits, PIN code 6 digits)
* 1 natural number
* 2 money amounts (currency amount, mixed size and units)
* 2 yes/no questions (predominantly "yes", predominantly "no")
* 3 dates (spontaneous date e.g. birthday, prompted date, relative and general date expression)
* 2 times (spontaneous time of day, prompted mixed/analogue digital)
* 6 application words
* 1 word spotting phrase using embedded application words
* 7 directory assistance names (spontaneous names e.g. forenames, city of birth, a name out of a set of 150 SDB full names, most frequent cities, most frequent companies)
* 3 spellings (spontaneous e.g. forename, directory city name, real/artificial city name)
* 4 isolated words
* 12 phonetically rich sentences
* 7 speaker specific material (speaker gender question, call from fixed or mobile network, speaker region question, todays date, environment of call, native language, educational level)
The following age distribution has been obtained: 15 speakers are under 16, 444 are between 16 and 30, 328 are between 31 and 45, 184 are between 46 and 60, and 29 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.