言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 761 - 770 件目

C-001350: CELEX Dutch lexical database - Complete set
Monolingual Lexicons
The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.
Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.
To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:
· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;
· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;
· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;
· syntax: word class, subcategorisations per word class;
· frequency of the entries: disambiguated for homographic lemmata.
C-001352: CHIL 2004 Evaluation Package
Multimodal/Multimedia Resources
The CHIL 2004 Evaluation Package was produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission's Sixth Framework Programme. The objective of this project is to create environments in which computers serve humans who focus on interacting with other humans as opposed to having to attend to and being preoccupied with the machines themselves. Instead of computers operating in an isolated manner, and Humans [thrust] in the loop [of computers] we will put Computers in the Human Interaction Loop (CHIL).

In this context, the CHIL project produced CHIL Seminars. The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. During the talks, videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speakers voice and ambient sounds were recorded.

The CHIL Seminars have been compiled in four different packages, according to the evaluations for which they have been created and used:
- CHIL 2004 Evaluation Package (catalogue reference ELRA-E0009)
- CHIL 2005 Evaluation Package (catalogue reference ELRA-E0010)
- CHIL 2006 Evaluation Package (catalogue reference ELRA-E0017)
- CHIL 2007 Evaluation Package (catalogue reference ELRA-E0033)

The CHIL_2004 Evaluation Package consists of the following contents:

The whole set of recordings amounts to a total of almost 6 hours of audio recordings and more than 2 hours of video recordings. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speakers voice and background sounds.

The database consists of:
1) Audio and Video Recordings: 10 seminars (7 seminars recorded from October to December 2003 and 3 seminars recorded in June 2004).
2) Annotations: Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
3) Transcriptions: Transcriptions using both TRS and STMUID formats.
C-001353: CHIL 2005 Evaluation Package
Multimodal/Multimedia Resources
The CHIL 2005 Evaluation Package was produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission's Sixth Framework Programme. The objective of this project is to create environments in which computers serve humans who focus on interacting with other humans as opposed to having to attend to and being preoccupied with the machines themselves. Instead of computers operating in an isolated manner, and Humans [thrust] in the loop [of computers] we will put Computers in the Human Interaction Loop (CHIL).

In this context, the CHIL project produced CHIL Seminars. The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. During the talks, videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speakers voice and ambient sounds were recorded.

The CHIL Seminars have been compiled in four different packages, according to the evaluations for which they have been created and used:
- CHIL 2004 Evaluation Package (catalogue reference ELRA-E0009)
- CHIL 2005 Evaluation Package (catalogue reference ELRA-E0010)
- CHIL 2006 Evaluation Package (catalogue reference ELRA-E0017)
- CHIL 2007 Evaluation Package (catalogue reference ELRA-E0033)

The CHIL_2005 Evaluation Package consists of the following contents:

1) Contents of the CHIL 2004 Evaluation Package (see catalogue reference ELRA-E0009 for description).
2) Audio and Video Recordings: 5 seminars recorded in November 2004).
3) Stereo Video Recordings of 10 subjects that move in the cameras field of view while performing pointing gestures.
4) Video annotations.
5) Transcriptions.
- hasVersion: C-001352: CHIL 2004 Evaluation Package
C-001354: CLUVI Parallel Corpus
- isPartOf: C-001534: TECTRA Corpus of English-Galician literary texts
- isPartOf: C-001411: FEGA Corpus of French-Galician literary texts
- isPartOf: C-001458: LEGA Corpus of Galician-Spanish legal texts
- isPartOf: UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts
- isPartOf: C-001460: LOGALIZA Corpus of English-Galician software localization
- isPartOf: C-001356: CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
C-001356: CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
- hasPart: C-001354: CLUVI Parallel Corpus
C-001357: CRATER 2 Corpus
Written Corpora
The CRATER corpus was built upon the foundations of an earlier project, ET10/63, which was funded in the final phase of the Eurotra programme. The Corpus Resources and Terminology Extraction project (MLAP-93 20) extended the bilingual annotated English-French International Telecommunications Union corpus produced within ET10/63 to include Spanish.
The CRATER 2 corpus was produced by the Department of Linguistics & Modern English Language, Lancaster University (United Kingdom) with funding from ELRA. The ELRA funding in turn was provided by the European Commission project LRsP Modern English Language, Lancaster University (United Kingdom) with funding from ELRA. The ELRA funding in turn was provided by the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335). This project has enhanced the CRATER corpus, available under the reference ELRA-W0003 in the ELRA catalogue. CRATER 2 has significantly expanded the French/English component of the parallel corpus by increasing the size of the English/French corpus from 1,000,000 words per language to approximately 1,500,000 tokens per language.
The offer consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations (human-edited).
CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003).
C-001358: Cantonese SpeechDat-like MDB-2000
Telephone
The Cantonese SpeechDat-like MDB-2000 database contains the recordings of 2,000 Cantonese speakers (996 males, 1,004 females) recorded over the mobile telephone network in China and Hong Kong. The MDB-2000 database is partitioned into 11 CDs in ISO 9660 format. It follows the specifications given in the framework of the SpeechDat(II) project.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

This database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat(II) format and content specifications.

Each speaker uttered the following items:

* 2 isolated digits
* 1 sequence of 10 isolated digits
* 4 connected digits (1 prompt sheet number -4 digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
* 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
* 1 word spotting phrase using an embedded application word
* 6 application words

* 3 spelled words (1 spontaneous name e.g. own forename, 1 city name, 1 real/artificial word for coverage)
* 1 currency money amount
* 1 natural number
* 5 directory assistance names (1 spontaneous name e.g. own forename, 1 city of birth/growing up, 1 most frequent cities out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 isolated city name, 1 "forename surname" out of a set of 400 full names)
* 2 yes/no questions (1 predominantly "Yes" question, 1 predominantly "No" question)
* 9 phonetically rich sentences
* 2 time phrases (1 spontaneous time of day, 1 word style time phrase)
* 4 phonetically rich words

The following age distribution has been obtained: 74 speakers are under 16, 953 speakers are between 16 and 30, 636 speakers are between 31 and 45, 328 speakers are between 46 and 60, 9 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001359: Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (120 people)
Desktop/Microphone
This corpus comprises 1,500 entries uttered by 120 speakers of different dialects, ages and various educational levels (59 males and 61 females), recorded through head-mounted noise-canceling microphone. The database comprises 3,600 digit strings. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for a total of 6.2 hours of speech. The total capacity of the data is 945 Mb.
Each speaker read 120-150 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-001360: Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (200 people)
Desktop/Microphone
This corpus comprises 1,500 entries uttered by 200 speakers of different dialects, ages and various educational levels (87 males and 113 females), recorded over 4 channels (Mic1: SHURE SM58; Mic2: ANC-700 Head-mounted; Mic3: TELEX M-60; Mic4: ACOUSTIC MAGIC). The database comprises 6,000 digit strings per channel. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 11.5 hours of speech per channel. The total capacity of the data is 6.82 Gb.
Each speaker read 30 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-001361: Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (849 people)
Desktop/Microphone
This corpus comprises 750 entries uttered by 849 speakers of different dialects, ages and various educational levels (420 males and 429 females), recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). The database comprises 12,750 digit strings per channel. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 21 hours of speech per channel. The total capacity of the data is 12.9 Gb.
Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system.

SHACHI - Language Resource Metadata Database