言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 831 - 840 件目

C-001435: Hong Kong Parallel Text
Hong Kong Parallel Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T08 and ISBN 1-58563-290-2.
To support the research and development of automatic machine translation systems, LDC was sponsored to create English-Chinese parallel text collected from the Hong Kong Special Administrative Region (HKSAR).
Hong Kong Parallel Text contains data of three sub-corpora, namely Hong Kong Hansards, Hong Kong Laws and Hong Kong News. Hong Kong Hansards contains the excerpts from the Official Record of Proceedings of the Legislative Council of the HKSAR. Hong Kong Laws contains law codes acquired from the Department of Justice of the HKSAR. Hong Kong News contains press releases from the Information Services Department of the HKSAR.
- isPartOf: Hong Kong Hansards
- isPartOf: Hong Kong Laws
- isPartOf: Hong Kong News
C-001436: ICE-GB (British English component of the International Corpus of English)
Written Corpora
ICE-GB is the British component of the International Corpus of English (ICE). ICE began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English.

ICE-GB is fully grammatically analysed. Like all the ICE corpora, ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, allowing complex and detailed searches across the whole corpus.

ICE-GB contains 83,394 parse trees, including 59,640 in the spoken part of the corpus.

ICE-GB has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional 'post-checking' strategy and also by cross-sectional error-based searches.

ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program). ICECUP supports a variety of query types, including the use of the parse analyses to construct Fuzzy Tree Fragments to search the corpus.
C-001437: ILE: Italian LExicon
Speech Related
ILE is a 588,000 entries Italian lexicon transcribed with SAMPA notation. It was generated, mainly for speech recognition purposes, by means of a morphological analyzer handling more than 100,000 morphemes, each of them transcribed and manually checked. Each stem was combined with all its possible suffixes to form valid words. Verbal forms do not include clitics.The morpho-lexicon was obtained by properly processing an Italian dictionary, and adding by hand all possible inflections. This base lexicon was then enriched with names and neologisms found in the 65,000 most frequent words of the newspaper "Il Sole 24 Ore". Also the most frequent Italian proper names and surnames (from the telephone directory), geographical names, acronyms, company names, commonly used foreign words were added to the lexicon.All words are transcribed using SAMPA units for the Italian language. In case of multiple pronunciations for a word, one row for each different transcription is provided (a total of about 601,000 different transcriptions are provided for the 588,000 words lexicon). Stressed vowels are marked with the ASCII character ". Also foreign words are transcribed using only SAMPA units for the Italian language, which leads to some awkward but effective transcription, at least for speech recognition purposes.
Some samples of ILE follow.
ANCORA "a n k o r a
ANCORA a n k "o r a
CESSARE tS e ss "a r e
CESSEREBBERO tS e ss e r "E bb e r o
CITTA` tS i tt "a
AIDS "a i d s
AIDS a i d i "E ss e
BABY-SITTER b E b i s "i tt e r
BABY-SITTER b e i b i s "i tt e r
BLUE-JEANS b l u dZ "i n s
C-001441: Japanese Mandarin Speech Recognition Corpus (desktop) single Japanese sentence (200 people)
Desktop/Microphone
This corpus comprises 12,000 Japanese Mandarin sentences uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 22.12 hours of speech per channel. The total capacity of the data is 28.45 Gb.
Each speaker read 60 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese person name (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – digit string (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese place name (200 people)
C-001442: Japanese Mandarin Speech Recognition Corpus (desktop) Japanese person name (200 people)
Desktop/Microphone
This corpus comprises 2,000 Japanese Mandarin person names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 4.41 hours of speech per channel. The total capacity of the data is 5.67 Gb.
Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – single Japanese sentence (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – digit string (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese place name (200 people)
C-001443: Japanese Mandarin Speech Recognition Corpus (desktop) Japanese place name (200 people)
Desktop/Microphone
This corpus comprises 2,000 Japanese Mandarin place names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 3.09 hours of speech per channel. The total capacity of the data is 3.96 Gb.
Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – single Japanese sentence (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – digit string (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese person name (200 people)
C-001444: Japanese Mandarin Speech Recognition Corpus (desktop) digit string (200 people)
Desktop/Microphone
This corpus comprises 8,000 Japanese Mandarin digit strings uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 16.22 hours of speech per channel. The total capacity of the data is 23.23 Gb.
Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – single Japanese sentence (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese person name (200 people)
- hasVersion: Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese place name (200 people)
C-001446: Karl May Korpus (KMK)
Written Corpora
The "Karl-May-Korpus" is a monolingual German corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May (1842-1912) and consists of around 1.6 million words (divided into 9 subcorpora of about 180,000 words each). The corpus was created between 1993 and 1997.

Each word form is tagged with a word class (1 out of 43 classes) and appropriate lemma.

File format: Text
Standard in use: SGML
Character set: 8-bit ASCII
C-001447: Korean Mandarin Speech Recognition Corpus (desktop) place name (150 people)
Desktop/Microphone
This corpus comprises 1,500 Korean Mandarin place names uttered by 150 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 1.53 hours of speech per channel. The total capacity of the data is 2 Gb.
Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-001448: Korean Mandarin Speech Recognition Corpus (desktop) digit string (110 people)
Desktop/Microphone
This corpus comprises 13,200 Korean Mandarin digit strings uttered by 110 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 18.87 hours of speech per channel. The total capacity of the data is 24.2 Gb.
Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.

SHACHI - Language Resource Metadata Database