言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 821 - 830 件目

C-001425: German Polyphone Database (SpeechDat(M)) DB1
Telephone
The database consists of read speech. A prompt sheet with a unique identification number has been distributed to the potential callers.
The speech data is recorded with digital lines (ISDN), resulting in A-law format (8 bit), 8 kHz sampling rate. The data collection comprises 1000 speakers, with a particular care of a balance with respect to gender. The age of the callers were to be between 16 and 65 (No controlled distribution).
Callers could call from any kind of acoustic and network environment: home, business, mobile phone, phone booth, wired or cordless phone, etc. (No controlled distribution).
The regional distribution was expected to fit within the following scheme: from each of the 16 German states there were to be 32 speakers. Speakers from Austria, Switzerland and other countries were not be controlled. The utterances to be gathered have been specified and consisted of several speech sequences, including sentences from different sources (local newspapers, existing corpora, law articles, etc.) to ensure a good phonetic coverage, application words from a defined list of command words, digits (isolated digits, connected digits, and natural numbers), currency amounts, quantities, credit card numbers, spelled words (mainly names), time of day (spontaneous) and time phrase (prompted, word style), city of call/birth, etc.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-001426: German Polyphone Database (SpeechDat(M)) DB2
C-001426: German Polyphone Database (SpeechDat(M)) DB2
Telephone
German Polyphone Database (SpeechDat(M))
Phonetically rich sentences sub-set. See ELRA-S0018
- hasVersion: C-001425: German Polyphone Database (SpeechDat(M)) DB1
C-001427: German Pronunciation Rules Set - PHONRUL 9.0
Speech Related
PHONRUL is a collection of computer-readable underspecifying pronunciation rules of standard German. This set describes the most common known effects in German pronunciation if deviating from the so-called canonic or citation form of words. The knowledge of this rule set was derived from empirical analysis of speech corpora as well as from a multitude of publications about German phonetics. The set does not contain any dialect-specific rules, however the line between Standard German and dialects is indistinct. Presently, this rule set is used at the University of Munich to aid automatic segmentation and labelling of unknown speech utterances.
The rule set, in its present form, consists of approximately 1,500 complex rules which expand to 5,546 simple replacement rules. The rule set was designed for extended German SAM-PA, but can be translated into other alphabets (e.g. Worldbet, IPA) without much effort.
C-001428: German SpeechDat(II) FDB-1000
Telephone
The German SpeechDat(II) FDB 1000 consists of 988 calls over the German fixed network, stored on 4 CD-ROMs in the final SpeechDat(II) database exchange format. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications.
The following items were recorded:
- 1 isolated digit (read or prompted)
- 1 sequence of 10 isolated digit
- 4 connected digits
- 4-6 digit number to identify the prompt sheet
- ca. 10 digit telephone number (read)
- 14-16 digit credit card number (read, 150 different credit card numbers were found)
- 6 digit PIN code (read)
- 1 natural number (read)
- 1 money amount (read)
- 3 spelled words (1 spontaneous name spelling, 2 read)
- 1 time of day (spontaneous)
- 1 time phrase (read)
- 1 date (spontaneous)
- 1 date (read)
- 1 relative date (read)
- 2 yes/no questions (spontaneous, not prompted)
- 3/6 common application words (read)
All application words are recorded more than 80 times.
These are:
- 1 application word phrase
- 9 phonetically rich sentences (read)
- 4 phonetically rich words (read)
- 5 directory assistance names (1 spontaneous name (e.g. forename), 1 spontaneous city name, 1 read city name (from a list of 500 most frequent), 1 read company/agency name (from a list of 500 most frequent), 1 read proper name, fore- and surname (from list of 150 SDB names).

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001429: German spoken by Turkish OrienTel database
Telephone
The German spoken by Turkish OrienTel database comprises 332 Turkish speakers who spoke German (167 males, 165 females) recorded over the German fixed and mobile telephone network. This database is partitioned into 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:

* 1 isolated single digit
* 1 sequence of 10 isolated digits
* 5connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
* 1 currency money amount
* 2 natural numbers
* 3 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression
* 2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
* 4 spelled words : 1 spontaneous (own forename), 1 personal first name, 1 city name, 1 real word for coverage
* 8 directory assistance utterances : 1 spontaneous, own forename, 1 city of birth/growing up (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname, 1 Turkish forename and surname, 1 Turkish city name, 1 Turkish company name
* 3 yes/no questions : 1 predominantly yes question, 1 predominantly no question, 1 predominantly fuzzy answer, e.g. "don't know"
* 6 application keywords/keyphrases
* 1 word spotting phrase using embedded application words
* 6 phonetically rich words
* 9 phonetically rich sentences

The following age distribution has been obtained: 4 speakers are less than 16 years old, 179 speakers are between 16 and 30, 115 speakers are between 31 and 45, 29 speakers are between 46 and 60, and 5 are over 60 years old.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001430: Hansard French/English
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches.

The collection presented here has been assembled by the LDC by way of archives from two distinct secondary sources. Material from one time period of parliamentary proceedings was acquired through the IBM T. J. Watson Research Center, while material from another period was acquired through Bell Communications Research Inc. (Bellcore). The combined collection covers a time span from the mid-1970's through 1988, with no apparent duplication between the two data sources.

Aside from covering different time periods, the two archives have different organization and have undergone different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprises two distinct types of data -- one appears to be the main parliamentary proceedings (similar in nature to the IBM set), while the other consists of transcripts from committee hearings.

The three sets have been kept distinct in this publication and each is described in greater detail in separate documentation files.

In terms of what the three sets have in common:

* They are rendered here using the 8-bit ISO-Latin1 character encoding standard.
* They use a minimal amount of SGML tagging to identify sentences or paragraphs.
* All sets are organized using a parallel file structure, in which the content of a given English text file is matched by the content of a corresponding French text file.
* The SGML text files for the IBM and the Bellcore committee-hearings data are published in compressed form, using the public-domain GNU-Zip utility (gzip). The Bellcore main-session files are not compressed.

In terms of differences between the three sets:

* The IBM collection is presented as a sequence of parallel sentences (there are nearly 2.87 million parallel sentence pairs in the set).
* The Bellcore data are presented as sequences of paragraphs.
* The Bellcore main-session data is accompanied by mapping files that provide computed paragraph alignments and word-token correspondences; no additional alignment data are provided for the Bellcore committee texts (and none are needed for the IBM sentences).
C-001431: Hempel
Telephone
This corpus contains 3,909 recordings via public phone lines (fixed network only) of 3,909 German speakers with a total of 184,240 spoken words. The contents are free monologues answering the question: "Was haben Sie in der letzten Stunde gemacht?" (What did you do within the last hour?). 25.5 hours of speech were recorded with a maximum length of 1 minute (24 seconds average) per recording. The transcription is provided in the Speechdat format. The database is conformant with the SpeechDat Exchange Format.
C-001432: Hong Kong Hansards Parallel Text
This publication contains the Hong Kong Special Administrative Region (HKSAR) Hansards Corpus produced by the Linguistic Data Consortium (LDC); catalog number LDC2000T50, ISBN 1-58563-175-2. This corpus contains excerpts from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.
- isPartOf:
C-001433: Hong Kong Laws Parallel Text
This FTP publication was obtained during January 1999 from the bilingual website of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China. The retrieved files have been processed and sentence aligned. LDC wishes to thank the Hong Kong Special Administrative Region of the People's Republic of China for granting the LDC permission to distribute this data to the research community.
- isPartOf:
C-001434: Hong Kong News Parallel Text
This FTP publication was created when the LDC collected parallel Chinese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
- isPartOf:

SHACHI - Language Resource Metadata Database