Language resource #: 3330
Results 551 - 560 of 2023
-
C-001028: JEIDA/JCSD-Channel 1 Control Words
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- isPartOf: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
- hasVersion: C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-000723: JEIDA/JCSD-Channel 0 City Names
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
-
C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- isPartOf: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001028: JEIDA/JCSD-Channel 1 Control Words
- hasVersion: C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-000723: JEIDA/JCSD-Channel 0 City Names
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
-
C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- isPartOf: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001028: JEIDA/JCSD-Channel 1 Control Words
- hasVersion: C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-000723: JEIDA/JCSD-Channel 0 City Names
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
-
C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
*Introduction*
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
*Data*
This collection consists of high-fidelity recordings of 150 native speakers of Japanese; each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones, yielding two channels that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1 (LDC96S65) contains data recorded simultaneously with a condenser microphone that presumably varied from site to site and is available separately.
A summary of the size and content of the corpus is given below:
number of speakers 150 speakers males 75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker 323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables 110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number of repetitions per item 4 repetitions total number of utterances 193,763 utterances (per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones 2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category of prompts. These prompts include:
Description Number of items Control Words: Banking Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the corpus can also be purchased as outlined below:
Price Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
*Updates*
There are no updates at this time.- isPartOf: C-001027: JEIDA/JCSD-Channel 1 Complete
- hasVersion: C-001026: JEIDA/JCSD-Channel 1 City Names
- hasVersion: C-001028: JEIDA/JCSD-Channel 1 Control Words
- hasVersion: C-001029: JEIDA/JCSD-Channel 1 Four Digit Sequences
- hasVersion: C-001030: JEIDA/JCSD-Channel 1 Isolated Digits
- hasVersion: C-001031: JEIDA/JCSD-Channel 1 Mono Syllables
- hasVersion: C-001022: JEIDA/JCSD-Channel 0 Complete
- hasVersion: C-000723: JEIDA/JCSD-Channel 0 City Names
- hasVersion: C-001023: JEIDA/JCSD-Channel 0 Control Words
- hasVersion: C-001024: JEIDA/JCSD-Channel 0 Four Digit Sequences
- hasVersion: C-000726: JEIDA/JCSD-Channel 0 Isolated Digits
- hasVersion: C-001025: JEIDA/JCSD-Channel 0 Mono Syllables
-
C-001032: JURIS
The text data contained on this two CD-ROM set represent a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice. The time span of the text ranges from the 1700s to the early 1990s.
-
C-001033: Japanese Business News Text Supplement
This corpus consists of newswire text from Nihon Keizai Shimbun, Inc. (NIKKEI), the largest Japanese daily financial newspaper, and Telerate, Inc. (formerly known as Dow Jones/Kyodo News Service), published primarily for managers of Japanese-owned corporations or Japanese employees working in North American financial institutions.
-
C-001034: Japanese Business News Text
The Linguistic Data Consortium announces the availability of a Japanese language text corpus composed of business and financial news from two sources.
-
C-001035: KING Speaker Verification
The KING corpus was collected at ITT in 1987 under a US government research contract and although other contractors have received it, it has not been officially available for public use before now. The version now available from LDC, referred to as KING-92, is based on a 1992 reprocessing of the original recordings (see below). It contains recorded speech from 51 male speakers in two versions, which differ in channel characteristics: one from a telephone handset and one from a high-quality microphone. The speakers are further subdivided into two groups, 25 in one and 26 in the other, who were recorded at different locations. For each speaker and channel there are ten files, corresponding to sessions of about 30 to 60 seconds' duration each. The interval between sessions varies from a week to a month. The transcripts contain about 54k word tokens (4.8k types). KING is designed principally for closed set experiments in text-independent speaker identification or verification over toll-quality telephone lines, although the single-sided collection format does not permit simulation of real telephone traffic. The ten sessions allow for a variety of divisions into training and test data, with the possibility of multiple test sets. For example, one could examine the effects of the amount of training on performance, or examine the variability of performance over several test samples (sessions) given a fixed amount of training (but see below about the "Great Divide").
The collection method used in KING was to establish a call from a laboratory location at ITT (either San Diego, CA or Nutley, NJ) over long distance lines and back to another phone at the same location. The phones used by the test subjects were equipped with an additional microphone, so two parallel recordings were made of that side of the conversation, while the interlocutor's side was not recorded. The two parties either spoke spontaneously or carried out a variety of tasks designed to elicit natural-sounding speech: interpreting a drawing, solving a problem, describing a picture, etc.
There were 25 speakers in Nutley and 26 in San Diego. Speech-to-noise ratios average about 10 dB worse for the Nutley telephone data than for San Diego; in fact it is less than 20 dB for over half the Nutley files. Users of this corpus therefore usually run separate experiments, or at least report results separately, according to site. A more subtle difference in the recordings, however, sometimes referred to as the "Great Divide," cuts across the telephone data for the San Diego speakers. This was apparently due to a minor equipment change which was made during the collection; it results in a slight but consistent change in the average long term spectrum of the telephone data recorded after the fifth session. Training and testing on data from the same side of this divide gives significantly better results than across it. Since the discovery of this difference, investigators now generally report results on the first and last five sessions of the San Diego telephone KING data separately, or they report within vs. across this boundary. A detailed description of the spectral differences can be found in a report by Thomas Crystal and Ned Neuburg which accompanies the CD-ROM version.
Since there are a number of published papers with results based on the original KING corpus and two versions of the data in existence, note that the new CD-ROM version, called KING-92, is based on a 1992 re-issue of the data from ITT. It differs from the original corpus in a few details:
* The original data was sampled at 10 kHz, but has now been resampled at 8 kHz;
* Missing segments, most on the order of seconds, have been restored to the data and the alignment between the high quality microphone and the telephone handset data files has been corrected;
* Originally both an orthographic and a phonetic transcription of the data, with time alignments, were part of the corpus, but there were numerous errors; only an unaligned orthographic transcription has been retained.
* Documentation has been changed to reflect these differences and a description of the artifactual division between sessions 1-5 and 6-10 in the San Diego telephone data is included.- replaces: KING-92
-
C-001036: Klex: Finite-State Lexical Transducer for Korean
Klex is a finite-state lexical transducer for the Korean language, with the lexical string on the upper side and the inflected surface string on the lower side.
- references: C-001061: Morphologically Annotated Korean Text
- references: Korean Treebank POS annotation standards
-
C-001037: Korean Broadcast News Speech
*Introduction*
This data set consists of 18 audio files recorded by LDC in January 2000 and February 2000 from Voice of America (VOA) satellite radio news broadcasts in Korean.
*Data*
The recordings, captured from a dedicated satellite receiver, are stored as 16-bit PCM, 16-kHz, single-channel, in NIST SPHERE format. The duration of each recording is either 30 minutes or 60 minutes, depending on the VOA broadcast schedule. The date (YYYYMMDD), start-time and end-time (HHMM, Eastern Standard Time) for each recording are indicated in its file name. The sample data is not compressed.
Transcripts for these recordings are available as a separate corpus from the LDC: Korean Broadcast News Transcripts, LDC2006T14.
*Samples*
For an example of the data contained in this corpus, please listen to this audio sample (wav format).- hasFormat: C-001038: Korean Broadcast News Transcripts