Language resource #: 3330 Results 561 - 570 of 2023
Current query
Input keywords
Select items
  • C-001038: Korean Broadcast News Transcripts
    This data set consists of 18 text files containing transcripts prepared by the LDC for Voice of America satellite radio news broadcasts in Korean. The broadcasts were recorded by the LDC at transmission time during a two week period between January 21, 2000 and February 7, 2000.
  • C-001039: Korean English Treebank Annotations
    This corpus consists of 33 texts originally written in Korean and translated into English for the purpose of language training in a military setting. The conversations are not authentic dialogues but were constructed for pedagogical purposes.
  • C-001040: Korean Newswire
    This corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000.
  • C-001041: Korean Propbank
    Korean Propbank Annotations is a semantic annotation of the Korean English Treebank Annotations and Korean Treebank Version 2.0.
  • C-001042: Korean Telephone Conversations Complete Set
    *Introduction*

    Korean Telephone Conversations Complete Set was produced by Linguistic Data Consortium (LDC) catalog number LDC2003P01 and ISBN 1-58563-267-8.

    The complete set of Korean Telephone Conversations consists of the following:

    * Korean Telephone Conversations Speech
    * Korean Telephone Conversations Transcripts
    * Korean Telephone Conversations Lexicon
    The Korean telephone conversations were originally recorded as part of the Callfriend project. The Callfriend Korean telephone speech was collected by Linguistic Data Consortium primarily in support of the Language Identification (LID) project, sponsored by the U.S. Department of Defense. The calls were later transcribed for use in other projects.

    Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls.

    The recorded conversations are between native speakers of Korean and last up to 30 minutes, of which the transcribed speech covers between 15 and 18 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in either the United States or Canada.

    Korean Telephone Conversations Transcripts consists of 100 text files, totalling approximately 190K words and 25K unique words.

    All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system.

    Please follow this link for a sample transcript: txt | gif.

    Korean Telephone Conversations Lexicon covers the tokens occurring in the Korean Telephone Conversations Transcripts.

    The lexicon contains five tab-separated information fields:

    * orthographic form in Hangul (head-word), encoded in the KSC-5601 (Wansung) system
    * orthographic form in Yale romanization
    * pronunciation
    * frequency of the word in Korean Telephone Conversations Transcripts
    * morphological analysis of the word
    Please follow this link for a sample page from the lexicon: txt | gif.
  • C-001044: Korean Telephone Conversations Speech
    *Introduction*

    Korean Telephone Conversations Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S03 and ISBN 1-58563-263-5.

    The telephone conversations in this corpus were originally recorded as part of the CALLFRIEND project. The CALLFRIEND Korean telephone speech was collected by Linguistic Data Consortium primarily in support of the Language Identification (LID) project, sponsored by the U.S. Department of Defense. The calls were later transcribed for use in other projects.

    This publication consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the rest of 51 are previously unexposed calls.

    All 100 conversations have been transcribed and are published as Korean Telephone Conversations Transcripts.

    The recorded conversations are between native speakers of Korean and last up to 30 minutes, of which the transcribed speech covers between 15 to 18 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in either the United States or Canada.

    *Data*

    There are 100 speech files, totalling approximately 44 hours of audio. All speech files are in sphere format (shorten-compressed), recorded in two-channel ulaw with a sampling rate of 8 KHz.

    *Updates*

    There are no updates available at this time.
  • C-001045: Korean Telephone Conversations Transcripts
    The telephone conversations on which these transcripts are based were originally recorded as part of the CALLFRIEND project. The CALLFRIEND Korean telephone speech was collected by Linguistic Data Consortium primarily in support of the Language Identification (LID) project, sponsored by the U.S. Department of Defense. The calls were later transcribed for use in other projects
  • C-001046: Korean Treebank Annotations Version 2.0
    The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information.
    • replaces: Korean English Treebank Annotations corpus
  • C-001047: LATINO-40 Spanish Read News
    This database provides a set of recordings for training speaker-independent systems that recognize Latin-American Spanish. It was recorded by the Entropic Research Laboratory in the period from July 11 through September 9 1994 in Palo Alto, California. The database comprises about 5,000 utterances files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female.

    The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.

    The Linguistic Data Consortium provided 13,000 sentences that had been selected from Latin American newspaper text by people working at Texas Instruments. The sentences are all shorter than 80 characters and are not grouped into larger constituents such as paragraphs or stories. The speech files have NIST SPHERE headers and are presented in compressed format, using the shorten speech compression algorithm developed by Tony Robinson at Cambridge Univesity, as implemented in the NIST SPHERE software package. This software is included with the data.
    • references: Jared Bernstein, et al.LATINO-40 Spanish Read News Linguistic Data Consortium, Philadelphia
  • C-001048: LLHDB
    *Introduction*

    The LLHDB corpus consists of recordings of people speaking into ten different telephone handsets. The aim was to create a corpus for the study of telephone transducer effects on speech which minimized confounding factors, such as variable telephone channels and background noise. LLHDB was created by having volunteers speak prompted and extemporaneous speech into different transducers in a sound-proof room and directly digitizing the output from the transducers on a SunSparc A/D at a 8kHz sampling rate and a 16-bit resolution.

    *Data*

    There were three types of speech recorded for each handset. First, the speaker read the "rainbow passage" [Nolan 83], a 97 word passage sometimes used in phonetic research. Second, the speaker read ten sentences extracted from the TIMIT. Finally, the speaker was asked to describe a photograph for approximately 40 seconds (a different photograph was used for each handset). LLHDB contains speech from 53 speakers (24 males and 29 females) recruited from the laboratory.

    Because the same handsets are used in both HTIMIT and LLHDB, it is possible to compare the effects of the two different recording methods.

    *Updates*

    Relative to the original CD-ROMs produced in 1998 by the Linguistic Data Consortium, the extension of the audio files was changed from ".wav" to ".sph."