Language resource #: 3330 Results 261 - 270 of 2023
Current query
Input keywords
Select items
  • C-000571: 1997 HUB5 Spanish Evaluation
    *Introduction*

    The 1997 HUB5 Spanish Evaluation was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S25 and ISBN 1-58563-235-x.

    The 1997 HUB5 Non-English Evaluation is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of conversational speech recognition. To this end the evaluation was designed to be simple, to focus on core speech technology issues, to be fully supported, and to be accessible.

    The HUB5 Non-English Evaluation focuses on the task of transcribing conversational speech into text. This task is posed in the context of conversational telephone speech. The evaluation is designed to foster research progress, with the goals of:

    * exploring promising new ideas in the recognition of conversational speech
    * developing advanced technology incorporating these ideas
    * measuring the performance of this technology
    The task is to transcribe conversational speech. The speech to be transcribed is presented as a set of conversations collected over the telephone. Each conversation is represented as a "4-wire" recording, that is with two distinct sides, one from each end of the telephone circuit. Each side is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit mu-law encoding).

    Additional documentation is available on the NIST website.

    *Data*

    This publication contains 20 sphere files encoded in two channel interleaved mulaw with a sampling rate of 8 KHz, for a total of 447,201,280 bytes (426 Mbytes) or seven hours of sphere data.

    An included documentation table contains information on the speech segments to be processed as follows:

    ...

    *Updates*

    There are no updates at this time.
  • C-000572: 1997 HUB5 Spanish Transcripts
    *Introduction*

    The 1997 HUB5 Spanish Transcripts corpus was produced by the Linguistic Data Consortium (LDC), catalog number LDC2003T04 and ISBN 1-58563-248-1.

    This publication contains transcripts for 20 Callhome Spanish telephone conversations. These 20 conversations were used in NIST's 1997 HUB5 Non-English evaluation, and are published as 1997 HUB5 Spanish Evaluation, (LDC2002S25).

    *Data*

    There are 20 data files in .txt format.

    The .txt files are transcript files containing the orthographic forms that were used in the original transcription process. These forms also serve as the head-words in the associated CALLHOME Spanish Lexicon (LDC96L16).

    Please follow this link for a sample transcript.

    *Updates*

    There are no updates at this time.
  • C-000573: 1997 Spanish Broadcast News Speech (HUB4-NE)
    LDC98S74 - Speech data LDC98T29 - Transcripts

    *Introduction*

    This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA.

    *Data*

    All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material and some contain 60 or 120 minutes (approximately); the sampling format requires roughly two megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.

    The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin) and are transmitted by FTP, not on the CD-ROMs with speech data.

    *Updates*

    There are no updates at this time.

    *Pricing*

    The Reduced Licensing Fee for this corpus is US$400.
  • C-000574: 1997 Spanish Broadcast News Transcripts (HUB4-NE)
    *Introduction*

    This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA.

    Corresponding speech data is released as 1997 Spanish Broadcast News Speech (HUB4-NE) (LDC98S74)

    *Data*

    All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material, and some contain 60 or 120 minutes (approximately); the sampling format requires roughly two megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.

    The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin).

    *Samples*

    Please view this SGML sample.

    *Updates*

    There are no updates at this time.
  • C-000575: 1997 Speaker Recognition Benchmark
    *Introduction*

    The 1997 speaker recognition evaluation was part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible.

    *Data*

    Technical Objectives of the 1997 speaker recognition evaluation were:

    1. Exploring promising new ideas in speaker recognition 2. Developing advanced technology incorporating these ideas 3. Measuring the performance of this technology

    The evaluation data was drawn from the Switchboard-2 Phase 1 corpus. Both training and test segments were constructed by concatenating consecutive turns for the desired speaker, similar to what was done in 1996. Each segment is stored as a continuous speech signal in a separate SPHERE file. The speech data is stored in 8-bit mulaw format.

    *Updates*

    There are no updates at this time.
  • C-000576: 1998 HUB4 Broadcast News Evaluation English Test Material
    *Introduction*

    This publication contains the evaluation test material used in the 1998 DARPA/NIST Continuous Speech Recognition Broadcast News HUB4 English Benchmark Test administered by the NIST Spoken Natural Language Processing Group and produced by the Linguistic Data Consortium (LDC), catalog number LDC2000S86, ISBN 1-58563-172-8.

    *Data*

    The test material is contained in two SPHERE-formatted waveform files. The file h4e_98_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from 1996. The file h4e_98_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from 1998. Each file should be separately recognized per the HUB4 English Evaluation Specification.
  • C-000577: 1998 HUB5 English Evaluation
    *Introduction*

    1998 HUB5 English Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of English conversational telephone speech used in the 1998 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).

    The Hub5 evaluation series focused on conversational speech over the telephone with the particular task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology. Further information about the evaluation can be found on the NIST HUB5 website and in The 1998 HUB-5E Evaluation Plan for Recognition of Conversational Speech over the Telephone in English, included in this release.

    *Data*

    The source data consists of conversational telephone speech collected by LDC: (1) 20 telephone conversations from Swtichboard-2 Phase 1 (LDC98S75) in which recruited speakers were connected through a robot operator to carry on casual conversations about a daily topic announced by the robot operator at the start of the call; and (2) 20 telephone conversations from CALLHOME American English Speech which consists of unscripted telephone conversations between native English speakers.

    The audio files are two channel interleaved mulaw in sphere format. The sphere headers have been modified from the original evaluation data by the addition of sample checksums to the CALLHOME data files.

    Corresponding transcripts are available in 1998 HUB5 English Transcripts (LDC2003T02).

    *Samples*

    Please listen to this audio sample.

    *Updates*

    There are no updates at this time.
  • C-000578: 1998 HUB5 English Transcripts
    *Introduction*

    1998 HUB5 English Transcripts was developed by the Linguistic Data Consortium and consists of transcripts of 40 English telephone conversations used in the 1998 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).

    The Hub5 evaluation series focused on conversational speech over the telephone with the particular task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology. Further information about the evaluation can be found on the NIST HUB5 website.

    *Data*

    This release contains transcripts in .txt format for the 40 source speech data files used in the evaluation: (1) 20 telephone conversations from Swtichboard-2 Phase 1 (LDC98S75) in which recruited speakers were connected through a robot operator to carry on casual conversations about a daily topic announced by the robot operator at the start of the call; and (2) 20 telephone conversations from CALLHOME American English Speech which consists of unscripted telephone conversations between native English speakers.

    The corresponding speech data is released as 1998 HUB5 English Evaluation (LDC2002S10).

    *Sample*

    Please follow this link for a sample transcript example.

    *Updates*

    There are no updates at this time.
  • C-000579: 1998 Speaker Recognition Benchmark
    *Introduction*

    The 1998 speaker recognition evaluation is part of an ongoing series of yearly benchmark tests conducted by NIST. These tests are intended to provide a stable reference point for measuring and comparing the performance of diverse methods for text-independent speaker recognition over the telephone and should be of interest to all researchers working in this area of speech technology development. The test sets and evaluation protocols have been designed to be simple, to focus on core technology issues, to be fully supported and to be accessible.

    *Data*

    In 1996 and 1997 handset variation was featured as a prominent technical challenge to be addressed. While handset variation remains a formidable challenge, the 1998 evaluation directs greatest attention toward speaker recognition performance for the case in which both training and test data are from the same source. The speech data were recorded by the LDC between January and March 1997 most of the speakers recruited for this collection were college students from the Great Lakes (Northern Midwest) region of the U.S.

    *Updates*

    There are no updates at this time.
  • C-000580: 1999 HUB4 Broadcast News Evaluation English Test Material
    *Introduction*

    This publication contains the English evaluation test material used in the 1999 NIST Broadcast News Transcription Evaluation administered by the NIST, Spoken Natural Language Processing Group and produced by the Linguistic Data ConsortiumCatalog number LDC2000S88 ISBN 1-58563-176-0.

    *Data*

    The test material is contained in two SPHERE-formatted waveform files. The file bn99en_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from last year's set2 epoch. The file bn99en_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from the summer of 1998. Each file should be separately recognized per the Broadcast News English Evaluation Specification.

    Additional test material for each set is also included. Test materials include evaluation map files (bn99en_1.uem), automatically generated segmentation files (bn99en_1.seg), transcripts from the evaluation (bn99en_1.utf) and the utf.dtd used to validate the transcripts, reference STM files (bn99en_1.stm), and transcript orthography mapping files (en981118.glm). For more complete information, see the 1998 HUB4 Website.

    *Updates*

    There are no updates at this time. Note that the waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.

    *Pricing*

    The Reduced Licensing Fee for this corpus is US$150.