言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 381 - 390 件目

C-000707: HARD 2004 Text
*Introduction*

The HARD 2004 Text Corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2005T28 and ISBN 1-58563-372-0.

This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. This corpus was created with support from the DARPA TIDES Program and LDC.

*Data*

The corpus comprises eight English newswire and web text sources from January-December 2003. The sources are AFE: Agence France Presse - English APE: Associated Press Newswire CNE: Central News Agency Taiwan - English LAT: Los Angeles Times/Washington Post NYT: New York Times SLN: Salon.com UME: Ummah Press - English XIE: Xinhua News Agency - English

Volume of data for each source appears in the table below:Source Stories Total Tokens Avg. Token/Story ---------------------------------------------------------- AFE: 226,515 71,829,978 317 APE: 237,067 93,294,584 393 CNE: 3,674 797,194 217 LAT: 18,287 12,576,721 687 NYT: 28,190 16,673,028 591 SLN: 3,321 4,710,500 1,418 UME: 2,607 782,064 299 XIE: 117,854 24,016,670 203 Total: 637,515 224,680,739Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components: - Keyword (optional), surrounded by tags - Date/time (optional), surrounded by tags - Headline, surrounded by tags - Main part, surrounded by tags.

Tags are used within this part to identify paragraph boundaries.

For more information please visit the HARD Project website.

*Samples*

For an example of the data in this corpus, please review this sample.
- references: Junbo Kong, et al. 2005 HARD 2004 Text Linguistic Data Consortium, Philadelphia
C-000708: HARD 2004 Topics and Annotations
*Introduction*

HARD 2004 Topics and Annotations contains topics and annotations (clarification forms, responses and relevance assessments) for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher.

The current corpus was previously distributed to HARD Participants as LDC2004E42 and LDC2005E17. The source data that corresponds to this release is distributed as LDC2005T28, HARD 2004 Text. This corpus was created with support from the DARPA TIDES Program and LDC.

*Data*

Three major annotation tasks are represented in this release: Topic Creation, Clarification Form Responses, and Relevance Assessment. Topics include a short title, query plus context, and a number of limiting parameters known as metadata which include targeted geographical region, target data domain or genre, and level of searcher expertise. Clarification Forms are brief HTML questionnaires system developers submitted to LDC searchers to glean additional information about information needs directly from the topic creators. Relevance assessment consisted of adjudication of pooled system responses, and included document-level judgments for all topics, and passage-level relevance judgments for a subset of topics. The release is divided into training and evaluation resources. The training set comprises twenty-one topics and 100 document-level relevance judgments per topic. The evaluation set contains fifty topics, clarification forms and responses, document-level relevance assessment for all topics and passage-level judgments for half of the topics. HARD participants received the reference data over the course of the evaluation cycle in stages: (0) training topics, (1) evaluation topic descriptions without metadata, (2) clarification form responses, (3) topic descriptions with metadata, and (4) relevance assessments.

*Samples*

For an example of the data in this publication, please review the following samples:

* Topic
* CF
* Result
- references: Stephanie Strassel and Meghan Glenn 2005 HARD 2004 Topics and Annotations Linguistic Data Consortium, Philadelphia
C-000709: HCRC Map Task Corpus
Originally published as set of eight CD-ROMS, the Map Task Corpus is now delivred as a web download. The contents of each disc reside in seprate directories with the same structure as the original set. The Map Task Corpus contains a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations, involving 64 different speakers (32 female, 32 male, all adults, each taking part in four conversations). The 64 speakers were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting, in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. a white cottage, an oak forest, Green Bay, etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations.

The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the talkers were strangers, in half friends in half of them the talkers could see each others faces, in half they could not.

The waveform data are provided in raw (headerless) files (16-bit samples, 20 kHz sample rate, two channels per conversation) and alternative header files are provided for use with software based on either the NIST SPHERE header structure or the European SAM header structure. Text transcriptions are provided for each conversation, along with PostScript files of the map images used in the experiments. Additional materials include full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs).
- references: . 1993 HCRC Map Task Corpus Linguistic Data Consortium, Philadelphia
C-000710: HKUST Mandarin Telephone Speech, Part 1
*Introduction*

In 2004, the Hong Kong University of Science and Technology (HKUST) was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively.

*Data Collection*

Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. All calls were operator-assisted, namely, an operator would call two participants as scheduled to initiate a call. Subjects were asked about demographic questions before they were bridged for normal conversation. Their answers to the demographic questions were recorded on separate files.

Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual.

Each side of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded), 8Khz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of date_time_Apin_Bpin.sph and the corresponding transcript is in the same format with .txt extension.

*Speaker demographics*

Subjects were asked to provide several pieces of demographic information, including gender, age, native language/dialect, birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is not the native dialect in many regions of China but is the official language of education and speakers may or may not have regional accents speaking Mandarin, it was decided that subjects birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions.

Selected demographics - age, gender, birthplace, phone type and accent for each side of the call and the topic ID for the call - are provided as a tab-delimited, plain-text, tabular file.

*Samples*

To review an example of this corpus, please examine this wav or mp3 audio samples.
- references: Pascale Fung, Shudong Huang, and David Graff 2005 HKUST Mandarin Telephone Speech, Part 1 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000711: HKUST Mandarin Telephone Transcript Data, Part 1
C-000711: HKUST Mandarin Telephone Transcript Data, Part 1
*Introduction*

This file contains documentation on the HKUST Mandarin Telephone Transcripts, Part 1, Linguistic Data Consortium (LDC) catalog number LDC2005T32 and ISBN 1-58563-352-6.

In 2004, the Hong Kong University of Science and Technology (HKUST) was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively.

Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. All calls were operator-assisted, namely, an operator would call two participants as scheduled to initiate a call. Subjects were asked about demographic questions before they were bridged for normal conversation. Their answers to the demographic questions were recorded on separate files.

Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subject made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual.

Each side of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded), 8Khz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of "date_time_Apin_Bpin.sph" and the corresponding transcript is in the same format with .txt extension.

*Speaker demographics*

Subjects were asked to provide several pieces of demographic information, including gender, age, native language/dialect, birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is not the native dialect in many regions of China but is the official language of education and speakers may or may not have regional accents speaking Mandarin, it was decided that subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions.

Selected demographics - age, gender, birthplace, phone type and accent for each side of the call and the topic ID for the call - are provided as a tab-delimited, plain-text, tabular file.

*Transcription*

All calls were fully transcribed from the beginning to the end. Standard simplified Chinese characters, encoded in GBK (CP-936), were used. Speech is segmented at natural boundaries wherever possible and each segment is no more than 10 seconds long. HKUST formulated transcription guidelines based on LDC's RT-03 transcription guidelines. For more information, refer to "trans-guidelines.pdf" included in the release.

The transcripts provided by HKUST were XML-formatted with each side of a call in a separate file. LDC multiplexed the two sides into a single file with turns interleaved in temporal order (based on the initial time stamps), and converted the format into the LDC format. All transcripts were checked against RT-04 formatting standards. The following is a list of RT-04 conventions that are different from those in the transcription guidelines.

* Speaker noise: curly brackets, e.g. {laugh}, instead of angel brackets;
* Foreign language: TEXT instead of TEXT.
The Chinese text is not segmented into words, though there are occasional white spaces within some turns.

*Samples*

To see an example of the data in this publication, please examine this text sample.
- references: Pascale Fung, Shudong Huang, and David Graff 2005 HKUST Mandarin Telephone Transcript Data, Part 1 Linguistic Data Consortium, Philadelphia
- hasVersion: C-000710: HKUST Mandarin Telephone Speech, Part 1
C-000713: HUB5 Mandarin Telephone Speech Corpus
LDC98S69 - Speech data LDC98T26 - Transcripts

*Introduction*

This release of HUB5 Mandarin training data consists of 42 calls derived from the CALLFRIEND Mandarin Chinese Mainland Dialect (Language ID) collection. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded conversation lasting up to 30 minutes.

*Data*

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America.

*Updates*

There are no updates at this time.
- hasVersion: C-000714: HUB5 Mandarin Transcripts
C-000714: HUB5 Mandarin Transcripts
LDC98S69 - Speech data LDC98T26 - Transcripts

*Introduction*

This release of HUB5 Mandarin training data consists of 42 calls derived from the CALLFRIEND Mandarin Chinese Mainland Dialect (Language ID) collection. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded conversation lasting up to 30 minutes.

*Data*

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America.

HUB5 Mandarin speech and transcript data may be obtained by emailing ldc@ldc.upenn.edu.

*Updates*

There are no updates at this time.
- hasVersion: C-000588: 2001 HUB5 Mandarin Evaluation
- references: CALLHOME Mandarin Lexicon
- hasVersion: C-000713: HUB5 Mandarin Telephone Speech Corpus
C-000715: HUB5 Spanish Telephone Speech Corpus
LDC98S70 - Speech data LDC98T27 - Transcripts

*Introduction*

This release of HUB5 Spanish training data consists of 106 calls derived from the CALLFRIEND Spanish (Language ID) collection. The transcripts cover a contiguous 10-30 minute segment taken from a recorded conversation lasting up to 30 minutes. These calls were originally collected by the LDC in support of the project on Language Recognition, sponsored by the U.S. Department of Defense. All of these calls are being designated as additional training data for the project on Large Vocabulary Conversational Speech Recognition (LVCSR) in Spanish.

*Data*

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project.

Once a caller was recruited to participate, he/she was given a free choice of whom to call. Recruits were given no guidelines concerning what they should talk about. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America, Puerto Rico or the Dominican Republic. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.

HUB5 Spanish speech and transcript data may be obtained by contacting the LDC

*Updates*

There are no updates at this time.
- hasVersion: C-000716: HUB5 Spanish Transcripts
C-000716: HUB5 Spanish Transcripts
LDC98S70 - Speech data LDC98T27 - Transcripts

*Introduction*

This release of HUB5 Spanish training data consists of 106 calls derived from the CALLFRIEND Spanish (Language ID) collection. The transcripts cover a contiguous 10-30 minute segment taken from a recorded conversation lasting up to 30 minutes. These calls were originally collected by the LDC in support of the project on Language Recognition, sponsored by the U.S. Department of Defense. All of these calls are being designated as additional training data for the project on Large Vocabulary Conversational Speech Recognition (LVCSR) in Spanish.

*Data*

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project.

Once a caller was recruited to participate, he/she was given a free choice of whom to call. Recruits were given no guidelines concerning what they should talk about. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America, Puerto Rico or the Dominican Republic. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.

HUB5 Spanish speech and transcript data may be obtained by contacting the LDC

*Updates*

There are no updates at this time.
- hasVersion: C-000715: HUB5 Spanish Telephone Speech Corpus
C-000717: ICSI Meeting Speech
*Introduction*

ICSI Meeting Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S02 and ISBN 1-58563-285-6.

The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. Word-level orthographic transcriptions are available as ICSI Meeting Transcripts.

*Data*

The collection includes 922 speech files, for a total of approximately 72 hours of Meeting Room speech. The speech is structured as one subdirectory per meeting, containing wavefiles for each channel (and possible .blp file, specifying any censored intervals).

The audio was collected at a 48 kHZ sample-rate, downsampled on the fly to 16 kHz. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit linear (big-endian) wavefiles, shorten-compressed in NIST SPHERE format.

The meetings were simultaneously recorded using close-talking microphones for each speaker (generally head-mounted, but early meetings contain some lapel microphones), as well as six table-top microphones: four high-quality omnidirectional PZM microphones arrayed down the center of the conference table, and two inexpensive microphone elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room.

In addition to recording the meetings themselves, the participants were also asked to read digit strings, similar to those found in TIDIGITS, at the start or end of the meeting. This small-vocabulary read-speech component of the recordings -- using the same meeting room, speakers, and microphones -- provides a valuable supplement to the natural conversational data, allowing a factorization of the speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some of the participants read digit strings; for the great majority of meetings, all participants did. The digit readings are included as part of the wavefiles for the meeting as a whole and are fully transcribed as part of the associated transcripts.

There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.

*Sponsorship*

The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM.

*Updates*

There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.
- references: Adam Janin, et al. 2004 ICSI Meeting Speech Linguistic Data Consortium, Philadelphia
- hasVersion: C-000718: ICSI Meeting Transcripts

SHACHI - Language Resource Metadata Database