Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 291 - 300 of 2023

C-000602: ATIS3 Test Data
*Introduction*

This release contains a corpus of speech and natural language data collected under the auspices of the Advanced Research Projects Agency Spoken Language Systems (ARPA-SLS) technology development program. The corpus, which contains data in the Air Travel Information Services (ATIS) domain, was designed by the ARPA-SLS Multi-site Atis Data COllection Working (MADCOW) group and was collected by five sites at locations across the U.S.:

* BBN Systems & Technologies, Cambridge, MA
* Carnegie Mellon University, Pittsburgh, PA
* MIT Laboratory for Computer Science, Boston, MA
* National Institute of Standards and Technology, Gaithersburg, MD
* SRI International, Menlo Park, CA

The corpora is part of the third phase of collection of ATIS data (ATIS3) and comprises the development test (NIST Speech Disc 17-4.2) and evaluation test material (NIST Speech Disc 17-5.1) used in the December 1994 ARPA SLS Benchmark Tests. As in the previous ATIS corpora, the speech contained in this corpus was elicited by presenting subjects with various hypothetical travel planning scenarios to solve. The resulting spontaneous spoken queries were recorded as the subjects interacted with partially or completely automated ATIS systems to solve the scenarios. Note that the ATIS3 training data is available on NIST Speech Discs 17-1.1 - 17-3.1.

*Data*

The recorded speech has been transcribed and annotated with categorizations and canonical reference answers. All of the utterances have been recorded using a close-talking, noise-canceling head-mounted Sennheiser microphone. For some subjects, secondary (noisier) microphone data was recorded simultaneously as well.

This release also contains the ATIS3 46 city/52 airport relational database, a revised Principles of Interpretation and test implementation and scoring instructions as well as other general documentation.

The ATIS3 corpus has been verified, collated, documented by the National Institute of Standards and Technology (NIST) in cooperation with MADCOW and distributed by the Linguistic Data Consortium (LDC).

*Updates*

None at this time.
- references: Deborah A. Dahl, et al. 1995 ATIS3 Test Data Linguistic Data Consortium, Philadelphia
C-000603: ATIS3 Training Data
The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios completed by 137 subjects, yielding a total of over 7,300 utterances. All utterances are transcribed and 2,900 of them have been categorized and annotated with canonical reference answers. The relational database for this dataset included flight information for 46 cities and 52 airports. Data was collected at BBN, CMU, MIT and SRI, using their own ATIS systems and at NIST using systems provided by BBN and SRI.

Two 1,000-utterance test sets were set aside from the data pooled by the collection sites. The first set was used in a December 1993 ARPA test and is included in ATIS3. The second has been reserved for future testing.

*Samples*

* Audio
* Transcripts
- hasVersion: N-000601: ATIS2
- hasVersion: ATIS0
- hasVersion: C-000602: ATIS3 Test Data
- isReferencedBy: (Online Documentation) http://www.ldc.upenn.edu/Catalog/docs/LDC94S19/
- isReferencedBy: (Online Readme Files) http://www.ldc.upenn.edu/Catalog/readme_files/atis3.readme.html
- isReferencedBy: Deborah A. Dahl, et al. 1994 ATIS3 Training Data Linguistic Data Consortium, Philadelphia
C-000604: Air Traffic Control BOS
LDC94S14A - Complete ATC0 corpus LDC94S14B - ATC0 Logan International LDC94S14C - ATC0 Washington National LDC94S14D - ATC0 Dallas Fort Worth

*Introduction*

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots.

*Data*

The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals. Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on each disc in the file, "atc.doc" in the "doc" directory.

The ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.

*Updates*

Relative to the CD-ROMs produced in 1994 by NIST, the sphere files were renamed with the .sph extension, instead of the .wav extension.
- references: John J. Godfrey 1994 Air Traffic Control BOS Linguistic Data Consortium, Philadelphia
- hasPart: C-000605: Air Traffic Control Complete
- isPartOf: C-000606: Air Traffic Control DCA
- isPartOf: C-000607: Air Traffic Control DFW
C-000605: Air Traffic Control Complete
LDC94S14A - Complete ATC0 corpus LDC94S14B - ATC0 Logan International LDC94S14C - ATC0 Washington National LDC94S14D - ATC0 Dallas Fort Worth

*Introduction*

The Air Traffic Control Corpus (ATC0) is comprised of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data is composed of voice communication traffic between various controllers and pilots.

*Data*

The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals.

Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on in the files, "atc.doc" in the "doc" directories.

The ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.

*Samples*

For an example of the data in this corpus, please examine the following files. The audio sample is in NIST Sphere format. Users should save this file rather than try to display it in the browser

* Audio
* Carrier Detect
* Transcripts/

*Updates*

Relative to the CD-ROMs produced in 1994 by NIST, the sphere files were renamed with the .sph extension, instead of the .wav extension.
- references: John J. Godfrey 1994 Air Traffic Control Complete Linguistic Data Consortium, Philadelphia
- isPartOf: C-000604: Air Traffic Control BOS
- isPartOf: C-000606: Air Traffic Control DCA
- isPartOf: C-000607: Air Traffic Control DFW
C-000606: Air Traffic Control DCA
LDC94S14A - Complete ATC0 corpus LDC94S14B - ATC0 Logan International LDC94S14C - ATC0 Washington National LDC94S14D - ATC0 Dallas Fort Worth

*Introduction*

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots.

*Data*

The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals. Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on each disc in the file, "atc.doc" in the "doc" directory.

The ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.

*Updates*

Relative to the CD-ROMs produced in 1994 by NIST, the sphere files were renamed with the .sph extension, instead of the .wav extension.
- references: John J. Godfrey 1994 Air Traffic Control DCA Linguistic Data Consortium, Philadelphia
- isPartOf: C-000605: Air Traffic Control Complete
- hasVersion: C-000604: Air Traffic Control BOS
- hasVersion: C-000607: Air Traffic Control DFW
C-000607: Air Traffic Control DFW
LDC94S14A - Complete ATC0 corpus LDC94S14B - ATC0 Logan International LDC94S14C - ATC0 Washington National LDC94S14D - ATC0 Dallas Fort Worth

*Introduction*

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots.

*Data*

The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals. Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on each disc in the file, "atc.doc" in the "doc" directory.

The ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.

*Updates*

Relative to the CD-ROMs produced in 1994 by NIST, the sphere files were renamed with the .sph extension, instead of the .wav extension.
- references: John J. Godfrey 1994 Air Traffic Control DFW Linguistic Data Consortium, Philadelphia
- hasPart: C-000605: Air Traffic Control Complete
- isPartOf: C-000604: Air Traffic Control BOS
- isPartOf: C-000606: Air Traffic Control DCA
C-000609: Arabic Broadcast News Transcripts
*Introduction*

Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of ten hours of transcribed speech from Voice of America satellite radio news broadcasts in Arabic recorded by LDC between June 2000 and January 2001. The corresponding speech files are available in Arabic Broadcast News Speech (LDC2006S46).

This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017, NSF III-9982201) in conjunction with the European Language Resources Association (ELRA). ELRA transcribed 22.5 hours of Arabic broadcast data from Radio Orient (France) that is available in NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157). The goal of the NetDC project was to improve the infrastructure for language resources by designing and implementing new modes of cooperation between LDC and ELRA.

*Data*

The character encoding is entirely in ASCII; Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket.

The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. (A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc).

*Samples*

For an example of the data contained in this corpus, please examine this screenshot of the transcription.
- references: Mohamed Maamouri, David Graff, Christopher Cieri 2006 Arabic Broadcast News Transcripts Linguistic Data Consortium, Philadelphia
- hasVersion: C-001251: Arabic Broadcast News Speech
C-000610: Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
*Introduction*

Arabic CTS Levantine Fisher Training Data Set 3 Transcripts provides the transcription for the speech contained in Arabic CTS Levantine Fisher Training Data Set 3, Transcripts (LDC2005S07).

This training speech release consists of 322 conversations, representing a total of approximately 50 hours of Levantine Arabic speech.

The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations are challengingly natural and intimate. Under the Fisher protocol, a very large number of participants each make a few calls of short duration speaking to other participants, whom they typically do not know, about assigned topics. This maximizes inter-speaker variation and vocabulary breadth although it also increases formality.

Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so; however the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study.

To encourage a broad range of vocabulary, Fisher participants are asked to speak on an assigned topic which is selected at random from a list, which changes every 24 hours and which is assigned to all subjects paired on that day. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol.

*Samples*

Please examine this sample for an example of this corpus. The file is UTF_8 encoded text.
- references: Mohamed Maamouri, Tim Buckwalter, and Hubert Jin 2005 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts Linguistic Data Consortium, Philadelphia
- hasVersion: C-001328: Arabic CTS Levantine Fisher Training Data Set 3, Speech
C-000611: Arabic English Parallel News Part 1
This corpus contains Arabic news stories and their English translations LDC collected via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story pairs, 68,685 sentence pairs, 2M Arabic words and 2.5M English words. The corpus is aligned at sentence level. All data files are SGML documents.

Please examine this Arabic example and this English example to review a sample of this corpus.
- replaces: Ummah Arabic English Parallel News Text (LDC2002E48)
- isReferencedBy: LDC, et al. 2004 Arabic English Parallel News Part 1 Linguistic Data Consortium, Philadelphia
C-000612: Arabic Gigaword Second Edition
*Introduction*

Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC), at the University of Pennsylvania.

Arabic Gigaword Second Edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data.

Five distinct sources of Arabic newswire are represented here:

Agence France Presse (afp_arb; formally afa) Al Hayat News Agency (hyt_arb; formally alh) An Nahar News Agency (nhr_arb; formally ann) Ummah Press (umh_arb) Xinhua News Agency (xin_arb; formally xia) The seven-letter codes in the parentheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-letter language code represents the standard Arabic in the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler three-character-code scheme was used to identify both the source and the language. The new convention allows us to distinguish data sets by source and language more naturally when a single newswire provider distributes data in multiple languages.

Ummah Press is a new source added to the Second Edition. The following table shows the new data that appear for the first time in the Second Edition.

Agence France Presse 2003.01-2004.12 143,766 documents Al Hayat News Agency 2002.01-2003.12 64,308 documents An Nahar News Agency 2003.01-2004.01 16,316 documents Ummah Press 2003.01-2004.12 4,641 documents Xinhua News Agency 2003.06-2004.12 10,6236 documents

*Data*

There are 423 files, totaling approximately 1.4GB in compressed form (5,359 MB uncompressed, and 1,591,983 K-words).

The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3 gigabytes total), K-words are the number of space-separated tokens in the text, excluding SGML tags.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_ARB 128 355 1429 123594 660621 HYT_ARB 119 524 1861 169100 369555 NHR_ARB 109 457 1649 151078 344084 UMH_ARB 24 4 13 1201 4645 XIN_ARB 43 103 407 36933 213082 TOTAL 423 1443 5359 481906 1591987 All text files in this corpus have been converted to UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.

Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). Therefore, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.

Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":

story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."

Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data.

As described in the introduction section, a new naming scheme for file names and document IDs is used in the Second Edition. All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the seven-letter prefix and the date is included in the following table.

Old New AFA AFP_ARB_ ALH HYT_ARB_ ANN NHR_ARB XIA XIN_ARB_

*Samples*

For an example of the data in this corpus, please examine this screenshot which is an image of the text from a single file.
- references: David Graff, et al. 2006 Arabic Gigaword Second Edition Linguistic Data Consortium, Philadelphia
- hasVersion: C-000613: Arabic Gigaword
- hasVersion: Arabic Gigaword Third Edition

SHACHI - Language Resource Metadata Database