言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1111 - 1120 件目

C-003330: ACE 2005 English SpatialML Annotations
*Introduction*

The ACE (Automatic Content Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06).

The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML and the 2005 ACE guidelines.

The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to map PLACE information in text to data from gazetteers and other databases to the extent possible. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag.

To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program. In particular, the English Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location, and Facility entity tags, and the Physical relation tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information recorded in the annotation is compatible with the feature types in the Alexandria Digital Library. This corpus also leverages the integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme can be related to the Geography Markup Language (GML) defined by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to express geographical features.

SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible to use spatial reasoning not only for integration with applications, but for better information extraction, e.g., for disambiguating a place name based on the locations of other place names in the document. SpatialML goes to some length to represent topological relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn et al. 1997).

Addtional information about SpatialML is contained in the paper "SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included in the online documentation for this corpus.

Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org)

*Samples*

For an example of the data in the corpus, please examine this sample.
- references: C-000594: ACE 2005 Multilingual Training Corpus
- isReferencedBy: [???Reference] Inderjeet Mani, et al. 2008 "ACE 2005 English SpatialML Annotations" Linguistic Data Consortium, Philadelphia
C-003331: CSLU: Portland Cellular Telephone Speech Version 1.3
*Introduction*

CSLU: Portland Cellular Telephone Speech Version 1.3 was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon. It consists of cellular telephone speech and corresponding transcripts, specifically, 7,571 utterances from 515 speakers who made calls in the Portland, Oregon area using cellular telephones.

Speakers called the CSLU data collection system on cellular telephones, and they were asked to repeat certain phrases and to respond to other prompts. Two prompt protocols were used: an In Vehicle Protocol for speakers calling from inside a vehicle and a Not in Vehicle Protocol for those calling from outside a vehicle. The protocols shared several questions, but each protocol contained distinct queries designed to probe the conditions of the caller's in vehicle/not in vehicle surroundings. Not every caller provided a response to each prompt.

*Recording Details*

The speeech data was captured digitally from CSLU's T1 connection and saved as 8 khz, 16-bit linear.

*Transcriptions*

The text transcriptions in this corpus were produced using the non time-aligned word-level conventions described in The CSLU Labeling Guide, which is included in the documentation for this release. CSLU: Portland Cellular Telephone Speech Version 1.3 contains orthographic and phonetic transcriptions of corresponding speech files. Non time-aligned orthographic transcriptions provide quick access to the content of an utterance; they may contain markers for word boundaries to support access and retrieval at the lexical level. Phonetic/phonemic transcriptions represent the phonetic content of an utterance at a given level of detail that is made explicit by the use of diacritics. Phonetic phenomena transcribed includes excessive nasalization, glottalization, frication on a stop, centralization, lateralization, rounding and palatalization.

*Samples*

For an example of the data in this corpus, please examine the following audio file and transcript.

* audio(wav)
* transcript
- replaces: Portland Cellular Corpus Release 1.2
- hasVersion: C-003350: CSLU: National Cellular Telephone Speech Release 2.3
- isReferencedBy: [???Reference] R. A. Cole, et al., 2008, "CSLU: Portland Cellular Telephone Speech Version 1.3," Linguistic Data Consortium, Philadelphia
C-003332: Hungarian-English Parallel Text, Version 1.0
*Introduction*

Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner.

Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics.

Additional information about this release is available from the corpus website maintained by BUTE.

*File formats, character encoding*

This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form.

Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically).

Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences.

The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.

*Samples*

For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.
- isReferencedBy: [???Reference] Dániel Varga, et al., 2008, "Hungarian-English Parallel Text, Version 1.0," Linguistic Data Consortium, Philadelphia
- references: C-003333: Hungarian National Corpus
C-003333: Hungarian National Corpus
Hungarian National Corpus (HNC) is a collection of written or spoken linguistical data of present-day Hungarian. It contains bibliographical data, marks the structural units (paragraphs, sentences) and every wordform in the texts is annotated with stem, part of speech and inflecional information. It aims to be a representative general-aim corpus of present-day standard Hungarian. HNC is divided into five subcorpora by regional language variants, and into five subcorpora by text genres, which makes the HNC an appropriate tool to study the differences not just between text genres but between language variants.
- isReferencedBy: C-003332: Hungarian-English Parallel Text, Version 1.0
C-003344: SALA II US English database
Telephone
The SALA II US English database collected in the United States was recorded within the scope of the SALA II project. It contains the recordings of 4,090 US English speakers (2,017 males and 2,073 females, including some speakers with Hispanic accents) recorded over the United States mobile telephone network.

The following acoustic conditions were selected as representative of a mobile user's environment (some speakers were recorded in several environments):
- Passenger in moving car, railway, bus, etc. (607 speakers)
- Public place (1,238 speakers)
- Stationary pedestrian by road side (928 speakers)
- Home/office environment (1,188 speakers)
- Passenger in moving car using a hands-free kit (161 speakers)

This database is distributed as 2 DVD-ROMs. The speech files are stored as sequences of 8-bit, 8kHz Mu-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.

This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.

Each speaker uttered the following items:
- 6 application words (out of a set of 30)
- 1 sequence of 10 isolated digits
- 4 connected digits (1 sheet number -5+ digits, 1 telephone number 9/11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
- 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
- 1 spotting phrase using an embedded application word
- 2 isolated digits
- 3 spelled words (1 surname, 1 directory assistance city name, 1 real/artificial name for coverage)
- 1 currency money amount
- 1 natural number
- 5 directory assistance names (1 spontaneous, e.g. own surname, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename surname out of a set of 150 )
- 2 yes/no questions (1 predominantly yes question, 1 predominantly no question, including fuzzy questions)
- 9 phonetically rich sentences
- 2 time phrases (1 spontaneous time of day, 1 word style time phrase)
- 4 phonetically rich words

The following age distribution has been obtained: 129 speakers are under 16, 2,456 speakers are between 16 and 30, 832 speakers are between 31 and 45, 610 speakers are between 46 and 60, and 63 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
- hasVersion: C-000126: SALA II Spanish from Mexico database
- hasVersion: C-000402: SALA II Spanish from Costa Rica database
- hasVersion: C-000404: SALA II Spanish from Argentina database
- hasVersion: C-003345: SALA II Portuguese from Brazil database
- hasVersion: C-003346: SALA II Spanish from Colombia Database
- hasVersion: C-003347: SALA II US Spanish West
- references: C-001303: TIMIT Acoustic-Phonetic Continuous Speech Corpus
- references: Harvard Corpus
C-003345: SALA II Portuguese from Brazil database
Telephone
This SALA II Portuguese from Brazil database was recorded within the scope of the SALA II project.
The database contains the recordings of 1000 speakers (500 males and 500 females) recorded over the local mobile telephone network.

The following acoustic conditions were selected as representative of a mobile user's environment:
Passenger in moving car (176 speakers)
Public place (264 speakers)
Stationary pedestrian by road side (231 speakers)
Home/Office environment (273 speakers)
Passenger in moving car using a hands-free kit (56 speakers)

The database is distributed as 4 DVDs. The speech files are stored as sequences of 8-bit, 8kHz a-law speech files, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file. This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.

Each speaker uttered the following items
6 application words
1 sequence of 10 isolated digits
4 connected digits (1 sheet number -6 digits, 1 telephone number 9-11 digits, 1 credit card number 14/16 digits, 1 PIN code -6 digits)
3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
1 word spotting phrase using an embedded application word
1 isolated digit
3 spelled words (1 spontaneous name, 1 directory assistance city name, 1 realartificial name for coverage)
1 currency money amount
1 natural number
5 directory assistance names (1 spontaneous name, 1 spontaneous city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 forename-surname out of a set of 150)
2 yes/no questions (1 predominantly yes question, 1 predominantly no question)
9 phonetically rich sentences
2 time phrases (1 spontaneous time of day, 1 word style time phrase)
4 phonetically rich words

The following age distribution has been obtained:
6 speakers under 16, 353 speakers are between 16 and 30, 328 speakers are between 31 and 45, 303 speakers are between 46 and 60, and 10 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
- hasVersion: C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
- hasVersion: C-000126: SALA II Spanish from Mexico database
- hasVersion: C-000402: SALA II Spanish from Costa Rica database
- hasVersion: C-000404: SALA II Spanish from Argentina database
- hasVersion: C-003344: SALA II US English database
- hasVersion: C-003346: SALA II Spanish from Colombia Database
- hasVersion: C-003347: SALA II US Spanish West
C-003346: SALA II Spanish from Colombia Database
Telephone
The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network.
- hasVersion: C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
- hasVersion: C-000126: SALA II Spanish from Mexico database
- hasVersion: C-000402: SALA II Spanish from Costa Rica database
- hasVersion: C-000404: SALA II Spanish from Argentina database
- hasVersion: C-003344: SALA II US English database
- hasVersion: C-003345: SALA II Portuguese from Brazil database
- hasVersion: C-003347: SALA II US Spanish West
C-003347: SALA II US Spanish West
Telephone
The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network.
- hasVersion: C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
- hasVersion: C-000126: SALA II Spanish from Mexico database
- hasVersion: C-000402: SALA II Spanish from Costa Rica database
- hasVersion: C-000404: SALA II Spanish from Argentina database
- hasVersion: C-003344: SALA II US English database
- hasVersion: C-003345: SALA II Portuguese from Brazil database
- hasVersion: C-003346: SALA II Spanish from Colombia Database
- references: C-000125: SALA II Spanish Mobile Network Database collected in Venezuela
- references: C-000126: SALA II Spanish from Mexico database
C-003348: GALE Phase 1 Arabic Blog Parallel Text
*Introduction*

This file contains the documentation for GALE Phase 1 Arabic Blog Parallel Text, Linguistic Data Consortium (LDC) catalog number LDC2008T02, ISBN 1-58563-462-X.

Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program.

LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:

* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)

*Source Data*

The task of preparing this corpus involved four stages of work: data scouting, data harvesting, formatting, and data selection.

Data scouting involved manually searching the web for suitable blog text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest (sites, threads and posts) to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout.

Once the text was downloaded, its format was standardized (by running various scripts) so that the data could be more easily integrated into downstream annotation processes. Original-format versions of each document were also preserved. Typically a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.

The selected documents were then reviewed for content suitability using a semi-automatic process. A statistical approach was used to rank a documents relevance to a set of already-selected documents labeled as good. An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. Those newly-judged documents in turn provided additional input for the generation of new ranked lists.

Manual sentence units/segments (SU) annotation was also performed on a subset of files following LDCs Quick Rich Transcription specification. Three types of end of sentence SU are identified:

- statement SU - question SU - incomplete SU

*Translation*

After files were selected, they were reformatted into a human-readable translation format, and the files were then assigned to professional translators for careful translation. Translators followed LDCs GALE Translation guidelines, which describe the makeup of the translation team, the source, data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies), and quality control procedures applied to completed translations.

Translators were instructed to return a 50-sentence sample as soon as it was completed. The sample was reviewed by LDCs bilingual language specialists. Subsequent deliveries were subject to quality controls as described in the translation guidelines. Low quality translations were returned to the translators for revision.

TDF Format

All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.

Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:

field data_type ----- --------- 1 file unicode 2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode

A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.

Encoding

All data are encoded in UTF8.

*Sponsorship*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

*samples*

For an example of the data in this corpus, please examine these screen captures(jpg) of the text:

* source
* translation
- hasVersion: C-003327: GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
- hasVersion: C-003328: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
- isReferencedBy: Xiaoyi Ma, Dalal Zakhary, 2008, "GALE Phase 1 Arabic Blog Parallel Text," Linguistic Data Consortium, Philadelphia
C-003349: STC-TIMIT 1.0
This file contains documentation for STC-TIMIT 1.0, Linguistic Data Consortium (LDC) catalog number LDC2008S03 and isbn 1-58563-468-9.

STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English reading ten phonetically rich sentences. Created in 1993, TIMIT was designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Since that time, several corpora have been developed using the TIMIT database: NTIMIT, LDC93S2 (transmiting TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them); CTIMIT, LDC96S30 (passing TIMIT files through cellular telephone circuits); FFMTIMIT, LDC96S32 (re-recording TIMIT files with a free-field microphone); and HTIMIT, LDC98S67 (re-recording a subset of TIMIT files through different telephone handsets).

What differentiates STC-TIMIT 1.0 from other TIMIT-derived corpora is that the entire TIMIT database was passed through an actual telephone channel in a single call. Thus, a single type of channel distortion and noise affect the whole database.

The process was managed using a Dialogic switchboard for the calling and receiving ends. No transducer (microphone) was employed; the original digital signal was converted to analog using the switchboard's A/D converter, transmitted trough a telephone channel and converted back to digital format before recording. As a result, the only distortion introduced is that of the telephone channel itself.

The STC-TIMIT 1.0 database is organized in the same manner as in the original TIMIT corpus: 4620 files belonging to the training partition and 1680 files belonging to the test partition. Files were recorded using 8kHz sampling frequency and muLaw encoding. Additionally four sets of two calibration tones were generated. These were passed through the telephone line approximately at the start of every 1/4th of the whole database (both the source and recorded calibration tones in each set are provided). Calibration tones are:

* 2 sec. 1kHz tone
* 2 sec. sweep tone from 10 Hz to 4000 Hz.
Utterances in STC-TIMIT 1.0 are time-aligned with those of TIMIT with an average precision of 0.125 ms (1 sample), by maximizing the cross-correlation between pairs of files from each corpus. Thus, labels from TIMIT may be used for STC-TIMIT 1.0, and the effects of telephone channels may be studied on a frame-by-frame basis.

*Data*

Originally a single wav file was created by concatenation of all files in the TIMIT database. This file was downsampled to 8kHz and compressed using muLaw encoding.

Two telephone lines within the same building were connected to a Dialogic(R) card. One of the lines was used as the calling-end and played the speech file, while the other line was used as the receiving-end and recorded the new signal. The whole recording process was conducted in a single call. Incoming speech was recorded using 8kHz sampling frequency and muLaw encoding.

After recording, the file was pre-cut according to the length of the corresponding TIMIT database file. Each resulting file was then aligned to its corresponding file in TIMIT using the xcorr routine in Matlab(R). Based on these results, the recorded file was sliced again from the original recorded file using the newly-generated alignments. Thus, each file in STC-TIMIT 1.0 is aligned to its equivalent in TIMIT and has the same length.

*Sample*

For an example of the data contained in this corps, please listen to this audio sample.
- isFormatOf: C-001303: TIMIT Acoustic-Phonetic Continuous Speech Corpus
- hasFormat: C-001072: NTIMIT
- hasFormat: C-000686: CTIMIT
- hasFormat: C-001257: FFMTIMIT
- hasFormat: N-000712: HTIMIT

SHACHI - Language Resource Metadata Database