言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 741 - 750 件目

C-001312: Tactical Speaker Identification Speech Corpus (TSID)
*Introduction*

The Tactical Speaker Identification Corpus (TSID), which was collected by Douglas Reynolds and Gerald C. O'Leary of MIT Lincoln Labs, contains recordings of 35 speakers (four female, 31 male), using a variety of different radio transmitters and receivers.

*Data*

The recording sessions were conducted by assembling the speakers into seven groups of five, then having each speaker perform the following tasks: - read a list of TIMIT sentences - read a list of digit strings - give directions for traveling from one point to another using a map (unscripted map task) Each speaker performed this set of tasks on each of three transmitters (xmtr1-3), and the utterances were recorded simlutaneously on DAT recorders attached to each of six receivers (rcvr1-6), which were located at some distance (well out of ear-shot) from the transmitter. Recordings were also made at the same time on a DAT recorder near the speaker using a head-mounted microphone to provide a reference wide-band recording of the speech (refwb).

As a result, the corpus is organized along four dimensions: speaker, transmitter, receiver, and speaking task; this organization can be viewed as a four-dimensional matrix, with 35x3x7x3 cells. Due to some occasional mishaps and malfunctions during the collection, some cells in this matrix are either empty or only partially full.

In addition to the tasks listed above, three pairs of speakers also participated in a two-way map task using xmtr3; in this case, one of the speakers in the task gives directions to the other for tracing a route on a map, and both speakers are recorded on a single audio channel at each of the receivers (except for the "refwb" recording: the two speakers were separated by some distance, using radio communication to perform the task, and only one of them used a head-mounted microphone and local DAT recorder for wide-band recording).

*Updates*

There are no updates at this time.
- references: David Graff, Douglas Reynolds and Gerald C. O' Leary 1999 Tactical Speaker Identification Speech Corpus (TSID) Linguistic Data Consortium, Philadelphia
C-001313: Taiwanese Putonghua Speech and Transcripts
*Introduction*

This set of data on Taiwanese accented Putonghua (PTH) was gathered by San Duanmu at the University of Michigan. The data was recorded in Taiwan from December 1994 to January 1995. Taiwanese accented PTH refers to PTH spoken by people who were born in Taiwan and whose first language is Taiwanese (Southern Min).

*Data*

A total of 40 speakers; ranging in age, education, birth place and family dialect; were recorded. There were five two-speaker dialogues and 30 single-speaker monologues. The dialogues were about 20 minutes each and the monologues were about 10 minutes each. Dialogues were recorded on two tracks, one for each speaker. Monologues were recorded on one track.

The recordings were done in ordinary, but quiet rooms. The speakers were asked in advance to speak in conversation style, without notes, on any topic they chose, or no topic at all. Most speakers spoke spontaneously and the topic drifted freely. Some speakers talked about their professional work in a rather formal way. One speaker (#20, a public health official) used notes. Overall, the corpus provides an informative sampling of variation in speech style.

The recording tools consisted of a portable DAT (Teac) which recorded at a 44.1 kHz sampling rate at 16 bits linear quantization. The microphones were AudioTechnica lapel microphones with a preamp and XLR connection to the DAT. The XLR helped low noise recordings and the AudioTechnica provided wide bandwidth, flat response over the speech range of interest, was unidirectional to minimize cross-talk and very light in comparison with standard microphones. Both single-speaker monologues and two-speaker dialogues were recorded using this system on standard DAT tape. For publication on CD-ROM, the original DAT recordings were downsampled to a 16 kHz sample rate.

Before recording, all speakers read and signed the "Informed Consent Form," which was written in Chinese and which largely followed the standard format approved by the Human Subject Committee of the University of Michigan. The form stated that the participation in the recording was entirely voluntary and that the speech may be used for linguistic teaching and research purposes.

The speech data are accompanied by transcripts. The monologues have start and end time stamps. The five dialogues are time stamped by speaker turn.

*Updates*

After the publication of this corpus some demographic data was made available to the LDC. To access this data, please go to the demographic table.
- references: San Duanmu, et al. 1998 Taiwanese Putonghua Speech and Transcripts Linguistic Data Consortium, Philadelphia
C-001315: The AQUAINT Corpus of English News Text
*Introduction*

The AQUAINT Corpus, Linguistic Data Consortium (LDC) catalog number LDC2002T31 and ISBN 1-58563-240-6 consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).

*Data*

The data files contain roughly 375 million words correlating to about 3GB of data. The text data are separated into directories by source (apw, nyt, xie); within each source, data files are subdivided by year, and within each year, there is one file per date of collection. Each file is named to reflect the source and date, and contains a stream of SGML-tagged text data presenting the series of news stories reported on the given date as a concatenation of DOC elements (i.e. blocks of text bounded by and tags).

All data files are published in compressed form, using the GNU "gzip" utility; as such, all files have a ".gz" extension, and will have null file name extension when uncompressed in the usual way (i.e. just the base file name, consisting of "YYYYMMDD_SRC").

While all the data files are covered by a single DTD, it is not the case that they all have a single pattern of markup. Rather, all files share a core markup structure, with minor variations in the peripheral regions of each DOC element, and the DTD has been written to accommodate the variations.

*Updates*

19980614_NYT.gz was left off in the conversion from CD to DVD. An update was issued on 09/13/2012. All copies ordered after this date will be complete. Contact ldc@ldc.upenn.edu for more information.
- references: David Graff 2002 The AQUAINT Corpus of English News Text Linguistic Data Consortium, Philadelphia
C-001316: The CMU Kids Corpus
*Introduction*

This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University.

*Data*

The children range in age from six to eleven (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5,180 utterances in all.

The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995 and were enrolled in either the Chatham College Summer Camp or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3,333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1,847 utterances in this set. The list of speakers, the set they are in and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl."

It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the LISTEN system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese."

*Updates*

There are no updates at this time.
- references: Maxine Eskenazi, Jack Mostow, and David Graff 1997 The CMU Kids Corpus Linguistic Data Consortium, Philadelphia
C-001317: West Point Company G3 American English Speech
*Introduction*

During the 2000-2001 academic year, cadets, staff and faculty members at the United States Military Academy volunteered to participate in a speech data collection project for American English. The goal of the project was to amass recordings from no less than 100 adult speakers (50 males and 50 females) to form a substantial corpus of high-quality read speech.

The project was conducted by the Center for Technology Enhanced Language Learning, part of the U.S. Military Academy's Department of Foreign Languages. Many of the 100-plus volunteers who provided the recordings were members of the staff and faculty of the Department of Foreign Languages. Other volunteers were friends and colleagues from other organizations who worked in offices in Washington Hall.

The largest group of volunteers was from Cadet Company G, Third Regiment, United States Corps of Cadets. Cadet Company G3, encouraged by their tactical officer, Major Scott Custer, adopted the speech data collection effort as a community service project. Every female cadet in Company G3 recorded her voice, as did many of the male cadets, including the cadet company commander and Major Custer.

The 185 sentences comprising the data collection script were written to elicit examples of all or most all of the possible syllables used in spoken American English.

The G3 Corpus audio data comes from 53 female and 56 male volunteers, each of whom recorded approximately 104 utterances. The recordings are sampled at a 16-bit resolution, 22,050 samples per second. Recordings were made using headset microphones (Shure M10) with preamplifiers attached to the line input jack of desktop computers. The total amount of speech is about 15 hours.

*Samples*

For an example of this corpus, please listen to this audio sample.
- references: John Morgan, et al. 2005 The West Point Company G3 American English Speech Data Corpus Linguistic Data Consortium, Philadelphia
C-001318: TimeBank 1.2
*Introduction*

TimeBank 1.2 contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specificationavailable at www.timeml.org.

*Data*

TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. For a detailed description of TimeML, see the TimeML 1.2.1 Specification and Guidelines. Here, we give a summary of each tag.

TIMEX3. This tag is used to capture dates, times, durations, and sets of dates and times. All TIMEX3 tags include a type and a value along with some other possible attributes. The value is given according to the ISO 8601 standard. The TIMEX3 tag allows specification of a tempral anchor. This facilitates the use of temporal functions to calculate the value of an underspecified temporal expression. For example, an article might include a document creation time such as "January 3, 2006." Later in the article, the temporal expression "today" may occur. By anchoring the TIMEX3 for "today" to the document creation time, we can determine the exact value of the TIMEX3.

EVENT. The EVENT tag is used to annotate those elements in a text that mark the semantic events described by it. Any event that can be temporally anchored or ordered is captured with this tag. An EVENT includes a class attribute with values such as occurrence, state, or reporting. The class of an EVENT may indicate what relationships the event participates in. In addition to the EVENT tag, events are also annotated with one or more MAKEINSTANCE tags that include information about a particular instance of the event. This includes part of speech, tense, aspect, modality, and polarity. When an event participates in a relationship, it is actually the event instance that is referenced. This is to allow for statements such as "John taught on Monday but not on Tuesday." Here, there are actually two instances of the teaching-event: one that has a positive polarity and one that is negative. Further, each instance participates in its own temporal relationship with respect to "Monday" and "Tuesday."

SIGNAL. The SIGNAL tag is used to annotate temporal function words such as "after," "during," and "when." These signals are then used in the representation of a temporal relationship.

The following three tags are link tags. They capture temporal, subordination, and aspectual relationships found in the text. These tags do not consume any actual text, but they do relate the three tag types above to each other.

TLINK. Temporal links are represented with a TLINK tag. A TLINK can temporally relate two temporal expressions, two event instances, or a temporal expression and an event instance. Along with an identification marker for each of these two elements, a relation type is given such as before, includes, or ended by. When a signal is present that helps to define the relationship, an ID for the SIGNAL is given as well.

SLINK. This tag is used to capture subordination relationships that involve event modality, evidentiality, and factuality. An SLINK includes an event instance ID for the subordinating event and an event instance ID for the subordinated event. Possible relation types for SLINK include modal, evidential, and factive. An SLINK will typically not include a signal ID unless it has the relation type conditional. Three specific EVENT classes interact with SLINK: reporting, i_state, and i_action.

ALINK. An aspectual connection between two event instances is represented with ALINK. As with SLINK, this tag includes two event instance IDs, one that introduces the ALINK and one that is the event argument to that event. The introducing event has the class aspectual. Some possible relation types for ALINK are initiates, terminates, and continues. TimeBank 1.2 contains 183 articles with just over 61,000 non-punctuation tokens. The count for each TimeML tag is listed below:

EVENT 7935 MAKEINSTANCE 7,940 TIMEX3 1,414 SIGNAL 688 ALINK 265 SLINK 2,932 TLINK 6,418 Total 27,592

*Samples*

For an example of the data in this corpus, please view the following samples. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
- references: James Pustejovsky, et al. 2006 TimeBank 1.2 Linguistic Data Consortium, Philadelphia
C-001319: カナダ・バイリンガル話ことばコーパス
カナダにおけるバイリンガル話ことば(日本語・英語・フランス語)の大規模コーパス。全体で29個の家族による日常的な内容の対話から構成されています。総語数は30000語です。
- hasVersion: C-001320: フランス語（エックス）多言語話し言葉コーパス
- hasVersion: C-001321: フランス語（パリ）多言語話しことばコーパス
- hasVersion: C-001322: マレーシア語多言語話しことばコーパス
- hasVersion: C-001323: スペイン語多言語話しことばコーパス2004年度版
- hasVersion: C-004968: スペイン語多言語話しことばコーパス2006年度版
- hasVersion: C-001324: トルコ語多言語話しことばコーパス
- hasVersion: C-004967: 台湾国語多言語話しことばコーパス
C-001320: フランス語（エックス）多言語話し言葉コーパス
フランス語話ことばコーパス（１）は全体で21個の対話から構成されています。総語数は100,000語以上で、それぞれの対話は異なる主題に焦点があてられています。
- hasVersion: C-001319: カナダ・バイリンガル話ことばコーパス
- hasVersion: C-001321: フランス語（パリ）多言語話しことばコーパス
- hasVersion: C-001322: マレーシア語多言語話しことばコーパス
- hasVersion: C-001323: スペイン語多言語話しことばコーパス2004年度版
- hasVersion: C-004968: スペイン語多言語話しことばコーパス2006年度版
- hasVersion: C-001324: トルコ語多言語話しことばコーパス
- hasVersion: C-004967: 台湾国語多言語話しことばコーパス
C-001321: フランス語（パリ）多言語話しことばコーパス
音楽の話題をめぐるパリの若者の話し言葉コーパス。
全体で7個の対話から構成されています。現在のところ転写が終わったのは、第１グループの会話で、総語数56463語です。
- hasVersion: C-001319: カナダ・バイリンガル話ことばコーパス
- hasVersion: C-001320: フランス語（エックス）多言語話し言葉コーパス
- hasVersion: C-001322: マレーシア語多言語話しことばコーパス
- hasVersion: C-001323: スペイン語多言語話しことばコーパス2004年度版
- hasVersion: C-004968: スペイン語多言語話しことばコーパス2006年度版
- hasVersion: C-001324: トルコ語多言語話しことばコーパス
- hasVersion: C-004967: 台湾国語多言語話しことばコーパス
C-001322: マレーシア語多言語話しことばコーパス
マレーシアで用いられている標準的な口語マレー語。
本コーパスには，32のダイアログが収められている。そのうちの22が転写済みである。総語数は，172,855語，総録音時間は，約30時間半である。ダイアログには，①テーマの統制度（自由か部分的に統制されているか）と②会話のなされ方（対面か電話か）という2つのパラメタに基づく4種類がある。このサイトでは4つのサンプル会話を聞くことができる。マレー語による転写に加え，英語と日本語の訳文も付けてある。
- hasVersion: C-001319: カナダ・バイリンガル話ことばコーパス
- hasVersion: C-001320: フランス語（エックス）多言語話し言葉コーパス
- hasVersion: C-001321: フランス語（パリ）多言語話しことばコーパス
- hasVersion: C-001323: スペイン語多言語話しことばコーパス2004年度版
- hasVersion: C-004968: スペイン語多言語話しことばコーパス2006年度版
- hasVersion: C-001324: トルコ語多言語話しことばコーパス
- hasVersion: 台湾国語多言語話しことばコーパス

SHACHI - Language Resource Metadata Database