言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 611 - 620 件目

C-001094: SLX Corpus of Classic Sociolinguistic Interviews
*Introduction*

The SLX Corpus of Classic Sociolinguistic Interviews comprises eight sociolinguistic interviews with a total of nine speakers, conducted in the 1960s and 70s. All of the interviews are conducted by William Labov or by one of his students. Labov notes that these interviews are not classic in the sense that they form part of a systematic sociolinguistic study of the speech community. What makes these interviews classic is that they represent classic solutions to the problems of achieving cross-cultural contact, reducing the effect of the Observers Paradox and approximating the vernacular of everyday life. Most importantly, they are interviews with extraordinarily gifted, memorable and fluent speakers.

These particular interviews have also been targeted for inclusion in this corpus because of their sound quality and because publication of the audio data and corresponding transcripts and annotations does not violate any agreement the interviewer made with the speakers regarding data distribution.

The corpus includes the complete interview recordings plus time-aligned verbatim transcripts for each speaker. Also included in the publication is a sociolinguistic variable survey that represents an overview of the intra- and inter-speaker variation attested in the corpus, highlighting a broad range of phonological, phonetic, grammatical, lexical and stylistic variables. Finally, the publication includes a number of annotation tools that allow users to listen to each interview while browsing the corresponding transcripts, and to display and hear each token identified in the variable survey. These tools can be extended to create new time-aligned transcripts or tag additional variables within the existing corpus.

The SLX Corpus was developed as part of the Data and Annotations for Sociolinguistics (DASL) Project, an investigation of best practices in the use of digital speech corpora for the study of language variation. Containing classic interview material in the Labovian tradition, it is a valuable teaching tool for linguists. The recordings demonstrate successful interviewing techniques, the sound quality is high, and the digitization, segmentation and transcription of the data represent best practice in these areas. The variable survey highlights over 150 sociolinguistic variables attested in the corpus and suggests avenues for further research. Most importantly, the SLX Corpus provides both an example of a digital speech corpus developed specifically to support sociolinguistic research, and a stable benchmark for training in sociolinguistic data collection, digitization, segmentation, transcription, analysis and publication.

*Data*

The 17 speech files are 22050Hz, 16-bit, single-channel in the MS WAV (RIFF) format, for a total of 575 minutes (~ 1.5GB).

The audio data reflects a broad spectrum of speaking styles, including spontaneous speech, narratives, responses and formal linguistic tasks. The interviews touch on a multitude of topics, and corpus users should note that the language of the interviews represents the uncensored opinions of the speakers, reflecting their daily concerns and personal histories.

Taken as a whole, the speakers exemplify a wide variety of regional and social dialects. Demographic information for each main speaker in the corpus is displayed in the table below.

Speaker
Age
Speech Community
Occupation
Ethnicity
Education

Adolphus H.
81
Near Hillsboro, NC
Farmer
African American
Very little

Bobbie A.
22
Ayr, Scotland
Saw Doctor
Scottish/Italian
Some technical college

Henry G.
60
E. Atlanta, GA (Dekalb Co.)
Railroad foreman
European American
High school graduate

Jerry T.
19
Near Leakey, Texas
Gas station attendant
European American
Some high school

Joe D. (interviewed with Eddie M.)
21
Liverpool, England
Docker
English
Some high school

Eddie M. (Interviewed with Joe D.)
19
Liverpool, England
Docker
English
Some high school

Kathy D.
15
Rochester, NY
Student
European American
In 11th grade

Louise A.
53
Knoxville, TN
Mother
European American
Unknown

Rose B.
43
New York, NY (Lower East Side)
Factory seamstress
Italian American
Sixth Grade

The corpus also contains transcripts, annotations, annotation tools and documentation.

The documentation includes the complete segmentation and transcription guidelines, descriptions of the variables and style codes used in the variable survey, demographic information plus Labovs notes about each speaker, and an instruction manual for using the corpus tools.

*Updates*

None at this time.

*Sponsorship*

The SLX corpus was funded in part through a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.

*Note*

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 100 copies are distributed, additional copies will be available for the production cost of $100.
- hasPart: Data and Annotations for Sociolinguistics (DASL) Project
C-001095: SUSAS Transcripts
*Introduction*

SUSAS (Speech Under Simulated and Actual Stress) Transcripts was developed by the Linguistic Data Consotium and consists of transcribed English speech by helicopter pilots. The speech data in this release is a subset of the data in SUSAS (LDC99S78), created by the Robust Speech Processing Laboratory at the University of Colorado-Boulder under the direction of Professor John H. L. Hansen and sponsored by the Air Force Research Laboratory.

*Data*

The transcripts in this release cover several speech files in the SUSAS collection, specifically, speech from four Apache helicopter pilots.

The SUSAS speech database is partitioned into four domains, encompassing a wide variety of stresses and emotions. A total of 32 speakers (13 female, 19 male), with ages ranging from 22 to 76 years, were employed to generate in excess of 16,000 utterances.

A common highly confusable vocabulary set of 35 aircraft communication words make up the database. All speech tokens were sampled using a 16-bit A/D converter at a sample rate of 8kHz.

*Updates*

There are no updates at this time.
C-001096: SUSAS
*Introduction*

Speech Under Simulated and Actual Stress (SUSAS) was created by the Robust Speech Processing Laboratory at the University of Colorado-Boulder under the direction of Professor John H. L. Hansen and sponsored by the Air Force Research Laboratory.

*Data*

The database is partitioned into four domains, encompassing a wide variety of stresses and emotions. A total of 32 speakers (13 female, 19 male), with ages ranging from 22 to 76 years were employed to generate in excess of 16,000 utterances.

SUSAS also contains several longer speech files from four Apache helicopter pilots. Those helicopter speech files were transcribed by the Linguistic Data Consortium and are available in SUSAS Transcripts (LDC99T33).

A common highly confusable vocabulary set of 35 aircraft communication words make up the database. All speech tokens were sampled using a 16-bit A/D converter at a sample rate of 8kHz.

*Updates*

There are no updates at this time.
- hasFormat: C-001095: SUSAS Transcripts
C-001097: Santa Barbara Corpus of Spoken American English Part I
*Introduction*

The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.

*Data*

The three CD-ROM volumes in Part I contain 14 speech files of between 15-30 minutes each, from the Santa Barbara Corpus of Spoken American English. Collected by: University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charlese Meyer (UMass, Boston), and Sandra A. Thompson (UCSB). The Santa Barbara Corpus of Spoken American English is part of the International Corpus of English (Charles W. Meyer, Director), representing the American Component.

Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable.

*Samples*

For an example of the data in this corpus, please examine these samples of the recordings and transcripts:

* Speech
* Transcripts

*Updates*

There are no updates at this time.
- hasPart: the International Corpus of English (Charles W. Meyer, Director), representing the American Component
- hasVersion: C-001098: Santa Barbara Corpus of Spoken American English Part II
- hasVersion: C-001099: Santa Barbara Corpus of Spoken American English Part III
C-001098: Santa Barbara Corpus of Spoken American English Part II
*Introduction*

Santa Barbara Corpus of Spoken American English Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S06 and ISBN 1-58563-272-4.

Santa Barbara Corpus of Spoken American English Part II is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.

The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)).

Santa Barbara Corpus of Spoken American English Part II is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component.

For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English.

Part I of the Santa Barbara Corpus of Spoken American English is also available as LDC2000S85.

*Data*

The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22,050Hz. The speech files total ~six hours of audio (1.8GB), representing over 47K-words and over 5K unique words in transcription.

Each speech file is accompanied by two transcripts in which intonation units are time stamped with respect to the audio recording. The two types of transcripts are defined by the file extension: .trn and .ca. The text and coding content of specific transcripts are identical. However, the transcripts with the ".ca" extension are transcripts in the CHAT format for conversational analysis, formatted for use with the CLAN software, available from TalkBank. The transcripts with ".trn" extension are structured according to the LDC Callhome format, for use with a variety of annotation tools. (Please also note that transcript coding is not presented as in the ICE standard).

Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files.

The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform.

*Acknowledgements*

The completion and release of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

*Updates*

There are no updates available at this time.

*Note*

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge; a $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available for the production cost of $100 per disc.
- hasPart: the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component.
- : C-001097: Santa Barbara Corpus of Spoken American English Part I
- : C-001099: Santa Barbara Corpus of Spoken American English Part III
C-001099: Santa Barbara Corpus of Spoken American English Part III
Santa Barbara Corpus of Spoken American English Part III was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S10 and ISBN 1-58563-308-9.

Santa Barbara Corpus of Spoken American English Part III is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.

The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)).

Santa Barbara Corpus of Spoken American English Part III is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component.

For software and additional data resources, please refer to the following sites: Talkbank, International Corpus of English.

Part I of the Santa Barbara Corpus of Spoken American English is available as LDC2000S85.

Part II of the Santa Barbara Corpus of Spoken American English is available as LDC2003S06.

*Data*

The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz. The speech files total ~6 hours of audio (1.8GB), representing over 116K-words and over 9K unique words in transcription.

segment.txt
explanation of the information in segment.tbl

segment.tbl
collection information about the recordings

segment_summaries.txt
brief summaries of audio scenarios

speaker.txt
explanation of the information in speaker.tbl

speaker.tbl
speaker ethnographic, demographic information

table.txt
description of file names and informal titles

annotations.txt
list of conventions and prosodic annotations

The the transcripts are in the following format:

.trn format structure 2.660 2.805 JOANNE: But, 2.805 4.685 so these slides be real interesting. 6.140 6.325 KEN: ... Yeah. 6.325 7.710 I think it'll be real interesting

A sample transcript file may be found here.

Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The file sbc040.flt is empty indicating there was no personal information to filter out.

The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform.

For a complete listing of the files, please see file.tbl in the docs directory.

*Acknowledgements*

The completion and release of this corpus was facilitated by funding extended by the Talkbank project. Talkbank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

Produced at the LDC by Nii Martey.

*Updates*

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2003S06.

*Note*

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge to qualified researchers; a $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available for the production cost of $200 per DVD-ROM.
C-001100: Penn Treebank Online
To use this interface, you need to know the tgrep query syntax and be familar with the tgrep options. Tgrep search is currently unavailable. This search system is with The Penn Treebank (Projecthttp://www.cis.upenn.edu/~treebank/).
- hasPart: Brown Corpus (Treebank I, with POS tags)
- hasPart: Wall Street Journal (Treebank II, with POS tags)
- hasPart: Wall Street Journal (Treebank II, w/o POS tags)
- hasPart: Wall Street Journal (Treebank I, with POS tags)
- hasPart: Switchboard transcripts (Treebank II, w/o POS tags)
- isRequiredBy: C-001547: Treebank-3
C-001101: athelstan
the source for books and software related to corpus linguistics and computer assisted language learning.
C-001105: GlobalPhone Arabic
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Arabic corpus was produced using the Assabah newspaper. It contains recordings of 78 speakers (35 males, 43 females) recorded in Tunisia, Palestine and Jordan. The following age distribution has been obtained: 20 speakers are below 19, 35 speakers are between 20 and 29, 13 speakers are between 30 and 39, 6 speakers are between 40 and 49, and 4 speakers are over 50.
- hasVersion: C-001106: GlobalPhone Chinese-Mandarin
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa
C-001106: GlobalPhone Chinese-Mandarin
Desktop/Microphone
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.

The Chinese-Mandarin corpus was produced using the Peoples Daily newspaper. It contains recordings of 132 speakers (64 males, 68 females) recorded in Beijing, Wuhan and Hekou, China. The following age distribution has been obtained: 16 speakers are below 19, 96 speakers are between 20 and 29, 16 speakers are between 30 and 39, 3 speakers are between 40 and 49 (1 speaker age is unknown).
- hasVersion: C-001105: GlobalPhone Arabic
- hasVersion: C-001116: GlobalPhone Spanish (Latin American)
- hasVersion: C-001107: GlobalPhone Chinese-Shanghai
- hasVersion: C-001108: GlobalPhone Croatian
- hasVersion: C-001109: GlobalPhone Czech
- hasVersion: C-001111: GlobalPhone German
- hasVersion: C-001112: GlobalPhone Japanese
- hasVersion: C-001113: GlobalPhone Korean
- hasVersion: C-001114: GlobalPhone Portuguese (Brazilian)
- hasVersion: C-001115: GlobalPhone Russian
- hasVersion: C-001110: GlobalPhone French
- hasVersion: C-001117: GlobalPhone Swedish
- hasVersion: C-001118: GlobalPhone Tamil
- hasVersion: C-001119: GlobalPhone Turkish
- hasVersion: C-004336: GlobalPhone Thai
- hasVersion: C-004337: GlobalPhone Polish
- hasVersion: C-004338: GlobalPhone Vietnamese
- hasVersion: C-004339: GlobalPhone Bulgarian
- hasVersion: C-004340: GlobalPhone Hausa

SHACHI - Language Resource Metadata Database