Language resource #: 3330
Results 1841 - 1850 of 2023
-
C-004711: CSLU: S4X Release 1.2
*Introduction*
CSLU: S4X Release 1.2, Linguistic Data Consortium (LDC) catalog number LDC2009S03 and isbn 1-58563-523-5, was created by the Center for Spoken Language Understanding, Oregon Health and Science University (CSLU). The corpus consists of 36 speakers (22 male, 14 female) uttering 11 specified words.
The speakers repeated the following words six times on each of four channels: startrek, supernova, tektronix, generation, nebula, processing, singularity, 71523, abracadabra, sungeeta and computer. The four channels used were office phone, home phone, carbon microphone telephone and speaker phone. Each speech file has a corresponding time-aligned phoneme-level transcription (achieved using automatic forced alignment) and an automatically-generated world-level transcription.
Humans reviewed each utterance in two passes and classified it as good, bad, noisy or different. The results of this verification process are included in the /docs directory.
*Data*
The data was recorded with the CSLU T1 digital data collection system. Each utterance is recorded as a separate file. These files were sampled at 8 khz 8-bit and stored as ulaw files. All of the data use the RIFF standard file format. This file format is 16-bit linearly encoded.
*Samples*
For an example of the data in this corpus, please listen to this recording of a subject speaking the word 'computer': SD-1030-computer-t3-67. -
C-004716: 2007 NIST Language Recognition Evaluation Test Set
*Introduction*
2007 NIST Language Recognition Evaluation Test Set consists of 66 hours of conversational telephone speech segments in the following languages and dialects: Arabic, Bengali, Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American, Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, Spanish (Caribbean, non-Caribbean), Tamil, Thai and Vietnamese.
The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release.
The training data for LRE 2007 consists of the following:
* 2003 NIST Language Recognition Evaluation, LDC2006S31. This material is comprised of: (1) approximately 46 hours of conversational telephone speech segments in the target languages and dialects and (2) the 1996 LRE test data (conversational telephone speech in Arabic (Egyptian colloquial), English (General American, Southern American), Farsi, French, German, Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Caribbean, non-Caribbean), Tamil and Vietnamese).
* 2005 NIST Language Recognition Evaluation, LDC2008S05. This release consists of approximately 44 hours of conversational telephone speech in English (American, Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Mexican) and Tamil.
* Supplemental test data to be released by LDC in late 2009, 2007 NIST Language Recognition Evaluation Supplemental Training Data, LDC2009S05.
*Data*
Each speech file in the test data is one side of a 4-wire telephone conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech files in SPHERE (.sph) format for a total of 66 hours of speech. The speech data was compiled from LDCs CALLFRIEND, Fisher Spanish and Mixer 3 corpora and from data collected by Oregon Health and Science University, Beaverton, Oregon.
The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. Unlike previous evaluations, the nominal duration for each test segment was not identified.
*Samples*
For an example of the data in this corpus, please listen to this audio sample. -
C-004717: 2007 NIST Language Recognition Evaluation Supplemental Training Set
Introduction 2007 NIST Language Recognition Evaluation Supplemental Training Se consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu and Tamil.
The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment.
The supplemental training material in this release consists of the following:
* Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND and Mixer collections.
* Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican) and Tamil. This material was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments used in the 2005 NIST Language Recognition Evaluation were derived from these full conversations.
In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation and 2005 NIST Language Recognition Evaluation.
*Samples *
For an example of the data in this corpus, please listen to this sample of the Egyptian Arabic data from the data set. -
C-004723: Fisher Spanish Speech
*Introduction*
Fisher Spanish - Speech was developed by the Linguistic Data Consortium (LDC) and consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. Full orthographic transcripts of these audio files are available in Fisher Spanish - Transcripts (LDC2010T04).
The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations are challengingly natural and intimate. Under the Fisher protocol, a very large number of participants each make a few calls of short duration speaking to other participants, whom they typically do not know, about assigned topics. This maximizes inter-speaker variation and vocabulary breath while also increasing formality.
Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so however the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study.
To encourage a broad range of vocabulary, Fisher participants are asked to speak on an assigned topic which is selected at random from a list, which changes every 24 hours and which is assigned to all subjects paired on that day. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol.
In collecting data for this corpus, attempts were made to provide a representative distribution of subjects across a variety of demographic categories including: gender, age, dialect region, and education level.
This corpus joins other Fisher corpora: Arabic CTS Levantine Fisher Training Data Set 3 (LDC2005S07, LDC2005T03), Fisher English Training Part 2 (LDC2005S13, LDC2005T19), Fisher English Training Speech Part 1 (LDC2004S13, LDC2004T19), and Fisher Levantine Arabic Conversational Telephone Speech (LDC2007S02, LDC2007T04)
*Data*
The speech recordings consist of 819 telephone conversations of 10 to 12 minutes in duration. They are provided as digital audio files in NIST SPHERE format (1024-byte ASCII file headers). The conversations were recorded as 2-channel mu-law sample data with 8000 samples per second (as captured from the public telephone network).
The accompanying transcript files (available in Fisher Spanish - Transcripts (LDC2010T04)) are in plain-text, tab-delimited format (tdf) with UTF-8 character encoding. They were created with the LDC-developed transcription tool XTrans, which allowed for improved handling of multi-channel audio and overlapping speakers. XTrans is available from LDC.
Transcribers followed LDC's Transcription Guidelines (NQTR), which are included with the documentation for this release.
The first line of each transcript file provides the column headings the next two lines are comments that can be ignored (these are used by XTrans they are distinguished from non-comment lines by having an initial semicolon ). Actual transcript data, with time stamps, channel number, transcript text and additional information, begins at line 4 of each transcript file.
Native speakers of Caribbean Spanish and non-Caribbean Spanish were recruited from within the continental United States and Puerto Rico. The following tables provide an overview of the demographics of the participants. The Subjects Table file, provided in the documentation, may be used to answer questions about specific combinations of participant characteristics (including level of participation).
Participants
Country Raised
47
U.S.A.
20
Argentina
14
Mexico
11
Colombia
7
Chile
6
Puerto Rico
5
Spain
5
Peru
3
Venezuela
3
Canada
3
Panama
3
Guatemala
2
Paraguay
1
Cuba
1
Honduras
1
Uruguay
1
Bolivia
1
Dominican Republic
1
Switzerland
1
Ecuador
Conversation Sides
Participants
1
6
2
5
3
4
4
3
5
3
6
2
7
2
8
1
9
1
10
13
11
10
12
9
13
8
14
8
15
7
16
7
17
7
18
7
19
7
20
6
21
5
22
5
23
5
24
5
Years Education
Participants
2
1
4
2
5
2
6
1
11
1
12
15
13
7
14
16
15
12
16
25
17
10
18
16
19
3
20
9
21
2
22
6
23
4
24
1
25
2
28
1
Participants
Dialect
91
Non-Caribbean
45
Caribbean
Participants
Age Group
23
Young
106
Middle
7
Old
Participants
Sex
84
Female
52
Male
*Samples*
Please examine this excerpt (converted to wav format for simpler online distribution) for an example of the data in this corpus.- hasFormat: N-004722: Fisher Spanish - Transcripts
-
C-004725: WTIMIT 1.0
*Introduction*
WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in American English from eight major dialectic regions, each reading ten phonetically rich sentences. The TIMIT speech corpus was completed in 1993, being intended for acoustic-phonetic studies as well as for development and evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT. The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files transmitted over a telephone handset and the NYNEX telephone network, subject to a large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT (LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits. The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and 192 female speakers through different telephone handsets for the study of telephone transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03), the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone channel.
While some of these derivative TIMIT corpora consist of wideband speech, others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world wideband telephony speech corpus has been publicly available. Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth adjunct to TIMIT is desirable for a wide range of scientific investigations, as well as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech files after transmission over a real 3G AMR-WB mobile network.
WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (i.e., no header information) and specified as follows:
* 16 kHz sampling rate
* 16 bit, 1-channel linear PCM sampling format
* little-endian byte order
* signed
*Data*
Data preparation was conducted by converting the original TIMIT speech files into raw data (i.e., dropping the first 1024 bytes of header information) and concatenating them to 11 signal chunks of at most 30 minutes duration. In order to allow precise de-concatenation after transmission, and in order to be able to examine codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands.
At the sending end, the speech chunks were played back by a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog signal was then fed electrically into the microphone input of the transmitting Nokia 6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile phones was used. Prior to the actual transmission, the output attenuation of the DAC was adjusted such as to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving end, a second mobile phone of type Nokia 6220, was established for each speech chunk separately. Using the field test monitoring software of the phones, we confirmed that they were situated in different network cells at all times during transmission; moreover, we verified that the proper speech codec, the widely used AMR-WB at a constant data rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most widely used one. Furthermore, the internal microphone equalization of the transmitting mobile phone was switched off.
At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC (analog-to-digital converter) of type RME Fireface 400. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via an IEEE 1394 link and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally, they were de-concatenated by maximizing the cross-correlation between them and the original speech files. We followed the de-concatenation methodology of STC-TIMIT, as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order to assure a precise sample alignment to the TIMIT speech files. Hence, utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625 ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (*.TXT, *.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter).
If you have any problems, questions or suggestions concerning WTIMIT, please send a brief email to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): fingscheidt@ifn.ing.tu-bs.de.
*Samples*
Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration):
* Audio File
* Text
* Words
* Phonemes
*Acknowledgement*
The authors would like to thank Mr. Dirk Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany, for providing general project support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague, The Netherlands, for local support and SIM cards. Thanks also to Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared mobile phones, which are in that form not available on the market.
This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1. -
C-004727: 2003 NIST Speaker Recognition Evaluation
*Introduction*
2003 NIST Speaker Recognition Evaluation was developed by researchers at NIST (National Institute of Standards and Technology). It consists of just over 120 hours of English conversational telephone speech used as training data and test data in the 2003 Speaker Recognition Evaluation (SRE), along with evaluation metadata and test set answer keys.
2003 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.
This speaker recognition evaluation focused on the task of 1-speaker and 2-speaker detection, in the context of conversational telephone speech. The evaluation was designed to foster research progress, with the goals of:
* Exploring promising new ideas in speaker recognition.
* Developing advanced technology incorporating these ideas.
* Measuring the performance of this technology.
The original evaluation consisted of three parts: 1-speaker detection "limited data", 2-speaker detection "limited data", and 1-speaker detection "extended data". This corpus contains training and test data and supporting metadata (including answer keys) for only the 1-speaker "limited data" and 2-speaker "limited data" components of the original evaluation. The 1-speaker "extended data" component of the original evaluation (not included in this corpus) provided metadata only, to be used in conjunction with data from Switchboard-2 Phase II (LDC99S79) and Switchboard-2 Phase III Audio (LDC2002S06). The metadata (resources and answer keys) for the 1-speaker "extended data" component of the original 2003 SRE evaluation are available from the NIST Speech Group website for the 2003 Speaker Recognition Evaluation. See the original evaluation plan, included with the documentation for this corpus, for more detailed information.
*Data*
The data in this corpus is a 120-hour subset of data first made available to the public as Switchboard Cellular Part 2 Audio (LDC2004S07), reorganized (as described below) specifically for use in the 2003 NIST SRE. For details on data collection methodology, see the documentation for the above corpus.
In the 1-speaker "limited data" component, concatenated turns of a single side of a conversation were presented. In the 2-speaker "limited data" component, two sides of conversation were summed together, and both the model speaker and that speaker's conversation partner were represented in the resulting audio file.
For the 1-speaker "limited data" component, 2 minutes of concatenated turns from a single conversation were used for training, and 15-45 seconds of concatenated turns from a 1-minute excerpt of conversation were used for testing.
For the 2-speaker "limited data" component, three whole conversations per participant (minus some introductory comments) were used for training, and 1-minute conversation excerpts were used for testing. In the two-speaker detection task, the evaluation participant was required to separate the speech of the two speakers and then decide (correctly) which side is the model speaker. To make this challenge feasible, the training conversations were chosen so that all speakers other than the model speaker were represented in only one conversation. Thus the model speaker, who is represented in all three conversations, is the only speaker to be represented in more than one conversation.
*Samples*
For an example of the data in this corpus, please examine this audio excerpt.
*Updates*
No updates have been issued at this time. -
C-004735: Asian Elephant Vocalizations
*Introduction*
Asian Elephant Vocalizations, Linguistic Data Consortium (LDC) catalog number LDC2010S05 and isbn 1-58563-557-X, consists of 57.5 hours of audio recordings of vocalizations by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of which 31.25 hours have been annotated. Voice recording field notes were made by Shermin de Silva and Ashoka Ranjeewa, of the Uda Walawe Elephant Research Project. The collection and annotation of the recordings was conducted and overseen by Shermin de Silva, through the University of Pennsylvania Department of Biology, and Institute for Research in Cognitive Science. The recordings primarily feature adult female, and juvenile elephants. Existing knowledge of acoustic communication in elephants is based mostly on African species (Loxodonta africana and Loxodonta cyclotis). There has been comparatively less study of communication in Asian elephants, primarily becaUse the habitat in which Asian elephants typically live makes them more difficult to study than African forest elephants. For other current elephant vocalization research, see ElephantVoices and the Cornell Lab of Ornithologys Elephant Listening Project.
This corpus is intended to enable researchers in acoustic communication to evaluate acoustic features and repertoire diversity of the recorded population. Of particular interest is whether there may be regional dialects that differ among Asian elephant populations in the wild and in captivity. A second interest is in whether structural commonalities exist between this and other species that shed light on underlying social and ecological factors shaping communication systems.
*Methods*
*Study site and subjects*
Uda Walawe National Park (UWNP), Sri Lanka, is located at latitude 630°14.0646N, longitude 80°5428.1268E, and an average altitude of 118 m above sea level. It occupies 308 km2 and contains tall grassland, dense scrub, riparian forest, secondary forest, rivers and seasonal streams. It also contains several natural and man-made water sources and reservoirs with seasonal floodplains. There are two monsoons per calendar year, separated by dry seasons of variable length. Over 300 adult females have been individually identified in UWNP using characteristics of the ears, tail, and other natural markings (Moss, 1996).
*Data collection*
Data were collected from May, 2006 to December, 2007. Observations were performed by vehicle during park hours from 0600 to 1830 h. Most recordings of vocalizations were made using an Earthworks QTC50 microphone shock-mounted inside a Rycote Zeppelin windshield, via a Fostex FR-2 field recorder (24-bit sample size, sampling rate 48 kHz) connected to a 12 V lead acid battery. Recordings were initiated at the start of a call with a 10-s pre-record buffer so that the entire call was captured and loss of rare vocalizations minimized. This was made possible with the pre-record feature of the Fostex, which records continuously, but only saves the file with a 10-second lead once the record button is depressed. In order to minimize loss of low-frequency or potentially inaudible calls, recording was continued for at least three minutes following the end of vocalization events. During the first two months, hour-long recording sessions were also carried out opportunistically while in close proximity to a group. However, spectrograms showed that few vocalizations were captured therefore, this was discontinued.
*Anomalies*
Some audio files have 1 channel (field recording) and some have 2 channels (field recordings and field notes).
Certain files were recorded at 22050 Hz sample rate:
* asian_elephant_voc_d1/data/20070209/B13h00m34s09feb2007y.flac
* asian_elephant_voc_d1/data/20070209/B13h10m04s09feb2007y.flac
* asian_elephant_voc_d2/data/20070405/B14h56m48s05apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h35m11s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h38m34s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h39m27s09apr2007y.flac
Certain files were recorded at 16 bits per sample:
* asian_elephant_voc_d1/data/20070209/B13h00m34s09feb2007y.flac
* asian_elephant_voc_d1/data/20070209/B13h10m04s09feb2007y.flac
* asian_elephant_voc_d2/data/20070405/B14h56m48s05apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h35m11s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h38m34s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h39m27s09apr2007y.flac
* asian_elephant_voc_d3/data/20070507/B08h37m21s07may2007y.flac
* asian_elephant_voc_d4/data/20070822/B08h44m02s22aug2007y.flac
* asian_elephant_voc_d4/data/20070822/B08h48m02s22aug2007y.flac
* asian_elephant_voc_d4/data/20071015/B12h25m22s15oct2007y.flac
* asian_elephant_voc_d4/data/20071015/B12h59m51s15oct2007y.flac
* asian_elephant_voc_d5/data/20071024/B16h12m29s24oct2007y.flac
One file contains audio extracted from a video recording at 16-bit, 32 kHz. This file may overlap with other audio recordings, but was used to aid annotation because of the density of vocalizations and the number of vocalizing individuals:
* asian_elephant_voc_d1/data_from_video/20070724/20070724_g01_vocs.flac
*Audio data annotation*
Certain audio files were manually annotated, to the extent possible, with call type (see below for a list of categories), caller id, and miscellaneous notes. Annotations were made using the Praat TextGrid Editor, which allows spectral analysis and annotation of audio files with overlapping events. Annotations were based on written and audio-recorded field notes, and in some cases video recordings. Miscellaneous notes are free-form, and include such information as distance from source, caller identity certainty, and accompanying behavior. Audio files that are included without a corresponding Praat TextGrid annotation file have not yet been annotated.
*Acoustic features*
There are three main categories of vocalizations: those that show clear fundamental frequencies (periodic), those that do not (a-periodic), and those that show periodic and a-periodic regions as at least two distinct segments. Calls were identified as belonging to one of 14 categories:
Call Type Abbreviation Growl GRW Squeak SQK Longroar-rumble LRM Longroar LRR Rumble RUM Bark-rumble BRM Trumpet TMP Roar-rumble RRM Roar ROR Bark BRK Squeal SQL Croak-rumble CRM1 Chirp-rumble CRM2 Musth chirp-rumble MCR
*Audio compression (FLAC)*
All audio wav files in this corpus have been compressed using FLAC (Free Lossless Audio Codec). Becuase FLAC is a lossless compression algorithm, the conversion of the included FLAC files into wav files will result in files that are sample-for-sample identical to the original wav file recordings.
Many standard audio tools (including Praat TextGrid Editor) will transparently decompress FLAC files, so that they may be played, processed, and examined as if they were uncompressed audio. Should you wish to explicitly decompress FLAC files (by converting them into wav files), there are many free audio tools capable of performing this conversion. Some such tools, available for all major operating systems, may be found at http://flac.sourceforge.net/download.html
The data in this corpus were used by the corpus author as the foundation of a paper, Acoustic communication in the Asian elephant, Elephas maximus maximus (S. de Silva Behaviour, Volume 147, Number 7, 2010, pp. 825-852). If you have trouble accessing the paper through the preceding link, you may contact the corpus author directly for assistance.
*Sample*
A sample of data available in this corpus: Audio recording Praat TextGrid Annotation
*Updates*
No updates are available at this time. -
C-004741: TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)
*Introduction*
This version of the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) has all the waveform files formatted with ms-wav / RIFF headers, to make the corpus more accessible to a wider audience.
The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).
The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.
*Samples*
* phonemes
* transcripts
* audio
* word list -
C-004753: 2005 NIST Speaker Recognition Evaluation Training Data
*Introduction *
2005 NIST Speaker Recognition Evaluation Training Data, Linguistic Data Consortium (LDC) catalog number LDC2011S01 and isbn 1-58563-579-0, was developed at LDC and NIST (National Institute of Standards and Technology). It consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To that end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and to be accessible to those wishing to participate.
The task of the 2005 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational speech. The task was divided into 20 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the task conditions is contained in the The NIST Year 2005 Speaker Recognition Evaluation Plan.
*Data *
The speech data consists of conversational telephone speech with multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into two segments: 10 second two-channel excerpts (continuous segments from single conversations that are estimated to contain approximately 10 seconds of actual speech in the channel of interest) and 5 minute two-channel conversations.
The speech files are stored as 8-bit u-law speech signals in separate SPHERE files. In addition to the standard header fields, the SPHERE header for each file contains some auxiliary information that includes the language of the conversation and whether the data was recorded over a telephone line.
English language word transcripts in .cmt format were produced using an automatic speech recognition system (ASR)with error rates in the range of 15-30%.
*Samples*
For an example of the data contained in this corpus, review this audio sample. -
C-004756: 2006 NIST Spoken Term Detection Development Set
*Introduction*
2006 NIST Spoken Term Detection Development Set, Linguistic Data Consortium (LDC) catalog number LDC2011S02 and isbn 1-58563-583-9, was compiled by researchers at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights.
The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.
*Data*
The development corpus consists of three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2001 by LDCs broadcast collection system from the following sources: ABC (English), China Broadcasting System (Chinese), China Central TV (Chinese), China National Radio (Chinese), China Television System (Chinese), CNN (English), MSNBC/NBC (English), Nile TV (Arabic), Public Radio International (English) and Voice of America (Arabic, Chinese, English). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Sppech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group roundtable meetings and was collected in 2001, 2004 and 2005 by NIST, the International Computer Science Institute (Berkely, California), Carnegie Mellon University (Pittsburgh, PA) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. TheCONFMTG files contain a single recorded channel.
*Samples*
For an example of the data in this corpus, please review this audio sample(wav).- references: C-001418: Fisher English Training Speech Part 1 Speech
- references: C-000738: Switchboard-2 Phase II
- references: C-001284: Switchboard-2 Phase I