言語資源の登録件数: 3330件
2023 件中 1831 - 1840 件目
-
C-004665: Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
*Introduction*
This file contains documentation on the Levantine Arabic QT Training Data Set 4 (Speech + Transcripts), Linguistic Data Consortium (LDC) catalog number LDC2005S14 and ISBN 1-58563-342-9.
This release contains 901 calls and the total speech is 133.6 hours of telephone conversation in Levantine Arabic. Both audio and transcription files are included in this package.
The majority of speakers in this corpus are Lebanese. The data is similar to the training data in Set 3 [LDC2005S07, speech and LDC2005T03, transcripts]. The dialects are distributed as follows:
* 171 JOR
* 1373 LEB
* 229 PAL
* 29 SYR
*Samples*
For an example of this corpus, please review this audio sample. -
C-004678: 2004 Spring NIST Rich Transcription (RT-04S) Development Data
*Introduction*
2004 NIST Spring Rich Transcription (RT-04S) Development Data contains the test material (meeting speech and reference transcripts) used in the RT-04S evaluation administered by the NIST (National Institute of Standards and Technology) Speech Group. Rich Transcription (RT) is broadly defined as a fusion of speech-to-text technology and metadata extraction technologies designed to provide the basis for a generation of more usable transcriptions of human-human meeting speech.
The data in this release contains portions of meeting speech collected, and/or transcribed by the International Computer Science Institute (ICSI) at Berkeley, the Interactive Systems Laboratories (ISL) at Carnegie Mellon University, NIST and LDC. The complete meeting speech and corresponding transcript data sets are available from LDC's catalog as follows: ICSI Meeting Speech (LDC2004S02), ICSI Meeting Transcripts (LDC2004T04), ISL Meeting Speech Part 1 (LDC2004S05), ISL Meeting Transcripts Part 1 (LDC2004T10), NIST Meeting Pilot Corpus Speech (LDC2004S09) and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13).
The RT-04S development data consists of the 80-minute test set used in the RT-02 Meeting Recognition Evaluation, specifcally, approximately 10 minutes of recordings of eight meetings held at ISCI, CMU, LDC and NIST. For RT-04S, NIST re-released that data with additional distant mics (if the data collection sites provided them). Although the development data is comprised of 10-minute excerpts from the same data collection sites which are represented in the RT-04S evaluation data set (2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data, LDC2007S12), it is not completely reflective of the evaluation test data since it contains lapel mics in lieu of head mics for the LDC and CMU data and some different distant mics for LDC data. For more information about the development test data, see NIST's RT-04S Development Data Documentation.
RT-04S included the following tasks in the meeting domain:
Speech-to-Text Transcription (STT) tasks Microphone conditions: * Multiple distant microphones
* Single distant microphone
* Individual head microphone
Processing time conditions: * Unlimited time STT
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one times realtime
Diarization (SPKR) task (who spoke when) Microphone conditions: * Multiple distant microphones
* Single distant microphone
Input conditions: * Speech input only
* Speech plus reference transcript input
Processing time conditions: * Unlimited time
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one time realtime
*Samples*
For an example of the data in this release, please examine this audio sample and its transcript. -
C-004684: West Point Brazilian Portuguese Speech
*Introduction*
West Point Brazilian Portuguese Speech is a database of digital recordings of spoken Brazilian Portuguese designed and collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The U.S. government uses such systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs.
The data in this corpus was collected in March 1999 in Brasilia, Brazil using informants from a Brazilian military academy. The corpus consists of read speech from 60 female and 68 male native and non-native speakers.
The speech was elicited from a prompt script containing 296 sentences and phrases typically used in language learning situations. The prompts are listed in the file prompts.txt. Each line of this file has two fields separated by a tab: the first field denotes the base name of the waveform file; and the second field denotes the prompt used to record the utterance.
A pronouncing dictionary developed by Dr. Sheila Ackerlind with help from cadet Sterling Packer is provided in the file SANTIAGO.txt.
The speech was collected using four laptop computers running MS Windows. Three of the computers recorded with a 16 bit data size and sampling rate of 22050 Hz, the other laptop recorded with an 8 bit data size at a sampling rate of 11025 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review, allowing the utterance to be re-recorded. A member of the data collection team was present during the recording session to verify recordings and to provide technical assistance in case of malfunctioning equipment.
*Samples*
For an example of speech contained in this corpus, please listen to this audio sample (MS Wave format). -
C-004687: 2005 NIST Language Recognition Evaluation
*Introduction*
This file contains documentation for 2005 NIST Language Recognition Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2008S05 and isbn 1-58563-477-8.
The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted two previous evaluations in 1996 and 2003. For the 2005 LRE, the emphasis was on research directed toward a general base of technology to be ported to various language recognition tasks with minimum effort and the development of the ability to make more difficult discriminations between similar languages and dialects of the same language. That focus augmented the traditional evaluation goals, those being:
* to drive the technology forward
* to measure the state-of-the-art
* to find the most promising algorithmic approaches
The task evaluated was the detection of a given target language or dialect. From a test segment of speech and a target language or dialect, the system to be evaluated determined whether the speech was from the target language or dialect. The evaluation consisted of speech from the following languages and dialects:
* English (American)
* English (Indian)
* Hindi
* Japanese
* Korean
* Mandarin (Mainland)
* Mandarin (Taiwan)
* Spanish (Mexican)
* Tamil
The 2005 NIST Language Recognition Evaluation Plan, which includes a description of the evaluation tasks, is included with this release. Further information regarding this evaluation is also available at the NIST Language Recognition Evaluation website.
*Data*
Each speech file is one side of a "4-wire" telephone conversation represented as 8-bit 8 kHz mulaw data. There are 11,106 speech files in sphere (.sph) format for a total of 73.2 hours of speech. The speech data was compiled from LDC's CALLFRIEND corpora and from data collected by Oregon Health and Science University, Beaverton, Oregon.
Each test segment was prepared using an automatic speech activity detection algorithm to identify areas and durations of speech. The test segments were stored in SPHERE file format, one segment per file. Unlike previous evaluations, areas of silence were not removed from the segments. Segments were chosen to contain a specified approximate duration of actual speech. Auxiliary information was included in the SPHERE headers to document the source file, start time, and duration of all excerpts that were used to construct the segment.
The test segments contain three nominal durations of speech: 3 seconds, 10 seconds, and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds, and 25-35 seconds, respectively. Note that this refers to duration of actual speech contained in segments as determined by the speech activity detection algorithm; signal durations in general are longer due to areas of silence in the segments. Shorter speech duration test segments are subsets of longer speech duration test segments; i.e., each 10-second test segment is a subset of a corresponding 30-second test segment, and each 3-second test segment is a subset of a corresponding 10-second segment. Performance was evaluated separately for test segments of each duration.
NIST recommends using data from the 1996 and 2003 evaluations as development data. This data may be found in 2003 NIST Language Recognition Evaluation, LDC2006S31. Because the 1996 and 2003 evaluations did not cover Indian-accented English, this release includes a development data set of Indian-accented English.
*Samples*
For an example of the data in this corpus, please review the following audio samples(wav format): * 3 second
* 10 second
* 30 second -
C-004688: CSLU: Alphadigit Version 1.3
*Introduction*
This file contains documentation for CSLU: Alphadigit Version 1.3 , Linguistic Data Consortium (LDC) catalog number LDC2008S06 and isbn 1-58563-478-6.
Alphadigit Version 1.3 is a collection of 78,044 utterances from 3,025 speakers saying six-digit strings of letters and digits over the telephone for a total of approximately 82 hours of speech. Each speech file has corresponding orthographic and phonemic transcriptions. This corpus was created by the Center for Spoken Language Understanding (CSLU), Oregon Health & Science University, Beaverton, Oregon.
*Data*
Speakers were recruited using USEnet postings. Respondents registered for the collection by completing an online form. Once registered, they received a list of 18-29 six-digit strings (e.g., "a 2 b 4 5 g") and participation instructions. Speakers called the CSLU data collection system by dialing a toll-free number and were prompted for each string; 1102 different strings were used throughout the course of the data collection. The lists were set up to balance for phonetic context between all letter and digit pairs.
The data were recorded directly from a digital phone line without digital-to-analog or analog-to-digital conversion at the recording end using the CSLU T1 digital data collection system. The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system. The files have been converted to RIFF standard file format, 16-bit linearly encoded.
*Transcription*
All of the files included in this corpus have corresponding non-time-aligned word-level transcriptions and time aligned phoneme-level transcriptions (automatic forced alignment) that comply with the conventions in the CSLU Labeling Guide. Non time-aligned orthographic transcriptions provide quick access to the content of an utterance; they may contain markers for word boundaries to support access and retrieval at the lexical level. Phonetic/phonemic transcriptions represent the phonetic content of an utterance at a given level of detail that is made explicit by the use of diacritics. Phonetic phenomena transcribed include excessive nasalization, glottalization, frication on a stop, centralization, lateralization, rounding and palatalization.
*Samples*
For an example of the speech contained in this corpus, please listen to this audio sample (MS wave). -
C-004692: CSLU: ISOLET Spoken Letter Database Version 1.3
*Introduction*
CSLU: ISOLET Spoken Letter Database Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2008S07 and isbn 1-58563-488-3, was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon.
CSLU: ISOLET Spoken Letter Database Version 1.3 is a database of letters of the English alphabet spoken in isolation under quiet laboratory conditions and associated transcripts. The data was collected in 1990 and consists of two productions of each letter by 150 speakers (7800 spoken letters) for approximately 1.25 hours of speech. The subjects were recruited through advertising and consisted of 75 male speakers and 75 female speakers. Each subject received a free dessert at a local restaurant in exchange for his or her participation in the data collection. All speakers reported English as their native language. Their ages varied from 14 to 72 years; the speakers' average age was 35 years.
*Data*
Speech was recorded in the OGI speech recognition laboratory. The room measured 15' by 15' with a tile floor, standard office wall board and drop ceiling and contained two Sun workstations and three disk drives.
The recording equipment was selected to mimic the equipment used to collect the TIMIT database as closely as possible. The speech was recorded with a Sennheiser HMD 224 noise-canceling microphone, low pass filtered at 7.6 kHz. Data capture was performed using the AT&T DSP32 board installed in a Sun 4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV) format.
The subjects were seated in front of a Sun workstation and prompted with letters in random order. After each prompt, the subject would strike the return key and say the letter. Two seconds of speech were recorded and immediately played back for verification. If the subject spoke too soon or too late and missed the two-second buffer, or if the experimenter or subject decided that the letter was misspoken, the recording was repeated. There was no attempt to elicit ideal speech. A letter was judged to be misspoken only if there was a significant departure from normal pronunciation.
After the recording session, each utterance was verified by a human examiner for two determinations. First, the examiner viewed a waveform of the utterance to determine that the speech was padded with silence. The examiner then listened to the speech and noted any ambiguous or misspoken utterances. All utterances noted by the examiner were examined by two additional human examiners. If a majority of the examiners perceived that an utterance was abnormal, that utterance, and the rest of the utterances from that speaker, were removed from the corpus.
The transcriptions of the recorded speech are time-aligned phonetic transcriptions conforming to the CSLU Labeling standards. Time-aligned word transcriptions are represented in a standard orthography or romanization. Speech and non-speech phenomena are distinguished. The transcriptions are aligned to a waveform by placing boundaries to mark the beginning and ending of words. In addition to the specification of boundaries, this level of transcription includes additional commentary on salient speech and non-speech characteristics, such as glottalization, inhalation, and exhalation.
*Samples*
For an example of the data in this corpus, please listen to this audio sample (.WAV) of a speaker speaking the letter "a". The labeling for this sample can be seen below: MillisecondsPerFrame: 1.000000 END OF HEADER 0 95 .pau 95 285 ^ 285 425 .pau -
C-004698: LDC Spoken Language Sampler
*Introduction*
The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world.
Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials.
*Data*
The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog.
* most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech
* signal amplitude has been adjusted where necessary to normalize playback volume
* some corpora are published in compressed form, but all samples here are uncompressed
* LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.
The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.
An English Dictionary of the Tamil Verb This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Spanish A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts. CSLU Kids Speech Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Grassfields Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. Korean Telephone Speech Collection of 100 telephone conversations between native Korean speakers and their transcriptions. Mawukakan Lexicon The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages. Nationwide Speech Project A database of speech representing current regional accents and dialects of the United States. NIST Pilot Meeting Speech Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. West Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native speakers.
*How to Obtain*
The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.
Download 74 mb- references: D-004670: An English Dictionary of the Tamil Verb
- references: C-000635: CALLFRIEND Farsi
- references: C-000644: CALLFRIEND Tamil
- references: C-000657: CALLHOME Japanese Speech
- references: C-000658: CALLHOME Japanese Transcripts
- references: C-000664: CALLHOME Spanish Speech
- references: C-000665: CALLHOME Spanish Transcripts
- references: CSLU Kids Speech
- references: C-001420: Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
- references: C-001421: Fisher Levantine Arabic Conversational Telephone Speech
- references: C-000704: Grassfields Bantu Fieldwork: Dschang Tone Paradigms
- references: C-000706: Gulf Arabic Conversational Telephone Speech, Transcripts
- references: C-001258: Gulf Arabic Conversational Telephone Speech
- references: C-001044: Korean Telephone Conversations Speech
- references: C-001045: Korean Telephone Conversations Transcripts
- references: G-001056: Mawukakan Lexicon
- references: C-003309: Nationwide Speech Project
- references: NIST Pilot Meeting Speech
- references: C-001594: West Point Russian Speech
-
C-004699: CSLU: Numbers Version 1.3
*Introduction:*
CSLU: Numbers Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2009S01 and isbn 1-58563-501-4, was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon. It is a collection of naturally produced numbers taken from utterances in various CSLU telephone speech data collections. The corpus consists of approximately fifteen hours of speech and includes isolated digit strings, continuous digit strings, and ordinal/cardinal numbers.
The numbers have several sources, among them, phone numbers, numbers from street addresses and zip codes, uttered by 12618 speakers in a total of 23902 files. In most of CSLU's telephone data collections, callers were asked for their phone number, birthdate or zip code. Callers would also occasionally leave numbers in the midst of another utterance. The numbers in those situations were extracted from the host utterance and added to the corpus.
Additional information about this publication is available from the corpus web page at CSLU.
* Data:*
The speech data was collected over analog and digital telephone lines. The analog data was recorded using a Gradient Technologies analog-to-digital conversion box; those files were recorded as 16-bit, 8 khz and stored in a linear format. The digital data was recorded with the CSLU T1 digital data collection system; those files were sampled at 8khz, 8-bit and stored as ulaw files. All of the data in this release has been linearly encoded in 16-bit RIFF standard file format.
Each file includes an orthographic transcription following the CSLU Labeling guidelines which are included in the documentation for this publication. Also, many of the utterances have been phonetically labeled.
* Statistics: *
CSLU: Numbers Version 1.3 consists of approximately fifteen hours of speech. The following table gives a count of the number of files for each utterance type. Type Number phone 2970 street 7079 zipcode 7076 other 6771
*Samples:*
For an example of the data contained in this corpus, please examine the audio files and labels for the following spoken sequences
* Street Address: one sixteen wav|label
* Zipcode: one oh three one four wav|label -
C-004700: CHAracterizing INdividual Speakers (CHAINS)
*Introduction*
CHAINS was created by researchers at University College Dublin and contains recordings of thirty-six English speakers reading fables and selected sentences in different speaking styles. The data was obtained in two different sessions with a time separation of about two months. The goal of the corpus is to provide a range of speaking styles and voice modifications for speakers sharing the same accent. Other existing corpora, in particular CSLU Speaker Recognition Version 1.1, TIMIT and the IViE corpus (English Intonation in the British Isles), served as referents in the selection of material. This design decision was made to ensure that methods designed and evaluated on the CHAINS corpus might be directly testable on these other corpora, which were recorded using quite different dialects and channel characteristics.
Additional documentation about the corpus and its methodolgy is available at the CHAINS website.
*Data*
The data was collected in two recording sessions in a total of six different speaking styles. The first recording session was carried out in a professional recording studio in December 2005. Speakers were recorded in a sound-attenuated booth reading text in the solo, synchronous and retell styles using a Neumann U87 condenser microphone. Additional tracks using other microphones (near and far-field) were also recorded and may be made available upon request to the authors. The second recording session took place from March 2006 to May 2006 in a quiet office environment, using an AKG C420 headset condenser microphone. Speakers read text in the rsi, whisper and fast modes. The six different speaking styles were:
* solo reading
* synchronous reading
* spontaneous speech (retell)
* reptitive synchronous imitation (rsi)
* whispered fast reading
* fast speech reading
In two of the speaking conditions adopted, speakers modified their speech in a constrained fashion towards a known target in the synchronous condition, the speech of the co-speaker served as a target, while in rsi, there was an explicit known static target. The presence of a known target which speakers aim to copy raises the bar in the discovery and design of procedures for automatic speaker identification, as the target speech provides a potentially highly confusing foil. The whisper and fast speech conditions are also well defined speaking styles which require substantial voice modification by the speaker.
Participants were recruited through the University College Dublin and were paid for their participation. No participant had any known speech or hearing deficit. The speakers were from the United Kingdom, the eastern part of Ireland (Dublin and adjacent counties) and the United States. Further information about the speakers, their gender and dialect is available in the documentation released with this corpus.
*Samples *
For the example of the data in this particular corpus please examine this sound file of the fast reading type -
C-004701: English CTS Treebank with Structural Metadata
*Introduction*
English CTS Treebank with Structural Metadata, Linguistic Data Consortium (LDC) catalog number LDC2009T01 and isbn 1-58563-476-X, consists of metadata and syntactic structure annotations for 144 English telephone conversations, or 140,000 words, from data used in the EARS (Effective, Affordable, Reusable Speech-to-Text program. English CTS Treebank with Structural Metadata was created to support EARS work in English. It applies EARS metadata extraction annotations and Penn Treebank methods to conversations from Switchboard-1 Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73).
The purpose of the EARS program was to develop robust speech recognition technology to address a range of languages and speaking styles. LDC provided conversational and broadcast speech and transcripts, annotations, lexicons and texts for language modeling in each of the EARS languages (Arabic, Chinese, English). LDC also supported a metadata extraction (MDE) research evaluation, the goal of which was to enable technology to take raw speech-to-text (STT) output and to refine it into forms of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. Some of the data developed by LDC for the MDE task is contained in the LDC Catalog, i.e., RT-04 MDE Training Data Speech, LDC2005S16 and RT-04 MDE Training Data Text/Annotations, LDC2005T24.
*Data*
Speech
The telphone speech used in English CTS Treebank with Structural Metadata was drawn from Switchboard-1 Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73). The speech for all files was recorded on two channels with a sampling rate of 8000 Hz and was encoded in ulaw format.
The Fisher data was transcribed by LDC staff; for the Switchboard data, transcripts developed at the Institute for Signal and Information Processing at Mississippi State University were used.
Structural Metadata Annotation
The transcribed data was annotated to SimpleMDE V6.2 , an annotation task defined by LDC that consisted of the following elements: Edit Disfluencies (repetitions, revisions, restarts and complex disfluencies), Fillers (including, e.g., filled pauses and discourse markers) and SUs, or syntactic/semantic units. Each of these elements is described below:
* Edit Disfluencies: Edit disfluencies, or speech repairs, occur when speakers correct or alter their utterances or abandon them entirely and start over. Edit disfluencies have a more complex internal structure than fillers, consisting of the original utterance (reparandum), an interruption point, an optional editing phase and a correction. There are four types of disfluencies annotated in SimpleMDE: repetitions; revisions; restarts; and complex disfluencies, which consist of multiple or nested edits. In SimpleMDE, annotators labeled only the deletable region (DELREG) of the disfluency which corresponded to the reparandum. In cases where the reparandum contained multiple disfluent utterances, annotators identified the maximal extent of the disfluent portion, starting with the left edge of the first disfluency and continuing to the right edge (IP) of the final disfluency.
* Fillers: While the term filler has traditionally been synonymous with filled pause, SimpleMDE uses the term to encompass a broad set of vocalized space-fillers: filled pauses (FPs), discourse markers (DMs), explicit editing terms (EETs) and asides/parentheticals (A/Ps). Excepting the last category, fillers can be understood as words that do not alter the propositional content of the material into which they are inserted. For example, FPs include nonlexemes, such as um or ah, that speakers use to indicate hesitation or to maintain control of a conversation. A DM is a word or phrase that functions primarily as a structuring unit of spoken language, such as actually, now, anyway, see, basically, so, I mean, well, let's see, you know, like, you see. DMs often signal the speaker's intention to mark a boundary in discourse, like a change in speaker or the beginning of a new topic. There is no exhaustive list of DMs for a given language due to their wide range of functions, colloquial variations, and the difficulty of defining them precisely. Furthermore, words that are used as discourse markers can be used for other purposes. EETs occur during an edit disfluency and consist of an overt statement (e.g., I'm sorry) from the speaker recognizing the disfluency. Asides and parentheticals (A/Ps) are different from the other filler types in that they convey semantic information in the form of a short side comment before returning to the main topic. This may be either on a new topic (asides) or on the same topic of the larger utterance (parentheticals). Both break up the stream of discourse and are often accompanied by noticeable prosodic features.
* Syntactic Units: One of the goals of MDE annotation is the identification of all units within the discourse that function to express a complete thought or idea on the part of the speaker.Within MDE these elements are called SUs (Syntactic, Semantic or Slash Units). As with disfluency annotation, the goal of SU labeling is to improve transcript readability by presenting information in small, structured, coherent chunks. There are four sentence-level SUs. Statements are complete SUs that function as a declarative statement and are marked with /.; questions are complete SUs that function as an interrogative and are marked with /?. Backchannels are an open class of words uttered by the non-dominant speaker to indicate engagement in the conversation and are marked with /@. Incomplete SUs occur when an utterance does not constitute a grammatically complete sentence, phrase or continuer, and does not express a complete thought; these are marked with /-. To enhance inter-annotator consistency, there are also sentence-internal clausal and coordinating SUs (/, and /&).
Parsing and Treebank Annotation
The existing MDE annotations were converted from RTTM format into a format appropriate for the automatic parser, enabling the generation of accurate parses in a form that would require as little hand modification by the Treebank team as possible. RTTM is a format developed by NIST (National Institute for Standards and Technology) for the EARS program that labeled each token in the reference transcript according to the properties it displays (e.g., lexeme versus non-lexeme, edit, filler, SU). The initial parse trees were produced using an entropy-based parser, which was trained on Switchboard transcripts supplemented with Wall Street Journal data (with a 4:1 ratio). These parses served as the starting point for a manual process which corrected the initial pass for each conversation.
To provide high quality parses, scripts were used to separate the edited material from the fluent part of each SU prior to parsing it using the MDE annotations. The edits were then parsed and reinserted into the tree for presentation to the annotators. Some important issues are listed below:
* Words were tokenized in Syntactic Units using LDC's scripts.
* All of the punctuation provided in the markup was maintained in the SU for parsing because it was likely to enhance parse accuracy and was expected to appear in the final tree annotations.
* For parsing complex edits, contiguous edits were concatenated into one unit for parsing. In a few cases, edits occur across SUs in MDE annotations.
* Special treatment was required in the scripts for regions unannotated for MDE, complex edits, and SUs that were comprised solely of edited material.
* The string was "EDITED" as the non-terminal tag for edit regions inserted into the fluent parse trees. Additionally a terminal node for the IP ((DISFL-IP +) was added at the end of the edits in an attempt to make the tree follow the conventions used in the Switchboard Treebank.
Manual treebank annotation was performed in accordance with existing treebank guidelines for conversational telephone speech as well as in accordance with revised general guidelines for treebanking.
*Samples*
For an example of the data in this corpus, please listen to this audio sample (wav) and view its parse tree (PDF). Note that the opening greeting of the conversation has been omitted in the parse tree. Only the discussion on holidays is present.- references: C-001283: Switchboard-1 Release 2