Language resource #: 3330
Results 791 - 800 of 2023
-
C-001382: Mandarin Chinese Telephone Speech Recognition Corpus Stock (649 people)
Desktop/Microphone
This corpus comprises 1,584 entries uttered by 649 speakers of different dialects, ages and various educational levels (340 males and 309 females), recorded over the fixed telephone network. The database comprises 10,400 stocks. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 12.99 hours of speech.
Each speaker read 16 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system. -
C-001383: Mandarin Chinese Telephone Speech Recognition Corpus - Stock
Desktop/Microphone
This corpus comprises 3,085 entries uttered by 265 speakers of different dialects, ages and various educational levels (134 males and 131 females), recorded over the mobile telephone network. The database comprises 6,972 Chinese stocks. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 7 hours of speech. The total capacity of the data is 387 Mb.
Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system. -
C-001384: Mandarin Chinese Telephone Speech Recognition Corpus -Person Name, Place Name
Desktop/Microphone
This corpus comprises 7,298 entries uttered by 285 speakers of different dialects, ages and various educational levels (144 males and 141 females), recorded over the fixed telephone network. The database comprises 14,492 Chinese personal names and place names. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 17.6 hours of speech. The total capacity of the data is 968 Mb.
Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
The corpus aims to be applied to the testing and telephone natural speech recognition system. -
C-001385: Mandarin Chinese Speech Recognition Corpus (desktop) - digit string (200 people)
Desktop/Microphone
This corpus comprises 8,000 digit strings uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 12.35 hours of speech per channel. The total capacity of the data is 7.3 Gb.
Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system. -
C-001386: Mandarin Chinese Speech Recognition Corpus (desktop) - place name (200 people)
Desktop/Microphone
This corpus comprises 8,000 place names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 10.49 hours of speech per channel. The total capacity of the data is 6.2 Gb.
Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system. -
C-001387: Colombian Spanish Speech Database
Telephone
The Colombian Spanish speech database contains the recordings of 1,065 speakers (563 males and 502 females) recorded over the fixed telephone network using an E-1 interface.
The speech data were collected from Colombia.
Speech samples are stored as sequences of 8-bit 8 kHz A-law, and uncompressed (CCITT G.711 recommendation). Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file. Speech file format and SAM label files follow the specifications given by the SpeechDat project.
The recording platform used an ISDN basic access (BR1) interface.
The speakers were mainly recruited from Siemens personnel, students from several Colombian universities, and their relatives.
The following sex and age distribution has been obtained: 56 speakers are under 16 (38 males, 18 females), 542 speakers are between 16 and 30 (277 males, 265 females), 347 speakers are between 31 and 45 (178 males, 169 females), 99 speakers are between 46 and 60 (59 males, 40 females) and 21 speakers are over 60 (11 males, 10 females).
The transcription included in this database is an orthographic transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. A lexicon is also provided.
Non-Speech Acoustic Events have been arranged into 4 categories (filled pause, speaker noise, stationary noise and intermittent noise) and are transcribed.
Type of resource : Speech recordings (Acoustic)
Speech mode : Read
Recording conditions: ISDN telephone interface
Language: Colombian Spanish
Sex and number of speakers: 1,065 speakers (563 males and 502 females)
Linguistic annotation: Orthographic (+ transcription of audible noises)
File format: 8 bits, A-law
Standard in use: SAM
Phoeme set: SAMPA
Sampling rate (kHz): 8 kHz
Distribution media: 1 CD-ROM
Related resources: SpeechDat family. Other languages available. -
C-001388: Concise Oxford Dictionary - Audio Files
Desktop/Microphone
This "acoustic dictionary" contains 60,000 soundfiles from the 9th edition of the Concise Oxford Dictionary. The recordings were made by actors in a studio.
It features recordings with British-English pronunciation, with an accurate coverage of different homographs, variant forms and inflections. Full information on parts of speech and subsenses is covered, and the soundfiles are clearly linked to the related phonetic information. The format in use is 22kHz 16-bit WAV. -
C-001392: DSO Corpus of Sense-Tagged English
*Introduction*
This corpus contains sense-tagged word occurrences for 121 nouns and 70 verbs which are among the most frequently occurring and ambiguous words in English. These occurrences are provided in about 192,800 sentences taken from the Brown corpus and the Wall Street Journal and have been hand tagged by students at the Linguistics Program of the National University of Singapore. WordNet 1.5 sense definitions of these nouns and verbs were used to identify a word sense for each occurrence of each word.
*Data*
In addition to providing the word occurrences in their full sentential context, the corpus includes complete listings of the WordNet 1.5 sense definitions used in the tagging.
The following example illustrates the format of a sentence with a sense tag for the word "action," followed by the corresponding WordNet1.5 sense definition:
ca01.db #020 `` These >> actions 8 proceeding, legal proceeding, judicial proceeding, proceedings -- (the institution of a legal action) => due process, due process of law -- (the administration of justice according to established rules and principles) => group action -- (action taken by a group of people) => act, human action, human activity -- (something that people do or cause to happen) (In the actual corpus, all tagged occurrences of a given noun or verb are stored together in one file, with each full sentence on one line; all noun and verb word sense definitions are stored together in two separate files.)
This sense tagged corpus was provided by Hwee Tou Ng of the Defence Science Organisation (DSO) of Singapore. It was first reported in the following paper at ACL-96:
"Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach," by Hwee Tou Ng and Hian Beng Lee, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 40-47, Santa Cruz, California, USA, June 1996. ( http://xxx.lanl.gov/abs/cmp-lg/9606032 )
*Updates*
There are no updates at this time.- references: Hwee Tou Ng and Hian Beng Lee 1997 DSO Corpus of Sense-Tagged English Linguistic Data Consortium, Philadelphia
-
C-001393: Danish SpeechDat-Car - Full database
Desktop/Microphone
The Danish SpeechDat-Car comprises the recordings of 300 Danish speakers from 5 different regions (162 males, 138 females), recorded over the GSM telephone network, and in a car. This database is partitioned into 15 DVDs (53 GB), plus 1 CD-ROM for e.g. non-signal files and documentation. The speech databases made within the SpeechDat-Car project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat-Car format and content specifications.
The speech data files are in two formats. Four of the microphones were recorded on the computer in the boot of the car. The speech data are stored as sequences of 16 kHz, 16 bit and uncompressed. The fifth microphone was connected to the cell phone, and was recorded on a remote machine, with compressed data stored as sequences of 8 bit A-law 8.kHz. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
2 voice activation keywords
1 sequence of 10 isolated digits
7 connected digits : 1 sheet number (5+ digits), 1 spontaneous telephone number, 3 read telephone numbers, 1 credit card number (14-16 digits), 1 PIN code (6 digits)
3 dates : 1 spontaneous date (e.g. birthday), 1 prompted date, 1 relative or general date expression
2 word spotting phrases using an application word (embedded)
4 isolated digits
7 spelled words : 1 spontaneous (own forename or surname), 1 spelling of directory city name, 4 real word/name, 1 artificial name for coverage
1 money amount
1 natural number
7 directory assistance names : 1 spontaneous (own forename or surname), 1 city of birth / growing up (spontaneous), 2 most frequent cities, 2 most frequent company/agency, 1 "forename surname"
9 phonetically rich sentences
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
4 phonetically rich words
67 application words: 13 mobile phone application words, 22 IVR function keywords, 32 car products keywords
2 additional language dependent keywords
Prompts for spontaneous speech
2 additional keywords from a list of 10
The following age distribution has been obtained: 84 speakers are between 18 and 30, 99 speakers are between 31 and 45, 98 speakers are between 46 and 60, and 19 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-001399: Discourse Graphbank
*Introduction*
As Florian Wolf's Ph.D thesis, the Discourse Treebank aimed to define a descriptively adequate data structure for representing discourse coherence structures. This project also investigated the impact of discourse coherence structures on other linguistic processes and natural language applications (e.g. anaphor resolution,summarization, information retrieval), and developed and tested discourse parsing algorithms.
*Data*
The data consists of 135 texts from AP Newswire and Wall Street Journal, annotated with coherence relations. The source was UPenn TIPSTER.
*Samples*
A screenshot of the output of the annotator tool has been provided as an example of this corpus.- references: Florian Wolf, et al. 2005 Discourse Graphbank Linguistic Data Consortium, Philadelphia