Language resource #: 3330
Results 751 - 760 of 2023
-
C-001323: Multilingual Spoken Language Corpus: Spanish
The present corpus is published as one of the accomplishments of the 21st Century Center of Excellence (COE) Program
- hasVersion: C-001319: Canada Mulitilingual Spoken Corpus
- hasVersion: C-001320: French (Aix) Multiligual Spoken Corpus
- hasVersion: C-001321: French (Paris) Multilingual Spoken Corpus
- hasVersion: C-001322: Malay Multilingual Spoken Corpus
- hasVersion: C-001324: Multilingual Spoken Language Corpus: Turkish
- hasVersion: C-004967: Taiwanese Mandarin Multilngual Spoken Corpus
- hasVersion: C-004968: Spanish Multilingual Spoken Corpus 2006
-
C-001324: Multilingual Spoken Language Corpus: Turkish
The present corpus is published as one of the accomplishments of the 21st Century Center of Excellence (COE) Program
- hasVersion: C-001319: Canada Mulitilingual Spoken Corpus
- hasVersion: C-001320: French (Aix) Multiligual Spoken Corpus
- hasVersion: C-001321: French (Paris) Multilingual Spoken Corpus
- hasVersion: C-001322: Malay Multilingual Spoken Corpus
- hasVersion: C-001323: Multilingual Spoken Language Corpus: Spanish
- hasVersion: C-004967: Taiwanese Mandarin Multilngual Spoken Corpus
- hasVersion: C-004968: Spanish Multilingual Spoken Corpus 2006
-
C-001325: ARCADE/ROMANSEVAL corpus
Written Corpora
The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions:
· ARCADE, an exercise on multilingual text alignment financed by AUPELF-UREF
· ROMANSEVAL, part of the SENSEVAL exercise sponsored by ACL-SIGLEX and EURALEX, on word sense disambiguation.
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission).
The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together, and comprises:
· semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian;
· word-level alignment of all the occurrences of the test words between French and English.
Additional information:http://www.lpl.univ-aix.fr/projects/arcadehttp://www.lpl.univ-aix.fr/projects/romanseval -
C-001326: AURORA Project Database 2.0 - Evaluation Package
The Aurora project was originally set up to establish a world wide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. ETSI formally adopted this activity as work items 007 and 008.The two work items within ETSI are :
- ETSI DES/STQ WI007 : Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm
- ETSI DES/STQ WI008 : Distributed Speech Recognition - Advanced Feature Extraction Algorithm.
The Aurora project is releasing a revised version of the Noisy TI digits database to follow on the work of ETSI. This CD set is a replacement for the previous set (version 1.0 consisted of 2 CDs while version 2.0 now consists of 4 CDs) .
This database is intended for the evaluation of algorithms for front-end feature extraction algorithms in background noise but may also be used more widely by speech researchers to evaluate and compare the performance of noise robust speech recognition algorithms.
Compared to version 1.0 the changes are as follows :
The files are restored to the energy level of the original speech in the TI digits database. One of the noise types added to the speech has been changed (the babble one) There is an additional test sets where the noises are mismatched to those used in the training set. There is a convolutional distortion test. There is a clean training set The CD ROM will be used for the next round of ETSI Aurora standards evaluation.
Two original copies of the contract must be sent to ELDA. To be valid these contracts must be initialled and signed. The user should annex to the contract the proof that he obtained the right to use the TI digits from LDC (ref. LDC93S10). This may be a signed licence agreement or a proof of membership payment for 1993.
For further details, please check the following website: http://aurora.hsnr.de/ -
C-001327: Amaryllis Corpus - Evaluation Package
Written Corpora
Launched at the end of 1995, the AMARYLLIS project aimed at evaluating information retrieval software for French text corpora in order to provide a methodology for the evaluation of other similar tools. AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT).
More specifically, the objective was to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC.
All corpora are structured as SGML files with isolatin character-encoding.
The available corpora were provided by:
- INIST (Institut de l'Information Scientifique et Technique)
- OFIL (Observatoire Français et International des Industries de la Langue)
- ELRA (European Language Resources Association)
Each provider provided three types of corpora : text documents, seach topics and answers to these topics in the corresponding text corpora (with frames of reference for the answers).
1- Text documents in French
The text documents in French comprise:
- Articles (titles and texts) extracted from trhe newspaper "Le Monde"; each batch contains three months of documents, provided by OFIL (01-01-93/31-03-93, 01-04-93/30-06-93),
- Titles and summaries of scientific articles covering every domain from the Pascal bibliographical databases (from 1984 to 1995) and Francis (from 1992 to 1995), provided by INIST.
The tagging of the documents conforms to a simplified version of a DTD from the TEI, which includes the possibility to manage the logical structure.
2- Multilingual text documents
The multilingual text documents have been provided by ELRA, and comprise documents in 6 languages (French, English, Italian, Spanish, German and Portuguese), extracted from the parallel corpus MLCC which contains documents translated in official European languages (from 1992 to 1994). The corpus was divided in two sub-corpora: written questions (10 million words) and debates of the European Parliament (5 to 8 million de words per language).
3- Search topics
The topics derive from questions asked by end users, and should contain every information which is necessary to understand the issue they deal with and to estimate the relevance. They comprise the following items:
- A domain, to determine the field of knowledge they belong to,
- A topic: which equals to a title defining the subject,
- A question: which matches the question the user may ask,
- Complementary information: which gives details on further documents that should be selected from the corpus,
- Concepts: which are a set of descriptors used to set the limits of the search.
The topics have been built by OFIL, by some documentalists working for Le Monde who used requests from journalists, and by engineers responsible for documentation at INIST (experts in their domain) who used requests from end users. These topics were to cover numerous application fields, and to get a large number of relevant results in each corpus. The topics have been tested on the corpora to control their relevance. The query may have had to be modified, or some further details may have been needed.
4- Frames of reference for the answers
Answers' files contain for each numbered topic the numbers of all relevant documents. Some frames of reference for the answers were established before the participants proceeded to the tests. The answers had been selected by the providers (OFIL and INIST) with the appropriate methodology and adequate tools (initial frames of reference): they proceeded to a pre-selection of documents as extended as possible, based not only on their titles and summaries but also on the key words and classification codes used in the Pascal and Francis databases. These key words and classification codes can not be accessed by the participants. The results (a set of documents) are sorted manually, so that the results match the best the query.
The initial frames of reference were checked manually by the providers (INIST and OFIL), using the answers given by the participants. These answers were collected when the tests were finished. This allowed us to review and correct the frames of reference for the answers in order to give some even more detailed information for their content. The illustration below shows how the review was performed.
The 4 CDs contain each a corpus for the two phases of the two campaigns which took place.
TrecEval is also provided -
C-001328: Arabic CTS Levantine Fisher Training Data Set 3, Speech
*Introduction*
Arabic CTS Levantine Fisher Training Data Set 3 Speech consists of 322 conversations, representing a total of about 50 hours of Levantine Arabic speech. The corresponding human annotated transcripts are contained in Arabic CTS Levantine Fisher Training Data Set 3, Transcripts (LDC2005T03).
The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations were challengingly natural and intimate. Under the Fisher protocol, a very large number of participants each made a few calls of short duration speaking to other participants, whom they typically did not know, about assigned topics. This maximized inter-speaker variation and vocabulary breadth although it also increased formality.
Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher was unique in being platform driven rather than participant driven. Participants who wished to initiate a call did so; however, the collection platform initiated the majority of calls. Participants simply answered their phones at the times they specified when registering for the study.
To encourage a broad range of vocabulary, Fisher participants were asked to speak about an assigned topic chosen from a randomly generated list that changed every 24 hours. All participants that day were assigned subjects from that list. Some topics were inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol.
*Samples*
Please examine this sample for an example of this corpus. -
C-001329: BAS GEO1
Speech Related
BAS GEO1 is a simple database about the most important location names of Germany, Austria and Switzerland together with their canonical pronunciation coded in SAMPA.
BAS GEO1 may be used as a basis for automatic speech recognition of German postal addresses or to feed a speech synthesis algorithm. Future updates will be distributed to all users automatically (if a valid email address is provided).
BAS GEO1 contains 3 data sets:
o List of all locations with the following fields:
* Location ID
* Gemeinde name
* Gemeinde name pronunciation
* Postal code
* Location name
* Location name pronunciation
* Kreis name
* Kreis name pronunciation
* State name
* State name pronunciation
* Car code
* Phone area code
* Population (in 2003)
o A list of all street names:
* Street ID
* Street name
* Street name pronunciation
o A mapping of Locations to Streets:
* Location ID
* Street ID -
C-001330: BITS Unit Selection Synthesis Corpus
Desktop/Microphone
BITS stands for "BAS Infrastructures for Technical Speech Processing" and was funded by the German Ministry of Science and Education during 2003-2005.
The BITS synthesis corpus consists of two parts: a set of logatome recordings for controlled diphone synthesis (ELRA-S0217) and a set of sentence recordings for unit selection techniques (ELRA-S0224).
This corpus contains 6,732 recordings spoken by 4 professional German speakers covering all German diphone combinations in different prosodic contexts.
The data is stored on 4 DVDs. Each DVD contains the recordings, the annotation files and the meta data files of one of the four professional speakers, and the entire corpus' documentation. Each speaker was recorded in an insulated room with low reverberation.
Each sentence was recorded in three channels: close microphone, large membrane microphone and laryngographic signal. All recordings are segmented and labelled into phonemic units as well as annotated prosodically.
The same 4 professional speakers also spoke the BITS Logatome Synthesis Corpus (ELRA-S0217) enabling the user to combine diphone and unit selection techniques based on the same speakers.
Total number of recordings: 6,732
Total duration: 813 minutes
Format: WAV 48kHz, 16 bit, Praat TextGrid, BAS Partitur Format (BPF)
Segmentation: extended German SAM-PA
Prosodic Annotation: GTobi 'Light' -
C-001346: Bizkaifon (Bizkaieraren Fonoteka)
Desktop/Microphone
Bizkaifon contains sound archives and associated information of dialectal varieties of spoken Basque. The database was collected by the Department of Electronics and Telecommunications, University of the Basque Country, with the financial help of the Diputación Foral de Bizkaia. It consists of 21 hours of spontaneous and read speech, recorded over a microphone in a room, for Bizcayan Basque linguistic and phonetic research. The database is stored on 5 CDs. It includes four different types of material:
- 267 archives of popular literature (popular knowledge, songs and folklore).
- 797 texts.
- 1788 grammatically complete sentences.
- 11569 isolated words.
The database is structured as follows: sound files (.WAV) placed in different directories, with, for each sound file, a text TEI SGML file which stores data according to the TEI Consortium P4 recommendations (Text Encoding Initiative, 1994), and a binary file containing the same data as the SGML, stored in a format that grants faster access to the data. Speech is transcribed orthographically into standard and dialectal Basque.
For more information: http://bizkaifon.ehu.es/ahoweb/ -
C-001349: C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT
Desktop/Microphone
Description
The C-ORAL-ROM resource is a multilingual corpus of spontaneous1 speech for the main romance languages of around 1,200,000 words (IST 2000-26228). The resource comprises three components:
a)Multimedia corpus;
b)Speech software;
c)Appendix.
The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (around 300,000 words for each Language). The collections are delivered respectively by the following providers:
* Università di Firenze (Dipartimento di Italianistica, LABLITA);
* Université de Provence (Description Linguistique Informatisée sur Corpus);
* Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa
* Universidad Autónoma de Madrid (Departamento de Lingüística, Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de Lingüística Informática).
The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations:
* The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks
* Session metadata
* The text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance,
The multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. Minimal configuration: Pentium III, 1 GHz, 252 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. GDPLUS.dll installed on the same directory of the program required).2 A series of appendix are also provided containing: a) the purely textual corpus in .TXT and .XML format; b) the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files; c) a set of linguistic measurements extracted from the main corpus annotations, in .EXCEL files; d) the specifications and validation of the resource, e) corpus metadata.
Package
1. DVDs 1 to 8 contain the multimedia corpus edition (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish). All collections have the same folder's structure, that mirrors directly the C-ORAL-ROM corpus design (see. below). For each session into folders the following is delivered:
* the uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit)
* the .TXT file of the transcripts;
* the .XML file defining the text to speech alignment in WIN PITCH CORPUS format and its .DTD
2. The CD contains the speech software and the Appendix:
a)Speech software
The speech software Win Pitch Corpus (10 licenses)
b) Appendix
The C-ORAL-ROM transcription files in .TXT and .XML format
The C-ORAL-ROM transcription files with PoS tagging in .TXT files
The frequency list of lemmas for each language collection in TXT files
Measurements of spoken language variability in EXCEL files
The Corpus specifications:
a)Corpus design;
b)Metadata description;
c)Dialogue representation format;
d)Prosodic tagging;
e)Alignment format;
f)XML format;
g)PoS tagging and lemma formats
h)Glossaries.
Resource Validation reports
Multimedia sample files
Main Features
The resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modeling, test bed procedures in HLT and corpus based studies of spontaneous speech. C-ORAL-ROM have a relevant added value at the following levels:
* Corpus design
* Metadata
* Dialogue representation
* Prosodic annotation
* PoS tagging
* Multimedia storage
* Speech analysis
CORPUS DESIGN
The corpus design of the C-ORAL-ROM resource aim to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the main variation parameters of the spoken domain (Channel variation, Dialogue structure, sociological domain of use, and semantic domain of application) are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application.
The four language collection are considered comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure (check documentation for deviations):
INFORMAL/150,000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words each
INFORMAL/ Family-Private context/124,500 words
INFORMAL/Family-Private context/ Monologues/42,000 words
INFORMAL/Family-Private context/Dialogues-Conversations /82,500 words
INFORMAL/Public context /25.500 words
INFORMAL/Public context/Monologues/6,000 words
INFORMAL/Public context/ Dialogues-Conversations/19,500 words
FORMAL 150,000 words
FORMAL/Formal in natural context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 65,000 words in total.
FORMAL/Formal in natural context/ political speech
FORMAL/Formal in natural context/ political debate
FORMAL/Formal in natural context/ preaching
FORMAL/Formal in natural context/ teaching
FORMAL/Formal in natural context/professional explanation
FORMAL/Formal in natural context/ conference
FORMAL/Formal in natural context/ business
FORMAL/Formal in natural context/law (through media allowed)
FORMAL/Media context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 60,000 words in total
FORMAL/Media context/news (small sample)
FORMAL/Media context/meteo (small sample)
FORMAL/Media context/interviews
FORMAL/Media context/reportage
FORMAL/Media context/scientific press
FORMAL/Media context/sport talk shows
FORMAL/Media context/political debate
FORMAL/Media context/talk shows thematic discussions
FORMAL/Media context/talk shows culture
FORMAL/Media context/talk shows science
FORMAL/Telephone 25,000 words3
FORMAL/Telephone/private conversations
FORMAL/Telephone/phone to call services or man-machine interaction (10,000 words) 4
METADATA
For each session a rich series of metadata is delivered in CHAT format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data .
DIALOGUE REPRESENTATION
Corpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcripts
PROSODIC ANNOTATION
The four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been validated by an external institution.
MULTIMEDIA STORAGE
The multimedia storage ensures a natural and meaningful text / sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech.
SPEECH SOFTWARE
Win Pitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. The method is based on the ability to link visually a moving target with the perception of corresponding speech sound played back at a rate reduced by at least 30% or more.
Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, Win Pitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc...
For more information: http://www.elda.org/en/proj/coralrom.html
___________________
(1) As defined according to C-ORAL-ROM as: comprising formal and informal speech.
(2) ELDA does not take responsibility on software products coming with the distributed resources. Pitch France is fully responsible for this Software.
(3) text length not defined (by preference 1500 words upper limit, no lower limit)
(4) Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.