Language resource #: 3330
Results 141 - 150 of 2023
-
C-000411: Venice Italian Treebank (VIT)
Written Corpora
The VIT, Venice Italian Treebank is the effort of the collaboration of people working at the Laboratory of Computational Linguistics of the University of Venice in the years 1995-2005. It is partly the result of annotation carried out internally with no specific project in mind and no financial support. This work was partly related to the development of a lexicon, a morphological analyzer, a tagger, a deep parser of Italian. All these resources were finally ready at the beginning of the 90s when the LCL got involved in the first national projects.
The VIT contains about 272,000 words distributed over six different domains, and this is what makes it so relevant for the study of the structure of Italian language. The following domains were annotated:
Domain Number of words Time span
Bureaucratic 20,000 1986
Politics 40,000 1984
Economic & financial 12,000 1987
Literary 10,000 1984
Scientific 20,000 1985
News 170,000 1994
In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes.
The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided. -
C-000415: German Speecon database
Desktop/Microphone
The German Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 562 adult German speakers (272 males, 290 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child German speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 29 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications. Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)
The following age distribution has been obtained:
Adults: 259 speakers are between 15 and 30, 193 speakers are between 31 and 45, 110 speakers are over 46.
Children: 23 speakers are between 8 and 10, and 27 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000416: BITS Logatome Synthesis Corpus BITS-LG
Desktop/Microphone
BITS stands for "BAS Infrastructures for Technical Speech Processing" and was funded by the German Ministry of Science and Education during 2003-2005. The BITS synthesis corpus consists of two parts: a set of logatome recordings for controlled diphone synthesis (ELRA-S0217) and a set of sentence recordings for unit selection techniques (ELRA-S0224).
This corpus contains 11,036 recordings of logatomes spoken by 4 professional German speakers covering all German diphone combinations as well as the most prominent combination German - French English (each speaker had at least foreign language competence in English).
The data is stored on 4 DVDs. Each DVD contains the recordings, the annotation files and the meta data files of one of the four professional speakers, and the entire corpus' documentation. Each speaker was recorded in an insulated room with low reverberation.
Each logatome was recorded in three channels: close microphone, large membrane microphone and laryngographic signal. All diphones are segmented and labelled into phonemic units.
Total number of recordings: 11036
Total duration: 187 minutes
Format: WAV 48 kHz, 16 bit, Praat TextGrid, BAS Partitur Format (BPF)
Segmentation: extended German SAM-PA- hasPart: BITS Unit Selection Synthesis Corpus
-
C-000418: Italian Syntactic-Semantic Treebank (ISST)
Written Corpora
ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML.
ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task. Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs).
The multi-level structure of ISST shows two main novelties with respect to other treebanks:
1) while most treebanks are restricted to syntactic annotation only, ISST includes both syntactic and semantic annotation levels. In this way, the prerequisites are set up for corpus-based investigations on the syntax-semantics interface: the linking of the syntactic and semantic annotation layers permits, for instance, the identification of specific subcategorisation properties associated with a specific word sense, or of the semantic types associated with the functional positions of a given predicate;
2) the other innovative aspect of ISST concerns the distributed approach to syntactic annotation. In this respect, ISST differs from most treebanks which adopt a unique syntactic representation layer. ISST also differs from multi-level treebanks like the Prague Dependency Treebank (PTD): whereas PTD annotation levels refer respectively to a) the surface dependency relations and b) the underlying sentence structure, ISST syntactic annotation levels are intended to provide orthogonal views of the same surface syntax.
The adopted morpho-syntactic annotation scheme conforms to the EAGLES international standard.
ISST constituency annotation departs from other constituency-based syntactic annotation schemes (e.g. the one adopted in the Penn Treebank) in a number of respects, mainly due to the distributed organisation of syntactic annotation: annotation at this level consists in the identification of phrase boundaries with labelling of constituent types; due to the fact that functional relations are handled at a distinct level, ISST tree structures are shallow.
The ISST functional annotation scheme is based on FAME (Lenci et al. 1999, 2000) whose main features can be summarised as follows: a) hierarchical organisation of functional relations which makes provision for underspecified representations of highly ambiguous functional analyses; b) modular coding architecture which is articulated over different information layers, each factoring out different but possibly interrelated linguistic facets of syntactic annotation. FAME originated as a revision of a de facto standard, i.e. the functional annotation scheme developed in the framework of the LE-2111 SPARKLE project, revision which was first done for better complying with the basic requirements of parsing evaluation (in the framework of the LE-8340 ELSE project), and then for making the scheme suitable for annotation of unrestricted Italian texts.
References:
Lenci A., Montemagni S., Pirrelli V., Soria C., FAME: a Functional Annotation Meta-scheme for Multimodal and Multi-lingual Parsing Evaluation, in Proceedings of the ACL99 Workshop on Computer-Mediated Language Assessment and Evaluation in NLP, University of Maryland, June 22nd 1999.
Lenci A., Montemagni S., Pirrelli V., Soria C., Where opposites meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation, in Proceedings of LREC-2000, 31/5-2/6 2000, Athens, 625-632.
Articles describing ISST:
Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte, Building the Italian Syntactic-Semantic Treebank, in Anne Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, pp. 189-210.
Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Vito Pirrelli, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte, 2003, The syntactic-semantic treebank of Italian. An overview, Linguistica Computazionale XVI-XVII, pp. 461-492- isReferencedBy: D-000928: EuroWordNet Italian
-
C-000419: PAIDIALOGOS (NEOLOGOS Project)
Telephone
The PAIDIALOGOS database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The databases produced in the framework of the NEOLOGOS project are designed for the development and the assessment of French speech or speaker recognizers and speech synthesizers. They consist in:
1) the IDIOLOGOS databases are made of adults voices and are available in 2 subsets:
- the Bootstrap database (catalogue ref. ELRA-S0226-01),
- the Eingenspeakers database (catalogue ref. ELRA-S0226-02);
2) the PAIDIALOGOS database (catalogue ref. ELRA-S0227) is made of childrens and teenagers voices.
The PAIDIALOGOS database contains 37,364 utterances from 1010 child French speakers (510 males and 500 females) recorded over the French fixed telephone network.
This database is distributed as 1 DVD-ROM. The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of NEOLOGOS. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the NEOLOGOS format and content specifications.
Each speaker uttered the following items:
- 3 application words (set of 42)
- 4 connected digits: 2 sequence of 3 isolated digits, 1 sheet number (7 digits), 1 telephone number (10 digits)
- 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression)
- 2 isolated digits
- 3 spelled words (1 surname, 1 directory assistance city name, 1 real/artificial name for coverage)
- 1 currency money amount
- 1 natural number
- 4 directory assistance names (1 spontaneous, e.g. own surname, 1 city of where the call is made from, 1 most frequent French city out of a set of 40, 1 forename surname)
- 2 yes/no questions (1 predominantly yes question, 1 predominantly no question)
- 6 phonetically rich sentences
- 2 time phrases (1 time of call, 1 word style time phrase)
- 2 phonetically rich words
The following age distribution has been obtained: 6 speakers are under 7, 541 speakers are between 7 and 11, 308 speakers are between 12 and 14, 154 speakers are between 15 and 16, and 1 speaker is over 16.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.- hasPart: IDIOLOGOS 1 “Bootstrap E(NEOLOGOS Project)
- hasPart: IDIOLOGOS 2 “Eingenspeakers E(NEOLOGOS Project)
-
C-000420: TC-STAR 2006 Evaluation Package - ASR English
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for English. The same packages are available for both Spanish (ELRA-E0012) and Mandarin (ELRA-E0013) for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).
To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.
This package was used within the EPPS task and consists of 2 data sets:
- Development data set: consists of audio recordings of Parliaments sessions from 6 June to 7 July 2005, manually transcribed. Approximately 3 hours of recordings were selected and transcribed, corresponding to approximately 35,000 running words in English.
- Test data set: consists of audio recordings of Parliaments sessions from September to November 2005. As for the development set, the test data set is made of 3 hours (35,000 running words).- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: C-001231: TC-STAR 2005 Evaluation Package - ASR English
- hasVersion: C-001233: TC-STAR 2005 Evaluation Package - ASR Spanish
- hasVersion: C-001232: TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese
- hasVersion: TC-STAR 2007 Evaluation Package - ASR English
- hasVersion: TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
- hasVersion: TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
- hasVersion: TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese
-
C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Spanish. The same packages are available for English (ELRA-E0011), Mandarin (ELRA-E0013), and for the EPPS task for Spanish (ELRA-E0012/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).
To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.
This package was used within the CORTES task and consists of 2 data sets:
- Development data set: consists of audio recordings of CORTES sessions from 1 to 2 December 2004, manually transcribed. 3 hours of recordings were selected and transcribed, corresponding to approximately 30,000 running words in Spanish.
- Test data set: consists of audio recordings of CORTES sessions of 24 November 2005, As for the development set, the test data set is made of 3 hours (30,000 running words).- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS The rest is omitted.(17)
-
C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Chinese. The same packages are available for both English (ELRA-E0011) and Spanish (ELRA-E0012) for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).
To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.
This package was used within the VOA task and consists of 2 data sets:
- Development data set: consists of 3 hours of audio recordings from the broadcast news of Mandarin Voice of America between 1 and 11 December 1998 which corresponds more or less to 42,000 Chinese characters.
- Test data set: consists of 3 hours of audio recordings from news broadcast between 23 and 25 December 1998 and corresponds to 44,000 Chinese characters.- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
- hasVersion: C-001727: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Chinese-to-English -The rest is omitted.(16)
-
C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for English-to-Spanish translation. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR and for SLT in 2 other directions, Spanish-to-English (ELRA-E0015) and Chinese-to-English (ELRA-E0016).
To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.
This package was used within the EPPS task and consists of 2 data sets:
- Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 50,000 words were selected from the EPPS verbatim transcriptions, and from the Final Text Edition documents. The source texts were then translated into Spanish by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.
- Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 50,000 words were selected from the test data set (Parliaments sessions from 7 to 26 September 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into Spanish by two independent agencies.- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English -
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Chinese-to-English -The rest is omitted.(16)
-
C-000424: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
Desktop/Microphone
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The second TC-STAR evaluation campaign took place in March 2006.
Three core technologies were evaluated during the campaign:
Automatic Speech Recognition (ASR),
Spoken Language Translation (SLT),
Text to Speech (TTS).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR, and for SLT in 2 other directions, English-to-Spanish (ELRA-E0014) and Chinese-to-English (ELRA-E0016), as well as for the EPPS task for Spanish-to-English (ELRA-E0015/02).
To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.
This package was used within the CORTES task and consists of 2 data sets:
- Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 25,000 words were selected from the CORTES verbatim transcriptions and from the CORTES Final Text Edition documents. The source texts were then the translated into English by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.
- Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 25,000 words were selected from the test data set (CORTES sessions on 24 November 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into English by two independent agencies.- hasVersion: C-001727: TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS
- hasVersion: C-000420: TC-STAR 2006 Evaluation Package - ASR English
- hasVersion: C-000421: TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
- hasVersion: C-001726: TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
- hasVersion: C-000422: TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
- hasVersion: C-000423: TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
- hasVersion: TC-STAR 2006 Evaluation Package - SLT Chinese-to-English -The rest is omitted.(16)