Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1121 - 1130 of 2023

C-003350: CSLU: National Cellular Telephone Speech Release 2.3
*Introduction*

This file contains documentation for CSLU: Nattional Cellular Telephone Speech Release 2.3, Linguistic Data Consortium (LDC) catalog number LDC2008S02 and isbn 1-58563-467-0.

CSLU: National Cellular Telephone Speech Release 2.3 was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon. It consists of cellular telephone speech and corresponding transcripts, specifically, approximately one minute of speech from 2336 speakers calling from locations throughout the United States. The data collection protocol used for this release is the same protocol used in CSLU: Portland Cellular Telephone Speech Version 1.3 (LDC2008S01).

Speakers called the CSLU data collection system on cellular telephones, and they were asked a series of questions. Two prompt protocols were used: an In Vehicle Protocol for speakers calling from inside a vehicle and a Not in Vehicle Protocol for those calling from outside a vehicle. The protocols shared several questions, but each protocol contained distinct queries designed to probe the conditions of the caller's in vehicle/not in vehicle surroundings.

*Recording Details*

The data were collected with the CSLU T1 digital data collection system. The sampling rate was 8khz, and the files were stored in 8 bit mu-law format on a UNIX file system. In this release, the files are provided in 16-bit linearly encoded Windows wav (riff) format.

*Transcription*

The text transcriptions in this corpus were produced using the non time-aligned word-level conventions described in The CSLU Labeling Guide, which is included in the documentation for this release. CSLU: National Cellular Telephone Speech Release 2.3 contains orthographic and phonetic transcriptions of corresponding speech files. Non time-aligned orthographic transcriptions provide quick access to the content of an utterance; they may contain markers for word boundaries to support access and retrieval at the lexical level. Phonetic/phonemic transcriptions represent the phonetic content of an utterance at a given level of detail that is made explicit by the use of diacritics. Phonetic phenomena transcribed includes excessive nasalization, glottalization, frication on a stop, centralization, lateralization, rounding and palatalization.

*Samples*

For an example of the data in this corpus, please listen to the following audio samples:

* speaker 1
* speaker 2
- hasVersion: C-003331: CSLU: Portland Cellular Telephone Speech Version 1.3
- replaces: National Cellular Corpus Release 2.2
- isReferencedBy: [???Reference] R. A. Cole, et al., 2008, "CSLU: National Cellular Telephone Speech Release 2.3," Linguistic Data Consortium, Philadelphia
C-003351: Sports Hochi Article Data
The contents of the data covers all kinds of news articles on sports, entertainment industries, as well as general articles. The data includes newspaper articles of 1998 and later, each has a title, body, some keywords in Japanese.
- conformsTo: C-001597: Asahi Shimbun News Article Data for Research
- conformsTo: C-001630: THE DAILY YOMIURI Articles Data
- isFormatOf: Sports Hochi
C-003353: ARCADE II Evaluation Package
Written Corpora
The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment, with even more ambitious objectives than in the ARCADE I project (within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999), by including a finer alignment and by coping with many other languages (extension to French-distant languages). Thus, ARCADE II is not only an extension of ARCADE I, but also presents innovative and exploratory aspects, for instance by integrating French-distant languages, such as Arabic, Russian, Chinese, etc.

This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.

The campaign is distributed over two actions:
1) Sentence alignment: it consists in evaluating the alignment of French language with Latin-script languages on one side, and with non Latin-script languages on the other side.
2) Translation of named entities: it consists in identifying in the parallel Arabic corpus the translation corresponding to the named entities phrases annotated in the French corpus.

The ARCADE II evaluation package contains the following data and tools:
1) The JOC Corpus (Official Journal of the European Community) with Latin-script languages (English, French, German, Italian, Spanish) contains 1 million words per language (5 million words in all). The texts are aligned at the sentence level and produced in XML and UTF-8 format.
2) The MD Corpus (Le Monde Diplomatique) with non-Latin-script languages (Arabic, Chinese, Greek, Japanese, Persian, Russian,) contains manually-aligned texts at the sentence level, encoded in XML and UTF-8. The size of the different parts varies according to the language pair. A subset for the Arabic-French part was manually annotated with named entities. The size in words was calculated in the French part. The calculation is different depending on the language (such as for Arabic where many clitics are agglutinated, which reduces the number of words), and sometimes impossible (such as for Chinese, where there is no graphical separation between words):
<table border="0" width="100%" cellspacing="0" cellpadding="2" class="infoBoxContents">
<tr align=center><td></td><td>Arabic-French</td><td>Chinese-Fr</td><td>Greek-Fr</td><td>Japanese-Fr</td><td>Persian-Fr</td><td>Russian-Fr</td></tr>
<tr align=center><td align=left><strong>Number of articles</strong></td><td>150 x 2</td><td>59 x 2</td><td>50 x 2</td><td>52 x 2</td><td>53 x 2</td><td>50 x 2</td></tr>
<tr align=center><td align=left><strong>Number of words in French</strong></td><td>316,000</td><td>100,000</td><td>90,000</td><td>100,000 </td><td>108,000</td><td>91,000</td></tr>
</table>
A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=201 (in French language)
- references: C-000959: MULTEXT JOC Corpus
- references: N-001461: Le Monde Diplomatique Text corpus in Arabic
- references: C-000088: Le Monde Diplomatique Text corpus in French - archives from 1999
- references: Le Monde Diplomatique Text Corpus
C-003355: Balanced Corpus of Contemporary Written Japanese (demonstration version)
It is probably the most important of all the KOTONOHA component corpora, because it is the written register of the contemporary Japanese that is the greatest focus of interest for language researchers as well as the general public. It is also the contemporary written language that has the greatest applicability to such applications as dictionaries and teaching materials. The compilation of BCCWJ started in 2006 as a five-year project, and is supported partly by a Grant-in-Aid for Scientific Research on Priority Area from MEXT (Japanese ministry of education) : Japanese Corpus.
- isPartOf: Language corpus KOTONOHA
C-003357: CESTA Evaluation Package
Written Corpora
The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESTA project enabled to carry out a campaign for the evaluation of machine translation systems with English and Arabic texts translated into French.

This package includes the material that was used for the CESTA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.

The campaign is distributed over two actions:
1) Evaluation on a restrictive vocabulary: an evaluation protocol was introduced and was dedicated to two translation directions: English into French and Arabic into French.
2) Evaluation on a specialised domain (evaluation after terminology enrichment): it consists in observing the impact of the systems adaptation to the specialised domain.

The CESTA evaluation package contains the following data and tools:
1) Test run data:
- English-French parallel corpus: 21,590 English words and 23,554 French words extracted from the Official Journal of the European Communities, 1993, Written Questions section of the European Parliament, from the MLCC corpus (catalogue ref. ELRA-W0023).
- Arabic-French parallel corpus: 15,603 Arabic words and 18,257 French words extracted from Le Monde Diplomatique 2002 (catalogue ref. ELRA-W0036).

2) First campaign data:
- English-French parallel corpus: test corpus of 20,658 English words and 22,774 French words extracted from the Official Journal of the European Communities, 1993, Written Questions section of the European Parliament, from the MLCC corpus (catalogue ref. ELRA-W0023). Four translations in French are available.
- Arabic-French parallel corpus: test corpus of 23,763 Arabic words and 28,664 French words extracted from Le Monde Diplomatique 2002 and 2003 (catalogue réf. ELRA-W0036). Four translations in French are available.

3) Second campaign data:
- English-French parallel corpus: adaptation corpus of 19,383 English words and 22,741 French words, extracted from the Santé Canada website. Translation in French is available.
- Arabic-French parallel corpus: adaptation corpus of 19,560 Arabic words and 22,533 French words extracted from the UNICEF, WHO and FHI websites. Translation in French is available.
- English-French parallel corpus: test corpus of 18,880 English words and 23,411 French words, extracted from the Santé Canada website. Four translations in French are available.
- Arabic-French parallel corpus: test corpus of 17,305 Arabic words and 20,885 French words extracted from the UNICEF, WHO and FHI websites. Four translations in French are available.

4) Anonymised submissions of systems and human judgments with adequacy and fluency annotations.
5) French corpus of 13,000 words with adequacy and fluency tags.
6) Evaluation infrastructure for human judgments and for automatic evaluation.
7) Project documentation and publications.

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=199 (in French language)
- references: C-001465: MLCC Multilingual and Parallel Corpora (MLCC)
- references: N-001461: Le Monde Diplomatique Text corpus in Arabic
- references: C-000088: Le Monde Diplomatique Text corpus in French - archives from 1999
C-003358: EQueR Evaluation Package
Written Corpora
The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French.

This package includes the material that was used for the EQueR evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The campaign is distributed over two actions:
1) Generic task: it consists in evaluating the performances of question-answering systems on a collection of heterogeneous texts.
2) Specialised task: it consists in evaluating the performances of question-answering systems on a collection of texts from the medical domain.

The EQueR evaluation package contains the following data and tools:
1) Two text collections:
- General corpus: about 1.5 Gb of data consisting of news articles of several years from Le Monde and Le Monde Diplomatique, press releases and information reports from the French Senate dealing with various subjects.
- Medical corpus: about 140 Mb of data mainly consisting of scientific articles and guidelines for good medical practice, selected by the CISMeF (Catalogue et Index des Sites Médicaux Francophones) from the University Hospital Centre of Rouen.
1) Two corpora of questions :
- 500 questions for the generic task and 200 questions for the specialised task.
- For each question in the two corpora, the first 100 identifiers are provided (from Pertimms search engine).
2) Two Pertimm sub-corpora, created from the document identifiers and returned by the search engine.
3) The whole results provided by the participants.
4) A help software for the evaluation of results within the evaluation of question-answering systems (with detailed documentation).

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=195 (in French language)
- references: C-003364: Text corpus of "Le Monde"
- references: C-000088: Le Monde Diplomatique Text corpus in French - archives from 1999
C-003359: EvaSy Evaluation Package
Desktop/Microphone
The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data. This project is an extension of the only campaign that was ever carried out for French in this field within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999).

This package includes the material that was used for the EvaSy evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The campaign is distributed over three actions:
1) Evaluation of grapheme-to-phoneme conversion: it consists in evaluating the capacity of speech synthesis systems to phonetize text data.
2) Evaluation of prosody: it consists in evaluating the capacity of speech synthesis systems to forecast text prosody (duration and fundamental frequency of phonemes) from the text itself.
3) Global evaluation of the quality of speech synthesis systems:
- ACR tests (Absolute Category Rating): they consist in evaluating the overall quality of speech synthesis voices, by asking a number of subjects to evaluate several general characteristics of the speech synthesis voice, such as its naturalness, its fluency, its intelligibility.
- SUS tests (Semantically Unpredictable Sentences): they consist in evaluating the intelligibility of the speech synthesis voice, by using syntactically correct as well as semantically unpredictable sentences (which have no meaning).

The EvaSy evaluation package contains the following data and tools:
1) For the evaluation of the grapheme-to-phoneme conversion module:
1) About 8,000 proper names (4,115 pairs firstname-surname) were extracted from Le Monde newspaper of 19922000 (over 200 million words), manually phonetised with variants and annotated with linguistic tags. The reference phonetisation was checked and corrected after the adjudication phase.
2) A corpus of emails (about 115,000 words) anonymised, segmented by paragraph and phonetised in SAMPA. The reference phonetisation was not checked. The evaluation of thos data was not carried out within EvaSy.
3) The SCLITE tool (developed by NIST) was used to compare the reference phonetisation with the one from the evaluated system, and to calculate the number of mistaken phonemes (inserted, forgotten or substituted phonemes).
4) The Post-align tool was used to align the reference phonetisation with the one from the evaluated system on a word-by-word basis.

2) For the evaluation of the prosodic module:
- Text data: 7 phonetically-balanced sentences extracted from the BREF corpus (cf. ELRA-S0067), with a duration lasting from 4 to 11 seconds.
- Speech data: 7 sentences read by one speaker.
- The Mbroli tool, which converts *.pho prosodic files into *.wav speech files, together with the MBROLA fr1 diphone database.
- The Mbrolign tool, which aligns the phonemes with the signal, extracts the prosodic parameters of the signal and copy them in the MBROLA diphone databas.

3) For the global evaluation of the quality of speech synthesis systems:
a) For ACR tests (Absolute Category Rating):
- Text data: 40 abstracts with 5 sentences each of 20 second duration, extracted from the EUROM1f French corpus (cf. ELRA-S0014-01).
- Données audio : lecture des 40 passages par un locuteur EUROM1.
b) For SUS tests (Semantically Unpredictable Sentences):
- Text data: 24 lists of 12 SUS sentences. Phonemes are also distributed by list.
- Speech data: 24 lists read by a professional speaker.

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=202 (in French language)
- references: C-000887: BREF-120 - A large corpus of French read speech
- references: C-000915: EUROM1f
- references: C-003364: Text corpus of "Le Monde"
C-003360: CESART Evaluation Package
Written Corpora
The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESART project enabled to carry out a campaign for the evaluation of terminology extraction tools. This project is an extension of the evaluation campaign of terminology resource acquisition tools that was carried out for written corpora (ARC A3) within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999).

This package includes the material that was used for the CESART evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.

The campaign is distributed over two actions:
1) Term extraction for the building of a terminology reference which applications are the enrichment of the reference and the free indexing of documents.
2) Extraction of semantic relations (synonymy) from a list of focal terms.

The CESART evaluation package contains the following data and tools:
Three domain-specific corpora in French were built: one medical corpus, one educational corpus, and one political corpus. The first two were used as test corpora, while the third one (political corpus) was used as a masking corpus. The corpora were encoded in UTF-8 and XML. They are available in two different versions, one for DOS and one for UNIX.
1) The medical corpus consists of web pages extracted from Santé Canada (http://www.hc-sc.gc.ca/index_f.html).
2) The corpus in the educational field contains articles extracted from the SPIRAL magazine specialised in pedagogy and research in education.
3) The political corpus is composed of texts extracted from the Official Journal of the European Union.

The table below gives some statistics on the corpora used for the evaluation:
<table border="0" width="100%" cellspacing="0" cellpadding="2" class="infoBoxContents">
<tr align=center><td>Corpus (specialised)</td><td>Medicine (test corpus)</td><td>Education (test corpus)</td><td>Politics (masking corpus)</td></tr>
<tr align=center><td align=left><strong>Number of documents</strong></td><td>7,514</td><td>149</td><td>1,477</td></tr>
<tr align=center><td align=left><strong>Number of segments</strong></td><td>255,161</td><td>12,109</td><td>9,024</td></tr>
<tr align=center><td align=left><strong>Number of words</strong></td><td>9,000,000</td><td>535,000</td><td>240,000</td></tr>
</table>
Two reference lists were built from two terminology databases in a specialised domain. The list of medical terms, based on the terminology provided by the CISMeF team (www.chu-rouen.fr/terminologiecismef), is available from the IST/Inserm (http://ist.inserm.fr/basismesh/mesh.html). This list contains 22,861 entries. As for the educational domain, the reference list is based on the Motbis thesaurus (http://www.thesaurus.motbis.cndp.fr/site/) and consists of 36,081 entries.

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=200 (in French language)
- references: Motbis thesaurus
C-003361: MEDIA Evaluation Package
Telephone
The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.

This package includes the material that was used for the MEDIA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The campaign is distributed over two actions:
1) Evaluation taking into account the dialogue context: it consists in producing semantic annotation outside dialogue context for each of the 3,000 test prompts.
2) Evaluation not taking into account the dialogue context: it consists in evaluating the capacity of understanding systems a) from orthographic transcriptions only and b) from transcriptions and reference annotations outside dialogue context.

The MEDIA evaluation package contains the following data and tools:
1) Corpus of 1,258 dialogues (WoZ) recorded for a tourist information task, with transcriptions and annotations in dialogue acts and in semantic segments within or outside dialogue context. The annotations follow the XML formalism. This corpus consists of 18,831 user prompts used for the adaptation of systems and for evaluation.
2) The text corpus (200 dialogues) is annotated with meta-annotations in order to proceed with a diagnosis study of the systems output. These meta-annotations consist of annotations of speech-oriented phenomena such as repetitions, auto-corrections, incises, etc.
3) The Semantizer annotation tool.
4) The Mediaval HC (outside context) evaluation tool.
5) The Mediaval EC (within context) evaluation tool.

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=62 (in French language)
- hasVersion: C-003373: MEDIA speech database for French
C-003362: ESTER Evaluation Package
Broadcast Resources
The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems using French data. This project is an extension of the only campaign that was ever carried out for French in this field within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999).

This package includes the material that was used for the ESTER evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

The campaign is distributed over three actions:
1) Orthographic transcription: it consists in producing an orthographic transcription of radio-broadcast news, which quality is measured by word error rates. There are two distinct tasks, one with and one without calculation time constraint.
2) Segmentation: the segmentation tasks consist of segmentation in sound events, speaker tracking and speaker segmentation. For the sound event segmentation, the task consists of tracking the parts which contain music (with or without speech) and the parts which contain speech (with or without music). The speaker tracking task consists in detecting the parts of the document that correspond to a given speaker. The speaker segmentation consists of segmenting the document in speakers and grouping the parts spoken by the same speaker.
3) Information extraction: it consists of an exploratory task on named entity tracking. The objective was to set up and test an evaluation protocol instead of measure performances. The systems must detect eight classes of entities (person, place, data, organisation, geo-political entity, amount, building and unknown) from the automatic transcription or the manual transcription.

The ESTER evaluation package contains the following data and tools:
1) About 100 hours of orthographically transcribed news broadcast, including annotations of named entities.
2) The textual resources distributed within the ESTER campain are mainly based on the archives from Le Monde newspaper 1987-2003 (ELRA-W0015) and the debates from the European Parliament (ELRA-W0023).
3) The evaluation tools allow to evaluation each task defined above.
4) Two guides and manuals were produced and are provided in the package distributed by ELDA :
o Guide for the annotation of named entities
o Specifications and evaluation protocol

A description of the project is available at the following address:
http://www.technolangue.net/article.php3?id_article=60 (in French language)

An extra corpus of 1,700 hours of non-transcribed radio broadcast news recordings can also be provided upon request, on hard disk, as an adding to this package at a cost of 100 Euro (plus shipment fee).

For research or commercial use, please refer to ELRA-S0241 ESTER Corpus.
- references: C-001465: MLCC Multilingual and Parallel Corpora (MLCC)
- references: C-003364: Text corpus of "Le Monde"

SHACHI - Language Resource Metadata Database