Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 861 - 870 of 2023

C-001481: Multilingual Corpus
Written Corpora
Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English.
C-001483: NEMLAR Broadcast News Speech Corpus
Broadcast Resources
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).

The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC Radio Monte Carlo, RTM Radio Television Maroc.

Each broadcast contains between 25 and 30 minutes of news and interviews. The recordings were carried out at three different periods between 30 June 2002 and 18 July 2005. All files were recorded in linear PCM format, 16 kHz, 16 bit.

The software used for the transcription is Transcriber with the additional patch for Arabic. Thus the transcriptions were done in Arabic characters and the software automatically generated the transliterations. The following annotation levels are included:
Orthographic transcription of speech (in news, not in music, commercials, etc.), including Named Entities
Speakers and speaker turns
Segment markers (portions of maximum 10 seconds)
Topic/story boundaries
Background noises (stationary and instantaneous noise events)
Change of background
Music/Noise
Word boundaries

A lexicon of 62,000 words with transliterations, frequency and SAMPA for Arabic is also included.

The database is distributed in 1 ISO 9660 DVD-ROM volume. It has been validated by an external partner and a validation report is provided.
C-001484: NEMLAR Speech Synthesis Corpus
Desktop/Microphone
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Broadcast News Speech Corpus (ELRA-S0219).

The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.

Speech samples are stored in 96 kHz, 24 bit with the least significant byte first (lohi or Intel format) as (signed) integers.

The speaker read 2,032 prompted sentences covering approx. 42,000 words in three categories: transcribed speech (6,600 words - 20%), written text (16,500 words - 50%), and constructed phrases (10,300 - 30%).

The transcribed speech consists of text from different domains, being produced in the Broadcast news task. The written text consists of news excerpts, novels and short stories with short sentences. Each paragraph is presented on a separate prompt sheet.

Constructed phrases consist of frequent phrases and diphone coverage sentences. The frequent used phrases are designed as derived from written text (article, news paper, etc.) and have been divided into six sub-domains:
Frequently used colloquial expressions
Sports/Games
News
Finance
Culture/Entertainment
Consumer Information
The diphone coverage sentences cover the missing and rare diphones in all the data. To cover these diphones a large corpus about 150,000 words was used and from which the sentences were extracted.

The database is provided with orthographic, prosodic and phonetic transcriptions in SAMPA. All transcriptions are segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 3,589 headwords with phonetics in SAMPA is also available.

The database is distributed on 3 ISO 9660 DVD-ROM volumes. It has been validated by an external partner and a validation report is provided.
C-001485: NEMLAR Written Corpus
Written Corpora
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).

The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:
Political news: 48,000 words
Political debate: 30,000 words
Islamic text (Preaching and others): 29,000 words
Phrases of common words: 8,500 words
Text from broadcast news: 5,500 words
Business: 20,000 words
Arabic literature: 30,000 words
General news: 100,000 words
Interviews: 56,000 words
Scientific press: 50,000 words
Sports press: 50,000 words
Dictionary entries explanation: 52,000 words
Legal domain text: 21,000 words

The time span of the data included goes from late 1990s to 2005.

The corpus is provided in 4 different versions:
Raw text
Fully vowelized text
Text with Arabic lexical analysis
Text with Arabic POS-tags

Diacritics, lexical analysis and POS-tags were generated by RDIs tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).

The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.
C-001486: ONOMASTICA-COPERNICUS DATABASE
Speech Related
The ONOMASTICA project was a European-wide research initiative within the scope of the Linguistic Research and Engineering Programme, the aim of which was the construction of a multi-language pronunciation lexicon of proper names. That project covered eleven European languages: Danish, Dutch, English, French, German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish.

Although the ONOMASTICA project ended in June 1995, the work continued with the introduction of new partners, addressing names in Eastern and Central European languages: Czech, Estonian, Latvian, Polish, Romanian, Slovakian, Slovenian and Ukrainian, in a new project funded by the European Commission?s Copernicus Programme.

The corpus consists of a collection of 1,783,390 transcriptions of 1,705,653 names, broken down as follows:
· Czech: 257,700 entries consisting of 244,025 names prepared by Dr. Pavel Kolar of the Language Institute, Silesian University, Opava, Czech Republic.
· Estonian: 209,515 entries consisting of 208,380 names prepared by Dr. Peeter Päll of the Institute for the Estonian Language, Estonian Academy of Sciences, Tallinn, Estonia.
· Latvian: 258,214 entries consisting of 245,331 names prepared by Dr. Andrejs Spektors of the Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia.
· Polish: 285,412 entries consisting of 244,632 names prepared by Prof. Wiktor Jassem of the Institute of Fundamental Technological Research, Polish Academy of Sciences, Posnan, Poland.
· Slovak: 228,257 entries consisting of 228,257 names prepared by Dr. Peter Durco of the Department of Foreign Languages, Police Academy of the Slovak Republic, Bratislava, Slovak Republic.
· Slovenian: 285,862 entries consisting of 283,449 names prepared by Dr. Zdravko Kacic of the Faculty of Technical Sciences, University of Maribor, Maribor, Slovenia.
· Ukrainian: 258,430 entries consisting of 251,579 names prepared by Dr. Yevgeniy Ludovik of the Institute of Cybernetics, Ukraine Academy of Sciences, Kiev, Ukraine.

The databases are presented in Microsoft Access format and in ASCII text format, together with a database browser software prepared by Keith Edwards of the Centre for Communication Interface Research, The University of Edinburgh.
C-001488: OrienTel Egypt MCA (Modern Colloquial Arabic) database
S0142 : This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.S0143 : This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.S0185 : This speech database contains the recordings of 500 Egyptian speakers of English recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
C-001489: OrienTel Egypt MSA (Modern Standard Arabic) database
Telephone
The OrienTel Egypt MSA (Modern Standard Arabic) database comprises 500 Egyptian speakers (254 males, 246 females) recorded over the Egyptian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
2 sequences of 5 isolated digits
7 connected digits : 1 prompt sheet number (6 digits), 6 strings of 4 digits in written format
2 currency money amounts
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)
1 time phrase
2 spelled words : string of 4 letter sequences
3 directory assistance utterances : 1 frequent city name, 1 frequent company name, 1 personal name ( first name and family name)
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
4 spontaneous items (for control)

The following age distribution has been obtained: 329 speakers are between 16 and 30, 121 speakers are between 31 and 45, 50 speakers are between 46 and 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001490: OrienTel English as spoken in Egypt database
Telephone
The OrienTel English as spoken in Egypt database comprises 500 Egyptian speakers of English (251 males, 249 females) recorded over the Egyptian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.

Each speaker uttered the following items:
1 isolated single digit
1 sequencesof 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
2 spontaneous items (for control)

The following age distribution has been obtained: 347 speakers are between 16 and 30, 101 speakers are between 31 and 45, 52 speakers are between 46 and 60.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
C-001491: Original Short-Message Data Collation I in Chinese (PinYin)
Written Corpora
This corpus comprises 5,891,275 characters, corresponding to 51,568 short messages (SMS) from radio/TV stations and 213,694 daily life short messages. This subset contains original messages together with PinYin transcription.
All data have been proofread manually with PinYin.
C-001492: Original Short-Message Data Collation I in Chinese (named entities)
Written Corpora
This corpus comprises 5,891,275 characters, corresponding to 51,568 short messages (SMS) from radio/TV stations and 213,694 daily life short messages. This subset contains original messages together with named entities.
All data have been proofread and tagged manually.

SHACHI - Language Resource Metadata Database