Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1651 - 1660 of 2023

C-004319: The EMIME Bilingual Finnish/English German/English Database Version 1.0
The database is a collection of a bilingual database of Finnish/English and German/English data. Bilingual talkers were asked to first read the English sentences and then their native language. The accents of the talkers in the database have been rated; English, German and Finnish listeners assessed the English, German and Finnish talkers’degree of foreign accent in English.
- hasVersion: C-004320: The EMIME Mandarin/English Bilingual Database Version 1.1
- references: C-000936: Finnish Speecon database
- references: C-001111: GlobalPhone German
- references: the Wall Street Journal 1 corpus
C-004320: The EMIME Mandarin/English Bilingual Database Version 1.1
This is a Mandarin/English bilingual database recorded at the University of Edinburgh in 2010 in the context of the EMIME project (www.emime.org). It includes the recordings of seven female and seven male speakers of Mandarin. The accents of the talkers in the database have been rated; English and Mandarin listeners assessed the English and Mandarin talkers' degree of foreign accent in English.
- references: the Wall Street Journal 1 corpus
- references: C-000095: Mandarin Chinese Speecon database
- hasVersion: C-004319: The EMIME Bilingual Finnish/English German/English Database Version 1.0
C-004321: The Accents of the British Isles (ABI-1) Speech Corpus
The ABI-1 Corpus consists of about 70 hours of recordings covering 14 distinct accent regions within the British Isles. Each speaker and both of their parents have lived in their particular region all their lives.
- hasVersion: C-004322: The Second Accents of the British Isles Speech Corpus
C-004322: The Second Accents of the British Isles Speech Corpus
The ABI-2 corpus consists of approximately 70 hours of recordings covering 13 accent regions of the British Isles which are not covered in the original ABI-1 corpus. The material recorded and the recording procedure are the same as in the ABI-1 corpus, except that each subject recorded an additional set of 22 SCRIBE sentences and, where possible, a 5 minute telephone conversation. Note that the telephone conversational speech will be released as a separate corpus.
- hasVersion: C-004321: The Accents of the British Isles (ABI-1) Speech Corpus
C-004323: The PF-STAR British English Children's Speech Corpus
The corpus contains speech from 158 children aged 4 to 14 years.
C-004324: 現代日本語書き言葉均衡コーパス
C-004327: EPAC Corpus: orthographic transcriptions
Broadcast Resources
This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme, see also ELRA-E0021). This corpus also consists of automatic transcriptions of the full 1,677 hours.
- references: C-003362: ESTER Evaluation Package
C-004328: ESTER 2 Corpus
Broadcast Resources
ESTER 2 evaluation campaign (Evaluation of Broadcast News enriched transcription systems) is based, one the one hand, on the full corpus from the first ESTER campaign (see ELRA-E0021 and ELRA-S0241), and which was, on the other hand, completed with a training corpus of about hundred hours, specific to ESTER 2, as well as quick transcriptions of African radios. A subset of the corpus consisting of 6 hours is identified as the development corpus. This new data constitute the ESTER 2 Corpus.

ESTER 2 Corpus consists of:
- a manually transcribed radio broadcast news corpus amounting about 100 hours,
- quick transcriptions of African radios amounting about 6 hours.

An annotation of named entities is provided within the development data (about 6 hours).

The recorded radios contain news broadcast, files linked to current news and more conversational-oriented broadcast.
C-004330: The CINTIL Corpus International Corpus of Portuguese
Written Corpora
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).

The corpus is developed over raw textual materials of several types, of which 30% are spoken materials. This spoken subcorpus includes materials from several registers (ranging from formal to informal) and several communicative situations (e.g. phone calls, media broadcasts, conversations, monologues, formal exposition, etc.). The CINTIL corpus comprises the transcriptions of spoken texts but does not include the sound files with the recorded interviews. The remaining subcorpus is composed of written texts from several genres: newspaper, books, magazines, journals and miscellaneous (proceedings, dissertations, pamphlets, etc.). A detailed overview of the corpus composition is presented below:

Written = 689,124 tokens:
o News: 58.7% - 404,690 tokens
o Fiction: 29% - 200,194 tokens
o Other: 12.2% - 84,240 tokens
Spoken = 502,622 tokens:
o Informal/Private: 43.2% - 217,604 tokens
o Informal/Public: 9.5% - 48,221 tokens
o Informal/Phone: 0.4% - 2,287 tokens
o Formal/Natural: 19.3% - 97,499 tokens
o Formal/Media: 17.6% - 88,727 tokens
o Formal/Phone: 9.6% - 48,284 tokens
Total = 1,191,746 tokens

Linguistic information:
The corpus associates to raw text linguistic information of different nature and from different levels of sophistication. This information is encoded under the usual format of tags, checked for their accuracy by trained linguists, covering four levels of information:
Segmentation: The boundaries of each sentence are tagged and every token is circumscribed by blanks. Contractions are expanded, clitics in enclisis and mesoclisis are detached into autonomous tokens, and punctuation is associated with explicit information concerning the blanks surrounding them in the raw version. Multi-word expressions from some POS classes (e.g. Conjunctions, Prepositions, etc) are identified as forming a lexical unit.
POS: By means of POS tags, each token is associated with the indication of its morpho-syntactic category.
Inflection: Information concerning inflectional morphology: every inflected token is associated with the corresponding lemma, and with explicit information encoding their values for Mood, Tense, Person and Number, if they are from verbal classes, or Number and Gender if they are nominals. Nominals include also information about their degree, namely superlative for Adjectives, and diminutive for both Adjectives and Nouns.
Multiword Lexical Units (MWU) for Named Entity Recognition (NER): Delimitation and classification of multi-word expressions for Named Entities following the usual IOB tagging schema for NER, and the typical classes of Number, Date, Person, Location, etc.

The annotation manual is provided together with the corpus.

The corpus can be browsed online: http://cintil.ul.pt/
- isReferencedBy: C-005009: NPChunks
C-004331: SIGNUM Database
Multimodal/Multimedia Resources
The SIGNUM Database contains both isolated and continuous utterances of various signers. Since a vision-based approach was used for sign language recognition, the corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images.

The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. Based on this vocabulary, overall 780 sentences were constructed. Each sentence ranges from two to eleven signs in length. No intentional pauses are placed between signs within a sentence, but the sentences themselves are separated. The entire corpus, i.e. all 450 basic signs and all 780 sentences, was performed once by 25 native signers of different sexes and ages. One of them was chosen to be the so-called reference signer. His performances were recorded three times.

Corpus Content:
- Language: German Sign Language (DGS)
- Vocabulary size: 450 basic signs
- Number of signers: 25 native signers
- Number of isolated signs: 450
- Number of continuous sentences: 780
- Number of performances:
* Reference signer: 3
* Other signers: 1
- Total number of sequences: 33,210
- Total number of images: 5,970,450
- Equivalent video duration: 55.3 hours

Technical Details
- Image resolution: 776x578, 30fps, 24bpp, color
- Image format: JPEG (8:1 compression)
- Data volume: 920GB (approx.)
- File system: NTFS 3.1
- Medium: 1 hard disk

Further information can also be found on the following website: http://www.phonetik.uni-muenchen.de/forschung/Bas/SIGNUM/

SHACHI - Language Resource Metadata Database