Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 841 - 850 of 2023

C-001449: Korean Mandarin Speech Recognition Corpus (desktop) person name (150 people)
Desktop/Microphone
This corpus comprises 1,500 Korean Mandarin person names uttered by 150 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 1.56 hours of speech per channel. The total capacity of the data is 2 Gb.
Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-001450: Korean Mandarin Speech Recognition Corpus (desktop) single Korean sentences (40 people)
Desktop/Microphone
This corpus comprises 4,800 Korean Mandarin sentences uttered by 40 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 7.63 hours of speech per channel. The total capacity of the data is 9.82 Gb.
Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
The corpus aims to be applied to the testing and telephone natural speech recognition system.
C-001451: LABEL-LEX (MW)
Monolingual Lexicons
LABEL-LEX (MW) is a Portuguese formalized lexicon, containing 88 619 inflected multiword lexical units (formally, sequences of simple words). The units are distributed as follows:
- 85,881 nouns, with information about type, gender, number, inflected forms, irregular inflected forms and subcategorisation frames
- 2,204 adverbs
- 409 adjectives, with information about degree, gender, number, comparison, position, inflected forms, irregular inflected forms and subcategorisation frames
- 125 pronouns, prepositions/postpositions and conjunctions

From a linguistic point of view, multiword lexical units exhibit distributional and selectional constraints; they lack compositionality, and have, most of the time, idiomatic interpretations.
MWUs occur frequently in both everyday language and technical and scientific texts to express ideas and concepts that in general cannot be stated by "free" linguistic structures.
So it is impossible to envisage automatic text analysis without adequate identification and treatment of multiword lexical units. The meaning of a text is mostly supplied by frequent occurrence of multiword units, especially by compound nouns.

Other formats and other services may be supplied by the data owner upon request (e.g. conversion into buyer's formalism, selection of subsets of the words missing from your own dictionary).
C-001452: LABEL-LEX (SW)
Monolingual Lexicons
LABEL-LEX (SW) is a Portuguese formalized lexicon, containing 1,545,481 simple inflected words. The words are distributed as follows:
- 142,236 nouns, with information about type, gender, number, inflected forms, and irregular inflected forms
- 3,155 adverbs, with information about degree, polarity and wh-type
- 1,258,913 verbs, with information about type, mood, tense, person, number, transitivity, inflected forms, irregular inflected forms, subcategorisation frames, auxiliarity and clitics
- 139,536 adjectives, with information about degree, gender, number, comparison, position, inflected forms, irregular inflected forms and subcategorisation frames
- 291 pronouns, 443 determiners, 8 articles (sub-class of determiners), 27 prepositions/postpositions, 40 conjunctions, 317 numerals (sub-class of determiners), 205 interjections, 310 contractions (e.g. Prep + Det)

Each dictionary entry is associated to a lemma; information about POS and morphological attributes ? such as gender, number, person, case (for personal pronouns), tense, mood, diminutives, augmentatives, and superlatives ? is systematically formalized for each lexical entry.
Syntactic and semantic information is being encoded incrementally. For instance, verbs are sub-classified (transitive, intransitive auxiliary), adjectives are being refined with information about their syntactic sub-classification.

Other formats and other services may be supplied by the data owner upon request (e.g. conversion into buyer's formalism, selection of subsets of the words missing from your own dictionary).
C-001458: LEGA Corpus of Galician-Spanish legal texts
- hasPart: C-001354: CLUVI Parallel Corpus
C-001459: LEGE-BI Corpus of Basque-Spanish legal texts
- hasPart: C-001354: CLUVI Parallel Corpus
C-001460: LOGALIZA Corpus of English-Galician software localization
- hasPart: C-001354: CLUVI Parallel Corpus
C-001465: MLCC Multilingual and Parallel Corpora
Written Corpora
The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies.

The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:

Dutch - Het Financieele Dagblad - 1992-1993 (Samples)
The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.

English - The Financial Times - 1993 (Samples)
The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.

French - Le Monde - 1992-1993 (Samples)
A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.

German - Handelsblatt - 1986-1988 (Samples)
This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.

Italian - Il Sole 24 Ore - 1992-1993 (Samples)
The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.

Spanish - Expansion - 1994 (Samples)
This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.

The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:

Official Journal of the European Commission, C Series: Written Questions 1993
Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).

Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994
This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.
C-001466: MTP Annotated German corpus - tagged version
Written Corpora
This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP). It comprises a collection of SGML-formatted texts from two German newspapers, "Die Frankfurter Allgemeine Zeitung" and "Die Zeit", for the years 1990 to 1992. The articles reflect the typical distribution of newspaper topics, including economics, regional, national and international politics, the arts, sport, literature, history, science and modern life.
The text was segmented into sentence units and word tokens, and tagged for morphosyntactic POS markers. Two tagsets, which mainly differed in the granularity of the noun and verb tags, and which comprised 137 and 52 tags respectively, were used. Users may obtain annotated versions using either set, each of which comes with documentation and an instruction manual for tag application. A suite of tools, including the MTP taggers and the Xlex workbench for text handling, textual analysis and lexicography, is also available.
C-001467: MULTEXT Prosodic database
Desktop/Microphone
This database comprises one CD-ROM for each five languages (French, English, Italian, German and Spanish), totalling 4 hours and 20 minutes of speech and involving 50 different speakers (5 male and 5 female per language). The recordings on which the corpus is based consist of passages of about five sentences extracted from the EUROM.1 speech corpus (Esprit 2589 project "Multi-lingual Speech Input/output Assessment, Methodology and Standardisation"). The corpus was stylised automatically by an algorithm which factors out microprosodic effects and represents the intonation contour of utterances by a series of target points. Once interpolated by a smooth curve (spline), these points produce a contour indistinguishable from the original when re-synthesised, apart from a few detection errors. A symbolic coding of the 50000 pitch movements of the corpus is also provided, along with the time-alignment of orthographic transcription to signal at word level. The entire corpus was verified and manually corrected by experts for each language.
The CD-ROMs contain for each passage:
· the signal file from EUROM.1,
· the alignment of orthographic transcription to signal at word level,
· the Fo file,
· the stylisation files,
· the re-synthesis using the stylised Fo,
· the symbolic coding file,
· the residual Fo, i.e. the difference between the Fo and the stylised curve,
· a description file for the recording.
Additional information: Campione, E., Véronis, J. (1998). A multilingual prosodic database. Proceedings of ICSLP'98, Sidney, Australia.
(download PDF version): http://www.elda.org/catalogue/fr/speech/doc/icslp98_mult.pdf)

SHACHI - Language Resource Metadata Database