Language resource #: 3330
Results 801 - 810 of 2023
-
C-001401: ECI Multilingual Text
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.
The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.
Language (Subcorpus #) Kwords Totals
German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918
French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986
Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580
English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510
Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145
Czech (44) 4726 4726
Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014
Chinese (78) 2895 2895
Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610
Norwegian (41) 2226 2226
Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989
Tibetan (76) 834 834
Portuguese (60) 675 (47) 24 (71) 21 720
Malay (80) 563 563
Russian (73) 364 364
Japanese (57) 203 203
Turkish (20) 173 (20A) 110 283
Albanian (82) 205 205
Gaelic (55) 141 141
Estonian (39) 100 100
Usbek (81) 88 88
Latin (74) 75 75
Danish (47) 24 (71) 21 45
Lithuanian (89) 20 20
Bulgarian (84) 5 5
Total 91969- references: LDC, et al. 1994 ECI Multilingual Text Linguistic Data Consortium, Philadelphia
-
C-001403: EUROM1e English
Desktop/Microphone
The first really multilingual speech database produced in Europe. Equivalent corpora for each of the European languages: same number of speakers selected in the same way, and recorded in the same conditions with common file formats. Initially eight European countries have made recordings: Italy, United Kingdom, Germany, Netherlands, Denmark, Sweden, Norway, France. Additional recordings have been then completed (thanks to CEE Esprit Project SAM-A), in Greece, Spain and Portugal. The content consists of Numbers, Passages, Sentences and CVC. More than sixty speakers per language. -
C-001404: Egyptian Colloquial Arabic Lexicon
*Introduction*
This lexicon represents the first electronic pronunciation dictionary of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic.
*Data*
The lexicon contains 51,202 entries, drawn from 140 CALLHOME telephone conversations among native speakers of Colloquial Egyptian Arabic, collected and published by the LDC as follows: CALLHOME Egyptian Arabic Speech LDC97S45, CALLHOME Egyptian Arabic Transcripts LDC97T19, CALLHOME Egyptian Arabic Speech Supplement LDC200237 and CALLHOME Egyptian Arabic Transcripts Supplement LDC2002T38. The lexicon also contains entries derived manually from the Badawi & Hines dictionary of Colloquial Egyptian Arabic.
The lexical entries are written one to a line with tab-separated fields, including orthographic representation in both the LDC romanization as well as Arabic script, morphological, phonological, stress, source, and frequency information for each word.
Here is a sample page.
Relative to earlier versions of the Arabic Pronouncing Lexicon, this release provides not only a significant increase in the number of entries, but also a significant effort to improve the quality and consistency of all entries.
*Updates*
There are no updates at this time.- references: Kilany, et al. 2002 Egyptian Colloquial Arabic Lexicon Linguistic Data Consortium, Philadelphia
-
C-001405: Emotional Prosody Speech and Transcripts
*Introduction*
This file contains documentation on the 2002 Emotional Prosody Speech and Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2002S28 and ISBN 1-58563-237-6.
This publication contains audio recordings and corresponding transcripts, collected over an eight month period in 2000-2001 and designed to support research in emotional prosody. The recordings consist of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories, selected after Banse & Scherers study of vocal emotional expression in German. (Banse, R. & Scherer, K. R. 1996. Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614-636.)
Actor participants were provided with descriptions of each emotional context, including situational examples adapted from those used in the original German study. Flashcards were used to display series of four-syllable dates and numbers to be uttered in the approriate emotional category.
The Prosody Recordings Project was interested in capturing the aspects of speech (emotion, intonation) that are left out of the written form of a message. In these experiments, simple phrases are expressed in ways that reflect varied contexts. The same phrase might be used to answer different questions, address listeners at different distances from the speaker, or express different emotional states. Actors were used because they are experts at producing this kind of contextual variation in a natural and convincing way.
*Data*
There are 30 data files: 15 recordings in sphere format and their transcripts.
The sphere files are encoded in two-channel interleaved 16-bit PCM, high-byte-first (big-endian) format, for a total of 2,912,067,980 bytes (2777 Mbytes) or nine hours of sphere data.
The utterences were recorded directly into WAVES+ datafiles, on two channels with a sampling rate of 22.05K. The two microphones used were a stand-mounted boom Shure SN94 and a headset Seinnheiser HMD 410.
The original session recordings are provided in their entirety, including informal chit-chat and discussion between each emotion category elicitation task. Time alignment is limited to utterances within the formal elicitation tasks and miscellanous regions have been marked as such.
*Samples*
* Speech
* Transcripts
*Updates*
There are no updates at this time.- references: Mark Liberman, et al. 2002 Emotional Prosody Speech and Transcripts Linguistic Data Consortium, Philadelphia
-
C-001406: English Chinese Translation Treebank v 1.0
*Description*
This release of English Chinese Translation Treebank v. 1.0 consists of 146,300 words in 325 files of individual news stories from Xinhua News Agency (corresponding to the Xinhua data in Chinese Treebank 5.0 LDC2005T01) that are translated into English, part-of-speech tagged and treebanked. The files were compressed using gzip.
The source files for the treebank annotation contain the final updated translation of these files. Translation errors that prevented complete treebank annotation have been corrected. This translation and annotation were completed in October 2004 and supersede any earlier translation.
This publication was compiled under National Science Foundation Grant #IIS-0325646.
*Samples*
For an example of the data in this publication, please view this sample.- references: Ann Bies, et al. 2007 English Chinese Translation Treebank v 1.0 Linguistic Data Consortium, Philadelphia
-
C-001407: English Gigaword Second Edition
*Introduction*
English Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T12 and ISBN 1-58563-350-X. The English Gigaword corpus is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. This is the second edition of the English Gigaword corpus.
This edition includes all of the contents in the first edition of the English Gigaword corpus (LDC2003T05) as well as new data from July 2002 through Dec 2004. Also, a new newswire source (the Central New Agency of Taiwan, English Service) has been added in this edition.
The five distinct international sources of English newswire included in this release are the following:
Agence France-Presse, English Service (afp_eng ) Associated Press Worldstream, English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) The New York Times Newswire Service (nyt_eng) The Xinhua News Agency, English Service (xin_eng)
*What's New In The Second Edition** New newswire data contents from July 2002 to December 2004 have been added for all of the four newswire sources that were represented in the first edition.
* A new source, the Central News Agency of Taiwan English Service (CNA_ENG), has been added.
* We have adopted a new naming scheme for filenames and DOC IDs. The new naming scheme represents the source names in a three-letter code and the language name in a three-letter code.
* Minor formatting improvements (mostly line-wrapping) have been made to some of the data contents originally published in the first edition.
*Pricing*
The Reduced Licensing Fee for this corpus is US$400.- references: David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda 2005 English Gigaword Second Edition Linguistic Data Consortium, Philadelphia
-
C-001408: English Gigaword
*Introduction*
English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC.
Four distinct international sources of English newswire are represented here:
Agence France Press English Service (afe) Associated Press Worldstream English Service (apw) The New York Times Newswire Service (nyt) The Xinhua News Agency English Service (xie)
*Data*
Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.
Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.
All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.
Please follow this link for a sample file.
The markup structure, common to all data files, can be summarized as follows:
The Headline Element is Optional -- not all DOCs have one
The Dateline Element is Optional -- not all DOCs have one
Paragraph tags are only used if the "type" attribute of the DOC happens to be "story"
Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less
For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other.
Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFE 44 417 1216 170969 656269 APW 91 1213 3647 539665 1477466 NYT 96 2104 5906 914159 1298498 XIE 83 320 940 131711 679007 TOTAL 314 4054 11709 1756504 4111240
*Updates*
There are no updates available at this time.- references: David Graff 2003 English Gigaword Linguistic Data Consortium, Philadelphia
-
C-001409: English-Arabic Treebank v 1.0
*Introduction*
This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.
This release of the English-Arabic Treebank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.
*Data*
The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences: * POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
* TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)
*Samples*
For an example of the data in this corpus, please review this text sample.- references: Ann Bies 2006 English-Arabic Treebank v 1.0 Linguistic Data Consortium, Philadelphia
-
C-001410: European Language Newspaper Text
The European Language Newspaper Text corpus is also know as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML. The text is taken from the following sources:
* Approximately 60 million words of text in French and German have been made available from the Associated Press (AP) World Stream. AP World Stream is a compilation of AP news reports produced in 86 bureaus in 68 countries.
The Associated Press Worldstream newswire service provides articles in six languages, interleaved on a single data stream. The data is collected via an Associated Press installed telephone line at the LDC.
* Approximately 110 million words of text in French, German and Portuguese have been made available from Agence France Presse. Each language was supplied in separate data streams collected via a Dateno MKII satellite receiver and associated equipment at the LDC.
* Approximately 20 million words of text in German have been made available from Deutsche Presse Agentur. The text is collected via an AP Datafeatures telephone line installed at the Linguistic Data Consortium.
* A smaller part of the corpus comes from Le Monde newspaper. The Le Monde data covers about 5.6 million words of French. It is quite distinct from the AP and AFP materials in its markup approach, because it has been prepared in compliance with the conventions of the Text Encoding Initiative (TEI), rather than having been based on the model of the TIPSTER collections, which were originally developed prior to the establishment of the TEI conventions.- references: David Graff 1995 European Language Newspaper Text Linguistic Data Consortium, Philadelphia
-
C-001411: FEGA Corpus of French-Galician literary texts
- hasPart: C-001354: CLUVI Parallel Corpus