Language resource #: 3330
Results 911 - 920 of 2023
-
C-001548: UN Parallel Text (Complete)
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. The total content by language is summarized below (values are approximate):
No. of Millions Language documents of words ------------------------------------- English22,00059 French20,00058 Spanish14,40048 French/Spanish parallel data12,70038 (per language) ------------------------------------- In preparing the text for publication, we have applied a SGML tagging (Standard Generalized Markup Language) that preserves all typographic and meta-information that was present in the UN archival files. For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included, for use with the sed (stream-editor) utility, that will filter out the SGML-specific material and meta-information, leaving only the plain text. (Sed is a standard utility on unix systems, and is also available as free software for MS-based systems). The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.
Parallel samples of the three languages in this publication are listed below.
* LDC1994T04 English Sample
* LDC1994T04 French Sample
* LDC1994T04 Spanish Sample
Based on the combined usage of title strings and document numbers, it was possible to identify parallel sets amounting to over 60% of the data in the archive (a total of 56,684 files in 21,986 parallel sets). We have yet to find a reasonable method for doing a more careful search for parallels in the remaining 40%. Part of this residue is due to the fact that this corpus contains only English-based parallel sets parallel sets that included only French and Spanish versions have not been included in this release.
Users of this corpus must be warned that the parallel sets identified by this automatic method will include errors. We have observed a number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text. Also, we must expect that at least some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The reftable files in the tables directory give an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be easily identified.- isPartOf: C-001549: UN Parallel Text (English)
- isPartOf: C-001550: UN Parallel Text (French)
- isPartOf: C-001551: UN Parallel Text (Spanish)
-
C-001549: UN Parallel Text (English)
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages.
In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.- hasPart: David Graff 1994 UN Parallel Text(Complete), Linguistic Data Consortium, Philadelphia
-
C-001550: UN Parallel Text (French)
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages.
In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.- hasPart: David Graff 1994 UN Parallel Text (Complete) Linguistic Data Consortium, Philadelphia
-
C-001551: UN Parallel Text (Spanish)
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages.
In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.- hasPart: David Graff 1994 UN Parallel Text (Complete) Linguistic Data Consortium, Philadelphia
-
C-001552: UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation
- hasPart: C-001354: CLUVI Parallel Corpus
-
C-001553: US English Speecon database
Desktop/Microphone
The US English Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult US English speakers (286 males, 264 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child US English speakers (21 boys, 29 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 25 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications. Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items (over 290 items for adults and over 210 items for children):
Calibration data:
- 6 noise recordings
- The silence word recording
Free spontaneous items (adults only):
- 5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
- 3 dates
- 2 times
- 3 proper names
- 2 city names
- 1 letter sequence
- 2 answers to questions
- 3 telephone numbers
- 1 language
Read speech:
- 30 phonetically rich sentences uttered by adults and 60 uttered by children
- 5 phonetically rich words (adults only)
- 4 isolated digits
- 1 isolated digit sequence
- 4 connected digit sequences
- 1 telephone number
- 3 natural numbers
- 1 money amount
- 2 time phrases (T1 : analogue, T2 : digital)
- 3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
- 3 letter sequences
- 1 proper name
- 2 city or street names
- 2 questions
- 2 special keyboard characters
- 1 Web address
- 1 email address
- 208 application specific words and phrases per session (adults)
- 74 toy commands, 14 phone commands and 34 general commands (children)
The following age distribution has been obtained:
Adults: 217 speakers are between 15 and 30, 204 speakers are between 31 and 45, 129 speakers are over 46.
Children: 18 speakers are between 8 and 10, and 32 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-001554: US Spanish Speecon database
Desktop/Microphone
The US Spanish Speecon database is divided into 2 sets:
1) The first set comprises the recordings of 550 adult Spanish speakers in the US (255 males, 295 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
2) The second set comprises the recordings of 50 child Spanish speakers in the US (28 boys, 22 girls), recorded over 4 microphone channels in 1 recording environment (children room).
This database is partitioned into 22 DVDs (first set) and 3 DVDs (second set).
The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.
Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
Calibration data:
6 noise recordings
The silence word recording
Free spontaneous items (adults only):
2 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)
17 Elicited spontaneous items (adults only):
3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language
Read speech:
30 phonetically rich sentences uttered by adults and 60 uttered by children
5 phonetically rich words (adults only)
4 isolated digits
1 isolated digit sequence
4 connected digit sequences
1 telephone number
3 natural numbers
1 money amount
2 time phrases (T1 : analogue, T2 : digital)
3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)
3 letter sequences
1 proper name
2 city or street names
2 questions
2 special keyboard characters
1 Web address
1 email address
208 application specific words and phrases per session (adults)
74 toy commands, 14 phone commands and 34 general commands (children)
The following age distribution has been obtained:
Adults: 223 speakers are between 15 and 30, 191 speakers are between 31 and 45, and 136 speakers are over 46.
Children: 15 speakers are between 8 and 10, 35 speakers are between 11 and 14.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-001555: USC Marketplace Broadcast News Speech
*Introduction*
The USC Marketplace Broadcast News Corpus contains approximately 40 hours of audio data, which was recorded daily between May 1, 1996 and September 18, 1996. Corresponding transcript files were created by Federal Document Clearing House and enhanced by the LDC to include: story boundaries, disfluency markers, and speaker and gender identification. In keeping with HUB4 style transcription conventions, LDC spelled all digit strings in standard orthography. Commercial and music segments, while a part of the audio publication, were excluded from the transcripts. The timestamps mark the beginning of each speaker turn relative to the beginning of the recording and are precise to the 100th of a second. Although the transcripts were created using HUB4 conventions, the second and third pass quality checks, typically required by government sponsored evaluation projects, were skipped.
*Data*
The USC Marketplace recordings from the summer of 1996 were received on digital audio tapes (DATs) from the University of Southern California. LDC excluded from this set the roughly seven hours of broadcast that are currently included in the 1996 English Broadcast News publication.
Marketplace is produced by USC Radio in Los Angeles, a division of the University of Southern California.
*Updates*
There are no updates at this time. -
C-001556: USC Marketplace Broadcast News Transcripts
*Introduction*
The USC Marketplace Broadcast News Corpus contains approximately 40 hours of audio data, which was recorded daily between May 1, 1996 and September 18, 1996. Corresponding transcript files were created by Federal Document Clearing House and enhanced by the LDC to include: story boundaries, disfluency markers, and speaker and gender identification. In keeping with HUB4 style transcription conventions, LDC spelled all digit strings in standard orthography. Commercial and music segments, while a part of the audio publication, were excluded from the transcripts. The timestamps mark the beginning of each speaker turn relative to the beginning of the recording and are precise to the 100th of a second. Although the transcripts were created using HUB4 conventions, the second and third pass quality checks, typically required by government sponsored evaluation projects, were skipped.
*Data*
The USC Marketplace recordings from the summer of 1996 were received on digital audio tapes (DATs) from the University of Southern California. LDC excluded from this set the roughly seven hours of broadcast that are currently included in the 1996 English Broadcast News publication.
Marketplace is produced by USC Radio in Los Angeles, a division of the University of Southern California.
*Updates*
There are no updates at this time.- references: C-001555: USC Marketplace Broadcast News Speech
-
C-001557: VAHA (POLYPHONE II)
*Introduction*
Voice Across Hispanic America (VAHA) is a corpus of Spanish telephone speech, recorded digitally from 915 native speakers of Spanish in various parts of the United States. With nearly 39,000 recorded and transcribed utterances, VAHA will be useful for a variety of research studies, but it is intended primarily for speech technology research and development in telecommunications applications. It is patterned after Macrophone (1), an American English corpus (LDC94S21) which is widely used for this purpose.
*Data*
This corpus was collected by Texas Instruments in Dallas, TX for the Linguistic Data Consortium.
*Updates*
There are no updates at this time.- conformsTo: C-001054: MACROPHONE