Language resource #: 3330
Results 591 - 600 of 2023
-
C-001073: North American News Text Corpus
Both the New York Times and the L. A. Times/Washington Post services actually include a range of other newspaper sources in their syndicated newswires
- hasVersion: C-001074: North American News Text Supplement
-
C-001074: North American News Text Supplement
The previous North American News release included prior materials from both the LA Times/Washington Post and the New York Times; this supplement provides the continuation of those sources.
-
C-001075: OGI Spelled and Spoken Word
The OGI Spelled and Spoken Telephone Corpus consists of speech recordings from over 3,650 telephone calls, each made by a different speaker to an automated prompting/recording system installed at the Oregon Graduate Institute. Speakers were asked to say their name, where they were calling from and where they grew up; they were asked to answer a couple of yes/no questions and to spell their first and last names; many were also asked to repeat a few specific words and to recite the letters of the alphabet. Each response to a prompt is stored as a separate waveform file and the files are organized according to prompt (response type); all responses from a given call have a unique caller-index number as part of the file named, so that responses can easily be sorted by speaker. Waveform data are stored in compressed form, using the NIST SPHERE 2.0 software package, which is available separately at no charge to users. SPHERE 2.0 provides the decompression software needed to extract the waveform data, as well as tools for accessing and modifying file headers.
Time-aligned phonetic transcriptions are provided for a subset of responses and a complete log of each (giving speaker sex, quality judgments and orthographic transcriptions of all responses) is included in a form suitable for use as a relational data base. -
C-001076: PhoneBook: NYNEX Isolated Words
PhoneBook is a phonetically-rich, isolated-word, telephone-speech database, created because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word and keyword-spotting technology to speech-recognition-based applications over the telephone and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition. The goal of PhoneBook is to serve as a large database of American English word utterances incorporating all phonemes in as many segmental/stress contexts as are likely to produce coarticulatory variations, while also spanning a variety of talkers and telephone transmission characteristics. We anticipate that it will be useful in ways analogous to TIMIT/NTIMIT.
The core section of PhoneBook consists of a total of 93,667 isolated-word utterances, totalling 23 hours of speech. This breaks down to 7,979 distinct words, each said by an average of 11.7 talkers, with 1,358 talkers each saying up to 75 words. All data were collected in 8-bit mu-law digital form directly from a T1 telephone line. Talkers were adult native speakers of American English chosen to be demographically representative of the U.S.
Given the large set of talkers being recruited for PhoneBook database, it made sense to exploit the opportunity to collect additional utterances. We have chosen spontaneous numerical utterances, because of widespread interest in them and the need for very large numbers of talkers for research into spontaneous-speech effects. We restricted to just three spontaneous digit sequences and one money amount, as the lists for the core of PhoneBook have been designed to approach the limit of reasonable duration for a caller's session. As a result, PhoneBook contains a total of 5,105 spontaneous utterances. -
C-001077: Portuguese Newswire Text
This corpus builds on the Portuguese data published previously in the European Language Newswire Text Corpus and contains the previously published material, as well as more recent material.
- isReferencedBy: C-001410: European Language Newspaper Text
-
C-001078: Prague Arabic Dependency Treebank 1.0
Prague Arabic Dependency Treebank (PADT) not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP).
-
C-001079: Prague Czech-English Dependency Treebank1.0
The core part of PCEDT 1.0 is a Czech translation of 21,600 English sentences from the Wall Street Journal, which are part of the Penn Treebank corpus.
- isReferencedBy: C-001547: Treebank-3
- isReferencedBy: C-001080: Prague Dependency Treebank 1.0
-
C-001080: Prague Dependency Treebank 1.0
The Prague Dependency Treebank (PDT) is a long-term project with two major phases. In the first phase (1996-2000), the morphological and syntactic analytic layers of annotation have been completed and made together with the preview of tectogrammatical layer annotation available as PDT 1.0.
-
C-001081: Prague Dependency Treebank 2.0
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW);
-
C-001082: Proposition Bank I