Language resource #: 3330 Results 1621 - 1630 of 2023
Current query
Input keywords
Select items
  • C-004270: The Thor Corpus
    The Thor Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. It is 2 hours in length with 4000 utterances from 20 speakers.
    • references: JUPITER corpus
  • C-004271: The Jensson Corpus
    The Jensson Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. The corpus is 3.8 hours in length with 5,612 utterances from 20 speakers. The read text is in the form of questions and contains words that were chosen with the aim of keeping the text as short as possible. All the speakers read the same text.
  • C-004272: The RÚV Corpus
    The RÚV Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. The corpus is 46 minutes in length with 400 utterances from 20 speakers and contains read news items that includes a large vocabulary. No two speakers read the same text.
  • C-004273: English Web Treebank
    *Introduction*

    English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.

    *Data*

    This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.

    Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.

    Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.

    Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.

    The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.

    Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected.

    *Samples*
  • C-004274: Enron Email Dataset
    Enron Email Dataset contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.
  • C-004275: The EnronSent Corpus
    The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in corpus linguistics and language analysis, containing 2,205,910 lines and 13,810,266 words in 45 plain text files.
  • C-004277: Audio-Visual Speech Recognition Evaluation Environment
    Common platform for evaluating independently speech recognition accuracy and speech interval detection under noisy environment. An evaluation corpus for audio-visual speech recognition of continuously spoken single digits in Japanese. The digit sequence of each utterance and the pronunciation of Japanese digits are the same as the CENSREC-1 (AURORA-2J) database. Including color and infrared mouth images, which were recorded simultaneously with speech.
  • C-004279: X-ray Film database for speech research
    This corpus offers the movies from high quality x-ray films compiled on CAV laserdisk. Each speaker read phonetically contrastive sentences (about 30 sentences per speaker). The movies yield the best dynamic view of the entire vocal tract and the complex movements of the tongue.
  • C-004281: Priority Areas "Prosody and Speech Processing" Japanese MULTEXT Prosodic Corpus
    The Japanese version of Multilingual Text Tools and Corpora (MULTEXT). The speakers were asked to read aloud the 40 passages (each passage includes 5 - 6 sentences) in two speaking styles; the reading-style and the spontaneous-style (instructed to perform with different emotional attitudes according to the text of each situation).
  • C-004283: Chinese MULTEXT Corpus
    The Chinese version of Multilingual Text Tools and Corpora (MULTEXT). The speakers were asked to read aloud the 40 passages (each passage includes 5 - 6 sentences) as naturally as possible.