言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1621 - 1630 件目

検索条件を選択

description_language
language_area
language
type
subject_monoMultilingual
subject_resourceSubject
type_style
type_form
type_sentence
type_linguisticType
type_discourseType
type_purpose
subject_linguisticField
contributor_author_level
contributor_speaker_level
contributor_author_motherTongue
contributor_speaker_motherTongue
contributor_author_dialect
contributor_speaker_dialect
contributor_author_age
contributor_speaker_age
contributor_author_gender
contributor_speaker_gender
type_annotation

C-004270: The Thor Corpus
The Thor Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. It is 2 hours in length with 4000 utterances from 20 speakers.
- references: JUPITER corpus
C-004271: The Jensson Corpus
The Jensson Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. The corpus is 3.8 hours in length with 5,612 utterances from 20 speakers. The read text is in the form of questions and contains words that were chosen with the aim of keeping the text as short as possible. All the speakers read the same text.
C-004272: The RÚV Corpus
The RÚV Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. The corpus is 46 minutes in length with 400 utterances from 20 speakers and contains read news items that includes a large vocabulary. No two speakers read the same text.
C-004273: English Web Treebank
*Introduction*

English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.

*Data*

This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.

Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.

Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.

Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.

The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.

Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected.

*Samples*
- references: C-004275: The EnronSent Corpus
C-004274: Enron Email Dataset
Enron Email Dataset contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.
- isReferencedBy: C-004275: The EnronSent Corpus
C-004275: The EnronSent Corpus
The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in corpus linguistics and language analysis, containing 2,205,910 lines and 13,810,266 words in 45 plain text files.
- references: C-004274: Enron Email Dataset
C-004277: マルチモーダル音声認識評価環境
音声と口唇動画像を用いたバイモーダル音声認識用データ。発話内容はCENSREC-1に準拠。音声とともにカラー映像と近赤外線映像を収録し、ムービーを時系列画像に分解して口唇付近のみ切り出した画像データを含む。
C-004279: 音声研究用Ｘ線フィルムデータベース (X-Ray)
X線フィルムの高速撮影により、発声時の声道や舌の動きを鮮明に映像化したもの。話者ごとに異なる短文リスト約30文（1話者は無意味単語も含む）を読み上げている。
C-004281: 特定領域研究「韻律と音声処理」日本語MULTEXT韻律コーパス
ヨーロッパで作成されたMultilingual Text Tools and Corpora（MULTEXT）の日本語版。1つが5〜6文で構成される40の原稿を、朗読、感情を込めたように演技した音声（模擬自発発話）の2つの発声スタイルにて収録。
- references: C-000959: MULTEXT JOC Corpus
- isReferencedBy: C-004283: 中国語MULTEXTコーパス
C-004283: 中国語MULTEXTコーパス
ヨーロッパで作成されたMultilingual Text Tools and Corpora（MULTEXT）の中国語版。1つが5〜6文で構成される40の原稿を、できるだけ自然に話すように指示して収録。
- references: C-000959: MULTEXT JOC Corpus
- references: C-004281: 特定領域研究「韻律と音声処理」日本語MULTEXT韻律コーパス
- hasVersion: C-000959: MULTEXT JOC Corpus

SHACHI - Language Resource Metadata Database