Language resource #: 3330
Results 1571 - 1580 of 2023
-
C-004204: Internet Argument Corpus
The Internet Argument Corpus (IAC) consists of discussions extracted from the online debate site 4forums.com. It currently consists of ~11,000 disscussions, ~390,000 posts, and ~73,000,000 words from one site.
-
C-004205: Film Corpus
The film corpus consists of 862 film scripts from The Internet Movie Script Database (IMSDb) website (http://www.imsdb.com/), representing 7,400 characters. It contains a total of 664,000 lines of dialogue and 9,599,000 tokens. The corpus has been annotated with ontological features including genre, gender and director name, and liguistic features including number of sentences per turn, sentence polarity, overall polarity and verb strength.
-
C-004206: Uppsala PErsian Corpus
Uppsala PErsian Corpus (UPEC) is a large, freely available Persian corpus, consisting of total 2,703,257 million words and is annotated with morpho-syntactic and partly semantic features. It was created from on-line material containing newspaper articles and common text on various topics (e.g. culture, technology, fiction, and art).
- references: C-004208: Bijankhan Corpus
- isReferencedBy: C-004207: Uppsala PErsian Dependency Treebank
-
C-004207: Uppsala PErsian Dependency Treebank
Uppsala PErsian Dependency Treebank (UPEDT) is a dependency-based syntactically annotated corpus which is currently under development. The treebank consists of 1,282 sentences (26,065 tokens) of written text in CoNLL-format. The treebank data is extracted from the open source, validated Uppsala PErsian Corpus (UPEC) created from on-line material containing newspaper articles and common text on various topics (e.g. culture, technology, fiction, and art).
- references: C-004206: Uppsala PErsian Corpus
-
C-004208: Bijankhan Corpus
Bijankhan corpus is a tagged corpus suitable for natural language processing research on the Persian (Farsi) language. The texts were gathered form daily news and common texts and categorized into 4300 different subjects including politics and culture. The corpus contains about 2.6 million manually tagged words with a tag set of 40 Persian POS tags.
- isReferencedBy: C-004206: Uppsala PErsian Corpus
-
C-004209: Tehran English-Persian Parallel Corpus
TEP is a large-scale, sentence-aligned English-Persian parallel corpus. The texts in the corpus were extracted from movie subtitles. The corpus contains 4 million tokens in 612086 sentences (1600 subtitle lines) for each language.
-
C-004210: Prague Czech-English Dependency Treebank 2.0
*Introduction*
Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences.
*Data*
The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech monolingual corpus, and the English-Czech dictionary.
Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
* dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
* semantic labeling of content words and types of coordinating structures
* argument structure, including an argument structure (valency) lexicon for both languages
* ellipsis and anaphora resolution
This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus.
Please consult the PCEDT website for more information and documentation.
*Samples*
Please follow this link for a sample of the data included.
*Updates*
None at this time.- replaces: C-001079: Prague Czech-English Dependency Treebank1.0
- references: C-001547: Treebank-3
-
C-004211: CD-毎日新聞2008データ集
毎日新聞の東京・大阪本社の朝夕刊最終版を対象とした、毎日新聞2008年の全文記事データ集(タグ付テキストデータ)。
- isPartOf: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-000838: DCS-毎日新聞1991~2006データファイル
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
-
C-004212: CD-毎日新聞2009データ集
毎日新聞の東京・大阪本社の朝夕刊最終版を対象とした、毎日新聞2009年の全文記事データ集(タグ付テキストデータ)。
- isPartOf: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-000838: DCS-毎日新聞1991~2006データファイル
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
-
C-004213: CD-毎日新聞2010データ集
毎日新聞の東京・大阪本社の朝夕刊最終版を対象とした、毎日新聞2010年の全文記事データ集(タグ付テキストデータ)。
- isPartOf: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-000838: DCS-毎日新聞1991~2006データファイル
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス