言語資源の登録件数: 3330件
2023 件中 1561 - 1570 件目
-
C-004194: Corpus of American Soap Operas
The corpus contains 100 million words in more than 22,000 transcripts of ten American soap operas from 2001 to 2012. It provides very useful insight into informal, colloquial American speech.
- hasVersion: C-003498: TIME CORPUS
- hasVersion: C-004192: Corpus of Contemporary American English
- hasVersion: C-004193: Corpus of Historical American English
- hasVersion: C-003501: Corpus del Español
- hasVersion: Corpus do Português
-
C-004195: ukWaC
ukWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus is POS-tagged and lemmatized.
- hasVersion: C-004196: deWaC
- hasVersion: C-004197: itWaC
- hasVersion: C-004198: frWaC
- isReferencedBy: C-004199: PukWaC
-
C-004196: deWaC
deWaC is a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus is POS-tagged and lemmatized.
- hasVersion: C-004198: frWaC
- hasVersion: C-004197: itWaC
- hasVersion: C-004195: ukWaC
- references: SudDeutsche Zeitung corpus
-
C-004197: itWaC
itWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from La Repubblica Corpus and basic Italian vocabulary lists as seeds. The corpus is POS-tagged and lemmatized.
- hasVersion: C-004195: ukWaC
- hasVersion: C-004196: deWaC
- hasVersion: C-004198: frWaC
- references: C-004203: La Repubblica Corpus
- references: Morph-It!
-
C-004198: frWaC
frWaC is a 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus is POS-tagged and lemmatized.
- hasVersion: C-004195: ukWaC
- hasVersion: C-004196: deWaC
- hasVersion: C-004197: itWaC
- references: Le Monde Diplomatique Text corpus
-
C-004199: PukWaC
PukWac is a 2 billion word corpus built by syntactically annotating ukWaC. ukWaC is a corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The ukWaC corpus is POS-tagged and lemmatized.
- references: C-004195: ukWaC
- hasVersion: C-004201: WaCkypedia_EN
-
C-004200: SdeWaC
SdeWaC is a 0.88 billion word corpus derived from deWaC, a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain. In SdeWaC, duplicate sentences and some noise have been removed. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format.
- references: C-004196: deWaC
-
C-004201: WaCkypedia_EN
WaCkypedia_EN is a 2009 dump of the English Wikipedia (about 800 million tokens), annotated with POS, lemma and full dependency information. Dependency parsesing was performed with the MaltParser (http://maltparser.org/). The texts were extracted from the dump and cleaned using the Wikipedia extractor (http://medialab.di.unipi.it/wiki/Wikipedia_extractor).
- references: D-001674: Wikipedia
- hasVersion: C-004199: PukWaC
-
C-004202: Morph-it! Version 0.48
Morph-it! is a free morphological lexicon for the Italian language, containing 505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a lemmatizer / morphological analyzer / morphological generator.
- isReferencedBy: C-004197: itWaC
-
C-004203: La Repubblica Corpus
A collection of newspaper texts. The texts are tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized.
- references: La Repubblica