Language resource #: 3330
Results 981 - 990 of 2023
-
C-001691: Wikimedia Commons
Wikimedia Commons (English homepage) is a central repository for free images, music, sound & video clips, and spoken texts. It was launched on 7 September 2004.
The Commons prevents duplication of uploaded media by having one place that media can be uploaded to for use by all Wikimedia Foundation projects. For example, one image uploaded to the Commons can be embedded on pages in every language Wikipedia.
The Commons is also multilingual. That is, multiple languages exist on the same page. This can be done since most "articles" are images with short descriptions.- isReferencedBy: D-001674: Wikipedia
- isReferencedBy: C-001690: Wikiquote
- isReferencedBy: D-001696: Wiktionary
- isReferencedBy: C-001675: Wikibooks
- isReferencedBy: C-001688: Wikisource
- isReferencedBy: C-001687: Wikinews
- isReferencedBy: C-001697: Wikiversity
- isReferencedBy: C-001689: Wikispecies
- isReferencedBy: Mediawiki
- isReferencedBy: N-001698: Meta-Wiki
-
C-001692: Mandarin Chinese News Text
The Linguistic Data Consortium (LDC) announces the availability of a Mandarin Chinese text corpus. This corpus includes about 250 million GB-encoded text characters. The Mandarin News Corpus includes text from various journalistic sources:
* newspaper text from Renmin Ribao (People's Daily)
* radio scripts from China Radio International
* newswire text from Xinhua newswire service
The format of this corpus uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The header fields provided by the sources, which give information such as topic, date and article ID, have been retained. The articles cover a variety of topics, including international and domestic news, sports and culture. -
C-001693: Reuters Corpus, Volume 1
RCV1 is a large collection of news stories containing about 810,000 Reuters, English Language News stories. It is significantly larger than the older Reuters-21578 collection heavily used in the text classification community.
- hasVersion: C-001694: Reuters Corpus, Volume 2
-
C-001694: Reuters Corpus, Volume 2
RCV2 is a large collection of multilingual (thirteen languages) news stories containing over 487,000 Reuters News stories.
- hasVersion: C-001693: Reuters Corpus, Volume 1
-
C-001695: 1996 CSR HUB4 Language Model
*Introduction*
This corpus contains data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR HUB4 Evaluation.
*Data*
The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately one gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data.
The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned," "verbalized-punctuation") form. The "raw" form includes the header and footer information accompanying the articles, such as network, show name, headline, copyright, credits and so forth; the text and ancillary data are presented in a fairly consistent (though simple) SGML format. The "processed" form contains only the text content of the articles, together with SGML tags to mark the boundaries of articles, paragraphs and sentences; the text content has been modified by replacing numeric strings (dates, dollar amounts, quantities) with orthographic strings (e.g. "nineteen ninety six"), replacing abbreviations ("Inc.," "Ltd.," "Corp.," etc.) with corresponding full-word forms and replacing punctuation characters with corresponding word tokens (e.g. "," becomes "COMMA"). This release also includes an archive of the tools used to create the "processed" form of the data.
*Updates*
There are no updates at this time.
*Pricing*
The Reduced Licensing Fee for this corpus is US$200.- isReferencedBy: Robert MacIntyre 1998 1996 CSR HUB4 Language Model Linguistic Data Consortium, Philadelphia
-
C-001697: Wikiversity
A sister project to Wiktionary that aims to create community for the creation and use of free learning materials and activities. Wikiversity is a multidimensional social organization dedicated to learning, teaching, research and service.This Wikiversity is written in English. Started in August 2006, it currently contains 5,756 articles. You may read and edit courses in many different languages.
- isVersionOf: D-001674: Wikipedia
- hasVersion: C-001690: Wikiquote
- hasVersion: Wiktionar
- hasVersion: C-001675: Wikibooks
- hasVersion: C-001688: Wikisource
- hasVersion: C-001691: Wikimedia Commons
- hasVersion: C-001687: Wikinews
- hasVersion: Mediawiki
- hasVersion: N-001698: Meta-Wiki
-
C-001699: GENIA Corpus Version 3.02
GENIA Technical Term Corpus is a semantically annotated corpus of abstracts taken from National Library of Medicine's MEDLINE database. Semantic classification is done to a subset of the substances and the biological locations involved in reactions of proteins, based on a data model (GENIA ontology) of the biological domain, in XML format (GPML). In some cases, "GENIA Corpus" simply refers to "GENIA Technical Corpus". The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors.
- hasVersion: GENIAcorpus3.02p
- hasVersion: C-001702: GENIA Treebank Beta
- replaces: GENIA Corpus Version 1.1
- isPartOf: C-001701: GENIA Corpus
- isReferencedBy: 生命科学分野のタグ付きコーパス:GENIAコーパスの設計と作成(http://www-tsujii.is.s.u-tokyo.ac.jp/~okap/papers/NLP2005_S2-7.pdf)
-
C-001700: GENIA corpus 3.02p
GENIA corpus 3.02p is a set of annotated corpus of abstracts taken from National Library of Medicine's MEDLINE database, tagged for part of speech based on the tag set of PennTreeBank POS tag set. The base abstracts are the same set of texts as GENIA Corpus ver3.02, selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors.
- hasVersion: C-001702: GENIA Treebank Beta
- hasVersion: C-001699: GENIA Corpus Version 3.02
- isPartOf: C-001701: GENIA Corpus
- isReferencedBy: 生命科学分野のタグ付きコーパス:GENIAコーパスの設計と作成(http://www-tsujii.is.s.u-tokyo.ac.jp/~okap/papers/NLP2005_S2-7.pdf)
- isReferencedBy: Tateisi, Yuka and Jun'ichi Tsujii. (2004). Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. pp. 1267-1270.(http://www-tsujii.is.s.u-tokyo.ac.jp/%7Eyucca/papers/LREC2004-528.pdf)
- conformsTo: C-001546: Treebank-2
-
C-001701: GENIA Corpus
GENIA Corpus is a series of corpora of abstracts taken from National Library of Medicine's MEDLINE database. In some cases, "GENIA Corpus" simply refers to "GENIA Technical Term Corpus (GENIA Corpus Version 3.02)." It consists of three coupora: semantically annotated technical term corpus, part-of-speech corpus, and syntactically annotated corpus. Semantic classification annotations are done to a subset of the substances and the biological locations involved in reactions of proteins, based on a data model (GENIA ontology) of the biological domain, in XML format (GPML). Part-of-speech and syntactic annotations are done based on PennTreeBank specification. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors.
- hasPart: C-001702: GENIA Treebank Beta
- hasPart: C-001699: GENIA Corpus Version 3.02
- hasPart: GENIA Corpus Version 3.02p
- isReferencedBy: 生命科学分野のタグ付きコーパス:GENIAコーパスの設計と作成(http://www-tsujii.is.s.u-tokyo.ac.jp/~okap/papers/NLP2005_S2-7.pdf)
- isReferencedBy: Tateisi, Yuka and Jun'ichi Tsujii. (2004). Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. pp. 1267-1270.(http://www-tsujii.is.s.u-tokyo.ac.jp/%7Eyucca/papers/LREC2004-528.pdf)
- conformsTo: C-001546: Treebank-2
- isReferencedBy: C-004347: BioProp Version 1.0
-
C-001702: GENIA Treebank Beta
GENIA Treebank Beta is a syntactically bracketed corpus of abstracts taken from National Library of Medicine's MEDLINE database and tagged in (almost) PennTreeBank style. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors.
- hasVersion: C-001699: GENIA Corpus Version 3.02
- hasVersion: GENIA Corpus Version 3.02p
- isPartOf: C-001701: GENIA Corpus
- isReferencedBy: 生命科学分野のタグ付きコーパス:GENIAコーパスの設計と作成(http://www-tsujii.is.s.u-tokyo.ac.jp/~okap/papers/NLP2005_S2-7.pdf)
- isReferencedBy: ゲノムテキストからの知識の抽出と体系化,大田朋子東京大学大学院情報理工学系研究科(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/WS/PDFfiles/Ohta-Symp.pdf)
- conformsTo: C-001546: Treebank-2