Language resource #: 3330
Results 1691 - 1700 of 2023
-
C-004364: Korean Newswire Second Edition
*Introduction *
Korean Newswire Second Edition, Linguistic Data Consortium (LDC) catalog number LDC2010T19 and isbn 1-58563-564-2, is an archive of Korean newswire text that has been acquired over several years (1994-2009) at LDC from the Korean Press Agency. This release includes all of the content of Korean Newswire LDC2000T45 (June 1994-March 2000) as well as newly-collected data.
*New in the Second Edition *
The second edition contains all data collected by LDC from April 2000 through December 2009.
All material, including that from the first release, has been converted to UTF-8 (except for more recent data already in UTF-8 format) and processed in LDCs gigaword format. The gigaword format classifies newswire content into three types: story, multi and other where story refers to an article containing information pertaining to a particular event on a day multi refers to an article that contains more than one story relating to different topics and other refers to articles containing lists, tables or numerical data, such as sports scores.
A word break error in the original release and in data collected from January 2002 through February 2005 has been corrected in the second edition with the result that all Korean text should display correctly. The error involved a line break in the middle of a word with the result that an affected word appeared in segments in two lines. This problem was resolved using word histograms and a few common rules based on heuristics from the data and has yielded a 90% - 95% word break correction rate. Further information about the word break correction procedure is available in Word_Break_Correction_Procedure.txt.
The following table shows for each gigaword classification, the number of documents in the classification (# DOCS), the number of space-separated word tokens in the text (K-WORDS) and the uncompressed file size in kilobytes (TextKB):
# DOCS K-WORDS TextKB story 217052 37546 371722 multi 31 21 239 other 7318 1034 8375
*Data *
The directory structure of the corpus is as follows: . |-common_files |---docs |---dtd |-kor_nw_p1v2 |---data data: This directory contains the corpus files. Each file contains data collected during the course of a month. For example, the filename kpa_kor_199406 contains data collected in June 1994. Each document in a file has a fixed sgml structure governed by a dtd. The SGML tagging is as follows: Consult the dtd for more information regarding the sgml structure of a single article. Not all articles have information in all the tag fields. The dtd mandates that every article must have a DOC tag and a BODY tag. The HEADLINE, DATELINE and P tags are optional. Within the units, tagging is kept to a minimum, typically consisting only of tags to mark paragraph boundaries. The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag : is intepreted in the manner described below. yyyy = Year mm = Month dd = Day nnnn = Sequence NumberFor all articles that share the same yyyymmdd docid string, the nnnn substring ensures that the docid is unique in the corpus.
docs: Contains corpus documentation. dtd: Contains the dtd for the corpus.
*Samples*
For an example of the data in this corpus, please review this sample file.
*Updates *
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T19.- replaces: C-001040: Korean Newswire
-
C-004365: MASC I
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MASC I, the first release of the MASC data, contains about 82000 words and is the most heavily annotated for several linguistic phenomena including POS, shallow parse and treebank style syntax.
- hasVersion: C-004366: MASC II
- hasVersion: C-004367: MASC III
- hasPart: C-004369: MINI-MASC
- isPartOf: C-004368: Full MASC
- hasPart: C-004371: MASC-PROPBANK-ORIG
- hasPart: C-004370: MASC-CONLL
- references: C-003405: American National Corpus (ANC) Second Release
- references: C-004372: Language Understanding Annotation Corpus
- conformsTo: Penn Treebank
- conformsTo: T-000767: FrameNet
- conformsTo: D-000825: WordNet
- conformsTo: Propbank
- conformsTo: Timebank
-
C-004366: MASC II
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MASC II, the second part of the MASC data, contains about 120000 words of data annotated for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, FrameNet, plus WordNet sense annotations. *The current download version (as of January 12th of 2013) does not seem to contain annotations.
- hasVersion: C-004365: MASC I
- hasVersion: C-004367: MASC III
- isPartOf: C-004368: Full MASC
- references: C-003405: American National Corpus (ANC) Second Release
- conformsTo: Penn Treebank
- conformsTo: T-000767: FrameNet
- conformsTo: D-000825: WordNet
- conformsTo: Propbank
- conformsTo: Timebank
-
C-004367: MASC III
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MASC III, the third part of the MASC data, contains about 280000 words, rounding out the genre distribution in the entire MASC corpus. MASC III adds new additional genres such as blogs and tweets to MASC I and II. The current download version (as of January 12th of 2013) does not seem to contain annotations.
- hasVersion: C-004365: MASC I
- hasVersion: C-004366: MASC II
- isPartOf: C-004368: Full MASC
- references: C-003405: American National Corpus (ANC) Second Release
- conformsTo: Penn Treebank
- conformsTo: T-000767: FrameNet
- conformsTo: D-000825: WordNet
- conformsTo: Propbank
- conformsTo: Timebank
-
C-004368: Full MASC
Full MASC is a collection of manually annotated 504,299-word written and spoken texts of American English from a broad range of genres from news domains to more colloquial one such as Twitter. All the texts are being manually annotated or validated for a variety of linguistic phenomena including tokens, parts of speech, sentence boundaries, shallow parsing (noun chunk, verb chunk) and named entities (person, location, organization, date).
- hasPart: C-004365: MASC I
- hasPart: C-004366: MASC II
- hasPart: C-004367: MASC III
- references: C-003405: American National Corpus (ANC) Second Release
- references: C-004372: Language Understanding Annotation Corpus
- conformsTo: Penn Treebank
- conformsTo: T-000767: FrameNet
- conformsTo: D-000825: WordNet
- conformsTo: Propbank
- conformsTo: Timebank
-
C-004369: MINI-MASC
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MINI-MASC contains a selection of four written and four spoken files, each roughly 500 words in length, from the data in MASC I. MASC I is the first release of the MASC data, contains about 82000 words and is the most heavily annotated for several linguistic phenomena including POS, shallow parse and treebank style syntax.
- isPartOf: C-004365: MASC I
-
C-004370: MASC-CONLL
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MASC-CONLL is a subset of MASC I, the first release of the MASC data, with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format.
- isPartOf: C-004365: MASC I
- conformsTo: Penn Treebank
- conformsTo: Propbank
- conformsTo: C-004373: NomBank.1.0
- hasVersion: C-004371: MASC-PROPBANK-ORIG
-
C-004371: MASC-PROPBANK-ORIG
The MASC (Manually Annotated Sub-Corpus) is a collection of manually annotated written materials and spoken transcripts of American English from a broad range of genres. MASC-PROPBANK-ORIG is a subset of MASC I, the first release of the MASC data, with annotations for Propbank (annotations for verb propositions and their arguement) in their original format together with the Penn Treebank annotations upon which they rely.
- isPartOf: C-004365: MASC I
- conformsTo: Penn Treebank
- conformsTo: Propbank
- hasVersion: C-004370: MASC-CONLL
-
C-004372: Language Understanding Annotation Corpus
*Introduction*
The Language Understanding Annotation Corpus, Linguistic Data Consortium (LDC) catalog number LDC2009T10 and isbn 1-58563-513-8, emerged from a series of interdisciplinary meetings on semantics and pragmatics hosted by the Human Language Technology Center of Excellence at Johns Hopkins University. The participants were researchers from BBN Technologies, Carnegie Mellon University and Columbia University who were developing representations of text semantics, machine translation and summarization systems. The resulting corpus contains over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. The source materials were chosen from various genres to represent "informal input," that is, text that contains colloquial forms. The documents in the corpus include excerpts from newswire stories, telephone conversation transcripts, emails, contracts and written instructions.
The problem was modeled as an extended exercise in extracting information elements from a "document" (that is, from discrete language records in written or spoken forms). The goal was to answer two broad questions:
* What are the elements of knowledge that can be derived from a document?
* Can the representation, and hence, the annotation, be laid out in terms of iterative layers, the accumulation of which would represent the sum of the knowledge?
The annotations attempted to resolve these questions in the following ways:
* Belief/Opinion/Confidence. Committed belief annotation distinguishes between statements which assert belief or opinion, those which contain speculation, and statements which convey facts or otherwise do not convey belief. The goal is to be able to determine automatically from a given text what beliefs can be ascribed to the author and with what strength the author holds those beliefs.
* Dialog Acts. Dialog act annotation seeks to determine the forward and backward links between pairs of dialog acts.
* Coreference (entities and events). Event coreferences indicate which events are related to other events at the document level. Entity relations within these related events provide further information about e.g., the main actors, targets and causes of the events.
* Temporal relations. Temporal annotations mark the temporal relationship between the different events and time anchors mentioned in a document, that is, it highlights what the text is saying about the time line of time-mentions.- isReferencedBy: C-004365: MASC I
-
C-004373: NomBank.1.0
NomBank is an annotation project at New York University related to the PropBank project at the University of Pennsylvania. The goal of the project is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus just as PropBank records such information for verbs. The NomBank 1.0 release includes all the markable nouns in the Penn Treebank II Wall Street Journal corpus.
- references: C-001546: Treebank-2
- references: Propbank