言語資源の登録件数: 3330件 2023 件中 301 - 310 件目
現在の検索条件
キーワードを入力
検索条件を選択
  • C-000613: Arabic Gigaword
    *Introduction*

    Arabic Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T12 and ISBN 1-58563-271-6. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.

    Four distinct sources of Arabic newswire are represented here:

    Agence France Presse (afa) Al Hayat News Agency (alh) Al Nahar News Agency (ann) Xinhua News Agency (xin) Much of the AFP content in this collection has been published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55) and some of this content has also been included in an Arabic supplement to TDT3 and as the Arabic component of TDT4. TDT4 also included a four month sample from Al Hayat and An Nahar (October 2000 - January 2001). Apart from that, all of the Al Hayat, An Nahar and Xinhua Arabic content, as well as AFP content for 2001-2002, is being released here for the first time.

    *Data*

    There are 319 files, totalling approximately 1.1GB in compressed form (4348 MB uncompressed, and 391619 Kwords).

    The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 4.3 gigabytes, total), K-wrds are the number of space-separated tokens in the text, excluding SGML tags.

    Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFA 104 274 1091 94484 516855 ALH 95 431 1535 139501 305250 ANN 96 415 1530 140247 327768 XIA 24 47 192 17387 106846 TOTAL 319 1167 4348 391619 1256719 All text files in this corpus have been converted to UTF-8 character encoding.

    Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.

    Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.

    All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using the DTD file provided in the publication.

    Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

    All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":

    story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)," and so on other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."

    Previous "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic. For whatever reason, this person did not find the "advis" category to be applicable to any of the data.

    *Updates*

    This edition of Arabic Gigaword has been superseded by a a new edtion, LDC2006T02
  • C-000614: Arabic News Translation Text Part 1
    *Introduction*

    Arabic News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T17 and ISBN 1-58563-307-0.

    To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Arabic source materials. The source Arabic text was selected and translated in different LDC projects during the time period of November 2002 to February 2004. A total of about 441K Arabic words were selected from three sources, namely Xinhua, AFP, and An Nahar, and translation services were provided by eight translation agencies who translated each Arabic news story once.

    The Xinhua and An Nahar stories and their translations were created for TIDES Machine Translation, while the AFP stories and their English translations were created for TIDES TDT. The development of all these translations followed roughly the same guidelines and procedures.

    *Data*

    Three sources of journalistic Arabic text were selected to provide the Arabic material:

    - AFP News Service: 250 news stories, October 1998 - December 1998 - Xinhua News Service: 670 news stories, November 2001 - March 2002 - An Nahar: 606 news stories, October 2001 to December 2002 (total: 1,526 stories) The overall count of Arabic words by source is shown in the following table:

    AFP 44,193 Xinhua 99,514 An Nahar 297,533 ---------------- total 441,240 For the Arabic data, there are 441K-words, while for the English translation, there are approximately 581K-words in total, and 25K unique words.

    Each translation team was provided with translation guidelines. In accordance with the guidelines, each translation team was asked to return the first five stories for quality checking in each project. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. The LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. An Arabic-English bilingual LDC employee went through all the source data and English translations, and fixed any problems that had been found.

    For the present release, the corpus content is organized into source and translation directories, containing 1,526 files in source and 1,526 files in translation, one news story per file.
    • references: Xiaoyi Ma, Dalal Zakhary, and Moussa Bamba 2004 Arabic News Translation Text Part 1 Linguistic Data Consortium, Philadelphia
  • C-000615: Arabic Newswire Part 1
    *Introduction*

    This publication contains the Arabic Newswire A Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T55 and ISBN 1-58563-190-6. The Arabic Newswire Corpus is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source material was tagged using TIPSTER-style SGML and was transcoded to Unicode (UTF-8). The corpus includes articles from May 13, 1994 to December 20, 2000.

    *Data*

    The data is in 2,337 compressed (zipped) Arabic text data files. There are 209 Mb of compressed data (869 Mb uncompressed) with approximately 383,872 documents containing 76 million tokens over approximately 666,094 unique words.

    A template of the tagging is presented below.

    yyyymmdd_AFP_ARB.dddd Arabic Text Arabic TextOne or More Paragraphs of Arabic Text

    Arabic Text Arabic Text For a sample file of tagged articles, please see this sample.

    *Updates*

    There are no updates at this time.
    • references: David Graff and Kevin Walker 2001 Arabic Newswire Part 1 Linguistic Data Consortium, Philadelphia
  • C-000616: Arabic Treebank: Part 1 - 10K-word English Translation
    *Introduction*

    Arabic Treebank: Part 1 - 10K-word English Translation was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T07 and ISBN 1-58563-262-7. The purpose of this corpus of 10K Arabic words translated into English is to support the development of data-driven approaches to natural language processing, machine translation, human language technologies, cross-lingual information retrieval, and other forms of linguistic research on Modern Standard Arabic in general.

    *Data*

    The project targets the translation of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July 2000 (the files are dated 20000715*). The corpus consists of 49 source stories, which is a subset of the 734 stories published in Arabic Treebank: Part 1 v 2.0. These 49 source files consist of 418 paragraphs and 9,981 words.

    The source data and the translations are stored in SGML format. The files have been validated using the DTD provided in the corpus. Please follow these links for an example of an Arabic source file and the English translation.

    The stories have been translated at paragraph level and verified/corrected by different annotators. In general, the translation between Arabic and English has been aligned at sentence-to-sentence level. However, we noticed that an Arabic sentence can be translated into multiple English sentences (16 occurrences), and two Arabic sentences can be translated into a single English sentence (two occurrences). For 18 paragraphs out of the total of 418 in the corpus, only paragraph-to-paragraph alignment is provided.

    *Updates*

    There are no updates available at this time.
  • C-000617: Arabic Treebank: Part 1 v 2.0
    *Introduction*

    Arabic Treebank: Part 1 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T06 and ISBN 1-58563-261-9. This publication is part one of a a corpus of one million words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic.

    *Data*

    The Penn Arabic Treebank, which is part of the DARPA TIDES project, started in the Fall of 2001 with the objective of performing human and computer annotations of a large Arabic machine-readable text corpus (for project background please see POStest.html). As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases:

    * Part-of-Speech (POS) tagging - divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss
    * Arabic Treebanking (ArabicTB) - characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.
    Both tasks started in November 2001 with an initial pilot consisting of 734 files representing roughly 166K words of written Modern Standard Arabic newswire from the Agence France Presse corpus.

    The target of this publication is to provide a description of a written Modern Standard Arabic text corpus. The source data consists of Agence France Presse (AFP) newswire, spanning from July through November of 2000. This publication includes 734 stories representing 140,265 words (168,123 tokens after clitic segmentation in the Treebank).

    *Updates*

    There are no updates available at this time.
  • C-000618: Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
    *Introduction*

    To support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general, the LDC was sponsored to develop an Arabic Treebank of 1,000,000 words. This corpus is a re-release of part one of that project, with the addition in Version 3.0 of improved morphological/part-of-speech annotation (including full vocalization and case endings).

    *Data*

    The project targets the description of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July-November 2000 (files dated 20000/7/15 to 2000/11/15). This corpus includes 734 stories representing 145,386 words (166,068 tokens after clitic segmentation in the Treebank; the number of Arabic tokens is 123,796). For this work, annotators must be native speakers of Arabic, and they must understand enough linguistics to check morphosyntactic analysis and build syntactic structures.

    *Samples*

    To see an example of this corpus, please examine the following samples:

    * Source
    * POS
    * Treebank
    • references: Mohamed Maamouri (project head), et al. 2005 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) Linguistic Data Consortium, Philadelphia
  • C-000619: Arabic Treebank: Part 2 v 2.0
    *Introduction*

    Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1.

    This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. Part one was released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah.

    *Data*

    This corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation) in the 501 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.

    The corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.

    *Updates*

    There are no updates available at this time.
  • C-000620: Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
    *Introduction*

    This file contains documentation on the Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis), Linguistic Data Consortium (LDC) catalog number LDC2005T20 and ISBN 1-58563-341-0.

    The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general. The LDC was sponsored to develop an Arabic POS and Treebank of 1,000,000 words, and this corpus is part three of that project. In this release, we provide both syntactic (treebank) annotation and annotation on part of speech (POS), gloss, and word segmentation.

    Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level. Treebanks have become crucially important for the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research in general.

    This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work. The Penn Arabic Treebank is particularly suitable for language developers, computational linguists and computer scientists who are interested in various aspects of natural language processing.

    The Penn Arabic Treebank, which was part of the DARPA TIDES project, started in the Fall of 2001 with the objective of annotating via human intervention and automatically a large Arabic machine-readable text corpus. As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases: (a) Part-of-Speech (=POS) tagging, which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB), which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.

    Both tasks started in November 2001 with an initial pilot consisting of 734 files representing roughly 166K words of written Modern Standard Arabic newswire from the Agence France Presse corpus, which has since been released as "Arabic Treebank: Part 1 v 3.0," LDC Catalog No. LDC2005T02. The second part was released as the 168K-word corpus "Arabic Treebank: Part 2 v 2.0," LDC Catalog No. LDC2004T02.

    The current Arabic Treebank: Part 3 corpus consists of 600 stories from the An Nahar News Agency. This corpus is also referred to as ANNAHAR. The new features include complete vocalization of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive.

    The POS only annotation of this ANNAHAR corpus was released in 2004 under the catalog number LDC2004T11 (Arabic Treebank: Part 3 v 1.0). In addition to the treebank annotation, this release (i.e., Arabic Treebank: Part 3 v 2.0) also includes the POS annotation in LDC2004T11.

    *Samples*

    The POS and treebank samples belowe provide an example the data contained in this corpus * POS
    * Treebank
  • C-000621: Arabic Treebank: Part 3 v 1.0
    *Introduction*

    Arabic Treebank: Part 3 v 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T11 and ISBN 1-58563-298-8.

    This publication is the third part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language resear ch and development of language technology for Modern Standard Arabic. Part one was released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence France Press stories. Part two was released in 2004 as Arabic Treebank: Part 2 v 2.0, having the source data extracted from Al-Hayat distributed by Ummah. The current Arabic Treebank: Part 3 v 1.0 corpus consists of stories from An Nahar News Agency.

    *Data*

    This corpus includes 600 stories from the An Nahar News Text. There are a total of 340,281 words (counting non-Arabic tokens such as numbers and punctuation) in the 600 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.

    The corpus contains 293,035 Arabic-only word tokens (prior to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%) were items that the morphological parser failed to analyze correctly.

    *Samples*

    Please view the following samples:

    * sgm Sample
    * xml Sample
    * txt Sample

    *Updates*

    There are no updates available at this time.
  • C-000622: Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
    *Introduction*

    This file contains documentation on the Arabic Treebank: Part 4 v 1.0 (MPG Annotation), Linguistic Data Consortium (LDC) catalog number LDC2005T30 and ISBN 1-58563-343-7.

    The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general, the LDC was sponsored to develop an Arabic POS and Treebank of 1,000,000 words. This corpus is the fourth part of that project. In this release, we provide annotation on part of speech (POS), gloss, and word segmentation.

    *Samples*

    To view a example of this corpus, please review this sample POS file.