言語資源の登録件数: 3330件 2023 件中 1681 - 1690 件目
現在の検索条件
キーワードを入力
検索条件を選択
  • C-004353: GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
    *Introduction*

    GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 was prepared by LDC and contains 223,000 characters (98 files) of Chinese newsgroup text and its translation selected from twenty-one sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program.

    *Source Data*

    Preparating the source data involved four stages of work: data scouting, data harvesting, formating, and data selection.

    Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.

    Data scouts logged their decisions about potential text of interest (sites, threads and posts) to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout.

    Once the text was downloaded, its format was standardized (by running various scripts) so that the data could be more easily integrated into downstream annotation processes. Original-format versions of each document were also preserved. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.

    The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.

    Manual sentence unit/segment (SU) annotation.was also performed on a subset of files following LDC's Quick Rich Transcription specification. Three types of end of sentence SU were identified: statement SU, question SU and incomplete SU.

    *Translation*

    After files were selected, they were reformatted into a human-readable translation format, and the files were then assigned to professional translators for careful translation. Translators followed GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies), and quality control procedures applied to completed translations.

    TDF Format

    All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.

    Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:

    field data_type 1 file unicode 2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.

    Some fields are inapplicable to newsgroup text. Those include the channel, start time, end time and speaker dialect fields. These fields are either empty or contain values as a placeholder. Encoding

    All data are encoded in UTF8.

    *Sponsorship *

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

    *Samples*

    * Source
    * Translation
  • C-004354: Czech Broadcast Conversation Speech
    *Introduction*

    Czech Broadcast Conversation Speech was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic, and consists of 40 hours of speech recorded from Czech Radio 1 in 2003. Transcripts corresponding to the audio files in this corpus are provided in Czech Broadcast Conversation MDE Transcripts (LDC2009T20). These corpora join LDC's other Czech broadcast data sets: Czech Broadcast News Speech (LDC2004S01), Czech Broadcast News Transcripts (LDC2004T01), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89), and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).

    Czech Broadcast Conversation Speech consists of 72 single channel recordings of Radioforum, a live talk program broadcast by Czech Radio 1 (CRo1) every weekday evening. Its format consists of invited guests (most often politicians) spontaneously answering topical questions posed by one or two interviewers. The number of interviewees in a single program varies from one to three, but typically, one interviewer and two interviewees appear in the program. The material includes passages of interactive dialogue, but longer stretches of monologue-like speech comprise the majority of the collected data. Radioforum also has an interactive segment where listeners call the studio and ask their own questions. That telephony speech was not transcribed in the current release.

    *Data*

    Individual recordings range from 27 minutes to 36 minutes each. The recordings were collected during the period from February 12, 2003 through June 26, 2003. The signal is mono, sampled at 22.05 kHZ with 16-bit resolution, stored in Windows PCM waveform format. The names of the audio files refer to the broadcast date (rfYYMMDD.wav).

    The table below contains details about the audio files and the transcripts:

    Number of shows 72 Number of word tokens 292.6k Number of unique words 30.5k Duration of transcribed speech 33.0h Total number of speakers 128 Male speakers 108 Female speakers 20

    *Samples*

    * Speech

    *Sponsorship*

    The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. ME909 and 2C06020.
  • C-004355: Czech Broadcast Conversation MDE Transcripts
    *Introduction*

    Czech Broadcast Conversation MDE Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2009T20 and ISBN 1-58563-520-0, was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic, and consists of approximately 33 hours of transcribed speech from Radioforum, a talk show broadcast on Czech Radio 1. The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast Conversation Speech (LDC2009S02). These corpora join LDC's other Czech broadcast data sets: Czech Broadcast News Speech (LDC2004S01), Czech Broadcast News Transcripts (LDC2004T01), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89), and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).

    Czech Broadcast Conversation Speech consists of 72 single channel recordings of Radioforum, a live talk program broadcast by Czech Radio 1 (CRo1) every weekday evening. A total of 40 hours of recordings were collected during the period from February 12, 2003 through June 26, 2003. Individual recordings range from 27 minutes to 36 minutes each. Radioforum's format consists of invited guests (most often politicians) spontaneously answering topical questions posed by one or two interviewers. The number of interviewees in a single program varies from one to three, but typically, one interviewer and two interviewees appear in the program. The material includes passages of interactive dialogue, but longer stretches of monologue-like speech comprise the majority of the collected data. Radioforum also has an interactive segment where listeners call the studio and ask their own questions. That telephony speech was not transcribed in the current release.

    Czech Broadcast Conversation MDE Transcripts was created to extend Metadata Extraction (MDE) research to conversational Czech. The goal of MDE is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript.

    The transcripts and annotations in this corpus are stored in three different formats: TRS (Transcriber - http://trans.sourceforge.net), QAn (Quick Annotator - http://www.mde.zcu.cz/qan.html), and RTTM. TRS represents a standard speech transcript. QAn and RTTM contain essentially identical information about structural metadata (MDE); the main difference between them is formatting. Character encoding in all files is ISO-8859-2.

    All filenames have the form rfYYMMDD.format where "rf" stands for Radioforum, the following six digits indicate the date of broadcast, and the extension ".format" corresponds to the data format of the particular file ".trs", ".qan", or ".rttm".

    More information can be found on the website Structural Metadata Annotation for Czech.

    *Data*

    The radio programs recorded for this corpus were transcribed with two purposes. First, in order to produce precise time-aligned verbatim transcripts of the audio recordings, manual transcripts were created using guidelines based on those employed in Czech Broadcast News Transcripts (LDC2004T01). Second, the transcripts were annotated wiith MDE markup to provide structural information about the conversations.

    Manual time-aligned verbatim transcription

    The original guidelines for time-aligned verbatim transcription used for the Czech broadcast news data were adjusted to better accommodate specifics of the recorded broadcast coversation. Those revised guidelines instructed annotators how to deal with the following phenomena, among others:

    * Speaker turns: a corresponding time stamp and speaker ID are inserted every time there is a speaker change in the audio.
    * Turn-internal breakpoints: to break up long turns, breakpoints roughly corresponding to 'sentence' boundaries within a speaker turn are inserted.
    * Overlapping speech: an overlapping speech region is recognized when more than one speaker talks simultaneously; within this region, each speaker's speech is transcribed separately (if intelligible).
    * Background noises: [NOISE] tags are used to mark noticeable background noises.
    * Speaker noises: speaker-produced noises are identified with one of the following tags: [BREATH], [COUGH], [LAUGH], [LIP-SMACK].
    * Filled pauses: filled pauses produced by a speaker to indicate hesitation or to maintain control of a conversation are transcribed either as [EE-HESITATION] or as [MM-HESITATION], based on their pronunciation.
    * Interjections: certain interjections typically used as back channels or to express speaker's agreement or disagreement are transcribed using the [HM] (agreement) and [MH] (disagreement) tags.
    * Unintelligible speech: regions of unintelligible speech are marked with a special symbol.
    * Numbers: all numerals are transcribed as complete words.
    * Mispronounced words: mispronounced words (reading errors, slips of the tongue) are transcribed in the spelling corresponding to their pronunciation in the audio (i.e., the incorrect pronunciation is represented) and marked with a special symbol.
    * Word fragments: the pronounced part of the word is transcribed and a single dash is used to indicate point at which word was broken on.
    * Punctuation: standard punctuation (limited to commas, periods, and question marks) is used to enhance transcript readability.
    Because the verbatim transcripts were created by a large number of annotators, they were manually revised for maximum correctness and consistency.

    MDE annotation

    MDE is an annotation task which annotates Edit Disfluencies (repetitions, revisions, restarts and complex disfluencies), Fillers (including, e.g., filled pauses and discourse markers) and SUs, or syntactic/semantic units. Originally, the structural MDE annotation standard was defined for English. When developing structural metadata annotation guidelines for Czech, the guidelines developed by LDC for English were followed to the extent possible. Lanaguage-dependent modifications were made based on the description of the syntax of Czech compound and complex sentences. MDE Annotation marks the following phenomena:

    * Edit Disfluencies: Edit disfluencies, or speech repairs, occur when speakers correct or alter their utterances or abandon them entirely and start over.
    * Fillers: While the term filler has traditionally been synonymous with filled pause, SimpleMDE uses the term to encompass a broad set of vocalized space-fillers: filled pauses (FPs), discourse markers (DMs), explicit editing terms (EETs) and asides/parentheticals (A/Ps).
    * Sentence-like units: One of the goals of MDE annotation is the identification of all units within the discourse that function to express a complete thought or idea on the part of the speaker.Within MDE these elements are called SUs (Syntactic, Semantic or Slash Units).
    Corpus Statistics

    The table below contains details about the audio files and the transcripts:

    Number of shows 72 Number of word tokens 292.6k Number of unique words 30.5k Duration of transcribed speech 33.0h Total number of speakers 128 Male speakers 108 Female speakers 20

    *Samples*

    The Czech Broadcast Conversation MDE Transcripts employs three transcription formats. A sample of each is included below.

    * TRS-Transcriber-Provides basic transcription of speech.
    * QAN-Quick Annotator, the annotation format used to provide structural metadata.
    * RTTM These annotations provide structural metadata using a format similar EARS MDE.

    *Sponsorship*

    The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. ME909 and 2C06020.
  • C-004356: GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
    *Introduction:*

    GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1, Linguistic Data Consortium (LDC) catalog number LDC2009T02 and ISBN 1-58563-499-9, contains transcripts and English translations of 20.4 hours of Chinese broadcast conversation programming from China Central TV (CCTV) and Phoenix TV. It does not contain the audio files from which the transcripts and translations were generated. GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1, along with other corpora, was used as training data in year 1 (Phase 1) of the DARPA-funded GALE program.

    *Source Data:*

    A total of 20.4 hours of Chinese broadcast conversation programming were selected from two sources: CCTV (a broadcaster from Mainland China), and Phoenix TV (a Hong Kong -based satellite TV station). The transcripts and translations represent recordings of eight different programs.

    A manual selection procedure was used to choose data appropriate for the GALE program, namely conversation (talk) programs focusing on current events. Stories on topics such as sports, entertainment and business were excluded from the data set. The following table is a summary of the files included in this release.

    Source

    Program

    Epoch (YYYY.MM)

    #hours

    #characters

    CCTV

    Across China

    2005.08

    1.0

    9,924

    Todays Focus

    2005.11

    2.2

    33,805

    Phoenix TV

    Asian Journal

    2005.09

    2.2

    26,656

    Behind the Headlines

    2005.03 - 2005.11

    1.5

    17,933

    A Date With Lu Yu

    2005.09 - 2005.10

    7.1

    89,987

    News Hacker

    2005.03 - 2005.10

    2.3

    39,388

    Newsline

    2005.10 - 2005.11

    1.6

    15,496

    Social Watch

    2005.09 - 2005.11

    2.5

    29,159

    *Transcription:*

    The selected audio snippets were carefully transcribed by LDC annotators and professional transcription agencies following LDCs Quick Rich Transcription specification. Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU are identified:

    * statement SU

    * question SU

    * incomplete SU

    *Translation:*

    After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDCs GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies) and quality control procedures applied to completed translations.

    *TDF Format:*

    All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.

    Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:

    Field

    Data Type

    file

    unicode

    channel

    int

    start

    float

    end

    float

    speaker

    unicode

    speakerType

    unicode

    speakerDialect

    unicode

    transcript

    unicode

    section

    int

    turn

    int

    segment

    int

    sectionType

    unicode

    suType

    unicode

    A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.

    *Sponsorship*

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

    *Samples*

    For an example of the data in this corpus, please examine these images of the source and translation.
  • C-004357: GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
    *Introduction*

    GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2, Linguistic Data Consortium (LDC) catalog number LDC2009T06 andISBN 1-58563-503-0, contains transcripts and English translations of 24 hours of Chinese broadcast conversation programming from China Central TV (CCTV), Phoenix TV and Voice of America (VOA). It does not contain the audio files from which the transcripts and translations were generated. This release, along with other corpora, was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program. GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 was released in January 2009.

    *Source Data*

    A total of 24 hours of Chinese broadcast conversation programming was selected from three sources: CCTV (a broadcaster from Mainland China), Phoenix TV (a Hong Kong-based satellite TV station) and VOA. The transcripts and translations represent recordings of seven different programs.

    A manual selection procedure was used to choose data appropriate for the GALE program, namely, conversation (talk) programs focusing on current events. Stories on topics such as sports, entertainment and business were excluded from the data set. The following table is a summary of the files included in this release.

    Source Program Epoch (YYYY.MM) #hours #characters CCTV Today's Focus 2006.01 2.1 36,398 Phoenix TV Asian Journal 2005.09 - 2005.11 2.5 27,876 Behind the Headlines 2006.01 1.5 7,170 A Date with Lu Yu 2005.10 - 2006.01 10.9 129,782 News Hacker 2005.10 - 2005.11 3.0 43,070 Passion on China 2005.10 2.0 21,482 VOA News08 2005.03 2.0 28,621

    *Transcription*

    The selected audio snippets were carefully transcribed by LDC annotators and professional transcription agencies following LDC's Quick Rich Transcription specification. Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU are identified:

    * statement SU
    * question SU
    * incomplete SU

    *Translation*

    After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies) and quality control procedures applied to completed translations.

    TDF Format

    All final data are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format, and it is easy to process.

    Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:

    field data_type 1 file unicode 2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation.

    Encoding

    All data are encoded in UTF8.

    *Sponsorship*

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

    *Samples*

    For an example of the data in this corpus, please examine these screen captures of the text data:

    * source
    * translation
  • C-004358: ACE 2005 Mandarin SpatialML Annotations
    *Introduction*

    ACE 2005 Mandarin SpatialML Annotations was developed by researchers at The MITRE Corporation (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags to a subset of the source Mandarin training data in ACE 2005 Multilingual Training Corpus (LDC2006T06). Annotations for entities, relations, and events, which were included in ACE 2005 Multilingual Training Corpus, are not included in the current SpatialML release. For SpatialML markup to ACE 2005 English data, see ACE 2005 English SpatialML Annotations (LDC2008T03).

    SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services.

    The ACE (Automatic Content Extraction) Program seeks to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from automatic speech recognition and optical character recognition). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The annotation efforts of the ACE program supports the development of automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events

    The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML, and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag which encodes information about location. The central goal of SpatialML is to map location information in text to data from gazetteers and other databases to the extent possible by defining attributes in the PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program.

    *Data*

    This corpus consists of a 298-document subset of broadcast material from the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been tagged by a native Mandarin speaker according to version 2.3 of the SpatialML annotation guidelines, which are included in the documentation for this release.

    * *

    *Updates*

    No updates have been issued at this time.
  • C-004360: Chinese Treebank 7.0
    *Introduction *

    Chinese Treebank 7.0, Linguistic Data Consortium (LDC) catalog number LDC2010T07 and isbn 1-58563-542-1, consists of over one million words of annotated and parsed text from Chinese newswire, magazine news, various broadcast news and broadcast conversation programs, web newsgroups and weblogs.

    The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and is now at Brandeis University. The projects goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency (Xinhua) newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 adds new annotated newswire data, broadcast material and web text to this effort.

    *Data *

    This release consists of 2,448 text files, 51,447 sentences, 1,196,329 words and 1,931,381 hanzi (Chinese characters). The data is provided in UTF-8 encoding and the annnotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged and syntactically-bracketed formats.

    Chinese Treebank 7.0 includes text from the following genres and sources:

    Genre # words Newswire (Agence France Presse, China News Service, Guangming Daily, Peoples Daily, Xinhua News Agency) 260,164 News Magazine (Sinorama) 256,305 Broadcast News (China Broadcasting System, China Central TV, China National Radio, China Television System, New Tang Dynasty TV, Phoenix TV, Voice of America) 287,442 Broadcast Conversation (Anhui TV, China Central TV, CNN, MSNBC, New Tang Dynasty TV, Phoenix TV) 184,161 Newsgroups, Weblogs 208,257 Total 1,196,329

    *Sponsorship *

    This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-0022. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

    *Samples*

    For an example of the data in this corpus, please review the sample file.

    *Updates*

    No updates have been issued as of this time.

    Contact: ldc@ldc.upenn.edu © 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.
  • C-004361: Indian Language Part-of-Speech Tagset: Bengali
    *Introduction *

    Indian Language Part-of-Speech Tagset: Bengali, Linguistic Data Consortium (LDC) catalog number LDC2010T16 and isbn 1-58563-561-8, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna Universtiy, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

    The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context.

    Bengali (also referred to as Bangla) is a member of the Eastern Indo-Aryan language group. It is native to the region of Bengal which consists of Bangladesh, the Indian state of West Bengal, and parts of the Indian states of Tripura and Assam. It is spoken by more than 210 million people as a first or a second language with around 100 million speakers in Bangladesh, about 85 million speakers in India, and others in immigrant communities in the United Kingdom, USA and the Middle East.

    *Data *

    This corpus contains 7168 sentences (102933 words) of manually annotated text from modern standard Bengali sources including blogs, Wikipedia, Multikulti and a portion of the EMILLE/CIIL corpus . The annotated data is structured into two folders, Bangla1 (3684 sentences, 51091 words) and Bangla2 (3484 sentences, 51842 words), which represent the two stages in which the data was annotated. All annotated data is provided in both xml and text files. Each data file contains between 3,000-5,000 words. The XML file contains metadata about the material, such as language, encoding and data size.

    *Annotation Procedure *

    The Annotation Guidelines for Bangla included in this release contain a detailed description of the annotation methodology.

    *Updates *

    Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T16.

    *Samples*

    Please examine this XML excerpt for an example of the data in this corpus.
  • C-004362: Indian Language Part-of-Speech Tagset: Hindi
    *Introduction *

    Indian Language Part-of-Speech Tagset: Hindi, Linguistic Data Consortium (LDC) catalog number LDC2010T24 and isbn 1-58563-571-5, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

    The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and reusability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context.

    Hindi is the official language of India and a member of the Indo-Aryan language group. It is spoken mainly in the northern states of Rajasthan, Delhi, Haryana, Uttarakhand, Uttar Pradesh, Madhya Pradesh, Chhattisgarh, Himachal Pradesh, Jharkhand and Bihar as well as in much of central India and in communities in Africa, Australia, New Zealand, the Middle East, Europe and North America. Hindi is the first or second language of more than 500 million people.

    *Data *

    This corpus contains 4859 sentences (98,450 words) of manually annotated Hindi text randomly collected from the Microsoft Hindi Research Corpus, sourced from the publisher WebDunia. All annotated data is provided in both xml and text files. The xml files are contained in the "XML_files" folder and the text files in the "text_files" folder. Each data file contains between 900-5,000 words. The XML file contains metadata about the material, such as language, encoding and data size.

    *Annotation Procedure *

    The Annotation Guidelines for Hindi, included in this release, contain a detailed description of the annotation methodology. The Annotation Tool Guideline 1.0, also included in this publication, describes the annotation interface developed for the IL-POST framework; the tool is not included in this corpus.

    *Updates *

    Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T24.

    *Samples*
  • C-004363: Indian Language Part-of-Speech Tagset: Sanskrit
    *Introduction *

    Indian Language Part-of-Speech Tagset: Sanskrit, Linguistic Data Consortium (LDC) catalog number LDC2011T04 and isbn 1-58563-575-8, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

    The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context.

    Sanskrit is the classical language of Indian and the oldest documented language of the Indo-European language family. It is also the liturgical language of Hinduism, Buddhism and Jainism and one of the twenty-two official languages of India. The name Sanskrit means refined, consecrated and sanctified.

    *Data *

    This corpus contains 3,703 sentences (57,218 words) of manually annotated Sanskrit text selected from the Panchatrantra stories, a collection of animal fables in verse and prose dating from the third century BCE. All annotated data is provided in both xml and text files. The xml files are contained in the XML_files folder and the text files in the text_files folder. Each data file contains between 12,000-45,000 words. The XML file contains metadata about the material, such as language, encoding and data size.

    *Annotation Procedure *

    The paper, Annotating Sanskrit corpus: adapting IL-POSTS included in this release, contains a detailed description of the annotation methodology.

    *Sample*

    *Updates *

    Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T04.