Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1701 - 1710 of 2023

C-004374: XLEL-21
XLEL-21 was developed to support the training and evaluation of cross-language linking of named entities from 21 non-English languages into an English knowledge base. It includes over 55,000 queries across 21 non-English languages, plus an English version of each query. (http://hltcoe.jhu.edu/datasets/)
C-004375: NPS Internet Chatroom Conversations, Release 1.0
*Introduction*

NPS Internet Chatroom Conversations, Release 1.0 consists of 10,567 English posts (45,068 tokens) gathered from age-specific chat rooms of various online chat services in October and November 2006. Each file is a text recording from one of these chat rooms for a short period on a particular day. Users should be aware that some of the conversations in this corpus feature subjects and language that some people may find offensive or objectionable, including discussions of a sexual nature. This corpus was developed by researchers at the Department of Computer Science, Naval Postgraduate School, Monterey, California.

Although much work has been accomplished in Natural Language Processing (NLP) in traditional written and spoken language domains, relatively little has been performed in the newer computer-mediated communication (CMC) domains enabled by the Internet, such as text-based chat. One factor inhibiting research in this area has been the dearth of annotated CMC corpora available to the broader research community, despite the increasing use of CMC in a variety of areas and applications. NPS Internet Chatroom Conversations is one of the first text-based chat corpora tagged with lexical and discourse information. This corpus might be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis.

Each post is annotated with a chat dialog-act tag, and individual tokens within each post are annotated with part-of-speech tags. 3,507 tokenized posts were automatically tagged using a part-of-speech tagger trained on the Penn Treebank corpora, combined with a regular expression that identified privacy-masked user names and emoticons. Similarly, simple regular expression matching was employed to assign an initial chat dialog-act to each of this subset of posts. This initial tagging was verified by hand (with corrections made where necessary). The remaining 7,060 posts were POS-tagged using a POS tagger that was trained on the newly hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging on the remaining posts was accomplished using a back-propagation neural network trained on 21 features of the initial dialog-act-labeled posts. The tagging of this second group of posts was also manually verified (and corrected where necessary). Ultimately, all of the 10,567 privacy-masked posts, representing 45,068 tokens, were annotated with manually verified part-of-speech and dialog act information.

Filenames consist of date, target age group, and number of posts. For example, the file 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on October 19, 2006. The posts have been privacy-masked in two ways. First, all usernames have been changed to generic names of the form "UserN", where N is a unique integer consistently used for each respective poster across all files. The posts were then read by humans to remove other personally identifiable information. Within each file, usernames are prepended with the date and chat room portions of the filename. So in the above filename example, UserN becomes 10-19-20sUserN.

*Samples*

Please examine this sample for an example of the data in this corpus.

*References*

[1] Eric N. Forsyth and Craig H. Martell, "Lexical and Discourse Analysis of Online Chat Dialog," Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), pp. 19-26, September 2007.

[2] T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler and W. M. Pottenger, "Posting act tagging using transformation-based learning," Proceedings of the Workshop on Foundations of Data Mining and Discovery, IEEE International Conference on Data Mining, December 2002.

[3] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema and M. Meteer, "Dialogue act modeling for automatic tagging and recognition of conversational speech," Computational Linguistics, vol. 26, no. 3, pp. 339-373, 2000.

[4] M. Zitzen and D. Stein, "Chat and conversation: a case of transmedial stability?" Linguistics, vol. 42, no. 5, pp. 983-1021, 2004.
- conformsTo: C-001546: Treebank-2
C-004376: Quranic Arabic Corpus - Version 0.4
Quranic Arabic Corpus provides the Arabic grammar, syntax and morphology for each word in the Holy Quran. The Quran is the central religious text written in Quranic Arabic and contains 6,236 numbered verses. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology.
C-004377: RWTH-PHOENIX-Weather Database of German Sign Language
The RWTH-PHOENIX-Weather corpus is a video-based, large vocabulary corpus of German Sign Language. The corpus contains weather forecasts recorded from German public TV. Further, the spoken German weather forecast has been transcribed in a semi-automatic fashion. The corpus contains bilingual data for translation experiments for the language pair German Sign Language - German.
C-004378: The CONCISUS Corpus of Event Summaries
The CONCISUS Corpus is an annotated dataset of comparable Spanish and English event summaries, covering such domains as aviation accidents, train accidents, earthquakes, and terrorist attacks. The dataset contains: comparable summaries, comparable automatic translations, and comparable full documents.
C-004379: NKI-CCRT Corpus
The NKI-CCRT corpus contains recordings of 55 speakers treated for cancer of the head
and neck, and the corresponding perceptual evaluations of speech intelligibility over three evaluation moments: before treatment and after treatment.
C-004380: Arabic Treebank - Broadcast News v1.0
*Introduction*

Arabic Treebank - Broadcast News v1.0 was developed at the Linguistic Data Consortium (LDC). It consists of 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation.

The ongoing PATB project supports research in Arabic-language natural language processing and human language technology development. The methodology and work leading to the release of this publication are described in detail in the documentation accompanying this corpus.

*Data*

This release contains 432,976 source tokens before clitics were split, and 517,080 tree tokens after clitics were separated for treebank annotation. The source materials are Arabic broadcast news stories collected by LDC during the period 2005-2008 from the following sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV. The transcripts were produced by LDC.

*Samples*

Please follow this link for a sample from this corpus.

*Sponsorship*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

*Updates*

None at this time.
- conformsTo: Arabic Treebank Part 5 - v1.0
- conformsTo: Arabic Treebank Part 6 - v1.0
- conformsTo: Arabic Treebank Part 1 - v4.1
- conformsTo: Arabic Treebank Part 3 - v3.2
C-004381: Arabic-Dialect/English Parallel Text
*Introduction*

Arabic-Dialect/English Parallel Text was developed by Raytheon BBN Technologies (BBN), LDC and Sakhr Software and contains approximately 3.5 million tokens of Arabic dialect sentences and their English translations.

*Data*

The data in this corpus consists of Arabic web text as follows:

1. Filtered automatically from large Arabic text corpora harvested from the web by LDC. The LDC corpora consisted largely of weblog and online user groups and amounted to around 350 million Arabic words. Documents that contained a large percentage of non-Arabic or Modern Standard Arabic (MSA) words were eliminated. A list of dialect words was manually selected by culling through the Levantine Fisher (LDC2005S07, LDC2005T03, LDC2007S02 and LDC2007T04) and Egyptian CALLHOME speech corpora (LDC97S45, LDC2002S37, LDC97T19 and LDC2002T38) distributed by LDC. That list was then used to retain documents that contained a certain number of matches. The resulting subset of the web corpora contained around four million words. Documents were automatically segmented into passages using formatting information from the raw data.

2. Manually harvested by Sakhr Software from Arabic dialect web sites.

Dialect classification and sentence segmentation, as needed, and translation into English were performed by BBN through Amazons Mechanical Turk. Arabic annotators from Mechanical Turk classified filtered passages as being either MSA or one of four regional dialects: Egyptian, Levantine, Gulf/Iraqi or Maghrebi. An additional General dialect option was allowed for ambiguous passages. The classification was applied to whole passages rather than individual sentences. Only the passages labeled Levantine and Egyptian were further processed. The segmented Levantine and Egyptian sentences were then translated. Annotators were instructed to translate completely and accurately and to transliterate Arabic names. They were also provided with examples. All segments of a passage were presented in the same translation task to provide context.

*Samples*

Please follow this link for a sample of the data in this release.

*Updates*

None at this time.
- references: C-001421: Fisher Levantine Arabic Conversational Telephone Speech
- references: C-000610: Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
- references: C-001328: Arabic CTS Levantine Fisher Training Data Set 3, Speech
- references: C-000649: CALLHOME Egyptian Arabic Speech Supplement
- references: C-000650: CALLHOME Egyptian Arabic Speech
- references: C-000651: CALLHOME Egyptian Arabic Transcripts Supplement
- references: C-000652: CALLHOME Egyptian Arabic Transcripts
C-004382: Chinese Dependency Treebank 1.0
*Introduction*

Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technologys Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from Peoples Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures.

*Data*

Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds.Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of Peoples Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. One line presents information for one word. An empty line indicates the end of a sentence. Each line contains 10 columns separated with a tab.

*Samples*

Please click follow this link for a sample of the data.

*Updates*

None at this time.
C-004383: Chinese-English Semiconductor Parallel Text
*Introduction*

Chinese-English Semiconductor Parallel Text was developed by The MITRE Corporation. It consists of parallel sentences from a collection of abstracts from scientific articles on semiconductors published in Mandarin and translated into English by translators with particular expertise in the technical area. Translators were instructed to err on the side of literal translation if required, but to maintain the technical writing style of the source and to make the resulting English as natural as possible. The translators followed specific guidelines for translation, and those are included in this distribution.

*Data*

There are 2,169 lines of parallel Mandarin and English, with a total of 125,302 characters of Mandarin and 64,851 words of English, presented in a separate UTF-8 plain text file for each language. The sentences were translated in sequential order and presented in a scrambled order, such that parallel sentences at identical line numbers are translations. For example, the 31st line of the English file is a translation of the 31st line of the Mandarin file. The original line sequence is not provided.

*Samples*

Follow these links for Chinese and English samples.

*Updates*

None at this time.

SHACHI - Language Resource Metadata Database