Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 991 - 1000 of 2023

Select items

description_language
language_area
language
type
subject_monoMultilingual
subject_resourceSubject
type_style
type_form
type_sentence
type_linguisticType
type_discourseType
type_purpose
subject_linguisticField
contributor_author_level
contributor_speaker_level
contributor_author_motherTongue
contributor_speaker_motherTongue
contributor_author_dialect
contributor_speaker_dialect
contributor_author_age
contributor_speaker_age
contributor_author_gender
contributor_speaker_gender
type_annotation

C-001704: Academia Sinica Balanced Corpus of Modern Chinese
Sinica Corpus is the first Balanced Modern Chinese Corpus with part-of-speech tagging for 5 million words and designed for analyzing modern Chinese. Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source. Every text is segmented, and each segmented word is tagged with its pos.
- isReferencedBy: C-001705: Sinica Treebank
- isReplacedBy: Sinica 5.0
C-001705: Sinica Treebank
Sinica Treebank is a syntactic, structure-tagged corpus of modern Chinese. It was built by CKIP in 1997 with texts taken from the Sinica Corpus. The structural frame of Sinica Treebank is based on the Head-Driven Principle; that is, a sentence or phrase is composed of a core Head and its arguments, or adjuncts.
- references: C-001704: Academia Sinica Balanced Corpus of Modern Chinese
C-001706: NEGRA Corpus Version 2
The NEGRA corpus version 2 consists of 355,096 tokens (20,602 sentences) of German newspaper text, taken from the Frankfurter Rundschau as contained in the CD "Multilingual Corpus 1" of the European Corpus Initiative. It is based on approx. 60,000 tokens that were tagged for part-of speech at the Institut für maschinelle Sprachverarbeitung, Stuttgart. This corpus was extended, tagged with part-of-speech and completely annotated with syntactic structures.
C-001707: Corpus NILC/São Carlos
Corpus NILC is a 40 million-word corpus consisting of newspaper texts including commercial letters and educational texts in Brazilian Portuguese, divided into corrected texts, uncorrected texts and semi-corrected texts.
C-001708: CorpusPE
CorpusPE contains 65 pairs of digitized academic parallel texts (abstracts) on computer science. They were divided in two groups: one with 65 pairs of authentic (non-revised) texts; other with the same 65 pairs, but revised by a human translator (pre-edited corpus) to remove gramatical and translation errors. The corpus was processed and divided in three classes of corpora: test corpus, POS-tagged corpus and reference corpus.
- hasVersion: C-001709: CorpusALCA
- hasVersion: C-001710: CorpusNYT
C-001709: CorpusALCA
CorpusALCA contains 4 pairs of digitized official documents of the Free Trade Area of the Americas (FTAA) available on the Web. The corpus was processed and divided in three classes of corpora: test corpus, POS-tagged corpus and reference corpus.
- hasVersion: C-001708: CorpusPE
- hasVersion: C-001710: CorpusNYT
C-001710: CorpusNYT
CorpusNYT contains 7 pairs of parallel and digitized articles from "The New York Times" available on the Web in English and Brazilian Portuguese. The corpus was processed and divided in three classes of corpora: test corpus, POS-tagged corpus and reference corpus.
- hasVersion: C-001708: CorpusPE
- hasVersion: C-001709: CorpusALCA
C-001711: Corpus Gesproken Nederlands
The Spoken Dutch Corpus is a collection of approximately 900 hours of Standard Dutch from Flemish and Dutch speakers. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. Part of the data has been enriched with syntactic, prosodic and/or phonetic information.
- isReferencedBy: CGN final evaluation report (version 1.0)
- isReferencedBy: Online Course Spoken Dutch Corpus
- isReferencedBy: CGN promotional CD(http://ww2.tst.inl.nl/images/stories/docs/Engels/cgndemo_en.pdf)
- isReferencedBy: C-003642: COREA-coreferentiecorpus
C-001714: METU Turkish Corpus
METU Turkish Corpus is a collection of 2 million words of post-1990 written Turkish samples, taken from 10 different genres and XCES tagged at the typographical level.
- isReferencedBy: METU-Sabanci Turkish Treebank (http://www.ii.metu.edu.tr/~corpus/treebank.html)
- isReferencedBy: Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Ozge, "Development of a Corpus and a Treebank for Present-day Written Turkish" in Proceedings of the Eleventh International Conference of Turkish Linguistics, August 2002.
C-001715: METU-Sabanci Turkish Treebank
METU-Sabanci Turkish Treebank is a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences. The sentences are taken form METU Turkish Corpus, a collection of 2 million words of post-1990 written Turkish samples. The percentages of different genres in METU-Sabanci Turkish Treebank and METU Turkish Corpus are similar.
- references: C-001714: METU Turkish Corpus
- isReferencedBy: Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tur, Gokhan Tur, "Building a Turkish Treebank", Invited chapter in "Building and Exploiting Syntactically-annotated Corpora", Anne Abeille Editor, Kluwer Academic Publishers, 2003
- isReferencedBy: Nart B. Atalay, Kemal Oflazer, Bilge Say, "The Annotation Process in the Turkish Treebank", in "proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC", April 13-14, 2003, Budapest, Hungary (http://people.sabanciuniv.edu/oflazer/archives/papers/annotationttbank.pdf)

SHACHI - Language Resource Metadata Database