言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1451 - 1460 件目

C-004063: WPT 05
WPT 05 is a collection of over 10 million documents from the Portuguese web obtained by the crawler of the Tumba! search engine, produced by the XLDB Node of Linguateca. The contents were crawled in 2005.
- replaces: WPT 03
C-004064: Croatian National Corpus
Croatian National Corpus (HNK) is a systematized collection of selected texts mainly written in contemporary Croatian covering different media, genres, styles, fields and topics. The Corpus is accompanied by additional linguistic and non-linguistic data and stored in a database on our server which can be accessed with the search client program Bonito.
- replaces: HNK v 1.0, 30-million corpus of contemporary Croatian
- replaces: HNK v 1.0, Croatian Electronic Text Archive (HETA)
- replaces: HNK v 2.0
C-004065: Czech National Corpus
The Czech National Corpus (CNC) is an academic project focusing on building a large electronic corpus of mainly written Czech. Institute of the Czech National Corpus, Faculty of Arts, CharlesUniversity in Prague has been in charge of the CNC, its expansion, development and other related activities, particularly those associated with teaching and advancing the field of the corpus linguistics.
- hasPart: DIAKORP (corpus of the diachronic section of the CNC)
- hasPart: N-002752: The rover, or, The banish'd cavaliers [Electronic resource] / by Aphra Behn
- hasPart: C-004070: SYN2000
- hasPart: C-004069: SYN2005
- hasPart: C-004071: FSC2000
- hasPart: C-004072: KSK-DOPISY
- hasPart: C-004073: ORWELL
- hasPart: C-004074: ORAL2008
- hasPart: C-004075: ORAL2006
- hasPart: C-004076: PMK
- hasPart: C-004077: BMK
- references: C-004079: InterCorp
C-004068: SYN2006PUB
The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalist texts from November 1989 to the end of 2004, that is from the period covered by corpora SYN2000 and SYN2005. All three corpora are disjunctive as to the texts used, that is no text, which is part of one corpus is included in the other two. Corpora SYN2000, SYN2005 and SYN2006PUB thus contain a total of 500 million text words (tokens).
- hasVersion: C-004069: SYN2005
- hasVersion: C-004070: SYN2000
- hasVersion: C-004071: FSC2000
- isPartOf: C-004065: Czech National Corpus
C-004069: SYN2005
The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech, containing 100 million words (tokens). This basic characteristic is identical with its predecessor, the SYN2000 corpus. There are, however, also many differences between these two corpora, which must be taken into consideration when comparing any data in the two corpora (see below), because the mere mechanical comparison of frequencies can lead to misleading conclusions when these circumstances are not known. We also consider it important to emphasise that none of the corpus SYN2005 texts were previously used in the SYN2000 corpus; both corpora are therefore disjunctive as to the texts used and they contain altogether 200 million words (tokens).
- hasVersion: C-004068: SYN2006PUB
- isPartOf: C-004065: Czech National Corpus
- hasVersion: C-004070: SYN2000
- hasVersion: C-004071: FSC2000
C-004070: SYN2000
The corpus SYN2000 contains 100 million words and is composed of complete texts only. The criteria for selecting texts were based on researches of written language: they were to cover the widest possible genre stratification of the Czech language. TheSYN2000 is a synchronic corpus, which means that it covers contemporary Czech. Therefore it contains primarily texts that were created in 1990-1999. However, also important works of Czech literature were included in the corpus (i.e., Karel Čapek's Krakatit or Josef Škvorecký's Zbabělci (The Cowards)). As to older texts, there has been a rule that authors had to be born after 1880 for the text to be included in this corpus.
- isPartOf: C-004065: Czech National Corpus
- hasVersion: C-004068: SYN2006PUB
- hasVersion: C-004069: SYN2005
- hasPart: C-004071: FSC2000
C-004071: FSC2000
The FSC2000 Corpus is a reference source and a complement to the Frequency Dictionary of Czech (FSČ), which was published at the end of 2004 by NLN. The FSC2000 Corpus is based on the SYN2000 corpus and its development is described on the Czech website. One of the consequences of this process is that the texts in the FSC2000 corpus are in fact a subset of texts in the SYN2000 corpus. The exact size of the FSC2000 corpus is 95 854 929 of word forms (without punctuation marks); the size of 114 363 813 corpus positions, provided by the corpus manager, is information including both the word forms and punctuation marks.
- isPartOf: C-004065: Czech National Corpus
- isPartOf: C-004070: SYN2000
- hasVersion: C-004068: SYN2006PUB
- hasVersion: C-004069: SYN2005
C-004072: KSK-DOPISY
The Private Correspondence Corpus (KSK) provides the possibility to look into the language of contemporary private epistolary texts. The KSK, capturing handwritten correspondence, possibly in the last stage of its existence, contains electronic transcriptions of 2000 letters (that is 942 573 corpus positions) from 1990-2004. The selection of texts complies with the condition of a variety of idiolects, that is, it represents the language of 2000 different people. In the collected correspondence, there are writers from the entire Czech Republic, of all age and education categories, however the communication of young people is most accentuated as it is the best evidence of the contemporary development tendencies of Czech, transformations of the correspondence genre and written expression in general.
- isPartOf: C-004065: Czech National Corpus
C-004073: ORWELL
This corpus was created as part of the EU Multext-East project and it is formed by the text of George Orwell 's novel 1984 (from the English original translated by Eva Šimečková; Prague: Naše vojsko,1991). The corpus contains c. 80 thousand words and 20 thousand punctuation marks, that is approximately 100 thousand of corpus positions and it is morphologically tagged. The relatively small size of this corpus allowed the hand-correction of mistakes, which were created during the automatic morphological analysis, which means it is almost flawlessly tagged.
- isPartOf: C-004065: Czech National Corpus
C-004074: ORAL2008
ORAL2008 is another spoken corpus available within the framework of the Czech National Corpus project. Its aim is appropriate representation of authentic spoken language. The corpus is built from material recorded in the whole of Bohemia in 2002 - 2007 using the same repository of recordings and their transcriptions as its predecessor, corpus ORAL2006.
- isPartOf: C-004065: Czech National Corpus
- hasVersion: C-004075: ORAL2006
- hasVersion: C-004076: PMK

SHACHI - Language Resource Metadata Database