Language resource #: 3330
Results 1351 - 1360 of 2023
-
C-003853: 広東語常用単語データベース
広東語の常用単語を発音と日本語・英語の語彙及び分類略号から調べるためのデータベースです。オンライン辞書としてご利用ください。自作の7万2千語データベースから,日常会話及び香港、広州などに旅行した時によく使う単語約9,000語を抜き出して公開しています。現在,複合検索は未対応です。
-
C-003857: The JEFLL (Japanese EFL Learner) Corpus
It consists of the subjects ranging from novice to intermediate levels, covering mainly junior and senior high school students in Japan. The essay task is carefully controlled so that each subcorpus can be comparable across topics, proficiency, school years, school types, among others.
-
C-003858: 文法項目別BNC用例集
『文法項目別BNC用例集』は、東京外国語大学が小学館との共同プロジェクトによって作成した英語の例文集で、中学・高校の英語教科書と文法書を調査して選定した144の文法大項目と14の下位項目によって構成される1320の文型パターンに、BNCから抽出された20万を超える用例が分類されています。
- isFormatOf: C-001018: British National Corpus 1.0
-
C-003859: ORCHID POS-Tagged Corpus
ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is an initiative project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with similar resources, and software tools must also be made available. The construction of a Thai part-of-speech (POS) tagged corpus is a preliminary stage in the construction of a Thai speech corpus.
-
C-003861: LIVAC Synchronous Corpus
It contains texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The collection of materials from the diverse communities is synchronized, and so offers an innovative "Window" approach for a whole variety of comparative studies and useful IT applications.
-
C-003863: AnnCorra
The name AnnCorra, shortened for "Annotated Corpora",
is for an electronic lexical resource of annotated corpora.
The purpose behind this effort is to fill the lacuna in such
resources for Indian languages. It will be an important resource
for the developement of Indian language parsers, machine learning of
grammars, lakshancharts (discrimination nets for sense disambiguation)
and a host of other tools. -
C-003864: Corpus Program
The Corpus Program developed and maintained by CKIP group in Academia Sinica, the News Corpus includes 14 million words.
-
C-003865: Sinica Balanced Corpus
The Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 3.0) is open to the research community through the WWW (http://www.sinica.edu.tw/SinicaCorpus/). The size of this corpus is 5 million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which can filter the data, generate statistics, sort, and identify collocations.
-
C-003866: Word List with Accumulated Word Frequency in Sinica Corpus 3.0
- isPartOf: C-003865: Sinica Balanced Corpus
-
C-003867: Chinese Electronic Dictionary
The CKIP Electronic Dictionary is an electronic lexicon for Mandarin Chinese containing 88,000 entries. Each entry contains:
1. print form (Chinese characters),
2. word frequency (based on a 5 million words corpus),
3. pronunciation (National Phonetic Alphabets, Zhu4yin1fu2hao4 and Chinese Phonetic Alphabet, Han4Yu3Pin1Yin1),
4. syntactic category (based on CKIP classification of 198 categories),
5. semantic feature (base on CKIP classification of 123 concept nodes for nouns).