言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 2001 - 2010 件目

C-005052: 読売新聞記事データ＜邦文＞2015年版
言語学・情報学・メディア研究などの調査研究を支援することを目的とする新聞記事データベース。2015年の読売新聞の邦文新聞記事データ1年分(約27万記事)をCSV形式にて収録。研究外での使用は禁止。
- references: D-003612: ヨミダス用語辞書
- hasVersion: C-003605: 読売新聞記事データ＜邦文＞2006年版 (*CSV形式)
- hasVersion: C-003604: 読売新聞記事データ＜邦文＞2007年版 (*CSV形式)
- hasVersion: C-005045: 読売新聞記事データ＜邦文＞2008年版
- hasVersion: C-005046: 読売新聞記事データ＜邦文＞2009年版
- hasVersion: C-005047: 読売新聞記事データ＜邦文＞2010年版
- hasVersion: C-005048: 読売新聞記事データ＜邦文＞2011年版
- hasVersion: C-005049: 読売新聞記事データ＜邦文＞2012年版
- hasVersion: C-005050: 読売新聞記事データ＜邦文＞2013年版
- hasVersion: C-005051: 読売新聞記事データ＜邦文＞2014年版
- hasVersion: C-005053: 読売新聞記事データ＜邦文＞2016年版
C-005053: 読売新聞記事データ＜邦文＞2016年版
言語学・情報学・メディア研究などの調査研究を支援することを目的とする新聞記事データベース。2016年の読売新聞の邦文新聞記事データ1年分(約26万記事)をCSV形式にて収録。研究外での使用は禁止。
- references: D-003612: ヨミダス用語辞書
- hasVersion: C-003605: 読売新聞記事データ＜邦文＞2006年版 (*CSV形式)
- hasVersion: C-003604: 読売新聞記事データ＜邦文＞2007年版 (*CSV形式)
- hasVersion: C-005045: 読売新聞記事データ＜邦文＞2008年版
- hasVersion: C-005046: 読売新聞記事データ＜邦文＞2009年版
- hasVersion: C-005047: 読売新聞記事データ＜邦文＞2010年版
- hasVersion: C-005048: 読売新聞記事データ＜邦文＞2011年版
- hasVersion: C-005049: 読売新聞記事データ＜邦文＞2012年版
- hasVersion: C-005050: 読売新聞記事データ＜邦文＞2013年版
- hasVersion: C-005051: 読売新聞記事データ＜邦文＞2014年版
- hasVersion: C-005052: 読売新聞記事データ＜邦文＞2015年版
C-005061: CMU_ARCTIC speech synthesis databases
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research.

The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databses include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.

The distributions include 16KHz waveform and simultaneous EGG signals. Full phoentically labelling was perfromed by the CMU Sphinx using the FestVox based labelling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labelling etc.
C-005062: The Vera am Mittag German Audio-Visual Spontaneous Speech Database
The VAM corpus consists of 12 hours of recordings of the German TV talk-show “Vera am Mittag” (Vera at noon). They are segmented into broadcasts, dialogue acts and utterances, respectively. This audio -visual speech corpus contains spontaneous and very emotional speech recorded from unscripted, authentic discussions between the guests of the talk-show. Such data may be of great interest to all research groups working on spontaneous speech analysis, emotion recognition in both, speech and facial expression, natural language understanding, and robust speech recognition. Further interests may arise from a linguist’s viewpoint in the variety of German regional accents that are present in the data.

In addition to the audio-visual data and the segmented utterances we provide emotion labels for a great part of the data. This labeling follows state-of-the art insights from emotion psychology. Thus, the emotion labels are given on a continuous-valued scale for three emotion primitives: valence (negative vs. positive), activation (calm vs. excited) and dominance (weak vs. strong). , using a large number of human evaluators.
C-005063: AV16.3
The AV16.3 corpus is an audio-visual corpus of 43 real indoor multispeaker recordings, designed to test algorithms for audio-only, video-only and audio-visual speaker localization and tracking. Real human speakers were used. The variety of recordings was chosen to test algorithms to their limits, and to cover a wide range of applicative scenarii (meetings, surveillance). The emphasis is on overlapped speech and multiple moving speakers. Recordings include mostly dynamic scenarii, with single and multiple moving speakers. A few meeting scenarii, with mostly seated speakers, are also included.
C-005064: Disco-Annotation
Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives (although, as, however, meanwhile, since, though, while, yet) in europarl texts (http://www.idiap.ch/dataset/europarl-direct ).
For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.
C-005065: Mediaparl
Mediaparl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bi-lingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection.
C-005066: MOBIO
The MOBIO database consists of bi-modal (audio and video) data taken from 152 people. The database has a female-male ratio or nearly 1:2 (100 males and 52 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal database with both native and non-native English speakers.

In total 12 sessions were captured for each client: 6 sessions for Phase I and 6 sessions for Phase II. The Phase I data consists of 21 questions with the question types ranging from: Short Response Questions, Short Response Free Speech, Set Speech, and Free Speech. The Phase II data consists of 11 questions with the question types ranging from: Short Response Questions, Set Speech, and Free Speech.
C-005067: Tense-Annotation
This dataset contains parallel English and French texts from the Europarl corpus (Koehn, 2005).

The files provide alignments of EN and FR verbs along with information on their position, tense and voice and can therefore be used in translational studies for these languages and/or the training of translation systems that can make use of the labels in this resource.

Although the resource was created semi-automatically, the verb alignment and inferred tenses are of high precision, especially in the second file contained in the package:
Tense-Annotation-full.txt : complete alignment.
Tense-Annotation-gold.txt : alignments only for cases where there is an EN /and/ an FR tense that was inferred from the verbs.
- references: C-000766: European Parliament Proceedings Parallel Corpus
C-005068: Abstract Meaning Representation (AMR) Annotation Release 2.0
*Introduction*

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).

*Data*

The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

Dataset Training Dev Test Totals
BOLT DF MT 1061 133 133 1327
Broadcast conversation 214 0 0 214
Weblog and WSJ 0 100 100 200
BOLT DF English 6455 210 229 6894
DEFT DF English 19558 0 0 19558
Guidelines AMRs 819 0 0 819
2009 Open MT 204 0 0 204
Proxy reports 6603 826 823 8252
Weblog 866 0 0 866
Xinhua MT 741 99 86 926
Totals 36521 1368 1371 39260

For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the "split" directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 39,260 AMRs with no train/dev/test partition.
- replaces: N-004832: Abstract Meaning Representation (AMR) Annotation Release 1.0

SHACHI - Language Resource Metadata Database