Language resource #: 3330 Results 2001 - 2010 of 2023
Current query
Input keywords
Select items
  • C-005052: Article Data of Yomiuri Shimbun (Japanese) 2015
    The database contains about 270,000 Japanese newspaper articles from Yomiuri Newspaper published in 2015. The database is exclusively for research and academic use and is intended to support development and studies in such fields as linguistics, informatics or media study. The data is provided in CSV format.
  • C-005053: Article Data of Yomiuri Shimbun (Japanese) 2016
    The database contains about 260,000 Japanese newspaper articles from Yomiuri Newspaper published in 2016. The database is exclusively for research and academic use and is intended to support development and studies in such fields as linguistics, informatics or media study. The data is provided in CSV format.
  • C-005061: CMU_ARCTIC speech synthesis databases
    The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research.

    The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databses include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.

    The distributions include 16KHz waveform and simultaneous EGG signals. Full phoentically labelling was perfromed by the CMU Sphinx using the FestVox based labelling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labelling etc.
  • C-005062: The Vera am Mittag German Audio-Visual Spontaneous Speech Database
    The VAM corpus consists of 12 hours of recordings of the German TV talk-show “Vera am Mittag” (Vera at noon). They are segmented into broadcasts, dialogue acts and utterances, respectively. This audio -visual speech corpus contains spontaneous and very emotional speech recorded from unscripted, authentic discussions between the guests of the talk-show. Such data may be of great interest to all research groups working on spontaneous speech analysis, emotion recognition in both, speech and facial expression, natural language understanding, and robust speech recognition. Further interests may arise from a linguist’s viewpoint in the variety of German regional accents that are present in the data.

    In addition to the audio-visual data and the segmented utterances we provide emotion labels for a great part of the data. This labeling follows state-of-the art insights from emotion psychology. Thus, the emotion labels are given on a continuous-valued scale for three emotion primitives: valence (negative vs. positive), activation (calm vs. excited) and dominance (weak vs. strong). , using a large number of human evaluators.
  • C-005063: AV16.3
    The AV16.3 corpus is an audio-visual corpus of 43 real indoor multispeaker recordings, designed to test algorithms for audio-only, video-only and audio-visual speaker localization and tracking. Real human speakers were used. The variety of recordings was chosen to test algorithms to their limits, and to cover a wide range of applicative scenarii (meetings, surveillance). The emphasis is on overlapped speech and multiple moving speakers. Recordings include mostly dynamic scenarii, with single and multiple moving speakers. A few meeting scenarii, with mostly seated speakers, are also included.
  • C-005064: Disco-Annotation
    Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives (although, as, however, meanwhile, since, though, while, yet) in europarl texts (http://www.idiap.ch/dataset/europarl-direct ).
    For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.
  • C-005065: Mediaparl
    Mediaparl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bi-lingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection.
  • C-005066: MOBIO
    The MOBIO database consists of bi-modal (audio and video) data taken from 152 people. The database has a female-male ratio or nearly 1:2 (100 males and 52 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal database with both native and non-native English speakers.

    In total 12 sessions were captured for each client: 6 sessions for Phase I and 6 sessions for Phase II. The Phase I data consists of 21 questions with the question types ranging from: Short Response Questions, Short Response Free Speech, Set Speech, and Free Speech. The Phase II data consists of 11 questions with the question types ranging from: Short Response Questions, Set Speech, and Free Speech.
  • C-005067: Tense-Annotation
    This dataset contains parallel English and French texts from the Europarl corpus (Koehn, 2005).

    The files provide alignments of EN and FR verbs along with information on their position, tense and voice and can therefore be used in translational studies for these languages and/or the training of translation systems that can make use of the labels in this resource.

    Although the resource was created semi-automatically, the verb alignment and inferred tenses are of high precision, especially in the second file contained in the package:
    Tense-Annotation-full.txt : complete alignment.
    Tense-Annotation-gold.txt : alignments only for cases where there is an EN /and/ an FR tense that was inferred from the verbs.
  • C-005068: Abstract Meaning Representation (AMR) Annotation Release 2.0
    *Introduction*

    Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

    AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

    LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).


    *Data*

    The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

    Dataset Training Dev Test Totals
    BOLT DF MT 1061 133 133 1327
    Broadcast conversation 214 0 0 214
    Weblog and WSJ 0 100 100 200
    BOLT DF English 6455 210 229 6894
    DEFT DF English 19558 0 0 19558
    Guidelines AMRs 819 0 0 819
    2009 Open MT 204 0 0 204
    Proxy reports 6603 826 823 8252
    Weblog 866 0 0 866
    Xinhua MT 741 99 86 926
    Totals 36521 1368 1371 39260

    For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the "split" directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 39,260 AMRs with no train/dev/test partition.