Language resource #: 3330 Results 1511 - 1520 of 2023
Current query
Input keywords
Select items
  • C-004130: KorpusDK
    KorpusDK aims to integrate Korpus 2000 and Korpus 90 into ordnet.dk and by-and-by to expand them with new text material. With KorpusDK you can investigate language usage by making queries in a number of Danish texts, totalling 56 million words.
  • C-004131: Leipzig Corpora Collection
    The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The corpora are ready to use with the Corpus Browser. Moreover, all data are available as plain text and as MySQL database tables for various applications. They are intended both for scientific use by the corpus linguist as well as for applications such as knowledge extraction programs.
  • C-004132: National Corpus of Polish
    These four institutions have started cooperation to build a reference corpus of Polish language containing hundreds millions of words. The corpus that will appear soon on this site will be searchable by means of advanced tools that analyse Polish inflection and the Polish sentence structure.
  • C-004133: Uppsala Corpus
    The Uppsala Corpus (Upsal'skij korpus russkix tekstov) consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama.
  • C-004135: Scottish Gaelic corpus
    Corpus contents:

    * conversation.txt - an informal conversation
    * lecture.txt - a university lecture on philosophy
    * sermon.txt - a sermon from a Church of Scotland communion service
    * service.txt - a second sermon
    * talk.txt - an informal educational/historical/religious talk
  • C-004136: Welsh corpus
    Corpus contents:

    * cathedral.txt - sermon from cathedral eucharist
    * chapel.txt - sermon from chapel service
    * chat-1.txt - television talk/magazine show
    * chat-2.txt - television talk/magazine show
    * demog.txt - informal domestic conversation
    * dentist-1.txt - dental appointment
    * dentist-2.txt - dental appointment
    * football.txt - football magazine show
    * rugby.txt - rugby commentary
    * school.txt - school history lesson
  • C-004139: Comparable corpus of English and Russian news texts
    The English corpus is based on a subset of the corpus of Reuters news, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19. The Russian corpus is based on articles from Izvestia, a national broadsheet newspaper, and covers the period from 2000 to 2001.
  • C-004140: The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication
    Desktop/Microphone
    This database has been collected and packaged under the auspices of the IST-EU STREP project HIWIRE (Human Input that Works In Real Environments). The database was designed to be used as a tool for development and test of speech processing and recognition techniques dealing with robust non-native speech recognition.

    The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.

    Clean audio data has been recorded in different office rooms using a close-talking microphone for lowest ambient acoustic effects (Plantronics USB-45). The used sampling frequency is 16 kHz and data is stored in Windows PCM WAV 16 bits mono format.

    Recordings correspond to prompts extracted from an aeronautic command and control application. A total of 8,099 utterances have been recorded corresponding to 81 speakers pronouncing 100 utterances each. The speaker distribution is as follows:

    <table border="0" width="100%" cellspacing="0" cellpadding="2" class="infoBoxContents">
    <tr align=center><td>Country</td><td># Speakers</td><td># Utterances</td></tr>
    <tr align=center><td>France</td><td>31 (38.3%)</td><td>3100</td></tr>
    <tr align=center><td>Greece</td><td>20 (24.7%)</td><td>2000</td></tr>
    <tr align=center><td>Italy</td><td>20 (24.7%)</td><td>2000</td></tr>
    <tr align=center><td>Spain</td><td>10 (12.3%)</td><td>999</td></tr>
    <tr align=center><td>Total</td><td>81</td><td>8099</td></tr>
    </table>

    To generate the noisy data utterances, the speech level is maintained and only the noise amplitude is modified to obtain the desired SNR. The noise amplitude is adjusted to obtain three different averaged SNR values of 10dB, 5dB and -5dB which are referenced as low noise (LN), mid noise (MN) and high noise (HN) conditions. For each given condition the noise level remains constant.

    The speech data are pcm-wav files (16kHz / 16 bits / mono) stored on one DVD. The total size is 3.03 Gbytes for 33.053 files.
  • C-004143: BySoc
    Corpus BySoc consists of transcriptions of 78 (or so) long conversations, conceived as Labovian interviews.The interviews were collected by a team of Danish sociolinguists in the late eighties, in connection with the large-scale Project Urban Sociolingustics. The informants were all native Danes, most of them even born and raised in the suburb Nyboder of central Copenhagen. The BySoc corpus is thus unusually homogeniuous, concerning speech style, extra-linguistic conditions, informant's backgrounds, etc. Moreover, the informants were appointed according to an elaborate plan. Thus, informants of all combinations of age, social classes, and sex are represented in the material. Detailed records of the informants' personal data are on file. This project deals with the establishment, exchange and utilization of speech corpora. Representatives of all the Nordic countries, as well as Estonia, participate.
    • isPartOf: NordTalk
  • C-004144: amph
    The amph micro-corpus consists of altogether 3404 occurrences of the four most common Finnish THINK lexemes, ajatella, miettiä, pohtia, and harkita 'think, reflect, ponder, consider'.