Language resource #: 3330 Results 1431 - 1440 of 2023
Current query
Input keywords
Select items
  • C-004028: HKCAC
    An adult language corpus of spoken Hong Kong Cantonese (HKCAC) has recently been developed consisting of spontaneous speech recorded from phone-in programs and forums on the radio in Hong Kong. The database represents the speech of a total of sixty-nine speakers in addition to the program hosts, and has approximately 170,000 characters. It is believed that HKCAC will be of great value to linguists who are interested in studying Cantonese, and speech therapists and educators who work with the Cantonese speaking population.
  • C-004029: HKCPSC
    This study investigated the development of the mental representation of Chinese disyllabic words. Unlike alphabetical languages, Chinese is a logographic system where character is the basic unit of meaning. Most Chinese words are composed of two characters. Theoretically, Chinese compound word can be read either as a whole unit or as the component character. Subjects were asked to read aloud a list of two-character words, controlled for word and component character frequencies across grades. The correct percentage was analyzed using three two-way analyses of variance. Results indicated that children are able to make use of both levels of reading as early as Grade 1. Lower graders tended to use both the component character level reading processes more, while higher graders tended to read words as whole units more.
  • C-004030: NTU Corpus of Formosan Languages
    Most of the Formosan languages lack written records and many have either become extinct or are now seriously endangered. The creation of this linguistic database is an attempt not only to preserve valuable linguistic heritage, but is also to provide a systematic recording of these languages, for the benefit of linguistic research.This database contains first-hand data transcribed largely according to Du Bois et al (1993). The Intonation Unit (IU) serves as the basic unit for a detailed recording of linguistic phenomena, including pauses, repetitions, repair, and intonation. Aside from recorded data, the database also contains hundreds of field notes gathered in the course of field research. These pieces of information are precious for the linguistics researcher, as they reveal the structure of a language and show the interface between language and the human cognitive system.http://corpus.linguistics.ntu.edu.tw/index_en.php
  • C-004031: Cleaneval development dataset
    This is a perl script which takes two arguments: first, the file to be scored, and second, the gold-standard file to compare it with. It calculates scores based on (1) the edit distance between the two and the extent to which contestant-inserted markup tags indicate blocks of text starting and ending in the same places; and (2) based on alignment of text alone, ignoring the contestant-inserted markup tags. Comments in the code provide more detail. It has been well tested for English but not so well tested for Chinese: we hope to publish an amended version for Chinese shortly. Script available at http://cleaneval.sigwac.org.uk/cleaneval_scorer.zip (-- zipped so our server does not try to run it).
    • isRequiredBy: Cleaneval
  • C-004032: SCoRE: Singapore Corpus of Research in Education
    a large collection of data on classroom interactions, teaching materials and students' assignments in Singapore primary and secondary schools from its various research projects. The proposed deliverables include a speech subcorpus, a lexical subcorpus, and several multilevel annotated subcorpora at different development stages. Eventually all these subcorpora will be indexed and incorporated into a large corpus database, which will be provided with sophisticated query tools for both online and offline queries.
  • C-004033: Singaporean Preschoolers Oral Competence in Mandarin
    This is a specific focus project investigating the relationship between Singaporean Chinese children's home language use and their oral Mandarin competence. In this project, random sampling approach was adopted, where 1000 of boys and girls aged at 5 and 6 years old from 36 childcare centers and kindergartens (17 public, 10 church and 9 private) were recruited. In addition to the equal number of their parents’ sociolinguistic questionnaires collected and processed, the oral production of 600 of the 1000 participants (300 hours audio recordings) and 24 video taped classroom observations (12 hours video recordings) were transcribed and annotated. The ultimate goal of this project is to compile a multi-modal corpus of Singapore preschool children's oral language in Mandarin.
  • C-004034: Hindi Speech Data base
    This is related to the speech technology. The data base is meant to be supportive for developing Automatic Speech Recognition (ASR) systems in Hindi.
  • C-004035: Mandarin Topic-oriented Conversation Corpus
    The annotation system is designed to mark discourse functions in natural conversations. Opening, main discussion and closing are the three main parts of a natural, topic-oriented conversation. The main discussion contains discourse functions intended to start a discussion, to negotiate a topic, to introduce a topic, to talk about a topic, and to end the discussion.
  • C-004036: Mandarin Map Task Corpus
    The Mandarin Map Task Corpus (MMTC) was recorded in 2002, from January to March. There are 30 task-oriented conversations between familiar persons. One speaker with a detailed map had to give oral instructions to the other speaker with a simplified map to three destinations on the map. The total length of the conversations are 5 hours. The average length of each conversation is 10 minutes.
  • C-004037: The Mandarin Conversational Dialogue Corpus
    The Mandarin Conversational Dialogue Corpus (MCDC) was recorded in 2001, from March to July. The conversations are natural conversations between two strangers. The conversation partners had to introduce themselves at the beginning of the conversation. The rest of the conversation was completely up to the conversation partners. There are 60 speakers in total. The total length of the 30 conversations is 25.6 hours; the average length of each conversation is 50 minutes.