Language resource #: 3330
Results 191 - 200 of 2023
-
C-000476: The Lancaster Los Angeles Spoken Chinese Corpus
The Lancaster - Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese, which is composed of one million words of dialogues and monologues, both spontaneous and scripted. Six spoken discourse types are covered: face-to-face conversations, telephone calls, play/movies scripts, TV talkshows, public debates, oral narratitves, edited oral narratives. The corpus is XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, utterance units (or paragraphs), sentences and tokens are marked up, with each token also annotated for part of speech.
- isPartOf: The Lancaster Corpus of Mandarin Chinese
-
C-000477: The Machine Readable Spoken English Corpus
The Marsec corpus of spoken standard southern British English is a development of the Lancaster/IBM spoken English corpus (SEC). See http://www.ling.lancs.ac.uk/staff/gerry/SEC.htm for related publications and http://www.hd.uib.no/icame/lanspeks.html for contact details about ordering the orthographic, prosodic and part-of-speech annotated transcriptions of the SEC on CD-ROM from ICAME.
Whereas the SEC edition of the corpus comprises annotated orthographic transcriptions of the spoken material it does not include the acoustic material. The MARSEC edition of the corpus adds the acoustic recordings on a second CD-ROM (see below for ordering details) and includes word-level time-alignment (downloadable below) between the transcripts and the acoustic signal.
The CD-ROM contains digital sample files of the original corpus recordings. Each recording has been divided into samples of not greater than one-minute in duration. The sample files are raw headerless mono 16 bit (intel byte-order) PCM samples sampled at 16000 samples per second. This format can be easily be imported into many audio applications. -
C-000478: The PDC2000 Corpus of Chinese News Text
The PDC2000 Corpus of Chinese News Text is built using one year's (year 2000) data provided by the People's Daily Press, Beijing. The corpus contains approximately 15 million tokens. PDC2000 is encoded in Unicode (UTF-8) and marked up in XML. There are 366 files in the corpus, one for a day, which is marked up for the month and the date. Each corpus file consists of a corpus header and the corpus text proper. The corpus header applies the ELDA (Evaluations and Language Resources Distribution Agency) Metadata Scheme version 1.40. The corpus text is marked up for paragraphs, sentences and tokens. Sentences are numbered consecutively within each file while tokens are annotated for part-of-speech, using the Peking University tagset.
-
C-000479: The UCLA Chinese Corpus
The UCLA Chinese Corpus is designed as a Chinese counterpart for the FLOB and Frown corpora of British and American English for contrastive research, as well as a recent update of the Lancaster Corpus of Mandarin Chinese (LCMC) for diachronic studies of possible changes in written Chinese over the past decade. Since this period is of special significance because of the impact of the Internet on language, especially on Chinese, the corpus is an excellent complement to LCMC. The samples in the corpus are all collected from written modern Chinese available from the internet, during the period of 2000-2005, though some texts may have been converted from paper-based publications in earlier years. File types are matched as closely as possible to the Brown corpus model, with some variations (e.g. adventure fictions) to accommodate Chinese characteristics, while the proportions for different text categories may vary from the English counterparts and LCMC.
- hasVersion: The Lancaster Corpus of Mandarin Chinese
-
C-000480: The international corpus of English
The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English.
- references: Bailey, Richard W. & Manfred Görlach (eds) (1984) English as a World Language. Cambridge: Cambridge University Press.
-
G-000481: The Enabling Minority Language Engineering Corpus
A set of corpora for fifteen languages of South Asia. The corpus includes a re-coded version of the Central Institute for Indian Language (CIIL)'s corpus collection. Data includes monolingual written data, monolingual spoken data, and parallel data. Total size is 97 million words.
-
C-000484: Web corp
However large and up-to-date the electronic text corpora available are, there will always be aspects of the language which are too rare or too new to be evidenced in them. WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.
-
C-000485: ASJ Continuous Speech Corpus for Research
The sound waves of simulated dialog for guidance tasks and the dictated text are recorded.
-
C-000486: CD-ROM Nikkei Full-text Database - Nikkei Business Daily 1990
This CD-ROM contains all the full-text articles of Nihon Keizai Shimbun. Users can search by keywords. Headline list is included.
- isVersionOf: Nikkei Full-text Database (1990-2006 each)
-
C-000487: CD-ROM Nikkei Full-text Database - Nikkei Business Daily 1991
This CD-ROM contains all the full-text articles of Nihon Keizai Shimbun. Users can search by keywords. Headline list is included.
- isVersionOf: Nikkei Full-text Database (1990-2006 each)