Language resource #: 3330
Results 431 - 440 of 2023
-
C-000786: Longman/Lancaster Corpus
The Longman/Lancaster Corpus is a large computerised body of language made up of running text from a wide range of sources. It is 30 million words of written language taken from literature, magazines, papers and more ephemeral materials such as leaflets and packaging. The only global corpus that is carefully constructed to be as representative of written language as possible and a true reflection of twentieth century English, the Longman/Lancaster Corpus offers lexicographers, editors and authors all the information they need to know to write top quality dictionaries and EFL materials.
-
C-000791: OPUS
OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic data, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus. (Manual corrections have not been made.)
-
C-000792: OrienTel Turkish database
Telephone
The Turkish OrienTel database comprises 1700 Turkish speakers (921 males, 779 females) recorded over the Turkish fixed and mobile telephone network. This database is partitioned into 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
Each speaker uttered the following items:
1 isolated single digit
1 sequence of 10 isolated digits
5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number
1 currency money amount
2 natural numbers
3 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression
2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)
3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage
5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname
2 yes/no questions : 1 predominantly yes question, 1 predominantly no question
6 application keywords/keyphrases
1 word spotting phrase using embedded application words
4 phonetically rich words
9 phonetically rich sentences
The following age distribution has been obtained: 982 speakers are between 16 and 30, 431 speakers are between 31 and 45, 274 speakers are between 46 and 60; the age of 13 speakers is unknown.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. -
C-000797: Polytechnic of Wales Corpus
The corpus was originally collected between 1978-84 for a child language
development project to study the use of various syntactico-semantic constructs
in children between the ages of six and twelve.- isPartOf: The Oxford Text Archive
-
C-000801: THE LOB CORPUS
The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day British English texts, compiled under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen. Like its American counterpart, the Brown Corpus (see Francis and Kucera 1979), it contains 500 text samples of approximately 2,000 words distributed over 15 text categories.
In the tagged corpus each word is accompanied by a word-class tag, assigned through a combination of automatic tagging programs and manual pre- and post-editing. There is no syntactic bracketing.- conformsTo: C-000751: Brown Corpus
-
C-000802: TIGER
The TIGER Treebank (Version 2) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. For details, see the annotation page.
- replaces: Version 1 of the TIGER Treebank
-
C-000806: The Bank of English
The Bank of English® (part of the Collins Word Web) is a collection of modern English language held on computer for analysis of words, meanings, grammar and usage. In linguistics and lexicography such a collection is called a corpus.
- isPartOf: C-000760: Collins Word Web
- isReferencedBy: Collins COBUILD Advanced Learner's English Dictionary
- requires: C-000826: WordbanksOnline
-
C-000807: The Cambridge International Corpus
The Cambridge International Corpus (CIC) is a very large collection of English texts, stored in a computerised database, which can be searched to see how English is used. It has been built up by Cambridge University Press over the last ten years to help in writing books for learners of English. The English in the CIC comes from newspapers, best-selling novels, non-fiction books on a wide range of topics, websites, magazines, junk mail, TV and radio programmes, recordings of people's everyday conversations and many other sources.
- hasPart: Cambridge and Nottingham Corpus of Discourse in English (CANCODE)
- hasPart: Cambridge and Nottingham Spoken Business English (CANBEC)
- hasPart: N-000755: Cambridge Cornell Corpus of Spoken North American English
- hasPart: Cambridge Corpus of Spoken North American English (CAMSNAE)
- references: C-000452: American National Corpus
- hasPart: N-000756: Cambridge Corpus of Business English
- hasPart: Cambridge Corpus of Legal English
- hasPart: Cambridge Corpus of Financial English
- hasPart: Cambridge Corpus of Academic English
- hasPart: C-000757: Cambridge Learner Corpus
-
C-000808: The East African Component of The International Corpus of English
The East African component of The International Corpus of English (ICE-EA) is a computerized collection of spoken and written texts from Kenya and Tanzania.The complete version in rich text format (rtf) format, which, in addition to the texts of 2000 words, contains the full versions of the texts and all tagging.The reduced version as text only (ASCII), which consists of just the 2000-word texts and the
element-attached tagging.- conformsTo: C-000480: The international corpus of English
-
C-000811: The Helsinki Corpus of English Texts: Diachronic Part
The diachronic part of the Helsinki Corpus includes a basic selection of texts compiled from the Old, Middle and Early Modern (British) English periods, and a supplementary part focusing on regional varieties (Scots now available and early American English in preparation).
- hasPart: