Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1151 - 1160 of 2023

C-003404: Japanese Dialect Database Volume 2: Iwate/Akita
This is an information packet which contains acoustic and written data of Japanese dialect discourses spoken by the old native speakers. The data were collected in the project for the recording and preservation of the traditional Japanese dialects held by the Agency for Cultural Affairs during 1977-1985, and they will be compiled into 20 volumes in total (available in CD-ROM/CD/BOOKS). All volumes would provide us with valuable information to know the current status of the Japanese dialects, and it would contribute to Japanese language research and education.
- hasVersion: C-003390: Japanese Dialect Database Volume 1: Hokkaido/Aomori
- hasVersion: Japanese Dialect Database Vol. 3 Miyagi / Yamagata / Fukushima
- hasVersion: Japanese Dialect Database Vol. 4 Ibaraki / Tochigi
- hasVersion: Japanese Dialect Database Vol. 5 Saitama / Chiba
- hasVersion: Japanese Dialect Database Vol. 6 Tokyo / Kanagawa
- hasVersion: Japanese Dialect Database Vol. 7 Gunma / Niigata
- hasVersion: Japanese Dialect Database Vol. 8 Nagano / Yamanashi / Shizuoka
- hasVersion: Japanese Dialect Database Vol. 9 Gifu / Aichi / Mie
- hasVersion: Japanese Dialect Database Vol. 10 Toyama / Ishikawa / Fukui
- hasVersion: Japanese Dialect Database Vol. 11 Kyoto / Shiga
- hasVersion: Japanese Dialect Database Vol. 12 Nara / Wakayama
- hasVersion: Japanese Dialect Database Vol. 13 Osaka / Hyogo
- hasVersion: Japanese Dialect Database Vol. 14 Tottori / Shimane / Okayama
- hasVersion: Japanese Dialect Database Vol. 15 Hiroshima / Yamaguchi
- hasVersion: Japanese Dialect Database Vol. 16 Kagawa / Tokushima
- hasVersion: Japanese Dialect Database Vol. 17 Ehime / Kochi
- hasVersion: Japanese Dialect Database Vol. 18 Fukuoka / Saga / Ooita
- hasVersion: Japanese Dialect Database Vol. 19 Nagasaki / Kumamoto / Miyazaki
- hasVersion: C-003495: Japanese Dialect Database Vol. 20 Kagoshima / Okinawa
C-003405: American National Corpus (ANC) Second Release
*Introduction*

This file contains documentation on the ANC Second Release, Linguistic Data Consortium (LDC) catalog number LDC2005T35 and ISBN 1-58563-369-0.

The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.

The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels.

ANC Second Release contains over 20 million words: 10+ million words added in the Second Release, and a new corrected and validated version of the 11 million word ANC First Release. The Second Release also contains software for searching and retrieving multiple stand-off annotations.

ANC Second Release contains texts from the following sources (* denotes new source in the Second Release):

* Transcribed telephone speech (LDC and Project MORE)
* The New York Times
* Berlitz Travel Guides (Langensheidt Publishers)
* Slate Magazine (Microsoft)
* ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural Communication)*
* The Michigan Corpus of Academic Spoken English (MICASE) (University of Michigan, English Language Institute)*
* Various non-fiction
* Various fiction (Orin Hargraves, Ferd Eggan)*
* Various medical research articles (BioMed Central, Public Library of Science)*
* Anonymized posts to the Phoenix Board/Buffistas.org*

ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in acordance with the agreement by which it is governed.

The ANC will ultimately contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC will be expanded to include new types of language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.

The American National Corpus is being developed with the help of consortium of publishers of American English dictionaries and companies with interests in language processing was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.

Additional documentation and information is available at the ANC web site at http://www.americannationalcorpus.org/SecondRelease/index.html.

*Samples*

For examples of the various types of data in this corpus, please review the files listed below.

* LDC2005T35_let-biber.xml
* LDC2005T35_let-hepple.xml
* LDC2005T35_let-logical.xml
* LDC2005T35_let-np.xml
* LDC2005T35_let-s.xml
* DC2005T35_let-vp.xml
* LDC2005T35_let.anc
* LDC2005T35_let.txt
* LDC2005T35_sp-biber.xml
* DC2005T35_sp-hepple.xml
* LDC2005T35_sp-logical.xml
* LDC2005T35_sp-np.xml
* LDC2005T35_sp-s.xml
* LDC2005T35_sp-vp.xml
* LDC2005T35_sp.anc
* LDC2005T35_sp.txt

*Acknowledgements*

The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
- references: C-000648: CALLHOME American English Transcripts
- references: Switchboard
- references: C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
- references: C-003315: MICASE
- replaces: C-000452: American National Corpus
- hasVersion: Slate coreference subset: http://www.americannationalcorpus.org/annotations.html#slate-coref
- hasVersion: CLAWS C5 part-of-speech annotation subset: http://www.americannationalcorpus.org/annotations.html#claws
- hasVersion: CLAWS C7 part-of-speech annotation subset: http://www.americannationalcorpus.org/annotations.html#claws
- isReferencedBy: D-003412: ANC Second Release Frequency Data
- hasPart: C-003410: American National Corpus Second Release - Open Portion
- requires: ANCTool: http://AmericanNationalCorpus.org/tools/index.html
- isReferencedBy: Randi Reppen, Nancy Ide, and Keith Suderman, 2005, American National Corpus (ANC) Second Release, Linguistic Data Consortium, Philadelphia
- conformsTo: C-001546: Treebank-2
C-003410: American National Corpus Second Release - Open Portion
The Open ANC includes over 14 million words from the American National Corpus (ANC) Second Release, a massive electronic collection of American English including texts of all genres and transcripts of spoken data produced from 1990 onward. The file organization and encoding conventions for the OANC is the same as in the Second Release. All annotations were originally produced automatically using GATE's ANNIE system on structural markups (sections, chapters, etc.), tokens with part of speech tags (Penn tagset), noun chunks, and verb chunks. Some of the texts in the OANC include manually validated sentence boundaries that are not included in the Second Release.
- references: C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
- references: Switchboard
- isPartOf: C-003405: American National Corpus (ANC) Second Release
- requires: ANC Tool (http://www.americannationalcorpus.org/tools/anctool-installer.jar)
C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
The ICIC Fundraising Corpus project is an ongoing project to build a corpus of fundraising texts and to study the persuasive use of language in case statements, annual reports grant proposals, and direct mail letters. The Corpus includes over 900 fundraising documents from 236 organizations and totals over 1 million words. This research project is significant in that it strives to highlight the links between rhetorical and linguistic analysis, and ways in which these analyses work together to persuade potential donors.
- isReferencedBy: C-003405: American National Corpus (ANC) Second Release
- isReferencedBy: C-003410: American National Corpus Second Release - Open Portion
C-003436: ATR Natural Speech Database (English translation)
The corpus contains 4 subcorpora of speech recordings of 515 speakers, simulating tourism-domain telephone conversations and reading the ATR 503 phonetically balanced sentences (50 sentences/speaker). All utterances and sentences are in the Japanese language. The corpus also includes corresponding orthographic/phonemic transcripts and morphological information data for the spontaneous conversations and transcripts for the phonetically balanced sentences.
- references: ATR 503 Phonetically Balanced Sentences
C-003438: ATR Natural Speech and Language Database (English translation)
The corpus contains speech recordings of Japanese and English speakers, simulating telephone conversations between a Japanese speaker and an English speaker via translator, and reading the ATR 503 phonetically balanced sentences (50 sentences/speaker). For the spontaneous conversations, corresponding orthographic/phonemic transcripts, morphological information data and syntactically annotated data are also included.
- references: ATR 503 Phonetically Balanced Sentences
C-003440: ATR Dialogue Database
The corpus contains transcripts from two different types of dialogues (telephone and keyboards dialogues) in Japanese and their English translations. Domains of the conversations include international conference and tourism. All text data have been annotated for segmentation, morphological information, syntactic information (for Japanese only) and alignment information.
C-003442: ATR Speech Database of Many Speakers APP (English translation)
The corpus contains speech recordings of 3,700 Japanese speakers simulating telephone conversations. It is a large scale speech database based on recordings of natural speech by speakers from all over Japan, representing a wide variety of people of different regional origins and ages. The corpus consists of 4 subcorpora (APP3, APP4, APP5 and APP6), and includes corresponding orthographic/phonemic transcripts and morphological information data.
C-003444: ATR Speech Database of Many Speakers APPBLA (English translation)
The corpus contains speech recordings of 3,700 Japanese speakers reading 50 ATR's phonetically balanced sentences. These speakers are the same as the ones in the APP (Simulated Dialogue) database, a large scale speech database based on recordings of natural speech by speakers from all over Japan, representing a wide variety of people of different regional origins and ages. The corpus consists of 4 subcorpora, and includes corresponding time-stamped phonemic transcripts and written texts of ATR's 503 Phonetically Balanced Sentences.
- hasVersion: C-003442: ATR Speech Database of Many Speakers APP (English translation)
- hasVersion: C-003446: ATR Speech Database of Many Speakers APPDIC (English translation)
- isPartOf: C-003448: ATR Speech Database of Many Speakers (English translation)
- references: ATR 503 Phonetically Balanced Sentences
C-003446: ATR Speech Database of Many Speakers APPDIC (English translation)
The corpus contains speech recordings of 3,700 Japanese speakers reading sentences and words from Japanese dictionaries including place names and foreign words. These speakers are the same as the ones in the APP (Simulated Dialogue) database, a large scale simulated dialogue speech database based on recordings of natural speech by speakers from all over Japan, representing a wide variety of people of different regional origins and ages. The corpus includes corresponding time-stamped phonemic transcripts and orthographic transcripts.
- hasVersion: C-003442: ATR Speech Database of Many Speakers APP (English translation)
- hasVersion: C-003444: ATR Speech Database of Many Speakers APPBLA (English translation)
- isPartOf: C-003448: ATR Speech Database of Many Speakers (English translation)

SHACHI - Language Resource Metadata Database