言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1151 - 1160 件目

C-003404: 日本のふるさとことば集成第２巻　岩手・秋田
書籍：方言の会話をカタカナで文字化し，共通語訳をつけた。文字化と共通語訳を二段組にして対照させ，意味を取りやすくしている。付属のCDを聞きながら読むのもいい。CD：談話全体の方言音声を収録。知らない地域のことばでも，書籍を助けにして，聞くことができる。CD-ROM:方言談話の文字データおよび音声データが閲覧・再生可能。書籍のページを画像データにし，パソコンでも本と同じような感じで読むことができ，ページ単位で方言の会話が聞けるよう方言音声をリンクさせている。また，添付の検索ソフトにより談話データを検索することもできる。その他，方言談話の文字化と共通語訳をテキストファイルで収録。これらのデジタルデータは，研究・教育用に加工して，自由に活用することができる。
- hasVersion: C-003390: 日本のふるさとことば集成第１巻　北海道・青森
- hasVersion: C-003478: 日本のふるさとことば集成第３巻　宮城・山形・福島
- hasVersion: C-003479: 日本のふるさとことば集成第４巻　茨城・栃木
- hasVersion: C-003480: 日本のふるさとことば集成第５巻　埼玉・千葉
- hasVersion: C-003481: 日本のふるさとことば集成第６巻　東京・神奈川
- hasVersion: C-003482: 日本のふるさとことば集成第７巻　群馬・新潟
- hasVersion: C-003483: 日本のふるさとことば集成第８巻　長野・山梨・静岡
- hasVersion: C-003484: 日本のふるさとことば集成第９巻　岐阜・愛知・三重
- hasVersion: C-003485: 日本のふるさとことば集成第10巻　富山・石川・福井
- hasVersion: C-003486: 日本のふるさとことば集成第11巻　京都・滋賀
- hasVersion: C-003487: 日本のふるさとことば集成第12巻　奈良・和歌山
- hasVersion: C-003488: 日本のふるさとことば集成第13巻　大阪・兵庫
- hasVersion: C-003489: 日本のふるさとことば集成第14巻　鳥取・島根・岡山
- hasVersion: C-003490: 日本のふるさとことば集成第15巻　広島・山口
- hasVersion: C-003491: 日本のふるさとことば集成第16巻　香川・徳島
- hasVersion: C-003492: 日本のふるさとことば集成第17巻　愛媛・高知
- hasVersion: C-003493: 日本のふるさとことば集成第18巻　福岡・佐賀・大分
- hasVersion: C-003494: 日本のふるさとことば集成第19巻　長崎・熊本・宮崎
- hasVersion: C-003495: 日本のふるさとことば集成第20巻　鹿児島・沖縄
C-003405: American National Corpus (ANC) Second Release
*Introduction*

This file contains documentation on the ANC Second Release, Linguistic Data Consortium (LDC) catalog number LDC2005T35 and ISBN 1-58563-369-0.

The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.

The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels.

ANC Second Release contains over 20 million words: 10+ million words added in the Second Release, and a new corrected and validated version of the 11 million word ANC First Release. The Second Release also contains software for searching and retrieving multiple stand-off annotations.

ANC Second Release contains texts from the following sources (* denotes new source in the Second Release):

* Transcribed telephone speech (LDC and Project MORE)
* The New York Times
* Berlitz Travel Guides (Langensheidt Publishers)
* Slate Magazine (Microsoft)
* ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural Communication)*
* The Michigan Corpus of Academic Spoken English (MICASE) (University of Michigan, English Language Institute)*
* Various non-fiction
* Various fiction (Orin Hargraves, Ferd Eggan)*
* Various medical research articles (BioMed Central, Public Library of Science)*
* Anonymized posts to the Phoenix Board/Buffistas.org*

ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in acordance with the agreement by which it is governed.

The ANC will ultimately contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC will be expanded to include new types of language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.

The American National Corpus is being developed with the help of consortium of publishers of American English dictionaries and companies with interests in language processing was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.

Additional documentation and information is available at the ANC web site at http://www.americannationalcorpus.org/SecondRelease/index.html.

*Samples*

For examples of the various types of data in this corpus, please review the files listed below.

* LDC2005T35_let-biber.xml
* LDC2005T35_let-hepple.xml
* LDC2005T35_let-logical.xml
* LDC2005T35_let-np.xml
* LDC2005T35_let-s.xml
* DC2005T35_let-vp.xml
* LDC2005T35_let.anc
* LDC2005T35_let.txt
* LDC2005T35_sp-biber.xml
* DC2005T35_sp-hepple.xml
* LDC2005T35_sp-logical.xml
* LDC2005T35_sp-np.xml
* LDC2005T35_sp-s.xml
* LDC2005T35_sp-vp.xml
* LDC2005T35_sp.anc
* LDC2005T35_sp.txt

*Acknowledgements*

The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
- references: C-000648: CALLHOME American English Transcripts
- references: Switchboard
- references: C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
- references: C-003315: MICASE
- replaces: C-000452: American National Corpus
- hasVersion: Slate coreference subset: http://www.americannationalcorpus.org/annotations.html#slate-coref
- hasVersion: CLAWS C5 part-of-speech annotation subset: http://www.americannationalcorpus.org/annotations.html#claws
- hasVersion: CLAWS C7 part-of-speech annotation subset: http://www.americannationalcorpus.org/annotations.html#claws
- isReferencedBy: D-003412: ANC Second Release Frequency Data
- hasPart: C-003410: American National Corpus Second Release - Open Portion
- requires: ANCTool: http://AmericanNationalCorpus.org/tools/index.html
- isReferencedBy: Randi Reppen, Nancy Ide, and Keith Suderman, 2005, American National Corpus (ANC) Second Release, Linguistic Data Consortium, Philadelphia
- conformsTo: C-001546: Treebank-2
C-003410: American National Corpus Second Release - Open Portion
The Open ANC includes over 14 million words from the American National Corpus (ANC) Second Release, a massive electronic collection of American English including texts of all genres and transcripts of spoken data produced from 1990 onward. The file organization and encoding conventions for the OANC is the same as in the Second Release. All annotations were originally produced automatically using GATE's ANNIE system on structural markups (sections, chapters, etc.), tokens with part of speech tags (Penn tagset), noun chunks, and verb chunks. Some of the texts in the OANC include manually validated sentence boundaries that are not included in the Second Release.
- references: C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
- references: Switchboard
- isPartOf: C-003405: American National Corpus (ANC) Second Release
- requires: ANC Tool (http://www.americannationalcorpus.org/tools/anctool-installer.jar)
C-003431: ICIC Corpus of Philanthropic Fundraising Discourse
The ICIC Fundraising Corpus project is an ongoing project to build a corpus of fundraising texts and to study the persuasive use of language in case statements, annual reports grant proposals, and direct mail letters. The Corpus includes over 900 fundraising documents from 236 organizations and totals over 1 million words. This research project is significant in that it strives to highlight the links between rhetorical and linguistic analysis, and ways in which these analyses work together to persuade potential donors.
- isReferencedBy: C-003405: American National Corpus (ANC) Second Release
- isReferencedBy: C-003410: American National Corpus Second Release - Open Portion
C-003436: 自然発話音声データベース SDB
日本語話者二人が非対面で、ホテルの予約、サービスに関する問い合わせ等、主にホテルのフロント係と顧客の電話を通した会話という設定のもと、話者が自由な発話表現で対話を行なう模擬会話を収録。また、各話者が音素環境をバランスさせて作成した503文（音素バランス文）のうち1セット（50文）を、息継ぎ位置などを自由に読み上げ発声したものもあわせて収録。全4タイトル、1タイトルにつき20話者から194話者の異なった話者の発声を収録。日本語書き起こしデータ、音素単位書き起こしデータ、形態素情報データ、音素バランス文テキストファイル付。
- references: ATR音素バランス文503文
C-003438: 自然発話音声・言語データベース（日英対訳） SLDB
自然な発話の認識および音声翻訳技術実現のために収録された、旅行会話に関する自然発話の模擬会話データベース。日本語話者と英語話者の二人が非対面で、お互いの言語は理解しないものとし、主にホテルのフロント係と顧客が通訳機能の付いた電話を介したという設定のもと、話者が自由な発話表現で対話を行なう日英対訳の模擬会話を収録。また、ATR音素バランス文50文を、各話者が息継ぎ位置などを自由に読み上げ発声したものもあわせて収録。英語・日本語書き起こしデータ、音素単位書き起こしデータ、形態素情報データ、日本語構文解析データ付。
- references: ATR音素バランス文503文
C-003440: ATR Dialogue Database
The corpus contains transcripts from two different types of dialogues (telephone and keyboards dialogues) in Japanese and their English translations. Domains of the conversations include international conference and tourism. All text data have been annotated for segmentation, morphological information, syntactic information (for Japanese only) and alignment information.
C-003442: 多数話者音声データベース APP
不特定話者を対象とした自然な発話による連続音声認識技術の性能向上を目的に、日本人同士の模擬会話を収録した音声データベース。日本の各地で約3,700人の話者による模擬会話を収録し、話者の出身地は47都道府県すべてをカバー。地域的、年齢的な広がりを考慮して設計された大規模な音声データベース。日本語書き起こしデータ、音素単位書き起こしデータ、形態素情報データ付。全４タイトル(APP3、APP4、APP5、APP6)。
- hasVersion: C-003444: 多数話者音声データベース APPBLA
- hasVersion: C-003446: 多数話者音声データベース APPDIC
- isPartOf: C-003448: 多数話者音声データベース
C-003444: 多数話者音声データベース APPBLA
多数話者音声データベースは、不特定話者を対象とした自然な発話による連続音声認識技術の性能向上を目的に、日本人同士の模擬会話を収録した大規模な音声データベース。日本の各地で約3,700人の話者による模擬会話を収録、話者の出身地は47都道府県すべてをカバーし、地域的、年齢的な広がりを考慮して設計。本コーパスでは、模擬会話の収録に参加した話者が、音素環境をバランスさせて作成した音素バランス503文のうち1セット(50文)を、息継ぎ位置などを自由に読み上げ発声したものを収録。音素単位書き起こしデータ及び音素バランス文テキストファイル付き。全４タイトル。
- hasVersion: C-003442: 多数話者音声データベース APP
- hasVersion: C-003446: 多数話者音声データベース APPDIC
- isPartOf: C-003448: 多数話者音声データベース
- references: ATR音素バランス文503文
C-003446: 多数話者音声データベース APPDIC
多数話者音声データベースは、不特定話者を対象とした自然な発話による連続音声認識技術の性能向上を目的に、日本人同士の模擬会話を収録した大規模な音声データベース。日本の各地で約3,700人の話者による模擬会話を収録、話者の出身地は47都道府県すべてをカバーし、地域的、年齢的な広がりを考慮して設計。本コーパスでは、模擬会話の収録に参加した話者が、国語辞典・地名辞典・外来語辞典などから抜粋された文章や単語を発声したものを収録。音素単位書き起こしデータ及び辞書テキストデータ付き。
- hasVersion: C-003442: 多数話者音声データベース APP
- hasVersion: C-003444: 多数話者音声データベース APPBLA
- isPartOf: C-003448: 多数話者音声データベース

SHACHI - Language Resource Metadata Database