言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1031 - 1040 件目

C-003229: Eindhoven Corpus
The Eindhoven Corpus is the first collection of written and transcribed spoken Dutch texts with 720,000 tokens from the period 1960 to 1973. It was initially intended to put together a frequency list for Dutch, and now, it can be used for all kinds of linguistic technology research. The corpus is manually, almost fautlessly annotated. This makes the corpus suitable for use as a training and test data when developing part-of-speech taggers.
- isReferencedBy: D-003240: CGN lexicon
C-003230: IFA Spoken Language Corpus v1.0
The IFA Corpus is an open source database of hand-segmented Dutch speech. It contains speech from 8 Dutch speakers. For each speaker, a fixed text has been recorded in several "styles", and a retold version of the fixed text. Furthermore, each speaker told an Informal story face-to-face with an interviewer which was the basis of a speaker specific variable text corpus, which was read and retold by each speaker individualy. This corpus is unique in the sense that it has phonemic segmentation and that the same speakers recorded in many syles, which many of the currently available speech corpora lack.
- isReferencedBy: [???Reference] The IFA Corpus: a Phonemically Segmented Dutch "Open Source" Speech Database (http://www.fon.hum.uva.nl/Service/IFAcorpus/SLcorpus/AdditionalDocuments/IFAcorpusEurospeech2001.html)
- isReferencedBy: [???Reference] Structure and access of the open source IFA-corpus (http://www.fon.hum.uva.nl/Service/IFAcorpus/SLcorpus/AdditionalDocuments/IRCS2001paper.html)
C-003235: Spoken Dutch Corpus 2.0
The Spoken Dutch Corpus is a collection of approximately 900 hours of Standard Dutch from Flemish and Dutch speakers. The total number of words included is nearly 9 million. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. A selection of one million words has been annotated syntactically, and for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. In this release, the CGN lexicon has also been included.
- replaces: C-001711: Corpus Gesproken Nederlands
- hasPart: C-003236: CGN Annotation dvd
- hasPart: D-003240: CGN lexicon
C-003236: CGN Annotation dvd
This DVD contains the written portion of the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), a collection of approximately 900 hours of standard Dutch from Flemish and Dutch speakers. The total number of words included is nearly 9 million. The DVD includes the fully annotated version of the transcribed corpus. The package also includes COREX, the corpus software used by the CGN. A selection of one million words has been annotated syntactically, and for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available.
- isPartOf: C-003235: Spoken Dutch Corpus 2.0
C-003244: 重点領域研究「音声言語」・試験研究「音声DB」連続音声データベース
単音節・単語（単音節（101語）、外来音節（9語）、ATR 音素バランス単語（216語））と短文・文章（音声品質評価のための文リスト（7文）、疑問文（11文）、日本語教育用リスト（70文）、イソップ童話「北風と太陽」、ニュース文章、ナレーション文章、天気予報文章）から成る音声データベース。12名（男女各6名）の話者が各1回発声。
- references: ATR 503 Phonetically Balanced Sentences
C-003245: 筑波大多言語音声コーパス
11言語（英語、ドイツ語、フランス語、スペイン語、ロシア語、アラビア語、インドネシア語、タイ語、中国語、韓国語、日本語）の各言語につき1～7名、合計98名の話者による音声を収録。話者は主に筑波大学に在籍する外国人留学生および外国人教師である。タイ語については、タイ国立電子工学・計算機工学センターの協力を得た。筑波大学特別プロジェクト「東西言語文化の類型論」（平成9－13年度）で作成。各言語について共通の内容の発話を収録している。発話内容は、世界中で広く一般に使用されているような単語として、数字14語、月の名12語、曜日7語、天気用語4語、挨拶6語、返答3語、時間に関する用語4語、計50語を選んだ。また、連続音声としては、世界で知られていて、資料が用意に入手できる1分間くらいの物語として、イソップ童話の「北風と太陽」を選んだ。各言語の発声テキストは、日本語（英語）を基にして各言語の話者（少なくとも2名）に作成してもらった。音声データは防音室でヘッドセットマイクを使ってDATに収録した。発話時間は1人あたり平均5分くらいである。データはCD-ROM1枚に格納されている。話者の年齢、性別、出身地、母語、居住履歴、両親の出身地等を発声者データとして記入してもらった。
C-003246: 東北大-松下単語音声データベース
単語音声（set1:音韻バランス 212語、set2:鉄道駅名・線名 3285語）を含む音声データベース。 set1では60名（男女各30名）の話者が、set2では12名（男女各6名）の話者が, 各1回発声。Set1のうち20名分のデータには音素ラベルも有り。
C-003247: 基盤研究(A)「日本語方言の地域差」方言音声コーパス
日本語方言音声コーパスの構築に使用した方言音声データは科研費基盤研究Ａ（展開）「日本語方言音声の地域差及び方言音声コーパスの設計・構築に関する研究」の援助によるものである。方言音声データはテキスト読み上げと自然談話によって構成されており、DATに16bit、48kHzで収録された。日本の主要方言地域をカバーするためにデータの収録地は青森、山形、千葉、愛知、富山、奈良、鳥取、香川、福岡の計９地点を選んだ。話者については60歳以上の男性および女性とし、その方言の地方で育ち、そこに長い間住んでいる人を選んだ。また方言の地方、話者の数、年齢、性別、発声内容、独話か対話か、録音受け入れ等の問題に付いて考慮し録音は以下の点に留意して行った。１）音声の内容はテキスト読み上げ及び自然談話を含む。２）談話の長さは１分以上、３－５分が望ましい。３）方言毎に話者は１０－２０人、男女が大体均等になるようにする。４）録音された音声の書き起こし許可を話者から得る。５）書き起こしの際、強調部分の記述は特に行わない。６）話者は５５才以上、典型的な方言を集める為には６０才以上が望ましい。７）名前、年齢、性別、生まれた場所、育った場所等のインフォーマントのデータを集める。８）記録された音声データの使用承諾書を依頼する。９）録音は出来るだけ静かな部屋で行うことが望ましい。
C-003248: 音声対話データベース (96年版)
人間同士１対１，対面での目的指向対話（質問応答形式の自由対話）。話題の内容は、自動車の購入，海外旅行計画の２つで、質問者（顧客）－回答者（専門家）のペアによるもの。１対話あたり6分〜20分程度。
- hasVersion: C-003249: RWCP-SP97 Spoken Dialogue Database (1997 edition)
C-003249: 音声対話データベース (97年版)
人間同士１対１，対面での目的指向対話（質問応答形式の自由対話）。話題の内容は海外旅行計画で、質問者（顧客）－回答者（専門家）のペアによるもの。
- hasVersion: C-003248: RWCP-SP96 Spoken Dialogue Database (1996 edition)

SHACHI - Language Resource Metadata Database