言語資源検索 - SHACHI: Language Resource Metadata Database

言語資源の登録件数: 3330件 2023 件中 1221 - 1230 件目

C-003581: BLLIP North American News Text, Complete
*Introduction*

Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, Complete, LDC2008T13, isbn 1-58563-482-4, contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).

BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus.

To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

*Methodology*

A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech.

Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined.

The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material.

In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability.

*Data*

The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence.

Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list:

50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores.

Source material is as follows:

Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40

*Pricing*

The Reduced Licensing Fee for this corpus is US$800.
- hasPart: C-003582: BLLIP North American News Text, General Release
- isReferencedBy: C-003579: North American News Text, Complete
- references: C-001073: North American News Text Corpus
- isReferencedBy: David McClosky, Eugene Charniak, and Mark Johnson, 2008, BLLIP North American News Text, Complete, Linguistic Data Consortium, Philadelphia
C-003582: BLLIP North American News Text, General Release
*Introduction*

Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, General Release, LDC2008T14, isbn 1-58563-482-4, contains a Penn Treebank-style parsing of approximately 21 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).

BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus.

To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

*Methodology*

A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech.

Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined.

The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material.

In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability.

*Data*

The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence.

Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list:

50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores.

Source material is as follows:

Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40
- isPartOf: C-003581: BLLIP North American News Text, Complete
- isReferencedBy: C-003580: North American News Text, General Release
- references: C-001073: North American News Text Corpus
- isReferencedBy: David McClosky, Eugene Charniak, and Mark Johnson, 2008, BLLIP North American News Text, General Release, Linguistic Data Consortium, Philadelphia
C-003585: CD-毎日新聞2007データ集
毎日新聞の東京・大阪本社の朝夕刊最終版を対象とした、毎日新聞2007年の全文記事データ集（タグ付テキストデータ）。
- isPartOf: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-000838: DCS-毎日新聞1991～2006データファイル
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003590: CD-毎日新聞2005データ集プラス
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
C-003588: CD-毎日新聞2007データ集プラス
毎日新聞の東京・大阪本社の朝夕刊最終版に加え、北海道～鹿児島までの記事（約20万記事）を収録した「地方版」とがセットになった毎日新聞2007年の全文記事データ集（タグ付テキストデータ）。
- hasPart: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-000838: DCS-毎日新聞1991～2006データファイル
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003590: CD-毎日新聞2005データ集プラス
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005027: CD-毎日新聞2013データ集
- hasVersion: C-005028: CD-毎日新聞2014データ集
- hasVersion: C-005029: CD-毎日新聞2015データ集
- hasVersion: C-005030: CD-毎日新聞2016データ集
- hasVersion: C-005031: CD-毎日新聞2013データ集プラス
- hasVersion: C-005032: CD-毎日新聞2014データ集プラス
- hasVersion: C-005033: CD-毎日新聞2015データ集プラス
- hasVersion: C-005034: CD-毎日新聞2016データ集プラス
C-003589: CD-毎日新聞2006データ集プラス
毎日新聞の東京・大阪本社の朝夕刊最終版に加え、北海道～鹿児島までの記事を収録した「地方版」とがセットになった毎日新聞2006年の全文記事データ集（タグ付テキストデータ）。
- hasPart: CD-毎日新聞2006データ集
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-003590: CD-毎日新聞2005データ集プラス
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-000838: DCS-毎日新聞1991～2006データファイル
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005027: CD-毎日新聞2013データ集
- hasVersion: C-005028: CD-毎日新聞2014データ集
- hasVersion: C-005029: CD-毎日新聞2015データ集
- hasVersion: C-005030: CD-毎日新聞2016データ集
- hasVersion: C-005031: CD-毎日新聞2013データ集プラス
- hasVersion: C-005032: CD-毎日新聞2014データ集プラス
- hasVersion: C-005033: CD-毎日新聞2015データ集プラス
- hasVersion: C-005034: CD-毎日新聞2016データ集プラス
C-003590: CD-毎日新聞2005データ集プラス
毎日新聞の東京・大阪本社の朝夕刊最終版に加え、北海道～鹿児島までの記事を収録した「地方版」とがセットになった毎日新聞2005年の全文記事データ集（タグ付テキストデータ）。
- hasPart: CD-毎日新聞2005データ集
- hasVersion: C-003589: CD-毎日新聞2006データ集プラス
- hasVersion: C-003588: CD-毎日新聞2007データ集プラス
- hasVersion: C-000838: DCS-毎日新聞1991～2006データファイル
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-003585: CD-毎日新聞2007データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005027: CD-毎日新聞2013データ集
- hasVersion: C-005028: CD-毎日新聞2014データ集
- hasVersion: C-005029: CD-毎日新聞2015データ集
- hasVersion: C-005030: CD-毎日新聞2016データ集
- hasVersion: C-005031: CD-毎日新聞2013データ集プラス
- hasVersion: C-005032: CD-毎日新聞2014データ集プラス
- hasVersion: C-005033: CD-毎日新聞2015データ集プラス
- hasVersion: C-005034: CD-毎日新聞2016データ集プラス
C-003593: 朝日新聞記事データ集学術研究用 2007
朝日新聞の本社版記事2007年分・約１７万件を収録した新聞記事データ集（タグ付テキストデータ）。
- hasVersion: C-001597: Asahi Shimbun News Article Data for Research
- hasVersion: C-003594: 朝日新聞記事データ集学術研究用 2006
- hasVersion: C-004341: 朝日新聞記事データ（学術・研究用）2008年版
- hasVersion: C-004342: 朝日新聞記事データ（学術・研究用）2009年版
- hasVersion: C-004343: 朝日新聞記事データ（学術・研究用）2010年版
- hasVersion: C-004344: 朝日新聞記事データ（学術・研究用）2011年版
C-003594: 朝日新聞記事データ集学術研究用 2006
朝日新聞の本社版記事2006年分・約１７万件を収録した新聞記事データ集（タグ付テキストデータ）。
- hasVersion: C-001597: Asahi Shimbun News Article Data for Research
- hasVersion: C-003593: 朝日新聞記事データ集学術研究用 2007
- hasVersion: C-004341: 朝日新聞記事データ（学術・研究用）2008年版
- hasVersion: C-004342: 朝日新聞記事データ（学術・研究用）2009年版
- hasVersion: C-004343: 朝日新聞記事データ（学術・研究用）2010年版
- hasVersion: C-004344: 朝日新聞記事データ（学術・研究用）2011年版
C-003604: 読売新聞記事データ＜邦文＞2007年版 (*CSV形式)
言語学・情報学・メディア研究などの調査研究を支援することを目的とする新聞記事データベース。2007年の読売新聞の邦文新聞記事データ1年分(約35万記事)をCSV形式にて収録。研究外での使用は禁止。
- hasVersion: C-001632: Yomiuri Shimbun Articles Data(Japanese)
- hasVersion: C-003605: 読売新聞記事データ＜邦文＞2006年版 (*CSV形式)
- hasVersion: C-003608: 読売新聞記事データ＜英文＞2007年版 (*CSV形式)
- references: D-003612: ヨミダス用語辞書
- hasFormat: C-003606: 読売新聞記事データ集 2007 (*テキスト形式)
- hasVersion: C-005045: 読売新聞記事データ＜邦文＞2008年版
- hasVersion: C-005046: 読売新聞記事データ＜邦文＞2009年版
- hasVersion: C-005047: 読売新聞記事データ＜邦文＞2010年版
- hasVersion: C-005048: 読売新聞記事データ＜邦文＞2011年版
- hasVersion: C-005049: 読売新聞記事データ＜邦文＞2012年版
- hasVersion: C-005050: 読売新聞記事データ＜邦文＞2013年版
- hasVersion: C-005051: 読売新聞記事データ＜邦文＞2014年版
- hasVersion: C-005052: 読売新聞記事データ＜邦文＞2015年版
- hasVersion: C-005053: 読売新聞記事データ＜邦文＞2016年版
C-003605: 読売新聞記事データ＜邦文＞2006年版 (*CSV形式)
言語学・情報学・メディア研究などの調査研究を支援することを目的とする新聞記事データベース。2006年の読売新聞の邦文新聞記事データ1年分(約36万記事)をCSV形式にて収録。研究外での使用は禁止。
- hasVersion: C-001632: Yomiuri Shimbun Articles Data(Japanese)
- hasVersion: C-003604: 読売新聞記事データ＜邦文＞2007年版 (*CSV形式)
- hasVersion: C-003609: 読売新聞記事データ＜英文＞2006年版 (*CSV形式)
- references: D-003612: ヨミダス用語辞書
- hasFormat: C-003607: 読売新聞記事データ集 2006 (*テキスト形式)
- hasVersion: C-005045: 読売新聞記事データ＜邦文＞2008年版
- hasVersion: C-005046: 読売新聞記事データ＜邦文＞2009年版
- hasVersion: C-005047: 読売新聞記事データ＜邦文＞2010年版
- hasVersion: C-005048: 読売新聞記事データ＜邦文＞2011年版
- hasVersion: C-005049: 読売新聞記事データ＜邦文＞2012年版
- hasVersion: C-005050: 読売新聞記事データ＜邦文＞2013年版
- hasVersion: C-005051: 読売新聞記事データ＜邦文＞2014年版
- hasVersion: C-005052: 読売新聞記事データ＜邦文＞2015年版
- hasVersion: C-005053: 読売新聞記事データ＜邦文＞2016年版

SHACHI - Language Resource Metadata Database