Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1221 - 1230 of 2023

C-003581: BLLIP North American News Text, Complete
*Introduction*

Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, Complete, LDC2008T13, isbn 1-58563-482-4, contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).

BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus.

To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

*Methodology*

A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech.

Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined.

The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material.

In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability.

*Data*

The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence.

Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list:

50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores.

Source material is as follows:

Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40

*Pricing*

The Reduced Licensing Fee for this corpus is US$800.
- hasPart: C-003582: BLLIP North American News Text, General Release
- isReferencedBy: C-003579: North American News Text, Complete
- references: C-001073: North American News Text Corpus
- isReferencedBy: David McClosky, Eugene Charniak, and Mark Johnson, 2008, BLLIP North American News Text, Complete, Linguistic Data Consortium, Philadelphia
C-003582: BLLIP North American News Text, General Release
*Introduction*

Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, General Release, LDC2008T14, isbn 1-58563-482-4, contains a Penn Treebank-style parsing of approximately 21 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).

BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus.

To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

*Methodology*

A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech.

Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined.

The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material.

In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability.

*Data*

The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence.

Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list:

50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores.

Source material is as follows:

Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40
- isPartOf: C-003581: BLLIP North American News Text, Complete
- isReferencedBy: C-003580: North American News Text, General Release
- references: C-001073: North American News Text Corpus
- isReferencedBy: David McClosky, Eugene Charniak, and Mark Johnson, 2008, BLLIP North American News Text, General Release, Linguistic Data Consortium, Philadelphia
C-003585: CD-Mainichi Shimbun 2007 Data Collection
A full-text news paper article database containing data from the national edition of Mainichi Newspaper published in 2007.
- isPartOf: C-003588: CD-Mainichi Shimbun 2007 Data Collection Plus
- hasVersion: C-000838: DCS - Mainichi Newspaper 1991-2006 data files
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-003589: CD-Mainichi Shimbun 2006 Data Collection Plus
- hasVersion: C-003590: CD-Mainichi Shimbun 2005 Data Collection Plus
C-003588: CD-Mainichi Shimbun 2007 Data Collection Plus
A full-text news paper article database containing data from morning- and evening- editions of Mainichi Newspaper covering both national and local (regions from Hokkaido to Kagoshima) editions published in 2007.
- hasPart: C-003585: CD-Mainichi Shimbun 2007 Data Collection
- hasVersion: C-000838: DCS - Mainichi Newspaper 1991-2006 data files
- hasVersion: C-003589: CD-Mainichi Shimbun 2006 Data Collection Plus
- hasVersion: C-003590: CD-Mainichi Shimbun 2005 Data Collection Plus
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005027: CD-Mainichi Shimbun 2013 Data Collection
- hasVersion: C-005028: CD-Mainichi Shimbun 2014 Data Collection
- hasVersion: C-005029: CD-Mainichi Shimbun 2015 Data Collection
- hasVersion: C-005030: CD-Mainichi Shimbun 2016 Data Collection
- hasVersion: C-005031: CD-Mainichi Shimbun 2013 Data Collection Plus
- hasVersion: C-005032: CD-Mainichi Shimbun 2014 Data Collection Plus
- hasVersion: C-005033: CD-Mainichi Shimbun 2015 Data Collection Plus
- hasVersion: C-005034: CD-Mainichi Shimbun 2016 Data Collection Plus
C-003589: CD-Mainichi Shimbun 2006 Data Collection Plus
A full-text news paper article database containing data from morning- and evening- editions of Mainichi Newspaper covering both national and local (regions from Hokkaido to Kagoshima) editions published in 2006.
- hasPart: CD-Mainichi Shimbun 2006 Data Collection
- hasVersion: C-003588: CD-Mainichi Shimbun 2007 Data Collection Plus
- hasVersion: C-003590: CD-Mainichi Shimbun 2005 Data Collection Plus
- hasVersion: C-000838: DCS - Mainichi Newspaper 1991-2006 data files
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-003585: CD-Mainichi Shimbun 2007 Data Collection
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-005027: CD-Mainichi Shimbun 2013 Data Collection
- hasVersion: C-005028: CD-Mainichi Shimbun 2014 Data Collection
- hasVersion: C-005029: CD-Mainichi Shimbun 2015 Data Collection
- hasVersion: C-005030: CD-Mainichi Shimbun 2016 Data Collection
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005031: CD-Mainichi Shimbun 2013 Data Collection Plus
- hasVersion: C-005032: CD-Mainichi Shimbun 2014 Data Collection Plus
- hasVersion: C-005033: CD-Mainichi Shimbun 2015 Data Collection Plus
- hasVersion: C-005034: CD-Mainichi Shimbun 2016 Data Collection Plus
C-003590: CD-Mainichi Shimbun 2005 Data Collection Plus
A full-text news paper article database containing data from morning- and evening- editions of Mainichi Newspaper covering both national and local (regions from Hokkaido to Kagoshima) editions published in 2005.
- hasPart: CD-Mainichi Shimbun 2005 Data Collection
- hasVersion: C-003588: CD-Mainichi Shimbun 2007 Data Collection Plus
- hasVersion: C-003589: CD-Mainichi Shimbun 2006 Data Collection Plus
- hasVersion: C-000838: DCS - Mainichi Newspaper 1991-2006 data files
- hasVersion: C-003585: CD-Mainichi Shimbun 2007 Data Collection
- hasVersion: C-001600: CD-Mainichi Shimbun '95 Data Collection
- hasVersion: C-001603: CD-ROM Mainichi Shimbun '94 Data Collection
- hasVersion: C-001599: CD-Mainichi Shimbun '93 Data Collection
- hasVersion: C-001602: CD-ROM Mainichi Shimbun '92 Data Collection
- hasVersion: C-001598: CD-Maichichi Shimbun '91 data collection
- hasVersion: C-004211: CD-毎日新聞2008データ集
- hasVersion: C-004212: CD-毎日新聞2009データ集
- hasVersion: C-004213: CD-毎日新聞2010データ集
- hasVersion: C-004214: CD-毎日新聞2011データ集
- hasVersion: C-004390: CD-毎日新聞2012データ集
- hasVersion: C-005027: CD-Mainichi Shimbun 2013 Data Collection
- hasVersion: C-005028: CD-Mainichi Shimbun 2014 Data Collection
- hasVersion: C-005029: CD-Mainichi Shimbun 2015 Data Collection
- hasVersion: C-005030: CD-Mainichi Shimbun 2016 Data Collection
- hasVersion: C-004215: CD-毎日新聞2008データ集プラス
- hasVersion: C-004216: CD-毎日新聞2009データ集プラス
- hasVersion: C-004217: CD-毎日新聞2010データ集プラス
- hasVersion: C-004218: CD-毎日新聞2011データ集プラス
- hasVersion: C-004391: CD-毎日新聞2012データ集プラス
- hasVersion: C-005031: CD-Mainichi Shimbun 2013 Data Collection Plus
- hasVersion: C-005032: CD-Mainichi Shimbun 2014 Data Collection Plus
- hasVersion: C-005033: CD-Mainichi Shimbun 2015 Data Collection Plus
- hasVersion: C-005034: CD-Mainichi Shimbun 2016 Data Collection Plus
C-003593: Asahi Shimbun News Article Data for Research 2007
A full-text news paper article database containing data from the national editions of Asahi Newspaper published in 2007.
- hasVersion: C-001597: Asahi Shimbun News Article Data for Research
- hasVersion: C-003594: Asahi Shimbun News Article Data for Research 2006
C-003594: Asahi Shimbun News Article Data for Research 2006
A full-text news paper article database containing data from the national editions of Asahi Newspaper published in 2006.
- hasVersion: C-001597: Asahi Shimbun News Article Data for Research
- hasVersion: C-003593: Asahi Shimbun News Article Data for Research 2007
C-003604: Article Data of Yomiuri Shimbun (Japanese) 2007 (*CSV format)
The database contains about 350,000 newspaper articles from Yomiuri Newspaper (written in Japanese) published in 2007. The database is exclusively for research and academic use and is intended to support development and studies in such fields as linguistics, informatics or media study. The data is provided in CSV format.
- hasVersion: C-001632: Yomiuri Shimbun Articles Data(Japanese)
- hasVersion: C-003605: Article Data of Yomiuri Shimbun (Japanese) 2006 (*CSV format)
- hasVersion: C-003608: Article Data of Yomiuri Shimbun (English) 2007 (*CSV format)
- hasFormat: C-003606: Yomiuri Shimbun Data Collection 2007 (*text file)
- references: D-003612: YOMIDAS Dictionary
- hasVersion: C-005045: Article Data of Yomiuri Shimbun (Japanese) 2008
- hasVersion: C-005046: Article Data of Yomiuri Shimbun (Japanese) 2009
- hasVersion: C-005047: Article Data of Yomiuri Shimbun (Japanese) 2010
- hasVersion: C-005048: Article Data of Yomiuri Shimbun (Japanese) 2011
- hasVersion: C-005049: Article Data of Yomiuri Shimbun (Japanese) 2012
- hasVersion: C-005050: Article Data of Yomiuri Shimbun (Japanese) 2013
- hasVersion: C-005051: Article Data of Yomiuri Shimbun (Japanese) 2014
- hasVersion: C-005052: Article Data of Yomiuri Shimbun (Japanese) 2015
- hasVersion: C-005053: Article Data of Yomiuri Shimbun (Japanese) 2016
C-003605: Article Data of Yomiuri Shimbun (Japanese) 2006 (*CSV format)
The database contains about 360,000 newspaper articles from Yomiuri Newspaper (written in Japanese) published in 2006. The database is exclusively for research and academic use and is intended to support development and studies in such fields as linguistics, informatics or media study. The data is provided in CSV format.
- hasVersion: C-001632: Yomiuri Shimbun Articles Data(Japanese)
- hasVersion: C-003604: Article Data of Yomiuri Shimbun (Japanese) 2007 (*CSV format)
- hasVersion: C-003609: Article Data of Yomiuri Shimbun (English) 2006 (*CSV format)
- references: D-003612: YOMIDAS Dictionary
- hasFormat: C-003607: Yomiuri Shimbun Data Collection 2006 (*text file)
- hasVersion: C-005045: Article Data of Yomiuri Shimbun (Japanese) 2008
- hasVersion: C-005046: Article Data of Yomiuri Shimbun (Japanese) 2009
- hasVersion: C-005047: Article Data of Yomiuri Shimbun (Japanese) 2010
- hasVersion: C-005048: Article Data of Yomiuri Shimbun (Japanese) 2011
- hasVersion: C-005049: Article Data of Yomiuri Shimbun (Japanese) 2012
- hasVersion: C-005050: Article Data of Yomiuri Shimbun (Japanese) 2013
- hasVersion: C-005051: Article Data of Yomiuri Shimbun (Japanese) 2014
- hasVersion: C-005052: Article Data of Yomiuri Shimbun (Japanese) 2015
- hasVersion: C-005053: Article Data of Yomiuri Shimbun (Japanese) 2016

SHACHI - Language Resource Metadata Database