GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1

registration: 2016/01/17 10:20:54, last modified: 2016/01/17 10:20:54

Genre
Docs
Words
CharTokens
Segments

Chinese
BC
12
51,192
76,789
2,943

Chinese
BN
16
68,702
103,053
3,539

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

* Identifying, aligning, and tagging 8 different types of links
* Identifying, attaching, and tagging local-level unmatched words
* Identifying and tagging sentence/discourse-level unmatched words
* Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

*Samples*

Please view the following samples:

* Chinese Raw
* Chinese Token
* English Raw
* English Token
* World Alignment

*Sponsorship*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

*Updates*

None at this time.