GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web

registration: 2016/01/17 10:21:31, last modified: 2016/01/17 10:21:31

Genre
Docs
Words
CharTokens
Segments

Arabic
WB
119
59,696
81,620
4,383

Arabic
NW
717
198,621
263,060
8,423

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

* Normalizing tokenized tokens as needed
* Identifying different types of links
* Identifying sentence segments not suitable for annotation
* Tagging unmatched words attached to other words or phrases

*Samples*

Please view the following samples

* English Raw
* English Token
* Arabic Raw
* Arabic Token
* Word Alignment

*Sponsorship*

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

*Updates*

None at this time.