Language resource #: 3330
Results 1951 - 1960 of 2023
-
C-004996: TRAD Arabic-English Newspaper Parallel corpus - Test set 1
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are articles collected in 2012 from the Arabic version of Le Monde Diplomatique. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
The content is also translated into French (see ELRA-W0098).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign. -
C-004997: TRAD Arabic-French Newspaper Parallel corpus - Test set 2
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in French. The source texts are articles collected in May 2013 from the Arabic version of Le Monde Diplomatique. The translation has been conducted by two different translation teams following a strict protocol aimed at producing high quality translations.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for the 2014 TRAD MT evaluation campaign. -
C-004998: TRAD Arabic-French Newspaper Parallel corpus - Test set 1
This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are articles from the Arabic version of Le Monde Diplomatique collected in 2012. The translation has been conducted by four different translation teams following a strict protocol aimed at producing high quality translations.
The content is also translated into English (see ELRA-W0099).
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for the 2012 TRAD MT evaluation campaign. -
C-004999: TRAD Pashto-English News Articles Parallel corpus
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
The content has also been translated into French (see ELRA-W0096 TRAD Pashto-French Newspaper Parallel corpus)
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). -
C-005000: TRAD Pashto-French News Articles Parallel corpus
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
The content has also been translated into English (see ELRA-W0097 TRAD Pashto-English News Parallel corpus).
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). -
C-005001: TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381). These texts are VOA Ashna TV programs recorded on 15/01/2011, 18/01/2011 and 19/01/2011.
The content has also been translated into French (see ELRA-W0094 TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test set).
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign.- references: C-004906: TRAD Pashto Broadcast News Speech Corpus
- hasVersion: C-004994: TRAD Arabic-English Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-004995: TRAD Arabic-French Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-005002: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
- hasVersion: C-005003: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
-
C-005002: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381). These texts are VOA Ashna TV programs recorded on 15/01/2011, 18/01/2011 and 19/01/2011. These translations are different from the one provided in the TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training set (ELRA-W0093).
The content has also been translated into English (see ELRA-W0095 TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech).
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as a test set for an internal MT evaluation campaign.- references: C-004906: TRAD Pashto Broadcast News Speech Corpus
- hasVersion: C-004994: TRAD Arabic-English Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-004995: TRAD Arabic-French Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-005001: TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
- hasVersion: C-005003: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
-
C-005003: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
The corpus consists of the transcription of 106 hours of recordings in Pashto translated into French. The transcriptions are extracted from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381). It contains about 832,000 source words and 747,000 target words. No audio file is provided.
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). It was used as training data for language modelling in machine translation.- references: C-004906: TRAD Pashto Broadcast News Speech Corpus
- hasVersion: C-005001: TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
- hasVersion: C-004994: TRAD Arabic-English Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-004995: TRAD Arabic-French Parallel corpus of transcribed Broadcast News Speech
- hasVersion: C-005002: TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
-
C-005004: TRAD Pashto Monolingual text Corpus
This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different blogs and websites.
Identified and negotiated or freely available sources have been crawled in 2012, cleaned and XML-formatted.
Pashto is an indo-iranian language spoken by the Pashtun people mainly in Pakistan and Afghanistan.
This corpus was produced by ELDA within the PEA TRAD project supported by the French Ministry of Defence (DGA). -
C-005005: Linguatools Webcrawl Parallel Corpus German-English 2015
The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. The sentences were gathered from over 112,000 different hosts. An elaborate multi-step quality filtering was applied, including language identification filter, machine translation filter, grammaticality filter, etc. to get as clean data as possible. There are no duplicate sentence pairs, and there is no overlap with existing publicly available corpora like europarl, DGT-TM, etc. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8).