Hiencor: On mining of a hi-en general purpose parallel corpus from the web
Document Type
Conference Article
Publication Title
Proceedings of the 2017 International Conference on Asian Language Processing, IALP 2017
Abstract
This paper presents a language independent and simple methodology to mine bilingual parallel corpus from the web. In particular, we extract parallel corpus for the Hindi-English (Hi-En) language pair from web pages which are previously unexplored. Candidate websites containing Hindi and English pages are identified by using a list of Hindi stop words to the system. A small set of manually generated patterns and a state of the art sentence aligner is then used to extract Hindi-English parallel corpus from these candidate websites. The quality of the mined parallel corpus is also demonstrated empirically in Hindi-English machine translation task.
First Page
235
Last Page
238
DOI
10.1109/IALP.2017.8300587
Publication Date
2-21-2018
Recommended Citation
Das, Arjun; Garain, Utpal; Kumar, Ravindra; and Senapati, Apurbalal, "Hiencor: On mining of a hi-en general purpose parallel corpus from the web" (2018). Conference Articles. 101.
https://digitalcommons.isical.ac.in/conf-articles/101