Hiencor: On mining of a hi-en general purpose parallel corpus from the web
Proceedings of the 2017 International Conference on Asian Language Processing, IALP 2017
This paper presents a language independent and simple methodology to mine bilingual parallel corpus from the web. In particular, we extract parallel corpus for the Hindi-English (Hi-En) language pair from web pages which are previously unexplored. Candidate websites containing Hindi and English pages are identified by using a list of Hindi stop words to the system. A small set of manually generated patterns and a state of the art sentence aligner is then used to extract Hindi-English parallel corpus from these candidate websites. The quality of the mined parallel corpus is also demonstrated empirically in Hindi-English machine translation task.
Das, Arjun; Garain, Utpal; Kumar, Ravindra; and Senapati, Apurbalal, "Hiencor: On mining of a hi-en general purpose parallel corpus from the web" (2018). Conference Articles. 101.