Hiencor: On mining of a hi-en general purpose parallel corpus from the web

Document Type

Conference Article

Publication Title

Proceedings of the 2017 International Conference on Asian Language Processing, IALP 2017

Abstract

This paper presents a language independent and simple methodology to mine bilingual parallel corpus from the web. In particular, we extract parallel corpus for the Hindi-English (Hi-En) language pair from web pages which are previously unexplored. Candidate websites containing Hindi and English pages are identified by using a list of Hindi stop words to the system. A small set of manually generated patterns and a state of the art sentence aligner is then used to extract Hindi-English parallel corpus from these candidate websites. The quality of the mined parallel corpus is also demonstrated empirically in Hindi-English machine translation task.

First Page

235

Last Page

238

DOI

10.1109/IALP.2017.8300587

Publication Date

2-21-2018

This document is currently not available here.

Share

COinS