Conference Articles

Hiencor: On mining of a hi-en general purpose parallel corpus from the web

Arjun Das, University of Calcutta
Utpal Garain, Indian Statistical Institute, Kolkata
Ravindra Kumar, Indian Statistical Institute, Kolkata
Apurbalal Senapati, CIT Kokrajhar

Document Type

Conference Article

Publication Title

Proceedings of the 2017 International Conference on Asian Language Processing, IALP 2017

Abstract

This paper presents a language independent and simple methodology to mine bilingual parallel corpus from the web. In particular, we extract parallel corpus for the Hindi-English (Hi-En) language pair from web pages which are previously unexplored. Candidate websites containing Hindi and English pages are identified by using a list of Hindi stop words to the system. A small set of manually generated patterns and a state of the art sentence aligner is then used to extract Hindi-English parallel corpus from these candidate websites. The quality of the mined parallel corpus is also demonstrated empirically in Hindi-English machine translation task.

First Page

235

Last Page

238

DOI

10.1109/IALP.2017.8300587

Publication Date

7-2-2017

Recommended Citation

Das, Arjun; Garain, Utpal; Kumar, Ravindra; and Senapati, Apurbalal, "Hiencor: On mining of a hi-en general purpose parallel corpus from the web" (2017). Conference Articles. 216.
https://digitalcommons.isical.ac.in/conf-articles/216

This document is currently not available here.

COinS

Conference Articles

Hiencor: On mining of a hi-en general purpose parallel corpus from the web

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Browse

Search

Author Corner

Links

Conference Articles

Hiencor: On mining of a hi-en general purpose parallel corpus from the web

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Share

Browse

Search

Author Corner

Links