Bilingual Parallel Corpora Mining from the Web for Improving Hindi-English Machine Translation System.

Date of Submission

December 2015

Date of Award

Winter 12-12-2016

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Garain, Utpal (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

Bilingual parallel corpora is used in many applications in Natural Language Processing (NLP) and beyond. Machine Translation System is a well known application to use bilingual parallel corpora. Our work presents a system to mine bilingual parallel corpora from the web. We first collected candidate sites that contain Hindi-English text by initially supplying Hindi-English language pair and a list of Hindi words to the system. Hindi-English parallel corpora is mined from these candidate websites. Although our system has space for improvements but the resultant parallel corpus is very accurate and good. We have not built a very big Hindi-English parallel corpus because our initial goal was to find an approach. We have shown the improvements in machine translation by this Hindi-English parallel corpus. Details of our Hindi-English parallel corpora, mined from the web till now are also given. In our system no manual efforts are required. Our system can be used to mine domain specific as well as general domain bilingual parallel corpora. As data are growing over the web with time, our system can be used to build larger parallel corpora in future.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This document is currently not available here.