Search in Transliterated Domain for Marathi.

Date of Submission

December 2015

Date of Award

Winter 12-12-2016

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Mitra, Mandar (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

A large number of languages, including Arabic, Russian, and most of the South and South East Asian languages, are written using indigenous scripts. However, often the Webster and the user generated content in these languages are written using Roman script for various reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation.In this report, we handle the word language identification problem for Marathi language which is written in Roman script and also has English language words. We have considered a method based on character level n-grams, to address this problem. Our method gives around 98% mean accuracy using 10-fold cross validation. In addition to word language identification, we also handle transliteration of Marathi language words from Roman script to Devanagari script. Our method provides around 95% character level unigram precision.We have also implemented an ad-hoc retrieval for Marathi news data based on transliterated queries. We compared the results obtained with the original Devanagari script queries and found that the performance of ad-hoc retrieval system did not deteriorate significantly.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This document is currently not available here.
