Lemmatizers for Highly Inflected Indian Languages.

Date of Submission

December 2016

Date of Award

Winter 12-12-2017

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Garain, Utpal (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

Lemmatization is the process for finding the appropriate root for a given surface word. For morphologically rich languages one root word might have many morphological variants due agglutination or inflection and therefore, for performing tasks like question answering systems, text summarization, topic identification, word sense disambiguation, information retrieval for such languages we need good lemmatizer to find lemmata of words. This thesis considers lemmatization problem for two major Indian languages namely, Hindi and Bengali which are considered as highly inflected languages. Two different techniques have been explored under this work. Firstly, the efficiency of an off-the-shelf lemmatizer, i.e. Lemming[5], is tested for the Hindi and Bengali. Lemming does use a log linear model for lemmatization and requires parts-of-speech (POS) and lemma annotated data for learning. Experiments show that lemming performs well if we could provide sufficiently large (about twenty thousand annotated words in continuous text) data sets for Hindi and Bengali. However, for many Indian languages such a resource is not available and in the second part of this thesis, we tried to develop a graph-based unsupervised lemmatizer where the only resource requirement is a large corpus and POS tagged dataset, i.e. annotation of lemma, which is an expensive resource, is not required. Finally, role lemmatizer for Information Retrieval has been investigated for Hindi & Bengali.


ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843730

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.



This document is currently not available here.