Lemmatizers for Highly Inflected Indian Languages.
Date of Submission
December 2016
Date of Award
Winter 12-12-2017
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)
Supervisor
Garain, Utpal (CVPR-Kolkata; ISI)
Abstract (Summary of the Work)
Lemmatization is the process for finding the appropriate root for a given surface word. For morphologically rich languages one root word might have many morphological variants due agglutination or inflection and therefore, for performing tasks like question answering systems, text summarization, topic identification, word sense disambiguation, information retrieval for such languages we need good lemmatizer to find lemmata of words. This thesis considers lemmatization problem for two major Indian languages namely, Hindi and Bengali which are considered as highly inflected languages. Two different techniques have been explored under this work. Firstly, the efficiency of an off-the-shelf lemmatizer, i.e. Lemming[5], is tested for the Hindi and Bengali. Lemming does use a log linear model for lemmatization and requires parts-of-speech (POS) and lemma annotated data for learning. Experiments show that lemming performs well if we could provide sufficiently large (about twenty thousand annotated words in continuous text) data sets for Hindi and Bengali. However, for many Indian languages such a resource is not available and in the second part of this thesis, we tried to develop a graph-based unsupervised lemmatizer where the only resource requirement is a large corpus and POS tagged dataset, i.e. annotation of lemma, which is an expensive resource, is not required. Finally, role lemmatizer for Information Retrieval has been investigated for Hindi & Bengali.
Control Number
ISI-DISS-2016-308
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
DOI
http://dspace.isical.ac.in:8080/jspui/handle/10263/6465
Recommended Citation
Nayak, Kamlesh, "Lemmatizers for Highly Inflected Indian Languages." (2017). Master’s Dissertations. 385.
https://digitalcommons.isical.ac.in/masters-dissertations/385
Comments
ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843730