Study of Neural Learning in Text Processing.

Date of Submission

December 2016

Date of Award

Winter 12-12-2017

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Garain, Utpal (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

This study deals with the exploration of different Neural Learning frameworks in Natural Language Processing and Information Retrieval. Distributed neural language model Word2Vec has been reported to provide elegant word embedding as they capture semantic and syntactic information. Recent studies have also shown that such feature embedding coupled with various Neural Network models have been able to set new benchmarks in various problems of text processing. The aim of this research is to study different neural models and the word embedding framework and explore about their effectiveness and limitations in different challenges in text processing. Three problems have been explored in this study are (i)Learning document embedding from word embedding and analyzing it’s effectiveness in document classification (ii) Automatic query expansion using neural word embedding (iii) Biomedical information extraction for Cancer Genetics. Effective use of neural framework for learning document representation for document classification is challenging as existing techniques performs remarkably well and also, the extension from word embedding model is not straightforward. Our study has found that learning such document embedding doesn’t yield to any advantage in document classification when compared with naive Term FrequencyInverse Document Frequency embedding. Semantically related term can be obtained by finding the most similar terms to the query terms using word embedding. In the second problem, Query expansion using such semantically K- nearest neighbor term in the vocabulary do help in improving the result over the baseline retrieval using language model. But it is found that, query expansion for ad-hoc retrieval requires terms to be occurring with high frequency in the relevant documents along with query terms, in addition to terms which are semantically related. But query expansion using word embedding fails to include terms which co-occurs with high frequency anywhere in the relevant document as Word2Vec model measures co-occurrence in a limited context window. Our third problem, Biomedical information extraction essentially requires identification of events and finding relation among events and entities. It is found that word embedding is extremely useful in biomedical relation extraction. Also neural architecture like Convolutions neural network provide superior result in event identification. We propose a parser architecture for biomedical document concerning cancer genetics, using neural architecture and word vector as feature. The parser outperforms the state of the art results.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This document is currently not available here.