Date of Submission
7-31-2014
Date of Award
7-31-2015
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Doctoral Thesis
Degree Name
Doctor of Philosophy
Subject Name
Computer Science
Department
Machine Intelligence Unit (MIU-Kolkata)
Supervisor
Murthy, C. A. (MIU-Kolkata; ISI)
Abstract (Summary of the Work)
The supervised and unsupervised methodologies of text mining using the plain text data of English language have been discussed. Some new supervised and unsupervised methodologies have been developed for effective mining of the text data after successfully overcoming some limitations of the existing techniques.The problems of unsupervised techniques of text mining, i.e., document clustering methods are addressed. A new similarity measure between documents has been designed to improve the accuracy of measuring the content similarity between documents. Further, a hierarchical document clustering technique is designed using this similarity measure. The main significance of the clustering algorithm is that the number of clusters can be automatically determined by varying a similarity threshold of the proposed similarity measure. The algorithm experimentally outperforms several other document clustering techniques, but it suffers from computational cost. Therefore another hybrid document clustering technique has been designed using the same similarity measure to overcome the computational burden of the proposed hierarchical algorithm, which performs better than the hierarchical one for most of the corpora.The limitations of nearest neighbor decision rule for text categorization are discussed. An efficient nearest neighbor decision rule is designed for qualitative improvement of text categorization. The significance of the proposed decision rule is that it does not categorize a document when a decision in not so certain. The method is showing better results than several other classifiers for text categorization. The decision rule is also implemented using the proposed similarity measure instead of traditional similarity measure, which performs better than the same using traditional similarity measure.The importance of dimensionality reduction for text categorization is also discussed. A supervised term selection technique has been presented to boost the performance of text categorization by removing redundant and unimportant terms. The empirical studies have shown that the proposed method has improved the quality of text categorization.
Control Number
ISILib-TH426
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
DOI
http://dspace.isical.ac.in:8080/jspui/handle/10263/2146
Recommended Citation
Basu, Tanmay Dr., "On Supervised and Unsupervised Methodologies for Mining of Text Data." (2015). Doctoral Theses. 313.
https://digitalcommons.isical.ac.in/doctoral-theses/313
Comments
ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843378