Date of Submission


Date of Award


Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name

Computer Science


Machine Intelligence Unit (MIU-Kolkata)


Murthy, C. A. (MIU-Kolkata; ISI)

Abstract (Summary of the Work)

The supervised and unsupervised methodologies of text mining using the plain text data of English language have been discussed. Some new supervised and unsupervised methodologies have been developed for effective mining of the text data after successfully overcoming some limitations of the existing techniques.The problems of unsupervised techniques of text mining, i.e., document clustering methods are addressed. A new similarity measure between documents has been designed to improve the accuracy of measuring the content similarity between documents. Further, a hierarchical document clustering technique is designed using this similarity measure. The main significance of the clustering algorithm is that the number of clusters can be automatically determined by varying a similarity threshold of the proposed similarity measure. The algorithm experimentally outperforms several other document clustering techniques, but it suffers from computational cost. Therefore another hybrid document clustering technique has been designed using the same similarity measure to overcome the computational burden of the proposed hierarchical algorithm, which performs better than the hierarchical one for most of the corpora.The limitations of nearest neighbor decision rule for text categorization are discussed. An efficient nearest neighbor decision rule is designed for qualitative improvement of text categorization. The significance of the proposed decision rule is that it does not categorize a document when a decision in not so certain. The method is showing better results than several other classifiers for text categorization. The decision rule is also implemented using the proposed similarity measure instead of traditional similarity measure, which performs better than the same using traditional similarity measure.The importance of dimensionality reduction for text categorization is also discussed. A supervised term selection technique has been presented to boost the performance of text categorization by removing redundant and unimportant terms. The empirical studies have shown that the proposed method has improved the quality of text categorization.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Included in

Mathematics Commons