Hierarchical Approach to Document Classification of 20 Newsgroup Dataset.

Date of Submission

December 2016

Date of Award

Winter 12-12-2017

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Parui, Swapan Kumar (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

The aim of the dissertation is to come up with a good algorithm that will help classify the documents of the 20 newsgroup data set to it’s proper classes. Different methods are applied using the vector representation of documents (number of times a uni-gram occurs in a document) to come up with a method that gives best accuracy after classification. Hierarchical structure for classification was followed and different methods were experimented with to see which one gives the best accuracy. Different ways to detect outliers in the training set were also applied and these outliers were removed from the training set to improve accuracy


ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843176

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.



This document is currently not available here.