Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents

Document Type

Conference Article

Publication Title

Proceedings of the Third International Conference on Advances in Computing, Communication and Automation (Fall), ICACCA 2017

Abstract

This paper explores the use of two similarity measures for categorizing Bangla text documents into their respective domains. Cosine Similarity and Euclidean Distance have been usedasthe similarity measures on the vector space model based on TF-IDF feature. The domains of interest are Business, State, Medical, Sports, and Science texts which are used as inputs for analysis. The recognition accuracy of 95.80% for Cosine Similarity and 95.20% for Euclidean Distanceare achieved on 1000 text documents. This confirms that unsupervised feature extraction technique may be treated as one of the useful methods for automatic text classification in Bangla (and for other Indian language documents), if input texts are not pre-classified based on certain predefined linguistic or statistical parameters. Comparative experiments on the dataset using several classification algorithm show that the distance measures perform better compare to other classifiers.

First Page

1

Last Page

6

DOI

10.1109/ICACCAF.2017.8344721

Publication Date

4-20-2018

This document is currently not available here.

Share

COinS