Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents
Document Type
Conference Article
Publication Title
Proceedings - 2017 3rd International Conference on Advances in Computing, Communication and Automation (Fall), ICACCA 2017
Abstract
This paper explores the use of two similarity measures for categorizing Bangla text documents into their respective domains. Cosine Similarity and Euclidean Distance have been usedasthe similarity measures on the vector space model based on TF-IDF feature. The domains of interest are Business, State, Medical, Sports, and Science texts which are used as inputs for analysis. The recognition accuracy of 95.80% for Cosine Similarity and 95.20% for Euclidean Distanceare achieved on 1000 text documents. This confirms that unsupervised feature extraction technique may be treated as one of the useful methods for automatic text classification in Bangla (and for other Indian language documents), if input texts are not pre-classified based on certain predefined linguistic or statistical parameters. Comparative experiments on the dataset using several classification algorithm show that the distance measures perform better compare to other classifiers.
First Page
1
Last Page
6
DOI
10.1109/ICACCAF.2017.8344721
Publication Date
7-2-2017
Recommended Citation
Dhar, Ankita; Dash, Niladri Sekhar; and Roy, Kaushik, "Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents" (2017). Conference Articles. 214.
https://digitalcommons.isical.ac.in/conf-articles/214