Categorization of Bangla Web Text Documents Based on TF-IDF-ICF Text Analysis Scheme

Document Type

Conference Article

Publication Title

Communications in Computer and Information Science

Abstract

With the rapid growth and huge availability of digital text data, automatic text categorization or classification is a comparatively more effective solution in organizing and managing textual information. It is a process of automatically assigning a text document into one of the predefined sets of text categories. Although plenty of methods have been implemented on English text documents for categorization, limited studies are carried out on the Indian language texts including Bangla. Against this background, this paper analyzes the efficiency of some of the existing text classification methods available to us and proposes to supplement these with a new analysis method for classifying the Bangla text documents obtained from online web sources. The paper argues that addition of Inverse Class Frequency (ICF) measure to the Term Frequency (TF) and Inverse Document Frequency (IDF) methods can yield better responses in the act of feature extraction from a language like Bangla. The combination of all three processes generates a set of features which is further fed to train the MultiLayer Perceptron (MLP) classifier to produce promising results in identifying and classifying text documents to their respective domains and categories. Comparison of this classifier with others confirms that this has higher accuracy level in case of Bangla text documents. It is expected that MLP can produce satisfactory performance in terms of high dimensionality and relatively noisy feature vectors also.

First Page

477

Last Page

484

DOI

10.1007/978-981-13-1343-1_39

Publication Date

1-1-2018

This document is currently not available here.

Share

COinS