Application of TF-IDF feature for categorizing documents of online Bangla web text corpus
Advances in Intelligent Systems and Computing
This paper explores the use of standard features as well as machine learning approaches for categorizing Bangla text documents of online Web corpus. The TF-IDF feature with dimensionality reduction technique (40% of TF) is used here for bringing in precision in the whole process of lexical matching for identification of domain category or class of a piece of text document. This approach stands on the generic observation that text categorization or text classification is a task of automatically sorting out a set of text documents into some predefined sets of text categories. Although an ample range of methods have been applied on English texts for categorization, limited studies are carried out on Indian language texts including that of Bangla. Hence, an attempt is made here to analyze the level of efficiency of the categorization method mentioned above for Bangla text documents. For verification and validation, Bangla text documents that are obtained from various online Web sources are normalized and used as inputs for the experiment. The experimental results show that the feature extraction method along with LIBLINEAR classification model can generate quite satisfactory performance by attaining good results in terms of high-dimensional feature sets and relatively noisy document feature vectors.
Dhar, Ankita; Dash, Niladri Sekhar; and Roy, Kaushik, "Application of TF-IDF feature for categorizing documents of online Bangla web text corpus" (2018). Conference Articles. 158.