NN-based analytic approach to symbol level recognition for degraded Bengali printed documents
Sadhana - Academy Proceedings in Engineering Sciences
Analysis of degraded printed documents has been a research topic for last several years. In this article the contribution lies in segmentation of word images into symbols and recognition of the symbols of degraded printed document images of Bengali, the 7th most popular language in the world. A novel approach to symbol level segmentation based on a Multilayer Perceptron (MLP) network is proposed. A database of segmenting and non-segmenting image columns is developed from the ISIDDI page level database and segmentation is treated as a two-class classification problem. The MLP weights are learnt based on this database using the back propagation algorithm. We have introduced certain new metrics, based on which the F-score of the proposed segmentation algorithm is determined. Our method utilizes information that is relevant for character segmentation, ignoring other highly variable information contained in a printed text document, thus allowing for efficient transfer learning between datasets and alleviating the need for labelled training data. Other than Bengali, we have tested on English, Tamil and Devnagari scripts. For the classification purpose we have identified 336 symbols, and the corresponding training and test sets have been developed. The ISIDDI database is used for this purpose. Two classifiers, one CNN based and the other LSTM based, have been developed for this 336-class problem. The classification accuracies obtained on the test set by the CNN classifier and the LSTM classifier are 86.05% and 88.11%, respectively. The proposed classifiers outperform the existing classifiers for the ISIDDI database.
Mukherjee, Jayati; Parui, Swapan K.; and Roy, Utpal, "NN-based analytic approach to symbol level recognition for degraded Bengali printed documents" (2020). Journal Articles. 35.