A Deep OCR for Degraded Bangla Documents

Article Type

Research Article

Publication Title

ACM Transactions on Asian and Low-Resource Language Information Processing


Despite the significant success of document image analysis techniques, efficient Optical Character Recognition (OCR) of degraded document images still remains an open problem. Although a body of work has been reported on degraded document recognition for English language, only little attention has been paid to Indic scripts. In this work, we focus on developing a degraded OCR for Bangla, a major Indian language. In general, an OCR system includes segmentation of the foreground text part from the background followed by recognition of the extracted text. The text segmentation module aims to assign the foreground or background label to each pixel of the document image. In this paper, we present a new OCR system which is particularly suitable for degraded quality Bangla document images. The contribution is two fold. In the first phase, we use a semi-supervised Markov Random Field (MRF)-based Generative Adversarial Network (GAN) model (which we call MRF-GAN) for foreground segmentation of texts from degraded text. In the proposed MRF-GAN, we extend the concept of GAN to a multitask learning mechanism where discriminator-classifier networks differentiate between real/fake images and also assign a foreground or background label to each pixel. In the second phase, we propose to use a new encoder-decoder based recognizer that incorporates an attention-based character to a word prediction model, which has the capability of minimizing Word Error Rate (WER). We optimize this network using a Multitask based Transfer Learning scheme (MTTL). We perform experiments on a publicly available degraded Bangla document image dataset as well as on a new degraded printed Hindi document image dataset, which has been created as a part of the present study. Results of the experimentations demonstrate the efficacy of the proposed OCR.



Publication Date


This document is currently not available here.