Date of Submission


Date of Award


Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Chaudhuri, Bidyut Baran (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

This thesis concerns OCR development of machine printed text in an Indian lan- guage, Bangla (Bengali) which is the fourthmost popular language in the world and the secondmost popular language in India.1.1 Optical Character Recognition Optical Character Recognition (OCR) is a process of automatic computer recog- nition of characters in optically scanned and digitized pages of text. OCR is ene of the most fascinating and challenging areas of pattern recognition with various practical applications. It can contribute tremendously to the advancement of an automation process and can improve the interface between man and machine in many applications, including office automation and data entry. The input of an OCR system consista of text on paper. The output is a coded file with an ASCII (American Standard Code for Information Interchange) or other character code representation as well as special symbols for unrecognized or doubtfully recognized patterns. The output can be reformatted for input to a word processing or page layout systems, or used directly in information retrieval, image filling, remittanceprocessing, document sorting, or other downstream applications.Some practical application potentials of OCR system are as followa: Reading aid for the blind (OCR + speech synthesis). • Automatic text entry into the computer, desktop publication, library cata- loging, ledgering etc.Automatic reading for sorting of postal mail, bank cheques and other docu- ments. Document data compression: from document image to ASCII format. Language processing. Multi-media system design.There are several factors which can create difficulties in the development of OCR system. Some of those pertaining to machine printed documenta are listed below. Shape similarity: Different characters can have similar shapen. Some examples of similar shaped characters in Bangla are GH( ) and J,( ), KH( খ) and TH( থ), B( ব) and R( র), U( উ) and U, ( উ) ete, (here GH; J,; KH etc. are ASCII codes of Bangla characters used for representing them in the computer. In Chapter 2 a complete set of Bangla characters with their ASCII code are shown). Because of the shape similarity one character may be recognized as another character. Also, due to font, size and style variations, a single character may have different shapes, which also contributeto recognition problem. Deformation of image : Noise is the main cause of image deformation. Isolated dots, disconnected line segments, holes and breaks in lines are caused due to noise. Noise may be caused due to improper scanning and poor paper quality. Photocopied documents may also have some noises. Document skew : There may be natural skew or skew imposed during scanning of the document that may create difficulty in recognition.Effect of resolution: If a document is digitized at inappropriate Dots Per Inch (DPI), the character recognition rate may suffer. Presence of graphics, tables, halftone pletures and mathematicalsymbols : A document may contain graphica, tables, halftone pictures, mathematical symbols etc. which should be separated from the basic text materials before recognition. Some of the materials like tables, mathematical and chemical expressions may need special treatment for recognition.Complex document structure : A document page may have a complex physical layout, which should be segmented before subjecting cach segment to recognition.Complex background texture and colour : The background over which the text is printed may not have uniform white texture.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Included in

Mathematics Commons