No-Reference Quality Assessment for OCR'D Documents.

Date of Submission

December 2016

Date of Award

Winter 12-12-2017

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)

Supervisor

Garain, Utpal (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

This thesis deals with predicting quality of a text document that has been generated by an OCR system. As OCR systems are prone to make mistakes while converting an imaged document to machine readable form, this research concerns with finding errors in an OCR’d text and classify the OCR document accordingly. So far OCR community has dealt with this problem by following either of these two methods: (i) manual labeling of the errors or (ii) comparing the OCR’d document against the true text file. Manual counting of errors is infeasible in commercial situation whereas the true text file if often not available to compare with. This work attempts to develop methods for automatic prediction of OCR’d documents under no-reference condition. Bengali has been taken as the reference language. Use of lexicons and language models has been explored in several directions. Experiment with a large corpus of OCR’d documents shows that a lexicon based approach coupled with a suitable edit distance measure could be a viable method for no-reference quality assessment of OCR’d documents.

Comments

ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843110

Control Number

ISI-DISS-2016-352

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

DOI

http://dspace.isical.ac.in:8080/jspui/handle/10263/6512

This document is currently not available here.

Share

COinS