No-Reference Quality Assessment for OCR'D Documents.
Date of Submission
December 2016
Date of Award
Winter 12-12-2017
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)
Supervisor
Garain, Utpal (CVPR-Kolkata; ISI)
Abstract (Summary of the Work)
This thesis deals with predicting quality of a text document that has been generated by an OCR system. As OCR systems are prone to make mistakes while converting an imaged document to machine readable form, this research concerns with finding errors in an OCR’d text and classify the OCR document accordingly. So far OCR community has dealt with this problem by following either of these two methods: (i) manual labeling of the errors or (ii) comparing the OCR’d document against the true text file. Manual counting of errors is infeasible in commercial situation whereas the true text file if often not available to compare with. This work attempts to develop methods for automatic prediction of OCR’d documents under no-reference condition. Bengali has been taken as the reference language. Use of lexicons and language models has been explored in several directions. Experiment with a large corpus of OCR’d documents shows that a lexicon based approach coupled with a suitable edit distance measure could be a viable method for no-reference quality assessment of OCR’d documents.
Control Number
ISI-DISS-2016-352
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
DOI
http://dspace.isical.ac.in:8080/jspui/handle/10263/6512
Recommended Citation
Biswas, Arnab, "No-Reference Quality Assessment for OCR'D Documents." (2017). Master’s Dissertations. 96.
https://digitalcommons.isical.ac.in/masters-dissertations/96
Comments
ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843110