Automatic extraction of text and non-text information directly from compressed document images
Document Type
Conference Article
Publication Title
Advances in Intelligent Systems and Computing
Abstract
Texts, images, audios, and videos form the major volume in Big Data being generated in today’s tech-savvy world. Such data are preferably archived and transmitted in the compressed form to realize storage and transmission efficiency. Through compression, though data becomes storage and transmission efficient, its processing gets expensive as it requires decompression as many times the data needs to be processed; and this requires additional computing resources. Therefore it would be novel, if the data processing and information extraction could be carried out directly from the compressed data without subjecting it to decompression. In this backdrop, the research paper demonstrates a novel technique of extracting text and non-text information straight from compressed document images (supported by TIFF and PDF formats) using the correlation-entropy features that are directly computed from the compressed representation. The experimental results reported on compressed printed text document images validate the proposed method, and also demonstrate the fact that the text and non-text information extracted from the compressed document are identical to that obtained from uncompressed representation.
First Page
38
Last Page
46
DOI
10.1007/978-3-319-52941-7_5
Publication Date
1-1-2017
Recommended Citation
Javed, Mohammed; Nagabhushan, P.; and Chaudhuri, Bidyut B., "Automatic extraction of text and non-text information directly from compressed document images" (2017). Conference Articles. 347.
https://digitalcommons.isical.ac.in/conf-articles/347