TransDocUNet: A Transformer-based UNet Architecture for Degraded Document Image Binarization

Document Type

Conference Article

Publication Title

ACM International Conference Proceeding Series

Abstract

The enhancement of historical document images is critical for improving the quality and legibility of scanned or captured document images. Convolutional-based techniques previously generated competitive results for document image binarization, however, due to their inherent locality, these models are often limited in explicitly expressing long-range dependency. Transformers (ViT) have evolved as an alternative design with a global self-attention mechanism to tackle this issue, however, they can result in restricted localization capabilities due to a lack of low-level details. To address this problem, we propose TransDocUNet, a CNN-Transformer hybrid UNet architecture for document image binarization that merits both attention and convolution capabilities in a U-Net architecture and serves as a strong alternative to the existing solutions. The experimental results, obtained using the DIBCO/H-DIBCO datasets, highlight that our proposed method outperforms all the existing competing methods in terms of both objective quality metrics and visual quality assessment, achieving state-of-the-art performance in document image binarization. In addition, we undertake an ablation study to understand the role of dilation in the CNN to capture feature dependencies while reducing the computational cost as well. The findings helped us arrive at the final model and provide valuable insights into the importance of acquiring both global and local contextual information for tasks like enhancing document images.

DOI

10.1145/3627631.3627639

Publication Date

12-15-2023

3627631.3627639 (21 kB)

This document is currently not available here.

Share

COinS