Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
Document Type
Conference Article
Publication Title
Proceedings of the Aaai Conference on Artificial Intelligence
Abstract
Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.
First Page
5885
Last Page
5893
DOI
10.1609/aaai.v38i6.28402
Publication Date
3-25-2024
Recommended Citation
Wei, Jiajun; Zhan, Hongjian; Lu, Yue; Tu, Xiao; Yin, Bing; Liu, Cong; and Pal, Umapada, "Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network" (2024). Conference Articles. 870.
https://digitalcommons.isical.ac.in/conf-articles/870
Comments
Open Access; Gold Open Access