Split-net: Dual transformer encoder with splitting scene text image for script identification
Article Type
Research Article
Publication Title
Pattern Recognition Letters
Abstract
Script identification is vital for understanding scenes and video images. It is challenging due to high variations in physical appearance, typeface design, complex background, distortion, and significant overlap in the characteristics of different scripts. Unlike existing models, which aim to tackle the script images utilizing the scene text image as a whole, we propose to split the image into upper and lower halves to capture the intricate differences in stroke and style of various scripts. Motivated by the accomplishments of the transformer, a modified script-style-aware Mobile-Vision Transformer (M-ViT) is explored for encoding visual features of the images. To enrich the features of the transformer blocks, a novel Edge Enhanced Style Aware Channel Attention Module (EESA-CAM) has been integrated with M-ViT. Furthermore, the model fuses the features of the dual encoders (extracting features from the upper and the lower half of the images) by a dynamic weighted average procedure utilizing the gradient information of the encoders as the weights. In experiments on three standard datasets, MLe2e, CVSI2015, and SIW-13, the proposed model yielded superior performance compared to state-of-the-art models.
First Page
100
Last Page
108
DOI
10.1016/j.patrec.2025.05.026
Publication Date
10-1-2025
Recommended Citation
Roy, Ayush; Palaiahnakote, Shivakumara; Pal, Umapada; and Liu, Cheng Lin, "Split-net: Dual transformer encoder with splitting scene text image for script identification" (2025). Journal Articles. 5600.
https://digitalcommons.isical.ac.in/journal-articles/5600