Adaptive feature mixing with Vision Transformers for clinical image analysis
Article Type
Research Article
Publication Title
Applied Soft Computing
Abstract
The Vision Transformer (ViT) is an adaptation of the Transformer architecture that shows promise in image classification. However, limited training samples and the complex attributes of such images hinder its performance in identifying medical conditions from clinical images. To address this challenge, we propose a modified ViT architecture called ReMixViT by incorporating an efficient MLP-Mixer layer and reordering the residual blocks within the encoder block. This modification improves feature mixing and enhances the model's generalization ability. We enhanced ReMixViT by incorporating an efficient MLP-Mixer layer. Additionally, we design two hybrid architectures, Res-ReMixViT and Res-ReMixViT+, by integrating a Convolutional Neural Network (ResNet50) and ReMixViT encoder blocks, considering feature maps of single and multiple scales, respectively. We evaluated the proposed architectures using six diverse medical imaging datasets with varying modalities and medical conditions. Our comparative study reveals that the ReMixViT and hybrid models outperform the vanilla ViT models and hybrid models with ViT encoder blocks, respectively, based on widely accepted performance measures. Specifically, we observe improvements of 4.62% and 3.08% in the F1-score performance metric. Moreover, when combined with data augmentation algorithms, the proposed hybrid architectures surpass other state-of-the-art hybrid networks. In addition to performance evaluation, we provide visual explanations through attention maps and the gradient flow of our model. These visual explanations contribute to the interpretability of the Artificial Intelligence (AI) system, assisting medical practitioners in drawing inferences from an explainable AI perspective. Moreover, an extended study demonstrates that the proposed modifications can be successfully adapted to other vision transformer architectures, resulting in enhanced performance.
DOI
10.1016/j.asoc.2025.113259
Publication Date
9-1-2025
Recommended Citation
Ghosh, Susmita and Das, Swagatam, "Adaptive feature mixing with Vision Transformers for clinical image analysis" (2025). Journal Articles. 5228.
https://digitalcommons.isical.ac.in/journal-articles/5228