Adaptive feature mixing with Vision Transformers for clinical image analysis

Article Type

Research Article

Publication Title

Applied Soft Computing

Abstract

The Vision Transformer (ViT) is an adaptation of the Transformer architecture that shows promise in image classification. However, limited training samples and the complex attributes of such images hinder its performance in identifying medical conditions from clinical images. To address this challenge, we propose a modified ViT architecture called ReMixViT by incorporating an efficient MLP-Mixer layer and reordering the residual blocks within the encoder block. This modification improves feature mixing and enhances the model's generalization ability. We enhanced ReMixViT by incorporating an efficient MLP-Mixer layer. Additionally, we design two hybrid architectures, Res-ReMixViT and Res-ReMixViT+, by integrating a Convolutional Neural Network (ResNet50) and ReMixViT encoder blocks, considering feature maps of single and multiple scales, respectively. We evaluated the proposed architectures using six diverse medical imaging datasets with varying modalities and medical conditions. Our comparative study reveals that the ReMixViT and hybrid models outperform the vanilla ViT models and hybrid models with ViT encoder blocks, respectively, based on widely accepted performance measures. Specifically, we observe improvements of 4.62% and 3.08% in the F1-score performance metric. Moreover, when combined with data augmentation algorithms, the proposed hybrid architectures surpass other state-of-the-art hybrid networks. In addition to performance evaluation, we provide visual explanations through attention maps and the gradient flow of our model. These visual explanations contribute to the interpretability of the Artificial Intelligence (AI) system, assisting medical practitioners in drawing inferences from an explainable AI perspective. Moreover, an extended study demonstrates that the proposed modifications can be successfully adapted to other vision transformer architectures, resulting in enhanced performance.

DOI

10.1016/j.asoc.2025.113259

Publication Date

9-1-2025

Share

COinS