Date of Submission

2025

Date of Award

6-11-2025

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)

Supervisor

Bhattacharya, Ujjwal

Abstract (Summary of the Work)

Artificial intelligence (AI) strategies such as Multimodal learning, which can integrate inputs of multiple modes, e.g., image and text, have shown significant promise in medical applications. In this dissertation, we present our related study of a Multimodal Large Language Model (MLLM) designed for Visual Question Answering (VQA) in the medical domain, based on both image and text input modalities to improve diagnostic reasoning and decision support. Our model processes medical images (e.g., chest Xrays, CT scans, and ultrasound images) along with clinical text to answer complex, domain-specific questions. We employ a cross-modal fusion mechanism to align visual features with textual embeddings, enabling the model to generate accurate and contextually relevant responses. In this work, we have studied two different datasets, one is ImageCLEF 2019 medical VQA dataset and the other is MED-GRIT-270K dataset. First, we work on ImageCLEF 2019 medical VQA dataset and our approach demonstrates superior performance compared to existing multimodal baselines on same dataset, achieving state-of-the-art results in diagnostic precision and interpretability. Furthermore, to address the limitations of existing datasets, we reformat ImageCLEF 2019 VQA into a descriptive answer-style dataset and fine-tune Vision-LLM on this enhanced dataset to improve its medical reasoning capabilities. Second, to specialize the model for chest X-ray analysis, we extract a subset of radiology images and paired text from the MED-GRIT-270K dataset, then fine-tune the VLLM to create a robust chest X-ray AI system.

Control Number

CS2326

DOI

https://dspace.isical.ac.in/items/a1aa2602-0080-49de-acaf-27348f27568b

DSpace Identifier

http://hdl.handle.net/10263/7562

Recommended Citation

Singha, Srimanta, "Multi-Modal Large Language Model for Visual Question Answering on Medical Domain" (2025). Master’s Dissertations. 454.
https://digitalcommons.isical.ac.in/masters-dissertations/454

Download

Included in

Computer Sciences Commons

COinS

Master’s Dissertations

Multi-Modal Large Language Model for Visual Question Answering on Medical Domain

Date of Submission

Date of Award

Institute Name (Publisher)

Document Type

Degree Name

Subject Name

Department

Supervisor

Abstract (Summary of the Work)

Control Number

DOI

DSpace Identifier

Recommended Citation

Included in

Browse

Search

Author Corner

Links

Master’s Dissertations

Multi-Modal Large Language Model for Visual Question Answering on Medical Domain

Author (Researcher Name)

Date of Submission

Date of Award

Institute Name (Publisher)

Document Type

Degree Name

Subject Name

Department

Supervisor

Abstract (Summary of the Work)

Control Number

DOI

DSpace Identifier

Recommended Citation

Included in

Share

Browse

Search

Author Corner

Links