Date of Submission
6-11-2026
Date of Award
6-17-2026
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Linguistic Research Unit (LRU-Kolkata)
Supervisor
Dash, Niladri Sekhar
Co-Supervisor (if any)
Bapuji, Mendem
Abstract (Summary of the Work)
Conversational artificial intelligence has become the primary interface through which hundreds of millions of users in India seek information and customer support. Yet the way these users actually write and speak is fundamentally at odds with the monolingual assumptions baked into most retrieval and generation systems: they code-switch, fluidly mixing one or more of the twenty-two scheduled languages of India with English, frequently typing Indic words in the Roman script ("mera refund kab tak aayega"). Standard Retrieval-Augmented Generation (RAG) pipelines silently fail on such input — the retriever returns off-topic passages because the query and the knowledge base live in different representation spaces, and the generator replies in a register that does not match the user. This thesis presents SETU-RAG (from setu, a bridge), a code-switching-aware multilingual RAG system engineered to run end-to-end on a single commodity GPU (a Google Co lab T4 with 16 GB of memory). The system makes three novel contributions. First, a CMI-Adaptive Retrieval Router routes the retrieval strategy by the linguistic profile of the query its Code-Mixing Index (CMI) and matrix language — rather than by reasoning complexity, so monolingual queries stay cheap while genuinely code-mixed queries trigger a cross lingual fan-out. Second, a Transliteration-Robust Multi-View Query expands every query into up to four parallel views(surface, native-script, matrix-canonical, and English-pivot), each embedded separately, so at least one view lands in the know l edge base's representation space regardless of script. Third, a CMI-Conditioned Generation stage conditions the answer on the user's measured matrix language and mix ratio so that the reply mirrors their register while remaining grounded in retrieved evidence. Around this text core we build a speech-to-speech (VANI) layer that adds two further contributions — acoustic–lexical language-identification fusion and CMI-conditioned text-to-speech — enabling code-switched voice in and style-matched voice out. We additionally introduce CS-RAGAS, an evaluation harness that augments the standard RAGAS quality axes (faithfulness, answer relevancy, context precision/re call) with code-switching-native metrics: CMI-alignment, language-consistency, and transliteration-robustness. Every model in the pipeline is wired in a real-with-fallback manner — the live path loads strong open-weight models (BGE-M3, BGE-reranker-v2 m3, Indic LID, Indic Xlit, IndicTrans2, and a 4-bit instruction-tuned generator), while a deterministic stand-in keeps the entire system runnable offline and on CPU for testing and reproducibility. We describe the design, the mathematical formulation of each stage, the corrective (CRAG) and faithfulness (Self-RAG) gates that guard against the iii SETU-RAG documented failure modes of code-switched retrieval, and the memory policy that keeps peak usage under 16 GB. Experiments on a code-switched customer-support corpus demonstrate that the router produces well-calibrated routing decisions, that the multi-view expansion recovers retrieval hits that a single-view retriever misses, and that the CS-native metrics capture register-mirroring behaviour that conventional metrics miss entirely. The work is a step toward conversational AI that meets Indian users in the language — and the script, and the register — in which they actually speak.
Control Number
CS2410
DOI
https://dspace.isical.ac.in/items/a5737c2e-4861-4e94-b33b-3c97d5e42653
DSpace Identifier
http://hdl.handle.net/10263/7722
Recommended Citation
Juvale, Ashutosh, "Design and Evaluation of a Code-Switching-Aware Multilingual Conversational AI System using Advanced RAG Architectures" (2026). Master’s Dissertations. 458.
https://digitalcommons.isical.ac.in/masters-dissertations/458