Date of Submission
2025
Date of Award
6-2025
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)
Supervisor
Majumdar, Debapriyo
Co-Supervisor (if any)
Panuganti, Rajkiran
Abstract (Summary of the Work)
Retrieval-Augmented Generation (RAG) has become a popular technique to enhance Large Language Models (LLMs) with access to external information sources. However, the success of RAG systems critically depends on the relevance and quality of the retrieved documents. In particular, supplying irrelevant or noisy context can lead to degraded downstream generation quality. To address this, our project focuses on improving the document filtering stage in a RAG pipeline through binary relevance classification — deciding whether a retrieved document is suitable to include in the final context window based on its usefulness in directly answering the user query. We explore a wide range of approaches to this task, including rule-based retrieval methods (TF-IDF, BM25), classical machine learning classifiers (logistic regression, SVM), deep neural networks, and LLM-based methods, both in zero-shot and few-shot settings. Our final pipeline leverages instruction-tuned LLMs to act as strict binary classifiers, with a focus on maximizing precision over recall, thereby ensuring that only the most relevant and high-quality documents are passed to the generation module. Experiments are conducted on a Reddit-based query-document dataset tailored to subjective and opinion-heavy queries. Our evaluations suggest that LLMs, even without fine-tuning, can outperform traditional methods in this setting, o”ering a strong foundation for further enhancement through supervised fine-tuning
Control Number
CS2325
DOI
https://dspace.isical.ac.in/items/eb4c11e8-8db4-4d72-82da-1cf4c704afb2
DSpace Identifier
http://hdl.handle.net/10263/7587
Recommended Citation
Saha, Sreyan, "Binary Document Filtering for Retrieval-Augmented Generation" (2025). Master’s Dissertations. 433.
https://digitalcommons.isical.ac.in/masters-dissertations/433