Date of Submission
6-11-2026
Date of Award
6-15-2026
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Electronics and Communication Sciences Unit (ECSU-Kolkata)
Supervisor
Das, Swagatam
Abstract (Summary of the Work)
Token-level adversarial perturbations remain one of the most efficient known attacks against the safety alignment of instruction-tuned large language models (LLMs). Among recent works, the UniBreak framework (You et al., 2026) stands out for unifying gradient-based optimization with an evolutionary perturbation repository. However, its repository relies solely on accumulated success frequency without utilizing query content, and its fitness function implicitly assumes that suppressing refusal tokens is sufficient to elicit harmful responses. In this dissertation, we extend UniBreak along both axes and re-evaluates the framework under stricter generalization and judgment protocols. Specifically, we introduce a semantic perturbation repository that replaces frequency-only repository retrieval and geometric interpolation between historical frequency and sentence-encoder cosine similarity. Furthermore, we use Harmful-Intent Direction Suppression (HIDS) to augment the fitness function by explicitly penalizing the model’s residual-stream projection onto a validated harmful-intent direction. To isolate genuine cross-query generalization from within-dataset memorization, we introduce a two-phase frozen-repository evaluation protocol. Results are evaluated under two complementary judges: a binary classification judge and a 0-10 actionability scoring judge.The scoring judge itself is subsequently analysed through Grad×Input attribution.
Control Number
CS2425
DOI
https://dspace.isical.ac.in/items/436bbeeb-a2a1-435e-9c12-6f9f62ade17b
DSpace Identifier
http://hdl.handle.net/10263/7739
Recommended Citation
Saha, Sanket, "Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking" (2026). Master’s Dissertations. 463.
https://digitalcommons.isical.ac.in/masters-dissertations/463