Author (Researcher Name)

Date of Submission

6-11-2026

Date of Award

6-15-2026

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Electronics and Communication Sciences Unit (ECSU-Kolkata)

Supervisor

Das, Swagatam

Abstract (Summary of the Work)

Token-level adversarial perturbations remain one of the most efficient known attacks against the safety alignment of instruction-tuned large language models (LLMs). Among recent works, the UniBreak framework (You et al., 2026) stands out for unifying gradient-based optimization with an evolutionary perturbation repository. However, its repository relies solely on accumulated success frequency without utilizing query content, and its fitness function implicitly assumes that suppressing refusal tokens is sufficient to elicit harmful responses. In this dissertation, we extend UniBreak along both axes and re-evaluates the framework under stricter generalization and judgment protocols. Specifically, we introduce a semantic perturbation repository that replaces frequency-only repository retrieval and geometric interpolation between historical frequency and sentence-encoder cosine similarity. Furthermore, we use Harmful-Intent Direction Suppression (HIDS) to augment the fitness function by explicitly penalizing the model’s residual-stream projection onto a validated harmful-intent direction. To isolate genuine cross-query generalization from within-dataset memorization, we introduce a two-phase frozen-repository evaluation protocol. Results are evaluated under two complementary judges: a binary classification judge and a 0-10 actionability scoring judge.The scoring judge itself is subsequently analysed through Grad×Input attribution.

Control Number

CS2425

DOI

https://dspace.isical.ac.in/items/436bbeeb-a2a1-435e-9c12-6f9f62ade17b

DSpace Identifier

http://hdl.handle.net/10263/7739

Share

COinS