Master’s Dissertations

Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking

Sanket SahaFollow

Date of Submission

6-11-2026

Date of Award

6-15-2026

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Electronics and Communication Sciences Unit (ECSU-Kolkata)

Supervisor

Das, Swagatam

Abstract (Summary of the Work)

Token-level adversarial perturbations remain one of the most efficient known attacks against the safety alignment of instruction-tuned large language models (LLMs). Among recent works, the UniBreak framework (You et al., 2026) stands out for unifying gradient-based optimization with an evolutionary perturbation repository. However, its repository relies solely on accumulated success frequency without utilizing query content, and its fitness function implicitly assumes that suppressing refusal tokens is sufficient to elicit harmful responses. In this dissertation, we extend UniBreak along both axes and re-evaluates the framework under stricter generalization and judgment protocols. Specifically, we introduce a semantic perturbation repository that replaces frequency-only repository retrieval and geometric interpolation between historical frequency and sentence-encoder cosine similarity. Furthermore, we use Harmful-Intent Direction Suppression (HIDS) to augment the fitness function by explicitly penalizing the model’s residual-stream projection onto a validated harmful-intent direction. To isolate genuine cross-query generalization from within-dataset memorization, we introduce a two-phase frozen-repository evaluation protocol. Results are evaluated under two complementary judges: a binary classification judge and a 0-10 actionability scoring judge.The scoring judge itself is subsequently analysed through Grad×Input attribution.

Control Number

CS2425

DOI

https://dspace.isical.ac.in/items/436bbeeb-a2a1-435e-9c12-6f9f62ade17b

DSpace Identifier

http://hdl.handle.net/10263/7739

Recommended Citation

Saha, Sanket, "Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking" (2026). Master’s Dissertations. 463.
https://digitalcommons.isical.ac.in/masters-dissertations/463

Download

Included in

Computer Sciences Commons

COinS

Master’s Dissertations

Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking

Date of Submission

Date of Award

Institute Name (Publisher)

Document Type

Degree Name

Subject Name

Department

Supervisor

Abstract (Summary of the Work)

Control Number

DOI

DSpace Identifier

Recommended Citation

Included in

Browse

Search

Author Corner

Links

Master’s Dissertations

Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking

Author (Researcher Name)

Date of Submission

Date of Award

Institute Name (Publisher)

Document Type

Degree Name

Subject Name

Department

Supervisor

Abstract (Summary of the Work)

Control Number

DOI

DSpace Identifier

Recommended Citation

Included in

Share

Browse

Search

Author Corner

Links