SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Published in arXiv preprint (Under Submission to EMNLP 2025), 2025

Fonseca, J.*, Bell, A.*, & Stoyanovich, J. (2025). SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs. arXiv preprint arXiv:2501.02018. (Submitted to EMNLP 2025 — awaiting meta-review) https://arxiv.org/abs/2501.02018

Abstract

Recent high profile incidents have demonstrated the susceptibility of Large Language Models (LLMs) to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect”, may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict “normal” model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SAFENUDGE, that combines Controlled Text Generation with “nudging,” or using text interventions to modify a model’s behavior. SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by 30% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Further, it supports tunable SPTs, meaning practitioners can set their own tolerance for trade-offs balancing safety and restrictions to normal model behavior. SAFENUDGE is open-source and available through https://pypi.org/, and is compatible with models loaded with the Hugging Face transformers library.

Link to Publication
Paper