Auditing and Safeguarding Large Language Models

Published: June 18, 2025

This project focuses on developing comprehensive methods for auditing and safeguarding Large Language Models (LLMs) to ensure their safe and responsible deployment in real-world applications.

Key Components

SafeNudge

A real-time safeguarding method designed to protect Large Language Models against red teaming attacks and harmful prompt injections. SafeNudge provides tunable safety-performance trade-offs, allowing organizations to customize protection levels based on their specific use cases and risk tolerance.

This paper is currently under submission, an early preprint is available on ArXiv and here: SafeNudge: Real-Time Safeguarding for Large Language Models.

Output-Scouting

An open-source library designed to systematically search for dangerous or unsafe outputs within Large Language Models’ possible output space. This tool helps researchers and practitioners identify potential vulnerabilities and harmful behaviors before deployment.

Share on

Twitter Facebook LinkedIn

João Fonseca

Key Components

SafeNudge

Output-Scouting

Share on