Auditing and Safeguarding Large Language Models
Published:
Development of methods and tools for auditing and safeguarding Large Language Models against harmful outputs and adversarial attacks. This project includes SafeNudge, a real-time safeguarding method against red teaming attacks, and Output-Scouting, an open-source library for searching dangerous outputs in LLMs’ possible output space.