Name: Computer Science Department , MS Thesis Presentation, Seungho Lee "ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity"
Start: 2026-03-04T11:00:00-0500
End: 2026-03-04T12:00:00-0500
Location: Worcester Polytechnic Institute

Seungho Lee

MS student

WPI – Computer Science Department

Wednesday, March 4^th, 2026

Time: 11:00AM – 12:00PM

Location: Fuller Lab 320

Zoom Link: https://wpi.zoom.us/j/6220508406

Advisor: Prof. Kyumin Lee

Reader : Prof. Raha Moraffah

Abstract :

While large language models(LLMs)offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation.

Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity.

To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Our code is available at https://github.com/seungho715/ToxiPrompt

Computer Science Department , MS Thesis Presentation, Seungho Lee "ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity"

Department(s)