Seungho Lee
MS student
WPI – Computer Science Department
Wednesday, March 4th, 2026
Time: 11:00AM – 12:00PM
Location: Fuller Lab 320
Zoom Link: https://wpi.zoom.us/j/6220508406
Advisor: Prof. Kyumin Lee
Reader : Prof. Raha Moraffah
Abstract :
While large language models(LLMs)offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation.
Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity.
To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Our code is available at https://github.com/seungho715/ToxiPrompt