Computer Science Department , MS Thesis Presentation, Seungho Lee "ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity"

Wednesday, March 4, 2026
11:00 a.m. to 12:00 p.m.

 

Seungho Lee

MS  student

WPI – Computer Science Department 

 

 

Wednesday,  March 4th, 2026

Time: 11:00AM – 12:00PM

Location:  Fuller Lab 320

Zoom Link: https://wpi.zoom.us/j/6220508406 

Advisor: Prof. Kyumin Lee

Reader : Prof. Raha Moraffah

Abstract : 

While large language models(LLMs)offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation.

Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity. 

To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Our code is available at https://github.com/seungho715/ToxiPrompt