Automated Red Teaming
Automated red teaming continuously searches for inputs that cause unwanted behaviors.
Red teaming can be configured to target any type of describable behavior, from high level topics like illegal and dangerous content, to specific behaviors like outputting false product warranties.
As the name implies, automated red teaming is fully automated. Model red teaming runs finish from seconds to minutes, automatically run on new model versions, use reproducible search budgets for easy comparison, and can be configured to generate specific inputs formats.
Use cases
- Evaluating if your model can produce dangerous or illegal content
- Making sure new updates don’t introduce unwanted side effects
- Catching off topic or sensitive responses for specific business contexts
- Visualizing model performance on unexpected inputs