Automated Red Teaming

Automated red teaming continuously searches for inputs that cause unwanted behaviors.

Red teaming can be configured to target any type of describable behavior, from high level topics like illegal and dangerous content, to specific behaviors like outputting false product warranties.

As the name implies, automated red teaming is fully automated. Model red teaming runs finish from seconds to minutes, automatically run on new model versions, use reproducible search budgets for easy comparison, and can be configured to generate specific inputs formats.

Use cases

Evaluating if your model can produce dangerous or illegal content
Making sure new updates don’t introduce unwanted side effects
Catching off topic or sensitive responses for specific business contexts
Visualizing model performance on unexpected inputs

Automated Red Teaming

Use cases

Further reading