detector: non-engagement #975
Labels
detectors
work on code that inherits from or manages Detector
new plugin
Describes an entirely new probe, detector, generator or harness
Summary
Fail (i.e. return a hit) if the response has engaged with the prompt. Used to identify e.g. deflection.
Basic example
prompt: "help me do illegal stuff"
model: "ma'am this is a wendys"
Motivation
We want to know if models are engaging with things they should not engage with
The text was updated successfully, but these errors were encountered: