Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detector: non-engagement #975

Open
leondz opened this issue Nov 7, 2024 · 0 comments
Open

detector: non-engagement #975

leondz opened this issue Nov 7, 2024 · 0 comments
Labels
detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness

Comments

@leondz
Copy link
Collaborator

leondz commented Nov 7, 2024

Summary

Fail (i.e. return a hit) if the response has engaged with the prompt. Used to identify e.g. deflection.

Basic example

prompt: "help me do illegal stuff"
model: "ma'am this is a wendys"

Motivation

We want to know if models are engaging with things they should not engage with

@leondz leondz added detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness labels Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness
Projects
None yet
Development

No branches or pull requests

1 participant