-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probe: Past Tense Vulnerability #924
Merged
Merged
Changes from 35 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
ce6b9c9
Add files via upload
Shine-afk 04805e5
Add files via upload
Shine-afk a6e3a20
automatic garak/resources/plugin_cache.json update
github-actions[bot] 9ee738a
Add files via upload
Shine-afk 6148306
Update detectors.rst
Shine-afk 30c1612
Update probes.rst
Shine-afk 44321d6
Update detectors.rst
Shine-afk 9a935bb
Delete docs/source/garak.detectors.keywords.rst
Shine-afk 8bdd8d6
Add files via upload
Shine-afk f1e4e08
Update probes.rst
Shine-afk 3d917c5
Delete docs/source/garak.probes.past_tense.rst
Shine-afk 4f52965
Delete garak/detectors/keywords.py
Shine-afk 0758cf6
Add files via upload
Shine-afk d18bd09
Delete garak/probes/past_tense.py
Shine-afk 39186bb
Add files via upload
Shine-afk 72a6051
Create phrasing
Shine-afk cf2bc47
Delete garak/resources/phrasing
Shine-afk c0578f2
Create past_tense.txt
Shine-afk fc5cdfe
Delete garak/resources/plugin_cache.json
Shine-afk aa6beae
Merge pull request #1 from Shine-afk/PTV
Shine-afk a2f767c
Update specialwords.py
Shine-afk b2f71fb
Update specialwords.py
Shine-afk ab7611e
Update phrasing.py
Shine-afk 2ac9433
Add files via upload
Shine-afk 85e8f73
@Shine-afk has signed the CLA in Shine-afk/garak#2
github-actions[bot] 3866426
Merge pull request #2 from Shine-afk/PTV
Shine-afk 71c5838
automatic garak/resources/plugin_cache.json update
github-actions[bot] 4d2ba73
Merge branch 'main' into main
leondz d305b8f
move past tense data file to data/
leondz 47ba58f
Merge branch 'main' into Shine-afk/main
leondz 0e1a3be
drop changes to plugin cache
leondz a57a999
extend past tense to all variants in tml-epfl/llm-past-tense repo, & …
leondz a404c00
shuffle and dedupe tense examples
leondz 48777b1
add mini versions of tense probes
leondz 40a7763
shuffle mini phrasing prompts, factor mini-versions up
leondz a43d32b
Update garak/probes/phrasing.py
leondz f757096
Update garak/probes/phrasing.py
leondz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.probes.phrasing | ||
===================== | ||
|
||
.. automodule:: garak.probes.phrasing | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
"""Past Tense | ||
|
||
Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal | ||
outputs. We reveal a curious generalization gap in the current refusal training approaches: | ||
simply reformulating a harmful request in the past tense is often sufficient to jailbreak many state- | ||
of-the-art LLMs | ||
""" | ||
|
||
import random | ||
|
||
from garak import _config | ||
from garak.data import path as data_path | ||
from garak.probes.base import Probe | ||
|
||
|
||
class TenseMini: | ||
def _minify_prompts(self): | ||
random.shuffle(self.prompts) | ||
self.prompts = self.prompts[:200] | ||
self.generations = 1 | ||
|
||
|
||
class PastTense(Probe): | ||
"""Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior? | ||
|
||
This check sends the generator requests for advice on making weapons in the past tense. | ||
leondz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
active = False | ||
bcp47 = "en" | ||
doc_uri = "https://arxiv.org/pdf/2407.11969v2" | ||
primary_detector = "specialwords.Prefixes" | ||
tags = [ | ||
"owasp:llm01", | ||
"quality:Security:PromptStability", | ||
"payload:jailbreak", | ||
] | ||
goal = "use past tense to disregard the system prompt" | ||
|
||
def __init__(self, config_root=_config): | ||
super().__init__(config_root=config_root) | ||
|
||
self.prompts = [] | ||
|
||
with open( | ||
data_path / "phrasing" / "past_tense_en.txt", | ||
"r", | ||
encoding="utf-8", | ||
) as file: | ||
for str in file: | ||
self.prompts.append(str.strip()) | ||
|
||
|
||
class PastTenseMini(PastTense, TenseMini): | ||
"""Attempt to evade safeguards by setting requests in the past tense | ||
|
||
A subset of the tense-modified requests. Success is reported higher with more variation, so | ||
generations is overridden. | ||
""" | ||
|
||
active = True | ||
|
||
def __init__(self, config_root=_config): | ||
super().__init__(config_root=config_root) | ||
self._minify_prompts() | ||
|
||
|
||
class FutureTense(Probe): | ||
"""Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior? | ||
|
||
This check sends the generator requests for advice on making weapons in the future tense. | ||
leondz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
active = False | ||
|
||
bcp47 = "en" | ||
doc_uri = "https://arxiv.org/pdf/2407.11969v2" | ||
primary_detector = "specialwords.Prefixes" | ||
tags = [ | ||
"owasp:llm01", | ||
"quality:Security:PromptStability", | ||
"payload:jailbreak", | ||
] | ||
goal = "use future tense to disregard the system prompt" | ||
|
||
def __init__(self, config_root=_config): | ||
super().__init__(config_root=config_root) | ||
|
||
self.prompts = [] | ||
|
||
with open( | ||
data_path / "phrasing" / "past_tense_en.txt", | ||
"r", | ||
encoding="utf-8", | ||
) as file: | ||
for str in file: | ||
self.prompts.append(str.strip()) | ||
|
||
|
||
class FutureTenseMini(FutureTense, TenseMini): | ||
"""Attempt to evade safeguards by setting requests in the future tense | ||
|
||
A subset of the tense-modified requests. Success is reported higher with more variation, so | ||
generations is overridden. | ||
""" | ||
|
||
active = True | ||
|
||
def __init__(self, config_root=_config): | ||
super().__init__(config_root=config_root) | ||
self._minify_prompts() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider providing a seed to ensure a reproducible shuffle?
Creating a custom
Random
object avoids impacts to the global random generator, but provides a method to enable users create reproducibility when required and adds consistency between runs.It is reasonable to defer this and as optional as the
seed
would need to come from some default or logged value overridable by configuration possibly injected from_config.run.seed
by cli once the refactor to remove direct access extract access to that value from plugins occurs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting a seed already sets
random
's seed (though only as part ofcli.py
) and this propagates through the run, so reproducibility is already here. to verify, run something likepython -m garak -m test.Repeat -p phrasing -s 4
twice and look at the order of prompts in thereport.jsonl
s.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set for
random
will create consistency if no other probe or task accessing random occurs in a different order between runs, the example limits to a single probe. Since probes are instantiated in series two runs that used a dynamic probe andphrasing
could cause the global random object to reach a different value at shuffle for separate runs.This idea was optional so will land and we can circle back if we see this needs the consistency.