-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: use hf chat support #1047
Feature: use hf chat support #1047
Conversation
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
* When using a pipeline with `chat` support defer to library to enable response formatting to be in canonical chat list of dict form. * Models responses may have the chat template based prompt as a prefix decode to ensure the template mutated prompt is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good, and we went over it in some depth ahead of the PR. Approved, with the caveat that I'm putting some faith in this passing scream tests.
@@ -79,6 +79,9 @@ def _load_client(self): | |||
set_seed(_config.run.seed) | |||
|
|||
pipeline_kwargs = self._gather_hf_params(hf_constructor=pipeline) | |||
pipeline_kwargs["truncation"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I wish this were just the default for HF but c'est la vie. We may want to consider whether truncation = True
requires max_len
to be set -- I've had HF yell at me for not specifying both but it may be a corner case that I encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HF is kinda yell-y, and over the course of garak dev, HF has varied what it yells about. It can also be the case that some models require certain param combos to operate, while others will find the same param combo utterly intolerable. This has led to a style where one tries to do the right thing in garak, and tries to listen to HF warnings a little less.
if self.use_chat: | ||
formatted_prompt = self._format_chat_prompt(prompt) | ||
else: | ||
formatted_prompt = prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for future Erick and Jeffrey: we will want to merge ConversationalPipeline
into this at some point given HF's deprecation of the "conversational"
pipeline and Conversation
object. At that point, we'll also want to validate that prompt
is a string.
if self.use_chat: | ||
text_output = [ | ||
re.sub("^" + re.escape(prefix_prompt), "", i).strip() | ||
for i in raw_text_output | ||
] | ||
else: | ||
text_output = raw_text_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me (albeit a part with limited convictions) feels like there HAS to be a better way to handle this. HF seems to be managing it internally with their pipeline
s, but this seems fine for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk man the format uses \n
as a special char but doesn't seem to do mush escaping, many bets are off
Another option could be to do \n
-based escaping and parsing of messages line by line, assuming [A-Z][a-z]+:
is a metadata line
My experience also is that things like prefix_prompt
get mangled in other code sometimes. I don't have great suggestions for how to check this well, other than:
- continue doing qualitative reviews of prompt:output pairs for suspicious items;
- migrate to a data format with clearer separation between data and metadata;
- badger HF for a reference implementation;
- give notice to users that HF Chat is "janky".
tests/generators/test_huggingface.py
Outdated
def test_model_chat(mocker, hf_generator_config): | ||
# uses a ~350M model with chat support | ||
g = garak.generators.huggingface.Model( | ||
"microsoft/DialoGPT-small", config_root=hf_generator_config | ||
) | ||
mock_format = mocker.patch.object( | ||
g, "_format_chat_prompt", wraps=g._format_chat_prompt | ||
) | ||
g.generate("Hello world!") | ||
mock_format.assert_called_once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great test for where we are now -- would be nice to have something where we could specifically test the deprefixing but there is some unavoidable stochasticity here that I'm not sure how to account for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, in the vein of wanting less reliance on actual models I think adding a test in the future that mocks a model response completely by offering a known test result should be a reasonable way to approach adding a unit
test for validation of deprefix
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find re: the model. I wonder if we could use something like the lipsum
module (which we already depend upon) to mock up a test HF generator that could be used in like Many of our tests. Maybe for jan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this. A few questions around testing and refactoring, but LGTM otherwise.
@@ -79,6 +79,9 @@ def _load_client(self): | |||
set_seed(_config.run.seed) | |||
|
|||
pipeline_kwargs = self._gather_hf_params(hf_constructor=pipeline) | |||
pipeline_kwargs["truncation"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HF is kinda yell-y, and over the course of garak dev, HF has varied what it yells about. It can also be the case that some models require certain param combos to operate, while others will find the same param combo utterly intolerable. This has led to a style where one tries to do the right thing in garak, and tries to listen to HF warnings a little less.
if self.use_chat: | ||
text_outputs = [_o[-1]["content"].strip() for _o in outputs] | ||
else: | ||
text_outputs = outputs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment not for this PR: there are multiple ways of representing conversations in garak, and attempt should be canonical for conversation history; I'm starting to consider patterns where attempt holds the conversation history still but where this can be read & written using other interfaces, like an OpenAI API messages dict list, or this HF style format. But mayeb the work is so simple that this current pattern works fine.
garak/generators/huggingface.py
Outdated
# test tokenizer for `apply_chat_template` support | ||
self.use_chat = ( | ||
hasattr(self.tokenizer, "chat_template") | ||
and self.tokenizer.chat_template is not None | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is in a few places, is it worth factoring up to HFCompatible
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the object format is slightly different self.tokenizer
vs self.generator.tokenizer
there is an abstraction possible here for _supports_chat(tokenizer)
though not sold on the value yet:
def _supports_chat(self, tokenizer) -> bool:
return hasattr(tokenizer, "chat_template") and tokenizer.chat_template is not None
I would like to defer this until I see a third usage or usage outside this module.
if self.use_chat: | ||
text_output = [ | ||
re.sub("^" + re.escape(prefix_prompt), "", i).strip() | ||
for i in raw_text_output | ||
] | ||
else: | ||
text_output = raw_text_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk man the format uses \n
as a special char but doesn't seem to do mush escaping, many bets are off
Another option could be to do \n
-based escaping and parsing of messages line by line, assuming [A-Z][a-z]+:
is a metadata line
My experience also is that things like prefix_prompt
get mangled in other code sometimes. I don't have great suggestions for how to check this well, other than:
- continue doing qualitative reviews of prompt:output pairs for suspicious items;
- migrate to a data format with clearer separation between data and metadata;
- badger HF for a reference implementation;
- give notice to users that HF Chat is "janky".
tests/generators/test_huggingface.py
Outdated
def test_model_chat(mocker, hf_generator_config): | ||
# uses a ~350M model with chat support | ||
g = garak.generators.huggingface.Model( | ||
"microsoft/DialoGPT-small", config_root=hf_generator_config | ||
) | ||
mock_format = mocker.patch.object( | ||
g, "_format_chat_prompt", wraps=g._format_chat_prompt | ||
) | ||
g.generate("Hello world!") | ||
mock_format.assert_called_once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find re: the model. I wonder if we could use something like the lipsum
module (which we already depend upon) to mock up a test HF generator that could be used in like Many of our tests. Maybe for jan?
While not exposed as a `DEFAULT_PARAM` allowing the user to suppress chat template usage for models can enable exploration of possible weaknesses when insufficient or improper prompt handling occurs if the model is passed input inside an application. Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
Fix #1029
When a huggingface model supports a chat_template wrap the prompt in a chat list to enable request formatting. Using the trained template formatting can have significant impact on the inference efficiency as well the context of the response. This aligns local model execution with deployed scenarios.
truncation
requirement into pipeline creation callhuggingface.Model
inferenceAs a future task I could see possibly wanting to suppress
chat_template
formatting to test poorly configured inference against a instruct model. This idea is left as an exercise for future consideration.Verification
List the steps needed to make sure this thing works
python -m garak -m huggingface -n mistralai/Mistral-7B-Instruct-v0.1 -p atkgen
python -m garak -m huggingface -n mistralai/Mistral-7B-Instruct-v0.1 -p lmrc
python -m garak -m huggingface -n gpt2 -p lmrc
python -m garak -m huggingface.Model -n mistralai/Mistral-7B-Instruct-v0.1 -p atkgen
python -m garak -m huggingface.Model -n mistralai/Mistral-7B-Instruct-v0.1 -p lmrc
python -m garak -m huggingface.Model -n gpt2 -p lmrc