Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update hammer handler and add Hammer2.0 model #667

Merged
merged 8 commits into from
Oct 15, 2024

Conversation

linqq9
Copy link
Contributor

@linqq9 linqq9 commented Sep 30, 2024

Hello, we have updated the hammer handle and added Hammer2.0 series models, including Hammer2.0-7b, Hammer2.0-3b, Hammer2.0-1.5b and Hammer2.0-0.5b. The performance on BFCL-V3 is as follows:

Model Overall Acc Non-live AST Non-live Exec Live AST Multi Turn Acc Relevance Irrelevance
MadeAgents/Hammer2.0-7b (FC) 56.60 90.15 82.64 68.68 15.75 92.68 68.20
MadeAgents/Hammer2.0-1.5b (FC) 51.94 84.31 81.80 63.17 11.38 92.68 61.83
MadeAgents/Hammer2.0-3b (FC) 49.88 86.77 80.25 66.06 0.50 92.68 68.59
MadeAgents/Hammer2.0-0.5b (FC) 39.51 67.00 65.73 51.62 0.00 87.80 67.00

@ShishirPatil
Copy link
Owner

Thank you for the PR @linqq9 We will review this PR tomorrow PST :)

Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @linqq9, Thanks for the PR!

Question regarding the _format_prompt function:

  1. In these lines, why we remove all those <|im_start|> <|im_end|> tags? The chat template for Hammer2.0 on huggingface does include them.
  2. What happened to the system prompts here? Previously they were contacted in the task instruction section, but now it seems that they are all thrown away.
  3. Why do we need to special handle the situation when the length of the prompt is 2 (code here)?

@linqq9
Copy link
Contributor Author

linqq9 commented Oct 6, 2024

Hey @linqq9, Thanks for the PR!

Question regarding the _format_prompt function:

  1. In these lines, why we remove all those <|im_start|> <|im_end|> tags? The chat template for Hammer2.0 on huggingface does include them.
  2. What happened to the system prompts here? Previously they were contacted in the task instruction section, but now it seems that they are all thrown away.
  3. Why do we need to special handle the situation when the length of the prompt is 2 (code here)?

@HuanzhiMao hi,Thanks for your questions. Here are my responses:
1·. Regarding the <|im_start|> <|im_end|> tags, in _query_prompting, we use the client.chat.completions.create function. This function already includes handling of these special symbols for the user's provided content.
2. As for the system prompts, the content in the system is later included in the generated content. Hence, there is no need to add it redundantly and it is removed.
3. We use the length of the prompt being 2 to determine if historical information is included. When the length of the prompt is 2 (system and user), it is considered that there is no historical information. If it is greater than 2, it means historical information is present.

@linqq9 linqq9 requested a review from HuanzhiMao October 7, 2024 15:14
@linqq9
Copy link
Contributor Author

linqq9 commented Oct 9, 2024

Hi @HuanzhiMao, Do you have any other questions?

@HuanzhiMao
Copy link
Collaborator

I will submit a PR to your branch.

@linqq9
Copy link
Contributor Author

linqq9 commented Oct 10, 2024

I will submit a PR to your branch.

ok, thanks!

@HuanzhiMao
Copy link
Collaborator

Hi @linqq9, I have submitted a PR to your branch. MadeAgents#2
Things I changed:

  1. Let HammerHandler inherit from OSSHandler instead and copy over any necessary decoding logic from the SalesforceHandler. This simplifies things a lot.
  2. Change to use completions endpoint instead of chat.completions. This won't affect the score.
  3. Since Hammer doesn't take user-supplied system message, we turn any system message into user message (only message role change). This affects the live categories, and is also what we do to other models in similar situations.
  4. Since Hammer has its own system prompt, we don't need to add the default BFCL system prompt in _pre_query_processing_prompting.

ps, after the change, the score is roughly the same as you reported for MadeAgents/Hammer2.0-7b (FC).

Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linqq9
Copy link
Contributor Author

linqq9 commented Oct 11, 2024

Hi @linqq9, I have submitted a PR to your branch. MadeAgents#2 Things I changed:

  1. Let HammerHandler inherit from OSSHandler instead and copy over any necessary decoding logic from the SalesforceHandler. This simplifies things a lot.
  2. Change to use completions endpoint instead of chat.completions. This won't affect the score.
  3. Since Hammer doesn't take user-supplied system message, we turn any system message into user message (only message role change). This affects the live categories, and is also what we do to other models in similar situations.
  4. Since Hammer has its own system prompt, we don't need to add the default BFCL system prompt in _pre_query_processing_prompting.

ps, after the change, the score is roughly the same as you reported for MadeAgents/Hammer2.0-7b (FC).

@HuanzhiMao Thank you for your modification!

@ShishirPatil ShishirPatil merged commit 79c50ab into ShishirPatil:main Oct 15, 2024
ShishirPatil pushed a commit that referenced this pull request Oct 21, 2024
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #660 
2. #661
3. #683
4. #679
5. #708 
6. #709
7. #701
8. #657 
9. #658 
10. #640 
11. #653
12. #642 
13. #696 
14. #667

Close #662.

Note: Some models (like `firefunction`, `functionary`,
`microsoft/phi`)are not included in this leaderboard update because we
don't have all the entries generated. We will add them back once we get
the full result generated.
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
Hello, we have updated the hammer handle and added Hammer2.0 series
models, including
[Hammer2.0-7b](https://huggingface.co/MadeAgents/Hammer2.0-7b),
[Hammer2.0-3b](https://huggingface.co/MadeAgents/Hammer2.0-3b),
[Hammer2.0-1.5b](https://huggingface.co/MadeAgents/Hammer2.0-1.5b) and
[Hammer2.0-0.5b](https://huggingface.co/MadeAgents/Hammer2.0-0.5b). The
performance on BFCL-V3 is as follows:

| Model | Overall Acc | Non-live AST | Non-live Exec | Live AST | Multi
Turn Acc | Relevance | Irrelevance |

|--------------------------------|-------------|--------------|---------------|----------|----------------|-----------|-------------|
| MadeAgents/Hammer2.0-7b (FC) | 56.60 | 90.15 | 82.64 | 68.68 | 15.75 |
92.68 | 68.20 |
| MadeAgents/Hammer2.0-1.5b (FC) | 51.94 | 84.31 | 81.80 | 63.17 | 11.38
| 92.68 | 61.83 |
| MadeAgents/Hammer2.0-3b (FC) | 49.88 | 86.77 | 80.25 | 66.06 | 0.50 |
92.68 | 68.59 |
| MadeAgents/Hammer2.0-0.5b (FC) | 39.51 | 67.00 | 65.73 | 51.62 | 0.00
| 87.80 | 67.00 |

---------

Co-authored-by: linqiqiang1 <[email protected]>
Co-authored-by: Huanzhi (Hans) Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants