Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fs 104/Fix-Webagent #44

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 58 additions & 17 deletions backend/promptfoo/intent_config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
description: "Intent"
description: 'Intent'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other yaml files use ", I assume your IDE linting flagged this, could you update your IDE linting config instead of this change?

providers:
- id: openai:gpt-4o-mini
Expand All @@ -8,54 +8,95 @@ providers:
prompts: file://promptfoo_test_runner.py:create_prompt

tests:
- description: "questions directed towards the database lookups should have only 1 question -1"
- description: 'questions directed towards the database lookups should have only 1 question -1'
vars:
system_prompt_template: "intent-system"
user_prompt_template: "intent"
system_prompt_template: 'intent-system'
user_prompt_template: 'intent'
user_prompt_args:
chat_history: []
question: "Check the database and tell me the average ESG score (Environmental) for the WhiteRock ETF fund"
question: 'Check the database and tell me the average ESG score (Environmental) for the WhiteRock ETF fund'
assert:
- type: javascript
value: JSON.parse(output).questions.length === 0

- description: "questions directed towards the database look ups should have only 1 question -2"
- description: 'questions directed towards the database look ups should have only 1 question -2'
vars:
system_prompt_template: "intent-system"
user_prompt_template: "intent"
system_prompt_template: 'intent-system'
user_prompt_template: 'intent'
user_prompt_args:
chat_history: []
question: "Using Bloomberg.csv dataset give me the company with the best esg score"
question: 'Using Bloomberg.csv dataset give me the company with the best esg score'
assert:
- type: javascript
value: JSON.parse(output).questions.length === 0

- description: "verify that the correct company name is determined from the chat history"
- description: 'verify that the correct company name is determined from the chat history'
vars:
system_prompt_template: "intent-system"
user_prompt_template: "intent"
system_prompt_template: 'intent-system'
user_prompt_template: 'intent'
user_prompt_args:
chat_history: |
[
"User: When was Coca Cola founded?",
"System: Coca-Cola was founded on May 8, 1886.",
]
question: "What is their best selling product?"
question: 'What is their best selling product?'
assert:
- type: javascript
value: output.includes("Coca-Cola") || output.includes("Coca Cola")

- description: "verify that the question is correctly split up"
- description: 'verify that the question is correctly split up'
vars:
system_prompt_template: "intent-system"
user_prompt_template: "intent"
system_prompt_template: 'intent-system'
user_prompt_template: 'intent'
user_prompt_args:
chat_history: []
question: "Compare Ryanair emissions to other companies in the industry"
question: 'Compare Ryanair emissions to other companies in the industry'
assert:
- type: javascript
value: JSON.parse(output).questions[0].includes("Ryanair")
- type: llm-rubric
value: The 1st item in the questions array contains a question about finding the emissions for Ryanair
- type: llm-rubric
value: The 2nd item in the questions array contains a question about finding the emissions for companies in the industry

- description: 'verify intent for finding ESG scores online in the Technology sector'
vars:
system_prompt_template: 'intent-system'
user_prompt_template: 'intent'
user_prompt_args:
chat_history: []
question: 'provide a list of companies with the highest ESG scores in the Technology sector?'
assert:
- type: javascript
value: JSON.parse(output).user_intent.includes("Technology sector")
- type: javascript
value: JSON.parse(output).questions[0].includes("highest ESG scores")
- type: llm-rubric
value: The output correctly identifies the intent to search online for companies in the Technology sector with high ESG scores.

- description: 'Validation - General information is rejected'
vars:
system_prompt_template: 'validator'
user_prompt_template: 'validate'
user_prompt_args:
task: 'Provide a list of companies with the highest ESG scores in the Technology sector.'
answer: "As of the end of 2023, the Technology sector had the highest weighted-average ESG score among all sectors, according to the MSCI ACWI SRI Index. However, I don't have a specific list of individual companies with the highest scores."
assert:
- type: javascript
value: JSON.parse(output).response === "false"
- type: llm-rubric
value: The reasoning should explain that general sector information is insufficient to fulfill the task.

- description: 'Validation - Incorrect company is rejected'
vars:
system_prompt_template: 'validator'
user_prompt_template: 'validate'
user_prompt_args:
task: "What are Apple's ESG scores?"
answer: "Microsoft's ESG (Environmental, Social, and Governance) scores are as follows: Environmental Score of 95.0, Social Score of 90.0, Governance Score of 92.0."
assert:
- type: javascript
value: JSON.parse(output).response === "false"
- type: llm-rubric
value: The reasoning should explain that the scores provided do not match Apple's scores as requested.
2 changes: 1 addition & 1 deletion backend/src/agents/web_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ async def web_general_search_core(search_query, llm, model) -> str:
continue # Skip if the summarization is not valid
response = {
"content": summary,
"ignore_validation": "false"
"ignore_validation": "true" # This is to ignore the validation of the answer again by the supervisor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we disabling validation for the web agent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evpearce - Because we have already validated it on the line 84.

}
return json.dumps(response, indent=4)
return "No relevant information found on the internet for the given query."
Expand Down
5 changes: 4 additions & 1 deletion backend/src/prompts/templates/intent-system.j2
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ Output your result in the following json format:
"question": "string of the original question",
"user_intent": "string of the intent of the user's question",
"questions": array of singular objective questions or if the question mentions csv, dataset or database an empty array
}
}

Guidelines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want this?
If this is a beneficial change, then I would expect more tests to be added to intent_config.yaml to prove that this is working as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that Intent was not coming right as it for ESG related tasks, it was going to Datastore agent when there were more than 1 question.
Sorry, I had no clue about intent_config.yaml, will have a look at it and add more tests related to it there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of when it was wrong and how, I'm still not quite getting the problem

- If the user has asked to check online, then each question in the questions array should also specify that.
23 changes: 20 additions & 3 deletions backend/src/prompts/templates/validator.j2
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ You are an expert validator. You can help with validating the answers to the tas

Your entire purpose is to return a "true" or "false" value to indicate if the answer has fulfilled the task, along with a reasoning to explain your decision.

You will be passed a task and an answer. You need to determine if the answer is correct or not.
You will be passed a task and an answer. You need to determine if the answer is correct or not, ensuring that the task's specific requirements are addressed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there changes to the validator template if we are disabling it for the webAgent? Again promptfoo tests should be added for these changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the second validation, I will add promptfoo tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explaining


Output format:

Expand All @@ -14,10 +14,13 @@ json
}

**Validation Guidelines:**
- Be lenient - if the answer looks reasonably accurate, return "true".
- If multiple entities have the same highest score and this matches the query intent, return "true".
- The answer must fulfill the specific intent of the task, not just provide related information.
- Be lenient if the answer is reasonably accurate and fulfills the task's intent, even if it lacks minor details.
- If specific data (like a list of companies) is requested but missing, return "false."
- If multiple entities have the same highest score and this matches the query intent, return "true."
- Spending is negative; ensure any calculations involving spending reflect this if relevant to the task.


Example:
Task: What is 2 + 2?
Answer: 4
Expand All @@ -33,6 +36,20 @@ Answer: 5
"reasoning": "The answer is incorrect; 2 + 2 equals 4, not 5."
}

Task: Provide a list of companies with the highest ESG scores in the Technology sector.
Answer: As of the end of 2023, the Technology sector had the highest weighted-average ESG score among all sectors, according to the MSCI ACWI SRI Index. However, I don't have a specific list of individual companies with the highest scores.
{
"response": "false",
"reasoning": "The answer provides general information about ESG scores in the Technology sector but fails to fulfill the task's intent of listing companies with the highest scores."
}

Task: Provide a list of companies with the highest ESG scores in the Technology sector.
Answer: Here are the companies with the highest ESG scores in the Technology sector: 1. Apple Inc., 2. Microsoft Corp., 3. Alphabet Inc.
{
"response": "true",
"reasoning": "The answer lists companies with the highest ESG scores in the Technology sector, fulfilling the task's intent."
}

Task: What are Apple's ESG scores?
Answer: Apple's ESG (Environmental, Social, and Governance) scores are as follows: Environmental Score of 95.0, Social Score of 90.0, Governance Score of 92.0.
{
Expand Down
2 changes: 1 addition & 1 deletion backend/src/utils/web_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
engine = PromptEngine()


async def search_urls(search_query, num_results=10) -> str:
async def search_urls(search_query, num_results=30) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change slow down the web agent significantly? We are looking to improve the web agent search in general with https://scottlogic.atlassian.net/browse/FS-46

logger.info(f"Searching the web for: {search_query}")
try:
https_urls = [str(url) for url in search(search_query, num_results=num_results) if str(url).startswith("https")]
Expand Down
Loading