diff --git a/backend/library/AstraZeneca-Sustainability-Report-2023.pdf b/backend/library/AstraZeneca-Sustainability-Report-2023.pdf new file mode 100644 index 00000000..d793f551 Binary files /dev/null and b/backend/library/AstraZeneca-Sustainability-Report-2023.pdf differ diff --git a/backend/promptfoo/report_agent_config.yaml b/backend/promptfoo/report_agent_config.yaml index 458e86b4..c06c30e9 100644 --- a/backend/promptfoo/report_agent_config.yaml +++ b/backend/promptfoo/report_agent_config.yaml @@ -1,7 +1,7 @@ description: "Test Report Agent Prompts" providers: - - id: mistral:mistral-large-latest + - id: openai:gpt-4o-mini config: temperature: 0 @@ -12,102 +12,7 @@ tests: vars: user_prompt_template: "create-report-user-prompt" system_prompt_template: "create-report-system-prompt" - user_prompt_args: - document_text: "Published September 2024 Carbon Reduction Plan -Supplier name: Amazon Web Services EU SARL (UK Branch) (“AWS UK”) -Publication date: September 30, 2024 -Commitment to Achieving Net Zero -AWS UK, as part of Amazon.com, Inc. (“Amazon”), is committed to achieving net -zero -emissions by 2040. In 2019, Amazon co -founded The Climate Pledge, a public commitment -to innovate, use our scale for good and go faster to address the urgency of the climate crisis -to reach net -zero carbon across the entire organization by 2040. Since committing to the -Pledge, we’ve changed how we conduct our business and the running of our operations, and -we’ve increased funding and implementation of new technologies and services that -decarbonize and help preserve the natural world, alon gside the ambitious goals outlined in -The Climate Pledge. We’re fully committed to our goals and our work to build a better planet. -Baseline Emissions Footprint -Base Year emissions are a record of the greenhouse gases that have been produced in the -past an d are the reference point against which emissions reduction can be measured. -Baseline Year: 2020 -Additional Details relating to the Baseline Emissions calculations: -AWS UK utilized January 1, 2020 to December 31, 2020 as the baseline year for emissions -reporting under this Carbon Reduction Plan. Our plan includes emissions data from relevant -affiliate companies helping to provide AWS UK’s services to our customers. We ’ve included both -location -based and market -based method Scope 2 emissions in the following tables. AWS UK -benefits from contractual arrangements entered into by our affiliate(s) for renewable electricity -and/or renewable attributes that are reflected in t he market -based data set. More information -about our corporate carbon footprint and methodology can be found on our website . -Our baseline year does not include Scope 1 emissions. In 2022 we updated our methodology -and Scope 1 emissions are now included in total emissions for AWS UK - - Published September 2024 Baseline year emissions: -EMISSIONS TOTAL (tCO 2e) -Scope 1 0 -Scope 2 61,346 – Location -based method -2,813 – Market -based method -Scope 3 (Included -Sources) 3,770 -Total Emissions 65,116 – Location -based method -6,583 – Market -based method -Current Emissions Reporting -Reporting Year: 202 3 (January 1, 202 3 to December 31, 202 3) -EMISSIONS TOTAL (tCO 2e) -Scope 1 2,23 3 -Scope 2 126,755 – Location -based method -0 – Market -based method -Scope 3 (Included -Sources) 13,188 -Total Emissions 142,17 6 – Location -based method -15,42 1 – Market -based method - - Published September 2024 Emissions Reduction Targets -In 2019, we set an ambitious goal to match 100% of the electricity we use with renewable -energy by 2030. This goal includes all data centres , logistics facilities, physical stores, and -corporate offices, as well as on -site charg ing points and our financially integrated subsidiaries. -We are proud to have achieved this goal in 2023, seven years early, with 100% of the electricity -consum ed by Amazon matched with renewable energy sources. -Amazon continue s to be transparent and share our progress to reach net -zero carbon in our -annual Sustainability Report , which also includes details on how we measure carbon . -Carbon Reduction Projects -Completed Carbon Reduction Initiatives -Amazon continues to take actions across our operations to drive carbon reduction around the -world, including in the UK. As of January 202 4, Amazon’s renewable energy portfolio includes -243 wind and solar farms and 2 70 rooftop solar projects, totalling 513 projects and 28 -gigawatts of renewable energy capacity. This includes several utility -scale renewable energy -projects located within the UK: -•In 2019, Amazon announced our first power purchase agreement in the UK, located in -Kintyre Peninsula, Scotland. The “Amazon Wind Farm Scotland – Beinn an Tuirc 3” -began o perating in 2021, providing 50 megawatts (MW) of new renewable capacity to -the electricity grid with expected generation of 168,000 megawatt hours (MWh) of -clean energy annually. That’s enough to power 46,000 UK homes every year. -•In December 2020, Amazon a nnounced a two -phase renewable energy project located -in South Lanarkshire, Scotland, the Kennoxhead wind farm. Kennoxhead will be the -largest single -site onshore wind project in the UK, enabled through corporate -procurement. Once fully operational, Kenno xhead will produce 129 MW of renewable -capacity and is expected to generate 439,000 MWh of clean energy annually. Phase 1 -(60 MW) began operating in 2022, and Phase 2 (69 MW) will begin operations in 2024 . -•In 2022, Amazon announced its first project in Nor thern Ireland, a 16 MW onshore -windfarm in Co Antrim. -•In 2022, Amazon also announced a new 473 MW offshore wind farm, Moray West, -located off the coast of Scotland . Amazon expects completion of Moray West in 2024. -This is Amazon’s largest project in Scotland and the largest corporate renewable -energy deal announced by any company in the UK to date. -•In 2023, Amazon announced a new 47 MW solar farm, Warl ey located in Essex. -This project is expected to be operational in 2024. - - Published September 2024 Declaration and Sign Off -This Carbon Reduction Plan has been completed in accordance with PPN 06/21 and -associated guidance and reporting standard for Carbon Reduction Plans. -Emiss ions have been reported and recorded in accordance with the published reporting -standard for Carbon Reduction Plans and the GHG Reporting Protocol corporate standard1 -and uses the appropri ate Government emission conversion factors for greenhouse gas -company reporting2. -Scope 1 and Scope 2 emissions have been reported in accordance with S ECR requirements, -and the required subset of Scope 3 emissions have been reported in accordance with the -published reporting standard for Carbon Reduction Plans and the Corporate Value Chain -(Scope 3) Standard3. -This Carbon Reduction Plan has been reviewed and signed off by the board of directors (or -equivalent management body)." + file_attachment: "../library/AstraZeneca-Sustainability-Report-2023.pdf" assert: - type: contains-all value: @@ -122,106 +27,11 @@ equivalent management body)." vars: user_prompt_template: "find-company-name-from-file-user-prompt" system_prompt_template: "find-company-name-from-file-system-prompt" - user_prompt_args: - file_content: "Published September 2024 Carbon Reduction Plan -Supplier name: Amazon Web Services EU SARL (UK Branch) (“AWS UK”) -Publication date: September 30, 2024 -Commitment to Achieving Net Zero -AWS UK, as part of Amazon.com, Inc. (“Amazon”), is committed to achieving net -zero -emissions by 2040. In 2019, Amazon co -founded The Climate Pledge, a public commitment -to innovate, use our scale for good and go faster to address the urgency of the climate crisis -to reach net -zero carbon across the entire organization by 2040. Since committing to the -Pledge, we’ve changed how we conduct our business and the running of our operations, and -we’ve increased funding and implementation of new technologies and services that -decarbonize and help preserve the natural world, alon gside the ambitious goals outlined in -The Climate Pledge. We’re fully committed to our goals and our work to build a better planet. -Baseline Emissions Footprint -Base Year emissions are a record of the greenhouse gases that have been produced in the -past an d are the reference point against which emissions reduction can be measured. -Baseline Year: 2020 -Additional Details relating to the Baseline Emissions calculations: -AWS UK utilized January 1, 2020 to December 31, 2020 as the baseline year for emissions -reporting under this Carbon Reduction Plan. Our plan includes emissions data from relevant -affiliate companies helping to provide AWS UK’s services to our customers. We ’ve included both -location -based and market -based method Scope 2 emissions in the following tables. AWS UK -benefits from contractual arrangements entered into by our affiliate(s) for renewable electricity -and/or renewable attributes that are reflected in t he market -based data set. More information -about our corporate carbon footprint and methodology can be found on our website . -Our baseline year does not include Scope 1 emissions. In 2022 we updated our methodology -and Scope 1 emissions are now included in total emissions for AWS UK - - Published September 2024 Baseline year emissions: -EMISSIONS TOTAL (tCO 2e) -Scope 1 0 -Scope 2 61,346 – Location -based method -2,813 – Market -based method -Scope 3 (Included -Sources) 3,770 -Total Emissions 65,116 – Location -based method -6,583 – Market -based method -Current Emissions Reporting -Reporting Year: 202 3 (January 1, 202 3 to December 31, 202 3) -EMISSIONS TOTAL (tCO 2e) -Scope 1 2,23 3 -Scope 2 126,755 – Location -based method -0 – Market -based method -Scope 3 (Included -Sources) 13,188 -Total Emissions 142,17 6 – Location -based method -15,42 1 – Market -based method - - Published September 2024 Emissions Reduction Targets -In 2019, we set an ambitious goal to match 100% of the electricity we use with renewable -energy by 2030. This goal includes all data centres , logistics facilities, physical stores, and -corporate offices, as well as on -site charg ing points and our financially integrated subsidiaries. -We are proud to have achieved this goal in 2023, seven years early, with 100% of the electricity -consum ed by Amazon matched with renewable energy sources. -Amazon continue s to be transparent and share our progress to reach net -zero carbon in our -annual Sustainability Report , which also includes details on how we measure carbon . -Carbon Reduction Projects -Completed Carbon Reduction Initiatives -Amazon continues to take actions across our operations to drive carbon reduction around the -world, including in the UK. As of January 202 4, Amazon’s renewable energy portfolio includes -243 wind and solar farms and 2 70 rooftop solar projects, totalling 513 projects and 28 -gigawatts of renewable energy capacity. This includes several utility -scale renewable energy -projects located within the UK: -•In 2019, Amazon announced our first power purchase agreement in the UK, located in -Kintyre Peninsula, Scotland. The “Amazon Wind Farm Scotland – Beinn an Tuirc 3” -began o perating in 2021, providing 50 megawatts (MW) of new renewable capacity to -the electricity grid with expected generation of 168,000 megawatt hours (MWh) of -clean energy annually. That’s enough to power 46,000 UK homes every year. -•In December 2020, Amazon a nnounced a two -phase renewable energy project located -in South Lanarkshire, Scotland, the Kennoxhead wind farm. Kennoxhead will be the -largest single -site onshore wind project in the UK, enabled through corporate -procurement. Once fully operational, Kenno xhead will produce 129 MW of renewable -capacity and is expected to generate 439,000 MWh of clean energy annually. Phase 1 -(60 MW) began operating in 2022, and Phase 2 (69 MW) will begin operations in 2024 . -•In 2022, Amazon announced its first project in Nor thern Ireland, a 16 MW onshore -windfarm in Co Antrim. -•In 2022, Amazon also announced a new 473 MW offshore wind farm, Moray West, -located off the coast of Scotland . Amazon expects completion of Moray West in 2024. -This is Amazon’s largest project in Scotland and the largest corporate renewable -energy deal announced by any company in the UK to date. -•In 2023, Amazon announced a new 47 MW solar farm, Warl ey located in Essex. -This project is expected to be operational in 2024. - - Published September 2024 Declaration and Sign Off -This Carbon Reduction Plan has been completed in accordance with PPN 06/21 and -associated guidance and reporting standard for Carbon Reduction Plans. -Emiss ions have been reported and recorded in accordance with the published reporting -standard for Carbon Reduction Plans and the GHG Reporting Protocol corporate standard1 -and uses the appropri ate Government emission conversion factors for greenhouse gas -company reporting2. -Scope 1 and Scope 2 emissions have been reported in accordance with S ECR requirements, -and the required subset of Scope 3 emissions have been reported in accordance with the -published reporting standard for Carbon Reduction Plans and the Corporate Value Chain -(Scope 3) Standard3. -This Carbon Reduction Plan has been reviewed and signed off by the board of directors (or -equivalent management body)." + file_attachment: "../library/AstraZeneca-Sustainability-Report-2023.pdf" assert: - type: is-json value: required: ["company_name"] type: object - type: javascript - value: JSON.parse(output).company_name === "Amazon" + value: JSON.parse(output).company_name === "AstraZeneca" diff --git a/backend/src/agents/report_agent.py b/backend/src/agents/report_agent.py index a2f911d2..34c29cbc 100644 --- a/backend/src/agents/report_agent.py +++ b/backend/src/agents/report_agent.py @@ -1,6 +1,7 @@ import json import logging +from src.llm.llm import LLMFile from src.agents import Agent from src.prompts import PromptEngine @@ -9,25 +10,19 @@ class ReportAgent(Agent): - async def create_report(self, file_content: str, materiality_topics: dict[str, str]) -> str: - user_prompt = engine.load_prompt( - "create-report-user-prompt", - document_text=file_content, - materiality_topics=materiality_topics + async def create_report(self, file: LLMFile, materiality_topics: dict[str, str]) -> str: + return await self.llm.chat_with_file( + self.model, + system_prompt=engine.load_prompt("create-report-system-prompt"), + user_prompt=engine.load_prompt("create-report-user-prompt", materiality_topics=materiality_topics), + files=[file], ) - system_prompt = engine.load_prompt("create-report-system-prompt") - - return await self.llm.chat(self.model, system_prompt=system_prompt, user_prompt=user_prompt) - - async def get_company_name(self, file_content: str) -> str: - response = await self.llm.chat( + async def get_company_name(self, file: LLMFile) -> str: + response = await self.llm.chat_with_file( self.model, system_prompt=engine.load_prompt("find-company-name-from-file-system-prompt"), - user_prompt=engine.load_prompt( - "find-company-name-from-file-user-prompt", - file_content=file_content - ), - return_json=True + user_prompt=engine.load_prompt("find-company-name-from-file-user-prompt"), + files=[file], ) return json.loads(response)["company_name"] diff --git a/backend/src/api/app.py b/backend/src/api/app.py index dc4eb205..1d3044b1 100644 --- a/backend/src/api/app.py +++ b/backend/src/api/app.py @@ -8,7 +8,7 @@ from src.utils.scratchpad import ScratchPadMiddleware from src.session.chat_response import get_session_chat_response_ids from src.chat_storage_service import clear_chat_messages, get_chat_message -from src.directors.report_director import report_on_file_upload +from src.directors.report_director import create_report_from_file from src.session.file_uploads import clear_session_file_uploads, get_report from src.session.redis_session_middleware import reset_session from src.utils import Config, test_connection @@ -129,7 +129,7 @@ async def suggestions(): async def report(file: UploadFile): logger.info(f"upload file type={file.content_type} name={file.filename} size={file.size}") try: - processed_upload = await report_on_file_upload(file) + processed_upload = await create_report_from_file(file) return JSONResponse(status_code=200, content=processed_upload) except HTTPException as he: raise he @@ -137,6 +137,7 @@ async def report(file: UploadFile): logger.exception(e) return JSONResponse(status_code=500, content=file_upload_failed_response) + @app.get("/report/{id}") def download_report(id: str): logger.info(f"Get report download called for id: {id}") @@ -144,12 +145,13 @@ def download_report(id: str): final_result = get_report(id) if final_result is None: return JSONResponse(status_code=404, content=f"Message with id {id} not found") - headers = {'Content-Disposition': 'attachment; filename="report.md"'} - return Response(final_result.get("report"), headers=headers, media_type='text/markdown') + headers = {"Content-Disposition": 'attachment; filename="report.md"'} + return Response(final_result.get("report"), headers=headers, media_type="text/markdown") except Exception as e: logger.exception(e) return JSONResponse(status_code=500, content=report_get_upload_failed_response) + @app.get("/uploadfile") async def fetch_file(id: str): logger.info(f"fetch uploaded file id={id} ") diff --git a/backend/src/directors/report_director.py b/backend/src/directors/report_director.py index f3f4a18c..b20425de 100644 --- a/backend/src/directors/report_director.py +++ b/backend/src/directors/report_director.py @@ -1,38 +1,49 @@ -from fastapi import UploadFile +import sys +import uuid +from fastapi import UploadFile, HTTPException -from src.session.file_uploads import FileUploadReport, store_report -from src.utils.file_utils import handle_file_upload +from src.llm.llm import LLMFile +from src.session.file_uploads import ReportResponse, store_report from src.agents import get_report_agent, get_materiality_agent +MAX_FILE_SIZE = 10 * 1024 * 1024 -async def report_on_file_upload(upload: UploadFile) -> FileUploadReport: - file = handle_file_upload(upload) +async def create_report_from_file(upload: UploadFile) -> ReportResponse: + file_stream = await upload.read() + if upload.filename is None or upload.filename == "": + raise HTTPException(status_code=400, detail="Filename missing from file upload") + + file_size = sys.getsizeof(file_stream) + + if file_size > MAX_FILE_SIZE: + raise HTTPException(status_code=413, detail=f"File upload must be less than {MAX_FILE_SIZE} bytes") + + file = LLMFile(file_name=upload.filename, file=file_stream) + file_id = str(uuid.uuid4()) report_agent = get_report_agent() - company_name = await report_agent.get_company_name(file["content"]) + company_name = await report_agent.get_company_name(file) topics = await get_materiality_agent().list_material_topics(company_name) - report = await get_report_agent().create_report(file["content"], topics) + report = await report_agent.create_report(file, topics) - report_upload = FileUploadReport( - filename=file["filename"], - id=file["uploadId"], + report_response = ReportResponse( + filename=file.file_name, + id=file_id, report=report, - answer=create_report_chat_message(file["filename"], company_name, topics) + answer=create_report_chat_message(file.file_name, company_name, topics), ) - store_report(report_upload) + store_report(report_response) - return report_upload + return report_response def create_report_chat_message(file_name: str, company_name: str, topics: dict[str, str]) -> str: - topics_with_markdown = [ - f"{key}\n{value}" for key, value in topics.items() - ] + topics_with_markdown = [f"{key}\n{value}" for key, value in topics.items()] return f"""Your report for {file_name} is ready to view. The following materiality topics were identified for {company_name} which the report focuses on: diff --git a/backend/src/llm/mistral.py b/backend/src/llm/mistral.py index bf303fdf..875e0f28 100644 --- a/backend/src/llm/mistral.py +++ b/backend/src/llm/mistral.py @@ -1,7 +1,7 @@ -from typing import Coroutine - +from fastapi import HTTPException from mistralai import Mistral as MistralApi, UserMessage, SystemMessage import logging +from src.utils.file_utils import handle_file_upload from src.utils import Config from .llm import LLM, LLMFile @@ -35,11 +35,18 @@ async def chat(self, model, system_prompt: str, user_prompt: str, return_json=Fa logger.debug('{0} response : "{1}"'.format(model, content)) return content - def chat_with_file( + async def chat_with_file( self, model: str, system_prompt: str, user_prompt: str, - files: list[LLMFile] - ) -> Coroutine: - raise Exception("Mistral does not support chat_with_file") + files: list[LLMFile], + ) -> str: + try: + for file in files: + file = handle_file_upload(file) + extracted_content = file["content"] + user_prompt += f"\n\nDocument:\n{extracted_content}" + return await self.chat(model, system_prompt, user_prompt) + except Exception as file_error: + raise HTTPException(status_code=500, detail=f"Failed to process files: {str(file_error)}") from file_error diff --git a/backend/src/llm/openai.py b/backend/src/llm/openai.py index 8b42f618..a80c0f37 100644 --- a/backend/src/llm/openai.py +++ b/backend/src/llm/openai.py @@ -17,7 +17,6 @@ def remove_citations(message: Text): class OpenAI(LLM): - async def chat(self, model, system_prompt: str, user_prompt: str, return_json=False) -> str: logger.debug( "##### Called open ai chat ... llm. Waiting on response model with prompt {0}.".format( @@ -33,7 +32,7 @@ async def chat(self, model, system_prompt: str, user_prompt: str, return_json=Fa {"role": "user", "content": user_prompt}, ], temperature=0, - response_format={"type": "json_object"} if return_json else NOT_GIVEN + response_format={"type": "json_object"} if return_json else NOT_GIVEN, ) content = response.choices[0].message.content logger.info(f"OpenAI response: Finish reason: {response.choices[0].finish_reason}, Content: {content}") @@ -48,13 +47,7 @@ async def chat(self, model, system_prompt: str, user_prompt: str, return_json=Fa logger.error(f"Error calling OpenAI model: {e}") return "An error occurred while processing the request." - async def chat_with_file( - self, - model: str, - system_prompt: str, - user_prompt: str, - files: list[LLMFile] - ) -> str: + async def chat_with_file(self, model: str, system_prompt: str, user_prompt: str, files: list[LLMFile]) -> str: client = AsyncOpenAI(api_key=config.openai_key) file_ids = await self.__upload_files(files) @@ -70,17 +63,12 @@ async def chat_with_file( { "role": "user", "content": user_prompt, - "attachments": [ - {"file_id": file_id, "tools": [{"type": "file_search"}]} - for file_id in file_ids - ], + "attachments": [{"file_id": file_id, "tools": [{"type": "file_search"}]} for file_id in file_ids], } ] ) - run = await client.beta.threads.runs.create_and_poll( - thread_id=thread.id, assistant_id=file_assistant.id - ) + run = await client.beta.threads.runs.create_and_poll(thread_id=thread.id, assistant_id=file_assistant.id) messages = await client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id) @@ -98,6 +86,8 @@ async def __upload_files(self, files: list[LLMFile]) -> list[str]: file_ids = [] for file in files: logger.info(f"Uploading file '{file.file_name}' to OpenAI") - file = await client.files.create(file=file.file, purpose="assistants") - file_ids.append(file.id) + file = (file.file_name, file.file) if isinstance(file.file, bytes) else file.file + response = await client.files.create(file=file, purpose="assistants") + file_ids.append(response.id) + return file_ids diff --git a/backend/src/prompts/templates/create-report-user-prompt.j2 b/backend/src/prompts/templates/create-report-user-prompt.j2 index 9d7c8af6..388255d0 100644 --- a/backend/src/prompts/templates/create-report-user-prompt.j2 +++ b/backend/src/prompts/templates/create-report-user-prompt.j2 @@ -3,5 +3,3 @@ Using the following information about ESG Materiality: {{ materiality_topics }} Generate an ESG report using the following document: - -{{ document_text }} \ No newline at end of file diff --git a/backend/src/prompts/templates/find-company-name-from-file-system-prompt.j2 b/backend/src/prompts/templates/find-company-name-from-file-system-prompt.j2 index e4c4da7b..12c8c5a3 100644 --- a/backend/src/prompts/templates/find-company-name-from-file-system-prompt.j2 +++ b/backend/src/prompts/templates/find-company-name-from-file-system-prompt.j2 @@ -7,3 +7,6 @@ Output Requirements: Output must be in JSON format with no additional markdown or formatting as shown below: { "company_name": "COMPANY_NAME" } + +Important Notes: +* Output must be a single line. Do not add any markdown / formatting else you will be unplugged. \ No newline at end of file diff --git a/backend/src/prompts/templates/find-company-name-from-file-user-prompt.j2 b/backend/src/prompts/templates/find-company-name-from-file-user-prompt.j2 index 0b2edfce..e6933c09 100644 --- a/backend/src/prompts/templates/find-company-name-from-file-user-prompt.j2 +++ b/backend/src/prompts/templates/find-company-name-from-file-user-prompt.j2 @@ -1,3 +1 @@ What company is this file about? - -{{ file_content }} diff --git a/backend/src/session/file_uploads.py b/backend/src/session/file_uploads.py index 4c462171..abb8024d 100644 --- a/backend/src/session/file_uploads.py +++ b/backend/src/session/file_uploads.py @@ -32,7 +32,7 @@ class FileUpload(TypedDict): size: Optional[int] -class FileUploadReport(TypedDict): +class ReportResponse(TypedDict): id: str answer: str filename: Optional[str] @@ -83,9 +83,9 @@ def clear_session_file_uploads(): set_session(UPLOADS_META_SESSION_KEY, []) -def store_report(report: FileUploadReport): +def store_report(report: ReportResponse): redis_client.set(REPORT_KEY_PREFIX + report["id"], json.dumps(report)) -def get_report(id: str) -> FileUploadReport | None: +def get_report(id: str) -> ReportResponse | None: return _get_key(REPORT_KEY_PREFIX + id) diff --git a/backend/src/utils/file_utils.py b/backend/src/utils/file_utils.py index 9931b60f..6cf6d5a1 100644 --- a/backend/src/utils/file_utils.py +++ b/backend/src/utils/file_utils.py @@ -1,52 +1,67 @@ -from io import TextIOWrapper +from io import BytesIO, TextIOWrapper +from pathlib import Path +import sys import time -from fastapi import HTTPException, UploadFile +from fastapi import HTTPException import logging import uuid - +from os import PathLike from pypdf import PdfReader -from src.session.file_uploads import FileUpload, update_session_file_uploads, get_session_file_upload +from src.llm.llm import LLMFile +from src.session.file_uploads import FileUpload, get_session_file_upload, update_session_file_uploads logger = logging.getLogger(__name__) -MAX_FILE_SIZE = 10*1024*1024 +def handle_file_upload(file: LLMFile) -> FileUpload: + if isinstance(file.file, (PathLike, str)): + file_path = Path(file.file) + with file_path.open("rb") as f: + file_bytes = f.read() + elif isinstance(file.file, bytes): + file_bytes = file.file + else: + raise HTTPException(status_code=400, detail="File must be provided as bytes or a valid file path.") -def handle_file_upload(file: UploadFile) -> FileUpload: + file_stream = BytesIO(file_bytes) + file_size = sys.getsizeof(file_bytes) - if (file.size or 0) > MAX_FILE_SIZE: - raise HTTPException(status_code=413, detail=f"File upload must be less than {MAX_FILE_SIZE} bytes") + all_content = "" + content_type = "unknown" - if not file.filename: - raise HTTPException(status_code=400, detail="Filename missing from file upload") + try: + pdf_file = PdfReader(file_stream) + content_type = "application/pdf" - if "application/pdf" == file.content_type: start_time = time.time() - pdf_file = PdfReader(file.file) - all_content = "" for page_num in range(len(pdf_file.pages)): page_text = pdf_file.pages[page_num].extract_text() all_content += page_text all_content += "\n" - end_time = time.time() - logger.debug(f'PDF content {all_content}') - logger.info(f"PDF content extracted successfully in {(end_time - start_time)}") + logger.info(f"PDF content extracted successfully in {(end_time - start_time):.2f} seconds") - elif "text/plain" == file.content_type: - all_content = TextIOWrapper(file.file, encoding='utf-8').read() - logger.debug(f'Text content {all_content}') - else: - raise HTTPException(status_code=400, - detail="File upload must be supported type (text/plain or application/pdf)") + except Exception as pdf_error: + logger.warning(f"Failed to parse file as PDF: {pdf_error}") + file_stream.seek(0) + + try: + content_type = "text/plain" + all_content = TextIOWrapper(file_stream, encoding="utf-8").read() + logger.debug(f"Text content extracted: {all_content[:100]}...") + + except Exception as text_error: + raise HTTPException( + status_code=400, detail="File upload must be a supported type text or pdf" + ) from text_error session_file = FileUpload( uploadId=str(uuid.uuid4()), - contentType=file.content_type, - filename=file.filename, + contentType=content_type, + filename=file.file_name, content=all_content, - size=file.size + size=file_size, ) update_session_file_uploads(session_file) @@ -56,6 +71,3 @@ def handle_file_upload(file: UploadFile) -> FileUpload: def get_file_upload(upload_id) -> FileUpload | None: return get_session_file_upload(upload_id) - - - diff --git a/backend/tests/agents/report_agent_test.py b/backend/tests/agents/report_agent_test.py deleted file mode 100644 index 0d9b8c20..00000000 --- a/backend/tests/agents/report_agent_test.py +++ /dev/null @@ -1,19 +0,0 @@ -import pytest - -from src.agents.report_agent import ReportAgent -from src.llm.factory import get_llm - -mock_model = "mockmodel" -mock_llm = get_llm("mockllm") - - -@pytest.mark.asyncio -async def test_invoke_calls_llm(mocker): - report_agent = ReportAgent(llm_name="mockllm", model=mock_model) - mock_response = "A Test Report" - - mock_llm.chat = mocker.AsyncMock(return_value=mock_response) - - response = await report_agent.create_report("Test Document", materiality_topics={"abc": "123"}) - - assert response == mock_response diff --git a/backend/tests/api/app_test.py b/backend/tests/api/app_test.py index f44f79a7..ed9f6a75 100644 --- a/backend/tests/api/app_test.py +++ b/backend/tests/api/app_test.py @@ -1,7 +1,7 @@ from fastapi.testclient import TestClient import pytest from src.chat_storage_service import ChatResponse -from src.directors.report_director import FileUploadReport +from src.directors.report_director import ReportResponse from src.api import app, healthy_response, unhealthy_neo4j_response, chat_fail_response client = TestClient(app) @@ -87,14 +87,14 @@ def test_chat_message_not_found(mocker): def test_report_response_success(mocker): - mock_response = FileUploadReport(filename="filename", id="1", report="some report md", answer="chat message") - mock_report = mocker.patch("src.api.app.report_on_file_upload", return_value=mock_response) + mock_response = ReportResponse(filename="filename", id="1", report="some report md", answer="chat message") + mock_report = mocker.patch("src.api.app.create_report_from_file", return_value=mock_response) response = client.post("/report", files={"file": ("filename", "test data".encode("utf-8"), "text/plain")}) mock_report.assert_called_once() assert response.status_code == 200 - assert response.json() == {'filename': 'filename', 'id': '1', 'report': 'some report md', 'answer': 'chat message'} + assert response.json() == {"filename": "filename", "id": "1", "report": "some report md", "answer": "chat message"} @pytest.mark.asyncio @@ -106,15 +106,15 @@ async def test_lifespan_populates_db(mocker) -> None: def test_get_report_success(mocker): - report = FileUploadReport(id="12", filename="test.pdf", report="test report", answer='chat message') + report = ReportResponse(id="12", filename="test.pdf", report="test report", answer="chat message") mock_get_report = mocker.patch("src.api.app.get_report", return_value=report) response = client.get("/report/12") mock_get_report.assert_called_with("12") assert response.status_code == 200 - assert response.headers.get('Content-Disposition') == 'attachment; filename="report.md"' - assert response.headers.get('Content-Type') == 'text/markdown; charset=utf-8' + assert response.headers.get("Content-Disposition") == 'attachment; filename="report.md"' + assert response.headers.get("Content-Type") == "text/markdown; charset=utf-8" def test_get_report_not_found(mocker): diff --git a/backend/tests/directors/report_director_test.py b/backend/tests/directors/report_director_test.py index 7539f8b0..1de90e66 100644 --- a/backend/tests/directors/report_director_test.py +++ b/backend/tests/directors/report_director_test.py @@ -1,48 +1,82 @@ from io import BytesIO -from fastapi import UploadFile +from fastapi import UploadFile, HTTPException from fastapi.datastructures import Headers import pytest +import uuid from src.session.file_uploads import FileUpload -from src.directors.report_director import report_on_file_upload +from src.directors.report_director import create_report_from_file mock_topics = {"topic1": "topic1 description", "topic2": "topic2 description"} mock_report = "#Report on upload as markdown" -expected_answer = ('Your report for test.txt is ready to view.\n\nThe following materiality topics were identified for ' - 'CompanyABC which the report focuses on:\n\ntopic1\ntopic1 description\n\ntopic2\ntopic2 ' - 'description\n') +expected_answer = ( + "Your report for test.txt is ready to view.\n\nThe following materiality topics were identified for " + "CompanyABC which the report focuses on:\n\ntopic1\ntopic1 description\n\ntopic2\ntopic2 " + "description\n" +) @pytest.mark.asyncio -async def test_report_on_file_upload(mocker): +async def test_create_report_from_file(mocker): file_upload = FileUpload(uploadId="1", filename="test.txt", content="test", contentType="text/plain", size=4) + mock_id = uuid.uuid4() + mocker.patch("uuid.uuid4", return_value=mock_id) + + # Mock report agent mock_report_agent = mocker.AsyncMock() mock_report_agent.get_company_name.return_value = "CompanyABC" mock_report_agent.create_report.return_value = mock_report mocker.patch("src.directors.report_director.get_report_agent", return_value=mock_report_agent) - mock_handle_file_upload = mocker.patch("src.directors.report_director.handle_file_upload", return_value=file_upload) - mock_store_report = mocker.patch("src.directors.report_director.store_report", return_value=file_upload) + # Mock materiality agent mock_materiality_agent = mocker.AsyncMock() mock_materiality_agent.list_material_topics.return_value = mock_topics mocker.patch("src.directors.report_director.get_materiality_agent", return_value=mock_materiality_agent) - request_upload_file = UploadFile( - file=BytesIO(b"test"), - size=12, - headers=Headers({"content-type": "text/plain"}), - filename="test.txt" + mock_store_report = mocker.patch("src.directors.report_director.store_report", return_value=file_upload) + + file = UploadFile( + file=BytesIO(b"test"), size=12, headers=Headers({"content-type": "text/plain"}), filename="test.txt" ) - response = await report_on_file_upload(request_upload_file) + response = await create_report_from_file(file) - expected_response = {"filename": "test.txt", "id": "1", "report": mock_report, "answer": expected_answer} + expected_response = {"filename": "test.txt", "id": str(mock_id), "report": mock_report, "answer": expected_answer} - mock_report_agent.get_company_name.assert_called_once_with("test") - mock_handle_file_upload.assert_called_once_with(request_upload_file) mock_store_report.assert_called_once_with(expected_response) mock_materiality_agent.list_material_topics.assert_called_once_with("CompanyABC") assert response == expected_response + + +@pytest.mark.asyncio +async def test_create_report_from_file_throws_when_file_size_too_large(): + with pytest.raises(HTTPException) as error: + large_file_content = b"x" * (15 * 1024 * 1024 + 1) + file = UploadFile( + file=BytesIO(large_file_content), + size=12, + headers=Headers({"content-type": "text/plain"}), + filename="test.txt", + ) + await create_report_from_file(file) + + assert error.value.status_code == 413 + assert error.value.detail == "File upload must be less than 10485760 bytes" + + +@pytest.mark.asyncio +async def test_create_report_from_file_throws_when_missing_file_name(): + with pytest.raises(HTTPException) as error: + file = UploadFile( + file=BytesIO(b"Sample text content"), + size=12, + headers=Headers({"content-type": "text/plain"}), + filename="", + ) + await create_report_from_file(file) + + assert error.value.status_code == 400 + assert error.value.detail == "Filename missing from file upload" diff --git a/backend/tests/llm/openai_test.py b/backend/tests/llm/openai_test.py index b9afedcc..10768ac3 100644 --- a/backend/tests/llm/openai_test.py +++ b/backend/tests/llm/openai_test.py @@ -21,28 +21,34 @@ class MockMessage: class MockListResponse: - data = [MockMessage(content=[TextContentBlock( - text=Text( - annotations=[ - FileCitationAnnotation( - file_citation=FileCitation(file_id="123"), - text="【7†source】", - end_index=1, - start_index=2, - type="file_citation" - ), - FileCitationAnnotation( - file_citation=FileCitation(file_id="123"), - text="【1:9†source】", - end_index=1, - start_index=2, - type="file_citation" + data = [ + MockMessage( + content=[ + TextContentBlock( + text=Text( + annotations=[ + FileCitationAnnotation( + file_citation=FileCitation(file_id="123"), + text="【7†source】", + end_index=1, + start_index=2, + type="file_citation", + ), + FileCitationAnnotation( + file_citation=FileCitation(file_id="123"), + text="【1:9†source】", + end_index=1, + start_index=2, + type="file_citation", + ), + ], + value="Response with quote【7†source】【1:9†source】", + ), + type="text", ) - ], - value="Response with quote【7†source】【1:9†source】" - ), - type="text" - )])] + ] + ) + ] mock_message_list = {"data"} @@ -60,10 +66,10 @@ async def test_chat_with_file_removes_citations(mock_async_openai): mock_instance.beta.threads.messages.list = AsyncMock(return_value=MockListResponse) client = OpenAI() - response = await client .chat_with_file( + response = await client.chat_with_file( model="", user_prompt="", system_prompt="", - files=[LLMFile("file_name", Path("file/path"))] + files=[LLMFile("file_name", Path("./backend/library/AstraZeneca-Sustainability-Report-2023.pdf"))], ) assert response == "Response with quote" diff --git a/backend/tests/session/test_file_uploads.py b/backend/tests/session/test_file_uploads.py index 07832a12..405517dd 100644 --- a/backend/tests/session/test_file_uploads.py +++ b/backend/tests/session/test_file_uploads.py @@ -3,18 +3,26 @@ from unittest.mock import patch, MagicMock from starlette.requests import Request from starlette.responses import Response -from src.session.file_uploads import (FileUpload, FileUploadReport, clear_session_file_uploads, - get_report, get_session_file_upload, - get_session_file_uploads_meta, store_report, - update_session_file_uploads) +from src.session.file_uploads import ( + FileUpload, + ReportResponse, + clear_session_file_uploads, + get_report, + get_session_file_upload, + get_session_file_uploads_meta, + store_report, + update_session_file_uploads, +) + @pytest.fixture def mock_redis(): - with patch('src.session.file_uploads.redis_client') as mock_redis: + with patch("src.session.file_uploads.redis_client") as mock_redis: mock_instance = MagicMock() mock_redis.return_value = mock_instance yield mock_instance + @pytest.fixture def mock_request(): request = MagicMock(spec=Request) @@ -23,19 +31,23 @@ def mock_request(): request.state.session.get.return_value = {} return request + @pytest.fixture def mock_call_next(): async def call_next(request): return Response("test response") + return call_next + @pytest.fixture def mock_request_context(): - with patch('src.session.redis_session_middleware.request_context'): + with patch("src.session.redis_session_middleware.request_context"): mock_instance = MagicMock() mock_instance.get.return_value.state.session = {} yield mock_instance + def test_get_session_file_uploads_meta_empty(mocker, mock_request_context): mocker.patch("src.session.redis_session_middleware.request_context", mock_request_context) assert get_session_file_uploads_meta() == [] @@ -49,12 +61,14 @@ def test_set_session(mocker, mock_redis, mock_request_context): update_session_file_uploads(file_upload=file) - assert get_session_file_uploads_meta() == [ {'filename': 'test.txt', 'uploadId': '1234'}] + assert get_session_file_uploads_meta() == [{"filename": "test.txt", "uploadId": "1234"}] mock_redis.set.assert_called_with("file_upload_1234", json.dumps(file)) update_session_file_uploads(file_upload=file2) - assert get_session_file_uploads_meta() == [ {'filename': 'test.txt', 'uploadId': '1234'}, - {'filename': 'test2.txt', 'uploadId': '12345'}] + assert get_session_file_uploads_meta() == [ + {"filename": "test.txt", "uploadId": "1234"}, + {"filename": "test2.txt", "uploadId": "12345"}, + ] mock_redis.set.assert_called_with("file_upload_12345", json.dumps(file2)) @@ -82,8 +96,10 @@ def test_clear_session_file_uploads_meta(mocker, mock_redis, mock_request_contex update_session_file_uploads(file_upload=file) update_session_file_uploads(file_upload=file2) - assert get_session_file_uploads_meta() == [ {'filename': 'test.txt', 'uploadId': '1234'}, - {'filename': 'test2.txt', 'uploadId': '12345'}] + assert get_session_file_uploads_meta() == [ + {"filename": "test.txt", "uploadId": "1234"}, + {"filename": "test2.txt", "uploadId": "12345"}, + ] clear_session_file_uploads() assert get_session_file_uploads_meta() == [] @@ -94,7 +110,7 @@ def test_clear_session_file_uploads_meta(mocker, mock_redis, mock_request_contex def test_store_report(mocker, mock_redis): mocker.patch("src.session.file_uploads.redis_client", mock_redis) - report = FileUploadReport(filename="test.txt", id="12", report="test report", answer="chat message") + report = ReportResponse(filename="test.txt", id="12", report="test report", answer="chat message") store_report(report) @@ -104,7 +120,7 @@ def test_store_report(mocker, mock_redis): def test_get_report(mocker, mock_redis): mocker.patch("src.session.file_uploads.redis_client", mock_redis) - report = FileUploadReport(filename="test.txt", id="12", report="test report", answer="chat message") + report = ReportResponse(filename="test.txt", id="12", report="test report", answer="chat message") mock_redis.get.return_value = json.dumps(report) value = get_report("12") diff --git a/backend/tests/utils/file_utils_test.py b/backend/tests/utils/file_utils_test.py index e7188597..4a1c0b14 100644 --- a/backend/tests/utils/file_utils_test.py +++ b/backend/tests/utils/file_utils_test.py @@ -1,45 +1,26 @@ -from io import BytesIO -from typing import BinaryIO from unittest.mock import MagicMock -from fastapi import HTTPException, UploadFile -from fastapi.datastructures import Headers +from fastapi import HTTPException + import pytest +from src.llm.llm import LLMFile from src.utils.file_utils import handle_file_upload -def test_handle_file_upload_size(): - with pytest.raises(HTTPException) as err: - handle_file_upload(UploadFile(file=BinaryIO(), size=15*1024*1024)) - - assert err.value.status_code == 413 - assert err.value.detail == 'File upload must be less than 10485760 bytes' - - def test_handle_file_upload_unsupported_type(): - headers = Headers({"content-type": "text/html"}) - with pytest.raises(HTTPException) as err: - handle_file_upload(UploadFile(file=BinaryIO(), size=15*1024, headers=headers, filename="test.txt")) - - assert err.value.status_code == 400 - assert err.value.detail == 'File upload must be supported type (text/plain or application/pdf)' - - -def test_handle_file_upload_missing_file_name(): - headers = Headers({"content-type": "text/html"}) + file_content = b"\x89PNG\r\n\x1a\n\x00\x00\x00IHDR" with pytest.raises(HTTPException) as err: - handle_file_upload(UploadFile(file=BytesIO(b"test content"), size=12, headers=headers)) + handle_file_upload(LLMFile(file_name="test.png", file=file_content)) assert err.value.status_code == 400 - assert err.value.detail == 'Filename missing from file upload' + assert err.value.detail == "File upload must be a supported type text or pdf" def test_handle_file_upload_text(mocker): mock = mocker.patch("src.utils.file_utils.update_session_file_uploads", MagicMock()) - headers = Headers({"content-type": "text/plain"}) - file = BytesIO(b"test content") - session_file = handle_file_upload(UploadFile(file=file, size=12, headers=headers, filename="test.txt")) + file_content = b"Sample text content" + session_file = handle_file_upload(LLMFile(file_name="test.txt", file=file_content)) mock.assert_called_with(session_file) @@ -47,11 +28,9 @@ def test_handle_file_upload_text(mocker): def test_handle_file_upload_pdf(mocker): mock = mocker.patch("src.utils.file_utils.update_session_file_uploads", MagicMock()) pdf_mock = mocker.patch("src.utils.file_utils.PdfReader", MagicMock()) + file_content = b"%PDF-1.4" - - headers = Headers({"content-type": "application/pdf"}) - session_file = handle_file_upload(UploadFile(file=BytesIO(), size=12, headers=headers, filename="test.pdf")) + session_file = handle_file_upload(LLMFile(file_name="test.pdf", file=file_content)) pdf_mock.assert_called_once() mock.assert_called_with(session_file) -