small improvement to the prompt to model dates as relationships and r…

…emove the example output which was restricting the prompts flexability
ScottLogic · Nov 12, 2024 · 3caead4 · 3caead4
1 parent e28e3eb
commit 3caead4
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 10 deletions.
diff --git a/backend/src/prompts/templates/generate-knowledge-graph-model.j2 b/backend/src/prompts/templates/generate-knowledge-graph-model.j2
@@ -10,7 +10,7 @@ Here, all_data is a list of lists, where each inner list corresponds to a row of
 Analyze the structure of this data to produce an intuitive Neo4j model, focusing on identifying core entities, attributes, and relationships.
 
 1. Data Structure:
-* Report: Each row in the dataset represents an ESG report about a company. This can be represented as a node.
+* Report: Each row in the dataset represents an ESG report about a company. If there is a date or year in the row, then the report should have a "reported in" relationship to the company which contains the date / year.
 * Identify Key Entities: Based on the data headers, determine the main entity types (e.g., Company, Fund, or other core entities in the dataset) and map out each unique entity's attributes. Favour full names over abbreviations.
 * Identify Common Categories: Based on the data, determine common categories that appear in the data. Look for recurring values that appear in the same csv column and map these out as nodes with relationships to the main entity types.
 * Determine Relationships: Define the relationships between these entities, such as associating entities with reports, linking entities to specific time periods, or establishing hierarchical or categorical groupings within the data.
@@ -20,18 +20,20 @@ Analyze the structure of this data to produce an intuitive Neo4j model, focusing
 2. Output Model Structure:
 Describe Entities:
 * Example: Company: Represents each company with attributes like name and identifier.
-* Example: Year: Represents each distinct time period, based on date-related fields.
 * Example: Environment, Social, Governance: Each node represents one ESG category, containing only fields relevant to that specific category (e.g., CO2 emissions for Environment, injury rate for Social, and shareholder rights for Governance).
-* Describe Relationships:
+
+Describe Relationships:
 * Example: (Entity1)-[:HAS_ENTITY2]->(Entity2): Links between main entities, such as companies, years, or categories.
-* Example: (MainEntity)-[:HAS_ENVIRONMENT]->(Environment): Connects main entities to the Environment category node for environment-specific metrics.
-* Example: (MainEntity)-[:HAS_SOCIAL]->(Social): Links main entities to the Social category node for social-specific metrics.
-* Example: (MainEntity)-[:HAS_GOVERNANCE]->(Governance): Links main entities to the Governance category node for governance-specific metrics.
+* Example: (MainEntity)-[:REPORTED_ON {date: "YYYY-MM-DD"}]->(Report) or (MainEntity)-[:REPORTED_ON {year: "YYYY"}]->(Report): Links each report to its main entity, using the REPORTED_ON relationship with an attribute for the report date.
 * Example: (MainEntity)-[:BELONGS_TO]->(CategoryNode): Links main entities to a categorical node for any recurring category (such as industries or sectors).
+* Example: (Report)-[:HAS_ENVIRONMENT]->(Environment): Connects main entities to the Environment category node for environment-specific metrics.
+* Example: (Report)-[:HAS_SOCIAL]->(Social): Links main entities to the Social category node for social-specific metrics.
+* Example: (Report)-[:HAS_GOVERNANCE]->(Governance): Links main entities to the Governance category node for governance-specific metrics.
+
 Please provide the inferred model structure in the "model" field of the JSON output, specifying entities, attributes, and relationships. The output must explicitly link each header in the input data to the corresponding part of the model.
 
 ## Expected Output Format:
-{ model: "The model identifies 'Company' as the primary entity with unique identifiers for each company. Each data row is represented by a 'Report' node, with 'Year' nodes representing temporal relationships, and 'Industry' nodes for each unique industry type. ESG metrics are grouped into separate 'Environment', 'Social', and 'Governance' nodes. Relationships include: (Report)-[:IS_A]->(Company), (Report)-[:IN_YEAR]->(Year), (Company)-[:IN_INDUSTRY]->(Industry), (Report)-[:HAS_ENVIRONMENT]->(Environment), (Report)-[:HAS_SOCIAL]->(Social), and (Report)-[:HAS_GOVERNANCE]->(Governance). \n\nThe attributes for each entity are as follows:\n\n* Company: Identifier (RIC), Company Name\n\n* Year: Date\n\n* Industry: Name\n\n* Report: ESG_score, BVPS, Market_cap, Shares, Net_income, RETURN_ON_ASSET, QUICK_RATIO, ASSET_GROWTH, FNCL_LVRG, PE_RATIO\n\n* Environment: Env_score, Scope_1, Scope_2, CO2_emissions, Energy_use, Water_use, Water_recycle, Toxic_chem_red\n\n* Social: Social_score, Injury_rate, Women_Employees, Human_Rights, Strikes, Turnover_empl, Board_Size, Bribery, Recycling_Initiatives\n\n* Governance: Gov_score, Shareholder_Rights, Board_gen_div\n\nRelationships:\n\n* (Report)-[:IS_A]->(Company): Links each report to the relevant company.\n\n* (Report)-[:IN_YEAR]->(Year): Links each report to a specific year.\n\n* (Company)-[:IN_INDUSTRY]->(Industry): Links each company to its industry node.\n\n* (Report)-[:HAS_ENVIRONMENT]->(Environment): Links each report to an Environment node containing environmental metrics.\n\n* (Report)-[:HAS_SOCIAL]->(Social): Links each report to a Social node containing social metrics.\n\n* (Report)-[:HAS_GOVERNANCE]->(Governance): Links each report to a Governance node containing governance metrics.\n\nEach header in the input data is assigned to one of these nodes based on the model’s grouping structure." }
+{ model: "YOUR MODEL HERE" }
 
 Important Notes:
 * Ensure the model clearly identifies which header field relates to which part of the Neo4j graph.

diff --git a/backend/src/prompts/templates/generate-knowledge-graph-query.j2 b/backend/src/prompts/templates/generate-knowledge-graph-query.j2
@@ -21,12 +21,11 @@ Generate a Cypher query based on the provided model structure and data.
 * Use data.all_data[0] as headers to identify the fields.
 * Process each row of data and map the header fields to their corresponding parts of the model based on the model input.
 * Reminder: Avoid duplicating fields across multiple nodes. For example, if the same field appears in multiple places (e.g., company name, industry), create and reference a single node for that field rather than creating it multiple times.
-* Extract parts of fields, such as the year from a full date, to avoid ambiguity when creating nodes like Year (use substring(row[2], 0, 4) to extract the year from a date string).
-* Ensure that each node, such as Year or Industry, is only created once and reused in relationships to avoid redundant nodes.
+* Ensure that each node is only created once and reused in relationships to avoid redundant nodes.
 
 2. Generate Cypher Query:
 * Based on the model and data input, create a Cypher query to:
-* For each primary entity (e.g., Company, Fund, Industry, Year...), use MERGE to ensure only one instance is created, even if some rows contain null values for other attributes.
+* For each primary entity (e.g., Company, Fund, Industry), use MERGE to ensure only one instance is created, even if some rows contain null values for other attributes.
 * For each Environment, Social, and Governance category, use CREATE to ensure each report has its unique instance.
 * Use COALESCE to handle missing values and provide default values (e.g., COALESCE(row[10], 'Unknown') for industry).
 * Establish relationships as defined by the model, using MERGE for any reusable nodes but CREATE for nodes specific to each report.