From 9eef1c283a56994c54dd4f3fc959f40032de1c20 Mon Sep 17 00:00:00 2001
From: Bharat Pasupula <123897612+bhapas@users.noreply.github.com>
Date: Tue, 17 Sep 2024 16:34:10 +0200
Subject: [PATCH] [Automatic Import ] Improve KV and log type detection prompt
 improvements (#193136)

## Summary

This PR improves the `log type detection` and `structured log` prompts
for better results.

Improvements include:

- Moved the steps out of guidelines section and defined them in
numerical order.
- Improved the language when identifying `message body`.
- Improved the possible header parts in structured log parsing.

---------

Co-authored-by: Hanna Tamoudi <hanna.tamoudi@elastic.co>
(cherry picked from commit d1f068de9f57232e9a7c4a72afadf9afa093d055)
---
 .../server/graphs/kv/prompts.ts               | 23 ++++++++++++-------
 .../graphs/log_type_detection/prompts.ts      | 20 +++++++++-------
 2 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/x-pack/plugins/integration_assistant/server/graphs/kv/prompts.ts b/x-pack/plugins/integration_assistant/server/graphs/kv/prompts.ts
index 0fd7f1262c251..2ab1073a4ba8b 100644
--- a/x-pack/plugins/integration_assistant/server/graphs/kv/prompts.ts
+++ b/x-pack/plugins/integration_assistant/server/graphs/kv/prompts.ts
@@ -68,15 +68,19 @@ export const KV_HEADER_PROMPT = ChatPromptTemplate.fromMessages([
   ],
   [
     'human',
-    `Looking at the multiple syslog samples provided in the context, our goal is to identify which RFC they belog to. Then create a regex pattern that can separate the header and the structured body.
+    `Looking at the multiple syslog samples provided in the context, your task is to separate the "header" and the "message body" from this log. Our goal is to identify which RFC they belong to. Then create a regex pattern that can separate the header and the structured body.
 You then have to create a grok pattern using the regex pattern.
+You are given a log entry in a structured format. 
+
+Follow these steps to identify the header pattern:
+1. Identify if the log samples fall under RFC5424 or RFC3164. If not, return 'Custom Format'.
+2. The log samples contain the header and structured body. The header may contain any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId or any free-form text or non key-value information etc.,
+3. Make sure the regex and grok pattern matches all the header information. Only the structured message body should be under GREEDYDATA in grok pattern.
 
  You ALWAYS follow these guidelines when writing your response:
  <guidelines>
- - If you cannot match all the logs to the same RFC, return 'Custom Format' for RFC and provide the regex and grok patterns accordingly.
- - If the message part contains any unstructured data , make sure to add this in regex pattern and grok pattern.
  - Do not parse the message part in the regex. Just the header part should be in regex nad grok_pattern.
- - Make sure to map the remaining message part to \'message\' in grok pattern.
+ - Make sure to map the remaining message body to \'message\' in grok pattern.
  - Do not respond with anything except the processor as a JSON object enclosed with 3 backticks (\`), see example response above. Use strict JSON response format.
  </guidelines>
 
@@ -110,12 +114,15 @@ export const KV_HEADER_ERROR_PROMPT = ChatPromptTemplate.fromMessages([
 {errors}
 </errors>
 
- You ALWAYS follow these guidelines when writing your response:
+Follow these steps to fix the errors in the header pattern:
+1. Identify any mismatches, incorrect syntax, or logical errors in the pattern.
+2. The log samples contain the header and structured body. The header may contain any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId or any free-form text or non key-value information etc.,
+3. The message body may start with a description, followed by structured key-value pairs.
+4. Make sure the regex and grok pattern matches all the header information. Only the structured message body should be under GREEDYDATA in grok pattern.
+You ALWAYS follow these guidelines when writing your response:
  <guidelines>
- - Identify any mismatches, incorrect syntax, or logical errors in the pattern.
- - If the message part contains any unstructured data , make sure to add this in grok pattern.
  - Do not parse the message part in the regex. Just the header part should be in regex nad grok_pattern.
- - Make sure to map the remaining message part to \'message\' in grok pattern.
+ - Make sure to map the remaining message body to \'message\' in grok pattern.
  - Do not respond with anything except the processor as a JSON object enclosed with 3 backticks (\`), see example response above. Use strict JSON response format.
  </guidelines>
 
diff --git a/x-pack/plugins/integration_assistant/server/graphs/log_type_detection/prompts.ts b/x-pack/plugins/integration_assistant/server/graphs/log_type_detection/prompts.ts
index 2ed547de00132..74ba8f719f875 100644
--- a/x-pack/plugins/integration_assistant/server/graphs/log_type_detection/prompts.ts
+++ b/x-pack/plugins/integration_assistant/server/graphs/log_type_detection/prompts.ts
@@ -20,15 +20,19 @@ Here is some context for you to reference for your task, read it carefully as yo
   [
     'human',
     `Looking at the log samples , our goal is to identify the syslog type based on the guidelines below.
+Follow these steps to identify the log format type:
+1. Go through each log sample and identify the log format type.
+2. If the samples have any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId in the beginning information then set "header: true".
+3. If the samples have a syslog header then set "header: true" , else set "header: false". If you are unable to determine the syslog header presence then set "header: false".
+4. If the log samples have structured message body with key-value pairs then classify it as "log_type: structured". Look for a flat list of key-value pairs, often separated by spaces, commas, or other delimiters.
+5. Consider variations in formatting, such as quotes around values ("key=value", key="value"), special characters in keys or values, or escape sequences.
+6. If the log samples have unstructured body like a free-form text then classify it as "log_type: unstructured".
+7. If the log samples follow a csv format then classify it as "log_type: csv".
+8. If the samples are identified as "csv" and there is a csv header then set "header: true" , else set "header: false".
+9. If you do not find the log format in any of the above categories then classify it as "log_type: unsupported".
+
+ You ALWAYS follow these guidelines when writing your response:
 <guidelines>
-- Go through each log sample and identify the log format type.
-- If the samples have a timestamp , loglevel in the beginning information then set "header: true".
-- If the samples have a syslog header then set "header: true" , else set "header: false". If you are unable to determine the syslog header presence then set "header: false".
-- If the syslog samples have structured body then classify it as "log_type: structured".
-- If the syslog samples have unstructured body then classify it as "log_type: unstructured".
-- If the syslog samples follow a csv format then classify it as "log_type: csv".
-- If the samples are identified as "csv" and there is a csv header then set "header: true" , else set "header: false".
-- If you do not find the log format in any of the above categories then classify it as "log_type: unsupported".
 - Do not respond with anything except the updated current mapping JSON object enclosed with 3 backticks (\`). See example response below.
 </guidelines>