Skip to content

Commit

Permalink
[Automatic Import ] Improve KV and log type detection prompt improvem…
Browse files Browse the repository at this point in the history
…ents (#193136)

## Summary

This PR improves the `log type detection` and `structured log` prompts
for better results.

Improvements include:

- Moved the steps out of guidelines section and defined them in
numerical order.
- Improved the language when identifying `message body`.
- Improved the possible header parts in structured log parsing.

---------

Co-authored-by: Hanna Tamoudi <[email protected]>
(cherry picked from commit d1f068d)
  • Loading branch information
bhapas committed Sep 17, 2024
1 parent c9c3e89 commit 9eef1c2
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 16 deletions.
23 changes: 15 additions & 8 deletions x-pack/plugins/integration_assistant/server/graphs/kv/prompts.ts
Original file line number Diff line number Diff line change
Expand Up @@ -68,15 +68,19 @@ export const KV_HEADER_PROMPT = ChatPromptTemplate.fromMessages([
],
[
'human',
`Looking at the multiple syslog samples provided in the context, our goal is to identify which RFC they belog to. Then create a regex pattern that can separate the header and the structured body.
`Looking at the multiple syslog samples provided in the context, your task is to separate the "header" and the "message body" from this log. Our goal is to identify which RFC they belong to. Then create a regex pattern that can separate the header and the structured body.
You then have to create a grok pattern using the regex pattern.
You are given a log entry in a structured format.
Follow these steps to identify the header pattern:
1. Identify if the log samples fall under RFC5424 or RFC3164. If not, return 'Custom Format'.
2. The log samples contain the header and structured body. The header may contain any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId or any free-form text or non key-value information etc.,
3. Make sure the regex and grok pattern matches all the header information. Only the structured message body should be under GREEDYDATA in grok pattern.
You ALWAYS follow these guidelines when writing your response:
<guidelines>
- If you cannot match all the logs to the same RFC, return 'Custom Format' for RFC and provide the regex and grok patterns accordingly.
- If the message part contains any unstructured data , make sure to add this in regex pattern and grok pattern.
- Do not parse the message part in the regex. Just the header part should be in regex nad grok_pattern.
- Make sure to map the remaining message part to \'message\' in grok pattern.
- Make sure to map the remaining message body to \'message\' in grok pattern.
- Do not respond with anything except the processor as a JSON object enclosed with 3 backticks (\`), see example response above. Use strict JSON response format.
</guidelines>
Expand Down Expand Up @@ -110,12 +114,15 @@ export const KV_HEADER_ERROR_PROMPT = ChatPromptTemplate.fromMessages([
{errors}
</errors>
You ALWAYS follow these guidelines when writing your response:
Follow these steps to fix the errors in the header pattern:
1. Identify any mismatches, incorrect syntax, or logical errors in the pattern.
2. The log samples contain the header and structured body. The header may contain any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId or any free-form text or non key-value information etc.,
3. The message body may start with a description, followed by structured key-value pairs.
4. Make sure the regex and grok pattern matches all the header information. Only the structured message body should be under GREEDYDATA in grok pattern.
You ALWAYS follow these guidelines when writing your response:
<guidelines>
- Identify any mismatches, incorrect syntax, or logical errors in the pattern.
- If the message part contains any unstructured data , make sure to add this in grok pattern.
- Do not parse the message part in the regex. Just the header part should be in regex nad grok_pattern.
- Make sure to map the remaining message part to \'message\' in grok pattern.
- Make sure to map the remaining message body to \'message\' in grok pattern.
- Do not respond with anything except the processor as a JSON object enclosed with 3 backticks (\`), see example response above. Use strict JSON response format.
</guidelines>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,19 @@ Here is some context for you to reference for your task, read it carefully as yo
[
'human',
`Looking at the log samples , our goal is to identify the syslog type based on the guidelines below.
Follow these steps to identify the log format type:
1. Go through each log sample and identify the log format type.
2. If the samples have any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId in the beginning information then set "header: true".
3. If the samples have a syslog header then set "header: true" , else set "header: false". If you are unable to determine the syslog header presence then set "header: false".
4. If the log samples have structured message body with key-value pairs then classify it as "log_type: structured". Look for a flat list of key-value pairs, often separated by spaces, commas, or other delimiters.
5. Consider variations in formatting, such as quotes around values ("key=value", key="value"), special characters in keys or values, or escape sequences.
6. If the log samples have unstructured body like a free-form text then classify it as "log_type: unstructured".
7. If the log samples follow a csv format then classify it as "log_type: csv".
8. If the samples are identified as "csv" and there is a csv header then set "header: true" , else set "header: false".
9. If you do not find the log format in any of the above categories then classify it as "log_type: unsupported".
You ALWAYS follow these guidelines when writing your response:
<guidelines>
- Go through each log sample and identify the log format type.
- If the samples have a timestamp , loglevel in the beginning information then set "header: true".
- If the samples have a syslog header then set "header: true" , else set "header: false". If you are unable to determine the syslog header presence then set "header: false".
- If the syslog samples have structured body then classify it as "log_type: structured".
- If the syslog samples have unstructured body then classify it as "log_type: unstructured".
- If the syslog samples follow a csv format then classify it as "log_type: csv".
- If the samples are identified as "csv" and there is a csv header then set "header: true" , else set "header: false".
- If you do not find the log format in any of the above categories then classify it as "log_type: unsupported".
- Do not respond with anything except the updated current mapping JSON object enclosed with 3 backticks (\`). See example response below.
</guidelines>
Expand Down

0 comments on commit 9eef1c2

Please sign in to comment.