Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Import] Improve log format recognition #196228

Merged
merged 6 commits into from
Oct 15, 2024

Conversation

ilyannn
Copy link
Contributor

@ilyannn ilyannn commented Oct 15, 2024

Context

Previously the LLM would often select unstructured format for what (to our eye) clearly are CSV samples, e.g. for this PAN-OS integration log (note the 5 samples are squashed to one line):

Please process these log samples:
<log_samples>
Nov 30 16:09:08 PA-220 1,2018/11/30 16:09:07,012801096514,TRAFFIC,end,2049,2018/11/30 16:09:07,192.168.15.207,175.16.199.1,192.168.1.63,175.16.199.1,new_outbound_from_trust,,,apple-maps,vsys1,trust,untrust,ethernet1/2,ethernet1/1,send_to_mac,2018/11/30 16:09:07,22751,1,55113,443,16418,443,0x400053,tcp,allow,7734,1758,5976,36,2018/11/30 15:59:04,586,computer-and-internet-info,0,32091112,0x0,192.168.0.0-192.168.255.255,United States,0,16,20,tcp-fin,0,0,0,0,,PA-220,from-policy,,,0,,0,,N/A,0,0,0,0,1,2021/10/26 14:58:43,,TRAFFIC,end,2561,2021/10/26 14:58:43,81.2.69.144,81.2.69.145,,,intrazone-default,,,syslog,vsys1,LAN,LAN,ethernet1/2,ethernet1/2,LFPpan,2021/10/26 14:58:43,15877,1,57681,30514,0,0,0x10005e,udp,allow,2545,1365,1180,4,2021/10/26 14:58:07,0,any,,7022390495259151779,0x0,United States,United States,,2,2,aged-out,0,0,0,0,,PA-VM,from-policy," ", , , , , , , , , , , , , , , , , , , ,,,,2021-10-26T14:58:43.066-07:00,Nov 30 16:09:45 PA-220 1,2018/11/30 16:09:45,012801096514,TRAFFIC,end,2049,2018/11/30 16:09:45,192.168.15.224,175.16.199.1,192.168.1.63,175.16.199.1,new_outbound_from_trust,,,ssl,vsys1,xtrust,untrust,ethernet1/2,ethernet1/1,send_to_mac,2018/11/30 16:09:45,24204,1,52459,443,28012,443,0x40001c,tcp,allow,1761,1100,661,15,2018/11/30 16:09:14,13,computer-and-internet-info,0,32091159,0x0,192.168.0.0-192.168.255.255,United States,0,8,7,tcp-rst-from-client,0,0,0,0,,PA-220,from-policy,,,0,,0,,N/A,0,0,0,0,1,2021/10/26 14:31:32,,TRAFFIC,start,2561,2021/10/26 14:31:32,81.2.69.193,81.2.69.193,0.0.0.0,0.0.0.0,any to Intranet DCs,intranet\\sampleuser$,,dns,vsys1,LAN,WAN,ethernet1/2,ethernet1/1,LFPpan,2021/10/26 14:31:32,15844,1,64624,53,32849,53,0x400000,udp,allow,93,93,0,1,2021/10/26 14:31:31,0,any,,7022390495259151731,0x0,169.254.0.0-169.254.255.255,United States,,1,0,n/a,0,0,0,0,,PA-VM,from-policy," ", , , , , , , , , , , , , , , , , , , ,,,,2021-10-26T14:31:32.772-07:00,1,2021/10/26 14:32:07,,TRAFFIC,end,2561,2021/10/26 14:32:07,81.2.69.193,81.2.69.193,192.168.10.111,81.2.69.193,LAn-TO-WAn,,,dns,vsys1,LAN,WAN,ethernet1/2,ethernet1/1,LFPpan,2021/10/26 14:32:07,15844,1,64624,53,32849,53,0x400019,udp,allow,93,93,0,1,2021/10/26 14:31:31,0,any,,7022390495259151737,0x0,169.254.0.0-169.254.255.255,United States,,1,0,aged-out,0,0,0,0,,PA-VM,from-policy," ", , , , , , , , , , , , , , , , , , , ,,,,2021-10-26T14:32:07.776-07:00
</log_samples>

Please find the JSON object below:
{
  "name": "structured",
  "header": true
}

Summary

We add the missing line break between the log samples (which should help format recognition in general) and change the prompt to clarify when the comma-separated list should be treated as a csv and when as structured format.

Testing

The result is (compare to the existing integration):

{
  "name": "csv",
  "header": false,
  "columns": [
    "version",
    "timestamp",
    "serial_number",
    "log_type",
    "log_subtype",
    "session_id",
    "generate_time",
    "source_ip",
    "destination_ip",
    "nat_source_ip",
    "nat_destination_ip",
    "rule_name",
    "user",
    "",
    "application",
    "vsys",
    "source_zone",
    "destination_zone",
    "ingress_interface",
    "egress_interface",
    "log_forwarding_profile",
    "session_start_time",
    "elapsed_time",
    "repeat_count",
    "source_port",
    "destination_port",
    "nat_source_port",
    "nat_destination_port",
    "flags",
    "protocol",
    "action",
    "bytes",
    "bytes_sent",
    "bytes_received",
    "packets",
    "start_time",
    "elapsed_time",
    "category",
    "",
    "sequence_number",
    "",
    "source_location",
    "destination_location",
    "",
    "packets_sent",
    "packets_received",
    "session_end_reason",
    "",
    "",
    "",
    "",
    "",
    "device_name",
    "action_source",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "timestamp_iso"
  ]
}

Generated integration: ai_panw_traffic_202410150239-1.0.0.zip

(check the sample event)
        {
            "ai_panw_traffic_202410150239": {
                "log": {
                    "action_source": "from-policy",
                    "category": "any",
                    "column41": "0x0",
                    "column48": "0",
                    "column49": "0",
                    "column50": "0",
                    "column51": "0",
                    "column55": " ",
                    "column78": " , , , , , , , ,,,,2021-10-26T14:58:43.066-07:00",
                    "destination_zone": "LAN",
                    "egress_interface": "ethernet1/2",
                    "elapsed_time": "15877",
                    "elapsed_time_2": "0",
                    "flags": "0x10005e",
                    "generate_time": "2021/10/26 14:58:43",
                    "ingress_interface": "ethernet1/2",
                    "log_forwarding_profile": "LFPpan",
                    "log_subtype": "end",
                    "log_type": "TRAFFIC",
                    "repeat_count": "1",
                    "session_end_reason": "aged-out",
                    "session_id": "2561",
                    "session_start_time": "2021/10/26 14:58:43",
                    "source_location": "United States",
                    "source_zone": "LAN",
                    "start_time": "2021/10/26 14:58:07",
                    "timestamp": "2021/10/26 14:58:43",
                    "version": "1",
                    "vsys": "vsys1"
                }
            },
            "destination": {
                "bytes": "1180",
                "geo": {
                    "city_name": "London",
                    "continent_name": "Europe",
                    "country_iso_code": "GB",
                    "country_name": "United Kingdom",
                    "location": {
                        "lat": 51.5142,
                        "lon": -0.0931
                    },
                    "region_iso_code": "GB-ENG",
                    "region_name": "England"
                },
                "ip": "81.2.69.145",
                "nat": {
                    "port": "0"
                },
                "packets": "2",
                "port": "30514"
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "action": "allow",
                "category": [
                    "network"
                ],
                "end": "2021-10-26T14:58:43.000Z",
                "original": "1,2021/10/26 14:58:43,,TRAFFIC,end,2561,2021/10/26 14:58:43,81.2.69.144,81.2.69.145,,,intrazone-default,,,syslog,vsys1,LAN,LAN,ethernet1/2,ethernet1/2,LFPpan,2021/10/26 14:58:43,15877,1,57681,30514,0,0,0x10005e,udp,allow,2545,1365,1180,4,2021/10/26 14:58:07,0,any,,7022390495259151779,0x0,United States,United States,,2,2,aged-out,0,0,0,0,,PA-VM,from-policy,\" \", , , , , , , , , , , , , , , , , , , ,,,,2021-10-26T14:58:43.066-07:00",
                "sequence": 7022390495259151779,
                "start": "2021-10-26T14:58:43.000Z",
                "type": [
                    "allowed",
                    "connection",
                    "end"
                ]
            },
            "network": {
                "application": "syslog",
                "bytes": 2545,
                "packets": "4",
                "transport": "udp"
            },
            "observer": {
                "hostname": "PA-VM"
            },
            "related": {
                "host": [
                    "PA-VM"
                ],
                "ip": [
                    "81.2.69.144",
                    "81.2.69.145"
                ]
            },
            "rule": {
                "name": "intrazone-default"
            },
            "source": {
                "bytes": 1365,
                "geo": {
                    "city_name": "London",
                    "continent_name": "Europe",
                    "country_iso_code": "GB",
                    "country_name": "United Kingdom",
                    "location": {
                        "lat": 51.5142,
                        "lon": -0.0931
                    },
                    "region_iso_code": "GB-ENG",
                    "region_name": "England"
                },
                "ip": "81.2.69.144",
                "nat": {
                    "port": "0"
                },
                "packets": "2",
                "port": "57681"
            },
            "tags": [
                "preserve_original_event"
            ]
        },

@ilyannn ilyannn added bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes backport missing Added to PRs automatically when the are determined to be missing a backport. backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Team:Security-Scalability Team label for Security Integrations Scalability Team Feature:AutomaticImport labels Oct 15, 2024
@ilyannn ilyannn changed the title Make the CSV format choice less likely to be recognized as unstructured [Auto Import] Fix cases of the CSV format being recognized as unstructured Oct 15, 2024
@ilyannn ilyannn changed the title [Auto Import] Fix cases of the CSV format being recognized as unstructured [Auto Import] Fix cases of CSV format being recognized as unstructured Oct 15, 2024
@ilyannn ilyannn marked this pull request as ready for review October 15, 2024 00:54
@ilyannn ilyannn requested a review from a team as a code owner October 15, 2024 00:54
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-scalability (Team:Security-Scalability)

@ilyannn ilyannn changed the title [Auto Import] Fix cases of CSV format being recognized as unstructured [Auto Import] Improve log format recognition Oct 15, 2024
Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.. minor comments

@bhapas
Copy link
Contributor

bhapas commented Oct 15, 2024

release_note:fix may be?

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 15, 2024

release_note:fix may be?

I considered this, but it's mostly needed for CSV and we already have the release note about the CSV format support which kind of covers this one.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #13 / endpoint Endpoint permissions: when running with user/role [t1_analyst] should display endpoint data on Host Details

Metrics [docs]

✅ unchanged

@ilyannn ilyannn merged commit bdc9ce9 into elastic:main Oct 15, 2024
22 checks passed
@ilyannn ilyannn deleted the auto-import/csv-over-unstructured branch October 15, 2024 12:02
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11345593101

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 15, 2024
Previously the LLM would often select `unstructured` format for what (to
our eye) clearly are CSV samples.

We add the missing line break between the log samples (which should help
format recognition in general) and change the prompt to clarify when the
comma-separated list should be treated as a `csv` and when as
`structured` format.

See GitHub for examples.

---------

Co-authored-by: Bharat Pasupula <[email protected]>
(cherry picked from commit bdc9ce9)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 15, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] Improve log format recognition
(#196228)](#196228)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T12:02:00Z","message":"[Auto
Import] Improve log format recognition (#196228)\n\nPreviously the LLM
would often select `unstructured` format for what (to\r\nour eye)
clearly are CSV samples.\r\n\r\nWe add the missing line break between
the log samples (which should help\r\nformat recognition in general) and
change the prompt to clarify when the\r\ncomma-separated list should be
treated as a `csv` and when as\r\n`structured` format.\r\n\r\nSee GitHub
for examples.\r\n\r\n---------\r\n\r\nCo-authored-by: Bharat Pasupula
<[email protected]>","sha":"bdc9ce932bbfa606dd1f1e188c8b32df4327a0a4","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:skip","backport
missing","v9.0.0","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] Improve log format
recognition","number":196228,"url":"https://github.com/elastic/kibana/pull/196228","mergeCommit":{"message":"[Auto
Import] Improve log format recognition (#196228)\n\nPreviously the LLM
would often select `unstructured` format for what (to\r\nour eye)
clearly are CSV samples.\r\n\r\nWe add the missing line break between
the log samples (which should help\r\nformat recognition in general) and
change the prompt to clarify when the\r\ncomma-separated list should be
treated as a `csv` and when as\r\n`structured` format.\r\n\r\nSee GitHub
for examples.\r\n\r\n---------\r\n\r\nCo-authored-by: Bharat Pasupula
<[email protected]>","sha":"bdc9ce932bbfa606dd1f1e188c8b32df4327a0a4"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196228","number":196228,"mergeCommit":{"message":"[Auto
Import] Improve log format recognition (#196228)\n\nPreviously the LLM
would often select `unstructured` format for what (to\r\nour eye)
clearly are CSV samples.\r\n\r\nWe add the missing line break between
the log samples (which should help\r\nformat recognition in general) and
change the prompt to clarify when the\r\ncomma-separated list should be
treated as a `csv` and when as\r\n`structured` format.\r\n\r\nSee GitHub
for examples.\r\n\r\n---------\r\n\r\nCo-authored-by: Bharat Pasupula
<[email protected]>","sha":"bdc9ce932bbfa606dd1f1e188c8b32df4327a0a4"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
@kibanamachine kibanamachine added v8.16.0 and removed backport missing Added to PRs automatically when the are determined to be missing a backport. labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) bug Fixes for quality problems that affect the customer experience Feature:AutomaticImport release_note:skip Skip the PR/issue when compiling release notes Team:Security-Scalability Team label for Security Integrations Scalability Team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants