-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce log pattern lib with initial implementation of Brain algorithm log parser #16751
base: main
Are you sure you want to change the base?
Conversation
…thm log parser Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
❕ Gradle check result for 00de7ad: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16751 +/- ##
============================================
+ Coverage 72.08% 72.23% +0.14%
- Complexity 65279 65361 +82
============================================
Files 5318 5319 +1
Lines 304056 304179 +123
Branches 43992 44016 +24
============================================
+ Hits 219188 219725 +537
+ Misses 66932 66468 -464
- Partials 17936 17986 +50 ☔ View full report in Codecov by Sentry. |
libs/pattern/src/main/java/org/opensearch/pattern/BrainLogParser.java
Outdated
Show resolved
Hide resolved
libs/pattern/src/main/java/org/opensearch/pattern/BrainLogParser.java
Outdated
Show resolved
Hide resolved
libs/pattern/src/main/java/org/opensearch/pattern/BrainLogParser.java
Outdated
Show resolved
Hide resolved
@dbwiddis , @joshuali925 , @anirudha , @dblock , please help review the code if you or know someone that can help review it. |
Signed-off-by: Songkan Tang <[email protected]>
Thanks @songkant-aws , the algorithm seem legit, but I don't see the "large" picture of how it is going to be integrated into the search flow. We don't need a full fledged implementation but sketching out SPI/API would definitely help out with shaping the parsing in general. I believe Brain algorithm is just one of many possible implementations (as per #16627), so we should figure out:
As it stands now, this is very targeted and isolated module which exposes the general algorithm. |
Signed-off-by: Songkan Tang <[email protected]>
@reta Thanks reta, it's fair to add high level design integration with search flow. I will add more to RFC and let folks review again. For now, I think it's not a blocker to check-in this isolated library so that other plugin components can reuse it. |
Signed-off-by: Songkan Tang <[email protected]>
@songkant-aws I disagree with that: I don't see a point committing the code into core which not a single component in core is using or is planning to use. It circles back to the discussion on the RFC: if you plan to have implementation in some specific plugin (SQL fe), it would make sense to start from there and move to the core later, when it becomes clear that wider applicability and reusability is needed. |
I agree with @reta's point here. This repo is already huge and we bottleneck on getting changes through, so we're always hesitant to add more code that is not actually used within the repo itself. For now I'd suggest applying the YAGNI principle by implementing this where you need it and we can evaluate moving it later if/when that becomes necessary. |
for (int i = 0; i < tokens.size() - 1; i++) { | ||
String tokenKey = String.format(Locale.ROOT, POSITIONED_TOKEN_KEY_FORMAT, i, tokens.get(i)); | ||
Long tokenFreq = tokenFreqMap.get(tokenKey); | ||
occurrences.put(tokenFreq, occurrences.getOrDefault(tokenFreq, 0) + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
words have the same frequency are from the same lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the algorithm to group same frequency tokens as initial pattern. Longest same frequency tokens is encoded into (frequency, length) word occurrence for later on refining internal groups.
Description
Introduce a new library to maintain algorithms for log parsing. The initial commit simply implement a log parser algorithm with highest grouping accuracy called Brain. See: https://ieeexplore.ieee.org/document/10109145
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
RFC: #16627
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.