Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce log pattern lib with initial implementation of Brain algorithm log parser #16751

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

songkant-aws
Copy link

Description

Introduce a new library to maintain algorithms for log parsing. The initial commit simply implement a log parser algorithm with highest grouping accuracy called Brain. See: https://ieeexplore.ieee.org/document/10109145

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
RFC: #16627

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Dec 2, 2024

❕ Gradle check result for 00de7ad: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 92.43697% with 9 lines in your changes missing coverage. Please review.

Project coverage is 72.23%. Comparing base (b359dd8) to head (62d4f25).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...in/java/org/opensearch/pattern/BrainLogParser.java 92.43% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16751      +/-   ##
============================================
+ Coverage     72.08%   72.23%   +0.14%     
- Complexity    65279    65361      +82     
============================================
  Files          5318     5319       +1     
  Lines        304056   304179     +123     
  Branches      43992    44016      +24     
============================================
+ Hits         219188   219725     +537     
+ Misses        66932    66468     -464     
- Partials      17936    17986      +50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gaobinlong gaobinlong added the backport 2.x Backport to 2.x branch label Dec 2, 2024
@xluo-aws
Copy link
Member

xluo-aws commented Dec 3, 2024

@dbwiddis , @joshuali925 , @anirudha , @dblock , please help review the code if you or know someone that can help review it.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

✅ Gradle check result for 3dd4156: SUCCESS

@songkant-aws
Copy link
Author

@penghuo @andrross + Peng and Andrew to review this initial PR of log pattern

@songkant-aws
Copy link
Author

@reta @andrross , I see you commented on the RFC, would you like to review this PR as well?

@reta
Copy link
Collaborator

reta commented Dec 10, 2024

@reta @andrross , I see you commented on the RFC, would you like to review this PR as well?

Thanks @songkant-aws , the algorithm seem legit, but I don't see the "large" picture of how it is going to be integrated into the search flow. We don't need a full fledged implementation but sketching out SPI/API would definitely help out with shaping the parsing in general. I believe Brain algorithm is just one of many possible implementations (as per #16627), so we should figure out:

  • what the SPI/API for such pattern parsers should be?
  • how it is going to be integrated into the rest of OpenSearch?

As it stands now, this is very targeted and isolated module which exposes the general algorithm.

@songkant-aws
Copy link
Author

@reta @andrross , I see you commented on the RFC, would you like to review this PR as well?

Thanks @songkant-aws , the algorithm seem legit, but I don't see the "large" picture of how it is going to be integrated into the search flow. We don't need a full fledged implementation but sketching out SPI/API would definitely help out with shaping the parsing in general. I believe Brain algorithm is just one of many possible implementations (as per #16627), so we should figure out:

  • what the SPI/API for such pattern parsers should be?
  • how it is going to be integrated into the rest of OpenSearch?

As it stands now, this is very targeted and isolated module which exposes the general algorithm.

@reta Thanks reta, it's fair to add high level design integration with search flow. I will add more to RFC and let folks review again. For now, I think it's not a blocker to check-in this isolated library so that other plugin components can reuse it.

Copy link
Contributor

✅ Gradle check result for 62d4f25: SUCCESS

@reta
Copy link
Collaborator

reta commented Dec 16, 2024

For now, I think it's not a blocker to check-in this isolated library so that other plugin components can reuse it.

@songkant-aws I disagree with that: I don't see a point committing the code into core which not a single component in core is using or is planning to use. It circles back to the discussion on the RFC: if you plan to have implementation in some specific plugin (SQL fe), it would make sense to start from there and move to the core later, when it becomes clear that wider applicability and reusability is needed.

@andrross
Copy link
Member

It circles back to the discussion on the RFC: if you plan to have implementation in some specific plugin (SQL fe), it would make sense to start from there and move to the core later, when it becomes clear that wider applicability and reusability is needed.

I agree with @reta's point here. This repo is already huge and we bottleneck on getting changes through, so we're always hesitant to add more code that is not actually used within the repo itself. For now I'd suggest applying the YAGNI principle by implementing this where you need it and we can evaluate moving it later if/when that becomes necessary.

@songkant-aws
Copy link
Author

Thanks for your feedback, @reta @andrross. I understand the concern. I will evaluate this proposal. Meanwhile, I'm adding integration design with search flow to RFC in next few weeks.

for (int i = 0; i < tokens.size() - 1; i++) {
String tokenKey = String.format(Locale.ROOT, POSITIONED_TOKEN_KEY_FORMAT, i, tokens.get(i));
Long tokenFreq = tokenFreqMap.get(tokenKey);
occurrences.put(tokenFreq, occurrences.getOrDefault(tokenFreq, 0) + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

words have the same frequency are from the same lines?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the algorithm to group same frequency tokens as initial pattern. Longest same frequency tokens is encoded into (frequency, length) word occurrence for later on refining internal groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants