-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A mildly simple approach at introducing the 'combined diff format' #8
base: master
Are you sure you want to change the base?
Conversation
This adds in a new diff format parser based on the combined diff format: https://git-scm.com/docs/git-diff Git generates this when you perform a 'git diff --cc' or 'git diff -c' invocation. It shows the differences when the file had more than 1 parent. Without this fix, such differences are ignored. This is to help support fixing [issue 1028](gitleaks/gitleaks#1028) in GitLeaks. GitLeaks will also need to add the '--cc' argument, as mentioned by that issue. There are a few limitations with this. A big one is in the file header termination detection. For now, it only scans for the 1-parent style ('@@ -') and the 2-parent style ('@@@ -'). The combined diff format allows for having many more than just 2 parents, so this will skip those cases. I've updated the README to note about this limitation. This includes some unit test coverage. It's not exhaustive, but it does lend some credibility to the code.
These were originally going to be for patch testing, but those tests weren't written. Mostly because I'm not quite sure how to evaluate a multi-parent patch with just one source file.
const prefix = "diff --git " | ||
|
||
if !strings.HasPrefix(p.Line(0), prefix) { | ||
const prefix1 = "diff --git " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prefix check may need to change. If the diff is invoked with just "-c" instead of "--cc", it won't match. If more options are needed here, a fixed array of prefix matching would work better.
|
||
// check for a final "no newline" marker since it is not included in the | ||
// counters used to stop the loop above | ||
if isNoNewlineMarker(p.Line(0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reuses functionality from text.go
; some future refactoring could centralize these functions as well as reduce the freshly-added code duplication.
I found some situations where the 'diff --cc' can return a line count number of 18446744073709551615, which is a uint64 value of a signed int64 -1. The code now adjusts for this strange result. Additionally, a scenario came up where the line counting algorithm was off by 1, which may be related to the -1 issue above. This was captured in the 'strangeCount' test. Both of these scenarios point to a bug in the git diff algorithm. It's due to this that the final line count check can't be used.
I'm investigating a bug with this code, where the commit ID does not associate with the right change. |
@groboclown thanks for this PR, is it in draft or is this ready for a review? |
I finally gave this the attention it needs. I found the bug - it's the parsing of multiple empty commits; I have a test case for it. I'm looking into where the parser gets it wrong now. |
In some cases, especially for some merges, a commit can contain zero files. This can throw off the header parser by interpreting the following headers as the body text of the first commit. Down-stream, this causes the GitLeaks tool to identify the wrong commit ID with the leaked data. This patch currently fails the empty-commit check. A minor issue with the expected data eludes me at the moment.
This latest commit includes the handler for the empty commits. The test still fails due to some minor issue with the expected data, but the answer eludes me at the moment - a simple text diff of the output shows no differences. I'll need to look at this in more detail, or someone with sharper eyes may give it a go. |
In some cases, especially for some merges, a commit can contain zero files. This can throw off the header parser by interpreting the following headers as the body text of the first commit. Down-stream, this causes the GitLeaks tool to identify the wrong commit ID with the leaked data. This patch currently fails the empty-commit check. A minor issue with the expected data eludes me at the moment.
The test did not include the correct time zone, which caused local time differences to cause an error. Additionally, the test time creator should use the same time creation method as the corresponding code. Also, added a note for how to fix the multi-header parser, without actually fixing it.
I finally had time to get back to this patch. It now has the fix for the test. Turns out, after some very deep inspection, that it came from a time formatting issue in the unit test. @zricethezav I now consider this PR ready for a review. |
// TODO '@@@ -' is the simplest form of the combined diff; it | ||
// could be any number of '@' - one per parent + 1 '@'. | ||
// The fix involves adding a separate '@' count discovery | ||
// before this block (as '@' is the most commonly encountered structure). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this fix would add even more complexity to this PR. Rather than try to tackle that here, I'm proposing to only handle the simple case first, and add in the more-than-2 scenario later.
This adds in a new diff format parser based on the combined diff format:
https://git-scm.com/docs/git-diff
Git generates this when you perform a 'git diff --cc' or 'git diff -c' invocation. It shows the differences when the file had more than 1 parent. Without this fix, such differences are ignored.
This is to help support fixing issue 1028 in GitLeaks. GitLeaks will also need to add the '--cc' argument, as mentioned by that issue.
There are a few limitations with this. A big one is in the file header termination detection. For now, it only scans for the 1-parent style ('@@ -') and the 2-parent style ('@@@ -'). The combined diff format allows for having many more than just 2 parents, so this will skip those cases. I've updated the README to note about this limitation.
This includes some unit test coverage. It's not exhaustive, but it does lend some credibility to the code.