Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A mildly simple approach at introducing the 'combined diff format' #8

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

groboclown
Copy link

This adds in a new diff format parser based on the combined diff format:

https://git-scm.com/docs/git-diff

Git generates this when you perform a 'git diff --cc' or 'git diff -c' invocation. It shows the differences when the file had more than 1 parent. Without this fix, such differences are ignored.

This is to help support fixing issue 1028 in GitLeaks. GitLeaks will also need to add the '--cc' argument, as mentioned by that issue.

There are a few limitations with this. A big one is in the file header termination detection. For now, it only scans for the 1-parent style ('@@ -') and the 2-parent style ('@@@ -'). The combined diff format allows for having many more than just 2 parents, so this will skip those cases. I've updated the README to note about this limitation.

This includes some unit test coverage. It's not exhaustive, but it does lend some credibility to the code.

This adds in a new diff format parser based on the combined diff format:

https://git-scm.com/docs/git-diff

Git generates this when you perform a 'git diff --cc' or 'git diff -c' invocation.  It shows the differences when the file had more than 1 parent.  Without this fix, such differences are ignored.

This is to help support fixing [issue 1028](gitleaks/gitleaks#1028) in GitLeaks.  GitLeaks will also need to add the '--cc' argument, as mentioned by that issue.

There are a few limitations with this.  A big one is in the file header termination detection.  For now, it only scans for the 1-parent style ('@@ -') and the 2-parent style ('@@@ -').  The combined diff format allows for having many more than just 2 parents, so this will skip those cases.  I've updated the README to note about this limitation.

This includes some unit test coverage.  It's not exhaustive, but it does lend some credibility to the code.
These were originally going to be for patch testing, but those tests weren't written.  Mostly because I'm not quite sure how to evaluate a multi-parent patch with just one source file.
const prefix = "diff --git "

if !strings.HasPrefix(p.Line(0), prefix) {
const prefix1 = "diff --git "
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prefix check may need to change. If the diff is invoked with just "-c" instead of "--cc", it won't match. If more options are needed here, a fixed array of prefix matching would work better.


// check for a final "no newline" marker since it is not included in the
// counters used to stop the loop above
if isNoNewlineMarker(p.Line(0)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reuses functionality from text.go; some future refactoring could centralize these functions as well as reduce the freshly-added code duplication.

I found some situations where the 'diff --cc' can return a line count number of 18446744073709551615, which is a uint64 value of a signed int64 -1.  The code now adjusts for this strange result.

Additionally, a scenario came up where the line counting algorithm was off by 1, which may be related to the -1 issue above.  This was captured in the 'strangeCount' test.

Both of these scenarios point to a bug in the git diff algorithm.  It's due to this that the final line count check can't be used.
@groboclown
Copy link
Author

I'm investigating a bug with this code, where the commit ID does not associate with the right change.

@zricethezav
Copy link

@groboclown thanks for this PR, is it in draft or is this ready for a review?

@groboclown
Copy link
Author

I finally gave this the attention it needs. I found the bug - it's the parsing of multiple empty commits; I have a test case for it. I'm looking into where the parser gets it wrong now.

In some cases, especially for some merges, a commit can contain zero files.  This can throw off the header parser by interpreting the following headers as the body text of the first commit.  Down-stream, this causes the GitLeaks tool to identify the wrong commit ID with the leaked data.

This patch currently fails the empty-commit check.  A minor issue with the expected data eludes me at the moment.
@groboclown
Copy link
Author

This latest commit includes the handler for the empty commits. The test still fails due to some minor issue with the expected data, but the answer eludes me at the moment - a simple text diff of the output shows no differences. I'll need to look at this in more detail, or someone with sharper eyes may give it a go.

In some cases, especially for some merges, a commit can contain zero files.  This can throw off the header parser by interpreting the following headers as the body text of the first commit.  Down-stream, this causes the GitLeaks tool to identify the wrong commit ID with the leaked data.

This patch currently fails the empty-commit check.  A minor issue with the expected data eludes me at the moment.
The test did not include the correct time zone, which caused local time differences to cause an error.  Additionally, the test time creator should use the same time creation method as the corresponding code.

Also, added a note for how to fix the multi-header parser, without actually fixing it.
@groboclown
Copy link
Author

groboclown commented Nov 6, 2024

I finally had time to get back to this patch. It now has the fix for the test. Turns out, after some very deep inspection, that it came from a time formatting issue in the unit test.

@zricethezav I now consider this PR ready for a review.

// TODO '@@@ -' is the simplest form of the combined diff; it
// could be any number of '@' - one per parent + 1 '@'.
// The fix involves adding a separate '@' count discovery
// before this block (as '@' is the most commonly encountered structure).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this fix would add even more complexity to this PR. Rather than try to tackle that here, I'm proposing to only handle the simple case first, and add in the more-than-2 scenario later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants