Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always include trailing newline #747

Merged
merged 11 commits into from
Apr 18, 2024
Merged

Always include trailing newline #747

merged 11 commits into from
Apr 18, 2024

Conversation

camdencheek
Copy link
Member

@camdencheek camdencheek commented Mar 18, 2024

The goal of this PR is to fix our "line model" to fix the edge cases that led to https://github.com/sourcegraph/sourcegraph/issues/60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).

Before this PR, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).

The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

The downside is it does complicate the contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

This PR makes significant client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

@camdencheek camdencheek marked this pull request as draft March 18, 2024 22:00
Comment on lines -463 to -491
// lineBounds returns the byte offsets of the start and end of the 1-based
// lineNumber. The end offset is exclusive and will not contain the line-ending
// newline. If the line number is out of range of the lines in the file, start
// and end will be clamped to [0,fileSize].
func (nls newlines) lineBounds(lineNumber int) (start, end uint32) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split lineBounds into lineStart and lineEnd because 1) there were often places where we only needed one or the other, and 2) it let me add an option to exclude the trailing newline for lineEnd more easily.

Comment on lines -501 to -535
// Drop any trailing newline. Editors do not treat a trailing newline as
// the start of a new line, so we should not either. lineBounds clamps to
// len(data) when an out-of-bounds line is requested.
//
// As an example, if we request lines 1-5 from a file with contents
// `one\ntwo\nthree\n`, we should return `one\ntwo\nthree` because those are
// the three "lines" in the file, separated by newlines.
if highEnd == uint32(len(data)) && bytes.HasSuffix(data, []byte{'\n'}) {
highEnd = highEnd - 1
lowStart = min(lowStart, highEnd)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the culprit of the panics

@camdencheek
Copy link
Member Author

@sourcegraph/search-platform mind taking a peek at this to see if y'all agree with the direction?

@isker
Copy link
Contributor

isker commented Mar 25, 2024

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 History is on your side 🌞.

@keegancsmith
Copy link
Member

In short, this changes the definition of a "line" in a chunk to include its terminating newline (if it exists).

This sounds good and correct to me.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

This makes a lot of sense to me. If your regex asks for \n and we say sure we found one but then it isn't there... that feels broken. So I 100% agree we need to go in this direction.

The downside is it does complicate the ChunkMatch contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

It is already a bit tricky for the client right? It feels like we should document corner cases for content for clients, to ensure they test. Additionally what you propose I think will match what users want to see.

Some fun corner cases while I am here. Imagine your corpus is two files: `` (empty) and \n (single newline):

  • /^\n$/
  • /^\n?$/

This PR makes client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

I'm happy with the description. Ready for code review?

@camdencheek camdencheek force-pushed the cc/fix-newlines branch 4 times, most recently from 7413469 to ba9a6a0 Compare April 10, 2024 15:57
@camdencheek camdencheek changed the title Chunk matches: always include trailing newline Always include trailing newline Apr 10, 2024
@camdencheek camdencheek force-pushed the cc/fix-newlines branch 6 times, most recently from a2e32d0 to 08349fe Compare April 10, 2024 17:59
@camdencheek camdencheek marked this pull request as ready for review April 10, 2024 18:31
@camdencheek camdencheek requested review from keegancsmith and a team April 10, 2024 18:31
@camdencheek
Copy link
Member Author

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review this deeper, so requesting changes so I remember to come back to it + I have some blocking feedback.

contentprovider.go Outdated Show resolved Hide resolved
contentprovider.go Outdated Show resolved Hide resolved
contentprovider.go Outdated Show resolved Hide resolved
contentprovider.go Outdated Show resolved Hide resolved
contentprovider.go Show resolved Hide resolved
internal/e2e/e2e_rank_test.go Outdated Show resolved Hide resolved
@keegancsmith
Copy link
Member

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

Yeah we don't have a changelog. So I am just going to CC people who I know use our API. cc @isker for neogrok and @binarymason from GitLab.

Alternatively is it possible to only do this trimming behaviour for ChunkMatches? Or is that asking for trouble.

@camdencheek
Copy link
Member Author

is it possible to only do this trimming behaviour for ChunkMatches?

I did try that in my first attempt. Not impossible, but definitely messy. Also, the ambiguity described exists for LineMatch too, so I consider this a bugfix. If you'd like, I can make a sibling PR that attempts this and we can decide whether it's worth further splitting the code paths.

@cla-bot cla-bot bot added the cla-signed label Apr 16, 2024
@binarymason
Copy link

Thanks for the ping @keegancsmith! The logic behind this PR makes sense to me, but I'll need to dig in a bit to see if there would be any breaking changes on our end. We pin to a specific commit, so I don't see any blockers that should prevent this from moving forward.

If I see any breaking changes, will report back ASAP. 🤝

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Can you add docstrings to api.go for the LineEnd on LineMatches (and I guess for ChunkMatch related fields as well).

contentprovider.go Show resolved Hide resolved
contentprovider.go Outdated Show resolved Hide resolved
contentprovider.go Outdated Show resolved Hide resolved
index_test.go Outdated Show resolved Hide resolved
index_test.go Outdated Show resolved Hide resolved
@sourcegraph sourcegraph deleted a comment from github-actions bot Apr 18, 2024
@camdencheek camdencheek merged commit 74e75ef into main Apr 18, 2024
9 checks passed
@camdencheek camdencheek deleted the cc/fix-newlines branch April 18, 2024 02:57
isker added a commit to isker/neogrok that referenced this pull request Apr 19, 2024
Previously, zoekt made the curious choice to not include trailing
newline characters in each chunk it served in its API, in spite of the
posix definition of a line including the trailing newline character,
which is respected in many line-oriented unix tools like grep/ripgrep,
etc.

This mistake has finally been corrected! But it is a breakage that we
have to handle here. I've updated the tests to reflect the kind of data
that zoekt is actually serving now.

See sourcegraph/zoekt#747.
@isker
Copy link
Contributor

isker commented Apr 19, 2024

This was a breaking change for neogrok, but one that I'm happy to handle.

@keegancsmith
Copy link
Member

Sorry and thank you @isker!

camdencheek added a commit to sourcegraph/sourcegraph-public-snapshot that referenced this pull request Apr 23, 2024
This:
1) Bumps Zoekt to include sourcegraph/zoekt#747
2) updates all the consumers of our APIs to trim the trailing newline before splitting
3) updates searcher to also include trailing newlines in chunk matches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants