-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always include trailing newline #747
Conversation
// lineBounds returns the byte offsets of the start and end of the 1-based | ||
// lineNumber. The end offset is exclusive and will not contain the line-ending | ||
// newline. If the line number is out of range of the lines in the file, start | ||
// and end will be clamped to [0,fileSize]. | ||
func (nls newlines) lineBounds(lineNumber int) (start, end uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split lineBounds
into lineStart
and lineEnd
because 1) there were often places where we only needed one or the other, and 2) it let me add an option to exclude the trailing newline for lineEnd
more easily.
// Drop any trailing newline. Editors do not treat a trailing newline as | ||
// the start of a new line, so we should not either. lineBounds clamps to | ||
// len(data) when an out-of-bounds line is requested. | ||
// | ||
// As an example, if we request lines 1-5 from a file with contents | ||
// `one\ntwo\nthree\n`, we should return `one\ntwo\nthree` because those are | ||
// the three "lines" in the file, separated by newlines. | ||
if highEnd == uint32(len(data)) && bytes.HasSuffix(data, []byte{'\n'}) { | ||
highEnd = highEnd - 1 | ||
lowStart = min(lowStart, highEnd) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the culprit of the panics
@sourcegraph/search-platform mind taking a peek at this to see if y'all agree with the direction? |
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 History is on your side 🌞. |
This sounds good and correct to me.
This makes a lot of sense to me. If your regex asks for
It is already a bit tricky for the client right? It feels like we should document corner cases for content for clients, to ensure they test. Additionally what you propose I think will match what users want to see. Some fun corner cases while I am here. Imagine your corpus is two files: `` (empty) and
I'm happy with the description. Ready for code review? |
7413469
to
ba9a6a0
Compare
a2e32d0
to
08349fe
Compare
@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review this deeper, so requesting changes so I remember to come back to it + I have some blocking feedback.
Yeah we don't have a changelog. So I am just going to CC people who I know use our API. cc @isker for neogrok and @binarymason from GitLab. Alternatively is it possible to only do this trimming behaviour for ChunkMatches? Or is that asking for trouble. |
I did try that in my first attempt. Not impossible, but definitely messy. Also, the ambiguity described exists for |
Thanks for the ping @keegancsmith! The logic behind this PR makes sense to me, but I'll need to dig in a bit to see if there would be any breaking changes on our end. We pin to a specific commit, so I don't see any blockers that should prevent this from moving forward. If I see any breaking changes, will report back ASAP. 🤝 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Can you add docstrings to api.go for the LineEnd on LineMatches (and I guess for ChunkMatch related fields as well).
6d8d33a
to
e402c78
Compare
e402c78
to
bf2734b
Compare
Previously, zoekt made the curious choice to not include trailing newline characters in each chunk it served in its API, in spite of the posix definition of a line including the trailing newline character, which is respected in many line-oriented unix tools like grep/ripgrep, etc. This mistake has finally been corrected! But it is a breakage that we have to handle here. I've updated the tests to reflect the kind of data that zoekt is actually serving now. See sourcegraph/zoekt#747.
This was a breaking change for neogrok, but one that I'm happy to handle. |
Sorry and thank you @isker! |
This: 1) Bumps Zoekt to include sourcegraph/zoekt#747 2) updates all the consumers of our APIs to trim the trailing newline before splitting 3) updates searcher to also include trailing newlines in chunk matches
The goal of this PR is to fix our "line model" to fix the edge cases that led to https://github.com/sourcegraph/sourcegraph/issues/60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this PR, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
The downside is it does complicate the contract for the client a bit. In practice, it means to get the set of lines, you need to do something like
chunk.content.replace(/\r?\n$/, '').split(/\r?\n/)
instead of justchunk.content.split(/\r?\n/)
because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.This PR makes significant client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.