Always include trailing newline #747

camdencheek · 2024-03-18T22:00:12Z

The goal of this PR is to fix our "line model" to fix the edge cases that led to https://github.com/sourcegraph/sourcegraph/issues/60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).

Before this PR, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).

The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

The downside is it does complicate the contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

This PR makes significant client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

camdencheek · 2024-03-18T23:21:33Z

contentprovider.go

-// lineBounds returns the byte offsets of the start and end of the 1-based
-// lineNumber. The end offset is exclusive and will not contain the line-ending
-// newline. If the line number is out of range of the lines in the file, start
-// and end will be clamped to [0,fileSize].
-func (nls newlines) lineBounds(lineNumber int) (start, end uint32) {


I split lineBounds into lineStart and lineEnd because 1) there were often places where we only needed one or the other, and 2) it let me add an option to exclude the trailing newline for lineEnd more easily.

camdencheek · 2024-03-18T23:22:11Z

contentprovider.go

-	// Drop any trailing newline. Editors do not treat a trailing newline as
-	// the start of a new line, so we should not either. lineBounds clamps to
-	// len(data) when an out-of-bounds line is requested.
-	//
-	// As an example, if we request lines 1-5 from a file with contents
-	// `one\ntwo\nthree\n`, we should return `one\ntwo\nthree` because those are
-	// the three "lines" in the file, separated by newlines.
-	if highEnd == uint32(len(data)) && bytes.HasSuffix(data, []byte{'\n'}) {
-		highEnd = highEnd - 1
-		lowStart = min(lowStart, highEnd)
-	}


This is the culprit of the panics

camdencheek · 2024-03-18T23:47:34Z

@sourcegraph/search-platform mind taking a peek at this to see if y'all agree with the direction?

isker · 2024-03-25T23:20:46Z

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 History is on your side 🌞.

keegancsmith · 2024-04-03T10:24:53Z

In short, this changes the definition of a "line" in a chunk to include its terminating newline (if it exists).

This sounds good and correct to me.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

This makes a lot of sense to me. If your regex asks for \n and we say sure we found one but then it isn't there... that feels broken. So I 100% agree we need to go in this direction.

The downside is it does complicate the ChunkMatch contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

It is already a bit tricky for the client right? It feels like we should document corner cases for content for clients, to ensure they test. Additionally what you propose I think will match what users want to see.

Some fun corner cases while I am here. Imagine your corpus is two files: `` (empty) and \n (single newline):

/^\n$/
/^\n?$/

This PR makes client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

I'm happy with the description. Ready for code review?

camdencheek · 2024-04-10T18:32:39Z

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

keegancsmith

I still need to review this deeper, so requesting changes so I remember to come back to it + I have some blocking feedback.

contentprovider.go

internal/e2e/e2e_rank_test.go

keegancsmith · 2024-04-14T19:07:21Z

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

Yeah we don't have a changelog. So I am just going to CC people who I know use our API. cc @isker for neogrok and @binarymason from GitLab.

Alternatively is it possible to only do this trimming behaviour for ChunkMatches? Or is that asking for trouble.

camdencheek · 2024-04-15T23:05:36Z

is it possible to only do this trimming behaviour for ChunkMatches?

I did try that in my first attempt. Not impossible, but definitely messy. Also, the ambiguity described exists for LineMatch too, so I consider this a bugfix. If you'd like, I can make a sibling PR that attempts this and we can decide whether it's worth further splitting the code paths.

binarymason · 2024-04-17T01:50:51Z

Thanks for the ping @keegancsmith! The logic behind this PR makes sense to me, but I'll need to dig in a bit to see if there would be any breaking changes on our end. We pin to a specific commit, so I don't see any blockers that should prevent this from moving forward.

If I see any breaking changes, will report back ASAP. 🤝

keegancsmith

LGTM!

Can you add docstrings to api.go for the LineEnd on LineMatches (and I guess for ChunkMatch related fields as well).

contentprovider.go

index_test.go

Previously, zoekt made the curious choice to not include trailing newline characters in each chunk it served in its API, in spite of the posix definition of a line including the trailing newline character, which is respected in many line-oriented unix tools like grep/ripgrep, etc. This mistake has finally been corrected! But it is a breakage that we have to handle here. I've updated the tests to reflect the kind of data that zoekt is actually serving now. See sourcegraph/zoekt#747.

isker · 2024-04-19T05:07:16Z

This was a breaking change for neogrok, but one that I'm happy to handle.

keegancsmith · 2024-04-19T07:55:38Z

Sorry and thank you @isker!

This: 1) Bumps Zoekt to include sourcegraph/zoekt#747 2) updates all the consumers of our APIs to trim the trailing newline before splitting 3) updates searcher to also include trailing newlines in chunk matches

camdencheek marked this pull request as draft March 18, 2024 22:00

camdencheek commented Mar 18, 2024

View reviewed changes

camdencheek force-pushed the cc/fix-newlines branch 4 times, most recently from 7413469 to ba9a6a0 Compare April 10, 2024 15:57

camdencheek changed the title ~~Chunk matches: always include trailing newline~~ Always include trailing newline Apr 10, 2024

camdencheek force-pushed the cc/fix-newlines branch 6 times, most recently from a2e32d0 to 08349fe Compare April 10, 2024 17:59

camdencheek marked this pull request as ready for review April 10, 2024 18:31

camdencheek requested review from keegancsmith and a team April 10, 2024 18:31

keegancsmith requested changes Apr 14, 2024

View reviewed changes

cla-bot bot added the cla-signed label Apr 16, 2024

keegancsmith approved these changes Apr 17, 2024

View reviewed changes

contentprovider.go Show resolved Hide resolved

contentprovider.go Outdated Show resolved Hide resolved

contentprovider.go Outdated Show resolved Hide resolved

index_test.go Outdated Show resolved Hide resolved

index_test.go Outdated Show resolved Hide resolved

camdencheek added 3 commits April 17, 2024 20:08

always include newlines as part of line

a266c2b

update tests

114fc0f

trim newlines for display

8d011c3

camdencheek added 6 commits April 17, 2024 20:08

update comments

a384c06

update test files

d8c99b1

remove lineEnd

a91d71f

check for underflow

804d888

fix debug score printing

34765c8

review feedback

40e86e5

camdencheek force-pushed the cc/fix-newlines branch from 6d8d33a to e402c78 Compare April 18, 2024 02:37

fix off-by-one

bf2734b

camdencheek force-pushed the cc/fix-newlines branch from e402c78 to bf2734b Compare April 18, 2024 02:42

sourcegraph deleted a comment from github-actions bot Apr 18, 2024

add doc comments

8e9d2a4

camdencheek merged commit 74e75ef into main Apr 18, 2024
9 checks passed

camdencheek deleted the cc/fix-newlines branch April 18, 2024 02:57

camdencheek mentioned this pull request Apr 18, 2024

Search: expect trailing newlines in chunk matches sourcegraph/sourcegraph-public-snapshot#61247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always include trailing newline #747

Always include trailing newline #747

camdencheek commented Mar 18, 2024 •

edited

Loading

camdencheek Mar 18, 2024

camdencheek Mar 18, 2024

camdencheek commented Mar 18, 2024

isker commented Mar 25, 2024

keegancsmith commented Apr 3, 2024

camdencheek commented Apr 10, 2024

keegancsmith left a comment

keegancsmith commented Apr 14, 2024

camdencheek commented Apr 15, 2024

binarymason commented Apr 17, 2024

keegancsmith left a comment

isker commented Apr 19, 2024

keegancsmith commented Apr 19, 2024

Always include trailing newline #747

Always include trailing newline #747

Conversation

camdencheek commented Mar 18, 2024 • edited Loading

camdencheek Mar 18, 2024

Choose a reason for hiding this comment

camdencheek Mar 18, 2024

Choose a reason for hiding this comment

camdencheek commented Mar 18, 2024

isker commented Mar 25, 2024

keegancsmith commented Apr 3, 2024

camdencheek commented Apr 10, 2024

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith commented Apr 14, 2024

camdencheek commented Apr 15, 2024

binarymason commented Apr 17, 2024

keegancsmith left a comment

Choose a reason for hiding this comment

isker commented Apr 19, 2024

keegancsmith commented Apr 19, 2024

camdencheek commented Mar 18, 2024 •

edited

Loading