Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunkmatches: reuse last calculated column when filling #711

Merged
merged 3 commits into from
Jan 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions contentprovider.go
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,7 @@ func (p *contentProvider) fillContentChunkMatches(ms []*candidateMatch, numConte
chunks := chunkCandidates(ms, newlines, numContextLines)
data := p.data(false)
chunkMatches := make([]ChunkMatch, 0, len(chunks))
columnHelper := columnHelper{data: data}
for _, chunk := range chunks {
ranges := make([]Range, 0, len(chunk.candidates))
var symbolInfo []*Symbol
Expand All @@ -306,12 +307,12 @@ func (p *contentProvider) fillContentChunkMatches(ms []*candidateMatch, numConte
Start: Location{
ByteOffset: startOffset,
LineNumber: uint32(startLine),
Column: uint32(utf8.RuneCount(data[startLineOffset:startOffset]) + 1),
Column: columnHelper.get(startLineOffset, startOffset),
},
End: Location{
ByteOffset: endOffset,
LineNumber: uint32(endLine),
Column: uint32(utf8.RuneCount(data[endLineOffset:endOffset]) + 1),
Column: columnHelper.get(endLineOffset, endOffset),
},
})

Expand Down Expand Up @@ -392,6 +393,41 @@ func chunkCandidates(ms []*candidateMatch, newlines newlines, numContextLines in
return chunks
}

// columnHelper is a helper struct which caches the number of runes last
// counted. If we naively use utf8.RuneCount for each match on a line, this
// leads to an O(nm) algorithm where m is the number of matches and n is the
// length of the line. Aassuming we our candidates are increasing in offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// length of the line. Aassuming we our candidates are increasing in offset
// length of the line. Assuming we our candidates are increasing in offset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check: are we always sure we can assume our candidates are increasing in offset? I can't remember if this is always true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, reading the implementation, I guess we just fall back to the less performant version.

Copy link
Member

@jtibshirani jtibshirani Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this invariant is true, because in gatherMatches we always make sure to sort by byteOffset.

Maybe we could update the comments to make it clear this invariant is assumed, and treat the unsorted case as an error rather than being expected? That way if we ever introduce a bug here, we don't silently fall back to an O(n^2) algorithm... much harder to track down than a clear error in testing.

General thought: if invariants are too tricky to reason about, sometimes I just explicitly add a (re)sort! I believe Go's default sort is very fast when the input is already sorted. This bounds the worst case nicely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked the invariants. We also have the invariant that things don't overlap, which is also important since we lookup the end column.

The sorted invariant is actually quite important for other bits of code like chunkCandidates. So what I did was add a sorted check which loudly complains and then sorts if the invariant is broken.

Initially I pretended I was a haskell programmer and added a special type which guaranteed this, but TBH it felt quite overengineered. Happy to try it out if there is interest, but for now gonna merge with extra perf invariant documentation and sort check.

// makes this operation O(n) instead.
type columnHelper struct {
data []byte

// 0 values for all these are valid values
lastLineOffset int
lastOffset uint32
lastRuneCount uint32
}

// get returns the line column for offset. offset is the byte offset of the
// rune in data. lineOffset is the byte offset inside of data for the line
// containing offset.
func (c *columnHelper) get(lineOffset int, offset uint32) uint32 {
var runeCount uint32

if lineOffset == c.lastLineOffset && offset >= c.lastOffset {
// Can count from last calculation
runeCount = c.lastRuneCount + uint32(utf8.RuneCount(c.data[c.lastOffset:offset]))
} else {
// Need to count from the beginning of line
runeCount = uint32(utf8.RuneCount(c.data[lineOffset:offset]))
}

c.lastLineOffset = lineOffset
c.lastOffset = offset
c.lastRuneCount = runeCount

return runeCount + 1
}

type newlines struct {
// locs is the sorted set of byte offsets of the newlines in the file
locs []uint32
Expand Down
80 changes: 80 additions & 0 deletions contentprovider_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ import (
"bytes"
"fmt"
"testing"
"testing/quick"
"unicode/utf8"

"github.com/google/go-cmp/cmp"
)
Expand Down Expand Up @@ -327,3 +329,81 @@ func TestChunkMatches(t *testing.T) {
})
}
}

func BenchmarkColumnHelper(b *testing.B) {
// We simulate looking up columns of evenly spaced matches
const matches = 10_000
const match = "match"
const space = " "
const dist = uint32(len(match) + len(space))
data := bytes.Repeat([]byte(match+space), matches)

b.ResetTimer()

for i := 0; i < b.N; i++ {
columnHelper := columnHelper{data: data}

lineOffset := 0
offset := uint32(0)
for offset < uint32(len(data)) {
col := columnHelper.get(lineOffset, offset)
if col != offset+1 {
b.Fatal("column is not offset even though data is ASCII")
}
offset += dist
}
}
}

func TestColumnHelper(t *testing.T) {
f := func(line0, line1 string) bool {
data := []byte(line0 + line1)
lineOffset := len(line0)

columnHelper := columnHelper{data: data}

// We check every second rune returns the correct answer
offset := lineOffset
column := 1
for offset < len(data) {
if column%2 == 0 {
got := columnHelper.get(lineOffset, uint32(offset))
if got != uint32(column) {
return false
}
}
_, size := utf8.DecodeRune(data[offset:])
offset += size
column++
}

return true
}

if err := quick.Check(f, nil); err != nil {
t.Fatal(err)
}

// Corner cases

// empty data, shouldn't happen but just in case it slips through
ch := columnHelper{data: nil}
if got := ch.get(0, 0); got != 1 {
t.Fatal("empty data didn't return 1", got)
}

// Repeating a call to get should return the same value
// empty data, shouldn't happen but just in case it slips through
ch = columnHelper{data: []byte("hello\nworld")}
if got := ch.get(6, 8); got != 3 {
t.Fatal("unexpected value for third column on second line", got)
}
if got := ch.get(6, 8); got != 3 {
t.Fatal("unexpected value for repeated call for third column on second line", got)
}

// Now make sure if we go backwards we do not incorrectly use the cache
if got := ch.get(6, 6); got != 1 {
t.Fatal("unexpected value for backwards call for first column on second line", got)
}
}
Loading