Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
Hope it's cool if I made a PR, I really liked this tool but I noticed it keeps entropy information about all lines for a given file in memory and does a huge sort at the end.
Proposed change
I propose this change, which means we never keep more lines in memory than necessary.
The
Entropies
struct keeps the top n lines sorted in a slice.Testing with a medium sized repository, I noticed the old version got all the way up to 1G memory consumption, after the change it doesn't even show up in my top 100 memory consumption programs.
Execution time
While I didn't see an improvement in execution time for my testing, this version does get rid of the large sort towards the end.
If we want to cut down execution time in the future, it may be wise to make an individual list per file and/or per directory (as before), and merge these together when done with each file/directory, to reduce locking on the main
Entropies
struct.For very large values of
-top
, we could possibly also get very small performance gains using a max heap vs a slice, but that's probably premature optimization at this point.