Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of JoinParts for spans #42

Merged
merged 6 commits into from
Apr 3, 2024
Merged

Improve performance of JoinParts for spans #42

merged 6 commits into from
Apr 3, 2024

Conversation

Sewer56
Copy link
Member

@Sewer56 Sewer56 commented Mar 13, 2024

This improves the performance of the JoinParts method, by avoiding an unnecessary heap allocation which makes another copy of the string.

We pin the Span (in the case it originated from the heap), and use the pointers to the characters as input to string.Create.

@Sewer56 Sewer56 added the meta-performance Anywhere we might get an improvement in performance. label Mar 13, 2024
@Sewer56 Sewer56 requested a review from a team March 13, 2024 10:18
@Sewer56 Sewer56 self-assigned this Mar 13, 2024
@erri120 erri120 changed the title Improved: Performance of JoinParts (Span) Improve performance of JoinParts for spans Mar 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 92.30769% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 87.20%. Comparing base (582e3ca) to head (948996f).

Files Patch % Lines
src/NexusMods.Paths/Utilities/PathHelpers.cs 92.30% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
+ Coverage   87.19%   87.20%   +0.01%     
==========================================
  Files          42       42              
  Lines        3357     3362       +5     
  Branches      546      544       -2     
==========================================
+ Hits         2927     2932       +5     
  Misses        371      371              
  Partials       59       59              
Flag Coverage Δ
Linux 86.85% <92.30%> (+0.01%) ⬆️
Windows 86.85% <92.30%> (+0.01%) ⬆️
macOS 86.79% <92.30%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@erri120 erri120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some benchmarks to compare old vs. new. The span-version of JoinParts is also not used by anything except the tests, so even if the new version is more performant than the old version, we won't see that benefit.

If I remember correctly, we have a span-version and a string-version of JoinParts because they have different performance characteristics. The string-version uses string.Create to get a Span<char> while the span-version used to allocate an array "the normal way". Due to ref-struct weirdness, I couldn't get the span-version to use string.Create. This PR circumvents the ref-struct weirdness by deconstructing and reconstructing the span inside the string.Create part.

With this PR, the span- and string-versions should have similar performance characteristics (needs to be benchmarked), so the string-version should be removed.

You should also add a short comment on the unsafe part, explaining what I mentioned above about the span deconstructing and reconstructing inside the string.Create part.

@Sewer56
Copy link
Member Author

Sewer56 commented Mar 13, 2024

| Method    | Path1                | Path2                | Mean     | Error    | StdDev   | Gen0   | Allocated |
|---------- |--------------------- |--------------------- |---------:|---------:|---------:|-------:|----------:|
| Current   | a/ver(...)forms [46] | a/ver(...)forms [46] | 20.98 ns | 0.150 ns | 0.141 ns | 0.0124 |     208 B |
| OldMethod | a/ver(...)forms [46] | a/ver(...)forms [46] | 27.12 ns | 0.370 ns | 0.346 ns | 0.0124 |     208 B |
| Current   | a/ver(...)forms [46] | short                | 18.00 ns | 0.188 ns | 0.167 ns | 0.0076 |     128 B |
| OldMethod | a/ver(...)forms [46] | short                | 23.57 ns | 0.501 ns | 0.536 ns | 0.0076 |     128 B |
| Current   | short                | a/ver(...)forms [46] | 17.91 ns | 0.081 ns | 0.063 ns | 0.0076 |     128 B |
| OldMethod | short                | a/ver(...)forms [46] | 24.16 ns | 0.447 ns | 0.349 ns | 0.0076 |     128 B |
| Current   | short                | short                | 14.93 ns | 0.347 ns | 0.399 ns | 0.0029 |      48 B |
| OldMethod | short                | short                | 19.71 ns | 0.084 ns | 0.070 ns | 0.0029 |      48 B |

Sure you can have a benchmark if you want.
I just knew it would be faster from a lot of past experience optimizing down to this level.

As a point of reference, a pin on a string (Span inclusive) is around 5 x86 instructions. No memory alloc and copy in the world would come close.

@erri120
Copy link
Member

erri120 commented Mar 13, 2024

The benchmark doesn't compare the old span-version to the new span-version. It compares the old span-version to the unchanged string-version.

@Sewer56
Copy link
Member Author

Sewer56 commented Mar 13, 2024

Ah, apologies. Am sleepy and haven't even noticed, one moment.

@erri120
Copy link
Member

erri120 commented Mar 13, 2024

| Method    | Path1                | Path2                | Mean     | Error    | StdDev   | Gen0   | Allocated |
|---------- |--------------------- |--------------------- |---------:|---------:|---------:|-------:|----------:|
| Current   | a/ver(...)forms [46] | a/ver(...)forms [46] | 38.11 ns | 0.326 ns | 0.289 ns | 0.0124 |     208 B |
| OldMethod | a/ver(...)forms [46] | a/ver(...)forms [46] | 29.69 ns | 0.634 ns | 1.190 ns | 0.0124 |     208 B |
| Current   | a/ver(...)forms [46] | short                | 35.46 ns | 0.666 ns | 0.866 ns | 0.0076 |     128 B |
| OldMethod | a/ver(...)forms [46] | short                | 24.90 ns | 0.536 ns | 0.677 ns | 0.0076 |     128 B |
| Current   | short                | a/ver(...)forms [46] | 36.42 ns | 0.750 ns | 0.833 ns | 0.0076 |     128 B |
| OldMethod | short                | a/ver(...)forms [46] | 27.04 ns | 0.585 ns | 1.070 ns | 0.0076 |     128 B |
| Current   | short                | short                | 31.94 ns | 0.657 ns | 0.703 ns | 0.0029 |      48 B |
| OldMethod | short                | short                | 19.94 ns | 0.270 ns | 0.226 ns | 0.0029 |      48 B |

This is the results comparing the old span-version to the new span-version.

@Sewer56
Copy link
Member Author

Sewer56 commented Mar 13, 2024

| Method    | Path1                | Path2                | Mean     | Error    | StdDev   | Gen0   | Allocated |
|---------- |--------------------- |--------------------- |---------:|---------:|---------:|-------:|----------:|
| Current   | a/ver(...)forms [46] | a/ver(...)forms [46] | 17.77 ns | 0.188 ns | 0.157 ns | 0.0124 |     208 B |
| OldMethod | a/ver(...)forms [46] | a/ver(...)forms [46] | 26.01 ns | 0.205 ns | 0.192 ns | 0.0124 |     208 B |
| Current   | a/ver(...)forms [46] | short                | 15.58 ns | 0.179 ns | 0.168 ns | 0.0076 |     128 B |
| OldMethod | a/ver(...)forms [46] | short                | 22.75 ns | 0.187 ns | 0.156 ns | 0.0076 |     128 B |
| Current   | short                | a/ver(...)forms [46] | 15.43 ns | 0.069 ns | 0.054 ns | 0.0076 |     128 B |
| OldMethod | short                | a/ver(...)forms [46] | 21.63 ns | 0.197 ns | 0.185 ns | 0.0076 |     128 B |
| Current   | short                | short                | 12.94 ns | 0.152 ns | 0.142 ns | 0.0029 |      48 B |
| OldMethod | short                | short                | 15.75 ns | 0.104 ns | 0.087 ns | 0.0029 |      48 B |

I forgot to account for the fact that nested tuples with deconstruction account for some very unoptimal codegen.
In any case, the core idea wasn't wrong. I fixed the PR.

I used a trick I've learned from one of the runtime folks, and left a small note with it. And set benchmarks to run on .NET 8, kinda overdue.

Comment on lines 417 to 422
unsafe struct JoinPartsParams
{
internal ReadOnlySpan<char>* Left;
internal ReadOnlySpan<char>* Right;
internal IOSInformation Os;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct should be marked as private as well as readonly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it when I get back to my machine, dw

@erri120
Copy link
Member

erri120 commented Mar 13, 2024

The current string-version is still faster than the new span-version, and all of our code uses the string-version. We can probably just remove the span-version of JoinParts.

Current = span-version
OldMethod = string-version

| Method    | Path1                | Path2                | Mean     | Error    | StdDev   | Gen0   | Allocated |
|---------- |--------------------- |--------------------- |---------:|---------:|---------:|-------:|----------:|
| Current   | a/ver(...)forms [46] | a/ver(...)forms [46] | 16.75 ns | 0.031 ns | 0.024 ns | 0.0124 |     208 B |
| OldMethod | a/ver(...)forms [46] | a/ver(...)forms [46] | 15.86 ns | 0.062 ns | 0.058 ns | 0.0124 |     208 B |
| Current   | a/ver(...)forms [46] | short                | 15.38 ns | 0.348 ns | 0.542 ns | 0.0076 |     128 B |
| OldMethod | a/ver(...)forms [46] | short                | 14.70 ns | 0.335 ns | 0.314 ns | 0.0076 |     128 B |
| Current   | short                | a/ver(...)forms [46] | 15.08 ns | 0.126 ns | 0.118 ns | 0.0076 |     128 B |
| OldMethod | short                | a/ver(...)forms [46] | 14.11 ns | 0.113 ns | 0.106 ns | 0.0076 |     128 B |
| Current   | short                | short                | 12.44 ns | 0.077 ns | 0.072 ns | 0.0029 |      48 B |
| OldMethod | short                | short                | 11.60 ns | 0.062 ns | 0.052 ns | 0.0029 |      48 B |

@Sewer56
Copy link
Member Author

Sewer56 commented Mar 13, 2024

The current string-version is still faster than the new span-version.

They're the same code, only catch is there's an additional 3 stack copies and 2 derererences in the Span version, which accounts for a flat nanosecond. They're unfortunately unavoidable, for now.

If they use a throwhelper inside string.Create in a future runtime, making it inlineable, and the Action<T> also got inlined, the JIT would then turn the reference into direct variable use and the difference would be 0. Hopefully in a few runtimes from now, hahaha.

I know this is mostly dead code, I just had some spare time to kill, so I figured 'one day I'll use this, I might aswell'. An example use case that comes to mind is a future where we might want to return the parent as a ref struct (wrapper around Span)

@Al12rs Al12rs requested a review from erri120 March 26, 2024 09:27
@Sewer56 Sewer56 merged commit 7ec5282 into main Apr 3, 2024
5 checks passed
@Sewer56 Sewer56 deleted the improve-join branch April 3, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-performance Anywhere we might get an improvement in performance.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants