Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autovectorize 3 and 6 bpp Paeth unfiltering on stable #513

Closed
wants to merge 4 commits into from

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Sep 29, 2024

Make 3 and 6 bpp codepaths operate on i16 vectors of widths 4 and 8 respectively. Gains +5% for 3bpp and +10% for 6bbp case. This works on stable and matches the current std::simd implementation in master in terms of performance.

Part of #511

This does not obsolete #512 which pushes SIMD performance even further.

@okaneco
Copy link
Contributor

okaneco commented Sep 29, 2024

This scalarizes the code for the bpp=3 case
https://godbolt.org/z/qKsYas8TP
and seems to scalarize part of the bpp=6 case
https://godbolt.org/z/c4oeEWsoh

The current version was massaged through several iterations to eventually get auto-vectorization. I'm entirely open to the idea that scalarized code can be faster than auto-vectorized SIMD but I'm not sure that's the case here.

We have to resort to SIMD types to force vectorization in these non-power-of-two cases (or pull out these TryInto::<&mut [u8; N]>::try_into casting acrobatics). The compiler sees that you never write to part of the array and discards any chunk handling you try to force even though you're using 4 and 8 element arrays.

@Shnatsel
Copy link
Contributor Author

Thanks for the info! I didn't stare at the assembly for this PR after doing a great deal of it in other ones, I just measured it and posted the version that improved the benchmarks. It is possible that this is only faster on the latest CPUs with very wide ILP.

Could you run cargo bench --bench=unfilter --features=benchmarks Paeth on your machine with and without this PR? It would be a valuable data point, so that I'm not optimizing for a single specific CPU.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 5, 2024

Since this scalarizes code and that probably only benefits the latest CPUs with very high ILP potential, and the gains aren't significant enough to switch over, I'm going to go ahead and close this.

@Shnatsel Shnatsel closed this Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants