- Fixed possible overflow in LUT processing
- Restored compatibility with Visual C Compiler
- Filter: fixed wrong offset handling for 3x3 single-band version
- ColorLUT: fixed potential access violation, up to 2x faster
- ColorLUT: SSE4 & AVX2
- Bands: access violation in getband in some environments
- Reduce: SSE4
- GCC 9.0+: fixed unaligned read for
_**_cvtepu8_epi32
functions.
- Resampling: Correct max coefficient calculation. Some rare combinations of initial and requested sizes lead to black lines.
- Float-based filters, single-band: 3x3 SSE4, 5x5 SSE4
- Float-based filters, multi-band: 3x3 SSE4 & AVX2, 5x5 SSE4
- Int-based filters, multi-band: 3x3 SSE4 & AVX2, 5x5 SSE4 & AVX2
- Box blur: fast path for radius < 1
- Alpha composite: fast div approximation
- Color conversion: RGB to L SSE4, fast div in RGBa to RGBA
- Resampling: optimized coefficients loading
- Split and get_channel: SSE4
- Critical memory error for some combinations of source/destination sizes is fixed.
- A lot of optimizations in resampling including 16-bit intermediate color representation and heavy unrolling.
- Maintenance release
- Fixed error in RGBa -> RGBA conversion
- SSE4 and AVX2 fixed-point full loading implementation. Up to 4.6x faster.
- SSE4 and AVX2 fixed-point full loading horizontal pass.
- SSE4 and AVX2 fixed-point full loading vertical pass.
- RGBA -> RGBa SSE4 and AVX2 fixed-point full loading implementations. Up to 2.6x faster.
- RGBa -> RGBA AVX2 implementation using gather instructions. Up to 5x faster.
- SSE4 and AVX2 float full loading horizontal pass.
- SSE4 float full loading vertical pass.
- SSE4 and AVX2 float full loading horizontal pass.
- SSE4 float per-pixel loading vertical pass.
- SSE4 and AVX2 float per-pixel loading horizontal pass.
- SSE4 float per-pixel loading vertical pass.
- SSE4: Up to 2x for downscaling. Up to 3.5x for upscaling.
- AVX2: Up to 2.7x for downscaling. Up to 3.5x for upscaling.
- Simple SSE4 fixed-point implementations with per-pixel loading.
- Up to 2.1x faster.