Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTT’s 5th optimization #1264

Open
wants to merge 95 commits into
base: dev/3.0
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
b061ed4
add dynamic for gather
Nov 5, 2024
7348fef
Update ctest and benchmark case for gather.
zhangyang2057 Nov 5, 2024
dfbfe20
add dynamic ctest
Nov 5, 2024
12a1111
revise function name to snake case
Nov 5, 2024
a92a1b6
change variable name to snake mode
Nov 5, 2024
0796b1c
omitted variable names
Nov 6, 2024
1f2d803
Add strides support and more benchmark cases for slice.
zhangyang2057 Nov 6, 2024
f3b2715
Merge branch 'feature/ntt_benchmark_roofline_5' of https://github.com…
Nov 6, 2024
9834c5b
add just for slice benchmark
Nov 6, 2024
973b451
limit to l1 size
Nov 6, 2024
19826e3
Merge branch 'dev/3.0' into feature/ntt_benchmark_roofline_5
zhangyang2057 Nov 6, 2024
621dbb3
change shape for accracy test
Nov 6, 2024
850d9c4
change for other cases
Nov 6, 2024
45eaaa5
expand dim1 dims
Nov 6, 2024
58bab9a
add ctest for pack and init opt
Nov 7, 2024
bed436b
fallback for some errors
Nov 7, 2024
c827ac9
Fix pack benchmark and add ctest/benchmark test for unpack.
zhangyang2057 Nov 7, 2024
82f6d58
Update ctest and benchmark test of unpack.
zhangyang2057 Nov 8, 2024
d5cdf25
some prepare code for 2d pack
Nov 8, 2024
7e2fa68
Merge branch 'feature/ntt_benchmark_roofline_5' of https://github.com…
Nov 8, 2024
9bfa3f7
opt for 2d pack x86
Nov 11, 2024
7c25189
remove useless code
Nov 11, 2024
d0f68a4
add benchmark info for pack
Nov 12, 2024
d64b093
opt for 2d pack
Nov 12, 2024
3ba0faf
a better way
Nov 12, 2024
43f449e
revise unalign problems
Nov 12, 2024
1f8ce94
change size for benchmark unary test
Nov 13, 2024
dfdb94b
change benchmark for pack
Nov 13, 2024
2da9a19
revise typo in ifdef
Nov 13, 2024
711ce0f
add ctest for pack
Nov 13, 2024
99554df
remove iostream incite
Nov 13, 2024
bbdac69
adjust for benchmark shape
Nov 13, 2024
bb8d6a8
Optimize unpack rvv 1D and update roofline.
zhangyang2057 Nov 14, 2024
0c20459
Add missing u_unpack.h.
zhangyang2057 Nov 14, 2024
1386706
Optimize unpack 1D rvv (C/H: 8 -> 7 cyles).
zhangyang2057 Nov 15, 2024
08b4770
Optimize u_unpack 1D and rvv(N/C/H in 7 cycles)
zhangyang2057 Nov 15, 2024
94112e4
Add ranked shape support for unpack 1D.
zhangyang2057 Nov 15, 2024
fee0136
add some opt for 2d pack
Nov 18, 2024
7b519e9
limit the size to l1 size
Nov 18, 2024
ef6cb97
index opts for pack2d
Nov 18, 2024
ee86467
add roofline info
Nov 18, 2024
b2a91ce
remove some typo bugs
Nov 18, 2024
bb2e953
some small opts
Nov 18, 2024
ffb589b
fallback
Nov 18, 2024
ef19cb3
adjust size for l1 limit
Nov 18, 2024
4bf1fe5
add roofline info for pack hw
Nov 18, 2024
0a2d032
more ctest for pack
Nov 18, 2024
5e38186
add more ctest for pack 2d
Nov 18, 2024
b24e218
add roofline info for unpack
Nov 19, 2024
ddf9d59
change unpack benchmark test
Nov 19, 2024
eb7eb69
revise some special unpack
Nov 19, 2024
e2cf8da
update u_unpack_1d as u_unpack_1d_fixed.
zhangyang2057 Nov 19, 2024
07275bf
try to fix macos build.
zhangyang2057 Nov 19, 2024
6c2b712
revise build error for windows
Nov 19, 2024
73ff1c4
revise bug for not continuous
Nov 19, 2024
1657943
change test case for upack test case
Nov 19, 2024
2dffe92
lower shape info for future opt
Nov 20, 2024
84af5d5
a better way
Nov 20, 2024
38209a5
revise some build problem for riscv
Nov 20, 2024
56a6278
revise typo
Nov 20, 2024
50b8a83
lower Axis info for some opt
Nov 20, 2024
47ce646
Add 2D support for unpack fixed_shape.
zhangyang2057 Nov 20, 2024
39cc7b7
some opt for 1d unpack
Nov 21, 2024
cde9768
update for l1 limit
Nov 21, 2024
fa177bc
lower some shape info for opts
Nov 21, 2024
1bfe0f0
Fix unpack 2d bug and update hw roofline.
zhangyang2057 Nov 21, 2024
40b44b0
revise for align problems
Nov 21, 2024
8df9bf5
Merge branch 'feature/ntt_benchmark_roofline_5' of https://github.com…
Nov 21, 2024
88aaa48
revise segmentfault problem
Nov 21, 2024
d399281
revise clamp align problem
Nov 21, 2024
0b886aa
Optimize unpack 2D.
zhangyang2057 Nov 22, 2024
bfe4e7a
Fix unpack<C, N> 2D bug and add ctest case.
zhangyang2057 Nov 22, 2024
0e718f4
Optimize unpack 2d fixed shape and add ranked shape version.
zhangyang2057 Nov 22, 2024
dd98ff4
Merge branch 'dev/3.0' into feature/ntt_benchmark_roofline_5
zhangyang2057 Nov 22, 2024
806a1d4
add 2d unpack opt
Nov 22, 2024
a9d4da4
change size for gather test
Nov 22, 2024
59bf01b
Add rvv optimization for pack 1D fixed shape.
zhangyang2057 Nov 26, 2024
1676266
uniform output optimization of benchmark test for x86_64.
zhangyang2057 Nov 26, 2024
a06d89c
some change for cast
Nov 27, 2024
9f3fe8c
some build problems for macos
Nov 27, 2024
ff381de
for more build error for clang
Nov 27, 2024
ddf2d95
build error for windows
Nov 27, 2024
d0d0dd9
change cycle error for cast
Nov 27, 2024
64af79d
add special version for bool2float
Nov 27, 2024
5f4cdfb
Add u_pack2d for rvv(NC/CH boost 40%+).
zhangyang2057 Nov 27, 2024
c938b69
opt code for more general
Nov 27, 2024
aabed15
Merge branch 'feature/ntt_benchmark_roofline_5' of https://github.com…
Nov 27, 2024
6fc4503
fix build for macos
Nov 27, 2024
a6360c3
try again for macos build
Nov 27, 2024
85eda40
fallback for no work
Nov 27, 2024
0e833b6
Optimize unpack 2d fixed shape and ranked shape for rvv(7.6 -> 7 cycles)
zhangyang2057 Nov 28, 2024
36f2da6
remove test temporary code
Nov 28, 2024
04b7274
add ctest and benchmark for expand
Nov 28, 2024
6d7a89d
add roofline info keywords
Nov 28, 2024
4f847af
adjust benchmark size and add roofline info
Nov 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ntt/include/nncase/bfloat16.h
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ DEFINE_BF16_BINARY_BOOLRET(>=)
DEFINE_BF16_BINARY_BOOLRET(>)

#define DEFINE_BF16_BINARY_SELF_MOD(x, op) \
inline bfloat16 &operator x(bfloat16 & a, bfloat16 b) noexcept { \
inline bfloat16 &operator x(bfloat16 &a, bfloat16 b) noexcept { \
a = a op b; \
return a; \
}
Expand Down
2 changes: 1 addition & 1 deletion ntt/include/nncase/half.h
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ DEFINE_FP16_BINARY_BOOLRET(>=)
DEFINE_FP16_BINARY_BOOLRET(>)

#define DEFINE_FP16_BINARY_SELF_MOD(x, op) \
inline half &operator x(half & a, half b) noexcept { \
inline half &operator x(half &a, half b) noexcept { \
a = a op b; \
return a; \
}
Expand Down
824 changes: 819 additions & 5 deletions ntt/include/nncase/ntt/arch/riscv64/ukernels.h

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions ntt/include/nncase/ntt/arch/x86_64/primitive_ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,41 @@ template <> struct cast<ntt::vector<bool, 8>, ntt::vector<float, 8>> {
}
};

// cast
template <> struct cast<ntt::vector<bool, 32>, ntt::vector<float, 8>> {
void operator()(const ntt::vector<bool, 32> &v, ntt::vector<float, 8> &v0,
ntt::vector<float, 8> &v1, ntt::vector<float, 8> &v2,
ntt::vector<float, 8> &v3) const noexcept {
__m256i mask0 = _mm256_setr_epi32(
v(0) ? -1 : 0, v(1) ? -1 : 0, v(2) ? -1 : 0, v(3) ? -1 : 0,
v(4) ? -1 : 0, v(5) ? -1 : 0, v(6) ? -1 : 0, v(7) ? -1 : 0);

// Convert to float (1.0f for true, 0.0f for false)
v0 = _mm256_and_ps(_mm256_castsi256_ps(mask0), _mm256_set1_ps(1.0f));

__m256i mask1 = _mm256_setr_epi32(
v(8) ? -1 : 0, v(9) ? -1 : 0, v(10) ? -1 : 0, v(11) ? -1 : 0,
v(12) ? -1 : 0, v(13) ? -1 : 0, v(14) ? -1 : 0, v(15) ? -1 : 0);

// Convert to float (1.0f for true, 0.0f for false)
v1 = _mm256_and_ps(_mm256_castsi256_ps(mask1), _mm256_set1_ps(1.0f));

__m256i mask2 = _mm256_setr_epi32(
v(16) ? -1 : 0, v(17) ? -1 : 0, v(18) ? -1 : 0, v(19) ? -1 : 0,
v(20) ? -1 : 0, v(21) ? -1 : 0, v(22) ? -1 : 0, v(23) ? -1 : 0);

// Convert to float (1.0f for true, 0.0f for false)
v2 = _mm256_and_ps(_mm256_castsi256_ps(mask2), _mm256_set1_ps(1.0f));

__m256i mask3 = _mm256_setr_epi32(
v(24) ? -1 : 0, v(25) ? -1 : 0, v(26) ? -1 : 0, v(27) ? -1 : 0,
v(28) ? -1 : 0, v(29) ? -1 : 0, v(30) ? -1 : 0, v(31) ? -1 : 0);

// Convert to float (1.0f for true, 0.0f for false)
v3 = _mm256_and_ps(_mm256_castsi256_ps(mask3), _mm256_set1_ps(1.0f));
}
};

// cast
template <> struct cast<ntt::vector<float, 8>, ntt::vector<int, 8>> {
ntt::vector<int, 8>
Expand Down
476 changes: 464 additions & 12 deletions ntt/include/nncase/ntt/arch/x86_64/ukernels.h

Large diffs are not rendered by default.

99 changes: 65 additions & 34 deletions ntt/include/nncase/ntt/kernels/cast.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,64 +21,96 @@

namespace nncase::ntt {
namespace detail {
template <class Shape, class InStrides, class OutStrides> class cast_impl;
template <class InShape, class OutShape, class InStrides, class OutStrides>
class cast_impl;

template <size_t... Dims, size_t... InStrides, size_t... OutStrides>
class cast_impl<fixed_shape<Dims...>, fixed_strides<InStrides...>,
fixed_strides<OutStrides...>> {
template <size_t... InDims, size_t... OutDims, size_t... InStrides,
size_t... OutStrides>
class cast_impl<fixed_shape<InDims...>, fixed_shape<OutDims...>,
fixed_strides<InStrides...>, fixed_strides<OutStrides...>> {
public:
template <class TIn, class TOut>
constexpr void operator()(const TIn &input, TOut &output) {
constexpr size_t rank = sizeof...(Dims);

constexpr float scale =
(float)TIn::shape().length() / TOut::shape().length();

if constexpr (scale != 1.0f) {
static_assert(TIn::rank() == 1,
"Only support 1D tensor repack for now!");
}

constexpr auto in_offset_scale =
scale > 1.0f ? (size_t)scale : (size_t)1;
constexpr auto out_offset_scale =
scale > 1.0f ? (size_t)1 : (size_t)(1.0f / scale);

constexpr size_t rank = sizeof...(InDims);
ranked_shape<rank> index{};
constexpr auto conti_dims =
std::min(contiguous_dims(fixed_shape<Dims...>{},
std::min(contiguous_dims(fixed_shape<InDims...>{},
fixed_strides<InStrides...>{}),
contiguous_dims(fixed_shape<Dims...>{},
contiguous_dims(fixed_shape<InDims...>{},
fixed_strides<OutStrides...>{}));
apply<TIn, TOut, 0, rank, conti_dims, Dims...>(index, input, output);

if constexpr (scale >= 1.0f) {
apply<in_offset_scale, out_offset_scale, TIn, TOut, 0, rank,
conti_dims, OutDims...>(index, input, output);
} else {
apply<in_offset_scale, out_offset_scale, TIn, TOut, 0, rank,
conti_dims, InDims...>(index, input, output);
}
}

private:
template <class TIn, class TOut, size_t Axis, size_t Rank,
size_t ContiguousDims, size_t... RestDims>
template <size_t in_offset_scale, size_t out_offset_scale, class TIn,
class TOut, size_t Axis, size_t Rank, size_t ContiguousDims,
size_t... RestDims>
constexpr void apply(ranked_shape<Rank> &index, const TIn &input,
TOut &output) {
if constexpr (ContiguousDims == sizeof...(RestDims)) {
constexpr auto inner_size = fixed_shape<RestDims...>::length();
auto input_p =
input.elements().data() + linear_offset(index, input.strides());
auto output_p = output.elements().data() +
linear_offset(index, output.strides());
cast_contiguous<inner_size>(input_p, output_p);

auto in_offset =
linear_offset(index, input.strides()) * in_offset_scale;
auto out_offset =
linear_offset(index, output.strides()) * out_offset_scale;
auto input_p = input.elements().data() + in_offset;
auto output_p = output.elements().data() + out_offset;
cast_contiguous<in_offset_scale, out_offset_scale, inner_size>(
input_p, output_p);
} else {
apply_next<TIn, TOut, Axis, Rank, ContiguousDims, RestDims...>(
index, input, output);
apply_next<in_offset_scale, out_offset_scale, TIn, TOut, Axis, Rank,
ContiguousDims, RestDims...>(index, input, output);
}
}

template <class TIn, class TOut, size_t Axis, size_t Rank,
size_t ContiguousDims, size_t Dim, size_t... RestDims>
template <size_t in_offset_scale, size_t out_offset_scale, class TIn,
class TOut, size_t Axis, size_t Rank, size_t ContiguousDims,
size_t Dim, size_t... RestDims>
constexpr void apply_next(ranked_shape<Rank> &index, const TIn &input,
TOut &output) {
for (index[Axis] = 0; index[Axis] < Dim; index[Axis]++) {
apply<TIn, TOut, Axis + 1, Rank, ContiguousDims, RestDims...>(
index, input, output);
apply<in_offset_scale, out_offset_scale, TIn, TOut, Axis + 1, Rank,
ContiguousDims, RestDims...>(index, input, output);
}
}

template <size_t Extent, class T1, class T2>
template <size_t in_offset_scale, size_t out_offset_scale, size_t Extent,
class T1, class T2>
constexpr void cast_contiguous(const T1 *input, T2 *output) {
ntt::u_cast(input, 1, output, 1, Extent);
ntt::u_cast<T1, T2, in_offset_scale, out_offset_scale>(input, 1, output,
1, Extent);
}
};

template <size_t Rank, class InStrides, class OutStrides>
class cast_impl<ranked_shape<Rank>, InStrides, OutStrides> {
template <size_t InRank, size_t OutRank, class InStrides, class OutStrides>
class cast_impl<ranked_shape<InRank>, ranked_shape<OutRank>, InStrides,
OutStrides> {
public:
template <class TIn, class TOut>
constexpr void operator()(const TIn &input, TOut &output) {
ranked_shape<Rank> index{};
ranked_shape<InRank> index{};
auto conti_dims =
std::min(contiguous_dims(input.shape(), input.strides()),
contiguous_dims(input.shape(), output.strides()));
Expand All @@ -87,9 +119,9 @@ class cast_impl<ranked_shape<Rank>, InStrides, OutStrides> {

private:
template <class TIn, class TOut, size_t Axis>
constexpr void apply(ranked_shape<Rank> &index, size_t conti_dims,
constexpr void apply(ranked_shape<InRank> &index, size_t conti_dims,
const TIn &input, TOut &output) {
const auto outer_dims = Rank - conti_dims;
const auto outer_dims = InRank - conti_dims;
if (Axis >= outer_dims) {
size_t inner_size = 1;
for (size_t i = outer_dims; i < input.shape().rank(); i++)
Expand All @@ -99,7 +131,7 @@ class cast_impl<ranked_shape<Rank>, InStrides, OutStrides> {
auto output_p =
output.buffer().data() + linear_offset(index, output.strides());
cast_contiguous(input_p, output_p, inner_size);
} else if constexpr (Axis < Rank - 1) {
} else if constexpr (Axis < InRank - 1) {
const auto dim = input.shape()[Axis];
for (index[Axis] = 0; index[Axis] < dim; index[Axis]++) {
apply<TIn, TOut, Axis + 1>(index, conti_dims, input, output);
Expand All @@ -109,17 +141,16 @@ class cast_impl<ranked_shape<Rank>, InStrides, OutStrides> {

template <class T1, class T2>
constexpr void cast_contiguous(const T1 *input, T2 *output, size_t extent) {
ntt::u_cast(input, 1, output, 1, extent);
ntt::u_cast<T1, T2, 1, 1>(input, 1, output, 1, extent);
}
};
} // namespace detail

template <typename TIn, typename TOut>
void cast(const TIn &input, TOut &&output) noexcept {
detail::cast_impl<common_shape_t<typename TIn::shape_type,
typename std::decay_t<TOut>::shape_type>,
typename TIn::strides_type,
typename std::decay_t<TOut>::strides_type>
detail::cast_impl<
typename TIn::shape_type, typename std::decay_t<TOut>::shape_type,
typename TIn::strides_type, typename std::decay_t<TOut>::strides_type>
impl;
impl(input, output);
}
Expand Down
Loading
Loading