-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilk reducer bug with avx512 vectors #238
Comments
It looks like your I need to set things up on a machine with AVX512 instructions to confirm the issue, unless you can check it first. You might try running ASan on your code to see if it detects a memory issue. The Cilksan segfault might be related to the same issue, if my guess is correct. I'll take a look at that. |
The data size should be a power of 2 std::vector<double> make_data(size_t logN) {
std::vector<double> data(1UL << logN);
for (size_t i = 0; i < (1UL << logN); i++) {
data[i] = i;
}
return data;
} |
Ah, I missed that. Thanks for pointing it out. |
Also, I just tried running with asan and got no error in a run with bad results
|
So I managed to simplify it a bit, switched from doubles to uint64_t , and got rid of a square. Now both approaches that use vectorized reducers clearly give the wrong answer and do so consistently. Here is the updated code #include <cstdlib>
#include <cilk/cilk.h>
#include <cilk/opadd_reducer.h>
#include <immintrin.h>
#include <iostream>
#include <vector>
void zero_vec_512(void *view) {
*((__m512i *)view) = _mm512_setzero_si512(); // Initialize sum as zero
}
void add_vec_512(void *left, void *right) {
*((__m512i *)left) =
_mm512_add_epi64(*((__m512i *)left), *((__m512i *)right));
}
void zero_vec_256(void *view) {
*((__m256i *)view) = _mm256_setzero_si256(); // Initialize sum as zero
}
void add_vec_256(void *left, void *right) {
*((__m256i *)left) =
_mm256_add_epi64(*((__m256i *)left), *((__m256i *)right));
}
std::vector<uint64_t> make_data(size_t logN) {
std::vector<uint64_t> data(1UL << logN);
for (size_t i = 0; i < (1UL << logN); i++) {
data[i] = i;
}
return data;
}
uint64_t sum_serial(const auto &data) {
uint64_t sum(0);
for (std::size_t i = 0; i < data.size(); ++i) {
sum += data[i];
}
return sum;
}
uint64_t sum_parallel(const auto &data) {
cilk::opadd_reducer<uint64_t> sum(0);
cilk_for(std::size_t i = 0; i < data.size(); ++i) { sum += data[i]; }
return sum;
}
uint64_t sum256(const auto &data) {
__m256i cilk_reducer(zero_vec_256, add_vec_256) sum_vec;
cilk_for(std::size_t i = 0; i < data.size(); i += 4) {
__m256i data_vec = _mm256_loadu_si256((__m256i *)(data.data() + i));
// data_vec = _mm256_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
uint64_t sum_array[4];
_mm256_storeu_epi64(sum_array, sum_vec); // Store the results back to an array
uint64_t sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
return sum;
}
uint64_t sum512_parallel(const auto &data) {
__m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
}
uint64_t sum512_serial(const auto &data) {
__m512i sum_vec = _mm512_setzero_si512();
for (std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
}
int main(int argc, char *argv[]) {
auto data = make_data(20);
std::cout << "data size = " << data.size() << "\n";
std::cout << "serial no vectorization: " << sum_serial(data) << "\n";
std::cout << "parallel no vectorization: " << sum_parallel(data) << "\n";
std::cout << "parallel 256: " << sum256(data) << "\n";
std::cout << "serial 512: " << sum512_serial(data) << "\n";
std::cout << "parallel 512: " << sum512_parallel(data) << "\n";
}
output
|
You need to initialize your reducer variables, just like you would initialize them in the serial projection. For example: uint64_t sum512_parallel(const auto &data) {
- __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
+ __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec = _mm512_setzero_si512();
cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
} |
Ah thank you, my mistake, I thought that was handled with the identity function. |
Describe the bug
Trying to use a vector reducer and get non deterministic results only with avx512 vectors, not 256 bit ones
Expected behavior
deterministic results in summing a vector
OpenCilk version
clang version 16.0.6 (https://github.com/OpenCilk/opencilk-project df80b7d)
System information
VERSION="20.04.1 LTS (Focal Fossa)"
c5.xlargs from aws
Steps to reproduce (include relevant output)
code
build with
opencilk16/build/bin/clang++ -fopencilk -march=native -std=c++20 del.cpp
It should output
but sometimes either the 3rd or 4th row is equal to 3.7586e+179
I know that is one of the serial ones, but I was never able to reproduce running just the serial one, so I think its something else. When trying to run with cilksan I got segfaults
The text was updated successfully, but these errors were encountered: