Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilk reducer bug with avx512 vectors #238

Closed
wheatman opened this issue Feb 20, 2024 · 7 comments
Closed

cilk reducer bug with avx512 vectors #238

wheatman opened this issue Feb 20, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@wheatman
Copy link
Contributor

Describe the bug

Trying to use a vector reducer and get non deterministic results only with avx512 vectors, not 256 bit ones

Expected behavior

deterministic results in summing a vector

OpenCilk version

clang version 16.0.6 (https://github.com/OpenCilk/opencilk-project df80b7d)

System information

VERSION="20.04.1 LTS (Focal Fossa)"
c5.xlargs from aws

Steps to reproduce (include relevant output)

code

#include <cstdlib>

#include <cilk/cilk.h>
#include <cilk/opadd_reducer.h>
#include <immintrin.h>
#include <iostream>
#include <vector>

void zero_vec_512(void *view) {
  *((__m512d *)view) = _mm512_setzero_pd(); // Initialize sum as zero
}

void add_vec_512(void *left, void *right) {
  *((__m512d *)left) = _mm512_add_pd(*((__m512d *)left), *((__m512d *)right));
}

void zero_vec_256(void *view) {
  *((__m256d *)view) = _mm256_setzero_pd(); // Initialize sum as zero
}

void add_vec_256(void *left, void *right) {
  *((__m256d *)left) = _mm256_add_pd(*((__m256d *)left), *((__m256d *)right));
}

std::vector<double> make_data(size_t logN) {
  std::vector<double> data(1UL << logN);
  for (size_t i = 0; i < (1UL << logN); i++) {
    data[i] = i;
  }
  return data;
}

double sum(const auto &data) {
  cilk::opadd_reducer<double> sum(0.0);
  cilk_for(std::size_t i = 0; i < data.size(); ++i) {
    sum += data[i] * data[i];
  }
  return sum;
}

double sum256(const auto &data) {
  __m256d cilk_reducer(zero_vec_256, add_vec_256) sum_vec;
  cilk_for(std::size_t i = 0; i < data.size(); i += 4) {
    __m256d data_vec = _mm256_loadu_pd(data.data() + i);
    data_vec = _mm256_mul_pd(data_vec, data_vec);
    sum_vec += data_vec;
  }
  double sum_array[4];
  _mm256_storeu_pd(sum_array, sum_vec); // Store the results back to an array
  double sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
  return sum;
}

double sum512_parallel(const auto &data) {
  __m512d cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
  cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
    __m512d data_vec = _mm512_loadu_pd(data.data() + i);
    data_vec = _mm512_mul_pd(data_vec, data_vec);
    sum_vec += data_vec;
  }
  return _mm512_reduce_add_pd(sum_vec);
}

double sum512_serial(const auto &data) {
  __m512d sum_vec;
  for (std::size_t i = 0; i < data.size(); i += 8) {
    __m512d data_vec = _mm512_loadu_pd(data.data() + i);
    data_vec = _mm512_mul_pd(data_vec, data_vec);
    sum_vec += data_vec;
  }
  return _mm512_reduce_add_pd(sum_vec);
}

int main(int argc, char *argv[]) {
  auto data = make_data(20);
  std::cout << sum(data) << "\n";
  std::cout << sum256(data) << "\n";
  std::cout << sum512_serial(data) << "\n";
  std::cout << sum512_parallel(data) << "\n";
}

build with opencilk16/build/bin/clang++ -fopencilk -march=native -std=c++20 del.cpp

It should output

3.84307e+17
3.84307e+17
3.84307e+17
3.84307e+17

but sometimes either the 3rd or 4th row is equal to 3.7586e+179

I know that is one of the serial ones, but I was never able to reproduce running just the serial one, so I think its something else. When trying to run with cilksan I got segfaults

(gdb) bt
#0  0x00005555555e8f21 in zero_vec_256(void*) ()
#1  0x00005555555e7d9d in create_reducer_view ()
    at /home/ubuntu/opencilk16/cilktools/cilksan/cilksan_internal.h:192
#2  0x00005555555eb46a in sum256<std::vector<double, std::allocator<double> > > ()
#3  0x00005555555ed5d3 in double sum256<std::vector<double, std::allocator<double> > >(std::vector<double, std::allocator<double> > const&) ()
#4  0x00005555555e9974 in main ()
@wheatman wheatman added the bug Something isn't working label Feb 20, 2024
@neboat
Copy link
Collaborator

neboat commented Feb 20, 2024

It looks like your data vector stores a number of elements — 20 — that is not divisible by 8, the number of elements in an AVX512 register. I'm guessing that causes your AVX512 codes to read invalid memory. Although the code might get lucky most of the time in seeing only 0 values in those invalid locations, the Cilk reducer mechanism does memory allocations internally, which might result in those AVX512 operations reading non-zero values from time to time.

I need to set things up on a machine with AVX512 instructions to confirm the issue, unless you can check it first. You might try running ASan on your code to see if it detects a memory issue.

The Cilksan segfault might be related to the same issue, if my guess is correct. I'll take a look at that.

@wheatman
Copy link
Contributor Author

The data size should be a power of 2

std::vector<double> make_data(size_t logN) {
  std::vector<double> data(1UL << logN);
  for (size_t i = 0; i < (1UL << logN); i++) {
    data[i] = i;
  }
  return data;
}

@neboat
Copy link
Collaborator

neboat commented Feb 20, 2024

Ah, I missed that. Thanks for pointing it out.

@wheatman
Copy link
Contributor Author

Also, I just tried running with asan and got no error in a run with bad results

$ opencilk16/build/bin/clang++ -fopencilk -march=native -std=c++20 -fsanitize=address del.cpp
$ ./a.out 
3.84307e+17
3.84307e+17
3.7586e+179
3.7586e+179

@wheatman
Copy link
Contributor Author

So I managed to simplify it a bit, switched from doubles to uint64_t , and got rid of a square. Now both approaches that use vectorized reducers clearly give the wrong answer and do so consistently.

Here is the updated code

#include <cstdlib>

#include <cilk/cilk.h>
#include <cilk/opadd_reducer.h>
#include <immintrin.h>
#include <iostream>
#include <vector>

void zero_vec_512(void *view) {
  *((__m512i *)view) = _mm512_setzero_si512(); // Initialize sum as zero
}

void add_vec_512(void *left, void *right) {
  *((__m512i *)left) =
      _mm512_add_epi64(*((__m512i *)left), *((__m512i *)right));
}

void zero_vec_256(void *view) {
  *((__m256i *)view) = _mm256_setzero_si256(); // Initialize sum as zero
}

void add_vec_256(void *left, void *right) {
  *((__m256i *)left) =
      _mm256_add_epi64(*((__m256i *)left), *((__m256i *)right));
}

std::vector<uint64_t> make_data(size_t logN) {
  std::vector<uint64_t> data(1UL << logN);
  for (size_t i = 0; i < (1UL << logN); i++) {
    data[i] = i;
  }
  return data;
}

uint64_t sum_serial(const auto &data) {
  uint64_t sum(0);
  for (std::size_t i = 0; i < data.size(); ++i) {
    sum += data[i];
  }
  return sum;
}

uint64_t sum_parallel(const auto &data) {
  cilk::opadd_reducer<uint64_t> sum(0);
  cilk_for(std::size_t i = 0; i < data.size(); ++i) { sum += data[i]; }
  return sum;
}

uint64_t sum256(const auto &data) {
  __m256i cilk_reducer(zero_vec_256, add_vec_256) sum_vec;
  cilk_for(std::size_t i = 0; i < data.size(); i += 4) {
    __m256i data_vec = _mm256_loadu_si256((__m256i *)(data.data() + i));
    // data_vec = _mm256_mullo_epi64(data_vec, data_vec);
    sum_vec += data_vec;
  }
  uint64_t sum_array[4];
  _mm256_storeu_epi64(sum_array, sum_vec); // Store the results back to an array
  uint64_t sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
  return sum;
}

uint64_t sum512_parallel(const auto &data) {
  __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
  cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
    __m512i data_vec = _mm512_loadu_si512(data.data() + i);
    // data_vec = _mm512_mullo_epi64(data_vec, data_vec);
    sum_vec += data_vec;
  }
  return _mm512_reduce_add_epi64(sum_vec);
}

uint64_t sum512_serial(const auto &data) {
  __m512i sum_vec = _mm512_setzero_si512();
  for (std::size_t i = 0; i < data.size(); i += 8) {
    __m512i data_vec = _mm512_loadu_si512(data.data() + i);
    // data_vec = _mm512_mullo_epi64(data_vec, data_vec);
    sum_vec += data_vec;
  }
  return _mm512_reduce_add_epi64(sum_vec);
}

int main(int argc, char *argv[]) {
  auto data = make_data(20);
  std::cout << "data size = " << data.size() << "\n";
  std::cout << "serial no vectorization:   " << sum_serial(data) << "\n";
  std::cout << "parallel no vectorization: " << sum_parallel(data) << "\n";
  std::cout << "parallel 256:              " << sum256(data) << "\n";
  std::cout << "serial 512:                " << sum512_serial(data) << "\n";
  std::cout << "parallel 512:              " << sum512_parallel(data) << "\n";
}

output

data size = 1048576
serial no vectorization:   549755289600
parallel no vectorization: 549755289600
parallel 256:              281640026961247
serial 512:                549755289600
parallel 512:              516246975623928

@neboat
Copy link
Collaborator

neboat commented Feb 20, 2024

You need to initialize your reducer variables, just like you would initialize them in the serial projection. For example:

uint64_t sum512_parallel(const auto &data) {
-  __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
+  __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec = _mm512_setzero_si512();
  cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
    __m512i data_vec = _mm512_loadu_si512(data.data() + i);
    // data_vec = _mm512_mullo_epi64(data_vec, data_vec);
    sum_vec += data_vec;
  }
  return _mm512_reduce_add_epi64(sum_vec);
}

@wheatman
Copy link
Contributor Author

Ah thank you, my mistake, I thought that was handled with the identity function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants