Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed fix to improve pybind11 dynamic type casting #85

Merged
merged 56 commits into from
Oct 16, 2024
Merged

Conversation

The9Cat
Copy link
Contributor

@The9Cat The9Cat commented Sep 26, 2024

Problem

The pyEXP.coefs.setMatrix() routines fail to cast ndarray with complex type to Eigen::MatrixXd. Not all compiles demonstrated the problem. But recent compiles in the Docker container are affected.

Proposed fix

Use Eigen::Ref<Eigen::ComplexXcd> to pass arrays in setMatrix() rather than const Eigen::MatrixXcd&. Tests show that this fixed the issue.

To Do
The implementation has been verified on the failing case but more testing is needed.

@The9Cat
Copy link
Contributor Author

The9Cat commented Sep 28, 2024

Changed the binding interface to allow both copies and references. This makes implies extra copies of the setMatrix() data blocks, but probably a worthwhile trade off for ease of use.

@The9Cat
Copy link
Contributor Author

The9Cat commented Sep 28, 2024

To Do

  • Confirm that these changes compiling correctly and solve the setMatrix() issues both natively and in the Docker container.
  • Consider adding a regression test to CMake tests

Copy link
Member

@michael-petersen michael-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working with this branch for two weeks, and no workflows have been affected. I wasn't demonstrating the problem originally, so as long as that is reported fixed, I'm happy.

@@ -196,7 +196,8 @@ void FieldGeneratorClasses(py::module &m) {

See also
--------
slices : generate fields in a surface slice given by the initializtion grid
slices : generate fields in a surface slice given by the
initializtion grid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initializtion grid
initialization grid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@@ -281,7 +282,8 @@ void FieldGeneratorClasses(py::module &m) {

See also
--------
slices : generate fields in a surface slice given by the initializtion grid
slices : generate fields in a surface slice given by the
initializtion grid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initializtion grid
initialization grid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@@ -314,7 +316,8 @@ void FieldGeneratorClasses(py::module &m) {

See also
--------
slices : generate fields in a surface slice given by the initializtion grid
slices : generate fields in a surface slice given by the
initializtion grid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initializtion grid
initialization grid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now added CTest pyEXP exercises for setMatrix() and createFromArray(). It seems that this bug bit Sóley while running a test notebook from Nico. Ooops.

@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 12, 2024

Summary for this final (hopefully) commit

  • Fix a typo in the test dependencies for expNbodyTest (erroneously called makeNbodyTest in two locations)
  • Add a Python script to read a coefficient file, use setMatrix() to zero all odd-order coefficients and confirm that the new coefficieint object has zero odd-order coefficients.
  • Add a Python script to read a body file and use createFromArray() to make a coefficient structure and add it to a coefficient object.
  • Checked that all tests succeed

Probably good to go now...

@michael-petersen
Copy link
Member

Great addition of tests!

The tests pass locally for me, but why are they failing in CI? That nothing is getting printed in Halo/createCoefs.py suggests to me the failure is early on.

@@ -44,11 +44,11 @@
ypos.append(random.random()*2.0 - 1.0)
zpos.append(random.random()*2.0 - 1.0)

exit(0)

print("---- createFromArray usings lists")
coef1 = basis.createFromArray(mass, [xpos, ypos, zpos], time=3.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the failure in the CI tests must be coming from this line -- but I can't reproduce it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was inserted to try to bisect the error. Thought I removed that.

Not sure what to do here; if we can't reproduce the error except in the runner where we can't debug it. Not really a solution, but the test seems to work fine on native Ubuntu builds, in the Docker container, and for you with Clang, right?

I'm going to label this test "long" and merge to main unless you have another idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've figured it out, and fixed the script. In short, the Python numpy array objects need to persist. Even though I got rid of the Eigen::Ref which was hosing the interface in some builds, the data is passed by reference but if the data is wiped by an assignment, the memory gets stomped on access from the C++ side. To test, I changed the script to 'hold' the instance by giving it a unique name, and the script now works in CI.

However, this could lead to Python users writing code that breaks, depending on their workflow. So now, the question is: should we eliminate pass by reference for setMatrix() and createFromAarray() and force copies? I'm sort of thinking, yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use case of a non-unique name is probably going to happen, so it makes sense to eliminate the pass by reference. The copies can't be that large (at least compared to RAM, I hope), so probably no harm in the copying strategy.

Martin D. Weinberg and others added 3 commits October 13, 2024 10:34
One more try just to make sure that this fails on all `createFromArrry` statements.
More bisection...
@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 14, 2024

Summary

Linked the bug in the github CI runner to a pybind11 issue

Details

I built the act local github runner generator with a tweaked build.yml that built with with debugging symbols and included gdb. When the code died in the same way (with a seg. fault), I logged into the docker image and fired up gdb. I was able to trace the failure to a pybind11 wrapper for an eigen3 object that was pointing at a bad address. I then updated the pybind11 git submodule and that fixed the problem! Still not sure why it failed in this particular build and not others, but I feel more comfortable that the native local builds are consistent with the github CI at this point.

To Do

I suppose we should make sure that nothing else is broken but that does seem unlikely at this point. Then I'm hoping we can merge.

Reverted from `Debug` to `Release`
@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 15, 2024

Another (very minor) question is: do we leave the build.yml as it is current (with the 4 builds) or scale it back? Currently, I believe that all four are identical: ubuntu-latest is ubuntu-22.04 and openmpi will use gcc for mpicc anyway, so I don't think those are probing anything new. At some point, ubuntu-latest will be ubuntu-24.04, I assume. Then that might be useful since ubuntu-22.04 images are still used in the wild.

@michael-petersen
Copy link
Member

I think we want some sort of matrix; currently ubuntu-latest is 24.04 (at least according to this log), but I agree with you on mpicc. One option would be to replace that with clang, but I imagine that would fail unless the rest of the environment was built against clang?

@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 15, 2024

The log file from the ubuntu-latest run shows 22.04. But, I assume that it will be 24.04 at some point. I'll try the clang tag in act and see what happens...I think recent clang versions should work on straight Ubuntu.

The9Cat and others added 7 commits October 15, 2024 15:09
test gcc and clang matrix
Another try.
Need OpenMP support for clang
Need flags for clang++ to use stdc++ lib.
Remove hardwired use of llvm's libc++
Use apt numpy package for 24.04
@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 15, 2024

Summary of last bunch of commits

  • Updated build.yml to run gcc and clang C and C++ compiler pairs using include and exclude stanzas
  • Updated the top-level CMakeLists.txt to eliminate the previous default setting of the clang++ -stdlib=... flag. This should be set, if needed, by the local build configuration.
  • Fixed the missing C++ override decoration to eliminate chatty warnings from clang++. Some additional non-standard (according to clang++) grammar can be fixed later. The standard says that sizes must be known at compile time but most compilers allow this. At this point, there is no clean workaround. We could use combinations of std::vector<> and auto array = std::make_unique<int[]>(size); for this where size is a variable if we really want to comply. I have a commit for this stuff, but I'm reluctant to push that here without explicit testing.

Martin D. Weinberg added 2 commits October 15, 2024 20:34
…dentified by Clang and Clang++. They are all minor. The result passes 'ctest -L long' but pyEXP-example notebooks have not been run.
@The9Cat The9Cat mentioned this pull request Oct 16, 2024
@The9Cat
Copy link
Contributor Author

The9Cat commented Oct 16, 2024

Looks like we are good to go. All merged and checked.

@The9Cat The9Cat merged commit 922d165 into main Oct 16, 2024
12 checks passed
@The9Cat The9Cat deleted the pybind11fix branch October 16, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants