Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorisation sprint #654

Closed
wants to merge 107 commits into from
Closed

Vectorisation sprint #654

wants to merge 107 commits into from

Conversation

sv2518
Copy link
Contributor

@sv2518 sv2518 commented Mar 1, 2022

Implements automatic cross-element vectorisation (This is the work from TJ and Kaushik)

See firedrakeproject/firedrake#2365 for firedrake CI runs.
The corresponding loopy PR is inducer/loopy#557.

Big thanks to @kaushikcfd for working hard on an update of this so that we can get it merged.

The mechanism in PyOP2

  1. Check if kernel is vectorisable (see list below)
  2. Change target of the kernel to CVectorExtensionsTarget
  3. Inline all inner kernels in the wrapper kernel
  4. Align all temporaries
  5. Decide which instructions cannot be vectorised with extensions (see list below)
  6. Shift the bound of the loop index (iname) to vectorise over so that it starts from 0
  7. Split the iname (according to the SIMD length of the architecture which is determined with py-cpuinfo)
  8. Break the temporaries (add a new axis to the temporary) and index it with the provided iname
  9. Tag axes to vectorise over
  10. Tag iname to vectorise with with lp.VectorizeTag(lp.OpenMPSIMDTag()) where VectorizeTag indicates that we try to use vector extensions first, but if an instruction can't be vectorised we use the fallback OpenMPSIMDTag which wraps the instruction in openmp simd pragmas

Kernels which cannot be vectorised

  • Kernels which assemble matrices
  • Kernels which use complex types
  • Kernels with read write access arguments
  • The kernels which generate the extrusion coordinates
  • Kernels with conditionals

Single instructions which cannot be vectorised

  • Instructions which are outside the loop which was split (because they don't depend on the loop index we vectorise over)
  • Constant literal temporaries on the RHS (because we cannot just index into them)
  • Instruction with calls to Slate inverses and solve (gcc could do this, Kaushik extended the solve and inverse callables in PyOP2 with strided versions for that, but clang can't)

Copy link
Collaborator

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are going to be a few naming errors from my refactoring. I've labelled where they are.

pyop2/global_kernel.py Outdated Show resolved Hide resolved
pyop2/global_kernel.py Outdated Show resolved Hide resolved
pyop2/global_kernel.py Outdated Show resolved Hide resolved
@sv2518 sv2518 requested a review from connorjward May 24, 2022 16:00
Copy link
Collaborator

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment. Otherwise looks good AFAICT. You should definitely run the Firedrake test suite to make sure you haven't broken anything by accident.

pyop2/configuration.py Outdated Show resolved Hide resolved
@sv2518 sv2518 mentioned this pull request Jul 7, 2022
@connorjward
Copy link
Collaborator

Closing as #677 is a newer version of the same things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants