Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorisation sprint #654

Closed
wants to merge 107 commits into from
Closed
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
2a8d17c
codegen: Implement SIMD vectorisation
tj-sun Apr 11, 2019
fbc6e4a
add omp simd vectorization mode
tj-sun Aug 1, 2019
5ae780d
add openmp flag and by pass workaround flag
tj-sun Aug 4, 2019
ba693dc
DROP BEFORE MERGE: test with correct loopy branch
wence- Apr 11, 2019
4ec0769
Turn of tree vectorize for certain gcc compilers. We might not need t…
sv2518 Jul 1, 2020
f9e60fd
Add simd compiler flags.
sv2518 Jul 1, 2020
00e073d
Remove time configuration.
sv2518 Jul 1, 2020
1cf7698
Default SIMD width.
sv2518 Jul 1, 2020
3e66946
Generate CVec Target with batch size infomation and move typedef into…
sv2518 Jul 3, 2020
1238ce8
Move zero declaration to loopy code base to be more robust in naming …
sv2518 Jul 3, 2020
1d54777
Added conditionals when to vectorise:
sv2518 Jul 15, 2020
b369213
Drop omp vectorisation.
sv2518 Jul 15, 2020
1c6346e
Add -march=native everywhere.
sv2518 Jul 16, 2020
856b6aa
Silence warnings.
sv2518 Jul 22, 2020
5e52ce1
Change vector tag.
sv2518 Aug 24, 2020
537c14c
Give more control over vectorisation to PyOP2.
sv2518 Sep 1, 2020
9317654
Naming adaption.
sv2518 Sep 1, 2020
6723b6a
Realize ilp first.
sv2518 Sep 1, 2020
38ebc8a
Jenkins.
sv2518 Sep 1, 2020
32b2910
Merge branch 'master' into vectorisation-restructure-checks
sv2518 Feb 28, 2022
944c6cf
DBM: run against new loopy branch
sv2518 Mar 1, 2022
3a1eb24
Lint
sv2518 Mar 1, 2022
681e315
More adapations to new PyOP2
sv2518 Mar 1, 2022
48d6142
More adapations to new PyOP2
sv2518 Mar 1, 2022
792c8f0
DBM take the correct branch
sv2518 Mar 3, 2022
2469870
Adapt to new PyOP2 and vectorisation
sv2518 Mar 3, 2022
4bbcde5
Adapt to new PyOP2 and vectorisation
sv2518 Mar 3, 2022
a5c0455
Fix return wrapper with kernel not kernel
sv2518 Mar 3, 2022
c374031
We do need to inline bc Implementing transforms that apply cleanly ac…
sv2518 Mar 3, 2022
e7d31eb
First split then tag because loopy does not support retaggin of iname…
sv2518 Mar 3, 2022
56a8dde
tag_array_axes requires us to specify the tags for each dimension of …
sv2518 Mar 3, 2022
d1171b3
Fix
sv2518 Mar 3, 2022
0641c75
fix
sv2518 Mar 3, 2022
644842e
improve comments
sv2518 Mar 3, 2022
9e58b22
tag only non-constant arrays with vec axes
kaushikcfd Mar 3, 2022
3f133fd
Only vectorise when local kernel is a loopy thing.
sv2518 Mar 4, 2022
dcd0b69
shift iel-loop to have lbound of 0
kaushikcfd Mar 4, 2022
907fe58
Fix import
sv2518 Mar 6, 2022
ca2aaaf
Debug: try with newer python version
sv2518 Mar 6, 2022
0440f66
Debug: try with newer python version
sv2518 Mar 6, 2022
4bcb592
change target before inlining
kaushikcfd Mar 7, 2022
d42e7e8
ignore loopy vectorization fallback warnings
kaushikcfd Mar 7, 2022
7e37e02
Revert "Debug: try with newer python version"
sv2518 Mar 6, 2022
b541dbd
Make complex check tighter
sv2518 Mar 11, 2022
caa567a
extend the set of variables that cannot be vecotrized
kaushikcfd Mar 11, 2022
c3a96fa
Attempt to fix Slate by inlining of all subkernels
sv2518 Mar 14, 2022
dc996de
Add comment
sv2518 Mar 14, 2022
fa343e1
placate flake8
kaushikcfd Mar 15, 2022
aa7bc0c
blas callables: do not accept vectorized dtypes
kaushikcfd Apr 1, 2022
8302d52
allow inverse.c::inverse() to take in vector dtypes
kaushikcfd May 5, 2022
a767fe2
Merge remote-tracking branch 'origin/master' into vectorisation-sprint
kaushikcfd May 5, 2022
85de156
do not invoke the vectorization pass if one of the arguments is a Mix…
kaushikcfd May 5, 2022
30f8ecb
makes freeing logic accurate
kaushikcfd May 5, 2022
0d5023d
rewrite solve to accept strided inputs
kaushikcfd May 6, 2022
d25545b
blas-helpers: corrects the freeing logic
kaushikcfd May 6, 2022
0ade829
Don't vectorise the kernel which generates the coordinates for the ex…
sv2518 May 6, 2022
a4bab8e
PyOP2 compilation: add a pathway to compile with gcc on Mac.
sv2518 May 6, 2022
175eb14
do not vectorize the entire kernel if some instruction are surrounded…
kaushikcfd May 8, 2022
8256bd2
loop being split starts from '0' => do not peel at the head
kaushikcfd May 8, 2022
6585dbb
Merge branch 'vectorisation-sprint' of github.com:OP2/PyOP2 into vect…
sv2518 May 9, 2022
4c0ca6e
Add comment
sv2518 May 9, 2022
e744092
Fix complex check?
sv2518 May 10, 2022
5fc4264
Fix complex check?
sv2518 May 10, 2022
31f0c39
Fix complex check?
sv2518 May 10, 2022
7e8a86a
Fix complex check?
sv2518 May 10, 2022
63f1e52
clarifies vectorization strategy
kaushikcfd May 11, 2022
8b19370
Updates to transform startegy
kaushikcfd May 11, 2022
7a2cbd6
Time configuration is not used anywhere and add doc
sv2518 May 19, 2022
69d4921
Move conditional
sv2518 May 19, 2022
43960e6
sun2020study -> cross-element
sv2518 May 19, 2022
b4c9926
Make default_simd_width more readable
sv2518 May 19, 2022
c603f3f
cleanup
sv2518 May 19, 2022
1cee3d7
Lint
sv2518 May 19, 2022
a671b6c
corrects the condition to not vectorize temps passed to BLAS calls
kaushikcfd May 20, 2022
4aa86e1
Add vectorisation config to cache keys
sv2518 May 24, 2022
60b4b3e
Tests: add a vectorisation test
sv2518 May 24, 2022
1b3c29e
Cleanup
sv2518 May 24, 2022
0a54a34
Cleanup
sv2518 May 24, 2022
9b23200
Use reconfigure not init for changing the vectorisation strategy in t…
sv2518 May 24, 2022
acb9c89
Cleanup
sv2518 May 24, 2022
49e2779
Test: improve the vectorisation test.
sv2518 May 24, 2022
e5fe4d2
Put vectorisation strategy only in cache key of the global kernel.
sv2518 May 24, 2022
0eff9d6
lint
sv2518 May 25, 2022
22ce06e
Fix docs
sv2518 May 25, 2022
bdefbfa
Fix config error
sv2518 May 25, 2022
2a459e5
Fix config error
sv2518 May 25, 2022
56c65da
Don't add py-cpuinfo
May 27, 2022
ca5c51b
Add nbytes property
connorjward Jun 22, 2022
dc5f3bc
Drop unused args
sv2518 Jun 22, 2022
ac36708
Time->extra_info
sv2518 Jun 22, 2022
89c9dec
Merge branch 'vectorisation-sprint' into connorjward/add-nbytes
sv2518 Jun 22, 2022
e2af4c7
Merge pull request #666 from OP2/connorjward/add-nbytes
sv2518 Jun 22, 2022
4de6f06
Merge branch 'vectorisation-sprint' into JDBetteridge/vectorisation-s…
sv2518 Jun 22, 2022
2840f28
Merge pull request #665 from OP2/JDBetteridge/vectorisation-sprint
sv2518 Jun 22, 2022
89feb72
Fix bandwidth calculation
Jun 24, 2022
0857145
Add simd compiler flag also to LinuxGNU compiler
Jun 24, 2022
662241e
Add vectorisation flag to linux clang compiler too
Jun 27, 2022
203223c
account for changed in loopy's vectorization syntax
kaushikcfd Jul 6, 2022
fae323f
run CI with py3.8
kaushikcfd Jul 6, 2022
030cae5
Fallback for stopping criterium
sv2518 Jul 7, 2022
ece0e62
Fallback for stopping criterium
sv2518 Jul 7, 2022
934e147
Reduce inames to untag
sv2518 Jul 7, 2022
bd95ba3
Reduce inames to untag
sv2518 Jul 7, 2022
fd6650d
Fallback for stopping criterium
sv2518 Jul 7, 2022
f69755d
unroll (not vectorize) loops surrounding CInstructions
kaushikcfd Jul 11, 2022
e72f316
get rid of noop insns
kaushikcfd Jul 11, 2022
09bf629
Fix merge leftovers for vectorisation in chapter 3
sv2518 Oct 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions pyop2/codegen/rep2loopy.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,13 +191,14 @@ def generate_preambles(self, target):


class _PreambleGen(ImmutableRecord):
fields = set(("preamble", ))
fields = {"preamble", "idx"}

def __init__(self, preamble):
def __init__(self, preamble, idx="0"):
self.preamble = preamble
self.idx = idx

def __call__(self, preamble_info):
yield ("0", self.preamble)
yield (self.idx, self.preamble)


class PyOP2KernelCallable(loopy.ScalarCallable):
Expand Down Expand Up @@ -533,7 +534,9 @@ def renamer(expr):
options=options,
assumptions=assumptions,
lang_version=(2018, 2),
name=wrapper_name)
name=wrapper_name,
# TODO, should these really be silenced?
silenced_warnings=["write_race*", "data_dep*"])

# prioritize loops
for indices in context.index_ordering:
Expand Down
7 changes: 3 additions & 4 deletions pyop2/compilation.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,6 @@ def workaround_cflags(self):
if version.StrictVersion("7.3") <= ver <= version.StrictVersion("7.5"):
# GCC bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90055
# See also https://github.com/firedrakeproject/firedrake/issues/1442
# And https://github.com/firedrakeproject/firedrake/issues/1717
# Bug also on skylake with the vectoriser in this
# combination (disappears without
# -fno-tree-loop-vectorize!)
Expand Down Expand Up @@ -370,7 +369,7 @@ class MacCompiler(Compiler):

def __init__(self, cppargs=[], ldargs=[], cpp=False, comm=None):
machine = platform.uname().machine
opt_flags = ["-O3", "-ffast-math"]
opt_flags = ["-O3", "-ffast-math", "-fopenmp-simd"]
if machine == "arm64":
# See https://stackoverflow.com/q/65966969
opt_flags.append("-mcpu=apple-a14")
Expand Down Expand Up @@ -405,7 +404,7 @@ class LinuxCompiler(Compiler):
:kwarg comm: Optional communicator to compile the code on (only
rank 0 compiles code) (defaults to COMM_WORLD)."""
def __init__(self, cppargs=[], ldargs=[], cpp=False, comm=None):
opt_flags = ['-march=native', '-O3', '-ffast-math']
opt_flags = ['-march=native', '-O3', '-ffast-math', '-fopenmp-simd']
if configuration['debug']:
opt_flags = ['-O0', '-g']
cc = "mpicc"
Expand All @@ -431,7 +430,7 @@ class LinuxIntelCompiler(Compiler):
rank 0 compiles code) (defaults to COMM_WORLD).
"""
def __init__(self, cppargs=[], ldargs=[], cpp=False, comm=None):
opt_flags = ['-Ofast', '-xHost']
opt_flags = ['-march=native', '-Ofast', '-xHost', '-qopenmp-simd']
if configuration['debug']:
opt_flags = ['-O0', '-g']
cc = "mpicc"
Expand Down
21 changes: 20 additions & 1 deletion pyop2/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,22 @@
from pyop2.exceptions import ConfigurationError


def default_simd_width():
from cpuinfo import get_cpu_info
avx_to_width = {'avx': 2, 'avx1': 2, 'avx128': 2, 'avx2': 4,
'avx256': 4, 'avx3': 8, 'avx512': 8}
longest_ext = [t for t in get_cpu_info()["flags"] if t.startswith('avx')][-1]
if longest_ext not in avx_to_width.keys():
if longest_ext[:6] not in avx_to_width.keys():
sv2518 marked this conversation as resolved.
Show resolved Hide resolved
assert longest_ext[:4] in avx_to_width.keys(), \
sv2518 marked this conversation as resolved.
Show resolved Hide resolved
"The vector extension of your architecture is unknown. Disable vectorisation!"
return avx_to_width[longest_ext[:4]]
else:
return avx_to_width[longest_ext[:6]]
else:
return avx_to_width[longest_ext]


class Configuration(dict):
r"""PyOP2 configuration parameters

Expand Down Expand Up @@ -78,7 +94,10 @@ class Configuration(dict):
# name, env variable, type, default, write once
DEFAULTS = {
"compiler": ("PYOP2_BACKEND_COMPILER", str, "gcc"),
"simd_width": ("PYOP2_SIMD_WIDTH", int, 4),
"simd_width": ("PYOP2_SIMD_WIDTH", int, default_simd_width()),
"vectorization_strategy": ("PYOP2_VECT_STRATEGY", str, "ve"),
"alignment": ("PYOP2_ALIGNMENT", int, 64),
"time": ("PYOP2_TIME", bool, False),
"debug": ("PYOP2_DEBUG", bool, False),
"cflags": ("PYOP2_CFLAGS", str, ""),
"ldflags": ("PYOP2_LDFLAGS", str, ""),
Expand Down
58 changes: 57 additions & 1 deletion pyop2/global_kernel.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
from pyop2.datatypes import IntType, as_ctypes
from pyop2.types import IterationRegion
from pyop2.utils import cached_property, get_petsc_dir
from pyop2 import configuration
from pyop2 import op2


# We set eq=False to force identity-based hashing. This is required for when
Expand Down Expand Up @@ -331,14 +333,68 @@ def code_to_compile(self):
from pyop2.codegen.rep2loopy import generate

wrapper = generate(self.builder)
if self._iterset._extruded:
sv2518 marked this conversation as resolved.
Show resolved Hide resolved
iname = "layer"
else:
iname = "n"

has_matrix = any(arg._is_mat for arg in self._args)
sv2518 marked this conversation as resolved.
Show resolved Hide resolved
has_rw = any(arg.access == op2.RW for arg in self._args)
is_cplx = any(arg.dtype.name == 'complex128' for arg in self._args)
sv2518 marked this conversation as resolved.
Show resolved Hide resolved
vectorisable = not (has_matrix or has_rw) and (configuration["vectorization_strategy"])

if (isinstance(self._kernel.code, lp.LoopKernel) and vectorisable):
wrapper = lp.inline_callable_kernel(wrapper, self._kernel.name)
if not is_cplx:
wrapper = self.vectorise(wrapper, iname, configuration["simd_width"])
code = lp.generate_code_v2(wrapper)

if self.local_kernel.cpp:
from loopy.codegen.result import process_preambles
from lp.codegen.result import process_preambles
preamble = "".join(process_preambles(getattr(code, "device_preambles", [])))
device_code = "\n\n".join(str(dp.ast) for dp in code.device_programs)
return preamble + "\nextern \"C\" {\n" + device_code + "\n}\n"
return code.device_code()

def vectorise(wrapper, iname, batch_size):
"""Return a vectorised version of wrapper, vectorising over iname.

:arg wrapper: A loopy kernel to vectorise.
:arg iname: The iteration index to vectorise over.
:arg batch_size: The vector width."""
if batch_size == 1:
return wrapper

# create constant zero vectors
wrapper = wrapper.copy(target=lp.CVecTarget(batch_size))
kernel = wrapper.root_kernel

# split iname and vectorize the inner loop
slabs = (1, 1)
inner_iname = iname + "_batch"

if configuration["vectorization_strategy"] == "ve":
kernel = lp.split_iname(kernel, iname, batch_size, slabs=slabs, inner_tag="vec", inner_iname=inner_iname)

alignment = configuration["alignment"]
tmps = dict((name, tv.copy(alignment=alignment)) for name, tv in kernel.temporary_variables.items())
kernel = kernel.copy(temporary_variables=tmps)

from lp.preprocess import check_cvec_vectorizability, cvec_retag_and_privatize, realize_ilp
from lp.kernel.data import OpenMPSIMDTag, VectorizeTag
from lp.transform.iname import tag_inames

kernel = realize_ilp(kernel) # FIXME: do we also need to realize the reductions first?

# try to vectorise with vector extensionn
vector_inst, pragma_inst_to_tag, unr_inst_to_tag = check_cvec_vectorizability(kernel)

# if not possible fall back to OpenMP SIMD pragmas or unrolling by retagging, then privatize
kernel = cvec_retag_and_privatize(kernel, vector_inst, pragma_inst_to_tag, unr_inst_to_tag)

wrapper = wrapper.with_root_kernel(kernel)

return wrapper

@PETSc.Log.EventDecorator()
@mpi.collective
Expand Down
1 change: 1 addition & 0 deletions requirements-ext.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ pytest>=2.3
flake8>=2.1.0
pycparser>=2.10
mpi4py>=1.3.1
py-cpuinfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned about adding this package as maintenance seems to have ceased. This could be problematic moving to the M1 Mac (workhorsy/py-cpuinfo#162), but I might be wrong. I will try and look into an alternative and report back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aii, sorry. I had added this dependency two years ago and didn't look into it again. Thanks for trying to find an alternative, Jack!

Copy link
Contributor

@kaushikcfd kaushikcfd May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like there are great alternatives. I wouldn't mind querying platform and hard-coding its parameters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% in favour of just hardcoding PYOP2_SIMD_WIDTH=4 but for my alternative implementation of what is already implemented here see #665

decorator<=4.4.2
dataclasses
cachetools
2 changes: 1 addition & 1 deletion requirements-git.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
git+https://github.com/coneoproject/COFFEE.git#egg=coffee
git+https://github.com/firedrakeproject/loopy.git@main#egg=loopy
git+https://github.com/firedrakeproject/loopy.git@vectorisation-sprint#egg=loopy