Unable to compile for SU(4) for A100 #94

edbennett · 2022-09-30T22:59:45Z

I'm working with @LupoA trying to benchmark Hadrons on Tursa, and am hitting an issue that the nest of templates seemingly prevents Hadrons compiling with CUDA. A number of modules (including MGauss, which we need) give the errror:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_23LatticeBinaryExpressionIT_T0_T1_EEEUlmmmE_EEvmmmT_

CPU compilation is fine.

Do you have any idea how fixing this could be approached, beyond going into Grid and renaming all of the type names to something shorter?

Thanks!

The text was updated successfully, but these errors were encountered:

edbennett · 2022-10-01T19:34:48Z

Trying some basic shortening reveals a subsequent 9368 bytes required, max 4096 bytes allowed; I'm not sure there's any amount of name-shortening that will overcome that.

aportelli · 2022-10-03T09:57:44Z

Hi @edbennett, the issue you are encoutering is related to the actual size of parameters, not their name.
If a function have 4x4 matrices passed by value that could explain how the issue is related to Nc=4.
However without much information about the compilation or the location of the issue I cannot help much, so if you could please try to be a bit more specific that would help.

edbennett · 2022-10-03T10:56:44Z

Thanks for explaining, Antonin; I made assumptions based on the compiler's relatively unhelpful message.

Places where this occurs:

Modules/MContraction/Gamma3pt:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4360 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_10UnaryTraceENS_3LBEINS_9BinaryMulENSJ_ISK_NSJ_ISK_NS2_INS_2iMINSL_IS6_Li4EEELi4EEEEENS1_INS2_INSL_INSL_ISC_Li4EEELi4EEEEEEEEENS_5GammaEEESS_EEEERSG_RKNS_22LatticeUnaryExpressionIT_T0_EEEUlmmmE_EEvmmmT_

Modules/MContraction/WeakEye3pt:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4568 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_10UnaryTraceENS_3LBEINS_9BinaryMulENSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NS1_INS2_INS_2iMINSL_ISC_Li4EEELi4EEEEEEENS_5GammaEEENS2_INSL_INSL_IS6_Li4EEELi4EEEEEEESQ_EESQ_EESP_EESQ_EESQ_EESP_EESQ_EEEERSG_RKNS_22LatticeUnaryExpressionIT_T0_EEEUlmmmE_EEvmmmT_
/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4568 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_9BinaryMulENS_22LatticeUnaryExpressionINS_10UnaryTraceENS_3LBEISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NS1_INS2_INS_2iMINSM_ISC_Li4EEELi4EEEEEEENS_5GammaEEENS2_INSM_INSM_IS6_Li4EEELi4EEEEEEESR_EESR_EESQ_EESR_EESR_EEEENSJ_ISK_SS_EEEERSG_RKNSL_IT_T0_T1_EEEUlmmmE_EEvmmmT_

Modules/MSource/Gauss:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS_2iMINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_3LBEIT_T0_T1_EEEUlmmmE_EEvmmmT_

For Alessandro's code the first two aren't needed, but the latter is.

In each case this is the complete error

(In the above function names LBE should be read as LatticeBinaryExpression, iM as iMatrix and iS as iScalar; this is building on top of the Grid I modified, as a few modules in Grid would take multiple hours each to recompile for SU(4) if I changed it back.)

This is the grid.configure.summary for the Grid build I'm compiling against:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Summary of configuration for Grid v0.7.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----- GIT VERSION -------------------------------------
commit: 913fbca7
branch: develop
date  : 2022-08-31
----- PLATFORM ----------------------------------------
architecture (build)        : x86_64
os (build)                  : linux-gnu
architecture (target)       : x86_64
os (target)                 : linux-gnu
compiler vendor             : gnu
compiler version            :
----- BUILD OPTIONS -----------------------------------
Nc                          : 4
SIMD                        : GPU (width= 64)
Threading                   : yes
Acceleration                : cuda
Unified virtual memory      : no
Communications type         : mpi3
Shared memory allocator     : nvlink
Shared memory mmap path     : /var/lib/hugetlbfs/global/pagesize-2MB/
Default precision           :
Software FP16 conversion    : yes
RNG choice                  : sitmo
GMP                         : yes
LAPACK                      : no
FFTW                        : no
LIME (ILDG support)         : yes
HDF5                        : no
build DOXYGEN documentation : no
----- BUILD FLAGS -------------------------------------
CXXFLAGS:
    -I/mnt/lustre/tursafs1/home/dp208/dp208/shared/src/grid_20220929_su4test
    -I/home/dp208/dp208/dc-benn2/prefix/include
    -O3
    -ccbin
mpicxx
    -gencode
arch=compute_80,code=sm_80
    -std=c++14
    -cudart
shared
    -Xcompiler
    -fno-strict-aliasing
    --expt-extended-lambda
    --expt-relaxed-constexpr
    -Xcompiler
    -fopenmp
LDFLAGS:
    -L/mnt/lustre/tursafs1/home/dp208/dp208/shared/src/grid_20220929_su4test/build-gpu/Grid
    -L/home/dp208/dp208/dc-benn2/prefix/lib
    -cudart
shared
    -Xcompiler
    -fopenmp
LIBS:
    -lz
    -lcrypto
    -llime
    -lmpfr
    -lgmp
    -lstdc++
    -lm
    -lcuda
    -lz
-------------------------------------------------------

aportelli · 2022-10-05T14:47:45Z

Thanks, figuring out which lines of code these lambda functions came from would be useful, and could help producing a minimal reproducible example to report on Grid side. On the Hadrons side we can't do much about it, but if you do not need these specific modules you could deactivate them in the list of things to compile.

Also do you confirm that you compiled the Nc=3 version without encountering that?

edbennett · 2022-10-05T20:51:05Z

Also do you confirm that you compiled the Nc=3 version without encountering that?

I have now tested that and indeed do not encounter the issue.

Thanks, figuring out which lines of code these lambda functions came from would be useful, and could help producing a minimal reproducible example to report on Grid side. On the Hadrons side we can't do much about it, but if you do not need these specific modules you could deactivate them in the list of things to compile.

OK, I've narrowed the error in Gauss.hpp down to line 193,

rho=ScalarRho*idMat;

…which seems a very strange place to encounter an issue.

Minimal failing example based on this:

#include <Grid/Grid.h>

using namespace Grid;

void test(WilsonImplR::PropagatorField &out, WilsonImplR::ComplexField in, const WilsonImplR::SitePropagator::scalar_object idMat) {
  out = in * idMat;
}

Compiled with (or more accurately, attempted but did not succeed to compile with):

nvcc -x cu -I${HOME}/prefix_su4test/include -ccbin mpicxx -I${HOME}/prefix/include --expt-extended-lambda -c -o test.o test.cpp

(This does compile successfully when using Grid build with Nc=3.)

aportelli · 2022-10-05T23:50:37Z

Hi @edbennett, thanks I think I understand now. In Grid all argument of an expression are captured by value and made into a CUDA kernel. A double precision SU(4) propagator has size 16(Nc)x16(spin)x16B = 4096B, and in the expression you shared the identity matrix has this type. Avoiding to use this kind of constant site value in expressions might solve the issue.

edbennett · 2022-10-07T15:24:42Z

In principle, presumably setting this in single precision would reduce this by half, which would allow going up to SU(5), and since in this case it is an identity matrix should result in no loss in precision. (Of course, a better solution would be needed to go to general $N$.) A quick test failed to compile though, as there was no available overload, presumably as the operations defined assume that all types use the same precision.

edbennett mentioned this issue Oct 5, 2022

Certain operations involving SitePropagator::scalar_object won't compile with CUDA for Nc > 3 paboyle/Grid#413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to compile for SU(4) for A100 #94

Unable to compile for SU(4) for A100 #94

edbennett commented Sep 30, 2022

edbennett commented Oct 1, 2022

aportelli commented Oct 3, 2022

edbennett commented Oct 3, 2022 •

edited

Loading

aportelli commented Oct 5, 2022 •

edited

Loading

edbennett commented Oct 5, 2022 •

edited

Loading

aportelli commented Oct 5, 2022

edbennett commented Oct 7, 2022

Unable to compile for SU(4) for A100 #94

Unable to compile for SU(4) for A100 #94

Comments

edbennett commented Sep 30, 2022

edbennett commented Oct 1, 2022

aportelli commented Oct 3, 2022

edbennett commented Oct 3, 2022 • edited Loading

aportelli commented Oct 5, 2022 • edited Loading

edbennett commented Oct 5, 2022 • edited Loading

aportelli commented Oct 5, 2022

edbennett commented Oct 7, 2022

edbennett commented Oct 3, 2022 •

edited

Loading

aportelli commented Oct 5, 2022 •

edited

Loading

edbennett commented Oct 5, 2022 •

edited

Loading