Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to compile for SU(4) for A100 #94

Open
edbennett opened this issue Sep 30, 2022 · 7 comments
Open

Unable to compile for SU(4) for A100 #94

edbennett opened this issue Sep 30, 2022 · 7 comments

Comments

@edbennett
Copy link

I'm working with @LupoA trying to benchmark Hadrons on Tursa, and am hitting an issue that the nest of templates seemingly prevents Hadrons compiling with CUDA. A number of modules (including MGauss, which we need) give the errror:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_23LatticeBinaryExpressionIT_T0_T1_EEEUlmmmE_EEvmmmT_

CPU compilation is fine.

Do you have any idea how fixing this could be approached, beyond going into Grid and renaming all of the type names to something shorter?

Thanks!

@edbennett
Copy link
Author

Trying some basic shortening reveals a subsequent 9368 bytes required, max 4096 bytes allowed; I'm not sure there's any amount of name-shortening that will overcome that.

@aportelli
Copy link
Owner

Hi @edbennett, the issue you are encoutering is related to the actual size of parameters, not their name.
If a function have 4x4 matrices passed by value that could explain how the issue is related to Nc=4.
However without much information about the compilation or the location of the issue I cannot help much, so if you could please try to be a bit more specific that would help.

@edbennett
Copy link
Author

edbennett commented Oct 3, 2022

Thanks for explaining, Antonin; I made assumptions based on the compiler's relatively unhelpful message.

Places where this occurs:

Modules/MContraction/Gamma3pt:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4360 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_10UnaryTraceENS_3LBEINS_9BinaryMulENSJ_ISK_NSJ_ISK_NS2_INS_2iMINSL_IS6_Li4EEELi4EEEEENS1_INS2_INSL_INSL_ISC_Li4EEELi4EEEEEEEEENS_5GammaEEESS_EEEERSG_RKNS_22LatticeUnaryExpressionIT_T0_EEEUlmmmE_EEvmmmT_

Modules/MContraction/WeakEye3pt:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4568 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_10UnaryTraceENS_3LBEINS_9BinaryMulENSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NSJ_ISK_NS1_INS2_INS_2iMINSL_ISC_Li4EEELi4EEEEEEENS_5GammaEEENS2_INSL_INSL_IS6_Li4EEELi4EEEEEEESQ_EESQ_EESP_EESQ_EESQ_EESP_EESQ_EEEERSG_RKNS_22LatticeUnaryExpressionIT_T0_EEEUlmmmE_EEvmmmT_
/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4568 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS2_INS2_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEEEEEEEEEaSINS_9BinaryMulENS_22LatticeUnaryExpressionINS_10UnaryTraceENS_3LBEISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NSL_ISI_NS1_INS2_INS_2iMINSM_ISC_Li4EEELi4EEEEEEENS_5GammaEEENS2_INSM_INSM_IS6_Li4EEELi4EEEEEEESR_EESR_EESQ_EESR_EESR_EEEENSJ_ISK_SS_EEEERSG_RKNSL_IT_T0_T1_EEEUlmmmE_EEvmmmT_

Modules/MSource/Gauss:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_2iSINS_2iMINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_3LBEIT_T0_T1_EEEUlmmmE_EEvmmmT_

For Alessandro's code the first two aren't needed, but the latter is.

In each case this is the complete error

(In the above function names LBE should be read as LatticeBinaryExpression, iM as iMatrix and iS as iScalar; this is building on top of the Grid I modified, as a few modules in Grid would take multiple hours each to recompile for SU(4) if I changed it back.)

This is the grid.configure.summary for the Grid build I'm compiling against:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Summary of configuration for Grid v0.7.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----- GIT VERSION -------------------------------------
commit: 913fbca7
branch: develop
date  : 2022-08-31
----- PLATFORM ----------------------------------------
architecture (build)        : x86_64
os (build)                  : linux-gnu
architecture (target)       : x86_64
os (target)                 : linux-gnu
compiler vendor             : gnu
compiler version            :
----- BUILD OPTIONS -----------------------------------
Nc                          : 4
SIMD                        : GPU (width= 64)
Threading                   : yes
Acceleration                : cuda
Unified virtual memory      : no
Communications type         : mpi3
Shared memory allocator     : nvlink
Shared memory mmap path     : /var/lib/hugetlbfs/global/pagesize-2MB/
Default precision           :
Software FP16 conversion    : yes
RNG choice                  : sitmo
GMP                         : yes
LAPACK                      : no
FFTW                        : no
LIME (ILDG support)         : yes
HDF5                        : no
build DOXYGEN documentation : no
----- BUILD FLAGS -------------------------------------
CXXFLAGS:
    -I/mnt/lustre/tursafs1/home/dp208/dp208/shared/src/grid_20220929_su4test
    -I/home/dp208/dp208/dc-benn2/prefix/include
    -O3
    -ccbin
mpicxx
    -gencode
arch=compute_80,code=sm_80
    -std=c++14
    -cudart
shared
    -Xcompiler
    -fno-strict-aliasing
    --expt-extended-lambda
    --expt-relaxed-constexpr
    -Xcompiler
    -fopenmp
LDFLAGS:
    -L/mnt/lustre/tursafs1/home/dp208/dp208/shared/src/grid_20220929_su4test/build-gpu/Grid
    -L/home/dp208/dp208/dc-benn2/prefix/lib
    -cudart
shared
    -Xcompiler
    -fopenmp
LIBS:
    -lz
    -lcrypto
    -llime
    -lmpfr
    -lgmp
    -lstdc++
    -lm
    -lcuda
    -lz
-------------------------------------------------------

@aportelli
Copy link
Owner

aportelli commented Oct 5, 2022

Thanks, figuring out which lines of code these lambda functions came from would be useful, and could help producing a minimal reproducible example to report on Grid side. On the Hadrons side we can't do much about it, but if you do not need these specific modules you could deactivate them in the list of things to compile.

Also do you confirm that you compiled the Nc=3 version without encountering that?

@edbennett
Copy link
Author

edbennett commented Oct 5, 2022

Also do you confirm that you compiled the Nc=3 version without encountering that?

I have now tested that and indeed do not encounter the issue.

Thanks, figuring out which lines of code these lambda functions came from would be useful, and could help producing a minimal reproducible example to report on Grid side. On the Hadrons side we can't do much about it, but if you do not need these specific modules you could deactivate them in the list of things to compile.

OK, I've narrowed the error in Gauss.hpp down to line 193,

rho=ScalarRho*idMat;

…which seems a very strange place to encounter an issue.

Minimal failing example based on this:

#include <Grid/Grid.h>

using namespace Grid;

void test(WilsonImplR::PropagatorField &out, WilsonImplR::ComplexField in, const WilsonImplR::SitePropagator::scalar_object idMat) {
  out = in * idMat;
}

Compiled with (or more accurately, attempted but did not succeed to compile with):

nvcc -x cu -I${HOME}/prefix_su4test/include -ccbin mpicxx -I${HOME}/prefix/include --expt-extended-lambda -c -o test.o test.cpp

(This does compile successfully when using Grid build with Nc=3.)

@aportelli
Copy link
Owner

Hi @edbennett, thanks I think I understand now. In Grid all argument of an expression are captured by value and made into a CUDA kernel. A double precision SU(4) propagator has size 16(Nc)x16(spin)x16B = 4096B, and in the expression you shared the identity matrix has this type. Avoiding to use this kind of constant site value in expressions might solve the issue.

@edbennett
Copy link
Author

In principle, presumably setting this in single precision would reduce this by half, which would allow going up to SU(5), and since in this case it is an identity matrix should result in no loss in precision. (Of course, a better solution would be needed to go to general $N$.) A quick test failed to compile though, as there was no available overload, presumably as the operations defined assume that all types use the same precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants