Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature suggestion] OpenMP Parallelization #1

Open
sthartman opened this issue Feb 15, 2022 · 16 comments
Open

[Feature suggestion] OpenMP Parallelization #1

sthartman opened this issue Feb 15, 2022 · 16 comments

Comments

@sthartman
Copy link

Hello,
I've been working on parallelizing the KPROJ code using OpenMP, I have a minimal working version by now. I've gotten around 10x speedup by running on all 36 cores of a compute node - would you be interested in merging that once I get all the details sorted out and pushed to my fork?

@jinxin68
Copy link
Contributor

Hi, @sthartman
I'm very interested in your version ! When will you update your version?

@sthartman
Copy link
Author

I've just updated. I've tested the bands_wt_layer to check that it reproduces the same results as the serial version within rounding error. I don't remember if I ever got the bands_wt_all subroutine working correctly in parallel, since I didn't need it for my particular case. It looks like I tried to parallelize it, it shouldn't be that different than the other one. Unfortunately I don't think I have my input files any more or I'd test its accuracy myself.

@mxchen-2020
Copy link
Owner

@sthartman Hi, Steven, I am sorry for the delayed reply. I almost forgot this depository. Many thanks for your work on the program. Have you already updated the subroutine of the program? In fact, my student recently worked out interfaces to the LCAO code ABCUS and Phonopy. In addition, he replaced the FFT part with that from the intel lib, which can also speed up calculations significantly. We will update the program soon.

@sthartman
Copy link
Author

the bands_wt_layer in mod_kproj.f of my fork definitely works, i think bands_wt_all also works but it's been 4 months since I did it so I don't remember the details.

@jinxin68
Copy link
Contributor

Hi, @sthartman .
Many thanks for your work on the program. It reproduces the same bandunfolding results as before version. 2×2 graphene was used. Plane-wave energy cutoffs of 400eV were used for the electronic structure calculations.
But it's seems to be something wrong with your code when use 6×6 graphene and plane-wave energy cutoffs was 400eV .
I try to fix it but failed .Prompt information : Segmentation fault (core dumped).

@jinxin68
Copy link
Contributor

Hi, @sthartman !
I am very glad to tell you that I have fix this problem . You should use "!$omp parallel" before "!$omp parallel do"(359 row in mod_kproj.f). But the program was slower than serials version . I think it's because the "!$omp parallel" will create NKPTS*NSPIN times .

@sthartman
Copy link
Author

That's interesting. It's possible that the segfault could be a memory problem, the big array is shared between threads but some of the other variables are duplicated. What value of OMP_NUM_THREADS are you using?

@jinxin68
Copy link
Contributor

@sthartman
I didn't set OMP_NUM_THREADS ,so it will use all threads .For my computer , it's value is equal to 16 .
Can you test it as I said before?

@sthartman
Copy link
Author

The test you're asking about, do you mean putting the $omp parallel in? I can try to do that later today, if I can find any of my input files.

@jinxin68
Copy link
Contributor

Yes,it is.
In addition, I don't think large arrays are shared between threads,because the largest array is “q_pw” . But it's private .
Thank you very much for your help!

@sthartman
Copy link
Author

Going back and doing some tests, it seems that I have to put export OMP_STACKSIZE=1024m in my batch script, to set the environment variable, or else it segfaults. The default OMP_STACKSIZE is very small, only a few MB, is there a chance that it's overflowing the default stack when you run the larger 6x6 graphene example? The system I'm testing is a big heterointerface supercell with INKPROJ:
LZLAYER = .TRUE.
zlay1 = 0.13
zlay2 = 0.29
MAT_P2S = 7 3 0
4 7 0
0 0 1
LSORBIT = .TRUE.
the WAVECAR is about 49 GB but kproj is only using about 6.1 GB total across 36 threads, according to the top command, so the memory consumption is not terrible overall. I was able to reproduce the 10x speedup.

I tried adding an !$OMP PARALLEL on the line before my $OMP PARALLEL DO, but the compiler (ifort) threw an error
ifort -O3 -free -xhost -mkl -qopenmp -CB -traceback -c mod_kproj.f
mod_kproj.f(491): error #7918: A RETURN statement is invalid as an exit from a parallel region.
return
------^
mod_kproj.f(498): error #7653: There are unterminated OpenMP* directive block[s].
WRITE(IUO,'(5X,A)')'ERR: while reading coef. from WAVECAR'
^
mod_kproj.f(493): error #7677: This label has been branched to from outside this PPD or OpenMP* block. [220]
220 CONTINUE
--^
compilation aborted for mod_kproj.f (code 1)
make: *** [mod_kproj.o] Error 1

I don't think it liked the syntax of having two commands to begin the parallel section, but only one OMP END PARALLEL DO. I also tried adding another OMP END PARALLEL after OMP END PARALLEL DO, but it errored with an abort trap signal - I think it considers that to be nested parallelism, but with the wrong syntax.

@jinxin68
Copy link
Contributor

jinxin68 commented Jul 1, 2022

I got faster speed during using "ulimit -s unlimited" and "export OMP_STACKSIZE=1024m" . As you said ,it's a memory problem . But usually users don't take the initiative to set these. I will try to find a better solution.
Thank you for your detailed and useful suggestions !

@jinxin68
Copy link
Contributor

jinxin68 commented Jul 1, 2022

I try to move "fft_allocate and allocate(q_tmp)" and add "deallocate(q_pw,q_tmp)" into parallel region to aovid memory problem . Though this solves the probelm ,it will slow the parallel speed by 30 % . Because large arrays are allocated and deallocated repeatedly in every threads .
Now ,I am quite sure that the memory problem is arised from "q_pw " and "q_tmp"

@sthartman
Copy link
Author

That makes sense, the default stacksize is around 4 MB I think, it will be very hard to contain the thread within that size. What are the dimensions of q_pw? I couldn't tell from looking at the fft_allocate subroutine.

If the concern is only that less experienced users will not know to set omp_stacksize correctly, it might be easier just to leave the standard compiler options in the makefile, with a commented-out alternate line which has the -qopenmp option also. If kproj compiled without the -qopenmp option, the compiler should ignore the openmp directives in mod_kproj.f and compile serial code. That way, a user will only get the parallel version if they know what they are doing and choose it on purpose.

@jinxin68
Copy link
Contributor

jinxin68 commented Jul 1, 2022 via email

@sthartman
Copy link
Author

I don't know if it's possible to set environment variables in the makefile, but if it is, that would be the simplest way to do it. Normally I set it as a runtime environment variable in my SLURM batch script, and I just set it to some large value like 1024m. As far as I know, there's no drawback to setting a large value as long as OMP_STACKSIZE x OMP_NUM_THREADS doesn't exceed the computer's memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants