StatAnalysis Memory and Thresholding Issues #1076

lindsayrblank · 2021-08-10T21:53:52Z

lindsayrblank
Aug 10, 2021

Hi wonderful MET help team!

I told you it wouldn't be long before you heard from me! I hope you're all doing well.

I'm running into a few issues with StatAnalysis. I'm running StatAnalysis on a HPC via Singularity starting from the DTC Docker image. I am using version 10.0.0.

I run into a performance issue when running multiple -by options. For example, if I run a command
"stat_analysis -lookin <stat_file_directory>
-job aggregate_stat -line_type MPR -out_line_type CNT
-out_stat <out_put_file>
-fcst_var TMP -obs_var TMP -fcst_lead 06 -fcst_init_beg 2021060112
-fcst_init_end 2021060712 -by OBS_SID -set_hdr VX_MASK OBS_SID -set_hdr DESC CASE -out_bin_size 1 -v 3"

It runs in about 1 minute and 20 seconds (7000 lines). If I add -by FCST_VAR, OBS_SID, the job consumes all available 128 G of compute node RAM before crashing. Do you know why this would occur? I am following the example in the NRL tutorial StatAnalysis presentation (slide 14) that uses multiple -by statements. I tried turning on debugging (-v 4) and I don't get any related messages.

I run into a second performance issue when running any command with the following flags and settings: -aggregate_stat -line_type MPR. For each matched pair, a CDF is calculated with the default number of thresholds of 20. However, for each matched pair after the first, the previous matched pair's CDF thresholds are added, like so:

"DEBUG 4: ClimoCDFInfo::set_cdf_ta() -> For "cdf_bins" (20) and "center_bins" (false), defined climatology CDF thresholds: >=0.00000,>=0.05000,>=0.10000,>=0.15000,>=0.20000,>=0.25000,>=0.30000,>=0.35000,>=0.40000,>=0.45000,>=0.50000,>=0.55000,>=0.60000,>=0.65000,>=0.70000,>=0.75000,>=0.80000,>=0.85000,>=0.90000,>=0.95000,>=1.00000
DEBUG 4: ClimoCDFInfo::set_cdf_ta() -> For "cdf_bins" (20) and "center_bins" (false), defined climatology CDF thresholds: >=0.00000,>=0.05000,>=0.10000,>=0.15000,>=0.20000,>=0.25000,>=0.30000,>=0.35000,>=0.40000,>=0.45000,>=0.50000,>=0.55000,>=0.60000,>=0.65000,>=0.70000,>=0.75000,>=0.80000,>=0.85000,>=0.90000,>=0.95000,>=1.00000,>=0.00000,>=0.05000,>=0.10000,>=0.15000,>=0.20000,>=0.25000,>=0.30000,>=0.35000,>=0.40000,>=0.45000,>=0.50000,>=0.55000,>=0.60000,>=0.65000,>=0.70000,>=0.75000,>=0.80000,>=0.85000,>=0.90000,>=0.95000,>=1.00000"

This is cumulative, so when running over 5000 matched pairs, the last one has 100,000 thresholds attached to it. This creates a job so complex that the job does not finish. I was able to circumvent this problem by adding the flag -out_bin_size 1, but I still figured you would want to know about it.

Thank you in advance for your help!

Best,
Lindsay

Answered by JohnHalleyGotway

Aug 16, 2021

@lindsayrblank, good news. I found a bug with a simple one-line fix that'll solve this excessive memory use problem.

The void ClimoCDFInfo::set_cdf_ta(int n_bin, bool &center) function fails to initialize the cdf_ta array. Each time we call it, that arrays grows by 21 elements. Your job calls it 22,530 times, once for each combination of station id and variable name. So by the end, it has length 473,130 and we have 22,530 copies of it. That's what's hogging all the memory. Thanks for finding this issue!

Testing with a patch, Stat-Analysis consumes around 2 GB for this job. And technically, it could consume less than 1/4 of that. The issue is the NumArray class... it allocates memory in bl…

View full answer

georgemccabe · 2021-08-11T15:37:33Z

georgemccabe
Aug 11, 2021
Maintainer

Hi Lindsay,

These sound like performances issues that we should investigate. Thank you for bringing this to our attention. I think @JohnHalleyGotway would be best suited to look into it, however he is out on vacation this week. I will let him know about these issues and he can look into it when he returns. To help recreate these issues, could you provide us with the data that you are using and each command that you use that causes these slow run times? If the files aren't too big you could attach a zip or tar file to this discussion for each access. If they are very large, then you could upload them to FTP.

Thanks,
George

3 replies

lindsayrblank Aug 12, 2021
Author

Hi George,

There are a multiple commands that trigger the out of memory exception. Here are two examples:

singularity exec /path/to/singularity/met_latest.sif stat_analysis -lookin /path/to/data/command_1.stat -job aggregate_stat -line_type MPR -out_line_type CNT -out_stat one_fcst_test_by_obssid_by_fcstvar.txt -by OBS_SID,FCST_VAR -set_hdr VX_MASK OBS_SID -set_hdr DESC CASE -out_bin_size 1 -v 3
singularity exec /path/to/singularity/met_latest.sif -lookin /path/to/data/command_2 -job aggregate_stat -out_stat 2021060100_by_lead_tmp_sl1l2.stat -out_line_type SL1L2 -fcst_init_beg 20210601_000000 -fcst_init_end 20210601_000000 -fcst_var TMP -obs_var TMP -line_type MPR -by OBS_SID,FCST_LEAD -set_hdr VX_MASK OBS_SID -set_hdr DESC CASE -out_bin_size 1 -v 3

The files are too large to attach, so I'll send it to the FTP.

Thanks!
Lindsay

georgemccabe Aug 12, 2021
Maintainer

Thanks, Lindsay. @JohnHalleyGotway, I downloaded the tar file (incoming/irap/met_help/blank_data/for_met.tar). I will message you with the location.

JohnHalleyGotway Aug 16, 2021
Maintainer

@lindsayrblank thanks for sending along the sample data. I ran the first command you sent on my Mac laptop without the "-by OBS_SID" option and that job did complete. But adding "-by OBS_SID", I ended up killing the job after it consumed 52+ GB of memory! I definitely agree that this is an issue with excessive memory consumption. I'll take a look to see if I can find the source of the problem.

JohnHalleyGotway · 2021-08-16T19:12:31Z

JohnHalleyGotway
Aug 16, 2021
Maintainer

@lindsayrblank, good news. I found a bug with a simple one-line fix that'll solve this excessive memory use problem.

The void ClimoCDFInfo::set_cdf_ta(int n_bin, bool &center) function fails to initialize the cdf_ta array. Each time we call it, that arrays grows by 21 elements. Your job calls it 22,530 times, once for each combination of station id and variable name. So by the end, it has length 473,130 and we have 22,530 copies of it. That's what's hogging all the memory. Thanks for finding this issue!

Testing with a patch, Stat-Analysis consumes around 2 GB for this job. And technically, it could consume less than 1/4 of that. The issue is the NumArray class... it allocates memory in blocks of size 1000. However, each of your cases is less than length 50. So there's lots of extra memory allocated that goes unused. We could consider reimplementing NumArray to consume less.

I'll write up a GitHub issue describing the problem and commit a simple bugfix for the main_v10.0 branch. Once it's merged in, DockerHub will rebuild the main_v10.0 image and then you should be able to pull that via Singularity.

See dtcenter/MET#1875

4 replies

lindsayrblank Aug 16, 2021
Author

That's great news, @JohnHalleyGotway ! I'm glad to hear it's a simple fix. Thank you for looking into it!

JohnHalleyGotway Aug 17, 2021
Maintainer

@lindsayrblank the fix has been committed and rebuilt by DockerHub.

Please test using the main_v10.0 DockerHub tag:

docker pull dtcenter/met:main_v10.0

And then confirm that it now works as expected. Note that the "latest" tag in DockerHub HAS NOT been updated. That only gets updated when we tag an official release.

lindsayrblank Aug 17, 2021
Author

@JohnHalleyGotway Thank you so much for your the quick turn around time! I was able to test both of the commands above + a third more intensive one (command 2, but with -by FCST_VAR added as well) and they all work. I greatly appreciate it. Thanks again!

JohnHalleyGotway Aug 18, 2021
Maintainer

That's great news! Thanks for confirming. I had actually only tested the first command you sent but figured the bug/fix would apply to all. And I'm glad to hear that it did. Note that I do see how we could slim down the memory usage even further, reimplementing the NumArray class as a light-weight wrapper of the STL vector or array class. That would certainly use less memory but might impact runtime. But please let us know if you run into more performance issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatAnalysis Memory and Thresholding Issues #1076

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

StatAnalysis Memory and Thresholding Issues #1076

lindsayrblank Aug 10, 2021

Replies: 2 comments · 7 replies

georgemccabe Aug 11, 2021 Maintainer

lindsayrblank Aug 12, 2021 Author

georgemccabe Aug 12, 2021 Maintainer

JohnHalleyGotway Aug 16, 2021 Maintainer

JohnHalleyGotway Aug 16, 2021 Maintainer

lindsayrblank Aug 16, 2021 Author

JohnHalleyGotway Aug 17, 2021 Maintainer

lindsayrblank Aug 17, 2021 Author

JohnHalleyGotway Aug 18, 2021 Maintainer

lindsayrblank
Aug 10, 2021

Replies: 2 comments 7 replies

georgemccabe
Aug 11, 2021
Maintainer

lindsayrblank Aug 12, 2021
Author

georgemccabe Aug 12, 2021
Maintainer

JohnHalleyGotway Aug 16, 2021
Maintainer

JohnHalleyGotway
Aug 16, 2021
Maintainer

lindsayrblank Aug 16, 2021
Author

JohnHalleyGotway Aug 17, 2021
Maintainer

lindsayrblank Aug 17, 2021
Author

JohnHalleyGotway Aug 18, 2021
Maintainer