Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/memcomp #4599

Closed
wants to merge 10 commits into from
Closed

Fix/memcomp #4599

wants to merge 10 commits into from

Conversation

jedwards4b
Copy link
Contributor

Currently if a test is too short for the memcomp test it fails because the test baseline is missing. It should just skip the test in this case...

This PR causes the test to be skipped and pass, but I would like it to be more informative.

Test suite:
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes:
User interface changes?:

Update gh-pages html (Y/N)?:

@jedwards4b jedwards4b self-assigned this Mar 13, 2024
tolerance = case.get_value("TEST_MEMLEAK_TOLERANCE")

if tolerance is None:
tolerance = 0.1

baseline_file = os.path.join(baseline_dir, "cpl-mem.log")
if not os.path.exists(baseline_file):
return tolerance, "MEMCOMP: No baseline memory file present"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgfouca I would like this string to appear in the TestStatus file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b , I think you need to return a boolean here, right?
return True, "MEMCOMP: No baseline memory file present"

I'm a little worried that this will hide errors. I would not have noticed that our perf test was not running long enough to create a cpl-mem file if it was passing. @jasonb5 , thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be boolean. The options for TEST_*_STATUS are PASS, FAIL and PEND - should I add a TEST_SKIP_STATUS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b , were the tests that were failing performance tests or just regular tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just regular tests and I am only dealing with this in system_tests_common.py so I don't think there will be a problem for your performance tests. Now have:
PASS SMS_Ln9_Vnuopc.ne30_ne30_mg17.FW2000climo.derecho_intel.cam-outfrq9s MEMCOMP: No baseline memory file present
in TestStatus file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a script. It's users looking at TestStatus and seeing that cpl-mem.log is missing. Then they start contacting me asking what went wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered, a lot of people use cs.status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what dies the output of cs.status look like when cpl-mem.log is missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s (Overall: PASS) details:
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s CREATE_NEWCASE
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s XML
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s SETUP
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s SHAREDLIB_BUILD time=143
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s NLCOMP
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s MODEL_BUILD time=238
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s SUBMIT
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s RUN time=159
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s GENERATE cesm2_3_alpha17c
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s BASELINE cesm2_3_alpha17b:
FAIL ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s MEMCOMP [Errno 2] No such file or directory: '/glade/campaign/cesm/cesmdata/cesm_baselines/cesm2_3_alpha17b/ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s/cpl-mem.log'
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s TPUTCOMP
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s MEMLEAK insufficient data for memleak test
PASS ERP_D_Ln9_Vnuopc.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s SHORT_TERM_ARCHIVER

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what happens when the baseline comparison fails, because of missing baselines. So I'm OK with having it marked as FAIL. So far I've been ignoring the MEM and TPUTCOMP tests because we found we couldn't count on them. So I'm glad they are being worked on. It would be nice to be able to use them. So thanks for working on this.

@jedwards4b
Copy link
Contributor Author

I'll look into refactoring this as a change in cs.status ...

@jedwards4b
Copy link
Contributor Author

Actually that won't solve the problem. People are still going to manually look at the TestStatus file and see that the Test indicates fail.

@rljacob
Copy link
Member

rljacob commented Mar 14, 2024

Do we know about how long the run has to be for a memtest to be valid? Just skip all the mem testing if the test length is to short. Force it on with a command-line argument.

@rljacob
Copy link
Member

rljacob commented Mar 14, 2024

Or...only run the memtest portion if that file is present. If something about the machine keeps the file from being created, like disk issues, then other things will also go wrong.

@ekluzek
Copy link
Contributor

ekluzek commented Mar 14, 2024

@jedwards4b as I say above I'm OK with it marked as fail if it says it's because of missing baseline files. Personally I then do some grepping to make sure the FAIL's are only because of that kind of problem.

But, I do like @rljacob idea of not doing these tests for cases that are too short. Marking them as such would also be fine as well.

All -- I know we can change the thresholds for these tests, but is there a way to signal in the testlist*.xml file that you want them turned off? For example for tests that are too short or you just don't want to check it for particular tests?

Still just glad this is being worked on! Thanks @jedwards4b

@jedwards4b
Copy link
Contributor Author

I thought that this was going to be an easy fix - but it's not, I'm going to set it aside for now.

@jedwards4b jedwards4b marked this pull request as draft September 20, 2024 15:44
@jedwards4b
Copy link
Contributor Author

Superceeded by #4691

@jedwards4b jedwards4b closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants