Results regression in system tests #594

MakisH · 2024-11-15T08:16:07Z

As discussed in the coding days (Nov 2024), we have a strange regression in some results compared to the reference results. See original discussion in precice/precice#2131.

I have excluded changes in the preCICE repository, as I get the same issue with preCICE v3.1.2 (results were actually produced with v3.1.1, but this should not matter).

Based on where we have regressions, I suspect something related either to the Python bindings or to the Python dependencies.

Overview

Results regression ❌

elastic-tube-1d
- fluid-python / solid-python: pressure written by fluid-python (+ some sporadic cross section length values)
flow-over-heated-plate
- fluid-openfoam / solid-nutils: heat flux written by solid-nutils
perpendicular-flap
- fluid-su2 / solid-fenics: both force and displacement

Works ✔️

elastic-tube-1d
- fluid-cpp / solid-python
- fluid-cpp / solid-cpp
flow-over-heated-plate
- fluid-openfoam / solid-fenics
- fluid-openfoam / solid-openfoam
perpendicular-flap
- fluid-openfoam / solid-calculix

MakisH · 2024-11-16T10:34:08Z

Without changing anything from our side (I think), the Nutils and SU2-FEniCS cases have been fixed. The elastic-tube-1d python-python remains broken:

https://github.com/precice/precice/actions/runs/11869063712/job/33078975390?pr=2052

Maybe a Python dependency?

uekerman · 2024-11-16T16:35:46Z

Could we compare which NumPy version was used for the reference data and which one now? Is this in our data?

MakisH · 2024-11-22T12:13:53Z

Could we compare which NumPy version was used for the reference data and which one now? Is this in our data?

No, unfortunately this is not yet in our data, but we could implement a pip freeze in the output (#596).

MakisH · 2024-11-22T15:06:46Z

I am now investigating the results themselves. fieldcompare writes also diff files, but these are the absolute differences. We currently use a relative tolerance of 3e-7 (see #393).

Comparing runs on the tests VM and on my laptop

In both systems, I have ran a docker system prune -a first to ensure that no caching-related issues occur. Trying to reproduce this job, I ran on my laptop:

python3 systemtests.py --build_args=PLATFORM:ubuntu_2204,PRECICE_REF:f6e48e45167d9312ac14cec9efa8222a915ef201,PYTHON_BINDINGS_REF:b6b9ee5,CALCULIX_VERSION:2.20,CALCULIX_ADAPTER_REF:8eb1d43,FENICS_ADAPTER_REF:3de561d,OPENFOAM_EXECUTABLE:openfoam2312,OPENFOAM_ADAPTER_REF:20b4617,SU2_VERSION:7.5.1,SU2_ADAPTER_REF:64d4aff,TUTORIALS_REF:c5bc59f --suites=release_test

On the VM, only the Elastic tube 1D (fluid-python, solid-python) fails.
On my laptop, this + Flow over heated plate (fluid-openfoam, solid-nutils) + Perpendicular flap (fluid-su2, solid-fenics) fail.

Already this probably hints that our thresholds are maybe too tight (even though I cannot definitely explain why).

Elastic tube 1d

Opening the file groups related to the Fluid-Nodes-Mesh-Fluid (from the case that fails on the VM) in ParaView, and applying a TemporalStatistics filter, I get:

CrossSectionLength:
- maximum: [1.021327626346455, 1.032480229130512]
- avegare: [1.0001867218695808, 1.0019902605226825]
- minimum: [0.9686030436806008, 0.9807664502070361]
- absolute diff to reference: [3.5098794759491625e-8, 1.9195113998282665e-7]
Pressure:
- maximum: [186.04054809996933, 281.02104922355835]
- average: [-1.724136046412882, 16.41037397202194]
- minimum: [-284.9766694349331, -172.9517923390018]
- absolute diff to reference: [0.0003018627092785664, 0.001706805917660148]

At least for the CrossSectionLength, one could argue that the diff is small, but the values themselves are also small, making all this more prone to floating-point errors.

Trying to reproduce the reference results

Running the following test locally (again after a docker system prune -a):

python3 systemtests.py --suites release_test --build_args PLATFORM:ubuntu_2204,PRECICE_REF:v3.1.1,PYTHON_BINDINGS_REF:b6b9ee5,CALCULIX_VERSION:2.20,CALCULIX_ADAPTER_REF:v2.20.1,FENICS_ADAPTER_REF:v2.1.0,OPENFOAM_EXECUTABLE:openfoam2312,OPENFOAM_ADAPTER_REF:v1.3.0,SU2_VERSION:7.5.1,SU2_ADAPTER_REF:64d4aff,TUTORIALS_REF:c5bc59f --log-level DEBUG

(using the versions from reference_results.metadata, except using the latest python-bindings and tutorials ref because we did not have the requirements.txt before)

The same tests are failing with regressions, so it is either something in the python-bindings/tutorials (which I think I have previously excluded), or something outside our control.

How to move on

Do we now accept this as the new baseline, or do we keep digging?

Do we already relax the tolerances?

Should we maybe introduce a difference tolerance per test case?

uekerman · 2024-12-05T07:56:29Z

but the values themselves are also small, making all this more prone to floating-point errors.

This statement is wrong. Floating-point errors have nothing to do with the size of the values as long as we don't under- or overshoot. Let's always look at relative errors. Both are around 1e-7.

Concerning tolerances I think we have two choices:

We set them close to double-precision, 1e-13 or so. Then, we get a regression whenever any component changes, including OpenFOAM or NumPy ...
We set them a bit above the preCICE convergence criteria, so case dependent and here 1e-4. But then we should check iteration numbers.

I find the first option better. When running into a regression, we should always try to find out why. But this will not always be possible. Good would be to freeze as much as possible outside our control (OpenFOAM, NumPy, ...) and specifically test when updating such dependencies.

Could we somehow try to run the python-python elastic tube 1D with different NumPy versions?

MakisH · 2024-12-05T09:13:15Z

Could we somehow try to run the python-python elastic tube 1D with different NumPy versions?

All Python-related versions are now defined in the requirements.txt of each tutorial. If we want to fix them, we should fix them there. While this would help reproducibility, it would contradict common practice. But the common practice might be not the best practice for our scenario.

Besides the specific versions, we currently don't have a way implemented to propagate Python versions into the venv of each test. Even the pyprecice version is essentially ignored, see #584.

We get everything else from APT, so versions should more or less be frozen as long as we stick to the same Ubuntu version.

uekerman · 2024-12-05T10:20:26Z

I compared different NumPy versions and different versions of the Python bindings. I don't see any other dependencies of the elastic tube 1D python that could have changed. The tutorial itself did neither change recently. Lapack is fixed through the OS. I am puzzled as well.

Do we now accept this as the new baseline, or do we keep digging?

I would suggest to define your reproduction above as the new baseline.

All Python-related versions are now defined in the requirements.txt of each tutorial. If we want to fix them, we should fix them there. While this would help reproducibility, it would contradict common practice. But the common practice might be not the best practice for our scenario.

I think we should fix them. We could also add a requirements_testing.txt to each tutorial? The fewer moving targets the better.

MakisH mentioned this issue Nov 15, 2024

Fix VTK exporter precice/precice#2131

Merged

9 tasks

MakisH added the systemtests label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results regression in system tests #594

Results regression in system tests #594

MakisH commented Nov 15, 2024

MakisH commented Nov 16, 2024

uekerman commented Nov 16, 2024

MakisH commented Nov 22, 2024

MakisH commented Nov 22, 2024

uekerman commented Dec 5, 2024

MakisH commented Dec 5, 2024

uekerman commented Dec 5, 2024 •

edited

Loading

Results regression in system tests #594

Results regression in system tests #594

Comments

MakisH commented Nov 15, 2024

Overview

Results regression ❌

Works ✔️

MakisH commented Nov 16, 2024

uekerman commented Nov 16, 2024

MakisH commented Nov 22, 2024

MakisH commented Nov 22, 2024

Comparing runs on the tests VM and on my laptop

Elastic tube 1d

Trying to reproduce the reference results

How to move on

uekerman commented Dec 5, 2024

MakisH commented Dec 5, 2024

uekerman commented Dec 5, 2024 • edited Loading

uekerman commented Dec 5, 2024 •

edited

Loading