tests/test_parallel.py fails sometimes #13

chris-hld · 2020-07-01T11:59:46Z

Test_parallel fails frequently, but not always.
It seems that tests/test_parallel.py::test_render_bsdm is causing it (most of the times?).
As it doesn't occur every time, it might be undefined behaviour related to the shared memory access.

I will first try to collect some tracebacks, maybe they show some pattern.

Test on macOS-latest with Python 3.6

test_jobs = None

    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
    def test_render_bsdm(test_jobs):
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=1)
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=test_jobs)
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 1 / 2510 (0.0398%)
E       Max absolute difference: 0.64633057
E       Max relative difference: 1.
E        x: array([[ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ],
E              [ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ]])
E        y: array([[ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ],
E              [ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ]])

The text was updated successfully, but these errors were encountered:

HaHeho · 2020-07-01T18:36:23Z

Strange. Also occasionally happens when you execute it locally, or only in tests?

It is always only one time domain sample that is wrong?

I have not looked into the actual processing at all, but this might be an idea on narrowing the error down.
The "sometimes" most likely happens due to the utilization of randn as an input. However, the splitting into multiple tasks should not alter the results anyways of course. The data types in both branches are exactly the same? Maybe it is a matter of lacking numerical precision, which is only sometimes triggered due to the noise.

chris-hld · 2020-07-02T11:51:02Z

I could also make it fail locally, however, only a single time so far. Seems to be a lot less frequent when executed on a local machine.
Yes, it seems that it is always exactly one sample...

chris-hld · 2020-07-02T11:52:27Z

Another traceback:

Test on ubuntu-latest with Python 3.7

=================================== FAILURES ===================================
_____________________________ test_render_bsdm[2] ______________________________

test_jobs = 2

    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
    def test_render_bsdm(test_jobs):
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=1)
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=test_jobs)
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 1 / 2510 (0.0398%)
E       Max absolute difference: 1.31017305
E       Max relative difference: 1.
E        x: array([[0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ],
E              [0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ]])
E        y: array([[0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ],
E              [0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ]])

HaHeho · 2020-07-02T20:16:45Z

So it is not related to the Python version since this is 3.7 now? (That would have been very weird anyways, if all other packages are identical).

Max relative difference: 1. means that the single value is exactly double in one case?

HaHeho · 2020-07-02T21:25:05Z

I'm pretty certain this should solve it.

# part of parallel render_bsdm:
def _render_bsdm_sample(i, p, phi, theta, hrirs):
    h_l, h_r = hrirs[hrirs.nearest(phi, theta)]
    # global shared_array
    with shared_arr.get_lock(): # synchronize access
        shared_array[i:i + len(h_l), 0] += p * h_l
        shared_array[i:i + len(h_l), 1] += p * h_r

chris-hld · 2020-07-06T12:42:35Z

Hm, I am using multiprocessing.Array, which already includes a lock...
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Array
I am quite surprised why this particular function seems to fail while others don't.

HaHeho · 2020-07-07T13:00:47Z

Yes. I see the documentation is confusing in regards of the lock. The way that I understood it so far is, that it provides lock functionality for synchronized access, but it is not managed for you. This is because many examples I found online are also applying the lock in the way I specified in the code above.

I did not check all methods. But could it be that bsdm is the only one where the system state is part of the processing (i.e. utilizing +=, hence require synchronization). If so, this is also a strong indicator that manual locking is required. Did you try it this way and test still fail occasionally?

chris-hld · 2020-07-08T16:50:26Z

Seems it was indeed the += inplace operator. I'm also casting back to a numpy array, so the operator seems to be not thread safe there.

cf77fa8 and f5e5654 target this issue.

The reentrant lock seemed necessary at least for Windows, even though on my local Linux machine I couldn't make it fail with
repeat 100 { pytest tests/test_parallel.py::test_render_bsdm }

Thank you @HaHeho !

chris-hld · 2020-07-24T10:50:44Z

Re-opening, because still not resolved.
The last times failing seem to happen always on Windows.

Traceback:

Cross-Platform Test / Test on windows-latest with Python 3.6

test_jobs = 2
54

55
    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
56
    def test_render_bsdm(test_jobs):
57
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
58
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
59
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
60
                                                 jobs_count=1)
61
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
62
                                                 jobs_count=test_jobs)
63
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
64
E       AssertionError: 
65
E       Not equal to tolerance rtol=1e-07, atol=0
66
E       
67
E       Mismatched elements: 1 / 2510 (0.0398%)
68
E       Max absolute difference: 0.01022394
69
E       Max relative difference: 1.
70
E        x: array([[-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
71
E                0.      ],
72
E              [-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
73
E                0.      ]])
74
E        y: array([[-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
75
E                0.      ],
76
E              [-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
77
E                0.      ]])

HaHeho · 2020-07-24T14:25:24Z

Windows potentially (by default) follows a different mechanic of spawning child processes. I am not sure why this would influence the lock though,
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

However, I did not really understand the way you implemented the lock there. What is the reason for using RLock() instead of Lock(). The multiprocessing documentation is pretty cryptic on practical differences. This seems to explain differences:
https://stackoverflow.com/questions/22885775/what-is-the-difference-between-lock-and-rlock

According to this, I would assume that Lock() is the more reasonable choice for you (supposed to be faster and does not allow multiple entries by the same process). Also, why not straight up use the associated lock created with the array (with shared_arr.get_lock():)?

I would try that and see if that solves the problem.

HaHeho · 2020-07-24T14:28:32Z

Actually, https://docs.python.org/3/library/multiprocessing.html#all-start-methods (Explicitly pass resources to child processes) explains that this might actually be cause by the different process spawning on Windows. I would still expect my suggested way above to work, and be the most easy to read.
Otherwise, you seem to be required to pass the lock instance to the function, in order to properly work under Windows.

chris-hld added a commit that referenced this issue Jul 8, 2020

Lock access with += operator, targets #13

cf77fa8

chris-hld added a commit that referenced this issue Jul 8, 2020

Use reentrant lock (RLock), targets #13

f5e5654

chris-hld closed this as completed Jul 9, 2020

chris-hld reopened this Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/test_parallel.py fails sometimes #13

tests/test_parallel.py fails sometimes #13

chris-hld commented Jul 1, 2020

HaHeho commented Jul 1, 2020

chris-hld commented Jul 2, 2020

chris-hld commented Jul 2, 2020

HaHeho commented Jul 2, 2020

HaHeho commented Jul 2, 2020

chris-hld commented Jul 6, 2020

HaHeho commented Jul 7, 2020

chris-hld commented Jul 8, 2020

chris-hld commented Jul 24, 2020

HaHeho commented Jul 24, 2020

HaHeho commented Jul 24, 2020

tests/test_parallel.py fails sometimes #13

tests/test_parallel.py fails sometimes #13

Comments

chris-hld commented Jul 1, 2020

HaHeho commented Jul 1, 2020

chris-hld commented Jul 2, 2020

chris-hld commented Jul 2, 2020

HaHeho commented Jul 2, 2020

HaHeho commented Jul 2, 2020

chris-hld commented Jul 6, 2020

HaHeho commented Jul 7, 2020

chris-hld commented Jul 8, 2020

chris-hld commented Jul 24, 2020

HaHeho commented Jul 24, 2020

HaHeho commented Jul 24, 2020