Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/test_parallel.py fails sometimes #13

Open
chris-hld opened this issue Jul 1, 2020 · 11 comments
Open

tests/test_parallel.py fails sometimes #13

chris-hld opened this issue Jul 1, 2020 · 11 comments

Comments

@chris-hld
Copy link
Owner

Test_parallel fails frequently, but not always.
It seems that tests/test_parallel.py::test_render_bsdm is causing it (most of the times?).
As it doesn't occur every time, it might be undefined behaviour related to the shared memory access.

I will first try to collect some tracebacks, maybe they show some pattern.

Test on macOS-latest with Python 3.6

test_jobs = None

    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
    def test_render_bsdm(test_jobs):
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=1)
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=test_jobs)
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 1 / 2510 (0.0398%)
E       Max absolute difference: 0.64633057
E       Max relative difference: 1.
E        x: array([[ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ],
E              [ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ]])
E        y: array([[ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ],
E              [ 1.955098,  0.205032, -1.341322, ...,  0.      ,  0.      ,
E                0.      ]])

@HaHeho
Copy link

HaHeho commented Jul 1, 2020

Strange. Also occasionally happens when you execute it locally, or only in tests?

It is always only one time domain sample that is wrong?

I have not looked into the actual processing at all, but this might be an idea on narrowing the error down.
The "sometimes" most likely happens due to the utilization of randn as an input. However, the splitting into multiple tasks should not alter the results anyways of course. The data types in both branches are exactly the same? Maybe it is a matter of lacking numerical precision, which is only sometimes triggered due to the noise.

@chris-hld
Copy link
Owner Author

I could also make it fail locally, however, only a single time so far. Seems to be a lot less frequent when executed on a local machine.
Yes, it seems that it is always exactly one sample...

@chris-hld
Copy link
Owner Author

Another traceback:

Test on ubuntu-latest with Python 3.7

=================================== FAILURES ===================================
_____________________________ test_render_bsdm[2] ______________________________

test_jobs = 2

    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
    def test_render_bsdm(test_jobs):
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=1)
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
                                                 jobs_count=test_jobs)
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 1 / 2510 (0.0398%)
E       Max absolute difference: 1.31017305
E       Max relative difference: 1.
E        x: array([[0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ],
E              [0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ]])
E        y: array([[0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ],
E              [0.916457, 2.660908, 1.464512, ..., 0.      , 0.      , 0.      ]])

@HaHeho
Copy link

HaHeho commented Jul 2, 2020

So it is not related to the Python version since this is 3.7 now? (That would have been very weird anyways, if all other packages are identical).

Max relative difference: 1. means that the single value is exactly double in one case?

@HaHeho
Copy link

HaHeho commented Jul 2, 2020

I'm pretty certain this should solve it.

# part of parallel render_bsdm:
def _render_bsdm_sample(i, p, phi, theta, hrirs):
    h_l, h_r = hrirs[hrirs.nearest(phi, theta)]
    # global shared_array
    with shared_arr.get_lock(): # synchronize access
        shared_array[i:i + len(h_l), 0] += p * h_l
        shared_array[i:i + len(h_l), 1] += p * h_r

@chris-hld
Copy link
Owner Author

Hm, I am using multiprocessing.Array, which already includes a lock...
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Array
I am quite surprised why this particular function seems to fail while others don't.

@HaHeho
Copy link

HaHeho commented Jul 7, 2020

Yes. I see the documentation is confusing in regards of the lock. The way that I understood it so far is, that it provides lock functionality for synchronized access, but it is not managed for you. This is because many examples I found online are also applying the lock in the way I specified in the code above.

I did not check all methods. But could it be that bsdm is the only one where the system state is part of the processing (i.e. utilizing +=, hence require synchronization). If so, this is also a strong indicator that manual locking is required. Did you try it this way and test still fail occasionally?

@chris-hld
Copy link
Owner Author

Seems it was indeed the += inplace operator. I'm also casting back to a numpy array, so the operator seems to be not thread safe there.

cf77fa8 and f5e5654 target this issue.

The reentrant lock seemed necessary at least for Windows, even though on my local Linux machine I couldn't make it fail with
repeat 100 { pytest tests/test_parallel.py::test_render_bsdm }

Thank you @HaHeho !

@chris-hld
Copy link
Owner Author

Re-opening, because still not resolved.
The last times failing seem to happen always on Windows.

Traceback:

Cross-Platform Test / Test on windows-latest with Python 3.6

test_jobs = 2
54

55
    @pytest.mark.parametrize('test_jobs', JOB_COUNTS)
56
    def test_render_bsdm(test_jobs):
57
        sdm_p, sdm_phi, sdm_theta = [*np.random.randn(3, 1000)]
58
        hrirs = spa.IO.load_hrirs(fs=44100, filename='dummy')
59
        bsdm_l_r, bsdm_r_r = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
60
                                                 jobs_count=1)
61
        bsdm_l_t, bsdm_r_t = spa.sdm.render_bsdm(sdm_p, sdm_phi, sdm_theta, hrirs,
62
                                                 jobs_count=test_jobs)
63
>       assert_allclose([bsdm_l_t, bsdm_r_t], [bsdm_l_r, bsdm_r_r])
64
E       AssertionError: 
65
E       Not equal to tolerance rtol=1e-07, atol=0
66
E       
67
E       Mismatched elements: 1 / 2510 (0.0398%)
68
E       Max absolute difference: 0.01022394
69
E       Max relative difference: 1.
70
E        x: array([[-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
71
E                0.      ],
72
E              [-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
73
E                0.      ]])
74
E        y: array([[-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
75
E                0.      ],
76
E              [-1.208554, -0.140148,  0.93946 , ...,  0.      ,  0.      ,
77
E                0.      ]])

@chris-hld chris-hld reopened this Jul 24, 2020
@HaHeho
Copy link

HaHeho commented Jul 24, 2020

Windows potentially (by default) follows a different mechanic of spawning child processes. I am not sure why this would influence the lock though,
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

However, I did not really understand the way you implemented the lock there. What is the reason for using RLock() instead of Lock(). The multiprocessing documentation is pretty cryptic on practical differences. This seems to explain differences:
https://stackoverflow.com/questions/22885775/what-is-the-difference-between-lock-and-rlock

According to this, I would assume that Lock() is the more reasonable choice for you (supposed to be faster and does not allow multiple entries by the same process). Also, why not straight up use the associated lock created with the array (with shared_arr.get_lock():)?

I would try that and see if that solves the problem.

@HaHeho
Copy link

HaHeho commented Jul 24, 2020

Actually, https://docs.python.org/3/library/multiprocessing.html#all-start-methods (Explicitly pass resources to child processes) explains that this might actually be cause by the different process spawning on Windows. I would still expect my suggested way above to work, and be the most easy to read.
Otherwise, you seem to be required to pass the lock instance to the function, in order to properly work under Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants