Sidejob races and eqc failures (related to #10?) #17

russelldb · 2017-10-12T08:04:13Z

Supervisor_eqc:prop_par/0 and supervisor_eqc:prop_pulse/0 have shown a couple of races in sidejob.

Caveats

It may well be that the original author was content with these races,
as the code is supposed to be performance critical, they may have felt
the windows open for incorrectness were acceptable, but there is no
remaining documentation of such a decision. It could be that the
author of the code and the author of the tests did not communicate and
the tests make wrong assumptions.

Which children?

The property tests show a problem with
sidejob_supervisor:which_children/0 there are checked in
counter_examples in the nhs-riak repo (where this issue is duplicated
for now) that show this issue, though I don't yet have a fix.

https://github.com/nhs-riak/sidejob/blob/rdb/2.0.1-nhs-2.2.5/test/which_children_ce.eqc

Limit not a limit

There are two races around the process limit for sidejob_supervisor,
one is a check then act in `sidejob:is_availble/3'
https://github.com/basho/sidejob/blob/rdb/2.0.1-nhs-2.2.5/src/sidejob.erl#L144
here the ets lookup (check) counter update, and set of full (act) are
all capable of being interleaved. In one attempt to fix I added a
gen_server per worker-ets table and serialised access to the worker
ets with the gen_server, thus turning this group of commands into an
atomic call. This fixed this one race only

The larger issue is a larger window for a
race. sidejob:is_available/3 will increment the counter for a
worker-ets table, when LIMIT is met no more workers should be
created. However, after a worker is created be sidejob_supervisor
there is a call to sidejob_worker:update_usage/1 which sets the
usage counter in ets to the number of actually created processes
regardless of the number requested and set by is_available.

Two issues around this race:

There are acquire (is_available) and use (start_child) phases
that a process can crash between, leaving riak with no more
capacity and no fsms running at all. This is imo serious and should
be addressed. This is essentially a lock with no timeout or unlock.
The limit on the number of processes in the worse case is
N(N+1)/2 since LIMIT processes could set the ETS counter, then
1 gets created and set the counter to 1, and then LIMIT-1 update
the ETS counter, and then 1 get created and set the counter to 2,
and LIMIT-2 update the counter, etc etc etc.

I said that I'd tag @paegun for this issue. You done been tagged.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidejob races and eqc failures (related to #10?) #17

Sidejob races and eqc failures (related to #10?) #17

russelldb commented Oct 12, 2017 •

edited

Loading

Sidejob races and eqc failures (related to #10?) #17

Sidejob races and eqc failures (related to #10?) #17

Comments

russelldb commented Oct 12, 2017 • edited Loading

Caveats

Which children?

Limit not a limit

russelldb commented Oct 12, 2017 •

edited

Loading