Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidejob races and eqc failures (related to #10?) #17

Open
russelldb opened this issue Oct 12, 2017 · 0 comments
Open

Sidejob races and eqc failures (related to #10?) #17

russelldb opened this issue Oct 12, 2017 · 0 comments

Comments

@russelldb
Copy link
Member

russelldb commented Oct 12, 2017

Supervisor_eqc:prop_par/0 and supervisor_eqc:prop_pulse/0 have shown a couple of races in sidejob.

Caveats

It may well be that the original author was content with these races,
as the code is supposed to be performance critical, they may have felt
the windows open for incorrectness were acceptable, but there is no
remaining documentation of such a decision. It could be that the
author of the code and the author of the tests did not communicate and
the tests make wrong assumptions.

Which children?

The property tests show a problem with
sidejob_supervisor:which_children/0 there are checked in
counter_examples in the nhs-riak repo (where this issue is duplicated
for now) that show this issue, though I don't yet have a fix.

https://github.com/nhs-riak/sidejob/blob/rdb/2.0.1-nhs-2.2.5/test/which_children_ce.eqc

Limit not a limit

There are two races around the process limit for sidejob_supervisor,
one is a check then act in `sidejob:is_availble/3'
https://github.com/basho/sidejob/blob/rdb/2.0.1-nhs-2.2.5/src/sidejob.erl#L144
here the ets lookup (check) counter update, and set of full (act) are
all capable of being interleaved. In one attempt to fix I added a
gen_server per worker-ets table and serialised access to the worker
ets with the gen_server, thus turning this group of commands into an
atomic call. This fixed this one race only

The larger issue is a larger window for a
race. sidejob:is_available/3 will increment the counter for a
worker-ets table, when LIMIT is met no more workers should be
created. However, after a worker is created be sidejob_supervisor
there is a call to sidejob_worker:update_usage/1 which sets the
usage counter in ets to the number of actually created processes
regardless of the number requested and set by is_available.

Two issues around this race:

  1. There are acquire (is_available) and use (start_child) phases
    that a process can crash between, leaving riak with no more
    capacity and no fsms running at all. This is imo serious and should
    be addressed. This is essentially a lock with no timeout or unlock.

  2. The limit on the number of processes in the worse case is
    N(N+1)/2 since LIMIT processes could set the ETS counter, then
    1 gets created and set the counter to 1, and then LIMIT-1 update
    the ETS counter, and then 1 get created and set the counter to 2,
    and LIMIT-2 update the counter, etc etc etc.

I said that I'd tag @paegun for this issue. You done been tagged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant