Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fundamental incompatibility of Threading and Multiprocessing (Python) #230

Open
martinpakosch opened this issue Oct 13, 2023 · 2 comments
Open

Comments

@martinpakosch
Copy link
Collaborator

Hi @knipknap, hi community,

the last days - after switching some old environments from Python 2.7 with Exscript 2.6.3/2.6.22 to Python 3.11 with Exscript 2.6.28 - we have encountered some very strange and unpredictable deadlocks while using the exscript script to run some Exscript templates. First we thought these issues are related to our new environment, then maybe to some bugs with Locks, handled by Exscript, but finally we found out, that this is a fundamental issue in Python! More about this later...

Exscript is designed internally to rely on threading for some of his mayor functionalities, like the MainLoop in the WorkQueue, Jobs and some pipeline related stuff. See https://github.com/search?q=repo%3Aknipknap%2Fexscript%20threading.Thread&type=code. Additionally, around 12 years ago, there was support for multiprocessing introduced (e.g., e43f6b7, 0baf858). From that point, Exscript by default behaved differently for the two major use-cases:

  1. When using the exscript script to just execute Exscript templates, Exscript was started in multiprocessing mode. This means essentially:
    a. one main process
    b. many internal threads
    c. one multiprocessing.Process for each job (=host)
  2. When using the Exscript API in Python for more complex scenarios, everything was pinned to threading mode by default, which was good, because then we get essentially:
    a. one main process
    b. many internal threads
    c. one threading.Thread for each job (=host)

So, now you might wonder why all this is a big deal. ;-) Well, the problem is, that according to latest changes in Python 3.12, we found out, that the combination of multiprocessing and threading is not safe/stable for any POSIX system and also not considered stable by the CPython implementation! To not replicate all the stuff I found out about this fundamental issue in Python, I would like to share these articles/posts, that I found - they should explain the problem:

  1. https://pythonspeed.com/articles/faster-multiprocessing-pickle/
  2. https://pythonspeed.com/articles/python-multiprocessing/
  3. https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
  4. https://docs.python.org/3/library/os.html#os.fork
  5. https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2
  6. multiprocessing's default posix start method of 'fork' is broken: change to `'forkserver' || 'spawn' python/cpython#84559

Quick solution:
Now, coming back to our issue, we were able to eliminate the deadlocks by changing the default in the exscript script to threading.
See:

mode = 'multiprocessing',

If most of you have not encountered such deadlocks until now in the use-case 1, then you maybe have been lucky as we for the past 10 years running our older environments! But it does not change anything about the fundamental issue here.

Long-term solution:
The change above affects only the exscript script and should be good enough as a quick fix. On the long term, @knipknap @bigmars86, you should think about a fundamental fix - if that is possible at all - or about dropping multiprocessing support for Exscript. What I found out so far regarding possible solutions:

  1. The default method for new processes in Python on Linux is fork, relying on os.fork(), which creates a copy of the parent thread with almost(!) all states/data - see the articles above explaining everything. This will be dropped as a default in Python 3.14, making spawn the new default. But spawn and forkserver create clean child processes and fully rely on e.g. pickling to move data down to these childs - in case Exscript fundamentally relies on full process copies.
  2. I tried to do this in the exscript script with the current code base according to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_start_method with both alternate methods and failed. Exscript is then unable to pickle some local scope objects - among possible other issue as I stopped digging here... Example:
Traceback (most recent call last):
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
    self.child.start(to_self)
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 94, in start
    base.start(self)
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_make_process_class.<locals>.process_cls'

# After "fixing" this by creating static public custom classes for Process and Thread in Exscript/workqueue/job.py,
# I encountered the next one... stopped digging here. ;-)

Traceback (most recent call last):
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
    self.child.start(to_self)
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 152, in start
    super(Process, self).start()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_prepare_connection.<locals>._wrapped'
  1. For me this is beyond my skills - I am happy to understand this stuff until this point to be honest. :D So I assume to fix this - basically to make Exscript compatible with the spawn or forkserver method for multiprocessing - will cost some bigger efforts, unless you already know where the make your hands dirty.

That's from my side. Hope this helps in any way.
Cheers, Martin

@knipknap
Copy link
Owner

Yeah, I doubt this can or should be solved without a major redesign of the whole queuing mechanism.
By now, Python comes with good support for pooling, and the whole Queue class could be redesigned on top of that (or on another third party library for handling pools).

But I would guess it simply isn't worth it. Multiprocessing support was originally introduced to solve a performance bottleneck, which happened in two scenarios:

  • Using the Exscript CLI tool with > 30 concurrent connections. I doubt that this limitation is still an issue on current hardware.
  • When running Exscript workers on IPS with high concurrency. This could these days easier be solved by scaling the number of workers, and handling queuing exclusively in RabbitMQ. It probably no longer makes sense to use any Exscript-internal queuing at all. BUT this may require a synchronization mechanism that handles locking of TACACS accounts, which may or may not yet exist outside of Exscript. But IIRC @bigmars86 avoided this problem by simply assigning separate accounts for every single instance of a worker. Not quite scalable, but a workable workaround.

That's what I suspect, at least. Now if that turns out to be true in practice is to be tested. If not - it's redesign time.

@martinpakosch
Copy link
Collaborator Author

Hi Samuel,

thanks for your explanation.

  1. Currently, the Exscript CLI help suggests to only use 1-20 concurrent connections.
    Maximum number of concurrent connections.

    But you are right, there seems surprisingly to be no hard limit in the internals. :D
  2. You are right, using a 1:1 mapping of Accounts to external Workers does not scale very good, but at least it seems to work.
    a. Regarding an external sync for TACACS accounts: sounds like a queue encapsulating a Exscript.Queue. 🤔🙈
    b. Regarding RabittMQ: That's a topic for @bigmars86. :D

Well, I can tell, that we once had a config-rollout across 600 routers, done in 5 Min. with 20 concurrent connections and an account-pool of 20 accounts - and yes, that was on our old environment working fine with "multiprocessing" for the CLI tool.

On the other hand, I would agree, that nowadays threading shouldn't be that much of a bottle-neck in most I/O bound scenarios when automating tasks on network-elements - maybe other users can comment on this in their setups? I would expect, that nowadays high-cost pre- or post-processing of data would be out-sourced to dedicated multiprocessing tools/scripts instead of trying to solve everything with the "old" (but still fine ;-)) Exscript Queue.

So yes, I also doubt going for a redesign would make much sense nowadays - at least for our use-cases - and I would also reject any quick hacks on this just to save efforts. So, the consequence would be to drop official multiprocessing support in Exscript for now (making just the "multiprocessing" mode invalid) - maybe until there is someone/some time/passion for such a redesign... ;-)

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants