pickles break while "multiprocess"-ing dots #288
Replies: 8 comments
-
This is caused by multiprocessing's default use of pickle protocol v3 for transferring data between processes, which encodes its payload size as a signed 32-bit integer multiprocess is a fork of multiprocessing that replaces pickle with dill, but in my experience it doesn't seem to be able to override the protocol version, which is likely a bug similar to this one. In any case, we should also look into shedding some weight by scrutinizing how much data is getting serialized and sent to worker processes: chunks = map_(job, tiles, **map_kwargs) I doubt that the individual tiles can be made lighter, but there seems to be heavy stuff going into in the |
Beta Was this translation helpful? Give feedback.
-
so, input counts as well ?!
it is number of bytes, right ? not this turns out to be a bit more complicated/delicate thing. |
Beta Was this translation helpful? Give feedback.
-
Well, not all at once! Plus, I doubt it will be less efficient than pickling + IPC + unpickling 1000 times.
Yup. Another possibility to consider is the |
Beta Was this translation helpful? Give feedback.
-
The undocumented patch seems to work on 3.6 import multiprocessing as mp
class ForkingPickler4(mp.reduction.ForkingPickler):
def __init__(self, *args):
if len(args) > 1:
args[1] = 4
else:
args.append(4)
super().__init__(*args)
@classmethod
def dumps(cls, obj, protocol=4):
print("USING VERSION 4!!!")
return mp.reduction.ForkingPickler.dumps(obj, protocol)
class Pickle4Reducer(mp.reduction.AbstractReducer):
ForkingPickler = ForkingPickler4
register = ForkingPickler4.register
def dump(self, obj, file, protocol=4):
ForkingPickler4(file, protocol).dump(obj)
ctx = mp.get_context()
ctx.reducer = Pickle4Reducer()
def foo(x):
print(x)
with mp.Pool(3) as pool:
pool.map(foo, ['a', 'b', 'c'])
|
Beta Was this translation helpful? Give feedback.
-
It's not about efficiency, though, - just to overcome 2GB limitation, right ? I found another weak spot in the current implementation #51 - need to stop annotating chunks, and overall trim them down. 2GB is still too small for us, as we might be passing ~ I'd personally focus on optimizations - trimming dataframes down, because |
Beta Was this translation helpful? Give feedback.
-
this bites again ... strikes back ... Observations:
From that I can understand that: I'll describe other thoughts about the whole |
Beta Was this translation helpful? Give feedback.
-
I stumbled upon this issue when doing pileups and for me, it was solved by updating to python 3.8 (python/cpython#10305). Would that be possible for this use case as well? |
Beta Was this translation helpful? Give feedback.
-
@sergpolly - is this still an issue? |
Beta Was this translation helpful? Give feedback.
-
more of a docs/reminder rather than an issue
call-dots
into modular steps, returned toscroing_step
by @nvictus incall-dots
.Testing on "big"-data yielded somewhat familiar
multiprocessing/pickle
error:500,000X30
call-dots
instance that didn't use @nvictus -'sscoring_step
. I could not find a corresponding issue anywhere.pickle
is calculating total number of elements - looks like it doescolumns*rows
by the number of dataframes, otherwise math does not work out (>=2147483647
). Is it indeed the case @nvictus @mimakaev @golobor ? what if it were to be a bunch ofstring
-s of total length >2bln ?! https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647 - indeed says something about calculating elements in each objects ...dask
? or at leastmultipro
-something that is usingdill
- @nvictus ?Beta Was this translation helpful? Give feedback.
All reactions