Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistent function.__globals__ #411

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

faucct
Copy link

@faucct faucct commented Feb 20, 2021

Support Pickler#persistent_id and Unpickler#persistent_load for __globals__ of functions:

__globals__ = {"a": "foo"}

class Pickler(cloudpickle.CloudPickler):
    @staticmethod
    def persistent_id(obj):
        if id(obj) == id(__globals__):
            return "__globals__"

class Unpickler(pickle.Unpickler):
    @staticmethod
    def persistent_load(pid):
        return {"__globals__": __globals__}[pid]

get = eval('lambda: a', __globals__)
file = io.BytesIO()
Pickler(file).dump(get)
dumped = file.getvalue()
self.assertNotIn(b'foo', dumped)
get = Unpickler(io.BytesIO(dumped)).load()
self.assertEqual(id(__globals__), id(get.__globals__))
self.assertEqual('foo', get())
__globals__['a'] = 'bar'
self.assertEqual('bar', get())

@faucct faucct force-pushed the feature/persistent-function-globals branch from 686b224 to 5477bd7 Compare February 20, 2021 06:39
@faucct faucct force-pushed the feature/persistent-function-globals branch from 5477bd7 to bbe76e9 Compare February 20, 2021 06:56
@codecov
Copy link

codecov bot commented Feb 20, 2021

Codecov Report

Merging #411 (7d1ae89) into master (cdc704d) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   91.61%   91.66%   +0.05%     
==========================================
  Files           3        3              
  Lines         656      660       +4     
  Branches      135      137       +2     
==========================================
+ Hits          601      605       +4     
  Misses         34       34              
  Partials       21       21              
Impacted Files Coverage Δ
cloudpickle/cloudpickle_fast.py 96.79% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdc704d...7d1ae89. Read the comment docs.

Error: Unable to process command '::add-path::nightly-venv/bin' successfully.
214
Error: The `add-path` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/
@pierreglaser
Copy link
Member

Hi, thank you for the PR.

If I understood correctly, this feature allows for a pickled function's __globals__ to be reconciled at unpickling time with potentially pre-existing __globals__. __globals__ handling is a sensitive issue in cloudpickle (see #240, #216, #214, #188). I'm giving some additional context at the end of this comment.

Could you provide more context on the use case for which you would need this feature?

Additional context

We did some back and forth in the past in cloudpickle w.r.t __globals__ handling, but the current behavior of cloudpickle is essentially:

  • functions sharing a __globals__ and pickled in the same pickle.dump call will share __globals__ in the unpickling environment
  • functions pickled in separate pickle.dump calls will have isolated __globals__ at unpickling time even if they shared the same __globals__ at unpickling time.

We explored globals reconcilication as an opt-in feature in #216, but ended up not merging it.

@faucct
Copy link
Author

faucct commented Feb 21, 2021

Our cloud-hosted jupyter is a serverless solution, which uses lazy-loading __globals__-dicts, managed entirely by us. We want to be able to restore references to __globals__ after serialization and don't want its contents to be dumped . This feature does not affect anyone who does not use persistent_id/persistent_load, so it looks safe to merge.

pierreglaser added a commit to pierreglaser/cloudpickle that referenced this pull request Feb 21, 2021
@faucct
Copy link
Author

faucct commented Apr 9, 2021

Bump

@pierreglaser
Copy link
Member

pierreglaser commented Apr 10, 2021

This feature does not affect anyone who does not use persistent_id/persistent_load, so it looks safe to merge.

A quick search on github shows example CloudPickler subclasses that implement persistent_id. Example:

https://github.com/venkattgg/venkey/blob/796b9bdfb2fa1b881d82080754643c7e68629cd2/oss_src/unity/python/sframe/_gl_pickle.py#L300-L364

(Implementing persistent_id in CloudPickler subclasses causes Pickler.save (a method which is not re-implemented in CloudPickler) to execute persistent_id-specific code, see https://github.com/python/cpython/blob/e05a703848473b0365886dcc593cbddc46609f29/Lib/pickle.py#L539-L542)

There is a risk that the changes in this PR might break code relying on such subclasses , since cloudpickle's behavior will differ from its previous behavior if persistent_id is implemented. If I'm missing something @faucct please let me know.

@ogrisel if you have any thoughts on this, I'm happy to hear them :)

@faucct
Copy link
Author

faucct commented Apr 11, 2021

I have meant that this won’t break anything for people who don’t intentionally save globals in persistent_id. For those who do this change is more like a fix. The link you’ve shared gives persistent ids to custom classes, not to dict, so this change does not affect them:
https://github.com/venkattgg/venkey/blob/796b9bdfb2fa1b881d82080754643c7e68629cd2/oss_src/unity/python/sframe/_gl_pickle.py#L67

@pierreglaser
Copy link
Member

pierreglaser commented Apr 11, 2021

I have meant that this won’t break anything for people who don’t intentionally save globals in persistent_id.

Agreed.

For those who do this change is more like a fix

I don't quite agree with that for now, you would need to elaborate about why this is a fix for anyone using persistent_id alongside with CloudPickler :)

The link you’ve shared gives persistent ids to custom classes, not to dict, so this change does not affect them.

That is true. The example I linked is made to highlight the fact that there exists some code chunks out there -either open source, surely other closed source- consisting of a CloudPickler subclass implementing persistent_id. In the possible case where this subclass' persistent_id handles dict objects (a case whose existence we cannot easily rule out given that not all code is open source), then I stand by the fact that this PR may introduce a breaking change.

There are two ways I see this PR moving forward (other than me being wrong, in this case, feel free to explain why):

  • either some other cloudpickle core developers chime in, and consider that such a breaking change is acceptable. In that case we may merge this PR. ogrisel may give feedback within a week or two.
  • either we find a way to modify the changes in this PR to remove the risk of breaking changes.

@faucct
Copy link
Author

faucct commented Apr 12, 2021

Here is why I think of this change as a fix. The docs say that Pickler#persistent_id should be customized if some objects are a reference to something outside the pickled data stream – https://docs.python.org/3/library/pickle.html#pickle-persistent:

For the benefit of object persistence, the pickle module supports the notion of a reference to an object outside the pickled data stream.

If users customize CloudPickler in this place, they (especially me) expect that any references to it would be restored via Unpickler#persistent_load:

To unpickle external objects, the unpickler must have a custom persistent_load() method that takes a persistent ID object and returns the referenced object.

Same logic applies to custom CloudPickler#persistent_id for dicts. If a user only defines it for globals(), then he obviously wants that change, because globals() are mostly referenced by functions. If a user defines it for all dicts (or even all objects), which does not look like a valid case as everything is a dict in python, but still, then so be it: the user does not want any of them to end up in the pickled data-stream neither whole, neither partial, and wants their references to be restored after unpickling:

To unpickle external objects, the unpickler must have a custom persistent_load() method that takes a persistent ID object and returns the referenced object.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

To me, this seems like a bit of an abuse of the protocol for persistence_id / persistent_load as defined in the pickle documentation for external objects. Indeed cloudpickle never attempts to pickle a full __globals__ dict in its entirety, so it does not really make sense to allow the user to subclass the pickler and implement persistence_id to customize the way a full __globals__ dict would be handled.

Here the use case seems to be more how to customize the pickling of dynamically defined functions. To achieve this goal, I would instead recommend to override the CloudPickler._dynamic_function_reduce method to do what you want in your code.

I agree that this private API, an is not guaranteed to be stable when upgrading the version of cloudpickle though.

@faucct
Copy link
Author

faucct commented Jul 1, 2021

I don't think of this as an abuse of the protocol for persistence_id/persistent_load: our globals() object is external.
Your suggestion to override a private method CloudPickler._dynamic_function_reduce might solve some consequences of a problem we are dealing with, but it won't solve the exact problem of globals() being dumped when being referenced. And this is the thing we are trying to solve by defining persistence_id/persistent_load for it, so we will continue doing that.
Will you support this use of the protocol?

@pierreglaser
Copy link
Member

but it won't solve the exact problem of globals() being dumped when being referenced

Mmh, does that problem actually exist on cloudpickle master? I imagine that by subclassing the CloudPickler class and defining a persistent_id/persistent_load pair on the subclass that handles globals dictionaries the way you want them to be handled, you have the ability to not pickle the content of the said globals dictionaries.

@faucct
Copy link
Author

faucct commented Jul 1, 2021

I've meant that we will define persistent_id/persistent_load for globals() regardless. But this won't fix the pickling of functions. To solve that problem either this change should be accepted, either we should lock to private API CloudPickler._dynamic_function_reduce. I don't like the idea of depending on unstable API.

@pierreglaser
Copy link
Member

pierreglaser commented Jul 1, 2021

@faucct, I have an idea that may solve your use-case:
what if we add an extra item inside f_globals dictionaries to signal that they are dictionaries containing global variables referenced by a function? Something like

f_globals['__cloudpickle_globals'] = True

You can then detect and handle such dictionaries using your own persistent_id/persistent_load the way you want.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

Your suggestion to override a private method CloudPickler._dynamic_function_reduce might solve some consequences of a problem we are dealing with, but it won't solve the exact problem of globals() being dumped when being referenced.

I don't understand this sentence. If you customize _dynamic_function_reduce you can do want whatever you want by replacing the FuncType constructor with your own constructor (that uses FuncType internally but would use your own globals dict instead of the one extracted by cloudpickle._function_getstate.

@faucct
Copy link
Author

faucct commented Jul 1, 2021

Your suggestion to override a private method CloudPickler._dynamic_function_reduce might solve some consequences of a problem we are dealing with, but it won't solve the exact problem of globals() being dumped when being referenced.

I don't understand this sentence. If you customize _dynamic_function_reduce you can do want whatever you want by replacing the FuncType constructor with your own constructor (that uses FuncType internally but would use your own globals dict instead of the one extracted by cloudpickle._function_getstate.

You have said that this is a private API here:

I agree that this private API, an is not guaranteed to be stable when upgrading the version of cloudpickle though.

@faucct
Copy link
Author

faucct commented Jul 1, 2021

@faucct, I have an idea that may solve your use-case:
what if we add an extra item inside f_globals dictionaries to signal that they are dictionaries containing global variables referenced by a function? Something like

f_globals['__cloudpickle_globals'] = True

You can then detect and handle such dictionaries using your own persistent_id/persistent_load the way you want.

Will we have to add this key to user globals()? Also I don't understand how would this help with deserializing functions so they would reference current globals().

@pierreglaser
Copy link
Member

pierreglaser commented Jul 1, 2021

For you to be able to distinguish between different __globals__ dicts, we could even do something like:

f_globals['__cloudpickle_globals_id'] = id(func.__globals__)

In my opinion, such a change solves your use case as well as the current PR. Morevoer, this change does not rely on some code branching depending on the return value of persistent_id(func.__globals__). Given that func.__globals__ is an object which is never pickled in the first place, the current logic is in my opinion both a breaking change in cloudpickle and an abuse of the persistent_id method.

Will we have to add this key to user globals()?

Given the current code in your PR, I'm assuming your application already contains some code able to detect __globals__ dictionnaries (by looping through sys.modules and checking for identity for instance). So in my opinion, making sure that code such as cloudpickle.dumps(func.__globals__) does not pickle the actual contents of func.__globals__ is a concern that is orthogonal to the changes introduced in this PR.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

I think that the "__cloudpickle_globals_id" entry in the globals dict should be cleaned-up automatically at unpickling time. This means that we should use a custom alternative to FuncType do this.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

We could also subclass dict with a custom class that would hold this link to the original globals dict, possibly as a weak ref and we could implement the reducer to rebuild a regular dict instead at unpickling time.

Actually it does not even need to be subclass a dict or hold a weakref.

@faucct
Copy link
Author

faucct commented Jul 1, 2021

I have stopped understanding the solutions you suggest. :( Could you adapt the example from description to them, please?

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

Actually I don't think it would work since the original func.__globals__ dict is split into a base_globals dict with the basic func metadata and a second filtered globals dict with the actual content of the interesting symbols referenced in func.__code__.

The problem is that to implement the use case correctly with the persistent_id API, both of them should be intercepted. That will make the code not very intuitive and possibly hard to maintain.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

I am working on a prototype to make things clearer.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 1, 2021

@faucct @pierreglaser feel free to have a look at: #430.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants