Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix earthaccess.EarthAccessFile method lookup #620

Merged
merged 3 commits into from
Sep 23, 2024

Conversation

itcarroll
Copy link
Collaborator

@itcarroll itcarroll commented Jun 26, 2024

Fixes #610, closes #563.

This PR removes any base class from the definition of earthaccess.EarthAccessFile (EAF). Previously, EAF inherited from fsspec.spec.AbstractBufferedFile (ABF) so was capable of using methods defined on ABF. But EAF also held an instance inheriting from ABF at self.f and handed off __getattr__ requests to that instance. Under that setup, self.read returned super().read because read is defined on ABF, rather than returning self.f.read. That is a bug. It was probably assumed that __getattr__ would catch all method calls, but it only handles what __getattribute__ can't find.

We've scraped by with this setup because self.f (being an instance of an ABF) either does not override the called ABF method or the override does little more than itself call super(). The latter is the case for self.f.read when f is a fsspec.implementations.http.HTTPFile. It is not the case when f is a fsspec.implementations.http.HTTPStreamFile. (I discovered this bug while working on fsspec/filesystem_spec#1631.)

This PR also updates some type hints and relevant documentation.

  • the type hint on f was wrong, it is an ABF not a fsspec.AbstractFileSystem
  • type hints previously given as an ABF are now given as EAF (b/c it is no longer an instance of ABF)
  • the EAF is added to mkdocs un Modules/Store

ToDo if integration tests look okay:

  • add to changelog
  • check for any changes needed to docs

📚 Documentation preview 📚: https://earthaccess--620.org.readthedocs.build/en/620/

@mfisher87
Copy link
Collaborator

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

@mfisher87 mfisher87 changed the title the earthaccess.EarthAccessFile wrapper need not subclass anything the earthaccess.EarthAccessFile wrapper need not subclass anything Jul 9, 2024
@chuckwondo
Copy link
Collaborator

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

@mfisher87, have you had a chance to do this?

@itcarroll or @betolink, I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@itcarroll
Copy link
Collaborator Author

Thanks for checking on this one @chuckwondo! My guess on the need for this class was something to do with deserializing into a useable object in case authentication timed out. A guess only though, as I don't know when EarthAccessFile.__reduce__ would be called.

@betolink
Copy link
Member

betolink commented Sep 3, 2024

@chuckwondo

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

@itcarroll
Copy link
Collaborator Author

@jrbourbeau The essential question is whether removing fsspec.spec.AbstractBufferedFile from the MRO of EarthAccessFile will break the usage @betolink describes. I don't think we have a test using coiled. (We still have the file-like instance attached as the f attribute and used in the __getattr__ redirection.)

@betolink betolink requested a review from jrbourbeau September 3, 2024 18:15
@chuckwondo
Copy link
Collaborator

chuckwondo commented Sep 3, 2024

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

I didn't notice the make_instance function earlier, but now that I've looked at it, your description is now more clear to me.

However, I'm questioning the whole idea of pickling the fsspec.spec.AbstractBufferedFile to begin with. Opening files from your laptop, then distributing the open files across a cluster seems like you may be inviting more trouble than it's worth. Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

I may very well be thinking about this poorly, or simply be too inexperienced with using dask (and the like), but without seeing an example of the types of things you think this would be helpful for, offhand I would ask you why you're not simply distributing the URLs rather than the open files?

Is it so that you don't have to also potentially distribute credentials across such clusters as well?

@mfisher87
Copy link
Collaborator

have you had a chance to do this?

I have not yet, sorry :X

@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@chuckwondo
Copy link
Collaborator

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

I didn't notice the make_instance function earlier, but now that I've looked at it, your description is now more clear to me.

However, I'm questioning the whole idea of pickling the fsspec.spec.AbstractBufferedFile to begin with. Opening files from your laptop, then distributing the open files across a cluster seems like you may be inviting more trouble than it's worth. Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

I may very well be thinking about this poorly, or simply be too inexperienced with using dask (and the like), but without seeing an example of the types of things you think this would be helpful for, offhand I would ask you why you're not simply distributing the URLs rather than the open files?

Is it so that you don't have to also potentially distribute credentials across such clusters as well?

Regarding the specific changes in the PR, they look fine to me, but is anybody able to answer my question above?

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@itcarroll
Copy link
Collaborator Author

pre-commit.ci autofix

Copy link
Collaborator

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @itcarroll! Apologies for the lack of engagement here. The changes here LGTM. Though not sure exactly what's going on with the integration tests

the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url

Yeah, that's exactly right 👍 Obviously there's no http <--> s3 swapping when granules are opened and computed on in the same environment (i.e. both happen outside us-west-2 or both happen inside us-west-2)

The essential question is whether removing fsspec.spec.AbstractBufferedFile from the MRO of EarthAccessFile will break the usage @betolink describes

Nope -- we still keep the https <--> s3 behavior but fix the attribute lookup like you mentioned

tests/unit/test_store.py Show resolved Hide resolved
@itcarroll itcarroll requested review from jrbourbeau and removed request for mfisher87 September 23, 2024 20:02
@itcarroll
Copy link
Collaborator Author

@chuckwondo:

Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

Some complexity will be needed to handle the s3 <--> https hand-off between laptops and in-region workers spun up by coiled. Here it's a pickled file-like object, but additional in_region checks in __store__._get_urls could be a better approach. It looks like you are okay with making that a separate issue for an enhancement, rather than addressing it in this bug fix.

@itcarroll
Copy link
Collaborator Author

No worries @jrbourbeau, thanks for the review!

Though not sure exactly what's going on with the integration tests

Someone with write-access needs to trigger them is all.

@chuckwondo
Copy link
Collaborator

No worries @jrbourbeau, thanks for the review!

Though not sure exactly what's going on with the integration tests

Someone with write-access needs to trigger them is all.

This is not yet possible. Once #808 lands, this will be possible. If you are willing to wait for #808 to be approved and merged to main, then you could merge main into your PR, and this will retrigger int. tests, which will then fail again, but then a maintainer will be able to re-run the failed int. tests so that they can pass (assuming nothing was broken).

@jrbourbeau
Copy link
Collaborator

Someone with write-access needs to trigger them is all
This is not yet possible

Not sure exactly what the state of integrations tests are but FWIW I went ahead and bumped those failing CI jobs and they're passing now

@chuckwondo
Copy link
Collaborator

Someone with write-access needs to trigger them is all
This is not yet possible

Not sure exactly what the state of integrations tests are but FWIW I went ahead and bumped those failing CI jobs and they're passing now

LOL! Apologies. I'm getting my PRs mixed up. I forgot that part already landed 🤦

@jrbourbeau jrbourbeau changed the title the earthaccess.EarthAccessFile wrapper need not subclass anything Fix earthaccess.EarthAccessFile method lookup Sep 23, 2024
@jrbourbeau jrbourbeau merged commit 68e19a2 into nsidc:main Sep 23, 2024
12 checks passed
@itcarroll itcarroll deleted the eafile-method-resolution branch September 24, 2024 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants