Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enumerate unzipped files #10842

Conversation

ridoo
Copy link
Contributor

@ridoo ridoo commented Mar 23, 2023

The last base_file candidate overrides previously ones. Also, different files with the same extension (file1.csv, file2.csv) are not be accessible from file_paths as they get overridden, too.

An enumerated list of unzipped files should resolve this.

Checklist

Reviewing is a process done by project maintainers, mostly on a volunteer basis. We try to keep the overhead as small as possible and appreciate if you help us to do so by completing the following items. Feel free to ask in a comment if you have troubles with any of them.

For all pull requests:

  • Confirm you have read the contribution guidelines
  • You have sent a Contribution Licence Agreement (CLA) as necessary (not required for small changes, e.g., fixing typos in the documentation)
  • Make sure the first PR targets the master branch, eventual backports will be managed later. This can be ignored if the PR is fixing an issue that only happens in a specific branch, but not in newer ones.

The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):

  • There is a ticket in https://github.com/GeoNode/geonode/issues describing the issue/improvement/feature (a notable exemption is, changes not visible to end-users)
  • The issue connected to the PR must have Labels and Milestone assigned
  • PR for bug fixes and small new features are presented as a single commit
  • Commit message must be in the form "[Fixes #<issue_number>] Title of the Issue"
  • New unit tests have been added covering the changes, unless there is an explanation on why the tests are not necessary/implemented
  • This PR passes all existing unit tests (test results will be reported by travis-ci after opening this PR)
  • This PR passes the QA checks: black geonode && flake8 geonode
  • Commits changing the settings, UI, existing user workflows, or adding new functionality, need to include documentation updates
  • Commits adding new texts do use gettext and have updated .po / .mo files (without location infos)

Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.

@cla-bot
Copy link

cla-bot bot commented Mar 23, 2023

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @github-actions[bot] on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added.

@gannebamm
Copy link
Contributor

@ridoo, please take a look at your commit history.

@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch from 28ab262 to 9ea06df Compare March 23, 2023 16:25
@cla-bot cla-bot bot added the cla-signed CLA Bot: community license agreement signed label Mar 23, 2023
@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch 2 times, most recently from 3761d1a to d10741f Compare March 24, 2023 08:17
@ridoo
Copy link
Contributor Author

ridoo commented Mar 24, 2023

Kontext:

iterdir() seems to be platform dependent. It happened to me that the order of the returned items differed on different platforms. In cases where a zip file contains multiple base_file candidates (in my case json and two csv files) the last one wins. In combination with the arbitrary order of the return value of iterdir(), this becomes non-deterministic.

Also, different files with the same extension (some_data.csv, more_data.csv) will not be accessible from file_paths as they get overridden, too.

The fix enumerates all files to make them accessible from file_paths.

@ridoo
Copy link
Contributor Author

ridoo commented Mar 24, 2023

Seems that one test fails (test_geoserver_proxy_strip_paths (geonode.geoserver.tests.test_helpers.HelperTest) ... FAIL) according to restrictions on /geoserver/rest/layers. Not sure right now, if this relates to the recent adjustments made in geoservers rest.properties ...

@t-book
Copy link
Contributor

t-book commented Mar 25, 2023

A little improvement might be to use a context manager (with ...)

the_zip = zipfile.ZipFile(zip_file, allowZip64=True)

@gannebamm
Copy link
Contributor

@afabiani I see the same tests failing for recent PRs e.g. the dependabot bumping. Can you confirm @ridoo idea? Is this due to the new rest security rules?

@gannebamm
Copy link
Contributor

@afabiani I see the same tests failing for recent PRs e.g. the dependabot bumping. Can you confirm @ridoo idea? Is this due to the new rest security rules?

@ridoo please see #10887. If the issue gets resolved restart the tests to see if this will run successfully

@giohappy
Copy link
Contributor

@ridoo the issue with the failing test has been fixed in fa2c6da

@codecov
Copy link

codecov bot commented Apr 11, 2023

Codecov Report

Merging #10842 (356cf3c) into master (43188af) will increase coverage by 0.05%.
Report is 3 commits behind head on master.
The diff coverage is 56.25%.

❗ Current head 356cf3c differs from pull request most recent head 9623ff9. Consider uploading reports for the commit 9623ff9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10842      +/-   ##
==========================================
+ Coverage   62.16%   62.21%   +0.05%     
==========================================
  Files         871      828      -43     
  Lines       52020    51262     -758     
  Branches     6495     6569      +74     
==========================================
- Hits        32338    31895     -443     
+ Misses      18133    17671     -462     
- Partials     1549     1696     +147     

@gannebamm
Copy link
Contributor

I see it passes tests now and only shows codecov fails, which I would ignore.
@ridoo, please incorporate the context manager idea of @t-book (#10842 (comment)), and then I think this can get merged. @giohappy Do you agree?

@giohappy
Copy link
Contributor

@gannebamm I've asked for a quick internal review, but it looks good to me.

Copy link
Contributor

@mattiagiupponi mattiagiupponi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ridoo
I tried the PR and overall the idea looks good but is not working as expected.
The consistency between the files must be preserved.
I tried to upload a zip with multiple and mixed supported types
image

But after running the code, I can see that the payload the key needed for an shp (for example) refers to different files:
image

If we start to have the possibility to refer to multiple files, I guess is better to keep the consistency between them. For example in this case I expect:

  • the key without index should refer to the same shp/gpkg
  • the key with the same index should refer to the same shp/gkp

We have also to consider eventually the XML file and SLD file that can belong to the same layer

@ridoo
Copy link
Contributor Author

ridoo commented Apr 18, 2023

@mattiagiupponi thanks for having a look at this.

It seems to be unclear to me, what is actually acceptable content of an uploaded zip file. However, what about a simple sort before doing the whole magic (see my commit 573c3dc)

The zipped test collection could contain more mixed content and the test could be improved as well. But it may give you the idea, hopefully :)

@mattiagiupponi
Copy link
Contributor

It seems to be unclear to me, what is actually acceptable content of an uploaded zip file. However, what about a simple sort before doing the whole magic (see my commit 573c3dc)

In GeoNode one zip contains just 1 layer (relation 1-1), but here since we are introducing the idea to have possible multiple base_file is important to keep the consistency with the previous behavior, anyway I don't think that a sorted could be enough I guess it would be safer to have a code which can group directly by file name instead of relying on the sorting function.

(IMO) A good way would be to:

  • scan the zip file
  • group the file available in the zip by name
  • based on the file extension, verify that the needed file for that resource is available (this information is available via the get_supported_datasets_file_types
  • Return the first completed set found

@giohappy what do you think?

@ridoo
Copy link
Contributor Author

ridoo commented Apr 19, 2023

(IMO) A good way would be to:

  • scan the zip file
  • group the file available in the zip by name
  • based on the file extension, verify that the needed file for that resource is available (this information is available via >the get_supported_datasets_file_types
  • Return the first completed set found

Sorting before running the refactored loop would give you named groups, no?

However, for a datapackage.json which references just other files, a named group won't be helpful as it would for shapefiles. Here the actual handler (from the new geonode-importer) actually decides what to do with the input data. Making too strict decisions or verifications during a simple unzip step would go too far in my opinion. What do you think?

@ridoo
Copy link
Contributor Author

ridoo commented May 15, 2023

Hey @mattiagiupponi @giohappy do you see any chance to see this PR merged into an upcoming release (4.2.x or even 4.1.x)?

If you need further input from our side, please let me know! 🌞

cc @gannebamm

@giohappy
Copy link
Contributor

@ridoo sorry for the late reply. We can plan to have it in 4.2.0, but I think we need more thinking to implement a clean solution for the grouping of files.

@ridoo
Copy link
Contributor Author

ridoo commented May 26, 2023

@giohappy no worries .. are you going to share the discussion on this ticket or elsewhere?

ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Sep 20, 2023
Test example is available still from the PR GeoNode#10842
ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Oct 9, 2023
Test example is available still from the PR GeoNode#10842
@ridoo
Copy link
Contributor Author

ridoo commented Oct 11, 2023

@giohappy do you have any updates on this one? If you need further clarifications or work on my side please let me know

last statement from @mattiagiupponi was to discuss, what is acceptable content of an uploaded zip file (e.g. containing mixed content). See here: #10842 (comment)

@giohappy
Copy link
Contributor

@ridoo from what I see your proposal doesn't work in several ways:

1. We don't support multiple base_files

Every single (multi)file to be handled by the importer must have a base_file key. Outside the issue of sorting on grouping, you end up with a data structure where you have 1 base_file and multiple shp_file, shx_file, etc.
The last "group" of files that are processed overwrite the single available base_file value. I suspect this is what happens in the screenshot from @mattiagiupponi:

image

I suspect that your test is a false positive because you only take one group of files (the ones that generate the shp_file, etc. keys without postfixes. If you run the test on all the other files in _files you would get an assertion error since it wouldn't find a matched base_file

2. The importer can only handle one multi(file) at a time

The importer creates one (multi)format at a time. The create method triggers only one handler. If the zip file you send contains multiple files, it will get the handler only for the single base_file inside the retrieved paths.
So, in the end, the additional files would be discarded.
For the problem at point 1, you might end up with a base_file file that doesn't match with the other unprefixed keys.

Multifile support must be rethought IMHO, both at the StorageManager and the ImporterViewSet level

@ridoo
Copy link
Contributor Author

ridoo commented Oct 12, 2023

@ridoo from what I see your proposal doesn't work in several ways:

1. We don't support multiple base_files

Every single (multi)file to be handled by the importer must have a base_file key. Outside the issue of sorting on grouping, you end up with a data structure where you have 1 base_file and multiple shp_file, shx_file, etc. The last "group" of files that are processed overwrite the single available base_file value.

Well, the multi-mixed content example was one of @mattiagiupponi 's tests. That is why I was asking on the criteria, what actually is acceptable content of a zip file. If I understand you correctly, only one file assemble (e.g. shp, shx, prj, dbf) is allowed. This would be ok to me, and does not interfere with the issue I tried to describe.

I suspect that your test is a false positive because you only take one group of files (the ones that generate the shp_file, etc. keys without postfixes. If you run the test on all the other files in _files you would get an assertion error since it wouldn't find a matched base_file

Well, I do not mind (for this issue) if one file assemble overrides the other. What I tried to describe is the fact, that different handlers may decide on different base_files. In my case, my upload handler has a json file as base_file and multiple csv files as related files. See my comment: #10842 (comment). Sorting was just a quick hack to group each file assembly so that they get processed sequentially.

Multifile support must be rethought IMHO, both at the StorageManager and the ImporterViewSet level

I agree here, but as said, this was not the intent of this issue.

@giohappy
Copy link
Contributor

@ridoo I admit I lost track of this issue by reading only the comments. I've read the initial description again and now things are much clearer 😀.

Okay, so the goal was to not break the upload in case multiple files are put inside the uploaded zip file and let the handlers manage the enumerated paths.

BTW, even taking into account only specific cases, your patch would support only a very specific case: you have a custom handler registered at the top of the handlers registry.

Example:
You have a zip containing 1 json file + many csv files. Both json and csv are good candidates for the base_file. The base_file selection would depend on the (lexicographical) order, am I wrong?

So, when the ImporterViewSet selects the right handler, it do it by asking each handler if it can handle the fileset, and so the first one in the list will be selected, and with the factory handlers, it could either the geojson or the csv handler.

If your custom handler is at the top of the handlers registry, it might return immediately because you know how to handle that specific fileset.

If this is your expected scenario that's fine, although I don't think we can control the ordering of the registry at the moment.

iterdir() is platform dependent, that is the order of the returned items
may be different on different platforms. In cases where a zip file
contains multiple base_file candidates it will be overridden by the last
one found (which varies on different platforms).

Also, different files with the same extension (file1.csv, file2.csv) will not
be accessible from file_paths as they get overridden, too.

The fix enumerates all files to make them accessible from file_paths.
Ensures that unpacked content is sorted before getting handled
@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch 2 times, most recently from 4d3040d to 15e4d45 Compare October 13, 2023 11:17
@ridoo
Copy link
Contributor Author

ridoo commented Oct 13, 2023

Okay, so the goal was to not break the upload in case multiple files are put inside the uploaded zip file and let the handlers manage the enumerated paths.

That's right.

Example: You have a zip containing 1 json file + many csv files. Both json and csv are good candidates for the base_file. The base_file selection would depend on the (lexicographical) order, am I wrong?

To my understanding right now, the unpack process is just making a best guess on the base_file, depending on what geonode.utils.get_supported_datasets_file_types() returned first item is. This is ok, as I am free to re-set the base_file in my handler.

So, when the ImporterViewSet selects the right handler, it do it by asking each handler if it can handle the fileset, and so the first one in the list will be selected, and with the factory handlers, it could either the geojson or the csv handler.

If your custom handler is at the top of the handlers registry, it might return immediately because you know how to handle that specific fileset.

This is the way I have it configured and it works.

If this is your expected scenario that's fine, although I don't think we can control the ordering of the registry at the moment.

Ok, perhaps this issue targets multiple things here. The main issue, actually, is that unpacking files overrides files having the same extensions. This is why the PR is calle "Enumerate unzipped files", ie. two csv files in a zip would be unpacked to files[csv_1] and files[csv_2]

@giohappy
Copy link
Contributor

This is the way I have it configured and it works.

@ridoo ok, the ordering of the handlers in the settings is respected when they're registered

@mattiagiupponi to me this PR is harmless. It doesn't break the current behavior (which is undefined behavior in case the zip contains multiple filesets) and it offers support to handlers that can handle multiple file sets.

@mattiagiupponi
Copy link
Contributor

@mattiagiupponi to me this PR is harmless. It doesn't break the current behavior (which is undefined behavior in case the zip contains multiple filesets) and it offers support to handlers that can handle multiple file sets.

i agree

@ridoo can you please fix the flake8 and black issues? So we can have also the greenline from CircleCI.
Flake8 is complaining about this:

^@^@All done! ✨ 🍰 ✨
503 files would be left unchanged.
geonode/storage/data_retriever.py:224:39: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
geonode/storage/data_retriever.py:231:16: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`

I guess replacing the type(existing) == list with isinstance(existing, list) should be enough

@ridoo
Copy link
Contributor Author

ridoo commented Oct 13, 2023

Oh yes, will do this next.

One quick note: I removed the test which was based on wrong assumptions. If you would like to see tests on this part, I would have to do this some time next week.

@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch from 8bd8c0d to 37be03a Compare October 13, 2023 14:04
@gitguardian
Copy link

gitguardian bot commented Oct 16, 2023

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@mattiagiupponi
Copy link
Contributor

Hi @ridoo, a small test to be sure that works as expected it would be nice

@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch from d06323d to b75d185 Compare October 16, 2023 12:43
@ridoo ridoo force-pushed the keep-unzipped-files-accessible branch from b75d185 to 9623ff9 Compare October 16, 2023 12:45
@ridoo
Copy link
Contributor Author

ridoo commented Oct 16, 2023

@mattiagiupponi please have a look at my (really simple) test on enumeration/indexing multiple files with the same extension.

I just added a copy of zipped example.csv file. As ordering is undefined, I changed an existing test as well and commented where appropriate.

Copy link
Contributor

@mattiagiupponi mattiagiupponi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ridoo it looks good to me

@giohappy giohappy merged commit 9ca2f13 into GeoNode:master Oct 17, 2023
6 checks passed
ridoo added a commit to GeoNodeUserGroup-DE/contrib_datapackage that referenced this pull request Nov 2, 2023
ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Feb 22, 2024
Test example is available still from the PR GeoNode#10842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed CLA Bot: community license agreement signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants