-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enumerate unzipped files #10842
Enumerate unzipped files #10842
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @github-actions[bot] on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added. |
@ridoo, please take a look at your commit history. |
28ab262
to
9ea06df
Compare
3761d1a
to
d10741f
Compare
Kontext:
Also, different files with the same extension ( The fix enumerates all files to make them accessible from |
Seems that one test fails ( |
A little improvement might be to use a context manager (with ...) geonode/geonode/storage/data_retriever.py Line 208 in d10741f
|
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #10842 +/- ##
==========================================
+ Coverage 62.16% 62.21% +0.05%
==========================================
Files 871 828 -43
Lines 52020 51262 -758
Branches 6495 6569 +74
==========================================
- Hits 32338 31895 -443
+ Misses 18133 17671 -462
- Partials 1549 1696 +147 |
I see it passes tests now and only shows codecov fails, which I would ignore. |
@gannebamm I've asked for a quick internal review, but it looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ridoo
I tried the PR and overall the idea looks good but is not working as expected.
The consistency between the files must be preserved.
I tried to upload a zip with multiple and mixed supported types
But after running the code, I can see that the payload the key needed for an shp (for example) refers to different files:
If we start to have the possibility to refer to multiple files, I guess is better to keep the consistency between them. For example in this case I expect:
- the key without index should refer to the same shp/gpkg
- the key with the same index should refer to the same shp/gkp
We have also to consider eventually the XML file and SLD file that can belong to the same layer
@mattiagiupponi thanks for having a look at this. It seems to be unclear to me, what is actually acceptable content of an uploaded zip file. However, what about a simple sort before doing the whole magic (see my commit 573c3dc) The zipped test collection could contain more mixed content and the test could be improved as well. But it may give you the idea, hopefully :) |
In GeoNode one zip contains just 1 layer (relation 1-1), but here since we are introducing the idea to have possible multiple (IMO) A good way would be to:
@giohappy what do you think? |
Sorting before running the refactored loop would give you named groups, no? However, for a |
Hey @mattiagiupponi @giohappy do you see any chance to see this PR merged into an upcoming release (4.2.x or even 4.1.x)? If you need further input from our side, please let me know! 🌞 cc @gannebamm |
@ridoo sorry for the late reply. We can plan to have it in 4.2.0, but I think we need more thinking to implement a clean solution for the grouping of files. |
@giohappy no worries .. are you going to share the discussion on this ticket or elsewhere? |
Test example is available still from the PR GeoNode#10842
Test example is available still from the PR GeoNode#10842
@giohappy do you have any updates on this one? If you need further clarifications or work on my side please let me know last statement from @mattiagiupponi was to discuss, what is acceptable content of an uploaded zip file (e.g. containing mixed content). See here: #10842 (comment) |
@ridoo from what I see your proposal doesn't work in several ways: 1. We don't support multiple base_filesEvery single (multi)file to be handled by the importer must have a I suspect that your test is a false positive because you only take one group of files (the ones that generate the shp_file, etc. keys without postfixes. If you run the test on all the other files in _files you would get an assertion error since it wouldn't find a matched 2. The importer can only handle one multi(file) at a timeThe importer creates one (multi)format at a time. The create method triggers only one handler. If the zip file you send contains multiple files, it will get the handler only for the single Multifile support must be rethought IMHO, both at the |
Well, the multi-mixed content example was one of @mattiagiupponi 's tests. That is why I was asking on the criteria, what actually is acceptable content of a zip file. If I understand you correctly, only one file assemble (e.g. shp, shx, prj, dbf) is allowed. This would be ok to me, and does not interfere with the issue I tried to describe.
Well, I do not mind (for this issue) if one file assemble overrides the other. What I tried to describe is the fact, that different handlers may decide on different base_files. In my case, my upload handler has a
I agree here, but as said, this was not the intent of this issue. |
@ridoo I admit I lost track of this issue by reading only the comments. I've read the initial description again and now things are much clearer 😀. Okay, so the goal was to not break the upload in case multiple files are put inside the uploaded zip file and let the handlers manage the enumerated paths. BTW, even taking into account only specific cases, your patch would support only a very specific case: you have a custom handler registered at the top of the handlers registry. Example: So, when the If your custom handler is at the top of the handlers registry, it might return immediately because you know how to handle that specific fileset. If this is your expected scenario that's fine, although I don't think we can control the ordering of the registry at the moment. |
iterdir() is platform dependent, that is the order of the returned items may be different on different platforms. In cases where a zip file contains multiple base_file candidates it will be overridden by the last one found (which varies on different platforms). Also, different files with the same extension (file1.csv, file2.csv) will not be accessible from file_paths as they get overridden, too. The fix enumerates all files to make them accessible from file_paths.
Ensures that unpacked content is sorted before getting handled
4d3040d
to
15e4d45
Compare
That's right.
To my understanding right now, the unpack process is just making a best guess on the base_file, depending on what
This is the way I have it configured and it works.
Ok, perhaps this issue targets multiple things here. The main issue, actually, is that unpacking files overrides files having the same extensions. This is why the PR is calle "Enumerate unzipped files", ie. two csv files in a zip would be unpacked to |
@ridoo ok, the ordering of the handlers in the settings is respected when they're registered @mattiagiupponi to me this PR is harmless. It doesn't break the current behavior (which is undefined behavior in case the zip contains multiple filesets) and it offers support to handlers that can handle multiple file sets. |
i agree @ridoo can you please fix the flake8 and black issues? So we can have also the greenline from CircleCI.
I guess replacing the |
Oh yes, will do this next. One quick note: I removed the test which was based on wrong assumptions. If you would like to see tests on this part, I would have to do this some time next week. |
8bd8c0d
to
37be03a
Compare
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
Hi @ridoo, a small test to be sure that works as expected it would be nice |
d06323d
to
b75d185
Compare
b75d185
to
9623ff9
Compare
@mattiagiupponi please have a look at my (really simple) test on enumeration/indexing multiple files with the same extension. I just added a copy of zipped example.csv file. As ordering is undefined, I changed an existing test as well and commented where appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ridoo it looks good to me
Test example is available still from the PR GeoNode#10842
The last base_file candidate overrides previously ones. Also, different files with the same extension (file1.csv, file2.csv) are not be accessible from file_paths as they get overridden, too.
An enumerated list of unzipped files should resolve this.
Checklist
For all pull requests:
The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):
Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.