-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Section 7.2 #22
Comments
Background information: bulk download was discussed in opengeospatial/ogcapi-features#230 and there "enclosure" was suggested. |
@AlexRamageScot1 - What would you consider to be the "mainstream" link relation type for a downloadable distribution of a dataset? Could you provide good examples where that link relation type is used? We use |
If my memory serves me right, the previous download services documents allowed for either service based download (WFS 2.0) or bulk download (ATOM), but did not require to implement both. Having both options in a single API is an improvement, but I would suggest making the enclosure link optional, while "7.2.2. INSPIRE-specific requirement" seems to make it mandatory. Point in case, the dataset might be large and already stored in a database, implementing an enclosure link would require preparing a full dataset dump in one or more formats, increasing space utilization. A possible alternative would be to build the result on the fly from the database, which would be feasible only with streaming oriented formats (e.g. GeoJSON, GML), but would not work in a synchronous request for formats that need to be fully written before being sent back to the client (e.g., geopackage, shapefile). |
7.2.2 We raise similar concerns regarding implementation and anticipated performance challenges. Agree with suggestion of making the enclosure link optional. |
The INSPIRE implementing rules require that a download service has an operation "Get Spatial Data Set", that returns the whole data set (see section 3). This corresponds to e.g.
If an implementation of OPAIF does not provide a mechanism to get a bulk download of the dataset served, that implementation cannot be an INSPIRE-compliant download service. The enclosure construction is one mechanism to provide a bulk download.
The idea would be that a data provider provides a link to an already prepared dataset dump, that resides on another server. That could e.g. be an FTP-server, that has dataset dumps on it prepared by another tool, e.g. FME reading data from a database and uploading them every night. See the example from the OGCAPIF spec: The example in the current version of the Good Practice does not reflect this, I guess that should be updated to e.g. the following: So the consequence for a server supporting OGCAPI would be that it must support adding an "enclosure" link, including a See e.g. the following architecture: |
Thanks for the update. I maintain the point that the requirement is taxing, but the INSPIRE directive is free to ask for whatever seems fit, and those implementing it will simply have to deal with the cost of it. |
That depends on the interpretation of "operation" / "request" / "response" in the legal text. If the terms are interpreted in the HTTP 1.1 sense, the statement above would be correct, but if a more general definition is used and iterating over the Preparing a bulk download for data that is a static snapshot and never changes may not be an issue in general, but for data that changes (and I think that is a large part of the datasets) having a bulk download that always has the up-to-date data with the QoS requirements will often be challenging and costly plus the downloaded data will in general be outdated. I would find it surprising, if that was really the intent of the legal text. |
A clarification, the quote is not complete: there was an extra sentence there (hightlighted here):
Paging would be a second possible mechanism indeed. Paging was also discussed earlier indeed in the discussion paper, and it has been discussed earlier on the context of WFS, see Download Service WFS: StoredQuery and ResponsePaging for large datasets?. For large datasets this seems to be far from ideal (though probably compliant?). Then we're back in the discussion regarding the use of INSPIRE: do we implement it to be compliant ("tick the box") or do we implement it because it makes sense to provide a spatial data infrastructure giving access to our data. |
From the document:
This sounds like a mandatory presence, while "one mechanism to provide bulk download" makes it sound like it would be one among other various valid options (hence, optional). If the enclosure can just point to the "items" resource and paging is allowed, then the issue is solved... however, any client can already download the full dataset that way, as part of the OGC Features API, no need to have an explicit "enclosure" link for it. |
@heidivanparys - Just to clarify: I omitted the additional sentence as I don't think it is important for the discussion whether the legal text requires a "bulk download" (single downloadable file) for the Get Spatial Data Set operation or not, which was the subject of my comment. If indeed there would be consensus that a paged download would meet the legal requirements then there would be no need for a requirement related to a bulk download and an "enclosure" link could just be in an example like on the OGC API Features standard. |
This would the best solution for providers of APIs if we'd like to avoid INSPIRE specific requirements as much as possible. I do see the use case of a bulk download, especially for users of an API, so what about making an |
@thijsbrentjens the "other CRSs" bit picked my interest.. how does one advertise the CRS of the specific enclosure link? Is there a machine processable way, or would this be delegated to the link title, and hence, to human interpretation? |
@aaime you've got me there, I didn't think this through / check it. There does not seem to be a machine processable way indeed. |
@cportele This was actually my comment - I didn't intend to imply there was a more mainstream link relation (although it seems to have generated some good discussion anyway), it was more whether enclosure is a readily understandable term for a bulk download just from a developer experience point of view. "Download" clearly has issues as a term both for INSPIRE and more generally, and "bulk download" might be more understandable but isn't registered. So I can't say that I've got a better suggestion (which I acknowledge isn't helpful!), I was just concerned that if I was coding against a service I wouldn't guess that a link called enclosure would lead to a full bulk download. Perhaps the guidance could recommend that the title of an enclosure link specifically makes clear its a bulk download of the full dataset? |
I had a look at the service that is provided by pygeoapi community for the compliance test [1], https://demo.pygeoapi.io/cite , and there "download" is used: [1] See https://www.opengeospatial.org/resource/products?display_opt=1&specid=1022 . |
I would avoid This might be more intuitive, uses a commonly known vocabulary, has a URI that can be dereferenced to find more information about the semantics and is at least broadly inline with the rules for Extension Relation Types in RFC 8288. |
https://schema.org/downloadUrl is supposed to be used for software applications: How about using
DCAT is widely used in goverment portals and is a well-known specification, see e.g. Building a search engine for datasets in an open Web ecosystem (DOI 10.1145/3308558.3313685):
One more argument: given the efforts of the EU used on developing DCAT-AP, the re-use of elements of DCAT within in INSPIRE would also be good I think. ❓ It is not clear that this is the download URL for another distribution of the dataset. Is that an issue?
I like that approach. ❓ @cportele What exactly do you mean with "at least broadly inline"? Where would the approach not be compliant? |
Good catch, I overlooked this.
I said that because:
|
For reference, the whole relevant clause from the IETF Proposed Standard RFC 8288:
And according to the IETF Best Current Practice BCP 14:
So the proposal would be to use the extension relation type The reason to ignore the recommendation of IETF would be that we follow best practice Best Practice 15: Reuse vocabularies, preferably standardized ones in order to increase interoperability. Are there any implications? We could repeat the requirement that extension relation types must be compared in a case-insensitive fashion. The example would then become: {
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "http://www.w3.org/ns/dcat#downloadURL",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)"}
]
} ❓ Then the next question is, what about the
The Given the use of {
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "http://www.w3.org/ns/dcat#downloadURL",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)",
"byteSize": 472546 }
]
} |
The link relation type
Right now, I'm inclined to say that we should stick to Regarding some earlier comments:
According to RFC 8288 and also according to the IETF Internet-Draft: JSON serialization for Web Linking, additional target attributes may be added, see also the comment above. So if I understand that correct: we could define a {
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "enclosure",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)",
"length": 472546,
"crs": "http://www.opengis.net/def/crs/EPSG/0/3044"}
]
} @aaime @thijsbrentjens What do you think? |
@heidivanparys Adding the crs attribute works for me. |
I'm re-reading this again and found the situation is worse than I initially imagined. The requirement is to allow download of the entire dataset (potentially multiple collections) with a single link, however the allowable return types are just the following http://inspire.ec.europa.eu/media-types/application The only true multi-collection formats I see there are:
It is my understanding that GeoJSON tools prefer to have a single collection per document instead. Was hoping to just return a ZIP file with one GeoJSON file entry per collection, but don't see zip as an acceptable response type. So, in the current situation, it seems that if one cannot prepare in advance a static download package, the best option for full dataset download would be GML... not the direction I was expecting :-D Please clarify? As suggested previously, the preparation of a static download package is not always feasible. |
I think the simple explanation is that the contents of the registry is dynamic and should be updated when a need arises. I tried to look up what the media type for a "ZIP file with one GeoJSON file entry per collection" could be and found the following:
On RFC 6839: it has status "informational", and is updated by RFC 7303, XML Media Types, a proposed standard, that does not say anything about zip, so I assume that what is said about zip in RFC 6839 is still useful. Again, it is "informational", but probably the best we have to refer to? So If yes, then I am actually wondering, where does
Does this only refer to missing media types or do we have other outstanding issues? To be honest, I had the impression we were getting quite close to a workable and standardised solution 😕 . @alexanderkotsev Should we maybe organize a telecon to try to finalise this issue? |
@heidivanparys - A few comments:
|
@heidivanparys by "much worse" I mean the requirement to allow download of an entire dataset (multiple collections) instead of each single collection independently, and the associated issues. |
@cportele Ok, thanks for the comments.
I'm not sure that INSPIRE should just invent a new media type here. Wouldn't it be best to ask for the advice of OGC on this matter? Would the Naming Authority be the right place? |
My point was that INSPIRE has been doing this from the beginning. Yes, it is not perfect, but if it has been sufficient so far? The alternative would be that INSPIRE registers the additional media types in the vnd branch, which should be straightforward.
Yes, in OGC the Naming Authority is now responsible for registering media types needed for OGC standards (which would exclude, e.g., zipped GeoJSON, I think). |
@thijsbrentjens You also expressed that you could see the use case for this, do you maybe have anything to add here? |
@heidivanparys the way Chris Holmes suggested it, including asynch behavior, would make it feasible too, and open the door for geopackage, shapefiles and the like, outside of the case where the data is mostly static, or infrequently updated anyways. |
As said during the meeting of today, resolving and closing this issue would be my first priority. GitHub-label "waiting for input" |
I tried to break this issue into several issues that address one topic, see the mentioned issues above. |
Comments from the UK:
Suggested change for section 7,2:
Second: we appreciate the usefulness of 7.2.2, “Download of the whole data set”, but are not sure that “rel = enclosure” easily communicates this. We recognize that it is adopted from the Atom specifications & IANA link relation register – but we’re not sure how many users (developers?) find that ‘mainstream’.
Alex Ramage
Alex Ramage
The text was updated successfully, but these errors were encountered: