Extend and explain in more details how to upload large files. #315

petterreinholdtsen · 2023-08-22T10:40:36Z

Introduce new bulkgrense system value to allow clients to learn exactly how small
small files are. Specify how to reject large files using the small method. Explain
the format of Range and Content-Range headers. Made it clear that the upload need
to be done in sequence (and not out of order) by specifying that the Location header
returned after each bulk upload is to be used in the next bulk upload.

Drop the Google Drive reference. It do not really explain much, the URL is broken
and this specification should describe its own protocol as it is not identical to
the Google Drive protocol.

Fixes #313.

petterreinholdtsen · 2023-08-24T08:57:42Z

@tsodring and @ivaylomitrev, please have a look at the rewritten example and description to let me know if it make sense and seem correct to you.

ivaylomitrev · 2023-08-30T10:34:00Z

It seems correct to me on a general level, but I have several (not-so-small) concerns on the topic:

I do not see a reason to limit "small" file uploads. If a client decides to upload a 5GB file as a "small" file that is up to them. I would assume that in such a case, they simply do not want a resumable session and are OK with repeating the upload if necessary (i.e. if it fails). In other words, to me the distinction is between resumable and non-resumable upload, not between small and large file upload and I do not see the need for limitations. Also, 1.0 and 1.1 imply that resumable file upload can be used for files >150MB, but does not prevent the upload of >150 MBs using the "small file upload implementation". To the contrary, the distinction between the two contracts is based on headers, not based on size. This is a backwards-incompatible change that would break any client implementations that have used the "small file upload" contract for files > 150MB as vendors will be required to start rejecting them.
The query parameter filsesjon change to ref is also backwards-incompatible. Any integration developed against a 1.0/1.1 vendor implementation needs to be modified to support 1.2 meaning that vendors cannot upgrade to 1.2 seamlessly without client changes. In the real world, it is nigh impossible to request client changes for such scenarios. My advice would be to still allow filsesjon as fallback if ref is missing in 1.2. If not, I believe the version should be bumped to 2.0 as a result of that. Even if it is bumped to 2.0, there are still other problems with backwards-incompatibility in the specification, however. Please check my comment at the end.
I see a sequence number appended to the "session" in the examples - ref=abc1234567-1, ref=abc1234567-2, etc., and I am not sure I understand where the requirement originates from. I do not see this as described and I also do not see the need of it considering that an implementation can reject an out-of-place "chunk" based on the Range and Content-Range headers alongside the state of the chunks maintained on the server-side (since this is a stateful implementation after all). I believe this sequence number is derived information that is not needed and simply introduces noise and complexity. It is also backwards-incompatible which implies it would require a version change to 2.0 and would prevent vendors from supporting both 1.x and the new format.

A note on the backwards-incompatible mentions in my comment:

The specification only (indirectly) requires OData version to be transferred between the client and server. It does not, in any way, allow a client and server to negotiate on the version to be used other than the purely informational admin/system. Even if admin/system was not intended as informational, a server implementation upgrading its version automatically and inadvertently breaks all clients which only expect a lower version to have been supported. As a result, performing a 1.y change that breaks previous 1.x requirements cannot be communicated effectively between the client and server. As a result, a 1.2 vendor implementation (assuming such backwards-incompatible changes will go there) cannot respond to a client with the 1.1 format as the client has not indicated, in any way, that it only supports 1.1, but not any versions above that. As a result, a vendor implementation cannot actually upgrade its implementation to 1.2 as it cannot communicate with clients supporting different versions simultaneously. This means that vendor implementations are locked at the version at which the first client integration was performed with no way to support newer features.

Please correct me if you find flaws in my logic regarding backwards compatibility.

petterreinholdtsen · 2023-08-30T11:03:25Z

[ivaylomitrev]

It seems correct to me on a general level, but I have several (not-so-small) concerns on the topic:

Thank you for taking the time to review the proposal.

1. I do not see a reason to limit "small" file uploads. If a client decides to upload a 5GB file as a "small" file that is up to them. I would assume that in such a case, they simply do not want a resumable session and are OK with repeating the upload if necessary (i.e. if it fails). In other words, to me the distinction is between resumable and non-resumable upload, not between small and large file upload and I do not see the need for limitations. Also, 1.0 and 1.1 _imply_ that resumable file upload _can be_ used for files >150MB, but does not _prevent_ the upload of >150 MBs using the "small file upload implementation". To the contrary, the distinction between the two contracts is based on headers, not based on size. This is a _backwards-incompatible_ change that would break any client implementations that have used the "small file upload" contract for files > 150MB as vendors will be required to start rejecting them.

I believe this is a misunderstanding on several levels. First of all, there are scenarious where large HTTP sessions are blocked/cut, which can be handled by doing chunked uploads across several HTTP sessions. I have personally experienced it with a transparent proxy administrated by a third party causing problems. This is independent of any wish for resumable upload. Second, a server implementation specifying bulkgrense=0 do not need to reject any sizes using the small file upload method, so those servers can keep handling clients ignoring the recommendation to upload files larger than 150 MiB in chunks. Clients following the recommendation will keep working with any server implementing the specification too. As far as I know the only clients that might break when using servers following this new specification are those trying to ignore the recommended 150 MiB cutoff point and uploading large files using the small file upload method without verifying the servers limit by looking at bulkgrense. Rejecting such uploads is both expected and intended.

2. The query parameter `filsesjon` change to `ref` is also backwards-incompatible. Any integration developed against a 1.0/1.1 vendor implementation needs to be _modified_ to support 1.2 meaning that vendors cannot upgrade to 1.2 seamlessly without client changes. In the real world, it is nigh impossible to request client changes for such scenarios. My advice would be to still allow `filsesjon` as fallback if `ref` is missing in 1.2. If not, I believe the version should be bumped to 2.0 as a result of that. Even if it is bumped to 2.0, there are still other problems with backwards-incompatibility in the specification, however. Please check my comment at the end.

The values of href fields are not standardized in this specification, and it is entirely up to the server to decide what they should contain. The same is the case with the content of the Location header. I will add some text to make this more clear. I changed the examples to avoid bad language caused by questionable translation (english session = norsk økt, not 'sesjon'), but place no new constrain on what servers need to put in their

3. I see a sequence number appended to the "session" in the examples - `ref=abc1234567-1`, `ref=abc1234567-2`, etc., and I am not sure I understand where the requirement originates from. I do not see this as described and I also do not see the need of it considering that an implementation can reject an out-of-place "chunk" based on the Range and Content-Range headers alongside the state of the chunks maintained on the server-side (since this is a stateful implementation after all). I believe this _sequence number_ is derived information that is not needed and simply introduces noise and complexity. It is also _backwards-incompatible_ which implies it would require a version change to 2.0 and would prevent vendors from supporting both 1.x and the new format.

I added these sequence numbers to make it more clear in the example that the Location header returned is the one to use in the next PUT. It is already expressed in the protocol description: "Responsen du mottar vil inneholde en Location-Header som inneholder en økt-URI som skal benyttes i neste PUT-forespørsel for å overføre neste bolk av filen.", but was not made obvious in the example. I hope it is more obvious now. A server implementation is free to reuse the same URL for every response, and allow multiple PUT to the same URL, but it is not required by the specification.

A note on the _backwards-incompatible_ mentions in my comment: The specification only (indirectly) requires OData version to be transferred between the client and server. It does not, in any way, allow a client and server to negotiate on the version to be used other than the purely informational admin/system. Even if admin/system was not intended as informational, a server implementation upgrading its version automatically and inadvertently breaks all clients which only expect a lower version to have been supported. As a result, performing a 1.y change that _breaks_ previous 1.x requirements cannot be communicated effectively between the client and server. As a result, a 1.2 vendor implementation (assuming such backwards-incompatible changes will go there) cannot respond to a client with the 1.1 format as the client has not indicated, in any way, that it only supports 1.1, but _not_ any versions above that. As a result, a vendor implementation cannot actually upgrade its implementation to 1.2 as it cannot communicate with clients supporting different versions simultaneously. This means that vendor implementations are _locked_ at the version at which the first client integration was performed with no way to support newer features.

I do not know OData enough to quite understand what you mean here. I have assumed that the OData reference pointed to a specification that is backwards compatible with earlier versions of OData. If this is not the case, I guess we might need to handle OData differently. In any case, so far N5TG have only referred to OData 4.01, so no back- and forward compatilbity issue should be present at the moment. Did I misunderstand? If we need to handle multiple versions of Odata, do you have any ideas how to do that?

…

-- Happy hacking Petter Reinholdtsen

ivaylomitrev · 2023-08-30T11:08:32Z

I only gave OData as an example, but I meant to the N5WS specification. Since different 1.x versions will now require different handling on the client-side, this means that there must be a way for the server and client to negotiate the "version"(format) they will be communicating in. Otherwise a 1.1 client cannot talk to a 1.2 server since certain fields will change as part of this pull request.

petterreinholdtsen · 2023-08-30T11:19:05Z

[ivaylomitrev]

I only gave OData as an example, but I meant to the N5WS specification. Since different 1.x versions will now require different handling on the client-side, this means that there must be a way for the server and client to negotiate the "version"(format) they will be communicating in. Otherwise a 1.1 client cannot talk to a 1.2 server since certain fields will change as part of this pull request.

As far as I can tell, the 1.0 and 1.1 versions are backwards compatible. Once we come up with backwards incompatible changes, we should also come up with a version negotiation method. Personally I prefer to keep it backwards compatible.

…

-- Happy hacking Petter Reinholdtsen

tsodring · 2023-11-26T17:24:15Z

I think this can go in, but likely needs to see an implementation from both server and client perspectives to identify potential weakness

Introduce new bulkgrense system value to allow clients to learn exactly how small small files are. Specify how to reject large files using the small method. Explain the format of Range and Content-Range headers. Made it clear that the upload need to be done in sequence (and not out of order) by specifying that the Location header returned after each bulk upload is to be used in the next bulk upload. Drop the Google Drive reference. It do not really explain much, the URL is broken and this specification should describe its own protocol as it is not identical to the Google Drive protocol. Fixes arkivverket#313.

petterreinholdtsen mentioned this pull request Aug 22, 2023

Fix incorrect Range values for 'large file uploads' #314

Merged

petterreinholdtsen force-pushed the bugfix/fix-misleading-large-file-upload-range-values branch from ae297f4 to f082527 Compare August 24, 2023 06:50

petterreinholdtsen force-pushed the bugfix/fix-misleading-large-file-upload-range-values branch from 30ac9f7 to 963487e Compare November 27, 2023 12:17

petterreinholdtsen merged commit 7d4d548 into arkivverket:master Nov 27, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend and explain in more details how to upload large files. #315

Extend and explain in more details how to upload large files. #315

petterreinholdtsen commented Aug 22, 2023

petterreinholdtsen commented Aug 24, 2023

ivaylomitrev commented Aug 30, 2023

petterreinholdtsen commented Aug 30, 2023 via email

ivaylomitrev commented Aug 30, 2023

petterreinholdtsen commented Aug 30, 2023 via email

tsodring commented Nov 26, 2023

Extend and explain in more details how to upload large files. #315

Extend and explain in more details how to upload large files. #315

Conversation

petterreinholdtsen commented Aug 22, 2023

petterreinholdtsen commented Aug 24, 2023

ivaylomitrev commented Aug 30, 2023

petterreinholdtsen commented Aug 30, 2023 via email

ivaylomitrev commented Aug 30, 2023

petterreinholdtsen commented Aug 30, 2023 via email

tsodring commented Nov 26, 2023