Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend and explain in more details how to upload large files. #315

Conversation

petterreinholdtsen
Copy link
Collaborator

Introduce new bulkgrense system value to allow clients to learn exactly how small
small files are. Specify how to reject large files using the small method. Explain
the format of Range and Content-Range headers. Made it clear that the upload need
to be done in sequence (and not out of order) by specifying that the Location header
returned after each bulk upload is to be used in the next bulk upload.

Drop the Google Drive reference. It do not really explain much, the URL is broken
and this specification should describe its own protocol as it is not identical to
the Google Drive protocol.

Fixes #313.

@petterreinholdtsen petterreinholdtsen force-pushed the bugfix/fix-misleading-large-file-upload-range-values branch from ae297f4 to f082527 Compare August 24, 2023 06:50
@petterreinholdtsen
Copy link
Collaborator Author

@tsodring and @ivaylomitrev, please have a look at the rewritten example and description to let me know if it make sense and seem correct to you.

@ivaylomitrev
Copy link

It seems correct to me on a general level, but I have several (not-so-small) concerns on the topic:

  1. I do not see a reason to limit "small" file uploads. If a client decides to upload a 5GB file as a "small" file that is up to them. I would assume that in such a case, they simply do not want a resumable session and are OK with repeating the upload if necessary (i.e. if it fails). In other words, to me the distinction is between resumable and non-resumable upload, not between small and large file upload and I do not see the need for limitations. Also, 1.0 and 1.1 imply that resumable file upload can be used for files >150MB, but does not prevent the upload of >150 MBs using the "small file upload implementation". To the contrary, the distinction between the two contracts is based on headers, not based on size. This is a backwards-incompatible change that would break any client implementations that have used the "small file upload" contract for files > 150MB as vendors will be required to start rejecting them.
  2. The query parameter filsesjon change to ref is also backwards-incompatible. Any integration developed against a 1.0/1.1 vendor implementation needs to be modified to support 1.2 meaning that vendors cannot upgrade to 1.2 seamlessly without client changes. In the real world, it is nigh impossible to request client changes for such scenarios. My advice would be to still allow filsesjon as fallback if ref is missing in 1.2. If not, I believe the version should be bumped to 2.0 as a result of that. Even if it is bumped to 2.0, there are still other problems with backwards-incompatibility in the specification, however. Please check my comment at the end.
  3. I see a sequence number appended to the "session" in the examples - ref=abc1234567-1, ref=abc1234567-2, etc., and I am not sure I understand where the requirement originates from. I do not see this as described and I also do not see the need of it considering that an implementation can reject an out-of-place "chunk" based on the Range and Content-Range headers alongside the state of the chunks maintained on the server-side (since this is a stateful implementation after all). I believe this sequence number is derived information that is not needed and simply introduces noise and complexity. It is also backwards-incompatible which implies it would require a version change to 2.0 and would prevent vendors from supporting both 1.x and the new format.

A note on the backwards-incompatible mentions in my comment:

The specification only (indirectly) requires OData version to be transferred between the client and server. It does not, in any way, allow a client and server to negotiate on the version to be used other than the purely informational admin/system. Even if admin/system was not intended as informational, a server implementation upgrading its version automatically and inadvertently breaks all clients which only expect a lower version to have been supported. As a result, performing a 1.y change that breaks previous 1.x requirements cannot be communicated effectively between the client and server. As a result, a 1.2 vendor implementation (assuming such backwards-incompatible changes will go there) cannot respond to a client with the 1.1 format as the client has not indicated, in any way, that it only supports 1.1, but not any versions above that. As a result, a vendor implementation cannot actually upgrade its implementation to 1.2 as it cannot communicate with clients supporting different versions simultaneously. This means that vendor implementations are locked at the version at which the first client integration was performed with no way to support newer features.

Please correct me if you find flaws in my logic regarding backwards compatibility.

@petterreinholdtsen
Copy link
Collaborator Author

petterreinholdtsen commented Aug 30, 2023 via email

@ivaylomitrev
Copy link

I only gave OData as an example, but I meant to the N5WS specification. Since different 1.x versions will now require different handling on the client-side, this means that there must be a way for the server and client to negotiate the "version"(format) they will be communicating in. Otherwise a 1.1 client cannot talk to a 1.2 server since certain fields will change as part of this pull request.

@petterreinholdtsen
Copy link
Collaborator Author

petterreinholdtsen commented Aug 30, 2023 via email

@tsodring
Copy link
Contributor

I think this can go in, but likely needs to see an implementation from both server and client perspectives to identify potential weakness

Introduce new bulkgrense system value to allow clients to learn exactly how small
small files are.  Specify how to reject large files using the small method.  Explain
the format of Range and Content-Range headers.  Made it clear that the upload need
to be done in sequence (and not out of order) by specifying that the Location header
returned after each bulk upload is to be used in the next bulk upload.

Drop the Google Drive reference.  It do not really explain much, the URL is broken
and this specification should describe its own protocol as it is not identical to
the Google Drive protocol.

Fixes arkivverket#313.
@petterreinholdtsen petterreinholdtsen force-pushed the bugfix/fix-misleading-large-file-upload-range-values branch from 30ac9f7 to 963487e Compare November 27, 2023 12:17
@petterreinholdtsen petterreinholdtsen merged commit 7d4d548 into arkivverket:master Nov 27, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misleading "large file upload" examples and specification
3 participants