You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
It's not strictly a bug, but I'm filing it as one, given the high payoff.
CKAN is widely used to distribute data. OOTB, max_resource_size is set to 10mb. Nowadays, that setting is miniscule and even when CKAN is configured to handle large files (>500mb), several related problems/issues present themselves:
resource_create and resource_update file uploads can only handles up to 2gb files (otherwise, you get an Overflow error: string longer then 2147483647 bytes)
even for files smaller than 2gb, the current implementation requires loading the whole file in memory, sometimes freezing the client.
unreliable connections and timeouts as HTTP is not optimized for handling such large requests
These issues can be mitigated by using chunked/streaming uploads. Doing so is another issue by itself and will also require rework by existing Filestore API clients.
Several users have also migrated to alternate filestores like ckanext-cloudstorage to overcome this limitation.
But large file handling can be largely mitigated by adding native support for resource precompression in the Filestore API, in a way that is transparent to existing Filestore API clients.
With a big benefit for all CKAN users - developers, publishers, and downstream users alike:
most machine-readable formats have very high compression rates (csv, tsv, json, xml) - typically >80%
huge bandwidth savings - for both the CKAN operator and user
huge time/cost savings directly from lower bandwidth requirements
improved User Experience, as downloads and response times are faster
and last but not least, effectively increase the large file upload limit beyond 2gb
Publishers can precompress their data before upload. This enables you to stay under the 2gb upload limit even for files larger than 2gb uncompressed without having to resort to chunked/streaming uploads (e.g upload a 10gb csv file, which is less than 2gb compressed)
To do so, the following needs to be implemented:
automatic processing of extra ".gz" suffix
Files uploaded with ".gz" suffix (e.g. sample.csv.gz, sample.xml.gz, test.json.gz) will still be recognized by their primary mimetype. Upon upload, precompressed files are asynchronously decompressed by a background job in the same path in the filestore. Existing resource actions remain largely unchanged.
set nginx setting gzip_static on for the filestore location
nginx will serve the precompressed gz file transparently. If there is no precompressed file or the client doesn't support gzip, the original uncompressed file is served by nginx automatically. In this way, existing Filestore API clients like datapusher and ckanapi need not be modified (as python requests automatically handles gzip-encoded responses).
OPTIONAL: a background job that automatically precompresses resource files if a data publisher doesn't precompress the file before upload.
The only potential downside is the increased filestore storage footprint - as you now need to store both the compressed and uncompressed variants. IMHO, storage is very cheap and bandwidth/performance far more expensive/valuable.
But even here, we can use gzip static always, and force nginx to always serve the gzipped file, eliminating the need for storing the uncompressed variant - effectively reducing your filestore storage requirements as well!
CKAN version
2.9.1
Describe the bug
It's not strictly a bug, but I'm filing it as one, given the high payoff.
CKAN is widely used to distribute data. OOTB,
max_resource_size
is set to 10mb. Nowadays, that setting is miniscule and even when CKAN is configured to handle large files (>500mb), several related problems/issues present themselves:resource_create
andresource_update
file uploads can only handles up to 2gb files (otherwise, you get anOverflow error: string longer then 2147483647 bytes
)These issues can be mitigated by using chunked/streaming uploads. Doing so is another issue by itself and will also require rework by existing Filestore API clients.
Several users have also migrated to alternate filestores like ckanext-cloudstorage to overcome this limitation.
But large file handling can be largely mitigated by adding native support for resource precompression in the Filestore API, in a way that is transparent to existing Filestore API clients.
With a big benefit for all CKAN users - developers, publishers, and downstream users alike:
Publishers can precompress their data before upload. This enables you to stay under the 2gb upload limit even for files larger than 2gb uncompressed without having to resort to chunked/streaming uploads (e.g upload a 10gb csv file, which is less than 2gb compressed)
To do so, the following needs to be implemented:
Files uploaded with ".gz" suffix (e.g. sample.csv.gz, sample.xml.gz, test.json.gz) will still be recognized by their primary mimetype. Upon upload, precompressed files are asynchronously decompressed by a background job in the same path in the filestore. Existing resource actions remain largely unchanged.
gzip_static on
for the filestore locationnginx will serve the precompressed gz file transparently. If there is no precompressed file or the client doesn't support gzip, the original uncompressed file is served by nginx automatically. In this way, existing Filestore API clients like datapusher and ckanapi need not be modified (as python requests automatically handles gzip-encoded responses).
The only potential downside is the increased filestore storage footprint - as you now need to store both the compressed and uncompressed variants. IMHO, storage is very cheap and bandwidth/performance far more expensive/valuable.
But even here, we can use
gzip static always
, and force nginx to always serve the gzipped file, eliminating the need for storing the uncompressed variant - effectively reducing your filestore storage requirements as well!It's also noteworthy that uwsgi supports gzip precompression with the
static-gzip-all
setting - https://ugu.readthedocs.io/en/latest/compress.htmlThe text was updated successfully, but these errors were encountered: