Skip to content
This repository has been archived by the owner on Oct 31, 2019. It is now read-only.

Downloading large files (>2Gb) from Azure blob storage to R #83

Open
Dacolo opened this issue Jun 15, 2017 · 5 comments
Open

Downloading large files (>2Gb) from Azure blob storage to R #83

Dacolo opened this issue Jun 15, 2017 · 5 comments

Comments

@Dacolo
Copy link

Dacolo commented Jun 15, 2017

the need is to download files from blob storage into a azure VM.
For smaller files, this works and it’s also possible to set the type to “text” and leave out the rawToChar function.
For larger files (>2 Gb) the download as type text throws an error, as data type raw character cannot be > 2Gb in R.
Downloading with type set to raw works, but then the rawToChar function gives the same error (long vectors not supported yet: /builddir/patched_source/src/main/raw.c:68).

# Import data (code spaces as %20)
filename <- "20150101.csv"

data <- azureGetBlob(sc, 
             resourceGroup = "CLOGGING-PROD-RG",
             storageAccount = "cloggingmlsinput2", 
             container = "mlsinput",
             storageKey = sKey,
             blob=filename,
             type="raw")
data2 <- rawToChar(data, multiple = FALSE)
data3 <- read.csv(text=data2)

Error for large files:
Error in readBin(content, character()) : 
  R character strings are limited to 2^31-1 bytes

could we fix it?
thanks!

@hongooi73
Copy link
Contributor

This is a fundamental limit in R. As a workaround, you can use download.file to save the file to disk, rather than trying to create an in-memory object.

@andrie
Copy link
Contributor

andrie commented Sep 27, 2017

I don't think this is a fundamental limit, actually. I don't have the link ready, but I think the API allows to download from blob in chunks. In other words, the download turns into a streaming operation where you stream bytes from blob to local disk.

@hongooi73
Copy link
Contributor

Well, I meant a fundamental limit on the length of character strings. Downloading to disk would get around that, sure. Wouldn't it still be simpler to use download.file?

@andrie
Copy link
Contributor

andrie commented Sep 27, 2017

download.file() will only work if the blob is public. If it's private, you must access via REST.

Also, I thought the vector limit is 2^58 - 1?

@hongooi73
Copy link
Contributor

hongooi73 commented Sep 27, 2017

Ah yes, I forgot about public v private.

The limit on the number of strings in a vector is essentially how much memory you have, but the limit on string size is 2^31 - 1; see ?"Memory-limits".

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants