Add the ability to fetch remote files (s3 and http[s]) #34

ZJONSSON · 2018-01-22T22:57:23Z

Adapters included for S3 files and files available over http(s). Only the parts if interest are fetched over the wire, eliminating the need to download the complete files. The performance of the adapters when fetching full rows is drastically improved with #33

No tests so far, but if you are happy with the approach we could add tests using a simple localhost server and an S3 compatible local service, if that makes sense?

ZJONSSON · 2018-02-01T17:13:02Z

Added the ability to read directly from buffer containing the whole parquet file

asmuth · 2018-02-11T20:48:49Z

LGTM

kessler · 2018-02-11T21:32:57Z

@ZJONSSON let's add tests if you have the time, if not let us know and we'll try to do it

ZJONSSON · 2018-02-12T01:45:19Z

I agree. To keep tests as unit tests we would have to add a couple of things to the devDependencies

server that accepts range requests, such as expressjs
and an s3 mock adapter such as mock-aws-s3

A test for the buffer reader does not require any additional dependencies

Adapters included for S3 files and files available over http(s)

Instantiating a new client blocks on retrieving filesize. But there are cases when we really don't need the filesize, for example when we have the metadata cached already.

Kosta-Github · 2019-04-26T10:37:14Z

what is the status of this? abandoned?

kessler · 2019-04-28T14:09:47Z

@Kosta-Github it's not abandoned but we're (mainly me) are having problem allocating time for this project, since the initial effort of building it. If you want to contribute, let's discuss :-)

Allow MAP and LIST (for athena/hive)

ZJONSSON changed the title ~~Add the ability to fetch data remote data~~ Add the ability to fetch remote files (s3 and http[s]) Jan 22, 2018

ZJONSSON force-pushed the remote-files branch 2 times, most recently from bdb76c1 to 792c289 Compare February 1, 2018 02:59

ZJONSSON added 3 commits February 18, 2018 16:02

Add the ability to fetch data remote data

00ce93e

Adapters included for S3 files and files available over http(s)

Add openBuffer

cc67213

Allow this.fileSize to be an async function as well

e3c70df

Instantiating a new client blocks on retrieving filesize. But there are cases when we really don't need the filesize, for example when we have the metadata cached already.

ZJONSSON force-pushed the remote-files branch from 4f5ad32 to e3c70df Compare February 18, 2018 21:02

jeffbski-rga pushed a commit to jeffbski/parquetjs that referenced this pull request Mar 2, 2020

Merge pull request ironSource#34 from ZJONSSON/allow-list-map

606bb29

Allow MAP and LIST (for athena/hive)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to fetch remote files (s3 and http[s]) #34

Add the ability to fetch remote files (s3 and http[s]) #34

ZJONSSON commented Jan 22, 2018

ZJONSSON commented Feb 1, 2018

asmuth commented Feb 11, 2018

kessler commented Feb 11, 2018

ZJONSSON commented Feb 12, 2018

Kosta-Github commented Apr 26, 2019

kessler commented Apr 28, 2019

Add the ability to fetch remote files (s3 and http[s]) #34

Are you sure you want to change the base?

Add the ability to fetch remote files (s3 and http[s]) #34

Conversation

ZJONSSON commented Jan 22, 2018

ZJONSSON commented Feb 1, 2018

asmuth commented Feb 11, 2018

kessler commented Feb 11, 2018

ZJONSSON commented Feb 12, 2018

Kosta-Github commented Apr 26, 2019

kessler commented Apr 28, 2019