-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate with RedShfit Spectrum #40
Comments
@moeezalisyed thanks for the feature request. This looks to be a really useful set of functionality for My inclination with any large feature/rewrite is to make as many small meaningful steps towards the final functionality without breaking things. To this end, I'm pretty excited by the final feature set you detail here as I think there are a LOT of little improvements we can make which will lead to the final solution. QuestionsBefore scoping out all of the above, I'm hoping that your experience with Redshift Spectrum can help get me up to speed and I can poke about on some points:
|
@AlexanderMann I agree - it is always great to build up in steps. For your questions:
Here is the command:
|
@moeezalisyed sweet, so I think there are some basic steps to make this work. ❓Next question❓: I've been poking around online trying to find a simple small footprint way to write arbitrary nested Python data structures (JSON compatible crap) to Parquet. Seems like the "simplest" method is to install |
I did a small amount of digging and found an "Arrow" parquet library which can be used as follows:
Two desired improvements from this simple sample are: I can see getting around at least the first challenge by perhaps using koalas (the big-data version of pandas) or another compatible library. Hope this helps. |
There's apparently also a native koalas
Ref: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html Again, I'd suggest avoid using pandas at all cost, but koalas might work for this use case. UPDATE: Upon further research, it appears that koalas may require a working local installation of spark, which is most-likely not feasible. I also found this It looks like all roads lead to using pandas, unfortunately. And perhaps this is inevitable due to the need to create categories/lookups on source columns during compression. To revise my previous anti-pandas statement, it looks like pandas may be required. In that case, a mitigation to the scalability limitations of having to load all data into memory might be to support a Hope this helps. 😄 UPDATE (# 2): Related discussion here: dask/fastparquet#476 |
@aaronsteers I doubt that'll be a problem. We already have all data loaded in memory, and are batching by reasonable sizes, so I think we can handle that. I also found the above links you're talking about, and had the same 🤦♀ moments since it'd be great to have a simple library which writes a data format, but alas...Pandas it is. So I think the simplest path forward here are a couple of easy prs, and then some more head-scratchy ones. Bonus: some of these can be done in parallel.
|
Integrating with RedShift Spectrum will make this target very efficient.
Use Case
If you have large amounts of data that is not accessed frequently, it is more efficient to store it in S3 instead of keeping it loaded in RedShift. When data needs to be queried, it can be queried with RedShfit Spectrum that provides really fast querying by leveraging MPP. In order for this work, the data needs to be stored in S3 in a structured columnar format - I suggest Parquet. Moreover, these files can be gzipped leading to more efficient storage without slowing down query times.
Enhancements
RedShfit Spectrum is also able to query nested fields for files that are stored in S3 as parquet (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html). Looking at the code, seems like that's not something currently provided, so this will be an additional enhancement.
Integration Steps:
Changes I foresee
I think this can make target-redshift very scalable. Infact, we may be able to separate the intermediate code, and create a target-s3-parquet that simply stores the stream as an s3 parquet file (similar to the s3-csv tap).
Talked with @AlexanderMann about this in slack.
The text was updated successfully, but these errors were encountered: