v3.0.0-preview1
Pre-release
⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 3.x releases of this library are only compatible with Spark 2.x.
This is the first preview release of the 3.x version of this library. This release includes several major performance and security enhancements.
Feedback on the changes in this release is welcome! Please open an issue on GitHub to share your comments.
Upgrading from a 2.x release
Version 3.0 now requires forward_spark_s3_credentials
to be explicitly set before Spark S3 credentials will be forwarded to Redshift. Users who use the aws_iam_role
or temporary_aws_*
authentication mechanisms will be unaffected by this change. Users who relied on the old default behavior will now need to explicitly set forward_spark_s3_credentials
to true
to continue using their previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the Authenticating to S3 and Redshift section of the README.
Changes
New Features:
- Add experimental support for using CSV as an intermediate data format when writing back to Redshift (#73 / #288). This can significantly speed up large writes and also allows saving of tables whose column names are unsupported by Avro's strict schema validation rules (#84).
Performance enhancements:
- The read path is now based on Spark 2.0's new
FileFormat
-based data source, allowing it to benefit from performance improvements inFileScanRDD
, such as automatic coalescing of partitions based on input size (#289).
Usability enhancements:
- Attempt to automatically detect when the Redshift cluster and S3 bucket are in different regions in order to provide more informative error messages (#285).
- Document this library's support for encrypted load / unload (#189).
- Document this library's security-related configurations, including an extensive discussion of the different communication channels and data sources and how each may be authenticated and encrypted (#291). Forwarding of Spark credentials to Redshift now requires explicit opt-in.
Bug fixes:
- Fix a
NumberFormatException
which occurred when reading the special floating-point valuesNaN
andInfinity
from Redshift (#261 / #269). - Pass
AWSCredentialProviders
instead ofAWSCredentials
instances in order to avoid expiration of temporary AWS credentials between different steps of the read or write operation (#200 / #284). - IAM instance profiles authentication no longer requires temporary STS keys (or regular AWS keys) to be explicitly acquired / supplied by user code (#173 / #274, #276 / #277).