Releases: databricks/spark-redshift
v3.0.0-preview1
⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 3.x releases of this library are only compatible with Spark 2.x.
This is the first preview release of the 3.x version of this library. This release includes several major performance and security enhancements.
Feedback on the changes in this release is welcome! Please open an issue on GitHub to share your comments.
Upgrading from a 2.x release
Version 3.0 now requires forward_spark_s3_credentials
to be explicitly set before Spark S3 credentials will be forwarded to Redshift. Users who use the aws_iam_role
or temporary_aws_*
authentication mechanisms will be unaffected by this change. Users who relied on the old default behavior will now need to explicitly set forward_spark_s3_credentials
to true
to continue using their previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the Authenticating to S3 and Redshift section of the README.
Changes
New Features:
- Add experimental support for using CSV as an intermediate data format when writing back to Redshift (#73 / #288). This can significantly speed up large writes and also allows saving of tables whose column names are unsupported by Avro's strict schema validation rules (#84).
Performance enhancements:
- The read path is now based on Spark 2.0's new
FileFormat
-based data source, allowing it to benefit from performance improvements inFileScanRDD
, such as automatic coalescing of partitions based on input size (#289).
Usability enhancements:
- Attempt to automatically detect when the Redshift cluster and S3 bucket are in different regions in order to provide more informative error messages (#285).
- Document this library's support for encrypted load / unload (#189).
- Document this library's security-related configurations, including an extensive discussion of the different communication channels and data sources and how each may be authenticated and encrypted (#291). Forwarding of Spark credentials to Redshift now requires explicit opt-in.
Bug fixes:
- Fix a
NumberFormatException
which occurred when reading the special floating-point valuesNaN
andInfinity
from Redshift (#261 / #269). - Pass
AWSCredentialProviders
instead ofAWSCredentials
instances in order to avoid expiration of temporary AWS credentials between different steps of the read or write operation (#200 / #284). - IAM instance profiles authentication no longer requires temporary STS keys (or regular AWS keys) to be explicitly acquired / supplied by user code (#173 / #274, #276 / #277).
v1.1.0
⚠️ Important: If you are using Spark 2.x, then use v2.0.1 instead. 1.x releases of this library are only compatible with Spark 1.x.
The 1.1.0 release (which supports Spark 1.x) contains the following changes:
Bug fixes:
- Provide a clearer error message when attempting to write
BinaryType
columns to Redshift (#251). - Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
- Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).
New Features:
- Support for setting custom JDBC column types (#220)
v2.0.1
⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 2.x releases of this library are only compatible with Spark 2.x.
The 2.0.1 release (which is compatible with Spark 2.x) includes the following bug fixes:
- Provide a clearer error message when attempting to write
BinaryType
columns to Redshift (#251). - Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
- Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).
v2.0.0
This is the first non-preview release of this library which supports Spark 2.0.0+.
This incorporates all of the changes from the v2.0.0-preview1 release, as well as the following bug fixes:
- Fix an issue where decimal-parsing logic was locale-sensitive, which could lead to incorrect results when reading decimals from Redshift when Spark was running in locales that use commas as decimal marks (#243 / #249).
v2.0.0-preview1
This is the first preview release of this library which supports Spark 2.x previews and release candidates.
A small number of deprecated features have been removed in this release; for a list, see #239.
New Features:
v1.0.0
This is the last planned major release of this library for Spark 1.x.
spark-redshift
1.x releases will remain compatible with Spark 1.4.x through 1.6.x, while spark-redshift
2.0.0+ will support only Spark 2.0.0+.
We will continue to fix minor bugs in 1.x maintenance releases but do not plan to add major new features in the 1.x line.
Bug Fixes:
- Properly escape backslashes in queries passed to
UNLOAD
(#215 / #228). - Fix loss of sub-second precision when reading timestamps from Redshift (#214 / #227).
- Fix a bug which led to incorrect results for queries that contained filters with date or timestamp literals (#152 / #156).
- Fix a bug which broke the use of IAM instance profile credentials (#158 / #159).
- Use
MANIFEST
to guard against eventually-consistent S3 bucket listing calls (#151).
Enhancements:
- The Redshift username and password can now be specified as configuration options rather than being embedded in the URL (#132 / #162). This should fix connectivity issues for users whose Redshift passwords contained non-URL-safe characters.
- Support for using IAM roles to authorize Redshift <-> S3 connections (#199 / #219).
- Support for specifying column comments and encodings (#164, #172, #178).
- The
COPY
statement issued against Redshift is now logged in order to make debugging easier (#196). - Documentation enhancements: #150, #163.
v0.6.0
Bug Fixes:
- Properly handle special characters in JDBC connection strings (#132 / #134). This bug affected users whose Redshift passwords contained special characters that were not valid in URLs (e.g. a password containing a percentage-sign (
%
) character). - Restored compatibility with
spark-avro
1.0.0 (#111 / #114). - Fix bugs related to using the PostgreSQL JDBC driver instead of Amazon's official Redshift JDBC driver (#126, #143, #147). If your classpath contains both the PostgreSQL and Amazon drivers, explicitly specifying a JDBC driver class via the
jdbcdriver
parameter will now force that driver class to be used. - Give a better warning message for non-existing S3 buckets when attempting to read their bucket lifecycle configurations (#138 / #142).
- Minor documentation fixes: #119, #120, #123, #137.
Enhancements:
- Redshift queries are now cancelled when thread issuing the query is interrupted (#116 / #117). If you cancel a Databricks notebook shell while it is executing a
spark-redshift
query, the Spark REPL will no longer crash due to interrupts being swallowed. - When writing data back to Redshift, dates are now written in the default Redshift date format (
yyyy-MM-dd
) rather than a timestamp format (#122 / #130). spark-redshift
now implements Spark 1.6's newunhandledFilters
API, which allows Spark to eliminate a duplicate layer of filtering for filters that are pushed down to Redshift (#128).
v0.5.2
spark-redshift
0.5.2 is a maintenance release that contains a handful of important bugfixes. We recommend that all users upgrade to this release.
Bug Fixes:
- Fixed a thread-safety issue which could lead to errors or data corruption when processing date, timestamp, or decimal columns (#107 / #108).
- Fixed bugs related to handling of S3 credentials when they are specified as part of the
tempdir
URL (#109). - Fixed a typo in the AWS credentials section of the README: the old text referred to
sc.hadoopConfig
instead ofsc.hadoopConfiguration
(#109).
Enhancements:
- Added a new
extracopyoptions
configuration, which allows advanced users to pass additional options to Redshift in COPY commands (#35). - Added an example of writing data back to Redshift using the SQL language API (#110).
- Added documentation on how to configure the SparkContext's global
hadoopConfiguration
from Python (#109). - Added a tutorial (#101 and #106).
v0.5.1
spark-redshift
0.5.1 is a maintenance release which contains several bugfixes and usability improvements:
- Improved JDBC quoting and escaping:
- Column names are now properly quoted when saving tables to Redshift, allowing reserved words or names containing special characters to be used as column names (#80 / #85).
- Table names that are qualified with schemas (e.g.
myschema.mytable
) or which contain special characters (such as spaces) are now supported (#97 / #102).
- Improved dependency handling:
spark-redshift
no longer has a binary dependency on thehadoop-aws
artifact, which caused problems for EMR users (#92 / #94).- When using the Redshift JDBC driver, both the JDBC 4.0 and 4.1 versions of the driver can now be used without having to change the default
jdbcdriver
setting; the proper configuration will be automatically chosen depending on which version of the JDBC driver can be loaded (#83 / #90).
- Misc. bugfixes:
- Fix a bug which prevented tables with empty partitions from being saved to Redshift (#96 / #102).
- Fix spurious exceptions when checking the S3 bucket lifecycle configuration when
tempdir
points to the root of the bucket (#91 / #95). - Fixed a bug in
Utils.joinURLs
which caused problems for Windows users (#93).