Releases · databricks/spark-redshift

01 Nov 00:45

v3.0.0-preview1

a28832b

v3.0.0-preview1 Pre-release

Pre-release

⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 3.x releases of this library are only compatible with Spark 2.x.

This is the first preview release of the 3.x version of this library. This release includes several major performance and security enhancements.

Feedback on the changes in this release is welcome! Please open an issue on GitHub to share your comments.

Upgrading from a 2.x release

Version 3.0 now requires forward_spark_s3_credentials to be explicitly set before Spark S3 credentials will be forwarded to Redshift. Users who use the aws_iam_role or temporary_aws_* authentication mechanisms will be unaffected by this change. Users who relied on the old default behavior will now need to explicitly set forward_spark_s3_credentials to true to continue using their previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the Authenticating to S3 and Redshift section of the README.

Changes

New Features:

Add experimental support for using CSV as an intermediate data format when writing back to Redshift (#73 / #288). This can significantly speed up large writes and also allows saving of tables whose column names are unsupported by Avro's strict schema validation rules (#84).

Performance enhancements:

The read path is now based on Spark 2.0's new FileFormat-based data source, allowing it to benefit from performance improvements in FileScanRDD, such as automatic coalescing of partitions based on input size (#289).

Usability enhancements:

Attempt to automatically detect when the Redshift cluster and S3 bucket are in different regions in order to provide more informative error messages (#285).
Document this library's support for encrypted load / unload (#189).
Document this library's security-related configurations, including an extensive discussion of the different communication channels and data sources and how each may be authenticated and encrypted (#291). Forwarding of Spark credentials to Redshift now requires explicit opt-in.

Bug fixes:

Fix a NumberFormatException which occurred when reading the special floating-point values NaN and Infinity from Redshift (#261 / #269).
Pass AWSCredentialProviders instead of AWSCredentials instances in order to avoid expiration of temporary AWS credentials between different steps of the read or write operation (#200 / #284).
IAM instance profiles authentication no longer requires temporary STS keys (or regular AWS keys) to be explicitly acquired / supplied by user code (#173 / #274, #276 / #277).

Assets 2

21 Aug 22:26

JoshRosen

v1.1.0

33fa626

v1.1.0 Latest

Latest

⚠️ Important: If you are using Spark 2.x, then use v2.0.1 instead. 1.x releases of this library are only compatible with Spark 1.x.

The 1.1.0 release (which supports Spark 1.x) contains the following changes:

Bug fixes:

Provide a clearer error message when attempting to write BinaryType columns to Redshift (#251).
Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).

New Features:

Support for setting custom JDBC column types (#220)

Assets 2

20 Aug 20:27

JoshRosen

v2.0.1

f0fa360

v2.0.1

⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 2.x releases of this library are only compatible with Spark 2.x.

The 2.0.1 release (which is compatible with Spark 2.x) includes the following bug fixes:

Provide a clearer error message when attempting to write BinaryType columns to Redshift (#251).
Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).

Assets 2

02 Aug 02:44

JoshRosen

v2.0.0

9b94797

v2.0.0

This is the first non-preview release of this library which supports Spark 2.0.0+.

This incorporates all of the changes from the v2.0.0-preview1 release, as well as the following bug fixes:

Fix an issue where decimal-parsing logic was locale-sensitive, which could lead to incorrect results when reading decimals from Redshift when Spark was running in locales that use commas as decimal marks (#243 / #249).

Assets 2

18 Jul 21:17

JoshRosen

v2.0.0-preview1

e19163e

v2.0.0-preview1 Pre-release

Pre-release

This is the first preview release of this library which supports Spark 2.x previews and release candidates.

A small number of deprecated features have been removed in this release; for a list, see #239.

New Features:

Spark 2.0 preview support (#221)
Support for setting custom JDBC column types (#220)

Assets 2

11 Jul 18:24

JoshRosen

v1.0.0

6262705

v1.0.0

This is the last planned major release of this library for Spark 1.x.

spark-redshift 1.x releases will remain compatible with Spark 1.4.x through 1.6.x, while spark-redshift 2.0.0+ will support only Spark 2.0.0+.

We will continue to fix minor bugs in 1.x maintenance releases but do not plan to add major new features in the 1.x line.

Bug Fixes:

Properly escape backslashes in queries passed to UNLOAD (#215 / #228).
Fix loss of sub-second precision when reading timestamps from Redshift (#214 / #227).
Fix a bug which led to incorrect results for queries that contained filters with date or timestamp literals (#152 / #156).
Fix a bug which broke the use of IAM instance profile credentials (#158 / #159).
Use MANIFEST to guard against eventually-consistent S3 bucket listing calls (#151).

Enhancements:

The Redshift username and password can now be specified as configuration options rather than being embedded in the URL (#132 / #162). This should fix connectivity issues for users whose Redshift passwords contained non-URL-safe characters.
Support for using IAM roles to authorize Redshift <-> S3 connections (#199 / #219).
Support for specifying column comments and encodings (#164, #172, #178).
The COPY statement issued against Redshift is now logged in order to make debugging easier (#196).
Documentation enhancements: #150, #163.

Assets 2

06 Jan 21:30

JoshRosen

v0.6.0

5b4c4d7

v0.6.0

Bug Fixes:

Properly handle special characters in JDBC connection strings (#132 / #134). This bug affected users whose Redshift passwords contained special characters that were not valid in URLs (e.g. a password containing a percentage-sign (%) character).
Restored compatibility with spark-avro 1.0.0 (#111 / #114).
Fix bugs related to using the PostgreSQL JDBC driver instead of Amazon's official Redshift JDBC driver (#126, #143, #147). If your classpath contains both the PostgreSQL and Amazon drivers, explicitly specifying a JDBC driver class via the jdbcdriver parameter will now force that driver class to be used.
Give a better warning message for non-existing S3 buckets when attempting to read their bucket lifecycle configurations (#138 / #142).
Minor documentation fixes: #119, #120, #123, #137.

Enhancements:

Redshift queries are now cancelled when thread issuing the query is interrupted (#116 / #117). If you cancel a Databricks notebook shell while it is executing a spark-redshift query, the Spark REPL will no longer crash due to interrupts being swallowed.
When writing data back to Redshift, dates are now written in the default Redshift date format (yyyy-MM-dd) rather than a timestamp format (#122 / #130).
spark-redshift now implements Spark 1.6's new unhandledFilters API, which allows Spark to eliminate a duplicate layer of filtering for filters that are pushed down to Redshift (#128).

Assets 2

23 Oct 17:37

JoshRosen

v0.5.2

3e32744

v0.5.2

spark-redshift 0.5.2 is a maintenance release that contains a handful of important bugfixes. We recommend that all users upgrade to this release.

Bug Fixes:

Fixed a thread-safety issue which could lead to errors or data corruption when processing date, timestamp, or decimal columns (#107 / #108).
Fixed bugs related to handling of S3 credentials when they are specified as part of the tempdir URL (#109).
Fixed a typo in the AWS credentials section of the README: the old text referred to sc.hadoopConfig instead of sc.hadoopConfiguration (#109).

Enhancements:

Added a new extracopyoptions configuration, which allows advanced users to pass additional options to Redshift in COPY commands (#35).
Added an example of writing data back to Redshift using the SQL language API (#110).
Added documentation on how to configure the SparkContext's global hadoopConfiguration from Python (#109).
Added a tutorial (#101 and #106).

Assets 2

05 Oct 23:17

JoshRosen

v0.5.1

f4a636c

v0.5.1

spark-redshift 0.5.1 is a maintenance release which contains several bugfixes and usability improvements:

Improved JDBC quoting and escaping:
- Column names are now properly quoted when saving tables to Redshift, allowing reserved words or names containing special characters to be used as column names (#80 / #85).
- Table names that are qualified with schemas (e.g. myschema.mytable) or which contain special characters (such as spaces) are now supported (#97 / #102).
Improved dependency handling:
- spark-redshift no longer has a binary dependency on the hadoop-aws artifact, which caused problems for EMR users (#92 / #94).
- When using the Redshift JDBC driver, both the JDBC 4.0 and 4.1 versions of the driver can now be used without having to change the default jdbcdriver setting; the proper configuration will be automatically chosen depending on which version of the JDBC driver can be loaded (#83 / #90).
Misc. bugfixes:
- Fix a bug which prevented tables with empty partitions from being saved to Redshift (#96 / #102).
- Fix spurious exceptions when checking the S3 bucket lifecycle configuration when tempdir points to the root of the bucket (#91 / #95).
- Fixed a bug in Utils.joinURLs which caused problems for Windows users (#93).

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from a 2.x release

Changes

Releases: databricks/spark-redshift

v3.0.0-preview1

Upgrading from a 2.x release

Changes

v1.1.0

v2.0.1

v2.0.0

v2.0.0-preview1

v1.0.0

v0.6.0

v0.5.2

v0.5.1