Snowflake (#87)

snowflake support CORE-687
lumoslabs · Sep 6, 2019 · 5f44aaa · 5f44aaa
1 parent 4b499e0
commit 5f44aaa
Show file tree

Hide file tree

Showing 34 changed files with 486 additions and 74 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,10 @@
 # Change Log
 All notable changes to this project will be documented in this file using [Semantic Versioning](http://semver.org/).
 
+## [0.4.0] - 2019-09-02
+### Features
+- Added support for Snowflake
+
 ## [0.3.0] - 2019-02-26
 ### Features
 - [Scheduled Query Execution](https://github.com/lumoslabs/aleph/issues/42)

diff --git a/Gemfile b/Gemfile
@@ -36,7 +36,9 @@ gem 'resque-pool', '~> 0.5.0'
 gem 'resque-web', '0.0.6', require: 'resque_web'
 gem 'roar'
 gem 'rollbar', '~> 2.3.0'
+gem 'ruby-odbc'
 gem 'sass-rails'
+gem 'sequel', '~> 4.35'
 gem 'sprockets-es6', '~> 0.8.0'
 gem 'therubyracer'
 gem 'thin'

diff --git a/Gemfile.lock b/Gemfile.lock
@@ -338,6 +338,7 @@ GEM
       rspec-mocks (~> 3.3.0)
       rspec-support (~> 3.3.0)
     rspec-support (3.3.0)
+    ruby-odbc (0.99999)
     ruby-saml (1.0.0)
       nokogiri (>= 1.5.10)
       uuid (~> 2.3)
@@ -349,6 +350,7 @@ GEM
       sprockets (>= 2.8, < 4.0)
       sprockets-rails (>= 2.0, < 4.0)
       tilt (>= 1.1, < 3)
+    sequel (4.49.0)
     sham_rack (1.3.6)
       rack
     shoulda-matchers (3.0.0)
@@ -473,7 +475,9 @@ DEPENDENCIES
   roar
   rollbar (~> 2.3.0)
   rspec-rails
+  ruby-odbc
   sass-rails
+  sequel (~> 4.35)
   sham_rack
   shoulda-matchers
   sprockets-es6 (~> 0.8.0)

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 
 # Aleph
-Aleph is a Redshift analytics platform that focuses on aggregating institutional data investigation techniques.
+Aleph is a business analytics platform that focuses on ease-of-use and operational simplicity. It allows analysts to quickly author and iterate on queries, then share result sets and visualizations. Most components are modular, but it was designed to version-control queries (and analyze their differences) using Github and store result sets long term in Amazon S3.
 
 ![aleph](images/aleph_repo_banner.png)
 
@@ -13,24 +13,48 @@ Aleph is a Redshift analytics platform that focuses on aggregating institutional
 
 
 ## Quickstart
-If you want to connect to your own Redshift cluster, the follow instructions should get you up and running.
+If you want to connect to your own Redshift or Snowflake cluster, the follow instructions should get you up and running.
+
+### Database Configuration
+Configure your Redshift or snowflake database and user(s).
+
+###### Additional requirements for Snowflake
+* Snowflake users must be setup with default warehouse and role; they are not configurable in Alpeh.
+* Since Aleph query results are unloaded directly from Snowflake to AWS S3, S3 is required for Snowflake connection.
+Configure an S3 bucket and create an external S3 stage in Snowflake. e.g.
+
+      create stage mydb.myschema.aleph_stage url='s3://<s3_bucket>/<path>/'
+        credentials=(aws_role = '<iam role>')
 
 ### Docker Install
 The fastest way to get started: [Docker](https://docs.docker.com/mac/step_one/)
 
-###### Configure your Redshift and run
+* For Redshift, run
 
       docker run -ti -p 3000:3000 lumos/aleph-playground /bin/bash -c "aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P  {password}; redis-server & aleph run_demo"
 
+* For Snowflake, run
+
+      docker run -ti -p 3000:3000 lumos/aleph-snowflake-playground /bin/bash -c "export AWS_ACCESS_KEY_ID=\"{aws_key_id}\" ; export AWS_SECRET_ACCESS_KEY=\"{aws_secret_key}\" ; cd /usr/bin/snowflake_odbc && sed -i 's/SF_ACCOUNT/{your_snowflake_account}/g' ./unixodbc_setup.sh && ./unixodbc_setup.sh && aleph setup_minimal -t snowflake -S snowflake -U {user} -P {password} -L {snowflake_unload_target} -R {s3_region}  -B {s3_bucket} -F {s3_folder}; redis-server & aleph run_demo"
+    `snowflake_unload_target` is the external stage and location in snowflake.  e.g. `@mydb.myschema.aleph_stage/results/`
 
 ###### Open in browser
 
       open http://$(docker-machine ip):3000
 
 ### Gem Install
 
+###### For Redshift
 You must be using [PostgreSQL 9.2beta3 or later client libraries](https://kkob.us/2014/12/20/homebrew-and-postgresql-9-4/)
 
+###### For Snowflake
+You must install `unixodbc-dev` and setup and configure [snowflake ODBC](https://docs.snowflake.net/manuals/user-guide/odbc.html).  e.g.
+
+    apt-get update && apt-get install -y unixodbc-dev
+    curl -o /tmp/snowflake_linux_x8664_odbc-2.19.8.tgz https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.19.8.tgz && cd /tmp && gunzip snowflake_linux_x8664_odbc-2.19.8.tgz && tar -xvf snowflake_linux_x8664_odbc-2.19.8.tar && cp -r snowflake_odbc /usr/bin && rm -r /tmp/snowflake_odbc
+    cd /usr/bin/snowflake_odbc 
+    ./unixodbc_setup.sh  # and following the instructions to setup Snowflake DSN
+
 ###### Install and run Redis
 
     brew install redis  && redis-server &
@@ -39,11 +63,22 @@ You must be using [PostgreSQL 9.2beta3 or later client libraries](https://kkob.u
 
     gem install aleph_analytics
 
-###### Configure your Redshift and run
+###### Configure your database
+See [Database Configuration](#database-configuration) above
+
+###### Run Aleph
+* For Redshift
 
-    aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P {password}
-    aleph run_demo
+      aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P {password}
+      aleph run_demo
 
+* For Snowflake
+
+      export AWS_ACCESS_KEY_ID="{aws key id}"
+      export AWS_SECRET_ACCESS_KEY="{aws secret key}"
+      aleph setup_minimal -t snowflake -S snowflake -U {user} -P {password} -L {snowflake_unload_target} -R {s3_region}  -B {s3_bucket} -F {s3_folder}
+      aleph run_demo
+
 Aleph should be running at `localhost:3000`
 
 ## Aleph Gem
@@ -62,9 +97,15 @@ There are a number of ways to install and deploy Aleph. The simplest is to set u
 
     FROM ruby:2.2.4
 
-    # we need postgres client libs
+    # we need postgres client libs for Redshift
     RUN apt-get update && apt-get install -y postgresql-client --no-install-recommends && rm -rf /var/lib/apt/lists/*
 
+    # for Snowflake, install unix odbc and Snowflake ODBC driver and setup DSN
+    # replace {your snowflake account} below
+    RUN apt-get update && apt-get install -y unixodbc-dev
+    RUN curl -o /tmp/snowflake_linux_x8664_odbc-2.19.8.tgz https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.19.8.tgz && cd /tmp && gunzip snowflake_linux_x8664_odbc-2.19.8.tgz && tar -xvf snowflake_linux_x8664_odbc-2.19.8.tar && cp -r snowflake_odbc /usr/bin && rm -r /tmp/snowflake_odbc
+    RUN cd /usr/bin/snowflake_odbc && sed -i 's/SF_ACCOUNT/{your snowflake account}/g' ./unixodbc_setup.sh && ./unixodbc_setup.sh
+
     # make a log location
     RUN mkdir -p /var/log/aleph
     ENV SERVER_LOG_ROOT /var/log/aleph
@@ -91,10 +132,15 @@ You can then deploy and run the main components of Aleph as separate services us
 
 At runtime, you can inject all the secrets as environment variables.
 
-We *highly* recommend that you have a git repo for your queries and s3 location for you results.
+S3 is required for Snowflake.
+
+We *highly* recommend that you have a git repo for your queries and S3 location for you results.
 
 Advanced setup and configuration details (including how to use Aleph roles for data access, using different auth providers, creating users, and more) can be found [here](docs/ADVANCED_CONFIGURATION.md).
 
+## Limitation
+The default maximum result size from Snowflake queries is 5 GB.  This is due to the MAX_FILE_SIZE limit of [Snowflake copy command](https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html#copy-options-copyoptions).  If Snowflake has changed the limit, update the setting in [snowflake.yml](docs/ADVANCED_CONFIGURATION.md#snowflake)
+
 ## Contribute
 Aleph is Rails on the backend, Angular on the front end. It uses Resque workers to run queries against Redshift. Here are few things you should have before developing:
 
@@ -103,7 +149,7 @@ Aleph is Rails on the backend, Angular on the front end. It uses Resque workers
 * Git Repo (for query versions)
 * S3 Location (store results)
 
-While the demo/playground version does not use a git repo or S3, we *highly* recommend that you use them in general.
+While the demo/playground version does not use a git repo and S3 is optional for Redshift, we *highly* recommend that you use them in general.
 
 ### Setup
 *Postgres*

diff --git a/aleph.gemspec b/aleph.gemspec
@@ -1,8 +1,8 @@
 Gem::Specification.new do |s|
   s.name        = 'aleph_analytics'
-  s.version     = '0.3.0'
-  s.date        = '2019-02-26'
-  s.summary     = 'Redshift analytics platform'
+  s.version     = '0.4.0'
+  s.date        = '2019-09-02'
+  s.summary     = 'Redshift/Snowflake analytics platform'
   s.description = 'The best way to develop and share queries/investigations/results within an analytics team'
   s.authors     = ['Andrew Xue', 'Rob Froetscher']
   s.email       = '[email protected]'

diff --git a/app/models/query_execution.rb b/app/models/query_execution.rb
@@ -1,6 +1,17 @@
 class QueryExecution
   @queue = :query_exec
   NUM_SAMPLE_ROWS = 100
+  SNOWFLAKE_UNLOAD_SQL = <<-EOF
+COPY INTO %{location} FROM (
+%{query}
+)
+FILE_FORMAT = (TYPE = 'csv' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\\n' FIELD_OPTIONALLY_ENCLOSED_BY = '"' 
+  NULL_IF = ('') COMPRESSION = NONE)
+HEADER = TRUE 
+SINGLE = TRUE
+OVERWRITE = TRUE
+MAX_FILE_SIZE = %{max_file_size}
+EOF
 
   def self.perform(result_id, role)
     result = Result.find(result_id)
@@ -14,8 +25,24 @@ def self.perform(result_id, role)
     result.mark_running!
     sample_callback = ->(sample) { result.mark_processing_from_sample(sample) }
 
-    connection = RedshiftConnectionPool.instance.get(role)
+    connection = AnalyticDBConnectionPool.instance.get(role)
+    if connection.is_a? RedshiftPG::Connection
+      query_redshift(connection, body, result, sample_callback, csv_service)
+    else
+      query_snowflake(connection, body, result, sample_callback)
+    end
+
+  rescue => e
+    if result && csv_service
+      csv_service.clear_tmp_file
+      result.mark_failed!(e.message)
+    end
+    raise
+  end
 
+  private
+
+  def self.query_redshift(connection, body, result, sample_callback, csv_service)
     connection.reconnect_on_failure do
       query_stream = PgStream::Stream.new(connection.pg_connection, body)
       result.headers = query_stream.headers
@@ -24,21 +51,36 @@ def self.perform(result_id, role)
       rrrc = result.redis_result_row_count
 
       stream_processor = PgStream::Processor.new(query_stream)
-      stream_processor.register(ResultCsvGenerator.new(result_id, result.headers).callbacks)
+      stream_processor.register(ResultCsvGenerator.new(result.id, result.headers).callbacks)
       stream_processor.register(SampleSkimmer.new(NUM_SAMPLE_ROWS, &sample_callback).callbacks)
       stream_processor.register(CountPublisher.new(rrrc).callbacks)
 
       row_count = stream_processor.execute
       result.mark_complete_with_count(row_count)
     end
+
   rescue *RedshiftPG::USER_ERROR_CLASSES => e
     csv_service.clear_tmp_file
     result.mark_failed!(e.message)
-  rescue => e
-    if result && csv_service
-      csv_service.clear_tmp_file
-      result.mark_failed!(e.message)
+  end
+
+  def self.query_snowflake(connection, body, result, sample_callback)
+    # unload the query result from snowflake directly into s3
+    # then read in the first 100 rows from the file as sample rows
+    # Note: snowflake unload currently has a max file size of 5 GB.
+    connection.reconnect_on_failure do
+      location = File.join(connection.unload_target, result.current_result_filename)
+      sql = SNOWFLAKE_UNLOAD_SQL % {location: location, query: body, max_file_size: connection.max_file_size}
+      row = connection.connection.fetch(sql).first
+      row_count = row[:rows_unloaded]
+
+      headers, samples = CsvSerializer.load_from_s3_file(result.current_result_s3_key, NUM_SAMPLE_ROWS)
+
+      result.headers = headers
+      result.save!
+
+      sample_callback.call(samples)
+      result.mark_complete_with_count(row_count)
     end
-    raise
   end
 end
diff --git a/app/models/result.rb b/app/models/result.rb
@@ -53,6 +53,14 @@ def copy_latest_result
     end
   end
 
+  def current_result_filename
+    @result_filename ||= CsvHelper::Base.new(id).filename
+  end
+
+  def current_result_s3_key
+    @result_key ||= CsvHelper::Aws.new(id).key
+  end
+
   private
 
   def duration(start_field, end_field)
@@ -67,8 +75,4 @@ def duration(start_field, end_field)
       0
     end
   end
-
-  def current_result_s3_key
-    @result_key ||= CsvHelper::Aws.new(id).key
-  end
 end
diff --git a/bin/aleph b/bin/aleph
@@ -33,6 +33,10 @@ class CommandParser
         @options[:worker_processes]  = w
       end
 
+      opts.on('-t', '--db_type DB_TYPE', 'redshift or snowflake.  Default is redshift') do |t|
+        @options[:db_type] = t
+      end
+
       opts.on('-H', '--redshift-host REDSHIFT_HOST', 'Redshift Hostname') do |rhost|
         @options[:redshift_host]  = rhost
       end
@@ -45,12 +49,40 @@ class CommandParser
         @options[:redshift_port]  = rport
       end
 
-      opts.on('-U', '--redshift-user REDSHIFT_USER', 'Redshift User') do |ruser|
-        @options[:redshift_user]  = ruser
+      opts.on('-U', '--db-user DB_USER', 'Redshift or Snowflake User') do |dbuser|
+        @options[:db_user]  = dbuser
+      end
+
+      opts.on('--redshift-user DB_USER', 'Same as --db-user (for backward compatibility)') do |dbuser|
+        @options[:db_user]  = dbuser
+      end
+
+      opts.on('-P', '--db-password DB_PASSWORD', 'Redshift or Snowflake Password') do |dbpw|
+        @options[:db_password]  = dbpw
+      end
+
+      opts.on('--redshift-password DB_PASSWORD', 'Same as --db-password (for backward compatibility)') do |dbpw|
+        @options[:db_password]  = dbpw
+      end
+
+      opts.on('-S', '---dsn ODBC_DSN', 'Snowflake ODBC DSN') do |dsn|
+        @options[:dsn]  = dsn
+      end
+
+      opts.on('-L', '---snowflake_unload_target STAGE_OR_LOCATION', 'Snowflake only.  Stage or location where result files are unloaded') do |target|
+        @options[:snowflake_unload_target]  = target
+      end
+
+      opts.on('-R', '---s3_region S3_REGION', 's3 Region') do |region|
+        @options[:s3_region]  = region
+      end
+
+      opts.on('-B', '---s3_bucket S3_BUCKET', 's3 bucket which result files are unloaded.  Required for Snowflake DB') do |bucket|
+        @options[:s3_bucket]  = bucket
       end
 
-      opts.on('-P', '--redshift-password REDSHIFT_PASSWORD', 'Redshift Password') do |rpw|
-        @options[:redshift_password]  = rpw
+      opts.on('-F', '---s3_bucket S3_FOLDER', 's3 folder where result files are unloaded') do |folder|
+        @options[:s3_folder]  = folder
       end
 
       opts.on('-h', '--help', 'Halp!') do |h|

diff --git a/bin/executables/lib/config_generator.rb b/bin/executables/lib/config_generator.rb
@@ -22,7 +22,16 @@ def write_redshift(host, database, port, user, password)
       write_yaml('redshift.yml', redshift_properties, environments: [@rails_env.to_sym])
       @env_writer.merge(admin_redshift_username: user, admin_redshift_password: password)
     end
-
+
+    def write_snowflake(dsn, user, password, unload_target)
+      snowflake_properties = {
+          'dsn' => dsn,
+          'unload_target' => unload_target
+      }
+      write_yaml('snowflake.yml', snowflake_properties, environments: [@rails_env.to_sym])
+      @env_writer.merge(admin_snowflake_username: user, admin_snowflake_password: password)
+    end
+
     def write_envs!
       @env_writer.write!
     end