Skip to content
This repository has been archived by the owner on Oct 5, 2021. It is now read-only.

Commit

Permalink
Snowflake (#87)
Browse files Browse the repository at this point in the history
snowflake support
CORE-687
  • Loading branch information
joycelau1 authored Sep 6, 2019
1 parent 4b499e0 commit 5f44aaa
Show file tree
Hide file tree
Showing 34 changed files with 486 additions and 74 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Change Log
All notable changes to this project will be documented in this file using [Semantic Versioning](http://semver.org/).

## [0.4.0] - 2019-09-02
### Features
- Added support for Snowflake

## [0.3.0] - 2019-02-26
### Features
- [Scheduled Query Execution](https://github.com/lumoslabs/aleph/issues/42)
Expand Down
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@ gem 'resque-pool', '~> 0.5.0'
gem 'resque-web', '0.0.6', require: 'resque_web'
gem 'roar'
gem 'rollbar', '~> 2.3.0'
gem 'ruby-odbc'
gem 'sass-rails'
gem 'sequel', '~> 4.35'
gem 'sprockets-es6', '~> 0.8.0'
gem 'therubyracer'
gem 'thin'
Expand Down
4 changes: 4 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ GEM
rspec-mocks (~> 3.3.0)
rspec-support (~> 3.3.0)
rspec-support (3.3.0)
ruby-odbc (0.99999)
ruby-saml (1.0.0)
nokogiri (>= 1.5.10)
uuid (~> 2.3)
Expand All @@ -349,6 +350,7 @@ GEM
sprockets (>= 2.8, < 4.0)
sprockets-rails (>= 2.0, < 4.0)
tilt (>= 1.1, < 3)
sequel (4.49.0)
sham_rack (1.3.6)
rack
shoulda-matchers (3.0.0)
Expand Down Expand Up @@ -473,7 +475,9 @@ DEPENDENCIES
roar
rollbar (~> 2.3.0)
rspec-rails
ruby-odbc
sass-rails
sequel (~> 4.35)
sham_rack
shoulda-matchers
sprockets-es6 (~> 0.8.0)
Expand Down
64 changes: 55 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

# Aleph
Aleph is a Redshift analytics platform that focuses on aggregating institutional data investigation techniques.
Aleph is a business analytics platform that focuses on ease-of-use and operational simplicity. It allows analysts to quickly author and iterate on queries, then share result sets and visualizations. Most components are modular, but it was designed to version-control queries (and analyze their differences) using Github and store result sets long term in Amazon S3.

![aleph](images/aleph_repo_banner.png)

Expand All @@ -13,24 +13,48 @@ Aleph is a Redshift analytics platform that focuses on aggregating institutional


## Quickstart
If you want to connect to your own Redshift cluster, the follow instructions should get you up and running.
If you want to connect to your own Redshift or Snowflake cluster, the follow instructions should get you up and running.

### Database Configuration
Configure your Redshift or snowflake database and user(s).

###### Additional requirements for Snowflake
* Snowflake users must be setup with default warehouse and role; they are not configurable in Alpeh.
* Since Aleph query results are unloaded directly from Snowflake to AWS S3, S3 is required for Snowflake connection.
Configure an S3 bucket and create an external S3 stage in Snowflake. e.g.

create stage mydb.myschema.aleph_stage url='s3://<s3_bucket>/<path>/'
credentials=(aws_role = '<iam role>')

### Docker Install
The fastest way to get started: [Docker](https://docs.docker.com/mac/step_one/)

###### Configure your Redshift and run
* For Redshift, run

docker run -ti -p 3000:3000 lumos/aleph-playground /bin/bash -c "aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P {password}; redis-server & aleph run_demo"

* For Snowflake, run

docker run -ti -p 3000:3000 lumos/aleph-snowflake-playground /bin/bash -c "export AWS_ACCESS_KEY_ID=\"{aws_key_id}\" ; export AWS_SECRET_ACCESS_KEY=\"{aws_secret_key}\" ; cd /usr/bin/snowflake_odbc && sed -i 's/SF_ACCOUNT/{your_snowflake_account}/g' ./unixodbc_setup.sh && ./unixodbc_setup.sh && aleph setup_minimal -t snowflake -S snowflake -U {user} -P {password} -L {snowflake_unload_target} -R {s3_region} -B {s3_bucket} -F {s3_folder}; redis-server & aleph run_demo"
`snowflake_unload_target` is the external stage and location in snowflake. e.g. `@mydb.myschema.aleph_stage/results/`

###### Open in browser

open http://$(docker-machine ip):3000

### Gem Install

###### For Redshift
You must be using [PostgreSQL 9.2beta3 or later client libraries](https://kkob.us/2014/12/20/homebrew-and-postgresql-9-4/)

###### For Snowflake
You must install `unixodbc-dev` and setup and configure [snowflake ODBC](https://docs.snowflake.net/manuals/user-guide/odbc.html). e.g.

apt-get update && apt-get install -y unixodbc-dev
curl -o /tmp/snowflake_linux_x8664_odbc-2.19.8.tgz https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.19.8.tgz && cd /tmp && gunzip snowflake_linux_x8664_odbc-2.19.8.tgz && tar -xvf snowflake_linux_x8664_odbc-2.19.8.tar && cp -r snowflake_odbc /usr/bin && rm -r /tmp/snowflake_odbc
cd /usr/bin/snowflake_odbc
./unixodbc_setup.sh # and following the instructions to setup Snowflake DSN

###### Install and run Redis

brew install redis && redis-server &
Expand All @@ -39,11 +63,22 @@ You must be using [PostgreSQL 9.2beta3 or later client libraries](https://kkob.u

gem install aleph_analytics

###### Configure your Redshift and run
###### Configure your database
See [Database Configuration](#database-configuration) above

###### Run Aleph
* For Redshift

aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P {password}
aleph run_demo
aleph setup_minimal -H {host} -D {db} -p {port} -U {user} -P {password}
aleph run_demo

* For Snowflake

export AWS_ACCESS_KEY_ID="{aws key id}"
export AWS_SECRET_ACCESS_KEY="{aws secret key}"
aleph setup_minimal -t snowflake -S snowflake -U {user} -P {password} -L {snowflake_unload_target} -R {s3_region} -B {s3_bucket} -F {s3_folder}
aleph run_demo

Aleph should be running at `localhost:3000`

## Aleph Gem
Expand All @@ -62,9 +97,15 @@ There are a number of ways to install and deploy Aleph. The simplest is to set u

FROM ruby:2.2.4

# we need postgres client libs
# we need postgres client libs for Redshift
RUN apt-get update && apt-get install -y postgresql-client --no-install-recommends && rm -rf /var/lib/apt/lists/*

# for Snowflake, install unix odbc and Snowflake ODBC driver and setup DSN
# replace {your snowflake account} below
RUN apt-get update && apt-get install -y unixodbc-dev
RUN curl -o /tmp/snowflake_linux_x8664_odbc-2.19.8.tgz https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.19.8.tgz && cd /tmp && gunzip snowflake_linux_x8664_odbc-2.19.8.tgz && tar -xvf snowflake_linux_x8664_odbc-2.19.8.tar && cp -r snowflake_odbc /usr/bin && rm -r /tmp/snowflake_odbc
RUN cd /usr/bin/snowflake_odbc && sed -i 's/SF_ACCOUNT/{your snowflake account}/g' ./unixodbc_setup.sh && ./unixodbc_setup.sh

# make a log location
RUN mkdir -p /var/log/aleph
ENV SERVER_LOG_ROOT /var/log/aleph
Expand All @@ -91,10 +132,15 @@ You can then deploy and run the main components of Aleph as separate services us

At runtime, you can inject all the secrets as environment variables.

We *highly* recommend that you have a git repo for your queries and s3 location for you results.
S3 is required for Snowflake.

We *highly* recommend that you have a git repo for your queries and S3 location for you results.

Advanced setup and configuration details (including how to use Aleph roles for data access, using different auth providers, creating users, and more) can be found [here](docs/ADVANCED_CONFIGURATION.md).

## Limitation
The default maximum result size from Snowflake queries is 5 GB. This is due to the MAX_FILE_SIZE limit of [Snowflake copy command](https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html#copy-options-copyoptions). If Snowflake has changed the limit, update the setting in [snowflake.yml](docs/ADVANCED_CONFIGURATION.md#snowflake)

## Contribute
Aleph is Rails on the backend, Angular on the front end. It uses Resque workers to run queries against Redshift. Here are few things you should have before developing:

Expand All @@ -103,7 +149,7 @@ Aleph is Rails on the backend, Angular on the front end. It uses Resque workers
* Git Repo (for query versions)
* S3 Location (store results)

While the demo/playground version does not use a git repo or S3, we *highly* recommend that you use them in general.
While the demo/playground version does not use a git repo and S3 is optional for Redshift, we *highly* recommend that you use them in general.

### Setup
*Postgres*
Expand Down
6 changes: 3 additions & 3 deletions aleph.gemspec
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Gem::Specification.new do |s|
s.name = 'aleph_analytics'
s.version = '0.3.0'
s.date = '2019-02-26'
s.summary = 'Redshift analytics platform'
s.version = '0.4.0'
s.date = '2019-09-02'
s.summary = 'Redshift/Snowflake analytics platform'
s.description = 'The best way to develop and share queries/investigations/results within an analytics team'
s.authors = ['Andrew Xue', 'Rob Froetscher']
s.email = '[email protected]'
Expand Down
56 changes: 49 additions & 7 deletions app/models/query_execution.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
class QueryExecution
@queue = :query_exec
NUM_SAMPLE_ROWS = 100
SNOWFLAKE_UNLOAD_SQL = <<-EOF
COPY INTO %{location} FROM (
%{query}
)
FILE_FORMAT = (TYPE = 'csv' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\\n' FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ('') COMPRESSION = NONE)
HEADER = TRUE
SINGLE = TRUE
OVERWRITE = TRUE
MAX_FILE_SIZE = %{max_file_size}
EOF

def self.perform(result_id, role)
result = Result.find(result_id)
Expand All @@ -14,8 +25,24 @@ def self.perform(result_id, role)
result.mark_running!
sample_callback = ->(sample) { result.mark_processing_from_sample(sample) }

connection = RedshiftConnectionPool.instance.get(role)
connection = AnalyticDBConnectionPool.instance.get(role)
if connection.is_a? RedshiftPG::Connection
query_redshift(connection, body, result, sample_callback, csv_service)
else
query_snowflake(connection, body, result, sample_callback)
end

rescue => e
if result && csv_service
csv_service.clear_tmp_file
result.mark_failed!(e.message)
end
raise
end

private

def self.query_redshift(connection, body, result, sample_callback, csv_service)
connection.reconnect_on_failure do
query_stream = PgStream::Stream.new(connection.pg_connection, body)
result.headers = query_stream.headers
Expand All @@ -24,21 +51,36 @@ def self.perform(result_id, role)
rrrc = result.redis_result_row_count

stream_processor = PgStream::Processor.new(query_stream)
stream_processor.register(ResultCsvGenerator.new(result_id, result.headers).callbacks)
stream_processor.register(ResultCsvGenerator.new(result.id, result.headers).callbacks)
stream_processor.register(SampleSkimmer.new(NUM_SAMPLE_ROWS, &sample_callback).callbacks)
stream_processor.register(CountPublisher.new(rrrc).callbacks)

row_count = stream_processor.execute
result.mark_complete_with_count(row_count)
end

rescue *RedshiftPG::USER_ERROR_CLASSES => e
csv_service.clear_tmp_file
result.mark_failed!(e.message)
rescue => e
if result && csv_service
csv_service.clear_tmp_file
result.mark_failed!(e.message)
end

def self.query_snowflake(connection, body, result, sample_callback)
# unload the query result from snowflake directly into s3
# then read in the first 100 rows from the file as sample rows
# Note: snowflake unload currently has a max file size of 5 GB.
connection.reconnect_on_failure do
location = File.join(connection.unload_target, result.current_result_filename)
sql = SNOWFLAKE_UNLOAD_SQL % {location: location, query: body, max_file_size: connection.max_file_size}
row = connection.connection.fetch(sql).first
row_count = row[:rows_unloaded]

headers, samples = CsvSerializer.load_from_s3_file(result.current_result_s3_key, NUM_SAMPLE_ROWS)

result.headers = headers
result.save!

sample_callback.call(samples)
result.mark_complete_with_count(row_count)
end
raise
end
end
12 changes: 8 additions & 4 deletions app/models/result.rb
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,14 @@ def copy_latest_result
end
end

def current_result_filename
@result_filename ||= CsvHelper::Base.new(id).filename
end

def current_result_s3_key
@result_key ||= CsvHelper::Aws.new(id).key
end

private

def duration(start_field, end_field)
Expand All @@ -67,8 +75,4 @@ def duration(start_field, end_field)
0
end
end

def current_result_s3_key
@result_key ||= CsvHelper::Aws.new(id).key
end
end
40 changes: 36 additions & 4 deletions bin/aleph
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ class CommandParser
@options[:worker_processes] = w
end

opts.on('-t', '--db_type DB_TYPE', 'redshift or snowflake. Default is redshift') do |t|
@options[:db_type] = t
end

opts.on('-H', '--redshift-host REDSHIFT_HOST', 'Redshift Hostname') do |rhost|
@options[:redshift_host] = rhost
end
Expand All @@ -45,12 +49,40 @@ class CommandParser
@options[:redshift_port] = rport
end

opts.on('-U', '--redshift-user REDSHIFT_USER', 'Redshift User') do |ruser|
@options[:redshift_user] = ruser
opts.on('-U', '--db-user DB_USER', 'Redshift or Snowflake User') do |dbuser|
@options[:db_user] = dbuser
end

opts.on('--redshift-user DB_USER', 'Same as --db-user (for backward compatibility)') do |dbuser|
@options[:db_user] = dbuser
end

opts.on('-P', '--db-password DB_PASSWORD', 'Redshift or Snowflake Password') do |dbpw|
@options[:db_password] = dbpw
end

opts.on('--redshift-password DB_PASSWORD', 'Same as --db-password (for backward compatibility)') do |dbpw|
@options[:db_password] = dbpw
end

opts.on('-S', '---dsn ODBC_DSN', 'Snowflake ODBC DSN') do |dsn|
@options[:dsn] = dsn
end

opts.on('-L', '---snowflake_unload_target STAGE_OR_LOCATION', 'Snowflake only. Stage or location where result files are unloaded') do |target|
@options[:snowflake_unload_target] = target
end

opts.on('-R', '---s3_region S3_REGION', 's3 Region') do |region|
@options[:s3_region] = region
end

opts.on('-B', '---s3_bucket S3_BUCKET', 's3 bucket which result files are unloaded. Required for Snowflake DB') do |bucket|
@options[:s3_bucket] = bucket
end

opts.on('-P', '--redshift-password REDSHIFT_PASSWORD', 'Redshift Password') do |rpw|
@options[:redshift_password] = rpw
opts.on('-F', '---s3_bucket S3_FOLDER', 's3 folder where result files are unloaded') do |folder|
@options[:s3_folder] = folder
end

opts.on('-h', '--help', 'Halp!') do |h|
Expand Down
11 changes: 10 additions & 1 deletion bin/executables/lib/config_generator.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,16 @@ def write_redshift(host, database, port, user, password)
write_yaml('redshift.yml', redshift_properties, environments: [@rails_env.to_sym])
@env_writer.merge(admin_redshift_username: user, admin_redshift_password: password)
end


def write_snowflake(dsn, user, password, unload_target)
snowflake_properties = {
'dsn' => dsn,
'unload_target' => unload_target
}
write_yaml('snowflake.yml', snowflake_properties, environments: [@rails_env.to_sym])
@env_writer.merge(admin_snowflake_username: user, admin_snowflake_password: password)
end

def write_envs!
@env_writer.write!
end
Expand Down
Loading

0 comments on commit 5f44aaa

Please sign in to comment.