Skip to content

Commit

Permalink
Adds support for COPY TO/FROM Azure Blob Storage
Browse files Browse the repository at this point in the history
Supports following Azure Blob uri forms:
- `az://{container}/key`
- `azure://{container}/key`
- `https://{account}.blob.core.windows.net/{container}/key`

**Configuration**

The simplest way to configure object storage is by creating the standard [`~/.azure/config`](https://learn.microsoft.com/en-us/cli/azure/azure-cli-configuration?view=azure-cli-latest) file:

```bash
$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
```

Alternatively, you can use the following environment variables when starting postgres to configure the Azure Blob Storage client:
- `AZURE_STORAGE_ACCOUNT`: the storage account name of the Azure Blob
- `AZURE_STORAGE_KEY`: the storage key of the Azure Blob
- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob
- `AZURE_CONFIG_FILE`: an alternative location for the config file

**Bonus**
Additionally, PR supports following S3 uri forms:
- `s3://{bucket}/key`
- `s3a://{bucket}/key`
- `https://s3.amazonaws.com/{bucket}/key`
- `https://{bucket}.s3.amazonaws.com/key`

Closes #50
  • Loading branch information
aykut-bozkurt committed Oct 26, 2024
1 parent 78fc489 commit 2feb683
Show file tree
Hide file tree
Showing 12 changed files with 560 additions and 104 deletions.
25 changes: 13 additions & 12 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ ENV TZ="Europe/Istanbul"
ARG PG_MAJOR=17

# install deps
RUN apt-get update && apt-get -y install build-essential libreadline-dev zlib1g-dev \
flex bison libxml2-dev libxslt-dev libssl-dev \
libxml2-utils xsltproc ccache pkg-config wget \
curl lsb-release sudo nano net-tools git awscli
RUN apt-get update && apt-get -y install build-essential libreadline-dev zlib1g-dev \
flex bison libxml2-dev libxslt-dev libssl-dev \
libxml2-utils xsltproc ccache pkg-config wget \
curl lsb-release ca-certificates gnupg sudo git \
nano net-tools awscli

# install Postgres
RUN sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
Expand All @@ -19,6 +20,14 @@ RUN apt-get update && apt-get -y install postgresql-${PG_MAJOR}-postgis-3 \
postgresql-client-${PG_MAJOR} \
libpq-dev

# install azure-cli and azurite
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
RUN apt-get update && apt-get install -y nodejs
RUN curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null
RUN echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/trusted.gpg.d/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | tee /etc/apt/sources.list.d/azure-cli.list
RUN apt-get update && apt-get install -y azure-cli
RUN npm install -g azurite

# download and install MinIO server and client
RUN wget https://dl.min.io/server/minio/release/linux-amd64/minio
RUN chmod +x minio
Expand Down Expand Up @@ -58,11 +67,3 @@ ARG PGRX_VERSION=0.12.6
RUN cargo install --locked cargo-pgrx@${PGRX_VERSION}
RUN cargo pgrx init --pg${PG_MAJOR} $(which pg_config)
RUN echo "shared_preload_libraries = 'pg_parquet'" >> $HOME/.pgrx/data-${PG_MAJOR}/postgresql.conf

ENV MINIO_ROOT_USER=admin
ENV MINIO_ROOT_PASSWORD=admin123
ENV AWS_S3_TEST_BUCKET=testbucket
ENV AWS_REGION=us-east-1
ENV AWS_ACCESS_KEY_ID=admin
ENV AWS_SECRET_ACCESS_KEY=admin123
ENV PG_PARQUET_TEST=true
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
]
}
},
"postStartCommand": "bash .devcontainer/scripts/setup-minio.sh",
"postStartCommand": "bash .devcontainer/scripts/setup_minio.sh && bash .devcontainer/scripts/setup_azurite.sh",
"forwardPorts": [
5432
],
Expand Down
7 changes: 7 additions & 0 deletions .devcontainer/scripts/setup_azurite.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

source setup_test_envs.sh

nohup azurite --location /tmp/azurite-storage > /dev/null 2>&1 &

az storage container create --name "${AZURE_TEST_CONTAINER_NAME}" --public off --connection-string "$AZURE_STORAGE_CONNECTION_STRING"
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash

source setup_test_envs.sh

nohup minio server /tmp/minio-storage > /dev/null 2>&1 &

mc alias set local http://localhost:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD
Expand Down
19 changes: 19 additions & 0 deletions .devcontainer/scripts/setup_test_envs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# S3 tests
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=admin123
export AWS_REGION=us-east-1
export AWS_S3_TEST_BUCKET=testbucket
export MINIO_ROOT_USER=admin
export MINIO_ROOT_PASSWORD=admin123

# Azure Blob tests
export AZURE_STORAGE_ACCOUNT=devstoreaccount1
export AZURE_STORAGE_KEY="Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://localhost:10000/devstoreaccount1;"
export AZURE_TEST_CONTAINER_NAME=testcontainer
export AZURE_TEST_READ_ONLY_SAS="se=2100-05-05&sp=r&sv=2022-11-02&sr=c&sig=YMPFnAHKe9y0o3hFegncbwQTXtAyvsJEgPB2Ne1b9CQ%3D"
export AZURE_TEST_READ_WRITE_SAS="se=2100-05-05&sp=rcw&sv=2022-11-02&sr=c&sig=TPz2jEz0t9L651t6rTCQr%2BOjmJHkM76tnCGdcyttnlA%3D"

# Other
export PG_PARQUET_TEST=true
export RUST_TEST_THREADS=1
5 changes: 0 additions & 5 deletions .env_sample

This file was deleted.

42 changes: 20 additions & 22 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,23 @@ jobs:
sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install build-essential libreadline-dev zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev libxml2-utils xsltproc ccache pkg-config
sudo apt-get -y install build-essential libreadline-dev zlib1g-dev flex bison libxml2-dev \
libxslt-dev libssl-dev libxml2-utils xsltproc ccache pkg-config \
gnupg ca-certificates
sudo apt-get -y install postgresql-${{ env.PG_MAJOR }}-postgis-3 \
postgresql-server-dev-${{ env.PG_MAJOR }} \
postgresql-client-${{ env.PG_MAJOR }} \
libpq-dev
- name: Install Azurite
run: |
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo bash -
sudo apt-get update && sudo apt-get install -y nodejs
curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null
echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/trusted.gpg.d/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | sudo tee /etc/apt/sources.list.d/azure-cli.list
sudo apt-get update && sudo apt-get install -y azure-cli
npm install -g azurite
- name: Install MinIO
run: |
# Download and install MinIO server and client
Expand Down Expand Up @@ -108,23 +119,14 @@ jobs:
$(pg_config --sharedir)/extension \
/var/run/postgresql/
# pgrx tests with runas argument ignores environment variables, so
# we read env vars from .env file in tests (https://github.com/pgcentralfoundation/pgrx/pull/1674)
touch /tmp/.env
echo AWS_ACCESS_KEY_ID=${{ env.AWS_ACCESS_KEY_ID }} >> /tmp/.env
echo AWS_SECRET_ACCESS_KEY=${{ env.AWS_SECRET_ACCESS_KEY }} >> /tmp/.env
echo AWS_S3_TEST_BUCKET=${{ env.AWS_S3_TEST_BUCKET }} >> /tmp/.env
echo AWS_REGION=${{ env.AWS_REGION }} >> /tmp/.env
echo PG_PARQUET_TEST=${{ env.PG_PARQUET_TEST }} >> /tmp/.env
# Set up test environments
source .devcontainer/scripts/setup_test_envs.sh
# Start MinIO server
export MINIO_ROOT_USER=${{ env.AWS_ACCESS_KEY_ID }}
export MINIO_ROOT_PASSWORD=${{ env.AWS_SECRET_ACCESS_KEY }}
minio server /tmp/minio-storage > /dev/null 2>&1 &
bash .devcontainer/scripts/setup_minio.sh
# Set access key and create test bucket
mc alias set local http://localhost:9000 ${{ env.AWS_ACCESS_KEY_ID }} ${{ env.AWS_SECRET_ACCESS_KEY }}
aws --endpoint-url http://localhost:9000 s3 mb s3://${{ env.AWS_S3_TEST_BUCKET }}
# Start Azurite server
bash .devcontainer/scripts/setup_azurite.sh
# Run tests with coverage tool
source <(cargo llvm-cov show-env --export-prefix)
Expand All @@ -135,13 +137,9 @@ jobs:
# Stop MinIO server
pkill -9 minio
env:
RUST_TEST_THREADS: 1
AWS_ACCESS_KEY_ID: test_secret_access_key
AWS_SECRET_ACCESS_KEY: test_access_key_id
AWS_REGION: us-east-1
AWS_S3_TEST_BUCKET: testbucket
PG_PARQUET_TEST: true
# Stop Azurite server
pkill -9 node
- name: Upload coverage report to Codecov
if: ${{ env.PG_MAJOR }} == 17
Expand Down
39 changes: 35 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 3 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ arrow = {version = "53", default-features = false}
arrow-schema = {version = "53", default-features = false}
aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
aws-credential-types = {version = "1.2", default-features = false}
dotenvy = "0.15"
futures = "0.3"
object_store = {version = "0.11", default-features = false, features = ["aws"]}
home = "0.5"
object_store = {version = "0.11", default-features = false, features = ["aws", "azure"]}
once_cell = "1"
parquet = {version = "53", default-features = false, features = [
"arrow",
Expand All @@ -38,6 +38,7 @@ parquet = {version = "53", default-features = false, features = [
"object_store",
]}
pgrx = "=0.12.6"
rust-ini = "0.21"
tokio = {version = "1", default-features = false, features = ["rt", "time", "macros"]}
url = "2.5"

Expand Down
38 changes: 34 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,13 @@ SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM
```

## Object Store Support
`pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported.
`pg_parquet` supports reading and writing Parquet files from/to `S3` and `Azure Blob Storage` object stores.

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
#### S3 Storage

The simplest way to configure object storage is by creating the standard `~/.aws/credentials` and `~/.aws/config` files:

Expand All @@ -178,9 +184,33 @@ Alternatively, you can use the following environment variables when starting pos
- `AWS_CONFIG_FILE`: an alternative location for the config file
- `AWS_PROFILE`: the name of the profile from the credentials and config file (default profile name is `default`)

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
Supported S3 uri formats are shown below:
- s3:// \<bucket\> / \<path\>
- s3a:// \<bucket\> / \<path\>
- https:// \<bucket\>.s3.amazonaws.com / \<path\>
- https:// s3.amazonaws.com / \<bucket\> / \<path\>

#### Azure Blob Storage

The simplest way to configure object storage is by creating the standard [`~/.azure/config`](https://learn.microsoft.com/en-us/cli/azure/azure-cli-configuration?view=azure-cli-latest) file:

```bash
$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
```

Alternatively, you can use the following environment variables when starting postgres to configure the Azure Blob Storage client:
- `AZURE_STORAGE_ACCOUNT`: the storage account name of the Azure Blob
- `AZURE_STORAGE_KEY`: the storage key of the Azure Blob
- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob
- `AZURE_CONFIG_FILE`: an alternative location for the config file

Supported Azure Blob Storage uri formats are shown below:
- az:// \<container\> / \<path\>
- azure:// \<container\> / \<path\>
- https:// \<account\>.blob.core.windows.net / \<container\> / \<path\>

## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
Expand Down
Loading

0 comments on commit 2feb683

Please sign in to comment.