Skip to content

Commit

Permalink
Use parquet CLI to combine
Browse files Browse the repository at this point in the history
  • Loading branch information
orf committed Aug 6, 2023
1 parent 5643782 commit bb6909d
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 16 deletions.
31 changes: 23 additions & 8 deletions .github/workflows/run.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,15 +127,17 @@ jobs:
with:
name: groups

- name: Install Rust
uses: actions-rs/toolchain@v1
with:
toolchain: stable
override: true

- run: cargo install parquet -F cli

- name: Download links
run: cat ${{ matrix.index }} | jq -rc '.[]'

- name: Setup DuckDB
run: |
wget https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-linux-amd64.zip -O /tmp/duckdb.zip
unzip /tmp/duckdb.zip -d ${{ github.workspace }}
chmod +x ${{ github.workspace }}/duckdb
- name: Debug
run: |
echo "Links for ${{ matrix.index }}"
Expand All @@ -149,9 +151,22 @@ jobs:
- run: ls -la ${{ github.workspace }}/input/

- name: Combine
run: ${{ github.workspace }}/duckdb -echo foo.db < ${{ github.workspace }}/sql/combine.sql
run: parquet-concat ${{ github.workspace }}/input/*.parquet ${{ github.workspace }}/merged.parquet

- name: Merged size
run: du -hs ${{ github.workspace }}/merged.parquet

- run: ls -la ${{ github.workspace }}/*.parquet
- name: Rewrite
run: |
parquet-rewrite --compression=zstd \
--input=${{ github.workspace }}/merged.parquet \
--output=${{ github.workspace }}/output.parquet \
--writer-version=2.0 \
--statistics-enabled=page \
--bloom-filter-enabled=true
- name: Output size
run: du -hs ${{ github.workspace }}/merged.parquet

- name: Upload Assets
uses: shogo82148/actions-upload-release-asset@v1
Expand Down
8 changes: 0 additions & 8 deletions sql/combine.sql

This file was deleted.

0 comments on commit bb6909d

Please sign in to comment.