Skip to content

Latest commit

 

History

History
138 lines (95 loc) · 5.87 KB

README.md

File metadata and controls

138 lines (95 loc) · 5.87 KB

Basin + WeatherXM Queries Demo

standard-readme compliant

Compute queries for WeatherXM data pushed to Textile Basin + Vaults

Table of Contents

Background

This project runs a simple analysis on WeatherXM data that's pushed to Tableland Basin (replicated to Filecoin). It fetches remote data and queries it with DuckDB.

Data

The script fetches data from the WeatherXM's wxm.weather_data_dev vault (created by 0x75c9eCb5309d068f72f545F3fbc7cCD246016fE0) on a cron schedule with GitHub Actions. For every run, it will query data and write the results to:

  • Data: Summary metrics for the run, including averages across all columns.
  • History: A CSV file containing the full history of all runs, along with the run date and time.

Install

To set things up (for local development), you'll need to do the following. First, install pipx and pipenv:

python3 -m pip install pipx
python3 -m pipx ensurepath
pipx install pipenv
pipenv run pip install --upgrade pip setuptools wheel

Then, install dependencies:

pipenv install --dev

And then, activate the virtual environment and set up pre-commit hooks:

pipenv shell
pipenv run pre-commit install -t pre-commit
pipenv run pre-commit install -t pre-push

Note the core dependencies installed are:

  • contextily: Used for plotting data on a map.
  • duckdb: Creates an in-memory SQL database for querying parquet files extracted from Basin.
  • geopandas: Also used for plotting data on a map.
  • shapely: Required for geopandas to work.
  • requests: Make requests to Basin HTTP API to fetch data.
  • polars: Used for DataFrame operations as part of post-query logic.
  • pyarrow: Required for DuckDB to work with parquet files.
  • rich: Used for logging purposes.

Once you've done this, you'll also need to make sure the Basin CLI is installed; it's part of the underlying application logic. You'll need Go 1.21 installed to do this, and then run:

go install github.com/tablelandnetwork/basin-cli/cmd/vaults@latest

Also, the go-car CLI is required to extract the underlying parquet files from the CAR files retrieved from Basin. You'll need Go 1.20 (note: different thant the Basin CLI) and can install it with:

go install github.com/ipld/go-car/cmd/car@latest

Usage

Running basin_wxm/main.py will fetch remote files from Tableland Basin, extract the contents with go-car, load the resulting parquet files into a DuckDB in-memory database, run queries on the data, and then collect them into a polars DataFrame for final operations (e.g., writing to files).

To use default time ranges (the full dataset), run:

make run

Or, you can define a custom time range with start and end arguments (Unix epoch timestamps or YYYY-MM-DD), which will be used to filter the data when queries are executed. Note: the timestamp range for the wxm.weather_data_dev vault starts on 1707436800.

make run start=1707436800 end=2024-02-15

This range also defines which events/data is fetched; the cache.json file will store all previously extracted events, so only new events will be fetched on subsequent runs.

Once you run the command, it'll log information about the current status of each step in the run and the total time to complete upon finishing:

[18:45:46] INFO     Getting events for vault...done in 0.53s
           INFO     Number of events found: 7
           INFO     Number of new events: 7
[18:51:04] INFO     Extracting data from events...done in 317.89s
           INFO     Creating database with parquet files...done in 0.02s
[18:51:11] INFO     Executing queries...done in 6.92s
[18:51:29] INFO     Generating bbox plots...done in 17.88s
⠙ Writing results to files...

Note: The program will download the files locally before creating the database and running queries, which will use up a bit of memory. For example, five wxm parquet files will total to ~1.2 GiB in terms of raw file size (each is 200-250 MiB).

Flags

The following flags are available for the main.py script:

  • --start: Start timestamp (in unix) for the query (e.g., 1707436800). Defaults to full range.
  • --end: End timestamp (in unix) for the query (e.g., 1707955200). Defaults to full range.
  • --verbose: Enable verbose logging to show stack traces for errors. Defaults to true.

Makefile Reference

The following defines all commands available in the Makefile:

  • make run: Run the main.py program to fetch Basin/wxm data, run queries, and write metrics to summary files.
  • make install: Install dependencies with pip, upgrading pip first.
  • make setup: Use the virtual environment and set up pre-commit hooks.
  • make format: Run black, isort, mypy, and flake8 to format and lint the code.
  • make basin: Install the Basin CLI from the latest release.
  • make ipfs-car: Install ipfs-car from the latest release.
  • make test: Run the (dummy) tests.

Contributing

PRs accepted.

Small note: If editing the README, please conform to the standard-readme specification.

License

MIT AND Apache-2.0, © 2021-2024 Textile Contributors