Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/dev' into rousik-output-diff
Browse files Browse the repository at this point in the history
  • Loading branch information
rousik committed Sep 11, 2023
2 parents ab0decc + 86ece0c commit c32ba73
Show file tree
Hide file tree
Showing 262 changed files with 15,379 additions and 26,318 deletions.
2 changes: 0 additions & 2 deletions .bandit.yml

This file was deleted.

36 changes: 36 additions & 0 deletions .github/workflows/docker-build-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
name: docker-build-test
on:
workflow_dispatch:
pull_request:

jobs:
pudl_docker_build:
name: Test building the PUDL ETL Docker image
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- name: Checkout Repository
uses: actions/checkout@v3

- name: Docker Metadata
id: docker_metadata
uses: docker/[email protected]
with:
images: catalystcoop/pudl-etl
flavor: |
latest=auto
- name: Set up Docker Buildx
uses: docker/[email protected]

- name: Build image but do not push to Docker Hub
uses: docker/[email protected]
with:
context: .
file: docker/Dockerfile
push: false
cache-from: type=gha
cache-to: type=gha,mode=max
18 changes: 14 additions & 4 deletions .github/workflows/tox-pytest.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
---
name: tox-pytest

on: [pull_request]
on:
pull_request:
types:
- created
- opened
- synchronize
- ready_for_review

env:
PUDL_OUTPUT: /home/runner/pudl-work/output
PUDL_INPUT: /home/runner/pudl-work/data/
DAGSTER_HOME: /home/runner/pudl-work/dagster_home/

jobs:
ci-static:
Expand Down Expand Up @@ -97,7 +104,10 @@ jobs:
path: coverage.xml

ci-integration:
runs-on: ubuntu-latest
runs-on:
group: large-runner-group
labels: ubuntu-22.04-4core
if: github.event.pull_request.draft == false
permissions:
contents: read
id-token: write
Expand Down Expand Up @@ -145,8 +155,8 @@ jobs:
path: ${{ env.PUDL_INPUT }}
key: zenodo-datastore-${{ hashFiles('datastore-dois.txt') }}

- name: Make cache/output dirs
run: mkdir -p ${{ env.PUDL_OUTPUT }} ${{ env.PUDL_INPUT}}
- name: Make input, output and dagster dirs
run: mkdir -p ${{ env.PUDL_OUTPUT }} ${{ env.PUDL_INPUT}} ${{ env.DAGSTER_HOME }}

- name: List workspace contents
run: find /home/runner/pudl-work
Expand Down
84 changes: 9 additions & 75 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ repos:
rev: v1.10.0
hooks:
- id: python-check-blanket-noqa # Prohibit overly broad QA exclusions.
- id: python-no-eval # Never use eval() it's dangerous.
- id: python-no-log-warn # logger.warning(), not old .warn()
- id: rst-backticks # Find single rather than double backticks
- id: rst-directive-colons # Missing double-colons after directives
- id: rst-inline-touching-normal # Inline code should never touch normal text
Expand All @@ -21,113 +19,49 @@ repos:
- id: check-yaml # Validate all YAML files.
- id: check-case-conflict # Avoid case sensitivity in file names.
- id: debug-statements # Watch for lingering debugger calls.
- id: end-of-file-fixer # Ensure there's a newline at EOF.
- id: mixed-line-ending # Use Unix line-endings to avoid big no-op CSV diffs.
args: ["--fix=lf"]
- id: trailing-whitespace # Remove trailing whitespace.
- id: name-tests-test # Follow PyTest naming convention.

####################################################################################
# Formatters: hooks that re-write Python & documentation files
####################################################################################
# Make sure import statements are sorted uniformly.
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.0.287
hooks:
- id: isort
exclude: migrations/.*
- id: ruff
args: [--fix, --exit-non-zero-on-fix]

# Format the code
- repo: https://github.com/psf/black
rev: 23.3.0
rev: 23.7.0
hooks:
- id: black
language_version: python3.11
exclude: migrations/.*

# Format docstrings
- repo: https://github.com/PyCQA/docformatter
rev: v1.6.5
hooks:
- id: docformatter
args: ["--in-place", "--config", "tox.ini"]
exclude: migrations/.*

# Remove f-string prefix when there's nothing in the string to format.
- repo: https://github.com/dannysepler/rm_unneeded_f_str
rev: v0.2.0
hooks:
- id: rm-unneeded-f-str

# Use built-in types for annotations as per PEP585
- repo: https://github.com/sondrelg/pep585-upgrade
rev: v1.0
hooks:
- id: upgrade-type-hints

- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.0.0-alpha.9-for-vscode
rev: v3.0.3
hooks:
- id: prettier
types_or: [yaml]

# Update Python language constructs to modern standards
- repo: https://github.com/asottile/pyupgrade
rev: v3.4.0
hooks:
- id: pyupgrade
args: ["--py311-plus"]

####################################################################################
# Linters: hooks that check but don't alter Python and documentation files
# Linters: hooks that check but don't alter files
####################################################################################

# Check for PEP8 non-compliance, code complexity, style, errors, etc:
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
hooks:
- id: flake8
args: [--config, tox.ini]
exclude: migrations/.*
additional_dependencies:
- flake8-docstrings
- flake8-colors
- pydocstyle
- flake8-builtins
- mccabe
- pep8-naming
- pycodestyle
- pyflakes
- flake8-rst-docstrings
- flake8-use-fstring

# Check for known security vulnerabilities:
- repo: https://github.com/PyCQA/bandit
rev: 1.7.5
hooks:
- id: bandit
additional_dependencies: ["bandit[toml]"]
args: ["--configfile", ".bandit.yml"]

# Lint Dockerfiles for errors and to ensure best practices
- repo: https://github.com/AleksaC/hadolint-py
rev: v2.12.0.2
hooks:
- id: hadolint

# Check for errors in restructuredtext (.rst) files under the doc hierarchy
- repo: https://github.com/PyCQA/doc8
rev: v1.1.1
hooks:
- id: doc8
args: [--config, tox.ini]

# Lint any RST files and embedded code blocks for syntax / formatting errors
- repo: https://github.com/rstcheck/rstcheck
rev: v6.1.2
hooks:
- id: rstcheck
additional_dependencies: [sphinx]
args: [--config, tox.ini]
args: [--config, pyproject.toml]

#####################################################################################
# Our own pre-commit hooks, which don't come from the pre-commit project
Expand Down
85 changes: 21 additions & 64 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,18 +64,18 @@ What data is available?

PUDL currently integrates data from:

* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2021
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2022-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2021
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2021
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2021
* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2022
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2022
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2022
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2022
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010

Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
Program <https://sloan.org/programs/research/energy-and-environment>`__, from
2021 to 2023 we will be integrating the following data as well:
2021 to 2024 we will be integrating the following data as well:

* `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
(The Annual Report of Natural Gas Supply and Disposition)
Expand All @@ -101,7 +101,7 @@ resources and everyone in between!
How do I access the data?
-------------------------

There are four main ways to access PUDL outputs. For more details you'll want
There are several ways to access PUDL outputs. For more details you'll want
to check out `the complete documentation
<https://catalystcoop-pudl.readthedocs.io>`__, but here's a quick overview:

Expand All @@ -124,7 +124,7 @@ Want access to all the published data in bulk? If you're familiar with Python
and `Jupyter Notebooks <https://jupyter.org/>`__ and are willing to install Docker you
can:

* `Download a PUDL data release <https://sandbox.zenodo.org/record/764696>`__ from
* `Download a PUDL data release <https://zenodo.org/record/3653158>`__ from
CERN's `Zenodo <https://zenodo.org>`__ archiving service.
* `Install Docker <https://docs.docker.com/get-docker/>`__
* Run the archived image using ``docker-compose up``
Expand All @@ -138,20 +138,6 @@ The `PUDL Examples repository <https://github.com/catalyst-cooperative/pudl-exam
has more detailed instructions on how to work with the Zenodo data archive and Docker
image.

JupyterHub
^^^^^^^^^^
Do you want to use Python and Jupyter Notebooks to access the data but aren't
comfortable setting up Docker? We are working with `2i2c <https://2i2c.org>`__ to host
a JupyterHub that has the same software and data as the Docker container and Zenodo
archive mentioned above, but running in the cloud.

* `Request an account <https://forms.gle/TN3GuE2e2mnWoFC4A>`__
* `Log in to the JupyterHub <https://bit.ly/pudl-examples-01>`__

**Note:** you'll only have 4-6GB of RAM and 1 CPU to work with on the JupyterHub, so
if you need more computing power, you may need to set PUDL up on your own computer.
Eventually we hope to offer scalable computing resources on the JupyterHub as well.

The PUDL Development Environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you're more familiar with the Python data science stack and are comfortable working
Expand All @@ -161,51 +147,22 @@ full data processing pipeline yourself, tweak the underlying source code, and (w
make contributions back to the project.

This is by far the most involved way to access the data and isn't recommended for
most users. You should check out the `Development section <https://catalystcoop-pudl.readthedocs.io/en/latest/dev/dev_setup.html>`__ of the main `PUDL
documentation <https://catalystcoop-pudl.readthedocs.io>`__ for more details.
most users. You should check out the `Development section <https://catalystcoop-pudl.readthedocs.io/en/latest/dev/dev_setup.html>`__
of the main `PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__ for more
details.

Nightly Data Builds
^^^^^^^^^^^^^^^^^^^
If you are less concerned with reproducibility and want the freshest possible data
we also upload the outputs of our nightly builds to public S3 storage buckets. This
data is produced by the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__,
of PUDL, and is updated most weekday mornings. It is also the data used to populate
Datasette:

* `PUDL SQLite DB <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/pudl.sqlite>`__
* `EPA CEMS Hourly Emissions Parquet (1995-2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/hourly_emissions_epacems.parquet>`__
* `Census DP1 SQLite DB (2010) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/censusdp1tract.sqlite>`__

* Raw FERC Form 1:

* `FERC-1 SQLite derived from DBF (1994-2020) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc1.sqlite>`__
* `FERC-1 SQLite derived from XBRL (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc1_xbrl.sqlite>`__
* `FERC-1 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc1_xbrl_datapackage.json>`__
* `FERC-1 XBRL Taxonomy Metadata as JSON (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc1_xbrl_taxonomy_metadata.json>`__

* Raw FERC Form 2:

* `FERC-2 SQLite derived from XBRL (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc2_xbrl.sqlite>`__
* `FERC-2 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc2_xbrl_datapackage.json>`__
* `FERC-2 XBRL Taxonomy Metadata as JSON (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc2_xbrl_taxonomy_metadata.json>`__

* Raw FERC Form 6:

* `FERC-6 SQLite derived from XBRL (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc6_xbrl.sqlite>`__
* `FERC-6 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc6_xbrl_datapackage.json>`__
* `FERC-6 XBRL Taxonomy Metadata as JSON (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc6_xbrl_taxonomy_metadata.json>`__

* Raw FERC Form 60:

* `FERC-60 SQLite derived from XBRL (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc60_xbrl.sqlite>`__
* `FERC-60 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc60_xbrl_datapackage.json>`__
* `FERC-60 XBRL Taxonomy Metadata as JSON (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc60_xbrl_taxonomy_metadata.json>`__

* Raw FERC Form 714:
we automatically upload the outputs of our nightly builds to public S3 storage buckets
as part of the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__. This data is based on
the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and
is updated most weekday mornings. It is also the data used to populate Datasette.

* `FERC-714 SQLite derived from XBRL (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc714_xbrl.sqlite>`__
* `FERC-714 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc714_xbrl_datapackage.json>`__
* `FERC-714 XBRL Taxonomy Metadata as JSON (2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/ferc714_xbrl_taxonomy_metadata.json>`__
The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
directly via the web. See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__
for links to the individual SQLite, JSON, and Apache Parquet outputs.

Contributing to PUDL
--------------------
Expand Down Expand Up @@ -250,9 +207,9 @@ Contact Us
`Office Hours <https://calend.ly/catalyst-cooperative/pudl-office-hours>`__
* Follow us on Twitter: `@CatalystCoop <https://twitter.com/CatalystCoop>`__
* More info on our website: https://catalyst.coop
* For private communication about the project or to hire us to provide customized data
* To hire us to provide customized data
extraction and analysis, you can email the maintainers:
`pudl@catalyst.coop <mailto:pudl@catalyst.coop>`__
`hello@catalyst.coop <mailto:hello@catalyst.coop>`__

About Catalyst Cooperative
--------------------------
Expand Down
3 changes: 3 additions & 0 deletions devtools/datasette/publish.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ datasette publish cloudrun \
--extra-options="--setting sql_time_limit_ms 5000" \
$SQLITE_DIR/pudl.sqlite \
$SQLITE_DIR/ferc1.sqlite \
$SQLITE_DIR/ferc2.sqlite \
$SQLITE_DIR/ferc6.sqlite \
$SQLITE_DIR/ferc60.sqlite \
$SQLITE_DIR/ferc1_xbrl.sqlite \
$SQLITE_DIR/ferc2_xbrl.sqlite \
$SQLITE_DIR/ferc6_xbrl.sqlite \
Expand Down
Loading

0 comments on commit c32ba73

Please sign in to comment.