Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PUDL Kaggle dataset updates #3852

Closed
zaneselvans opened this issue Sep 18, 2024 · 5 comments · Fixed by #3853
Closed

Fix PUDL Kaggle dataset updates #3852

zaneselvans opened this issue Sep 18, 2024 · 5 comments · Fixed by #3853
Assignees
Labels
kaggle Sharing our data and analysis with the Kaggle community nightly-builds Anything having to do with nightly builds or continuous deployment.

Comments

@zaneselvans
Copy link
Member

The PUDL Kaggle dataset last updated on August 24th, but it's supposed to pull new data every Monday from the S3 bucket, so something is broken.

I attempted a manual update and it failed with an error, but I can't actually see the error. I made a forum post reporting the issue.

I'm worried that we might have run up against some kind of quota, and searching around the support forums it sounds like it may be impossible to delete old versions of a dataset, meaning the only way to continue with updates would be to create an entirely new dataset, which would be lousy for continuity.

@zaneselvans zaneselvans converted this from a draft issue Sep 18, 2024
@zaneselvans zaneselvans added the kaggle Sharing our data and analysis with the Kaggle community label Sep 18, 2024
@zaneselvans
Copy link
Member Author

Ah okay, with some help from the document source inspector, I was able to see the full error, which seems addressable. For some reason, the census DP1 database is no longer getting zipped in the nightly builds, and so it's not showing up at the expected URL:

Failed - Error during creation: Accessing URL failed (Response status code does not indicate success: 404 (Not Found).):
https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/censusdp1tract.sqlite.zip

Looking at last night's build logs, there's a weird error for only the compression of the Census DP1 database, which is confusing. Why would it run out of space on this one relatively small database, but not on pudl.sqlite which is much larger?

Compressing censusdp1tract.sqlite
  adding: censusdp1tract.sqlite
zip I/O error: No space left on device
zip error: Output file write failure (write error on zip file)
Compressing ferc1_dbf.sqlite
  adding: ferc1_dbf.sqlite (deflated 65%)
Compressing ferc1_xbrl.sqlite
  adding: ferc1_xbrl.sqlite (deflated 89%)
Compressing ferc2_dbf.sqlite
  adding: ferc2_dbf.sqlite (deflated 75%)
Compressing ferc2_xbrl.sqlite
  adding: ferc2_xbrl.sqlite (deflated 85%)
Compressing ferc60_dbf.sqlite
  adding: ferc60_dbf.sqlite (deflated 67%)
Compressing ferc60_xbrl.sqlite
  adding: ferc60_xbrl.sqlite (deflated 89%)
Compressing ferc6_dbf.sqlite
  adding: ferc6_dbf.sqlite (deflated 73%)
Compressing ferc6_xbrl.sqlite
  adding: ferc6_xbrl.sqlite (deflated 83%)
Compressing ferc714_xbrl.sqlite
  adding: ferc714_xbrl.sqlite (deflated 89%)
Compressing pudl.sqlite
  adding: pudl.sqlite (deflated 83%)

This should also probably qualify as a build failure, but the way the error codes are combined in the nightly build script means this doesn't happen:

function clean_up_outputs_for_distribution() {
    # Compress the SQLite DBs for easier distribution
    pushd "$PUDL_OUTPUT" && \
    for file in *.sqlite; do
        echo "Compressing $file" && \
        zip "$file.zip" "$file" && \
        rm "$file"
    done
    popd && \
    # Create a zip file of all the parquet outputs for distribution on Kaggle
    # Don't try to compress the already compressed Parquet files with Zip.
    pushd "$PUDL_OUTPUT/parquet" && \
    zip -0 "$PUDL_OUTPUT/pudl_parquet.zip" ./*.parquet && \
    # Move the individual parquet outputs to the output directory for direct access
    mv ./*.parquet "$PUDL_OUTPUT" && \
    popd && \
    # Remove any remaiining files and directories we don't want to distribute
    rm -rf "$PUDL_OUTPUT/parquet" && \
    rm -f "$PUDL_OUTPUT/metadata.yml"
}

@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 18, 2024

It seems a little bit unlikely that we're actually running out space on the device though, since all of the other databases seem to have no trouble getting zipped. We specify 80GB of disk space for the machine that Google Batch uses. I guess we could bump it to 100GB and see if that makes the problem go away? In which case it really was a disk space issue?

Hmm, no data on disk usage from the GCP console:

image

@zaneselvans
Copy link
Member Author

Searching for background on this error message, the only posts I'm finding are related to actually running out of disk space. So even though all the nightly build outputs together are only ~25GB before they get compressed for distribution, with all the raw input data and interim outputs from Dagster, maybe it really is exceeding 80GB of space total. I'm inclined to try bumping it to 100GB just to see.

@zaneselvans zaneselvans added the nightly-builds Anything having to do with nightly builds or continuous deployment. label Sep 18, 2024
@zaneselvans zaneselvans self-assigned this Sep 18, 2024
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Sep 18, 2024
@zaneselvans
Copy link
Member Author

I'm going to hold off on closing this until I can verify that the builds worked and I can update the Kaggle dataset tomorrow.

@zaneselvans zaneselvans reopened this Sep 18, 2024
@e-belfer
Copy link
Member

@zaneselvans I think we've resolved this, can I close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kaggle Sharing our data and analysis with the Kaggle community nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants