-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PUDL Kaggle dataset updates #3852
Comments
Ah okay, with some help from the document source inspector, I was able to see the full error, which seems addressable. For some reason, the census DP1 database is no longer getting zipped in the nightly builds, and so it's not showing up at the expected URL:
Looking at last night's build logs, there's a weird error for only the compression of the Census DP1 database, which is confusing. Why would it run out of space on this one relatively small database, but not on
This should also probably qualify as a build failure, but the way the error codes are combined in the nightly build script means this doesn't happen: function clean_up_outputs_for_distribution() {
# Compress the SQLite DBs for easier distribution
pushd "$PUDL_OUTPUT" && \
for file in *.sqlite; do
echo "Compressing $file" && \
zip "$file.zip" "$file" && \
rm "$file"
done
popd && \
# Create a zip file of all the parquet outputs for distribution on Kaggle
# Don't try to compress the already compressed Parquet files with Zip.
pushd "$PUDL_OUTPUT/parquet" && \
zip -0 "$PUDL_OUTPUT/pudl_parquet.zip" ./*.parquet && \
# Move the individual parquet outputs to the output directory for direct access
mv ./*.parquet "$PUDL_OUTPUT" && \
popd && \
# Remove any remaiining files and directories we don't want to distribute
rm -rf "$PUDL_OUTPUT/parquet" && \
rm -f "$PUDL_OUTPUT/metadata.yml"
} |
It seems a little bit unlikely that we're actually running out space on the device though, since all of the other databases seem to have no trouble getting zipped. We specify 80GB of disk space for the machine that Google Batch uses. I guess we could bump it to 100GB and see if that makes the problem go away? In which case it really was a disk space issue? Hmm, no data on disk usage from the GCP console: |
Searching for background on this error message, the only posts I'm finding are related to actually running out of disk space. So even though all the nightly build outputs together are only ~25GB before they get compressed for distribution, with all the raw input data and interim outputs from Dagster, maybe it really is exceeding 80GB of space total. I'm inclined to try bumping it to 100GB just to see. |
I'm going to hold off on closing this until I can verify that the builds worked and I can update the Kaggle dataset tomorrow. |
@zaneselvans I think we've resolved this, can I close this issue? |
The PUDL Kaggle dataset last updated on August 24th, but it's supposed to pull new data every Monday from the S3 bucket, so something is broken.
I attempted a manual update and it failed with an error, but I can't actually see the error. I made a forum post reporting the issue.
I'm worried that we might have run up against some kind of quota, and searching around the support forums it sounds like it may be impossible to delete old versions of a dataset, meaning the only way to continue with updates would be to create an entirely new dataset, which would be lousy for continuity.
The text was updated successfully, but these errors were encountered: