Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

Closed
RyanMasson opened this issue Nov 2, 2024 · 8 comments
Closed

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

RyanMasson opened this issue Nov 2, 2024 · 8 comments
Labels
bug Things that are just plain broken. data-types Dtype conversions, standardization and implications of data types parquet Issues related to the Apache Parquet file format which we use for long tables. snowflake Issues related to using PUDL data in snowflake. time what even is time. fixing and changing the way in which PUDL data deals with time

Comments

@RyanMasson
Copy link

Describe the bug

I am working in a trial version of the Snowflake web app to explore EIA-930 data. After loading EIA-930 data into Snowflake from one of the parquet files hosted in the PUDL S3 bucket (specifically the hourly_operations table from the nightly build path: s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet) I am seeing entries in the DATETIME_UTC column look like the following examples:

50467-06-17 01:00:00.000
50467-07-28 17:00:00.000
50467-09-08 09:00:00.000
50467-10-20 01:00:00.000
50467-11-30 16:00:00.000
50468-01-11 08:00:00.000

It appears that the standard YYYY part of a UTC datetime data type has been altered.

Bug Severity

  • Medium: With some effort, I can work around the bug.
    I would like to sort this time series by date to look at BPA area electricity generation over time, but this datetime issue is preventing me from doing that.

To Reproduce

This is a snippet from my Snowflake SQL worksheet that reproduces the behavior. It loads and then queries all 5 million+ rows of the hourly_operations table and has to be run inside the Snowflake architecture. I have not tried loading this particular parquet file in another way to look at the DATETIME_UTC column and compare.

--- set Role Context
USE ROLE accountadmin;

--- set Warehouse Context
USE WAREHOUSE compute_wh;

--- create the Database
CREATE OR REPLACE DATABASE pudl
    COMMENT = 'Database for Catalyst Cooperative Public Utility Data Liberation project' ;
USE DATABASE PUDL;

--- create the Schema
CREATE OR REPLACE SCHEMA pudl.s3_data
    COMMENT = 'Schema for PUDL tables loaded from S3';

--- create the Table
CREATE OR REPLACE TABLE  pudl.s3_data.core_eia930_hourly_operations
    (
    balancing_authority_code_eia STRING,
    datetime_utc DATETIME,
    demand_adjusted_mwh NUMBER,
    demand_forecast_mwh NUMBER,
    demand_imputed_mwh NUMBER,
    demand_reported_mwh NUMBER,
    interchange_adjusted_mwh NUMBER,
    interchange_imputed_mwh NUMBER,
    interchange_reported_mwh NUMBER,
    net_generation_adjusted_mwh NUMBER,
    net_generation_imputed_mwh NUMBER,
    net_generation_reported_mwh NUMBER
    --> supported types: https://docs.snowflake.com/en/sql-reference/intro-summary-data-types.html
    )
    COMMENT = 'Table for EIA 930 hourly operations data';

--- query the empty Table
SELECT * FROM pudl.s3_data.core_eia930_hourly_operations;

--- parquet file format to be able to use parquet data
CREATE OR REPLACE FILE FORMAT pudl_parquet_format
  TYPE = parquet;

--- create Snowflake stage
CREATE OR REPLACE STAGE pudl.s3_data.pudl_s3 
	URL = 's3://pudl.catalyst.coop/' 
	DIRECTORY = ( ENABLE = true )
    FILE_FORMAT = pudl_parquet_format;

--- look at PUDL S3 directories
LIST @pudl.s3_data.pudl_s3;

--- load data from nightly build hourly operations into my hourly operations table
COPY INTO pudl.s3_data.CORE_EIA930_HOURLY_OPERATIONS
    FROM @pudl.s3_data.pudl_s3/nightly/core_eia930__hourly_operations.parquet
    FILE_FORMAT = (FORMAT_NAME = 'pudl_parquet_format')
    MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

SELECT * FROM PUDL.S3_DATA.CORE_EIA930_HOURLY_OPERATIONS;

Expected behavior

I expected the DATETIME_UTC column produced by the final SELECT statement to conform to 'YYYY-MM-DD HH' but the YYYY part seems corrupted. It could be from some step I am making in accessing the data or in the file itself, I am not sure.

Software Environment?

This happening when accessing Nightly Build data through PUDL's AWS S3 API, using Snowflake. I did not install PUDL on my machine.

@RyanMasson RyanMasson added the bug Things that are just plain broken. label Nov 2, 2024
@RyanMasson
Copy link
Author

@zaneselvans just showed me via email that a the pandas pd.read_parquet() is producing correct datetime output. So there is likely something wrong with my use of Snowflake.

@zaneselvans
Copy link
Member

zaneselvans commented Nov 2, 2024

Hey, that's definitely weird. Have you been able to read datetime columns from other Parquet files without any issues?

What happens if you try using the TIMESTAMP type? Oh wait DATETIME is just an alias for TIMESTAMP_NTZ

Can you select the maximum value of datetime_utc so we can see how it correlates with the maximum dates I'm getting using pandas?

Are all of the years crazy? Or is it just a few of them?

Here's the code I used which gave the expected results:

ops = pd.read_parquet(
    "s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
)
ops.datetime_utc.max()
# Timestamp('2024-08-01 07:00:00')

And using PyArrow:

ops = pd.read_parquet(
    "s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)
ops.datetime_utc.max()
# datetime.datetime(2024, 8, 1, 7, 0)

Distribution and set of years looks right with pandas:

ops.datetime_utc.dt.year.value_counts().sort_index().plot(kind="bar")

image

@zaneselvans zaneselvans added snowflake Issues related to using PUDL data in snowflake. time what even is time. fixing and changing the way in which PUDL data deals with time data-types Dtype conversions, standardization and implications of data types labels Nov 2, 2024
@zaneselvans
Copy link
Member

Maybe it has something to do with the precision of the Timestamp in the Parquet file, which is ms rather than ns?

ops.datetime_utc.dtype
# timestamp[ms][pyarrow]

@RyanMasson
Copy link
Author

It seems that all of the years in my data are weird:

SELECT COUNT(*) FROM PUDL.S3_DATA.CORE_EIA930_HOURLY_OPERATIONS;
-- 5,118,554

SELECT MAX(DATETIME_UTC) AS max_datetime
    FROM pudl.s3_data.core_eia930_hourly_operations;
-- 56553-09-17 09:00:00.000

SELECT MIN(DATETIME_UTC) AS min_datetime
    FROM pudl.s3_data.core_eia930_hourly_operations;
-- 47466-05-16 01:00:00.000

@zaneselvans
Copy link
Member

My best guess is it has to do with the timestamp being stored as an integer number of mili/micro/nano seconds since t=0, and you need to somehow tell it which precision to use when interpreting the integer. But also it really seems like it should be smart enough to do that itself given the Parquet dtype metadata.

@zaneselvans
Copy link
Member

zaneselvans commented Nov 2, 2024

Yeah, looking at your biggest "years" they're very close to 1000x further from 1970 than the present day is, so my guess is there's a milli/micro/nano conversion error on the way from Parquet to Snowflake somehow. Try dividing the timestamp by 1000 and see what dates you get.

@zaneselvans zaneselvans added the parquet Issues related to the Apache Parquet file format which we use for long tables. label Nov 2, 2024
@zaneselvans
Copy link
Member

zaneselvans commented Nov 2, 2024

So I think Snowflake is getting milliseconds, and interpreting them as seconds.

thousander = ops["datetime_utc"].astype(int)*1000
thousander.astype("datetime64[ms]").min()
# Timestamp('47466-05-16 08:00:00')
thousander.astype("datetime64[ms]").max()
# Timestamp('56553-09-17 16:00:00')

@zaneselvans
Copy link
Member

@RyanMasson I'm going to go ahead and close this since I'm pretty sure we found the issue and it's (thankfully!) not in our data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. data-types Dtype conversions, standardization and implications of data types parquet Issues related to the Apache Parquet file format which we use for long tables. snowflake Issues related to using PUDL data in snowflake. time what even is time. fixing and changing the way in which PUDL data deals with time
Projects
Status: Done
Development

No branches or pull requests

2 participants