EIA-930 DATETIME_UTC data format issue using Snowflake #3937

RyanMasson · 2024-11-02T00:54:22Z

Describe the bug

I am working in a trial version of the Snowflake web app to explore EIA-930 data. After loading EIA-930 data into Snowflake from one of the parquet files hosted in the PUDL S3 bucket (specifically the hourly_operations table from the nightly build path: s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet) I am seeing entries in the DATETIME_UTC column look like the following examples:

50467-06-17 01:00:00.000
50467-07-28 17:00:00.000
50467-09-08 09:00:00.000
50467-10-20 01:00:00.000
50467-11-30 16:00:00.000
50468-01-11 08:00:00.000

It appears that the standard YYYY part of a UTC datetime data type has been altered.

Bug Severity

Medium: With some effort, I can work around the bug.
I would like to sort this time series by date to look at BPA area electricity generation over time, but this datetime issue is preventing me from doing that.

To Reproduce

This is a snippet from my Snowflake SQL worksheet that reproduces the behavior. It loads and then queries all 5 million+ rows of the hourly_operations table and has to be run inside the Snowflake architecture. I have not tried loading this particular parquet file in another way to look at the DATETIME_UTC column and compare.

--- set Role Context
USE ROLE accountadmin;

--- set Warehouse Context
USE WAREHOUSE compute_wh;

--- create the Database
CREATE OR REPLACE DATABASE pudl
    COMMENT = 'Database for Catalyst Cooperative Public Utility Data Liberation project' ;
USE DATABASE PUDL;

--- create the Schema
CREATE OR REPLACE SCHEMA pudl.s3_data
    COMMENT = 'Schema for PUDL tables loaded from S3';

--- create the Table
CREATE OR REPLACE TABLE  pudl.s3_data.core_eia930_hourly_operations
    (
    balancing_authority_code_eia STRING,
    datetime_utc DATETIME,
    demand_adjusted_mwh NUMBER,
    demand_forecast_mwh NUMBER,
    demand_imputed_mwh NUMBER,
    demand_reported_mwh NUMBER,
    interchange_adjusted_mwh NUMBER,
    interchange_imputed_mwh NUMBER,
    interchange_reported_mwh NUMBER,
    net_generation_adjusted_mwh NUMBER,
    net_generation_imputed_mwh NUMBER,
    net_generation_reported_mwh NUMBER
    --> supported types: https://docs.snowflake.com/en/sql-reference/intro-summary-data-types.html
    )
    COMMENT = 'Table for EIA 930 hourly operations data';

--- query the empty Table
SELECT * FROM pudl.s3_data.core_eia930_hourly_operations;

--- parquet file format to be able to use parquet data
CREATE OR REPLACE FILE FORMAT pudl_parquet_format
  TYPE = parquet;

--- create Snowflake stage
CREATE OR REPLACE STAGE pudl.s3_data.pudl_s3 
	URL = 's3://pudl.catalyst.coop/' 
	DIRECTORY = ( ENABLE = true )
    FILE_FORMAT = pudl_parquet_format;

--- look at PUDL S3 directories
LIST @pudl.s3_data.pudl_s3;

--- load data from nightly build hourly operations into my hourly operations table
COPY INTO pudl.s3_data.CORE_EIA930_HOURLY_OPERATIONS
    FROM @pudl.s3_data.pudl_s3/nightly/core_eia930__hourly_operations.parquet
    FILE_FORMAT = (FORMAT_NAME = 'pudl_parquet_format')
    MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

SELECT * FROM PUDL.S3_DATA.CORE_EIA930_HOURLY_OPERATIONS;

Expected behavior

I expected the DATETIME_UTC column produced by the final SELECT statement to conform to 'YYYY-MM-DD HH' but the YYYY part seems corrupted. It could be from some step I am making in accessing the data or in the file itself, I am not sure.

Software Environment?

This happening when accessing Nightly Build data through PUDL's AWS S3 API, using Snowflake. I did not install PUDL on my machine.

The text was updated successfully, but these errors were encountered:

RyanMasson · 2024-11-02T00:59:27Z

@zaneselvans just showed me via email that a the pandas pd.read_parquet() is producing correct datetime output. So there is likely something wrong with my use of Snowflake.

zaneselvans · 2024-11-02T01:16:33Z

Hey, that's definitely weird. Have you been able to read datetime columns from other Parquet files without any issues?

What happens if you try using the TIMESTAMP type? Oh wait DATETIME is just an alias for TIMESTAMP_NTZ

Can you select the maximum value of datetime_utc so we can see how it correlates with the maximum dates I'm getting using pandas?

Are all of the years crazy? Or is it just a few of them?

Here's the code I used which gave the expected results:

ops = pd.read_parquet(
    "s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
)
ops.datetime_utc.max()
# Timestamp('2024-08-01 07:00:00')

And using PyArrow:

ops = pd.read_parquet(
    "s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)
ops.datetime_utc.max()
# datetime.datetime(2024, 8, 1, 7, 0)

Distribution and set of years looks right with pandas:

ops.datetime_utc.dt.year.value_counts().sort_index().plot(kind="bar")

zaneselvans · 2024-11-02T01:30:38Z

Maybe it has something to do with the precision of the Timestamp in the Parquet file, which is ms rather than ns?

ops.datetime_utc.dtype
# timestamp[ms][pyarrow]

RyanMasson · 2024-11-02T01:32:27Z

It seems that all of the years in my data are weird:

SELECT COUNT(*) FROM PUDL.S3_DATA.CORE_EIA930_HOURLY_OPERATIONS;
-- 5,118,554

SELECT MAX(DATETIME_UTC) AS max_datetime
    FROM pudl.s3_data.core_eia930_hourly_operations;
-- 56553-09-17 09:00:00.000

SELECT MIN(DATETIME_UTC) AS min_datetime
    FROM pudl.s3_data.core_eia930_hourly_operations;
-- 47466-05-16 01:00:00.000

zaneselvans · 2024-11-02T01:35:04Z

My best guess is it has to do with the timestamp being stored as an integer number of mili/micro/nano seconds since t=0, and you need to somehow tell it which precision to use when interpreting the integer. But also it really seems like it should be smart enough to do that itself given the Parquet dtype metadata.

zaneselvans · 2024-11-02T01:46:19Z

Yeah, looking at your biggest "years" they're very close to 1000x further from 1970 than the present day is, so my guess is there's a milli/micro/nano conversion error on the way from Parquet to Snowflake somehow. Try dividing the timestamp by 1000 and see what dates you get.

zaneselvans · 2024-11-02T01:56:30Z

So I think Snowflake is getting milliseconds, and interpreting them as seconds.

thousander = ops["datetime_utc"].astype(int)*1000
thousander.astype("datetime64[ms]").min()
# Timestamp('47466-05-16 08:00:00')
thousander.astype("datetime64[ms]").max()
# Timestamp('56553-09-17 16:00:00')

zaneselvans · 2024-11-04T21:11:30Z

@RyanMasson I'm going to go ahead and close this since I'm pretty sure we found the issue and it's (thankfully!) not in our data.

RyanMasson added the bug Things that are just plain broken. label Nov 2, 2024

github-project-automation bot added this to Catalyst Megaproject Nov 2, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Nov 2, 2024

zaneselvans added snowflake Issues related to using PUDL data in snowflake. time what even is time. fixing and changing the way in which PUDL data deals with time data-types Dtype conversions, standardization and implications of data types labels Nov 2, 2024

zaneselvans added the parquet Issues related to the Apache Parquet file format which we use for long tables. label Nov 2, 2024

zaneselvans closed this as completed Nov 4, 2024

github-project-automation bot moved this from New to Done in Catalyst Megaproject Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

RyanMasson commented Nov 2, 2024

RyanMasson commented Nov 2, 2024

zaneselvans commented Nov 2, 2024 •

edited

Loading

zaneselvans commented Nov 2, 2024

RyanMasson commented Nov 2, 2024

zaneselvans commented Nov 2, 2024

zaneselvans commented Nov 2, 2024 •

edited

Loading

zaneselvans commented Nov 2, 2024 •

edited

Loading

zaneselvans commented Nov 4, 2024

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

EIA-930 DATETIME_UTC data format issue using Snowflake #3937

Comments

RyanMasson commented Nov 2, 2024

Describe the bug

Bug Severity

To Reproduce

Expected behavior

Software Environment?

RyanMasson commented Nov 2, 2024

zaneselvans commented Nov 2, 2024 • edited Loading

zaneselvans commented Nov 2, 2024

RyanMasson commented Nov 2, 2024

zaneselvans commented Nov 2, 2024

zaneselvans commented Nov 2, 2024 • edited Loading

zaneselvans commented Nov 2, 2024 • edited Loading

zaneselvans commented Nov 4, 2024

zaneselvans commented Nov 2, 2024 •

edited

Loading

zaneselvans commented Nov 2, 2024 •

edited

Loading

zaneselvans commented Nov 2, 2024 •

edited

Loading