You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dagsterIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1testingWriting tests, creating test data, automating testing, etc.xbrlRelated to the FERC XBRL transition
On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked XFAIL, until he can take a look at it upon his return.
________________________ test_filter_for_freshest_data _________________________
@hypothesis.given(example_schema.strategy(size=3))
> def test_filter_for_freshest_data(df):
test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
df = entity_id date utility_type publication_time int_factoid float_factoid str_factoid
0 ...0
2 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
@hypothesis.given(example_schema.strategy(size=3))
def test_filter_for_freshest_data(df):
# XBRL context is the identifying metadata for reported values
xbrl_context_cols = ["entity_id", "date", "utility_type"]
filing_metadata_cols = ["publication_time", "filing_name"]
primary_key = xbrl_context_cols + filing_metadata_cols
deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
df, primary_key=primary_key
)
example_schema.validate(deduped)
# every post-deduplication row exists in the original rows
assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
# for every [entity_id, utility_type, date] - th"true"e is only one row
assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
# for every *context* in the input there is a corresponding row in the output
original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
paired_by_context = original_contexts.merge(
deduped,
on=xbrl_context_cols,
how="outer",
suffixes=["_in", "_out"],
indicator=True,
).set_index(xbrl_context_cols)
hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
hypothesis.note(f"The freshest data:\n{deduped}")
hypothesis.note(f"Paired by context:\n{paired_by_context}")
> assert (paired_by_context._merge == "both").all()
E AssertionError: assert False
E + where False = <bound method Series.all of entity_id date utility_type\n 1970-01-01 electric False\n electric False\nName: _merge, dtype: bool>()
E + where <bound method Series.all of entity_id date utility_type\n 1970-01-01 electric False\n electric False\nName: _merge, dtype: bool> = entity_id date utility_type\n 1970-01-01 electric left_only\n electric right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E + where entity_id date utility_type\n 1970-01-01 electric left_only\n electric right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] = publication_time_in int_factoid_in float_factoid_in str_factoid_in publication_time_out int_factoid_out float_factoid_out str_factoid_out _merge\nentity_id date utility_type \n 1970-01-01 electric 1970-01-01 0.0 0.0 NaT NaN NaN NaN left_only\n electric NaT NaN NaN NaN 1970-01-01 00:00:00.000000001 0.0 0.0 right_only._merge
E Falsifying example: test_filter_for_freshest_data(
E df=
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 0 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
E 1 0 1970-01-01 electric 1970-01-01 00:00:00.000000001 0 0.0
E 2 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
E ,
E )
E Found these contexts in input data:
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 0 1970-01-01 electric 1970-01-01 0 0.0
E The freshest data:
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 1 0 1970-01-01 electric 1970-01-01 00:00:00.000000001 0 0.0
E Paired by context:
E publication_time_in int_factoid_in float_factoid_in str_factoid_in publication_time_out int_factoid_out float_factoid_out str_factoid_out _merge
E entity_id date utility_type
E 1970-01-01 electric 1970-01-01 0.0 0.0 NaT NaN NaN NaN left_only
E electric NaT NaN NaN NaN 1970-01-01 00:00:00.000000001 0.0 0.0 right_only
E Explanation:
E These lines were always and only run by failing examples:
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E (and 66 more with settings.verbosity >= verbose)
test/unit/io_managers_test.py:398: AssertionError
The text was updated successfully, but these errors were encountered:
zaneselvans
added
ferc1
Anything having to do with FERC Form 1
testing
Writing tests, creating test data, automating testing, etc.
xbrl
Related to the FERC XBRL transition
dagster
Issues related to our use of the Dagster orchestrator
labels
Oct 27, 2023
OK, after some digging... it looks like somehow we're not actually identifying the correct "original contexts" in test, due to some very surprising df.groupby behavior...
You can see in the error message above that the 0 entity ID context is not being detected, only the "" entity ID:
E Found these contexts in input data:
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 0 1970-01-01 electric 1970-01-01 0 0.0
And if you get a debugger in there (I used hypothesis.settings(print_blob=True, max_examples=10_000) to get a reproducible binary blob, then pasted that into @hypothesis.reproduce_failure("6.87.3", b"AXicY2BgYGBkAANGJBIF4JVnZIhiJKwKAA9lAGY=") to, well... reproduce the failure)
Indeed! The null byte seems to break pandas string handling since sometimes it gets passed through to a C library: pandas-dev/pandas#53720 changing this test to not allow \x00 in the entity ID.
dagsterIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1testingWriting tests, creating test data, automating testing, etc.xbrlRelated to the FERC XBRL transition
After merging #2948 into
dev
some of us started getting sporadic failures for the Hypothesis basedtest_filter_for_freshest_data
testOn some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked
XFAIL
, until he can take a look at it upon his return.The text was updated successfully, but these errors were encountered: