`test_filter_for_freshest_data` occasionally fails #2983

zaneselvans · 2023-10-27T19:43:23Z

After merging #2948 into dev some of us started getting sporadic failures for the Hypothesis based test_filter_for_freshest_data test

 pytest test/unit/io_managers_test.py::test_filter_for_freshest_data

On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked XFAIL, until he can take a look at it upon his return.

________________________ test_filter_for_freshest_data _________________________

    @hypothesis.given(example_schema.strategy(size=3))
>   def test_filter_for_freshest_data(df):

test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

df =   entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
0           ...0
2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0

    @hypothesis.given(example_schema.strategy(size=3))
    def test_filter_for_freshest_data(df):
        # XBRL context is the identifying metadata for reported values
        xbrl_context_cols = ["entity_id", "date", "utility_type"]
        filing_metadata_cols = ["publication_time", "filing_name"]
        primary_key = xbrl_context_cols + filing_metadata_cols
        deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
            df, primary_key=primary_key
        )
        example_schema.validate(deduped)

        # every post-deduplication row exists in the original rows
        assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
        # for every [entity_id, utility_type, date] - th"true"e is only one row
        assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
        # for every *context* in the input there is a corresponding row in the output
        original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
        paired_by_context = original_contexts.merge(
            deduped,
            on=xbrl_context_cols,
            how="outer",
            suffixes=["_in", "_out"],
            indicator=True,
        ).set_index(xbrl_context_cols)
        hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
        hypothesis.note(f"The freshest data:\n{deduped}")
        hypothesis.note(f"Paired by context:\n{paired_by_context}")
>       assert (paired_by_context._merge == "both").all()
E       AssertionError: assert False
E        +  where False = <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool>()
E        +    where <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool> = entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E        +      where entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] =                                   publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge\nentity_id date       utility_type                                                                                                                                                                   \n          1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only\n                     electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only._merge
E       Falsifying example: test_filter_for_freshest_data(
E           df=
E                 entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E               0           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E               1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E               2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E           ,
E       )
E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0
E       The freshest data:
E         entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E       1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E       Paired by context:
E                                         publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge
E       entity_id date       utility_type
E                 1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only
E                            electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only
E       Explanation:
E           These lines were always and only run by failing examples:
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E               (and 66 more with settings.verbosity >= verbose)

test/unit/io_managers_test.py:398: AssertionError

The text was updated successfully, but these errors were encountered:

jdangerx · 2023-10-30T23:37:04Z

OK, after some digging... it looks like somehow we're not actually identifying the correct "original contexts" in test, due to some very surprising df.groupby behavior...

You can see in the error message above that the 0 entity ID context is not being detected, only the "" entity ID:

E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0

And if you get a debugger in there (I used hypothesis.settings(print_blob=True, max_examples=10_000) to get a reproducible binary blob, then pasted that into @hypothesis.reproduce_failure("6.87.3", b"AXicY2BgYGBkAANGJBIF4JVnZIhiJKwKAA9lAGY=") to, well... reproduce the failure)

> df
  entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
0           1970-01-01     electric       1970-01-01            0            0.0            
1           1970-01-01     electric       1970-01-01            0            0.0            
2        0 1970-01-01     electric       1970-01-01            0            0.0     
> df.entity_id.value_counts()
entity_id
      2
0    1
Name: count, dtype: int64
> df.groupby("entity_id", dropna=False).groups.keys()
dict_keys([''])

This... isn't how I expect groupby to work - I expect it to show up with '' and '0' as possible keys.

--

Further digging:

> df.iloc[2].entity_id
'\x000'

So maybe there's something funny with the null byte going on?

jdangerx · 2023-10-30T23:48:03Z

Indeed! The null byte seems to break pandas string handling since sometimes it gets passed through to a C library: pandas-dev/pandas#53720 changing this test to not allow \x00 in the entity ID.

zaneselvans added ferc1 Anything having to do with FERC Form 1 testing Writing tests, creating test data, automating testing, etc. xbrl Related to the FERC XBRL transition dagster Issues related to our use of the Dagster orchestrator labels Oct 27, 2023

zaneselvans assigned jdangerx Oct 27, 2023

zaneselvans added this to Catalyst Megaproject Oct 27, 2023

github-project-automation bot moved this to New in Catalyst Megaproject Oct 27, 2023

jdangerx linked a pull request Oct 31, 2023 that will close this issue

Only generate alphanumeric entity IDs in test #2993

Merged

jdangerx moved this from New to In review in Catalyst Megaproject Oct 31, 2023

jdangerx moved this from In review to Done in Catalyst Megaproject Oct 31, 2023

jdangerx linked a pull request Nov 2, 2023 that will close this issue

Switch regex strategy to sampling strategy to improve performance #2998

Merged

jdangerx closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_filter_for_freshest_data` occasionally fails #2983

`test_filter_for_freshest_data` occasionally fails #2983

zaneselvans commented Oct 27, 2023

jdangerx commented Oct 30, 2023 •

edited

Loading

jdangerx commented Oct 30, 2023

test_filter_for_freshest_data occasionally fails #2983

test_filter_for_freshest_data occasionally fails #2983

Comments

zaneselvans commented Oct 27, 2023

jdangerx commented Oct 30, 2023 • edited Loading

jdangerx commented Oct 30, 2023

`test_filter_for_freshest_data` occasionally fails #2983

`test_filter_for_freshest_data` occasionally fails #2983

jdangerx commented Oct 30, 2023 •

edited

Loading