Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_filter_for_freshest_data occasionally fails #2983

Closed
zaneselvans opened this issue Oct 27, 2023 · 2 comments · Fixed by #2993 or #2998
Closed

test_filter_for_freshest_data occasionally fails #2983

zaneselvans opened this issue Oct 27, 2023 · 2 comments · Fixed by #2993 or #2998
Assignees
Labels
dagster Issues related to our use of the Dagster orchestrator ferc1 Anything having to do with FERC Form 1 testing Writing tests, creating test data, automating testing, etc. xbrl Related to the FERC XBRL transition

Comments

@zaneselvans
Copy link
Member

After merging #2948 into dev some of us started getting sporadic failures for the Hypothesis based test_filter_for_freshest_data test

 pytest test/unit/io_managers_test.py::test_filter_for_freshest_data

On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked XFAIL, until he can take a look at it upon his return.

________________________ test_filter_for_freshest_data _________________________

    @hypothesis.given(example_schema.strategy(size=3))
>   def test_filter_for_freshest_data(df):

test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

df =   entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
0           ...0
2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0

    @hypothesis.given(example_schema.strategy(size=3))
    def test_filter_for_freshest_data(df):
        # XBRL context is the identifying metadata for reported values
        xbrl_context_cols = ["entity_id", "date", "utility_type"]
        filing_metadata_cols = ["publication_time", "filing_name"]
        primary_key = xbrl_context_cols + filing_metadata_cols
        deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
            df, primary_key=primary_key
        )
        example_schema.validate(deduped)

        # every post-deduplication row exists in the original rows
        assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
        # for every [entity_id, utility_type, date] - th"true"e is only one row
        assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
        # for every *context* in the input there is a corresponding row in the output
        original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
        paired_by_context = original_contexts.merge(
            deduped,
            on=xbrl_context_cols,
            how="outer",
            suffixes=["_in", "_out"],
            indicator=True,
        ).set_index(xbrl_context_cols)
        hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
        hypothesis.note(f"The freshest data:\n{deduped}")
        hypothesis.note(f"Paired by context:\n{paired_by_context}")
>       assert (paired_by_context._merge == "both").all()
E       AssertionError: assert False
E        +  where False = <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool>()
E        +    where <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool> = entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E        +      where entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] =                                   publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge\nentity_id date       utility_type                                                                                                                                                                   \n          1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only\n                     electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only._merge
E       Falsifying example: test_filter_for_freshest_data(
E           df=
E                 entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E               0           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E               1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E               2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E           ,
E       )
E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0
E       The freshest data:
E         entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E       1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E       Paired by context:
E                                         publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge
E       entity_id date       utility_type
E                 1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only
E                            electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only
E       Explanation:
E           These lines were always and only run by failing examples:
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E               (and 66 more with settings.verbosity >= verbose)

test/unit/io_managers_test.py:398: AssertionError
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 testing Writing tests, creating test data, automating testing, etc. xbrl Related to the FERC XBRL transition dagster Issues related to our use of the Dagster orchestrator labels Oct 27, 2023
@jdangerx
Copy link
Member

jdangerx commented Oct 30, 2023

OK, after some digging... it looks like somehow we're not actually identifying the correct "original contexts" in test, due to some very surprising df.groupby behavior...

You can see in the error message above that the 0 entity ID context is not being detected, only the "" entity ID:

E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0

And if you get a debugger in there (I used hypothesis.settings(print_blob=True, max_examples=10_000) to get a reproducible binary blob, then pasted that into @hypothesis.reproduce_failure("6.87.3", b"AXicY2BgYGBkAANGJBIF4JVnZIhiJKwKAA9lAGY=") to, well... reproduce the failure)

> df
  entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
0           1970-01-01     electric       1970-01-01            0            0.0            
1           1970-01-01     electric       1970-01-01            0            0.0            
2        0 1970-01-01     electric       1970-01-01            0            0.0     
> df.entity_id.value_counts()
entity_id
      2
0    1
Name: count, dtype: int64
> df.groupby("entity_id", dropna=False).groups.keys()
dict_keys([''])

This... isn't how I expect groupby to work - I expect it to show up with '' and '0' as possible keys.

--

Further digging:

> df.iloc[2].entity_id
'\x000'

So maybe there's something funny with the null byte going on?

@jdangerx
Copy link
Member

Indeed! The null byte seems to break pandas string handling since sometimes it gets passed through to a C library: pandas-dev/pandas#53720 changing this test to not allow \x00 in the entity ID.

@jdangerx jdangerx linked a pull request Oct 31, 2023 that will close this issue
@jdangerx jdangerx moved this from New to In review in Catalyst Megaproject Oct 31, 2023
@jdangerx jdangerx moved this from In review to Done in Catalyst Megaproject Oct 31, 2023
@jdangerx jdangerx closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator ferc1 Anything having to do with FERC Form 1 testing Writing tests, creating test data, automating testing, etc. xbrl Related to the FERC XBRL transition
Projects
Archived in project
2 participants