Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding all_rows_overlap to fugue #244

Merged
merged 2 commits into from
Nov 15, 2023
Merged

adding all_rows_overlap to fugue #244

merged 2 commits into from
Nov 15, 2023

Conversation

fdosani
Copy link
Member

@fdosani fdosani commented Nov 10, 2023

Ref: #214
Support for all_rows_overlap with Fugue

@fdosani
Copy link
Member Author

fdosani commented Nov 10, 2023

@goodwanghan FYI. Had some time, so adding in some more functionality to the Fugue side.

@fdosani fdosani marked this pull request as draft November 10, 2023 15:52
@goodwanghan
Copy link
Contributor

Wow nice

@fdosani
Copy link
Member Author

fdosani commented Nov 10, 2023

Wow nice

Well this sucks. Seems like the last release of fugue is breaking some of the tests :(

@goodwanghan
Copy link
Contributor

goodwanghan commented Nov 10, 2023 via email

@goodwanghan
Copy link
Contributor

I can reproduce this issue, will take a look today

@goodwanghan
Copy link
Contributor

So after some initial investigation, it is related with pandas. Fugue 0.8.7 has a big update that is it relies more on pandas extension d types. It is an interesting case why the current datacompy logic breaks with this change. I will investigate more.

@fdosani
Copy link
Member Author

fdosani commented Nov 10, 2023

So after some initial investigation, it is related with pandas. Fugue 0.8.7 has a big update that is it relies more on pandas extension d types. It is an interesting case why the current datacompy logic breaks with this change. I will investigate more.

Ahh interesting. Thanks for helping out here and taking a peek, really appreciate it.

@goodwanghan
Copy link
Contributor

goodwanghan commented Nov 10, 2023

no problem, actually think this is a very good case, i have found an example
You know pandas will use object type for string columns, but in new pandas, it can be StringDType, and it can be an arrow type (it is a mess I know). So in this case we want to treat them as the same type? In the current logic, we think they are different. That is why we saw those failed tests.

To generalize, datacompy needs to consider the comparison between original type and extension dtypes, do we think they are the same?

Examples are:

object vs StringDType
int vs Int
bool vs Bool

@fdosani
Copy link
Member Author

fdosani commented Nov 10, 2023

no problem, actually think this is a very good case, i have found an example

You know pandas will use object type for string columns, but in new pandas, it can be StringDType, and it can be an arrow type (it is a mess I know). So in this case we want to treat them as the same type? In the current logic, we think they are different. That is why we saw those failed tests.

To generalize, datacompy needs to consider the comparison between original type and extension dtypes, do we think they are the same?

Examples are:

object vs StringDType

int vs Int

bool vs Bool

Yup those should be considered the same type IMHO. Off the top of my head pandas does store other things as objects too though. I'm pretty there is some conversion logic which happens in the compare class. Just travelling but I can take a look once I'm home.

@fdosani
Copy link
Member Author

fdosani commented Nov 14, 2023

Waiting on #245 first before rebasing, reviewing, and merging

@fdosani fdosani marked this pull request as ready for review November 14, 2023 18:19
@jdawang
Copy link
Contributor

jdawang commented Nov 15, 2023

We should probably list the added functionality for fugue in the docs. Right now only is_match and report are mentioned.

@fdosani
Copy link
Member Author

fdosani commented Nov 15, 2023

We should probably list the added functionality for fugue in the docs. Right now only is_match and report are mentioned.

Good call! Will add this in a bit.

@CLAassistant
Copy link

CLAassistant commented Nov 15, 2023

CLA assistant check
All committers have signed the CLA.

@fdosani
Copy link
Member Author

fdosani commented Nov 15, 2023

@jdawang updated now

@fdosani fdosani merged commit 8d0e542 into develop Nov 15, 2023
18 checks passed
@fdosani fdosani deleted the fugue-all_rows_overlap branch November 15, 2023 17:33
rhaffar pushed a commit to rhaffar/datacompy that referenced this pull request Sep 12, 2024
* adding all_rows_overlap to fugue

* updating fugue functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants