Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support Index.to_frame() #894

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

mattyopl
Copy link
Contributor

@mattyopl mattyopl commented Aug 8, 2024

Intern Starter Task

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

can test by checking:

import pandas as pd
idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal')
df = idx.to_frame()
print(df)
df = idx.to_frame(False)
print(df)
df = idx.to_frame(name = "food")
print(df)
df = idx.to_frame(False, name = "food")
print(df)

import bigframes.core as core
idx = core.indexes.Index(['Ant', 'Bear', 'Cow'], name='animal')
df = idx.to_frame()
print(df)
df = idx.to_frame(False)
print(df)
df = idx.to_frame(name = "food")
print(df)
df = idx.to_frame(False, name = "food")
print(df)

Fixes internal 356891401 🦕

@mattyopl mattyopl requested a review from GarrettWu August 8, 2024 17:51
@mattyopl mattyopl requested review from a team as code owners August 8, 2024 17:51
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Aug 8, 2024
Comment on lines 129 to 142
def to_frame(
self, index: bool = True, name: blocks.Label | None = None
) -> bigframes.dataframe.DataFrame:
import numpy as np

provided_name = name if name else self.name
provided_index = self.values if index else np.arange(len(self.values))
result = bigframes.dataframe.DataFrame(
{provided_name: self.values}, index=provided_index
)
if index: # matching pandas behavior
result.index.name = self.name

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redefinition

import numpy as np
provided_name = name if name else self.name
provided_index = self.values if index else np.arange(len(self.values))
result = bigframes.dataframe.DataFrame({provided_name: self.values}, index= provided_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the index is a MultiIndex, the resulting DataFrame will need to have multiple columns (and names will need to be a list-like with length equal to the multi-index level depth).

) -> bigframes.dataframe.DataFrame:
import numpy as np
provided_name = name if name else self.name
provided_index = self.values if index else np.arange(len(self.values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just provide None if want default sequential index. np.arrange may create a very large in-memory array here if the original index is large (some BigFrames objects have billions of rows).

def to_frame(
self, index: bool = True, name: blocks.Label | None = None
) -> bigframes.dataframe.DataFrame:
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the import at the top of the file for easier management. Actually numpy is already imported in the file.

Import in the method only to avoid circular imports.

bigframes/core/indexes/base.py Show resolved Hide resolved
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Aug 9, 2024
@mattyopl mattyopl assigned mattyopl and unassigned orrbradford Aug 9, 2024
@mattyopl mattyopl enabled auto-merge (squash) August 9, 2024 18:57
@@ -115,6 +115,34 @@ def from_frame(
index._linked_frame = frame
return index

def to_frame(self, index: bool = True, name=None) -> bigframes.dataframe.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add back type hint for "name"? Why it is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back, still getting familiar with static type checker in Python


multi = isinstance(self, MultiIndex)

if multi:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In stead of checking type and do everything in base class, should override the method in derived class(MultiIndex).


if multi:
columns = [
[self.values[j][i] for j in range(len(self.values))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.values calls to_numpy, which downloads everything and not scalable. Maybe take a look at Index.from_frame() and Series.to_frame()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed for Index. per offline conversation with @TrevorBergeron, efficient MultiIndex implementation is blocked by bugs in DataFrame construction with MultiIndex

bug TL;DR

  1. Index with array values being created for DataFrame instead of MultiIndex
bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
test = bigframes.dataframe.DataFrame({1: [2, 4, 6], 2: [1, 3, 5]}, index = bf_idx)
print(test)
  1. block joining for DataFrame creation using MultiIndex and Dict of Series data throws
bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
nlevels = bf_idx.nlevels
columns : list[bigframes.series.Series] = []
 for level in range(nlevels):
       series = self.get_level_values(level).to_series()
       columns.append(series)
data = {i: column for i, column in enumerate(columns)}
result = bigframes.dataframe.DataFrame(
            data, index= bf_idx
        )

)
bf_idx = indexes.Index(["Ant", "Bear", "Cow"], name="animal")

for index_arg, name_arg in itertools.product(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use @pytest.mark.parametrize to achieve combinations of params.

Which creates multiple separate tests. Easier to find the problem if anyone fails.

Copy link
Contributor

@GarrettWu GarrettWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

)
bf_idx = indexes.Index(["Ant", "Bear", "Cow"], name="animal")

if name_arg is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need the if else condition. Just

pd_df = pd_idx.to_frame(index=index_arg, name=name_arg)
bf_df = bf_idx.to_frame(index=index_arg, name=name_arg)

is good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas implementation of name arg handling is slightly different. If we set name=None for pandas, it will create a DataFrame with column names as string None, since technically the default is lib.nodefault (code pointer: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/_libs/lib.pyi#L32), whereas use None: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/indexes/base.py#L1607-L1666

Copy link
Contributor Author

@mattyopl mattyopl Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add something similar to BigFrames if you think it is better to exactly mirror pandas functionality: I am equally happy either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a weird behavior of pandas. Thanks for pointing out. Then we shouldn't let the default to be None. But should have sth similar. @TrevorBergeron


pd_idx = pandas.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
if name_arg is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants