New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: support `Index.to_frame()` #894

Open

mattyopl wants to merge 27 commits into googleapis:main from mattyopl:main

+87 −0

Contributor

mattyopl commented Aug 8, 2024

Intern Starter Task

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

can test by checking:

import pandas as pd
idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal')
df = idx.to_frame()
print(df)
df = idx.to_frame(False)
print(df)
df = idx.to_frame(name = "food")
print(df)
df = idx.to_frame(False, name = "food")
print(df)

import bigframes.core as core
idx = core.indexes.Index(['Ant', 'Bear', 'Cow'], name='animal')
df = idx.to_frame()
print(df)
df = idx.to_frame(False)
print(df)
df = idx.to_frame(name = "food")
print(df)
df = idx.to_frame(False, name = "food")
print(df)

Fixes internal 356891401 🦕

Matthew Laurence Chen and others added 6 commits

August 6, 2024 22:55


          chore: clean up OWNERS

b62f1ca

- remove inactive users
- add myself


          Merge branch 'googleapis:main' into main

ef7eaf8


          feat: support Index.to_frame()

cd97a45


          Merge branch 'googleapis:main' into main

860f370


          feat: support Index.to_frame()

8761d8a


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

c68a5f5

mattyopl requested a review from GarrettWu

August 8, 2024 17:51

mattyopl requested review from a team as code owners

August 8, 2024 17:51

blunderbuss-gcf bot assigned orrbradford

product-auto-label bot added size: s api: bigquery labels

TrevorBergeron reviewed

View reviewed changes

bigframes/core/indexes/base.py Outdated

Comment on lines 129 to 142

+                  def to_frame(
+                      self, index: bool = True, name: blocks.Label | None = None
+                  ) -> bigframes.dataframe.DataFrame:
+                      import numpy as np
+                      provided_name = name if name else self.name
+                      provided_index = self.values if index else np.arange(len(self.values))
+                      result = bigframes.dataframe.DataFrame(
+                          {provided_name: self.values}, index=provided_index
+                      )
+                      if index:  # matching pandas behavior
+                          result.index.name = self.name
+                      return result

Contributor

TrevorBergeron Aug 8, 2024

Redefinition

bigframes/core/indexes/base.py Outdated

+                      import numpy as np
+                      provided_name = name if name else self.name
+                      provided_index = self.values if index else np.arange(len(self.values))
+                      result = bigframes.dataframe.DataFrame({provided_name: self.values}, index= provided_index)

Contributor

TrevorBergeron Aug 8, 2024

If the index is a MultiIndex, the resulting DataFrame will need to have multiple columns (and names will need to be a list-like with length equal to the multi-index level depth).

bigframes/core/indexes/base.py Outdated

+                  ) -> bigframes.dataframe.DataFrame:
+                      import numpy as np
+                      provided_name = name if name else self.name
+                      provided_index = self.values if index else np.arange(len(self.values))

Contributor

TrevorBergeron Aug 8, 2024

Just provide None if want default sequential index. np.arrange may create a very large in-memory array here if the original index is large (some BigFrames objects have billions of rows).

Matthew Laurence Chen added 2 commits

August 8, 2024 18:04


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

7c2e1b0


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

e9bffa8

GarrettWu reviewed

View reviewed changes

bigframes/core/indexes/base.py Outdated

+                  def to_frame(
+                      self, index: bool = True, name: blocks.Label | None = None
+                  ) -> bigframes.dataframe.DataFrame:
+                      import numpy as np

Contributor

GarrettWu Aug 8, 2024

Put the import at the top of the file for easier management. Actually numpy is already imported in the file.

Import in the method only to avoid circular imports.

bigframes/core/indexes/base.py Show resolved Hide resolved

mattyopl and others added 3 commits

August 8, 2024 15:55


          Merge branch 'main' into main

0f2e291


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

2e1afff


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

e376b44

product-auto-label bot added size: m and removed size: s labels

Matthew Laurence Chen and others added 5 commits

August 9, 2024 18:24


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

31f3db8


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

e813c61


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

32eb06c


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

60bea0a


          Merge branch 'main' into main

ec97d72

mattyopl assigned mattyopl and unassigned orrbradford

mattyopl enabled auto-merge (squash)

August 9, 2024 18:57


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

af9f7a4


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

b212126

GarrettWu reviewed

View reviewed changes

bigframes/core/indexes/base.py Outdated

@@ @@ -115,6 +115,34 @@ def from_frame( @@
                       index._linked_frame = frame
                       return index
+                  def to_frame(self, index: bool = True, name=None) -> bigframes.dataframe.DataFrame:

Contributor

GarrettWu Aug 9, 2024

Can we add back type hint for "name"? Why it is removed?

Contributor Author

mattyopl Aug 12, 2024

added back, still getting familiar with static type checker in Python

bigframes/core/indexes/base.py Outdated


		multi = isinstance(self, MultiIndex)

		if multi:

Contributor

GarrettWu Aug 9, 2024

In stead of checking type and do everything in base class, should override the method in derived class(MultiIndex).

bigframes/core/indexes/base.py Outdated

+                      if multi:
+                          columns = [
+                              [self.values[j][i] for j in range(len(self.values))]

Contributor

GarrettWu Aug 9, 2024

self.values calls to_numpy, which downloads everything and not scalable. Maybe take a look at Index.from_frame() and Series.to_frame()

Contributor Author

mattyopl Aug 12, 2024

fixed for Index. per offline conversation with @TrevorBergeron, efficient MultiIndex implementation is blocked by bugs in DataFrame construction with MultiIndex

bug TL;DR

Index with array values being created for DataFrame instead of MultiIndex

bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
test = bigframes.dataframe.DataFrame({1: [2, 4, 6], 2: [1, 3, 5]}, index = bf_idx)
print(test)

block joining for DataFrame creation using MultiIndex and Dict of Series data throws

bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
nlevels = bf_idx.nlevels
columns : list[bigframes.series.Series] = []
 for level in range(nlevels):
       series = self.get_level_values(level).to_series()
       columns.append(series)
data = {i: column for i, column in enumerate(columns)}
result = bigframes.dataframe.DataFrame(
            data, index= bf_idx
        )

tests/system/small/test_index.py Outdated

+                  )
+                  bf_idx = indexes.Index(["Ant", "Bear", "Cow"], name="animal")
+                  for index_arg, name_arg in itertools.product(

Contributor

GarrettWu Aug 9, 2024

you can use @pytest.mark.parametrize to achieve combinations of params.

Which creates multiple separate tests. Easier to find the problem if anyone fails.

Matthew Laurence Chen added 6 commits

August 12, 2024 18:29


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

9c31fb3


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

266feae


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

43f826c


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

5d388da


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

c368362


          Merge branch 'main' of github.com:mattyopl/python-bigquery-dataframes

3d2ba36

GarrettWu reviewed

View reviewed changes

Contributor

GarrettWu left a comment

Generally LGTM.

tests/system/small/test_index.py

+                  )
+                  bf_idx = indexes.Index(["Ant", "Bear", "Cow"], name="animal")
+                  if name_arg is None:

Contributor

GarrettWu Aug 12, 2024

It doesn't need the if else condition. Just

pd_df = pd_idx.to_frame(index=index_arg, name=name_arg)
bf_df = bf_idx.to_frame(index=index_arg, name=name_arg)

is good enough

Contributor Author

mattyopl Aug 12, 2024

Pandas implementation of name arg handling is slightly different. If we set name=None for pandas, it will create a DataFrame with column names as string None, since technically the default is lib.nodefault (code pointer: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/_libs/lib.pyi#L32), whereas use None: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/indexes/base.py#L1607-L1666

Contributor Author

mattyopl Aug 12, 2024 •

edited

Loading

I can add something similar to BigFrames if you think it is better to exactly mirror pandas functionality: I am equally happy either way

Contributor

GarrettWu Aug 12, 2024

Ah, that's a weird behavior of pandas. Thanks for pointing out. Then we shouldn't let the default to be None. But should have sth similar. @TrevorBergeron

tests/system/small/test_multiindex.py

+                  pd_idx = pandas.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
+                  bf_idx = indexes.MultiIndex.from_arrays([["a", "b", "c"], ["d", "e", "f"]])
+                  if name_arg is None:

Contributor

GarrettWu Aug 12, 2024

same here.

mattyopl added 3 commits

August 14, 2024 15:29


          Merge branch 'googleapis:main' into main

2966f08


          Merge branch 'googleapis:main' into main

ab2753e


          Merge branch 'googleapis:main' into main

777ff7d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery size: m