Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] When write parquet as data.frame we lost information about class. #44524

Closed
galachad opened this issue Oct 24, 2024 · 4 comments
Closed

[R] When write parquet as data.frame we lost information about class. #44524

galachad opened this issue Oct 24, 2024 · 4 comments

Comments

@galachad
Copy link

galachad commented Oct 24, 2024

Describe the bug, including details regarding any error messages, version, and platform.

When we write the parquet using arrow, we lost information that the data frame is data.frame.

arrow::write_parquet(iris, "iris.parquet")

When we read iris.parquet it's read as tbl by default.

arrow::read_parquet("iris.parquet")
class(arrow::read_parquet("iris.parquet"))
# [1] "tbl_df"     "tbl"        "data.frame"

This bug was introduced in #34775

The class data.frame is removed in .serialize_arrow_r_metadata function.
https://github.com/apache/arrow/blame/7ef5437e23bd7d7571a0c7a7fc0c5d3634816802/r/R/metadata.R#L25

prop <- arrow::ParquetFileReader$create(
  "iris.parquet",
  props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# NULL

When the parquet is saved, the attributes were removed.

Workaround:

  1. Apply extra class to data.frame
new_iris <- iris
class(new_iris) <- c("custom.data.frame", "data.frame")
arrow::write_parquet(new_iris, "iris.parquet")
prop <- arrow::ParquetFileReader$create(
  "iris.parquet",
  props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# [1] "custom.data.frame" "data.frame"    
  1. Apply the metadata properties manually
# Convert the data.frame to an Arrow Table
iris_table <- arrow::Table$create(iris)

# Retrieve existing metadata (if any)
existing_metadata <- iris_table$schema$metadata

# Add or update the 'class' metadata to indicate it's a data.frame
new_metadata <- existing_metadata
new_metadata$r$attributes$class <- c("data.frame", "custom.data.frame")

# Update the schema with the new metadata
iris_table <- iris_table$ReplaceSchemaMetadata(new_metadata)

arrow::write_parquet(iris_table, "iris.parquet")

Bug description:

When we remove data.frame class attribute, we read parquet by default as tibble. In my opinion it's not expected behavior as when we write data.frame we should read data.frame.

What was the reason for remove the class if it's just data.frame?

Component(s)

R

@thisisnic
Copy link
Member

Hi @galachad, the behaviour you describe is not a bug but an intended feature. There's more on reasoning in #35173 and #34825. Would you mind expanding a bit more on the problems this is causing for you?

@galachad
Copy link
Author

We have some legacy code that isn't compatible with tibble. When the data.frame attributes are removed, parquet files are saved without a class (starting from arrow version 13+). In the absence of the class attribute, Arrow reads the file as a tibble by default.

While I understand the intent to remove unnecessary attributes, the assumption that data.frame is the default format is incorrect. The default type for reading parquet files is actually tibble.

If the goal is to remove attributes for the default table type in R, we should remove the class only when it is c("tbl_df", "tbl", "data.frame"), not just 'data.frame'.

As I mention in the description, to avoid class modification, extra workaround is required now. Unfortunately, the tibble and data.frame are not 100% compatible.

The example that the assumption was made is probably incorrect:

> class(arrow::arrow_table(name = "1", mtcars)$to_data_frame())
[1] "tbl_df"     "tbl"        "data.frame"
> class(arrow::arrow_table(mtcars, name = "1")$to_data_frame())
[1] "tbl_df"     "tbl"        "data.frame"

In this case, mtcars is a data.frame and should be represented as data.frame.

In my opinion, this bug is tiny and can be resolved just by remove remove the class if it's just data.frame section.

arrow/r/R/metadata.R

Lines 25 to 31 in 7ef5437

# remove the class if it's just data.frame
if (identical(x$attributes$class, "data.frame")) {
x$attributes <- x$attributes[names(x$attributes) != "class"]
if (is_empty(x$attributes)) {
x <- x[names(x) != "attributes"]
}
}

@thisisnic
Copy link
Member

It's not a bug - it was a deliberate design decision to have a default type of tibble - the discussion is in those linked PRs, though is spread a bit between multiple ones. Do you still have the same problems if you were to call as.data.frame() on the loaded object after calling read_parquet()?

@galachad
Copy link
Author

I'm just pointing that if arrow default type is tibble then we should remove class attribute only when object is tibble.

As I mention before, when we save data.frame into parquet, then read_parquet returns tibble.

It could be changed to tibble - remove the class attribute when the object is just tibble.

 # remove the class if it's just tibble
 if (identical(x$attributes$class, c("tbl_df", "tbl", "data.frame"))) { 
   x$attributes <- x$attributes[names(x$attributes) != "class"] 
   if (is_empty(x$attributes)) { 
     x <- x[names(x) != "attributes"] 
   } 
 } 

You can always convert tibble to data.frame, but I'm just pointing that current behavior is not consistence.

Anyway, if you are not agreed that removing class information from not default type object is a bug, then we should close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants