Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support read unstructured excel file #901

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

khm0651
Copy link

@khm0651 khm0651 commented Oct 1, 2024

It has been confirmed that the current Dataframe.readExcel function only supports structured Excel formats.

Of course, it would be ideal if everyone created Excel files in a structured format, but as shown in the image, when dealing with unstructured Excel formats, the current DataFrame approach of always designating the first row as the header causes difficulties in usage.

The Python pandas library supports this, making it very efficient to use.

As a result, for unstructured Excel formats, i've implemented support by using a value called withDefaultHeader.

When set to true, it automatically generates headers using NameRepairStrategy,
thus enabling support for unstructured Excel formats.

when withDefaultHeader is set to true, it operates as [NameRepairStrategy.MAKE_UNIQUE]

Is there a better approach?

image

@khm0651 khm0651 force-pushed the support_unstructured_excel_file branch from 301f53c to 2b3361f Compare October 1, 2024 12:25
@Jolanrensen Jolanrensen self-requested a review October 1, 2024 12:49
@Jolanrensen
Copy link
Collaborator

Thanks for your contribution!

Some thoughts:

So if I'm correct, withDefaultHeader = true will make the columns be named according to excel columns, like "A", "B", "C" etc.? It's not really clear to me this will happen from your description of the argument.

I might change it to firstRowIsHeader = true as the argument, then describing that when that's set to false explicitly, DF will fall-back to excel letter column names, else it will take the first row (after skipRows) as the header.

Also, am I correct that in the current implementation, column J will not be included in the result?
image

@khm0651
Copy link
Author

khm0651 commented Oct 1, 2024

Oh, it's a bit different. When withDefaultHeader is set to true, it doesn't use the row specified by skipRow as the header. If skipRow is set, it retrieves data starting from the row specified by skipRow, using automatically generated headers.
Excel columns are automatically generated as "A", "B", "C", etc.. and so on.

However, as @Jolanrensen pointed out, I hadn't considered that particular case.
It seems we need to address this case as well
to accommodate this case, should calculate and incorporate a maximum value.

and i agree change it to firstRowIsHeader = true as the argument too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants