Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Connector-V2] Change read excel util from POI to EasyExcel #8064

Open
wants to merge 19 commits into
base: dev
Choose a base branch
from

Conversation

dwave
Copy link

@dwave dwave commented Nov 15, 2024

#8040

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@Hisoka-X Hisoka-X changed the title [Hotfix][Connector-V2] ExcelReader read more than 65000 rows XSSFWorkbook will cause oom . so change POI to EasyExcel #8040 [Improve][Connector-V2] Change read excel util from POI to EasyExcel Nov 15, 2024
@@ -54,7 +55,7 @@ public class ExcelReadStrategyTest {

@Test
public void testExcelRead() throws IOException, URISyntaxException {
testExcelRead("/excel/test_read_excel.xlsx");
// testExcelRead("/excel/test_read_excel.xlsx");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disable this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test excel used in the commented out code, and the date string that needs to be converted is 2024/1/31, and the format is
{mso-generic-font-family:auto;
mso-font-charset:134;
mso-number-format:"yyyy/m/d"; }

In POI, we can get the correct data type according to the format of the cell, but in EasyExcel, we can only get the string, and the conversion of the string to the Date type does not conform to the defined YYYYY/MM/dd format, which causes the test case to fail, so I commented out this one test case

image

image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should find some way to make sure the old behavior not changed. Or add an option to let user to choose use POI or EasyExcel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll find a way to deal with it

@@ -15,8 +15,9 @@
* limitations under the License.
*/

package org.apache.seatunnel.connectors.seatunnel.file.writer;
package org.apache.seatunnel.connectors.seatunnel.file.Reader;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
package org.apache.seatunnel.connectors.seatunnel.file.Reader;
package org.apache.seatunnel.connectors.seatunnel.file.reader;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Excel stores the double type of the data of the time type at the bottom layer, so using double to convert back to the Date and DateTime types can now pass all test cases.
And added the string matching recognition time formatting option of yyyy/M/d and yyyy-M-d in DateTimeUtils and DateUtils

@github-actions github-actions bot added the api label Nov 19, 2024
@corgy-w
Copy link
Contributor

corgy-w commented Nov 19, 2024

https://github.com/apache/seatunnel/runs/33188901598 @dwave Please open ci workflow

@dwave
Copy link
Author

dwave commented Nov 20, 2024

https://github.com/apache/seatunnel/runs/33188901598 @dwave Please open ci workflow

Okay, it's already opened

Comment on lines +162 to +167

<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>${easyexcel.version}</version>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we all know, easyexcel is no longer maintained. It doesn't seem good to introduce it at this time. We can try other alternatives, such as fastexcel. There are also reports online that it is faster than easyexcel. What do you think? cc @hailin0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or easyexcel-plus?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a try

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or easyexcel-plus?

easyexcel-plus was only on GitHub last night, and I haven't seen it in the maven repository yet

Copy link
Author

@dwave dwave Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we all know, easyexcel is no longer maintained. It doesn't seem good to introduce it at this time. We can try other alternatives, such as fastexcel. There are also reports online that it is faster than easyexcel. What do you think? cc @hailin0

I tried using fastexcel, but there is a problem with its xls support for excel97-2003

dhatim/fastexcel#287
image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, let's add an option to configure the excel parse engine, default POI, support POI and easyexcel at now. So we can implement other engine in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, let's add an option to configure the excel parse engine, default POI, support POI and easyexcel at now. So we can implement other engine in the future.

Will there be any conflict between poi versions?

dwave and others added 2 commits November 21, 2024 11:26
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/excel/ExcelReaderListener.java

Co-authored-by: corgy-w <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants