sedona-spark-shaded package should exclude some repetitive dependencies. #1584

freamdx · 2024-09-12T03:26:11Z

pom.xml should exclude any dependencies that exist in spark jars, eg:
edu.ucar:cdm-core exclude guava/httpclient/protobuf-java
... ... <dependency> <groupId>edu.ucar</groupId> <artifactId>cdm-core</artifactId> <exclusions> <exclusion> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> </exclusion> <exclusion> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </exclusion> <exclusion> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> </exclusion> </exclusions> </dependency> ... ... <artifactSet> <excludes> <exclude>org.scala-lang:scala-library</exclude> <exclude>org.apache.commons:commons-*</exclude> <exclude>commons-pool:commons-pool</exclude> <exclude>commons-lang:commons-lang</exclude> <exclude>commons-io:commons-io</exclude> <exclude>commons-logging:commons-logging</exclude> </excludes> </artifactSet> ... ...

The text was updated successfully, but these errors were encountered:

jiayuasu · 2024-09-13T01:03:21Z

@freamdx would you mind create a PR to fix this yourself?

Kontinuation · 2024-09-13T01:26:22Z

A few notes:

cdm-core has to be bundled in the shaded package, we include it in the shaded pom.xml to make it compile rather than provided.
guava is better shaded than excluded, since we don't know whether the guava jar shipped with spark is compatible with sedona or not, and shading guava is a common practice.
other package exclusion seems to be OK. Apache commons has good backward compatibility and the versions shipped with spark is later than what we bundled.

zwu-net · 2024-09-24T06:10:37Z

I agree with @Kontinuation. Guava is pretty tricky, as I learned it when upgrading Spark.

I'm having trouble understanding why this Google's library has such poor compatibility issues. It seems unexpected from a leading tech company

I'm willing to step in if @jiayuasu approves, given @freamdx's lack of response. However, I some worried proceeding without rigorous testing to mitigate potential risks.

jiayuasu · 2024-09-24T11:41:50Z

@zwu-net please feel free to create a PR to fix this

zwu-net · 2024-09-25T15:46:51Z

@jiayuasu (I would like also to bring @james-willis here since I got know him due to DBScan discussion. If you are busy on other tasks, you don't need to participate) @Kontinuation @freamdx

Research Findings and Proposal

After researching the feasibility of modifying Sedona's pom.xml (https://github.com/apache/sedona/blob/master/spark-shaded/pom.xml), I concluded that manual management of library inclusions/exclusions across various Spark and Sedona versions would be overly complex and labor-intensive.

Proposed Solution

To address this challenge, I suggest creating a custom tool to manage pom.xml generation. This approach is inspired by my previous work on a custom transformer of SQLs from Oracle to Snowflake(https://www.linkedin.com/pulse/revolutionizing-data-analysis-custom-transformer-seamless-paul-wu-4euxc/?trackingId=1AyXqiJh1sei4yybC%2FZW9g%3D%3D).

Tool Requirements

The tool should:

Fetch Spark releases (ideally from a prefetched repository to minimize build time and network stability issues)
Generate pom.xml files
Be implemented in Java or Python (with careful dependency management if using Python)

Implementation Considerations

Given the scope of this task, it may take several months for part-time contributors like myself to implement and test. Before proceeding, I'd appreciate feedback on:

The viability of this approach
Potential alternative solutions

Please share your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sedona-spark-shaded package should exclude some repetitive dependencies. #1584

sedona-spark-shaded package should exclude some repetitive dependencies. #1584

freamdx commented Sep 12, 2024 •

edited

Loading

jiayuasu commented Sep 13, 2024

Kontinuation commented Sep 13, 2024 •

edited

Loading

zwu-net commented Sep 24, 2024 •

edited

Loading

jiayuasu commented Sep 24, 2024

zwu-net commented Sep 25, 2024

sedona-spark-shaded package should exclude some repetitive dependencies. #1584

sedona-spark-shaded package should exclude some repetitive dependencies. #1584

Comments

freamdx commented Sep 12, 2024 • edited Loading

jiayuasu commented Sep 13, 2024

Kontinuation commented Sep 13, 2024 • edited Loading

zwu-net commented Sep 24, 2024 • edited Loading

jiayuasu commented Sep 24, 2024

zwu-net commented Sep 25, 2024

freamdx commented Sep 12, 2024 •

edited

Loading

Kontinuation commented Sep 13, 2024 •

edited

Loading

zwu-net commented Sep 24, 2024 •

edited

Loading