Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FileAlreadyExistsException in LORE dump process #11484

Merged
merged 12 commits into from
Sep 27, 2024

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Sep 19, 2024

Summary

This PR resolves the FileAlreadyExistsException that occurs when using spark.rapids.sql.lore.idsToDump to dump particular partitions.

Changes Made

  • Added a check to verify if the LORE dump path exists and is not empty before proceeding with the dump.
  • If the path exists and contains files, an IllegalArgumentException is thrown to prevent overwriting existing data.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ustcfy We also need to add a check that the root dir of lore dump should be empty.

@@ -197,6 +197,17 @@ object GpuLore {
s"when ${RapidsConf.LORE_DUMP_IDS.key} is set."))

val spark = SparkShimImpl.sessionFromPlan(sparkPlan)

Option(spark.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)).foreach { executionId =>
if (!idGen.containsKey(executionId)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky to me. How about maintaining a hashmap for this check?

@liurenjie1024
Copy link
Collaborator

Please add test for this change, and update dev/lore.md doc.

@ustcfy ustcfy changed the title Updated parameters to enable file overwriting when dumping. Fix FileAlreadyExistsException in LORE dump process Sep 20, 2024
@liurenjie1024
Copy link
Collaborator

build

@ustcfy ustcfy marked this pull request as ready for review September 23, 2024 09:47
liurenjie1024
liurenjie1024 previously approved these changes Sep 23, 2024
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

docs/dev/lore.md Outdated Show resolved Hide resolved
liurenjie1024
liurenjie1024 previously approved these changes Sep 23, 2024
@ustcfy
Copy link
Collaborator Author

ustcfy commented Sep 23, 2024

build

2 similar comments
@liurenjie1024
Copy link
Collaborator

build

@liurenjie1024
Copy link
Collaborator

build

@ustcfy
Copy link
Collaborator Author

ustcfy commented Sep 24, 2024

build

2 similar comments
@liurenjie1024
Copy link
Collaborator

build

@liurenjie1024
Copy link
Collaborator

build

liurenjie1024
liurenjie1024 previously approved these changes Sep 24, 2024
@liurenjie1024
Copy link
Collaborator

build

2 similar comments
@ustcfy
Copy link
Collaborator Author

ustcfy commented Sep 25, 2024

build

@pxLi
Copy link
Collaborator

pxLi commented Sep 25, 2024

build

docs/dev/lore.md Outdated Show resolved Hide resolved
@@ -90,7 +90,7 @@ object GpuLore {

def dumpObject[T: ClassTag](obj: T, path: Path, hadoopConf: Configuration): Unit = {
withResource(path.getFileSystem(hadoopConf)) { fs =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR, but typically we want to leave the underlying FileSystem.get calls memoized to make the subsequent ones cheaper. Thus the FileSystem object should not be closed via withResource

executionId =>
loreOutputRootPathChecked.computeIfAbsent(executionId, _ => {
val path = new Path(loreOutputRootPath)
withResource(path.getFileSystem(spark.sparkContext.hadoopConfiguration)) { fs =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't withResource getFileSystem

Co-authored-by: Gera Shegalov <[email protected]>
@ustcfy
Copy link
Collaborator Author

ustcfy commented Sep 26, 2024

build

@liurenjie1024 liurenjie1024 merged commit 41351c0 into NVIDIA:branch-24.10 Sep 27, 2024
45 checks passed
@sameerz sameerz added the bug Something isn't working label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants