Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failed to delete an empty directory due to cache evict #18675

Open
wants to merge 2 commits into
base: master-2.x
Choose a base branch
from

Conversation

gp1314
Copy link
Contributor

@gp1314 gp1314 commented Aug 26, 2024

What changes are proposed in this pull request?

phenomenon

  • In a production environment I delete empty directory encountered DirectoryNotEmptyException anomalies
    企业微信截图_127297c4-0aae-4ce4-b489-aa72ac7d16bb

  • I'm sure I deleted the files in the directory first, and then deleted the empty directory

  • recursive is set to false when an empty directory is deleted

reason

  • Delete the directory, and if recursive is false, determine whether the directory has subdirectories or files
  • CachingInodeStore.hasChildren the mListingCache does not have the target directory, through!mListingCache.getDataFromBackingStore(inode.getId(), option).isEmpty()|| mBackingStore.hasChildren(inode); judgment
  • If there is the following logic:
  1. Write data to directory A and flush directory A to rocksdb when the cache is full
  2. File b is written to directory A,b cached in InodeCache,A->b cached in EdgeCache
  3. If the cache is full, EdgeCache flush is triggered,A->b in rocksdb edges column, evict A from mListingCache
  4. Delete b file,add into mUnflushedDeletes
  5. At this time, in CachingInodeStore.hasChildren
  • directory A cannot be retrieved from mListingCache(mListingCache.getCachedChildIds not present)
  • and b cannot be obtained from inodecache because b has been deleted(mListingCache.getDataFromBackingStore return empty)
  • we can also get A->b at rocksdb (mBackingStore.hasChildren return true)

reproduction

  • In CachingInodeStoreMockedBackingStoreTest we can construct a single measurement
  • MASTER_METASTORE_INODE_CACHE_LOW_WATER_MARK_RATIO is set to 0 to ensure that all data is evict
  @Rule
  public ConfigurationRule mConf = new ConfigurationRule(
      ImmutableMap.of(PropertyKey.MASTER_METASTORE_INODE_CACHE_MAX_SIZE, CACHE_SIZE,
          PropertyKey.MASTER_METASTORE_INODE_CACHE_EVICT_BATCH_SIZE, 5,
          PropertyKey.LEAK_DETECTOR_LEVEL, ResourceLeakDetector.Level.PARANOID,
          PropertyKey.LEAK_DETECTOR_EXIT_ON_LEAK, true,
          PropertyKey.MASTER_METASTORE_INODE_CACHE_LOW_WATER_MARK_RATIO, 0),
      Configuration.modifiableGlobal());

@Test
  public void evictInodeForTestHasChildren() throws Exception {
    MutableInodeDirectory myTestdir =
        MutableInodeDirectory.create(500, 0, "guptest", CreateDirectoryContext.defaults());
    mStore.writeNewInode(myTestdir);
    mStore.mListingCache.evictForTesting();
    mStore.mListingCache.evictForTesting();
    mStore.mEdgeCache.flush();
    mStore.mInodeCache.flush();

    long id = 6000;
    MutableInodeFile child =
        MutableInodeFile.create(id, 500, "child" + id, 0, CreateFileContext.defaults());
    mStore.writeNewInode(child);
    mStore.addChild(500, child);

    mStore.mEdgeCache.flush();

    mStore.removeInodeAndParentEdge(child);
    assertEquals(true, mStore.hasChildren(myTestdir));
  }
  • Or set configuration
alluxio.master.metastore.inode.cache.max.size=1024
alluxio.master.metastore.inode.cache.evict.batch.size=100
  • write subdir-i/i, delete i and subdir-i (i loop 500)

Why are the changes needed?

Please clarify why the changes are needed. For instance,

  1. In CachingInodeStore.hasChildren, can remove mBackingStore.hasChildren(inode), because mListingCache.GetDataFromBackingStore already from rocksdb checked again

Does this PR introduce any user facing changes?

No changes to the user-facing api

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title is too long (95 characters). Must be at most 72 characters.
      • Title must not end with punctuation
  • Commits associated with Github account: FAIL
    • It looks like your commits can't be linked to a valid Github account.
      Your commits are made with the email [email protected], which does not allow your contribution to be tracked by Github.
      See this link for possible reasons this might be happening.
      To change the author email address that your most recent commit was made under, you can run:
      git -c user.name="Name" -c user.email="Email" commit --amend --reset-author
      See this answer for more details about how to change commit email addresses.
      Once the author email address has been updated, update the pull request by running:
      git push --force https://github.com/gp1314/alluxio.git dev/gup/failed_delete_empty_directory

Some checks failed. Please fix the reported issues and reply
alluxio-bot, check this please
to re-run checks.

@gp1314 gp1314 force-pushed the dev/gup/failed_delete_empty_directory branch from 3da3a4a to 893ea90 Compare August 26, 2024 08:17
@gp1314 gp1314 changed the title Fix the issue of deleting empty directories failing due to cache eviction in CachingInodeStore. Fix the issue of deleting empty directories failing due to cache eviction in CachingInodeStore Aug 26, 2024
@gp1314
Copy link
Contributor Author

gp1314 commented Aug 26, 2024

alluxio-bot, check this please

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title is too long (94 characters). Must be at most 72 characters.
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply
alluxio-bot, check this please
to re-run checks.

1 similar comment
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title is too long (94 characters). Must be at most 72 characters.
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply
alluxio-bot, check this please
to re-run checks.

@gp1314 gp1314 changed the title Fix the issue of deleting empty directories failing due to cache eviction in CachingInodeStore Fix failed to delete an empty directory due to cache evict Aug 26, 2024
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@gp1314
Copy link
Contributor Author

gp1314 commented Sep 19, 2024

@jiacheliu3 Could you please help core review it

@jiacheliu3
Copy link
Contributor

That is very interesting finding, thanks a lot @gp1314 ! The logic here is definitely complicated and let me spend some time thinking and get back to you.

Right now, I feel the "issue" is because there are many lock-free operations in the CachingInodeStore. So operations on different parts of the internal cache (inode cache, edge cache, listing cache) are not atomic. Your solution is probably correct, but let me think twice about possible alternatives.

@@ -335,6 +336,27 @@ public void skipCache() throws Exception {
assertEquals(0, mStore.mInodeCache.getCacheMap().size());
}

@Test
public void evictInodeForTestHasChildren() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this test case! My understanding is, once your code is merged, this test case is no longer needed right? In other words, if your code is merged, assertEquals(false, mStore.hasChildren(myTestdir)); will return false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants