Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8371] Fixing column stats index with MDT for few scenarios #12105

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 15, 2024

Change Logs

  • Support bootstrapping of col stats for MOR table.
  • Fix new partition instantiation w/ MDT when there are pending operations in data table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Impact

We could enable col stats for MOR table at any given state.
Ran into other issues along the way which I had to fix to get the patch ready.

  • DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
  • MDT writes had data loss while trying to initialize during a rollback or any pending instant in data table. So, fixed the same in this patch.
  • When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.

Tests covered:

  • bootstrapping of both COW and MOR table.
  • Covered both partitioned and non-partitioned table.
  • Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
  • Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.

Risk level (write none, low medium or high below)

low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 15, 2024
Copy link
Contributor Author

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded and addressed all feedback

@nsivabalan nsivabalan force-pushed the mor_colstats_initializationfix branch 2 times, most recently from 8025ba8 to afad75a Compare October 15, 2024 21:04
Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions

return getLogFileColumnRangeMetadata(fullFilePath.toString(), datasetMetaClient, columnsToIndex, writerSchemaOpt);
} else if (FSUtils.isLogFile(filePath.substring(filePath.lastIndexOf("/") + 1)) && fetchStatsForLogFiles) {
LOG.warn("Reading log file: {}, to build column range metadata.", filePath);
return getLogFileColumnRangeMetadata(new StoragePath(datasetMetaClient.getBasePath(), filePath).toString(), datasetMetaClient, columnsToIndex, writerSchemaOpt, maxBufferSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoragePath fullFilePath = new StoragePath(datasetMetaClient.getBasePath(), filePath);

so why are we changing this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do better w/ variable naming.
for eg, the input arg to this method is file with partition path value (filePath variable).
but for unmerged log record reader we need full path (absolute path).

I have made some variable naming fixes in my latest commit. you can check it out.

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unblocking the pr from approve

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM including tests.

@nsivabalan nsivabalan changed the title [HUDI-8371] Support bootstrapping of col stats for MOR table [HUDI-8371] Fixing column stats index with MDT for few scenarios Oct 16, 2024
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Oct 17, 2024
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked on my review

@codope codope force-pushed the mor_colstats_initializationfix branch from 9be4d1b to effdeec Compare October 19, 2024 02:02
@nsivabalan nsivabalan force-pushed the mor_colstats_initializationfix branch from effdeec to 0567058 Compare October 19, 2024 04:14
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants