Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2695] feat(doc): Add docs for fileset catalog #2781

Merged
merged 6 commits into from
Apr 3, 2024

Conversation

jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to add docs for fileset catalog.

Why are the changes needed?

Fix: #2695

Does this PR introduce any user-facing change?

No.

How was this patch tested?

No.

@jerryshao jerryshao self-assigned this Apr 2, 2024
@jerryshao
Copy link
Contributor Author

@coolderli @xloya would you please help to review? Thanks.

docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
@coolderli
Copy link
Collaborator

@jerryshao Do we need to introduce how to use the Fileset in the Spark engine? In addition, I have already tested the Tensorflow and submitted an MR: tensorflow/io#1970. After #2779 is resolved, we can support tensorflow. I think we can add a doc like https://help.aliyun.com/zh/hdfs/using-tensorflow-on?spm=a2c4g.11186623.0.i6. What do you think?

@jerryshao
Copy link
Contributor Author

@jerryshao Do we need to introduce how to use the Fileset in the Spark engine? In addition, I have already tested the Tensorflow and submitted an MR: tensorflow/io#1970. After #2779 is resolved, we can support tensorflow. I think we can add a doc like https://help.aliyun.com/zh/hdfs/using-tensorflow-on?spm=a2c4g.11186623.0.i6. What do you think?

I would suggest to have another doc about gvfs and add Spark, TF related things there.

docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved

FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
NameIdentifier[] identifiers =
filesetCatalog.listFilesets(Namespace.ofFileset("metalake", "catalog", "schema"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's another issue. The metalake in Namespace seems redundant. The new GravitinoClient, we have declared the name of the current metalake. It is not related to this MR. Never mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is unrelated, and we are working on the client refactoring things.

@coolderli
Copy link
Collaborator

@jerryshao Left some comments. Overall, it looks good to me.

@jerryshao
Copy link
Contributor Author

@shaofengshi would you please also check the java client part? Thanks.

docs/index.md Outdated Show resolved Hide resolved
shaofengshi
shaofengshi previously approved these changes Apr 3, 2024
Copy link
Contributor

@shaofengshi shaofengshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved
tabular data and others in Gravitino with a unified way.

After fileset is created, users can easily access, manage the files/directories through
Fileset's identifier, without needing to know the physical path of the managed datasets. Also, with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe of the managed datasets is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it is still necessary, it means that the dataset is managed by Gravitino, so users don't need to know the physical path. Fro unmanaged dataset, users still need to know the physical path before visiting the dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-fileset-metadata-using-gravitino.md Outdated Show resolved Hide resolved
Copy link
Contributor

@qqqttt123 qqqttt123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerryshao jerryshao merged commit f119d90 into apache:main Apr 3, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Add document about how to use fileset type catalog
6 participants