feat: add support for catalogs with glue implementation to start #51

wolfeidau · 2024-01-17T02:18:17Z

This PR adds an implementation of the catalog which is heavily inspired by iceberg-python, I have tried to keep it as minimal as possible to start somewhere, and get feedback.

This includes coverage for all the glue related operations which can be easily mocked, with some discussion needed around testing of the s3 access.

I am doing my best to follow the conventions in the existing code, with a few things needing discussion:

The table identifier is a bit confusing, I have done my best to follow the python structure in the catalog.
Need to do a bit of work around testing combined glue and s3 access, maybe start by exposing a way to load the s3fs directly so we can pass in a mock to test LoadTable.

I can easily use this as a basis for the dynamodb catalog, which would enable running both the iofs and catalog locally in an integration test environment at some point.

wolfeidau · 2024-01-18T23:46:06Z

@nastra @zeroshade not sure if either of you saw this, would love some feedback.

Fokko

Thanks for working on this @wolfeidau Left a few comments to make sure that we align the implementations on the naming.

CC @jackye1995 and @HonahX who are much more familiar with Glue

catalog/catalog.go

catalog/glue.go

This has the following changes: * Just provide a load table, which looks up the table in the catalog and returns it with metadata loaded. * Remove the table structure, and follow the python list tables which returns identifiers instead. * Update the interface documentation to ensure it is clear about inputs and outputs like the python library. * Adjust the tests and verify via AWS locally

wolfeidau · 2024-01-19T08:20:18Z

@Fokko hopefull that changes makes things a bit clearer, I have added more internal documentation as well on the interface to be more specific on behavior around these identifiers. 😅🤞

zeroshade · 2024-01-22T17:33:11Z

catalog/glue.go

+type GlueAPI interface {
+	GetTable(ctx context.Context, params *glue.GetTableInput, optFns ...func(*glue.Options)) (*glue.GetTableOutput, error)
+	GetTables(ctx context.Context, params *glue.GetTablesInput, optFns ...func(*glue.Options)) (*glue.GetTablesOutput, error)
+}
+
+type GlueCatalog struct {
+	glueSvc GlueAPI
+}


Are we intending for these to be used externally? I think it would make more sense for all interactions with the catalog to be through the Catalog interface and that we shouldn't export these at all. Thoughts?

Very good point, last thing you want is that interface in the public API, great pick up, and fixed 😅

zeroshade · 2024-01-22T17:40:12Z

catalog/glue.go

+func NewGlueCatalog(awscfg aws.Config) *GlueCatalog {
+	return &GlueCatalog{
+		glueSvc: glue.NewFromConfig(awscfg),
+	}
+}


Rather than NewGlueCatalog like this, maybe we follow the more typical iceberg setup (similar to pyiceberg) and create a LoadCatalog function taking a uri + map[string]string (properties/options) and return the Catalog interface directly.

We can then infer which type of catalog to create based on the uri or from a type property. I have an example of this in my older iceberg implementation of stuff located here: https://github.com/zeroshade/icegopher/blob/main/table/catalog/catalog.go also containing an implementation for the rest catalog there.

Thoughts?

@zeroshade yeah this abstraction seems a bit more than I would expect in a Go library, I prefer the more common functional options where you inject specific overrides. https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis

Providing maps of properties just result in you needing to re-implement the myriad of overrides provided by say the aws.Config, or similar configs for other providers.

Creating an abstraction where "everything is created" feels a bit unnecessary, especially when each of these things already has a rich creation API, like the aws SDK as can be seen in it's configuration section https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/.

As an integration point, I think it would be better to provide a set of functional options and strip back the abstraction to specific create methods for each catalogue as it is more typical of a Go API. Go tends to encourage a flatter API with fewer abstractions like these.

That said, these abstractions may make sense in the future, but again function options are more common in Go.

I have moved NewGlueCatalog(awscfg aws.Config) *GlueCatalog to use functional options for the AWS configuration as it is cleaner, more extensible, and removes this coupling directly to aws.Config.

@zeroshade Just my two cents, and hopefully explains why I kept this catalogue implementation simple.

As an integration point, I think it would be better to provide a set of functional options and strip back the abstraction to specific create methods for each catalogue as it is more typical of a Go API. Go tends to encourage a flatter API with fewer abstractions like these.

That's fair and I'm fine with that. Perhaps in the future we'd add the more generic abstraction to help people convert from existing iceberg libraries but a per-catalog-type New functions will work for now.

@zeroshade The current model feels like it is following the factory method pattern https://en.wikipedia.org/wiki/Factory_method_pattern, which is great pattern for certain use cases, but in this case there are lots of other things to consider when creating these catalogs, like identity, custom platform specific configuration ect.

I prefer the Go model which although using common interfaces, encourages a more explicit creation for these things as this avoids "magic", which is important for reliable, deterministic operation of systems.

You still have common configuration/metrics/logging components, however these are instantiated based on the New routine in that catalog implementation.

Really it comes down to migrations being more of a considered action in most Go libraries, this ensures the configuration and security of these implementations is respected.

For example accessing an S3 data store could require specific role configuration for authentication, endpoints and event hooks, which may differ from the information used between different tables. A developer can do all this with the AWS SDK config, then pass it in when creating an IO for that particular store, and this is clearly specified in their code.

That said the operation of the catalog using a common interface is great, this i where other common components interact and operate so it is nice to have.

Again as I said before, this is just something I have observed as idiomatic in Go libraries, like the AWS SDK.

zeroshade · 2024-01-24T17:05:12Z

catalog/glue.go

+	// TODO: consider providing a way to directly access the S3 iofs to enable testing of the catalog.
+	iofs, err := io.LoadFS(map[string]string{}, location)


add optional key-value pairs as a slice of options arguments to this to pass to LoadFS instead of hard coding the empty map?

@zeroshade apologies, this is fixed, should have passed it in in the first place 😅

zeroshade

LGTM with one nit pick

HonahX

Sorry for being late here. Thanks for working on this! @wolfeidau

HonahX · 2024-01-23T06:30:53Z

catalog/glue.go

+
+	params := &glue.GetTablesInput{DatabaseName: aws.String(database)}
+
+	tblsRes, err := c.glueSvc.GetTables(ctx, params)


I checked https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/glue#GetTablesOutputand found that it also contains a NextToken field for pagination. Shall we handle this like what we did in python and java?

https://github.com/apache/iceberg-python/blob/defdac48b5ff236e2464a7f4666feb3d07e66a88/pyiceberg/catalog/glue.py#L599-L610

https://github.com/apache/iceberg/blob/4e62b58f04aba4ccc2a2a846494b478d3b03f58f/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L300-L328

@HonahX good call, this is now fixed and matches the existing implementations.

…ation Also addressed a couple of linter suggestions.

Fokko

Two small comments, I think we're almost there.

Fokko · 2024-01-25T09:54:42Z

catalog/catalog.go

+
+var (
+	// ErrNoSuchTable is returned when a table does not exist in the catalog.
+	ErrNoSuchTable = errors.New("table does not exist")


I'm not a go expert, but typically we do something like:

Suggested change

ErrNoSuchTable = errors.New("table does not exist")

ErrNoSuchTable = errors.New("Table does not exist: {namespace.table_name}")

@Fokko Yeah I have done this in python, and Java, however in Go it is more common to have these static error types for things like NotFound so they can be used to match a specific error. This paired the being forced to handle the error in the same block means you can "wrap" the error at that point and include the table name.

A consumer of the API would do something like this.

table, err := cat.GetTable(identfier) if err != nil { // do something specific on not found if errors.Is(err, catalog.ErrNoSuchTable) { // something here } return fmt.Errorf("failed to get table (%s): %w", identifier, err) }

@wolfeidau Ideally cat.GetTable would return fmt.Errorf("failed to get table (%s): %w", identifier, ErrNoSuchTable) rather than making the consumer do that wrapping itself.

@zeroshade good point, fix added to wrap sentinel not found error.

Seems reasonable. Thanks for the explanation 👍

Fokko · 2024-01-25T09:56:47Z

catalog/glue.go

+	GetTable(ctx context.Context, params *glue.GetTableInput, optFns ...func(*glue.Options)) (*glue.GetTableOutput, error)
+	GetTables(ctx context.Context, params *glue.GetTablesInput, optFns ...func(*glue.Options)) (*glue.GetTablesOutput, error)


I think LoadTable is clearer, as I generally try to avoid Get in the name. Why do we need GetTables? Wouldn't calling LoadTable multiple times give the same result? This also ties in with the comment on ErrNoSuchTable above; if we don't give the table name in the exception, you don't know which one doesn't exists.

I think this is describing the AWS Glue API, which has GetTable and GetTables API methods

@Fokko yeah @jackye1995 is 100% correct on this, I am using an interface to enable mocking of just the operations in the AWS I want to consume. This pattern is covered here and is widely used in Go. https://aws.github.io/aws-sdk-go-v2/docs/unit-testing/#mocking-client-operations

One thing to note is this interface is private as well thanks to feedback from @zeroshade .

jackye1995 · 2024-01-25T22:55:14Z

catalog/catalog.go

+type Catalog interface {
+	// ListTables returns a list of table identifiers in the catalog, with the returned
+	// identifiers containing the information required to load the table via that catalog.
+	ListTables(ctx context.Context, namespace table.Identifier) ([]table.Identifier, error)


just for my ignorance in go, having namespace havng type as table.Identifier feels really confusing. Should we have something like typedef.Identifier like pyiceberg?

@jackye1995 yeah I think that Identifier will need to moved at some point, either to the root namespace being iceberg, which is a common convention, or somewhere else to avoid cyclic dependencies which will happen as this evolves.

Good call out, and something I did want to chat with @zeroshade about.

i'd be fine with shifting table.Identifier to instead be iceberg.Identifier if that would make things easier to understand and more reasonable.

+1 to doing that

HonahX

Thanks for fixing the pagination. I have no additional comments. I think this is a good initial implementation. 👍

zeroshade · 2024-01-26T23:07:08Z

catalog/catalog.go

+type Option func(*Options)
+
+type Options struct {
+	awsConfig aws.Config
+}


Any reason to export these? We don't expect consumers to create their own Option functions right? If so we'd need to export the members of the Options struct which we don't.

@zeroshade Yeah I am a bit on the fence with this, feels half in and half out when the Options is private.

I am questioning whether this should be a Config structure with public members, but I wanted to get a couple more catalog implementations in and adjust it then.

Again keen to get a bit more code in the catalog and adjust it based on needs.

If we're gonna go that route, I'd vote making the whole thing private for now, and then if we adjust it we can make it public or a public config struct later.

Personally my default is always to make things private until it's explicitly necessary for it to be public. Keeps the surface API as small as possible

@zeroshade cool, I have made options private so we can think about it once there is more implementations.

zeroshade · 2024-01-26T23:09:00Z

catalog/glue.go

+func GlueTableIdentifier(database string, tableName string) table.Identifier {
+	return []string{database, tableName}
+}
+
+// GlueDatabaseIdentifier returns a database identifier for a Glue database in the format [database].
+func GlueDatabaseIdentifier(database string) table.Identifier {
+	return []string{database}
+}


Could these be made generic rather than Glue specific? Any catalog specific logic should happen by manipulating the Identifier internally to that catalog, right?

@zeroshade I wanted to provide some helpers for handling this identifier tuple for Glue to make it SUPER clear and use the terminology of Glue, rather than providing some vague external table mapping indexes in a tuple to a given catalogues expectation.

Users of Glue should be able to clearly see their terminology in the API as it will get them running faster, and make troubleshooting easier.

The use of a tuple for identifier feels a bit odd in a Go API, ideally I think these semantics would be in the identifier using a struct with a dash of generics. Catalogs could then extend this when needed but yeah

Okay, that makes sense. We can always adjust later if we find we're duplicating things.

zeroshade · 2024-01-30T02:52:57Z

@wolfeidau Can you add check marks to the appropriate spots in the README for the functionality you're adding?

wolfeidau · 2024-01-30T10:49:17Z

@zeroshade done

nastra · 2024-01-31T15:44:19Z

thanks for the reviews everyone and thanks to @wolfeidau for getting this done!

wolfeidau added 4 commits January 17, 2024 12:19

feat: add support for catalogs with glue implementation to start

f7f933d

refactored to seperate table info from loading

56c761b

add an implementation of load table using the catalog

1bed31e

added integration test for load table

e992a05

github-actions bot added the INFRA label Jan 17, 2024

fix: added missing license headers from added files

4f5ce6d

Fokko reviewed Jan 19, 2024

View reviewed changes

catalog/catalog.go Outdated Show resolved Hide resolved

catalog/glue.go Outdated Show resolved Hide resolved

wolfeidau requested a review from Fokko January 19, 2024 08:20

zeroshade reviewed Jan 22, 2024

View reviewed changes

wolfeidau added 2 commits January 23, 2024 08:07

fix: do not expose glue client interface in public api

bc30773

feat: move to functional options to cleanup creation of catalog

6327be2

wolfeidau requested a review from zeroshade January 22, 2024 23:12

zeroshade reviewed Jan 24, 2024

View reviewed changes

zeroshade approved these changes Jan 24, 2024

View reviewed changes

fix: update the LoadTable to accept LoadFS props

1aa77dd

HonahX reviewed Jan 25, 2024

View reviewed changes

fix: ensure glue catalog list retrieves all table entries using pagin…

1efa2d0

…ation Also addressed a couple of linter suggestions.

wolfeidau requested a review from HonahX January 25, 2024 08:10

Fokko approved these changes Jan 25, 2024

View reviewed changes

jackye1995 reviewed Jan 25, 2024

View reviewed changes

HonahX approved these changes Jan 26, 2024

View reviewed changes

fix: wrap not found sentinel error to add more context based on feedback

da70023

wolfeidau requested a review from zeroshade January 26, 2024 19:02

zeroshade reviewed Jan 26, 2024

View reviewed changes

fix: make options private to enable changes later

595e0af

wolfeidau requested a review from jackye1995 January 28, 2024 12:02

docs: updated catalog support table in readme

b1c3974

wolfeidau requested a review from zeroshade January 30, 2024 10:49

zeroshade approved these changes Jan 30, 2024

View reviewed changes

nastra approved these changes Jan 31, 2024

View reviewed changes

nastra merged commit abd4fb0 into apache:main Jan 31, 2024
8 checks passed

		// TODO: consider providing a way to directly access the S3 iofs to enable testing of the catalog.
		iofs, err := io.LoadFS(map[string]string{}, location)


		params := &glue.GetTablesInput{DatabaseName: aws.String(database)}

		tblsRes, err := c.glueSvc.GetTables(ctx, params)

	ErrNoSuchTable = errors.New("table does not exist")
	ErrNoSuchTable = errors.New("Table does not exist: {namespace.table_name}")

		GetTable(ctx context.Context, params glue.GetTableInput, optFns ...func(glue.Options)) (*glue.GetTableOutput, error)
		GetTables(ctx context.Context, params glue.GetTablesInput, optFns ...func(glue.Options)) (*glue.GetTablesOutput, error)

feat: add support for catalogs with glue implementation to start #51

feat: add support for catalogs with glue implementation to start #51

Conversation

wolfeidau commented Jan 17, 2024

wolfeidau commented Jan 18, 2024

Fokko left a comment

Choose a reason for hiding this comment

wolfeidau commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfeidau Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfeidau Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade left a comment

Choose a reason for hiding this comment

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Jan 30, 2024

wolfeidau commented Jan 30, 2024

nastra commented Jan 31, 2024

wolfeidau Jan 22, 2024 •

edited

Loading

wolfeidau Jan 24, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading