-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PerTableConfig's extensibility and make it format agnostic #297
Comments
I like the idea of cleaning this up. I've been struggling what the "perTable" actually means in the name of this. It doesn't really have a per table cardinality within it and there isn't one provided per table in the sync. Why is this not just considered a sync or job context? I have questions about the class diagram:
|
I'm picturing a flow where there is a SyncSubmit class that gets handed a config like the above which comes from a YAML file or programmatically assembled then uses the factory to instantiate the various clients and inject the client specific config before init(). Limits the scope of what each class can see. Each targetClient gets a TargetClientConfig |
This is unclear to me. A single source table implies a
Agree, I too see it as a job config. Hence my proposal was to call it a PerTableSyncConfig.
A target table identifier alone is not enough, as #296 shows. For example, the input should specify where to create the target table. In other cases, the target table format may need catalog configuration. Hence the proposal to abstract it like a table.
Source and target tables are representations of table configs provided by user in per table config. They will be used for instantiating Delta, Hudi, or Iceberg clients. I guess a better name should be searched for this.
They are different. For simplicity I did not put details there. Configs like target metadata retention belong to Target tables only. |
@ashvina - perhaps the injection should actually be done at the property level. Each client provides a setter for each param that it requires and the SyncSubmit code would pull all the poperties from the relevant config and inject them as needed. Leaves the syncconfig object out of the clients altogether. Thoughts? |
I agree with you. The sync task's config model should not depend on how it is configured. By the way, about the terminology, in |
@ashvina and @the-other-tim-brown - I've just been playing around with rolling our own injection mechanism to address PerTableConfig. I think that this will greatly simplify what we need to do. By doing this, we eliminate the need for the PerTableConfig interface in the api module and therefore the implementation specifics as well. By following the below steps within the Factory class, all of the required common and implementation specific properties are set in between the ServiceLoader creating the uninitialized client and the call to init(). The knowledge of all those properties remain in the core module and the clients and the clients can further process those within init() if needed.
This allows for extensibility of the getters and setters as new clients are added and new config is required. It allows for the client specific interfaces to remain unknown by the factory code with reflection based casting being used for the implementation specific setters (this was a tricky part). It also may mean that we don't have to change PerTableConfig much, if at all. We can let it represent the configured sync job and not worry about separating it into specific format configs or the like. Here is some poc code that represents the injection stuff that I describe above. It will need to be productized and cleaned up but shows what I am talking about.
The above would be called from the existing factory method as such - notice that PerTableConfig no longer needs to be in the TargetClient interface:
I'm wondering whether we should investigate the injection before the refactoring of the PerTableConfig to see what is still needed after. What do you think? |
I will need some time to think through the sample you provided. My gut is saying to try to tackle the PerTableConfig and cleanup of some of these format specific things leaking into the common interfaces first to better understand how we want to do injection. |
@the-other-tim-brown - sure. Check out the PR #307 as a POC implementation. I think if the injection filters out the format specifics then the config representing the submitted sync can be mixed just as the yaml would be. But I am up for adjusting it to whatever we end up with. |
@ashvina , @the-other-tim-brown - can we step back and revisit problem statement for the PerTableConfig refactoring here? If we can limit the exposure of PTC to the Factory then it can more closely resemble the sync details and not worry about the table format specifics being in there. I do think that certain implementation classes themselves being used inside PTC itself for things like adding default implementations, etc should maybe be moved into the TargetClients themselves. The PTC itself should maybe just contain config values rather than object references as well. The consumers could then react to null or missing values and create the defaults themselves for instance. What is anything about #296 do we still need to consider in order for the metadata to be able to be located else some other location? Is this just additional format specific config values for location? |
@ashvina , @the-other-tim-brown - it does occur to me that there is still a lack of extensibility in the config (even with the injection) due to the fact that the configuration details are built into the config class interface. If we were to add the ability to provide additional name/value pairs then you could add an arbitrary targetClient and specific config to PerTableConfig through that generic mechanism. Maybe with target specific prefixes to the params. Injection would also be able to handle those params. |
@lmccay. I don't think we should mix injection and extensible of PTC in one issue. PTC should be extensible regardless. Let's focus only on PTC here. |
I am off from work this week so I will be able to put up some suggestions in the form of PRs for breaking down the existing PerTableConfig and making it extensible for users interacting directly calling the sync method. One option is to move towards a model like Hudi uses where you have a |
@the-other-tim-brown , @ashvina - I am very used to the hadoop style config and therefore like that idea. |
@ashvina - I didn't mean to suggest that injection should be part of this, sorry. I was just pointing out that solving the leaking of table specifics with something like injection doesn't address the extensibility of PTC itself. Can you explain what external tables are, if they are not source and target tables? Probably bumping up against my ignorance here. |
@lmccay the "external" was used in the original description to distinguish between the internal data model for a table. The diagram at the top of the issue shows what it is meant to represent. External may not be the best term to use since data warehouses have external tables as well. |
This enhancement request stems from 1) the conversation in #293 and 2) emerging needs such as #296. The intent is to clean up the
PerTableConfig
.OneTable's sync flow is based on
PerTableConfig
, which is the input configuration provided by the user. The sync process translates metadata of a single source table into one or more target tables. The user must provide this config for the translation process to be successful.This image below illustrates the current structure of
PerTableConfig
.However, the use cases have changed over time and now require more flexibility and compatibility with different table formats. A different location may be required for generating the metadata of the target table. In that case, the path to that location should also be provided. Additionally, the target table may have a connection to another catalog instance. Which means that the target table requires not just a format identifier, but also some of the configurations that are currently provided for a source table only.
The current configuration object includes some configurations that are specific to Iceberg and Hudi formats. These configurations should be wrapped by input configuration instances that are specific to each format.
The following image shows the proposed
PerTableConfig
.PerTableConfig
is aTableSyncConfig
, because it is a configuration for synchronizing a table.ExternalTable
that can be either a source table or a target table. TheExternalTable
clearly differentiates between internal representation and external table.A possible class structure for representing the table config is the topic I want to discuss.
The text was updated successfully, but these errors were encountered: