Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create LanceTableFactory implementing DataFusion's TableProviderFactory trait #3157

Open
matthewmturner opened this issue Nov 22, 2024 · 2 comments

Comments

@matthewmturner
Copy link

I would like to add Lance as a supported file type in dft similar to how we currently have deltalake and are working on hudi / Iceberg support. All of these formats are accessed via DataFusions TableProviderFactory. I see that TableProvider is already implemented so I am hoping that can be extended.

@westonpace
Copy link
Contributor

I've made a quick PR to expose the table provider.

TableProviderFactory is a little interesting. I might need some help making sure I understand the various inputs.

Is the intention to open up an existing table from a location? Or is the intention to create a branch new empty table? Or both?

  • schema - Lance datasets store the schema internally and so this is redundant if the dataset already exists. If we're creating a new dataset then I understand how this will be useful. On reads of existing datasets should we use it to apply a sort of "default projection" to the dataset when we open it? E.g. if the provided schema has x, y and the dataset schema is x, z, y should we hide z in the created dataset? Or should we just ignore this if the dataset already exists?
  • name - Understood
  • location - I'll interpret this as the URI to the lance dataset?
  • file_type - will ignore for now though I guess we can use it later to specify the storage version (e.g. 2.0, 2.1, ...) when creating tables.
  • table_partition_cols - if I ignore this will the dataframe not do any partitioning? Lance can support any number of partitions (e.g. if you want to scan a dataset with 10 threads we can evenly divide the work amongst 10 threads, regardless of the storage). I'm not entirely sure how to communicate this. This can be an optimization though.
  • if_not_exists - Understood
  • temporary - What exactly are the semantics for a temporary table? We have no notion of one right now so I'd probably just error out if this is true.
  • definition - Parsing this will be extra work. Can do in stages (e.g. first make sure we can open existing datasets and then make sure we can create new ones).
  • order_exprs - I'm a little confused why this is part of the create table and not part of the query
  • unbounded - I guess we just error if this is true? Or we can ignore?
  • options - Understood. We can add some here later (e.g. storage options, rows per file, etc.)
  • constraints - Not yet supported but may be someday (although these may be handled more easily by an eventual LanceDBTableProvider)
  • column_defaults - Not supported, will ignore for now

Does this sound correct?

@matthewmturner
Copy link
Author

matthewmturner commented Nov 23, 2024

@westonpace appreciate your quick and thoughtful response.

The intent here is to be able to be able to write DDL like the following so that I can start reading the lance format (I believe the TableProviderFactory may also enable writing to the format but I think that would only be if that was implemented by the TableProvider (dont quote me on this though).

CREATE EXTERNAL TABLE my_table STORED AS LANCE LOCATION '/path/to/lance';

Here is an example of how we use the DeltaTableFactory for this purpose.

Unfortunately, I'm not that familiar with Lance semantics to be able to answer the specifics on how that maps to TableProviderFactory (but im hoping to start learning more about it - hence this issue ;) ). Here is some documentation on how it works though which can hopefully help.

To the extent its reasonable on your side i would think a v1 that only exposes the simplest functionality would be reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants