Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query on C Interface Arrow Data #79

Open
kesavkolla opened this issue Sep 9, 2022 · 21 comments
Open

Query on C Interface Arrow Data #79

kesavkolla opened this issue Sep 9, 2022 · 21 comments
Labels
feature Used for auto generate changelog

Comments

@kesavkolla
Copy link

Can someone provide an example of how to seemlessly pass the arrow to duckdb in rust? I have the location of parquet file(s) in Azure ADLSv2. I am currently using object_store and arrow-rs to read those parquet files. I do see duckdb and arrow work very seemless in Python. Is there some example that someone can point how to use in rust.

Thanks in advance.

@wangfenjin
Copy link
Collaborator

you mean something like this? https://duckdb.org/docs/guides/python/sql_on_arrow

Better share your python examples

@kesavkolla
Copy link
Author

Yes that is correct. The example given in that URL. Ideally the recordbatchreader querying on it would be super useful from rust.

@wangfenjin
Copy link
Collaborator

Haven't support that in rust api. For now what you can do is load parquet directly into duckdb

#80

@wangfenjin wangfenjin added the feature Used for auto generate changelog label Sep 10, 2022
@kesavkolla
Copy link
Author

Hmmm... unfortunately I can't load parquet in duckdb because duckdb doesn't have support for Azure ADLS. Currently duckdb has no pluggable mechanism for loading parquet files.

Is it possible to write an extension for duckdb in rust?

@wangfenjin
Copy link
Collaborator

What about download parquet first and then load it into duckdb?

I haven't spent too much time to dive deep into the duckdb extension, need time to investigate

@kesavkolla
Copy link
Author

Yeah I could do that. Ideally it would be nice to provide Arrow C interface so we can easily integrate between rust and duckdb

@wangfenjin
Copy link
Collaborator

Yes, the API would be much simpler.

But the logic still the same, download the file, and load into duckdb. Read into arrow and then bind the data into duckdb might be slower than directly read parquet file into duckdb

@kesavkolla
Copy link
Author

I am looking at this example in Python:

from adlfs import AzureBlobFileSystem
import duckdb
import pyarrow.parquet as pq
import pyarrow.dataset as ds

account_name = "xxxxx"
account_key = "xxxxx"
abfs = AzureBlobFileSystem( account_name = account_name, account_key = account_key, container_name = "data")
pqdata = ds.dataset("path/inside/abfs", filesystem=abfs)


conn = duckdb.connect(":memory:")
conn.execute("SELECT * from pqdata LIMIT 10")

Something similar in rust would be greatly helpful

@wangfenjin
Copy link
Collaborator

Acctually this example may not work because it missed one important steps:

       conn.register("pqdata",pqdata)

And this register API is what we need to implement in rust.

And this register API is only useful when you already have some data in memory and want to use duckdb query it. If you only want to load parquet and query using duckdb, using parquet extension is much more straightforward and faster.

-- load parquest files in data folder
SELECT * FROM read_parquet('./data/*.parquet');

@kesavkolla
Copy link
Author

kesavkolla commented Sep 28, 2022

@wangfenjin Can you provide some high level flow for writing register API. I want to implement that and provide PR

@wangfenjin
Copy link
Collaborator

hi @kesavkolla , seems we don't have capi yet, I create an issue in duckdb repo, let's wait for the response.

@kesavkolla
Copy link
Author

I'm just reading arrow-rs and found they have something called ffi. Arrow ffi has implementation for stream producer and consumer. looks like duckdb has this arrow stream reader this is what even Python duckdb is using

@wangfenjin
Copy link
Collaborator

Don't understand your question.

So you have in-memory data in arrow format, you want to query it using duckdb, or query it using arrow-rs?

  • If it's arrow-rs, then you can use something like arrow-rs ffi or arrow-datafusion.
  • if you want use duckdb's sql engine to query on it, you have no choice but make the data available to duckdb first, that's why we need the register api

@snth
Copy link
Contributor

snth commented Sep 28, 2022

Hi,

sorry to be chiming in here but I have also been struggling to make sense of these arrow interfaces to other systems in Rust so perhaps I can help out a bit. It's an unfortunate that while the Python documentation for most projects is excellent, there are very few docs and examples for Rust.

I've extracted the minimal example from my code (so not 100% sure that it works as I combined a few things but I think it should give a good idea).

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::pretty_format_batches;
use duckdb::Connection;

let conn = Connection::open_in_memory()?;

let mut stmt = conn.prepare(&sql)?;

let batches = stmt.query_arrow([])?.collect::<Vec<RecordBatch>>();

println!(pretty_format_batches(&batches)?);

I think this shows how you can get the data from DuckDB into arrow data structures and you can then work with them using other arrow functions, e.g. pretty_format_batches() in this example.

HTH

@wangfenjin
Copy link
Collaborator

Yeah I agree that the documentation is not so clear or complete, I may make some time to improve it but I can't guarantee the time. Also I'm not a native English speaker which will spend more time for me to write the doc.

BTW, if you are looking for examples, you can refer to the test code such as query_arrow . All public API should have at least one test case, which can also serve as examples.

@kesavkolla
Copy link
Author

@wangfenjin I looked into the duckdb python code. Here is what python code is doing for registering arrow:

unique_ptr<DuckDBPyRelation> DuckDBPyConnection::FromArrow(py::object &arrow_object) {
	if (!connection) {
		throw ConnectionException("Connection has already been closed");
	}
	py::gil_scoped_acquire acquire;
	string name = "arrow_object_" + GenerateRandomName();
	if (!IsAcceptedArrowObject(arrow_object)) {
		auto py_object_type = string(py::str(arrow_object.get_type().attr("__name__")));
		throw InvalidInputException("Python Object Type %s is not an accepted Arrow Object.", py_object_type);
	}
	auto stream_factory =
	    make_unique<PythonTableArrowArrayStreamFactory>(arrow_object.ptr(), connection->context->config);

	auto stream_factory_produce = PythonTableArrowArrayStreamFactory::Produce;
	auto stream_factory_get_schema = PythonTableArrowArrayStreamFactory::GetSchema;

	auto rel = make_unique<DuckDBPyRelation>(
	    connection
	        ->TableFunction("arrow_scan", {Value::POINTER((uintptr_t)stream_factory.get()),
	                                       Value::POINTER((uintptr_t)stream_factory_produce),
	                                       Value::POINTER((uintptr_t)stream_factory_get_schema)})
	        ->Alias(name));
	rel->rel->extra_dependencies =
	    make_unique<PythonDependencies>(make_unique<RegisteredArrow>(move(stream_factory), arrow_object));
	return rel;
}

Also from the ducdb capi test code there is a way to register table function:

https://github.com/duckdb/duckdb/blob/master/test/api/capi/capi_table_functions.cpp#L53

static void capi_register_table_function(duckdb_connection connection, const char *name,
                                         duckdb_table_function_bind_t bind, duckdb_table_function_init_t init,
                                         duckdb_table_function_t f) {
	duckdb_state status;

	// create a table function
	auto function = duckdb_create_table_function();
	duckdb_table_function_set_name(nullptr, name);
	duckdb_table_function_set_name(function, nullptr);
	duckdb_table_function_set_name(function, name);
	duckdb_table_function_set_name(function, name);

	// add a string parameter
	duckdb_logical_type type = duckdb_create_logical_type(DUCKDB_TYPE_BIGINT);
	duckdb_table_function_add_parameter(function, type);
	duckdb_destroy_logical_type(&type);

	// set up the function pointers
	duckdb_table_function_set_bind(function, bind);
	duckdb_table_function_set_init(function, init);
	duckdb_table_function_set_function(function, f);

	// register and cleanup
	status = duckdb_register_table_function(connection, function);
	REQUIRE(status == DuckDBSuccess);

	duckdb_destroy_table_function(&function);
	duckdb_destroy_table_function(&function);
	duckdb_destroy_table_function(nullptr);
}

Looks like they do have this register functionality. I do see these duckdb_register_table_function in the libduck-sys crate that is part of this repo.

I am not too familiar on how to use FFI in rust can you add this register function to InnerConnection and Connection? Then we can pass arrow to duckdb.

Thanks in advance

@wangfenjin wangfenjin changed the title Example of using Arrow C interface Query on C Interface Arrow Data Dec 13, 2022
@wangfenjin
Copy link
Collaborator

@duarten
Copy link

duarten commented Jan 26, 2024

Are there plans to export ArrowVTab? The code is working fine for me so far.

@phillipleblanc
Copy link
Contributor

Are there plans to export ArrowVTab? The code is working fine for me so far.

This is exported now! #259

@navdeep710
Copy link

navdeep710 commented Apr 7, 2024

@kesavkolla can you help in any updates or hints if you were able to query arrow via duckdb in rust, we are also trying to achieve the same for differnt format not suported by duckdb but is supported by arrow ?

@ajwerner
Copy link

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Used for auto generate changelog
Projects
None yet
Development

No branches or pull requests

7 participants