-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query on C Interface Arrow Data #79
Comments
you mean something like this? https://duckdb.org/docs/guides/python/sql_on_arrow Better share your python examples |
Yes that is correct. The example given in that URL. Ideally the recordbatchreader querying on it would be super useful from rust. |
Haven't support that in rust api. For now what you can do is load parquet directly into duckdb |
Hmmm... unfortunately I can't load parquet in duckdb because duckdb doesn't have support for Azure ADLS. Currently duckdb has no pluggable mechanism for loading parquet files. Is it possible to write an extension for duckdb in rust? |
What about download parquet first and then load it into duckdb? I haven't spent too much time to dive deep into the duckdb extension, need time to investigate |
Yeah I could do that. Ideally it would be nice to provide Arrow C interface so we can easily integrate between rust and duckdb |
Yes, the API would be much simpler. But the logic still the same, download the file, and load into duckdb. Read into arrow and then bind the data into duckdb might be slower than directly read parquet file into duckdb |
I am looking at this example in Python: from adlfs import AzureBlobFileSystem
import duckdb
import pyarrow.parquet as pq
import pyarrow.dataset as ds
account_name = "xxxxx"
account_key = "xxxxx"
abfs = AzureBlobFileSystem( account_name = account_name, account_key = account_key, container_name = "data")
pqdata = ds.dataset("path/inside/abfs", filesystem=abfs)
conn = duckdb.connect(":memory:")
conn.execute("SELECT * from pqdata LIMIT 10") Something similar in rust would be greatly helpful |
Acctually this example may not work because it missed one important steps: conn.register("pqdata",pqdata) And this And this register API is only useful when you already have some data in memory and want to use duckdb query it. If you only want to load parquet and query using duckdb, using parquet extension is much more straightforward and faster. -- load parquest files in data folder
SELECT * FROM read_parquet('./data/*.parquet'); |
@wangfenjin Can you provide some high level flow for writing |
hi @kesavkolla , seems we don't have capi yet, I create an issue in duckdb repo, let's wait for the response. |
I'm just reading arrow-rs and found they have something called ffi. Arrow ffi has implementation for stream producer and consumer. looks like duckdb has this arrow stream reader this is what even Python duckdb is using |
Don't understand your question. So you have in-memory data in arrow format, you want to query it using duckdb, or query it using arrow-rs?
|
Hi, sorry to be chiming in here but I have also been struggling to make sense of these I've extracted the minimal example from my code (so not 100% sure that it works as I combined a few things but I think it should give a good idea). use arrow::record_batch::RecordBatch;
use arrow::util::pretty::pretty_format_batches;
use duckdb::Connection;
let conn = Connection::open_in_memory()?;
let mut stmt = conn.prepare(&sql)?;
let batches = stmt.query_arrow([])?.collect::<Vec<RecordBatch>>();
println!(pretty_format_batches(&batches)?); I think this shows how you can get the data from DuckDB into HTH |
Yeah I agree that the documentation is not so clear or complete, I may make some time to improve it but I can't guarantee the time. Also I'm not a native English speaker which will spend more time for me to write the doc. BTW, if you are looking for examples, you can refer to the test code such as query_arrow . All public API should have at least one test case, which can also serve as examples. |
@wangfenjin I looked into the duckdb python code. Here is what python code is doing for registering arrow: unique_ptr<DuckDBPyRelation> DuckDBPyConnection::FromArrow(py::object &arrow_object) {
if (!connection) {
throw ConnectionException("Connection has already been closed");
}
py::gil_scoped_acquire acquire;
string name = "arrow_object_" + GenerateRandomName();
if (!IsAcceptedArrowObject(arrow_object)) {
auto py_object_type = string(py::str(arrow_object.get_type().attr("__name__")));
throw InvalidInputException("Python Object Type %s is not an accepted Arrow Object.", py_object_type);
}
auto stream_factory =
make_unique<PythonTableArrowArrayStreamFactory>(arrow_object.ptr(), connection->context->config);
auto stream_factory_produce = PythonTableArrowArrayStreamFactory::Produce;
auto stream_factory_get_schema = PythonTableArrowArrayStreamFactory::GetSchema;
auto rel = make_unique<DuckDBPyRelation>(
connection
->TableFunction("arrow_scan", {Value::POINTER((uintptr_t)stream_factory.get()),
Value::POINTER((uintptr_t)stream_factory_produce),
Value::POINTER((uintptr_t)stream_factory_get_schema)})
->Alias(name));
rel->rel->extra_dependencies =
make_unique<PythonDependencies>(make_unique<RegisteredArrow>(move(stream_factory), arrow_object));
return rel;
} Also from the ducdb capi test code there is a way to register table function: https://github.com/duckdb/duckdb/blob/master/test/api/capi/capi_table_functions.cpp#L53 static void capi_register_table_function(duckdb_connection connection, const char *name,
duckdb_table_function_bind_t bind, duckdb_table_function_init_t init,
duckdb_table_function_t f) {
duckdb_state status;
// create a table function
auto function = duckdb_create_table_function();
duckdb_table_function_set_name(nullptr, name);
duckdb_table_function_set_name(function, nullptr);
duckdb_table_function_set_name(function, name);
duckdb_table_function_set_name(function, name);
// add a string parameter
duckdb_logical_type type = duckdb_create_logical_type(DUCKDB_TYPE_BIGINT);
duckdb_table_function_add_parameter(function, type);
duckdb_destroy_logical_type(&type);
// set up the function pointers
duckdb_table_function_set_bind(function, bind);
duckdb_table_function_set_init(function, init);
duckdb_table_function_set_function(function, f);
// register and cleanup
status = duckdb_register_table_function(connection, function);
REQUIRE(status == DuckDBSuccess);
duckdb_destroy_table_function(&function);
duckdb_destroy_table_function(&function);
duckdb_destroy_table_function(nullptr);
} Looks like they do have this register functionality. I do see these duckdb_register_table_function in the libduck-sys crate that is part of this repo. I am not too familiar on how to use FFI in rust can you add this register function to InnerConnection and Connection? Then we can pass arrow to duckdb. Thanks in advance |
https://github.com/wangfenjin/duckdb-rs/blob/main/src/vtab/arrow.rs#L480 Can try this feature |
Are there plans to export |
This is exported now! #259 |
@kesavkolla can you help in any updates or hints if you were able to query arrow via duckdb in rust, we are also trying to achieve the same for differnt format not suported by duckdb but is supported by arrow ? |
Any updates on this? |
Can someone provide an example of how to seemlessly pass the arrow to duckdb in rust? I have the location of parquet file(s) in Azure ADLSv2. I am currently using object_store and arrow-rs to read those parquet files. I do see duckdb and arrow work very seemless in Python. Is there some example that someone can point how to use in rust.
Thanks in advance.
The text was updated successfully, but these errors were encountered: