-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add network resources #491
base: main
Are you sure you want to change the base?
Conversation
This is awesome! I've been thinking about decoupling transport from format for a while now, but never had the change to work on it. Will take a look this week! |
Hi, @betodealmeida! I'm glad you found the changes interesting! About the failed integration test job: are you sure the Weather API key used in GitHub Actions is currently valid? I registered with WeatherAPI and obtained an API key. After that, I created a secret I also made the following fixes in previous commits:
|
The integration tests always fail for non-committers, since the secrets are not exposed to them. Otherwise someone could write a PR that grabs the secrets and sends them to a host they control. |
Oh, got it! Thanks! |
Foreword
In my company we use Apache Superset as a general BI solution. So, we encountered the case when we need to get data (CSV file in our case) from external resource (FTP in our case) and use data in our datasets. I found Shillelagh as the main solution to solve our problem, but for full use it was necessary to add to Shillelagh the ability to communicate with external network resources. Now I'm introducing new feature in Shillelagh - Network Resources.
Summary
A network resource is a resource located on a remote server and provides access to data over a network protocol.
The implementation of a network resource is a specific implementation of the functionality for obtaining data on a specific protocol from network resources that provide access to data on this protocol.
So, a file adapter in Shillelagh is an entity that answers the question "How to process specific data?", and a network resource answers the question "How to get this specific data to process it?".
NetworkResource in Shillelagh implement a single point for receiving data from external sources, after which this data will be used by the appropriate adapter for processing them. In turn, the Adapters' implementations no longer needs to worry about how to receive data, since this process is encapsulated in NetworkResourceImplementation class.
As part of the current PR, I have written three network resource implementations, these are HTTP(S), FTP and SFTP.
So, now we can use in our SQL query URL like:
sftp://user:[email protected]:22/path/to/file.ext
or
ftp://user:[email protected]:21/path/to/file.ext
We can omit the port in the URL if they are set by default: 22 for SFTP and 21 for FTP.
Using ContextVar approach
In my implementation, I used ContextVar as the global state containing the current instance of NetworkResource. This is due to two points:
I failed to throw the created NetworkResource instance through the arguments to the adapter due to their serialization process into a string.
When we are looking for a suitable adapter, generalizing, we cycle through all the adapters, calling the supports() method for each, which in turn can use a NetworkResource and create a connection to an external resource. I also wanted to reuse this connection at the moment of receiving data from an external resource.
Testing instructions
In the examples directory of the repository, I created two additional examples of Shillelagh working with FTP and SFTP. Before launching them, you should run the corresponding services in docker/docker-compose.yml
You can also up your FTP/SFTP resource or use an existing one to test this functionality.
For discussion
Right now I'm using a very simple method of getting the content type in the FTP and SFTP implementations with a simple check for the extension of the requested file in the URL. The good thing would be to go to an external resource and ask for the actual type of content. But I am concerned about the possible amount of data on an external resource: for a simple check, you will need to download all the data and it may happen that no adapter will be able to process the received data, in this case we could lose a lot of time.
Should I leave the current implementation or still go to a remote resource? Just with the second option, using ContextVar to save the created instance of NetworkResource is very suitable, so as not to request data again.