-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Bug: Cannot install connectors on Databricks/Spark (also Render.com and Replit.com) #78
Comments
@betizad - thanks for creating this issue! Have you tried skipping the install of the connector? PyAirbyte is able to install your connectors in their own dedicated virtual environments and it does this by default in order to prevent version conflicts. |
Alternatively, you can use a tool like |
I tried leeting airbyte to install the library needed, but it did not work. I get the following error:
My current workaround is:
|
@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly. (Updated the title of this issue to reflect what I now think is the root cause.) Can you provide the specifics to your runtime? And can you try the workaround which we applied to Colab? In Colab, our examples like this one start with |
I'm running into a similar problem on another platform (Replit). Replit is built on Nix and I suspect there are some permissions / config issues with trying to install venvs into the project folder.
My workaround: SOURCE_GOOGLE_SHEETS = "source-google-sheets"
source = ab.get_source(
name=SOURCE_GOOGLE_SHEETS,
local_executable=f".pythonlibs/bin/{SOURCE_GOOGLE_SHEETS}"
)
source.set_config({
"credentials": {
"auth_type": "Service",
"service_account_info": os.environ["SERVICE_ACCOUNT_JSON"]
},
"spreadsheet_id": SPREADSHEET_ID
}) Of course, that presents it's own challenges because now there are dependency issues 😅 Would love to find a solution for environments with challenging venv configurations. |
I took me a while to get back to this. I'm using: The workaround in colab does not work in DBX. If I run
|
@betizad and @mattppal - Thank you both for sharing more about your context and execution requirements. I did a bit of digging (mostly ChatGPT 🙄) and I believe I've confirmed that in both the Spark and also the Replit runtimes, there is no ability to create an 'isolated virtual env' - which we would need to ensure proper dependency isolation. If we don't want to roll the dice on a per-connector basis about whether the connectors will have conflicts with each other and/or with PyAirbyte or other libraries that you are using in these environments, I can think of two decent paths forward: Option 1: Leverage Conda across connectors and PyAirbyte to align dependency versionsThis requires net new work on the side of Airbyte, and it would (probably?) also require some work from the user in terms of interacting with Conda or building a Conda environment. This has an added benefit of streamlining usage in other environments that have Conda-based delivery integration - for instance with Snowflake's Snowpark Python runtime. Option 2: Use a tool like Shiv or PyOxydizer to pre-build the connector executableIn this approach, we would design a process to build connectors into CLI executables - and the executable itself would handle delivery of dependencies and the needed environment isolation. I believe this would work well in the case of Replit, where the executable would be uploaded to the Replit environment and then invoked/called by PyAirbyte. But getting this working correctly in a Spark cluster could be more complicated - since you'd need to ensure the CLI executable is available to all nodes in the cluster. (Not impossible, but also probably not a trivial effort.) @betizad and @mattppal - I'm curious of your thoughts on both of these approaches. Let me know if one or both seem like they could be a good fit, and/or if you have any other ideas not mentioned above. Thanks! 🙏 |
Circling back to this thread - A few other runtimes have been requested since my last post. Cethan in Slack has reported difficulty deploying with the www.render.com and separately we've had some progress getting this to work with Airflow. The trick that worked in Airflow was to use a Dockerfile that handles the isolation of installing the connectors into their own virtualenvs:
If
|
Circling back here again. 👋 Very happy to announce that we have a new "yaml" installation option that works for ~135 different API source connectors - along with all custom connectors built with our no-code Connector Builder. We're also investing heavily in migrating python connectors to the no-code/low-code framework, which means the number of supported connectors will continue to grow. Here is a Loom I recorded to walk through the feature: |
Compiling a related list of docs and resources here: |
When I try to install airbyte and airbyte-source-linkedin-ads, I get the following error.
I install in databaricks using the command
%pip install airbyte==0.7.2 airbyte-source-linkedin-ads==0.7.0
When I do the same in a local machine, the linkedin-ads is installed in a new venv whcih does not work in databricks.
The text was updated successfully, but these errors were encountered: