Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of setting the arangoDB name on the configuration #772

Open
zacayd opened this issue Dec 5, 2023 · 7 comments
Open

Support of setting the arangoDB name on the configuration #772

zacayd opened this issue Dec 5, 2023 · 7 comments

Comments

@zacayd
Copy link

zacayd commented Dec 5, 2023

Hi
I am using spline to capture lineage from Databricks notebooks
I put on the cluseter - on the advanced settings

spark.spline.mode ENABLED
spark.spline.lineageDispatcher.http.producer.url http://10.0.19.4:8080/producer
spark.spline.lineageDispatcher http

since i have several customers- i dont want to keep the data of all of them on the same arangoDB
so I want a way that the response will be kept on a db per customer.

can we send also the arangoDb name as a parameter so the execution plan lineage data will be kept on a different db
for each cluster i use

thanks in advance

@cerveada cerveada added this to Spline Dec 5, 2023
@github-project-automation github-project-automation bot moved this to New in Spline Dec 5, 2023
@wajda
Copy link
Contributor

wajda commented Dec 5, 2023

No, this isn't possible. The database is an internal part of the system and is not something you can easily select on a request basis.

My recommendation for your use-case would be to simply augment your execution plan and event objects with the DBR cluster name stored as an extra parameter, or a tag, and filter the stuff on the UI based on that (the feature beta is available in the develop version of the server and the UI).

Alternatively, you may augment the URIs for the input/output sources to include the cluster name as a part of the name. That is another way to logically separate the lineage data.

If you absolutely want to use different DBs then you can run separate Spline instances, put a custom proxy gateway in front of the Spline Producer REST API (or implement a custom LineageDispatcher wrapper) and route your requests to different Spline instances based on your custom conditions.

@zacayd
Copy link
Author

zacayd commented Dec 5, 2023

About
DBR cluster name stored as an extra parameter, or a tag, and filter the stuff on the UI based on that (the feature beta is available in the develop version of the server and the UI).

Do you mean that the name of the cluster is on the execution plan?

@zacayd
Copy link
Author

zacayd commented Dec 5, 2023

Does the feature Beta is available as a maven in the Databricks?

@wajda
Copy link
Contributor

wajda commented Dec 6, 2023

No. You need to build and install from the laters development branch.

@zacayd
Copy link
Author

zacayd commented Dec 6, 2023

Any chance that it will be on the cloud of Databricks soon?
since i have trouble to build and install it

@wajda
Copy link
Contributor

wajda commented Dec 6, 2023

no ETA unfortunately. The team has no capacity and the business priorities changed. So the project is on hold at the moment.

@zacayd
Copy link
Author

zacayd commented Jan 7, 2024

Hi
I succeeded to compile the project and create a Jar and load via the DBFS
But seems that when i run the Notebook - i get lineage data but the info of the notebook is missing
i took the branch of develop
https://github.com/AbsaOSS/spline-spark-agent/tree/develop
can you advise?

@AbsaOSS AbsaOSS deleted a comment from zacayd Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

2 participants