This the implementation of the Engine
contract of Open Data Fabric using the Apache Spark data processing framework. It is currently in use in kamu-cli data management tool.
- Spark engine currently provides the most rich SQL dialect for map/filter style transformations
- Integrates GeoSpark to provide geo-spatial SQL functions
- It is used by kamu-cli for ingesting data into Parquet
- It is used by kamu-cli along with Apache Livy to provide SQL queries functionality in the Jupyter notebooks
- Takes a long time to start up which is hurting the user experience
- Does not support temporal table joins
- You might be better off using Flink-based engine for joining and aggregating event streams
- TODO
See the Developer Guide