Skip to content

Latest commit

 

History

History
1024 lines (891 loc) · 114 KB

CHANGELOG.md

File metadata and controls

1024 lines (891 loc) · 114 KB

Changelog

0.12.0 (2024-01-14)

Full Changelog

Documentation updates:

  • docs: fix link #799 (haoxins)

Merged pull requests:

  • [minor] remove outdate todo #683 (Ted-Jiang)
  • Add executor terminating status for graceful shutdown #667 (thinkharderdev)
  • Allow BallistaContext::read_* methods to read multiple paths. #679 (luckylsk34)
  • Update scheduler.md #657 (psvri)
  • Mark SchedulerState as pub #688 (Dandandan)
  • Update graphviz-rust requirement from 0.5.0 to 0.6.1 #651 (dependabot[bot])
  • Upgrade DataFusion to 19.0.0 #691 (r4ntix)
  • Update release docs #692 (andygrove)
  • Mark SchedulerServer::with_task_launcher as pub #695 (Dandandan)
  • Make task_manager pub #698 (Dandandan)
  • Add ExecutionEngine abstraction #687 (andygrove)
  • Allow accessing s3 locations in client mode #700 (luckylsk34)
  • git clone branch incorrect #699 (BubbaJoe)
  • Fix for error message during testing #707 (yahoNanJing)
  • Upgrade datafusion to 20.0.0 & sqlparser to to 0.32.0 #711 (r4ntix)
  • Update README.md #729 (jiangzhx)
  • Update link to scheduler proto file in dev docs #713 (JAicewizard)
  • Fix show tables fails #715 (r4ntix)
  • Remove redundant fields in ExecutorManager #728 (yahoNanJing)
  • Fix parameter '--config-backend' to '--cluster-backend' #720 (paolorechia)
  • Upgrade DataFusion to 21.0.0 #727 (r4ntix)
  • [minor] remove useless brackets #739 (Ted-Jiang)
  • Only decode plan in LaunchMultiTaskParams once #743 (Dandandan)
  • Upgrade DataFusion to 22.0.0 #740 (r4ntix)
  • [feature] support shuffle read with retry when facing IO error. #738 (Ted-Jiang)
  • [log] Print long running task status. #750 (Ted-Jiang)
  • Upgrade DataFusion to 23.0.0 #755 (yahoNanJing)
  • Fix plan metrics length and stage metrics length not match #764 (yahoNanJing)
  • added match arms to create ClusterStorageConfig #766 (BokarevNik)
  • [Improve] refactor the offer_reservation avoid wait result #760 (Ted-Jiang)
  • [fea] Avoid multithreaded write lock conflicts in event queue. #754 (Ted-Jiang)
  • Upgrade DataFusion to 24.0.0, Object_Store to 0.5.6 #769 (r4ntix)
  • Refine create_datafusion_context() #778 (yahoNanJing)
  • Remove output_partitioning for task definition #776 (yahoNanJing)
  • Upgrade DataFusion to 25.0.0 #779 (r4ntix)
  • Disable the ansi feature of tracing-subscriber #784 (yahoNanJing)
  • Add config grpc_server_max_decoding_message_size to make the maximum size of a decoded message at the grpc server side configurable #782 (yahoNanJing)
  • Fix nodejs issues in Docker build #731 (jnaous)
  • Upgrade node version to fix build in main #794 (avantgardnerio)
  • Remove redundant mod session_registry #792 (yahoNanJing)
  • Make last_seen_ts_threshold for getting alive executor at the scheduler side larger than the heartbeat time interval #786 (yahoNanJing)
  • Remove the prometheus-metrics from the default feature #788 (yahoNanJing)
  • Refine the ExecuteQuery grpc interface #790 (yahoNanJing)
  • Add config to collect statistics, enable in TPC-H benchmark #796 (Dandandan)
  • Add support for GCS data sources #805 (haoxins)
  • Update DataFusion to 26 #798 (Dandandan)
  • Issue 162 build docker image in ci #716 (paolorechia)
  • Fix index out of bounds panic #819 (yahoNanJing)
  • Refactor the TaskDefinition by changing encoding execution plan to the decoded one #817 (yahoNanJing)
  • Fix ballista-cli docs #800 (jonahgao)
  • docs: fix link #799 (haoxins)
  • Implement the with_new_children for ShuffleReaderExec #821 (yahoNanJing)
  • Update to point to the correct documentation #838 (dadepo)
  • Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823 (yahoNanJing)
  • Upgrade DataFusion to 27.0.0 #834 (r4ntix)
  • Reduce the number of calls to create_logical_plan #842 (jonahgao)
  • Bump semver from 5.7.1 to 5.7.2 in /ballista/scheduler/ui #843 (dependabot[bot])
  • Bump actions/labeler from 4.1.0 to 4.3.0 #841 (dependabot[bot])
  • Bump tough-cookie from 4.1.2 to 4.1.3 in /ballista/scheduler/ui #840 (dependabot[bot])
  • Update flatbuffers requirement from 22.9.29 to 23.5.26 #801 (dependabot[bot])
  • Update dirs requirement from 4.0.0 to 5.0.1 #767 (dependabot[bot])
  • Update libloading requirement from 0.7.3 to 0.8.0 #761 (dependabot[bot])
  • Introduce a cache crate supporting concurrent cache value loading #825 (yahoNanJing)
  • Fix cargo clippy for latest rust version #848 (yahoNanJing)
  • Introduce CachedBasedObjectStoreRegistry to use data source cache transparently #827 (yahoNanJing)
  • Add ConsistentHash for node topology management #830 (yahoNanJing)
  • Implement 3-phase consistent hash based task assignment policy #833 (yahoNanJing)
  • Update tonic requirement from 0.8 to 0.9 #733 (dependabot[bot])
  • Update itertools requirement from 0.10 to 0.11 #844 (dependabot[bot])
  • Update etcd-client requirement from 0.10 to 0.11 #845 (dependabot[bot])
  • Update hashbrown requirement from 0.13 to 0.14 #846 (dependabot[bot])
  • Bump word-wrap from 1.2.3 to 1.2.4 in /ballista/scheduler/ui #849 (dependabot[bot])
  • Update hdfs requirement from 0.1.1 to 0.1.4 #856 (yahoNanJing)
  • Update to DataFusion 28 #858 (Dandandan)
  • Upgrade datafusion to 30.0.0 #866 (r4ntix)
  • refactor: port get_scan_files to Ballista #877 (alamb)
  • Upgrade datafusion to 31.0.0 #878 (r4ntix)
  • Upgrade datafusion to 32.0.0 #899 (r4ntix)
  • Update to DataFusion 33 #900 (Dandandan)
  • Refactor lru mod, remove linked_hash_map #918 (PsiACE)
  • Dynamically optimize aggregate (count) based on shuffle stats #919 (Dandandan)
  • Use lz4 compression for shuffle files & flight stream, refactoring / improvements #920 (Dandandan)
  • Make max encoding message size configurable #928 (andygrove)
  • Set max message size to 16MB in gRPC clients #931 (andygrove)
  • Upgrade to DataFusion 34.0.0-rc1 #927 (andygrove)
  • Use official DF 34 release #939 (andygrove)
  • Use StreamWriter instead of FileWriter #943 (avantgardnerio)
  • Remove some TODO comments related to context fetching schemas from scheduler #946 (andygrove)
  • Fix Docker build #947 (andygrove)
  • Fix regression in DataFrame.write_xxx #945 (andygrove)

0.11.0 (2023-02-19)

Full Changelog

Implemented enhancements:

  • Remove python since it has been moved to its own repo, arrow-ballista-python #653
  • Add executor self-registration mechanism in the heartbeat service #648
  • Upgrade to DataFusion 17 #638
  • Move Python bindings to separate repo? #635
  • Implement new release process #622
  • Change default branch name from master to main #618
  • Update latest datafusion dependency #610
  • Implement optimizer rule to remove redundant repartitioning #608
  • ballista-cli as (docker) images #600
  • Update contributor guide #598
  • Fix cargo clippy #570
  • Support Alibaba Cloud OSS with ObjectStore #566
  • Refactor StateBackendClient to be a higher-level interface #554
  • Make it concurrently to launch tasks to executors #544
  • Simplify docs #531
  • Provide an in-memory StateBackend #505
  • Add support for Azure blob storage #294
  • Add a workflow to build the image and publish it to the package #71

Fixed bugs:

  • Rust / Check Cargo.toml formatting (amd64, stable) (pull_request) Failing #662
  • Protobuf parsing error #646
  • jobs from python client not showing up in Scheduler UI #625
  • ballista ui fails to build #594
  • cargo build --release fails for ballista-scheduler #590
  • docker build fails #589
  • Multi-scheduler Job Starvation #585
  • Cannot query file from S3 #559
  • Benchmark q16 fails #373

Documentation updates:

Merged pull requests:

0.10.0 (2022-11-18)

Full Changelog

Implemented enhancements:

  • Add user guide section on prometheus metrics #507
  • Don't throw error when job path not exist in remove_job_data #502
  • Fix clippy warning #494
  • Use job_data_clean_up_interval_seconds == 0 to indicate executor_cleanup_enable #488
  • Add a config for tracing log rolling policy for both scheduler and executor #486
  • Set up repo where we can push benchmark results #473
  • Make the delayed time interval for cleanup job data in both scheduler and executor configurable #469
  • Add some validation for the remove_job_data grpc service #467
  • Add ability to build docker images using release-lto profile #463
  • Suggest users download (rather than build) the FlightSQL JDBC Driver #460
  • Clean up legacy job shuffle data #459
  • Add grpc service for the scheduler to make it able to be triggered by client explicitly #458
  • Replace Mutex<HashMap> by using DashMap #448
  • Refine log level #446
  • Upgrade to DataFusion 14.0.0 #445
  • Add a feature for hdfs3 #419
  • Add optional flag which advertises host for Arrow Flight SQL #418
  • Partitioning reasoning in DataFusion and Ballista #284
  • Stop wasting time in CI on MIRI runs #283
  • Publish Docker images as part of each release #236
  • Cleanup job/stage status from TaskManager and clean up shuffle data after a period after JobFinished #185

Fixed bugs:

  • build broken: configure_me_codegen retroactively reserved bind_host #519
  • Return empty results for SQLs with order by #451
  • ballista scheduler is not taken inline parameters into account #443
  • [FlightSQL] Cannot connect with Tableau Desktop #428
  • Benchmark q15 fails #372
  • Incorrect documentation for building Ballista on Linux when using docker-compose #362
  • Scheduler silently replaces ParquetExec with EmptyExec if data path is not correctly mounted in container #353
  • SQL with order by limit returns nothing #334

Documentation updates:

Merged pull requests:

0.9.0 (2022-10-22)

Full Changelog

Implemented enhancements:

  • Support count distinct aggregation function #411
  • Use multi-task definition in pull-based execution loop #400
  • Make the scheduler event loop buffer size configurable #397
  • Remove active execution graph when the related job is successful or failed. #391
  • Improve launch task efficiency by calling LaunchMultiTask #389
  • Use tokio::sync::Semaphore to wait for available task slots #388
  • stdout and file log level settings are inconsistent #385
  • Use dedicated executor in pull based loop #383
  • Avoid calling scheduler when the executor cannot accept new tasks #377
  • Add round robin executor slots reservation policy for the scheduler to evenly assign tasks to executors #371
  • Switch to mimalloc and enable by default #369
  • Integration test script should use docker-compose #364
  • Use local shuffle reader in containerized environments #356
  • Add --ext option to benchmark #352
  • Add job cancel in the UI #350
  • Using local shuffle reader avoid flight rpc call. #346
  • Add a Helm Chart #321
  • [UI] Show list of query stages with metrics #306
  • [UI] Add ability to specify job name and have it show in the job listing page in the UI #277
  • [UI] Add ability to download query plans in dot format #276
  • [UI] Add ability to render query plans #275
  • Add REST API documentation to User Guide #272
  • Graceful shutdown: Handle SIGTERM #266
  • [EPIC] Scheduler UI #265
  • Introduce the datafusion-objectstore-hdfs in datafusion-contrib as an object store feature #259
  • Add a feature based object store provider #257
  • Add docker build files #248
  • Allow IDEs to recognize generated code #246
  • Add user guide section on Flight SQL support #230
  • dev/release/README.md is outdated #228
  • Make ShuffleReaderExec output less verbose #211
  • Add LaunchMultiTask rpc interface for executor #209
  • Make executor fetch shuffle partition data in parallel #208
  • Concurrency control and rate limit during shuffle reader #195
  • Update User Guide #160
  • Ballista 0.8.0 Release #159
  • Save encoded execution plan in the ExecutionStage to reduce cost of task serialization and deserialization #142
  • Failed task retry #140
  • Redefine the executor task slots #132
  • Use ArrowFlight bearer token auth to create a session key for FlightSql clients #112
  • Leverage Atomic for the in-memory states in Scheduler #101
  • Introduce the object stores in datafusion-contrib as optional features #87
  • Support multiple paths for ListingTableScanNode #75
  • Need clean up intermediate data in Ballista #9
  • Ballista does not support external file systems #10

Fixed bugs:

  • Build errors in ./dev/build-ballista-rust.sh #407
  • The Ballista Scheduler Dockerfile copies a file that no longer exists #402
  • Benchmark q20 fails #374
  • Integration tests fail #360
  • Helm deploy fails #344
  • Executor get stopped unexpected #333
  • Executor poll work loop failure #311
  • Queries with LIMIT are failing with "PhysicalExtensionCodec is not provided" #300
  • Schema inference does not work in Ballista-cli with a remote context #287
  • There are bugs in the yarn build github misses but break our internal build #270
  • Race condition running docker-compose #267
  • Scheduler UI not working in Docker image #250
  • Use bind host rather than the external host for starting a local executor service #244
  • Initial query stages read parquet files and repartition them needlessly #243
  • Cannot build Docker images on macOS 12.5.1 with M1 chip #234
  • CLI uses DataFusion context if no host or port are provided #219
  • Unsupported binary operator StringConcat #201
  • Ballista assumes all aggregate expressions are not DISTINCT #5
  • Start ballista ui with docker, but it can not found ballista scheduler #11
  • Cannot build Ballista docker images on Apple silicon #17

Documentation updates:

Closed issues:

  • Automatic version updates for github actions with dependabot #127

Merged pull requests:

0.8.0 (2022-09-16)

Full Changelog

Implemented enhancements:

  • Executor should use all available cores by default #218
  • Update task status to the task job curator scheduler #179
  • update datafusion and arrow to 20.0.0 #176
  • No scheduler logs when deployed to k8s #165
  • Upgrade to DataFusion 11.0.0 #163
  • Better encapsulation for ExecutionGraph #149
  • A stage may act as the input of multiple stages #144
  • Executor Lost handling #143
  • Cancel a running query. #139
  • Ignore the previous job_id inside fill_reservations() #138
  • Normalize the serialization and deserialization places of protobuf structs #137
  • Remove revive offer event loop #136
  • Remove Keyspace::QueuedJobs #133
  • Spawn a thread for execution plan generation #131
  • Introduce CuratorTaskManager for make an active job be curated by only one scheduler #130
  • Using tokio tracing for log file #122
  • Ballista Executor report plan/operators metrics to Ballista Scheduler when task finish #116
  • Add timeout settings for Grpc Client #114
  • Add log level config in ballista #102
  • Use another channel to update the status of a task set for executor #96
  • Add config for concurrent_task in executor #94
  • Ballista should support Arrow FlightSQL #92
  • Why not include the ballista-cli in the member of workspace #88
  • Upgrade dependency of arrow-datafusion to commit d0d5564b8f689a01e542b8c1df829d74d0fab2b0 #84
  • Support sled path in config file. #79
  • Support for multi-scheduler deployments #39
  • Ballista 0.7.0 Release #126
  • Improvements to Ballista extensibility #8
  • Implement Python bindings for BallistaContext #15

Fixed bugs:

  • Run example fails via PushStaged mode #214
  • Config settings in BallistaContext do not get passed to DataFusion context #213
  • Start scheduler fails with arguments "-s PushStaged" #207
  • FlightSQL is broken and CI isn't catching it #190
  • Query fails with "NULL is invalid as a DataFusion scalar value" #180
  • Executor doesn't compile, missing tokio::signal #171
  • Unable to build master #76

ballista-0.7.0 (2022-05-12)

Full Changelog

Breaking changes:

  • Make ExecutionPlan::execute Sync #2434 (tustvold)
  • Add Expr::Exists to represent EXISTS subquery expression #2339 (andygrove)
  • Remove dependency from LogicalPlan::TableScan to ExecutionPlan #2284 (andygrove)
  • Move logical expression type-coercion code from physical-expr crate to expr crate #2257 (andygrove)
  • feat: 2061 create external table ddl table partition cols #2099 [sql] (jychen7)
  • Reorganize the project folders #2081 (yahoNanJing)
  • Support more ScalarFunction in Ballista #2008 (Ted-Jiang)
  • Merge dataframe and dataframe imp #1998 (vchag)
  • Rename ExecutionContext to SessionContext, ExecutionContextState to SessionState, add TaskContext to support multi-tenancy configurations - Part 1 #1987 (mingmwang)
  • Add Coalesce function #1969 (msathis)
  • Add Create Schema functionality in SQL #1959 [sql] (matthewmturner)
  • remove sync constraint of SendableRecordBatchStream #1884 (doki23)

Implemented enhancements:

Fixed bugs:

  • Ballista integration tests no longer work #2440
  • Ballista crates cannot be released from DafaFusion 7.0.0 source release #1980
  • protobuf OctetLength should be deserialized as octet_length, not length #1834 (carols10cents)

Documentation updates:

Performance improvements:

Closed issues:

  • Make expected result string in unit tests more readable #2412
  • remove duplicated fn aggregate() in aggregate expression tests #2399
  • split distinct_expression.rs into count_distinct.rs and array_agg_distinct.rs #2385
  • move sql tests in context.rs to corresponding test files in datafustion/core/tests/sql #2328
  • Date32/Date64 as join keys for merge join #2314
  • Error precision and scale for decimal coercion in logic comparison #2232
  • Support Multiple row layout #2188
  • Discussion: Is Ballista a standalone system or framework #1916

Merged pull requests:

7.1.0-rc1 (2022-04-10)

Full Changelog

Implemented enhancements:

  • Support substring with three arguments: (str, from, for) for DataFrame API and Ballista #2092
  • UnionAll support for Ballista #2032
  • Separate cpu-bound and IO-bound work in ballista-executor by using diff tokio runtime. #1770
  • [Ballista] Introduce DAGScheduler for better managing the stage-based task scheduling #1704
  • [Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

Closed issues:

  • Optimize memory usage pattern to avoid "double memory" behavior #2149
  • Document approx_percentile_cont_with_weight in users guide #2078
  • [follow up]cleaning up statements.remove(0) #1986
  • Formatting error on documentation for Python #1873
  • Remove duplicate tests from test_const_evaluator_scalar_functions #1727
  • Question: Is the Ballista project providing value to the overall DataFusion project? #1273

7.0.0-rc2 (2022-02-14)

Full Changelog

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

  • Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
  • Update to arrow 8.0.0 #1673 (alamb)

Implemented enhancements:

  • Task assignment between Scheduler and Executors #1221
  • Add approx_median() aggregate function #1729 (realno)
  • [Ballista] Add Decimal128, Date64, TimestampSecond, TimestampMillisecond, Interv… #1659 (gaojun2048)
  • Add corr aggregate function #1561 (realno)
  • Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
  • Add approx_quantile() aggregation function #1539 (domodwyer)
  • Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
  • Add stddev and variance #1525 (realno)
  • Add rem operation for Expr #1467 (liukun4515)
  • Implement array_agg aggregate function #1300 (viirya)

Fixed bugs:

  • Ballista context::tests::test_standalone_mode test fails #1020
  • [Ballista] Fix scheduler state mod bug #1655 (gaojun2048)
  • Pass local address host so we do not get mismatch between IPv4 and IP… #1466 (thinkharderdev)
  • Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)

Documentation updates:

Performance improvements:

Closed issues:

  • Track memory usage in Non Limited Operators #1569
  • [Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
  • Why use the expr types before coercion to get the result type? #1358
  • A problem about the projection_push_down optimizer gathers valid columns #1312
  • apply constant folding to LogicalPlan::Values #1170
  • reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372

Merged pull requests:

6.0.0-rc0 (2021-11-14)

Full Changelog

6.0.0 (2021-11-14)

Full Changelog

ballista-0.6.0 (2021-11-13)

Full Changelog

Breaking changes:

  • File partitioning for ListingTable #1141 (rdettai)
  • Register tables in BallistaContext using TableProviders instead of Dataframe #1028 (rdettai)
  • Make TableProvider.scan() and PhysicalPlanner::create_physical_plan() async #1013 (rdettai)
  • Reorganize table providers by table format #1010 (rdettai)
  • Move CBOs and Statistics to physical plan #965 (rdettai)
  • Update to sqlparser v 0.10.0 #934 [sql] (alamb)
  • FilePartition and PartitionedFile for scanning flexibility #932 [sql] (yjshen)
  • Improve SQLMetric APIs, port existing metrics #908 (alamb)
  • Add support for EXPLAIN ANALYZE #858 [sql] (alamb)
  • Rename concurrency to target_partitions #706 (andygrove)

Implemented enhancements:

Fixed bugs:

  • Test execution_plans::shuffle_writer::tests::test Fail #1040
  • Integration test fails to build docker images #918
  • Ballista: Remove hard-coded concurrency from logical plan serde code #708
  • How can I make ballista distributed compute work? #327
  • fix subquery alias #1067 [sql] (xudong963)
  • Fix compilation for ballista in stand-alone mode #1008 (Igosuki)

Documentation updates:

Performance improvements:

  • optimize build profile for datafusion python binding, cli and ballista #1137 (houqp)

Closed issues:

  • InList expr with NULL literals do not work #1190
  • update the homepage README to include values, approx_distinct, etc. #1171
  • [Python]: Inconsistencies with Python package name #1011
  • Wanting to contribute to project where to start? #983
  • delete redundant code #973
  • How to build DataFusion python wheel #853
  • Produce a design for a metrics framework #21

Merged pull requests:

  • [nit] simplify ballista executor CollectExec impl codes #1140 (panarch)

For older versions, see apache/arrow/CHANGELOG.md

ballista-0.5.0 (2021-08-10)

Full Changelog

Breaking changes:

  • [ballista] support date_part and date_turnc ser/de, pass tpch 7 #840 (houqp)
  • Box ScalarValue:Lists, reduce size by half size #788 (alamb)
  • Support DataFrame.collect for Ballista DataFrames #785 (andygrove)
  • JOIN conditions are order dependent #778 (seddonm1)
  • UnresolvedShuffleExec should represent a single shuffle #727 (andygrove)
  • Ballista: Make shuffle partitions configurable in benchmarks #702 (andygrove)
  • Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
  • Ballista: Rename QueryStageExec to ShuffleWriterExec #633 (andygrove)
  • fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
  • fix join column handling logic for On and Using constraints #605 (houqp)
  • Move ballista standalone mode to client #589 (edrevo)
  • Ballista: Implement map-side shuffle #543 (andygrove)
  • ShuffleReaderExec now supports multiple locations per partition #541 (andygrove)
  • Make external hostname in executor optional #232 (edrevo)
  • Remove namespace from executors #75 (edrevo)
  • Support qualified columns in queries #55 (houqp)
  • Read CSV format text from stdin or memory #54 (heymind)
  • Remove Ballista DataFrame #48 (andygrove)
  • Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

  • Add crate documentation for Ballista crates #830
  • Support DataFrame.collect for Ballista DataFrames #787
  • Ballista: Prep for supporting shuffle correctly, part one #736
  • Ballista: Implement physical plan serde for ShuffleWriterExec #710
  • Ballista: Finish implementing shuffle mechanism #707
  • Rename QueryStageExec to ShuffleWriterExec #542
  • Ballista ShuffleReaderExec should be able to read from multiple locations per partition #540
  • [Ballista] Use deployments in k8s user guide #473
  • Ballista refactor QueryStageExec in preparation for map-side shuffle #458
  • Ballista: Implement map-side of shuffle #456
  • Refactor Ballista to separate Flight logic from execution logic #449
  • Use published versions of arrow rather than github shas #393
  • BallistaContext::collect() logging is too noisy #352
  • Update Ballista to use new physical plan formatter utility #343
  • Add Ballista Getting Started documentation #329
  • Remove references to ballistacompute Docker Hub repo #325
  • Implement scalable distributed joins #63
  • Remove hard-coded Ballista version from scripts #32
  • Implement streaming versions of Dataframe.collect methods #789 (andygrove)
  • Ballista shuffle is finally working as intended, providing scalable distributed joins #750 (andygrove)
  • Update to use arrow 5.0 #721 (alamb)
  • Implement serde for ShuffleWriterExec #712 (andygrove)
  • dedup using join column in wildcard expansion #678 (houqp)
  • Implement metrics for shuffle read and write #676 (andygrove)
  • Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
  • Ballista: Implement scalable distributed joins #634 (andygrove)
  • Add Keda autoscaling for ballista in k8s #586 (edrevo)
  • Add some resiliency to lost executors #568 (edrevo)
  • Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
  • Support anti join #482 (Dandandan)
  • add order by construct in window function and logical plans #463 (Jimexist)
  • Refactor Ballista executor so that FlightService delegates to an Executor struct #450 (andygrove)
  • implement lead and lag built-in window function #429 (Jimexist)
  • Implement fmt_as for ShuffleReaderExec #400 (andygrove)
  • Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
  • [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
  • Allow table providers to indicate their type for catalog metadata #205 (returnString)
  • Add query 19 to TPC-H regression tests #59 (Dandandan)
  • Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
  • Add option param for standalone mode #42 (djKooks)
  • [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
  • [Ballista] Docker files for ui #22 (msathis)

Fixed bugs:

  • Ballista: TPC-H q3 @ SF=1000 never completes #835
  • Ballista does not support MIN/MAX aggregate functions #832
  • Ballista docker images fail to build #828
  • Ballista: UnresolvedShuffleExec should only have a single stage_id #726
  • Ballista integration tests are failing #623
  • Integration test build failure due to arrow-rs using unstable feature #596
  • cargo build cannot build the project #531
  • ShuffleReaderExec does not get formatted correctly in displayable physical plan #399
  • Implement serde for MIN and MAX #833 (andygrove)
  • Ballista: Prep for fixing shuffle mechansim, part 1 #738 (andygrove)
  • Ballista: Shuffle write bug fix #714 (andygrove)
  • honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
  • MINOR: Fix integration tests by adding datafusion-cli module to docker image #322 (andygrove)

Documentation updates:

Performance improvements:

  • Ballista: Avoid sleeping between polling for tasks #698 (Dandandan)
  • Make BallistaContext::collect streaming #535 (edrevo)

Closed issues:

  • Confirm git tagging strategy for releases #770
  • arrow::util::pretty::pretty_format_batches missing #769
  • move the assert_batches_eq! macros to a non part of datafusion #745
  • fix an issue where aliases are not respected in generating downstream schemas in window expr #592
  • make the planner to print more succinct and useful information in window function explain clause #526
  • move window frame module to be in logical_plan #517
  • use a more rust idiomatic way of handling nth_value #448
  • Make Ballista not depend on arrow directly #446
  • create a test with more than one partition for window functions #435
  • Implement hash-partitioned hash aggregate #27
  • Consider using GitHub pages for DataFusion/Ballista documentation #18
  • Add Ballista to default cargo workspace #17
  • Update "repository" in Cargo.toml #16
  • Consolidate TPC-H benchmarks #6
  • [Ballista] Fix integration test script #4
  • Ballista should not have separate DataFrame implementation #2

Merged pull requests:

  • Change datatype of tpch keys from Int32 to UInt64 to support sf=1000 #836 (andygrove)
  • Add ballista-examples to docker build #829 (andygrove)
  • Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
  • Move hash_array into hash_utils.rs #807 (alamb)
  • Fix: Update clippy lints for Rust 1.54 #794 (alamb)
  • MINOR: Remove unused Ballista query execution code path #732 (andygrove)
  • [fix] benchmark run with compose #666 (rdettai)
  • bring back dev scripts for ballista #648 (Jimexist)
  • Remove unnecessary mutex #639 (edrevo)
  • round trip TPCH queries in tests #630 (houqp)
  • Fix build #627 (andygrove)
  • in ballista also check for UI prettier changes #578 (Jimexist)
  • turn on clippy rule for needless borrow #545 (Jimexist)
  • reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
  • update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
  • make VOLUME declaration in tpch datagen docker absolute #466 (crepererum)
  • Refactor QueryStageExec in preparation for implementing map-side shuffle #459 (andygrove)
  • Simplified usage of use arrow in ballista. #447 (jorgecarleitao)
  • Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
  • #352: BallistaContext::collect() logging is too noisy #394 (jgoday)
  • cleanup function return type fn #350 (Jimexist)
  • Update Ballista to use new physical plan formatter utility #344 (andygrove)
  • Update arrow dependencies again #341 (alamb)
  • Remove references to Ballista Docker images published to ballistacompute Docker Hub repo #326 (andygrove)
  • Update arrow-rs deps #317 (alamb)
  • Update arrow deps #269 (alamb)
  • Enable redundant_field_names clippy lint #261 (Dandandan)
  • Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
  • update arrow-rs deps to latest master #216 (alamb)

* This Changelog was automatically generated by github_changelog_generator