diff --git a/404.html b/404.html new file mode 100644 index 00000000000..c73f3c45338 --- /dev/null +++ b/404.html @@ -0,0 +1,33 @@ + + + + + + + + + PolarDB for PostgreSQL + + + + +

404

There's nothing here.
Take me home
+ + + diff --git a/assets/10_solutions_to_future_pages-f585c284.png b/assets/10_solutions_to_future_pages-f585c284.png new file mode 100644 index 00000000000..5c92fd109f3 Binary files /dev/null and b/assets/10_solutions_to_future_pages-f585c284.png differ diff --git a/assets/10_solutions_to_future_pages-f6d8bc5c.png b/assets/10_solutions_to_future_pages-f6d8bc5c.png new file mode 100644 index 00000000000..fca2a2e24a2 Binary files /dev/null and b/assets/10_solutions_to_future_pages-f6d8bc5c.png differ diff --git a/assets/11_issues_of_conventional_streaming_replication-79eab5de.png b/assets/11_issues_of_conventional_streaming_replication-79eab5de.png new file mode 100644 index 00000000000..89267d2b848 Binary files /dev/null and b/assets/11_issues_of_conventional_streaming_replication-79eab5de.png differ diff --git a/assets/11_issues_of_conventional_streaming_replication-fe65f8ee.png b/assets/11_issues_of_conventional_streaming_replication-fe65f8ee.png new file mode 100644 index 00000000000..a62fec493a0 Binary files /dev/null and b/assets/11_issues_of_conventional_streaming_replication-fe65f8ee.png differ diff --git a/assets/12_Replicate_only_metadata_of_WAL_records-092ef5f2.png b/assets/12_Replicate_only_metadata_of_WAL_records-092ef5f2.png new file mode 100644 index 00000000000..4f996d9f7be Binary files /dev/null and b/assets/12_Replicate_only_metadata_of_WAL_records-092ef5f2.png differ diff --git a/assets/12_Replicate_only_metadata_of_WAL_records-d2fbf65b.png b/assets/12_Replicate_only_metadata_of_WAL_records-d2fbf65b.png new file mode 100644 index 00000000000..7bd75338be5 Binary files /dev/null and b/assets/12_Replicate_only_metadata_of_WAL_records-d2fbf65b.png differ diff --git a/assets/13_optimization1_result-98261cb3.png b/assets/13_optimization1_result-98261cb3.png new file mode 100644 index 00000000000..b225f0685c7 Binary files /dev/null and b/assets/13_optimization1_result-98261cb3.png differ diff --git a/assets/13_optimization1_result-d85386d9.png b/assets/13_optimization1_result-d85386d9.png new file mode 100644 index 00000000000..de8bd454d3b Binary files /dev/null and b/assets/13_optimization1_result-d85386d9.png differ diff --git a/assets/14_optimize_log_apply_of_WAL_records-a2722b50.png b/assets/14_optimize_log_apply_of_WAL_records-a2722b50.png new file mode 100644 index 00000000000..68582fb9d92 Binary files /dev/null and b/assets/14_optimize_log_apply_of_WAL_records-a2722b50.png differ diff --git a/assets/14_optimize_log_apply_of_WAL_records-e19cfea8.png b/assets/14_optimize_log_apply_of_WAL_records-e19cfea8.png new file mode 100644 index 00000000000..fbefdac11b5 Binary files /dev/null and b/assets/14_optimize_log_apply_of_WAL_records-e19cfea8.png differ diff --git a/assets/15_optimization2_result-3dd5d1a8.png b/assets/15_optimization2_result-3dd5d1a8.png new file mode 100644 index 00000000000..c0576074109 Binary files /dev/null and b/assets/15_optimization2_result-3dd5d1a8.png differ diff --git a/assets/15_optimization2_result-5c124fdf.png b/assets/15_optimization2_result-5c124fdf.png new file mode 100644 index 00000000000..742dd825d8c Binary files /dev/null and b/assets/15_optimization2_result-5c124fdf.png differ diff --git a/assets/16_optimize_log_apply_of_DDL_locks-0e74ca0c.png b/assets/16_optimize_log_apply_of_DDL_locks-0e74ca0c.png new file mode 100644 index 00000000000..f109427ac6a Binary files /dev/null and b/assets/16_optimize_log_apply_of_DDL_locks-0e74ca0c.png differ diff --git a/assets/16_optimize_log_apply_of_DDL_locks-d4407c97.png b/assets/16_optimize_log_apply_of_DDL_locks-d4407c97.png new file mode 100644 index 00000000000..61ce4f638d7 Binary files /dev/null and b/assets/16_optimize_log_apply_of_DDL_locks-d4407c97.png differ diff --git a/assets/17_optimization3_result-2e8e1fc5.png b/assets/17_optimization3_result-2e8e1fc5.png new file mode 100644 index 00000000000..10f3bc0aec2 Binary files /dev/null and b/assets/17_optimization3_result-2e8e1fc5.png differ diff --git a/assets/17_optimization3_result-c08ad12d.png b/assets/17_optimization3_result-c08ad12d.png new file mode 100644 index 00000000000..fa93ac58fe4 Binary files /dev/null and b/assets/17_optimization3_result-c08ad12d.png differ diff --git a/assets/18_recovery_optimization_background-60743f8d.png b/assets/18_recovery_optimization_background-60743f8d.png new file mode 100644 index 00000000000..ac16aaf9cc3 Binary files /dev/null and b/assets/18_recovery_optimization_background-60743f8d.png differ diff --git a/assets/18_recovery_optimization_background-a9ce115d.png b/assets/18_recovery_optimization_background-a9ce115d.png new file mode 100644 index 00000000000..46cb61da386 Binary files /dev/null and b/assets/18_recovery_optimization_background-a9ce115d.png differ diff --git a/assets/19_lazy_recovery-ba7ee19e.png b/assets/19_lazy_recovery-ba7ee19e.png new file mode 100644 index 00000000000..ffcf350987b Binary files /dev/null and b/assets/19_lazy_recovery-ba7ee19e.png differ diff --git a/assets/19_lazy_recovery-f16bb60b.png b/assets/19_lazy_recovery-f16bb60b.png new file mode 100644 index 00000000000..c31a09577ae Binary files /dev/null and b/assets/19_lazy_recovery-f16bb60b.png differ diff --git a/assets/1_polardb_architecture-1942f502.png b/assets/1_polardb_architecture-1942f502.png new file mode 100644 index 00000000000..a2cb6c5d4c2 Binary files /dev/null and b/assets/1_polardb_architecture-1942f502.png differ diff --git a/assets/1_polardb_architecture-ce580cc6.png b/assets/1_polardb_architecture-ce580cc6.png new file mode 100644 index 00000000000..826cd0b4874 Binary files /dev/null and b/assets/1_polardb_architecture-ce580cc6.png differ diff --git a/assets/20_recovery_optimization_result-5bbf801d.png b/assets/20_recovery_optimization_result-5bbf801d.png new file mode 100644 index 00000000000..79cc259d37f Binary files /dev/null and b/assets/20_recovery_optimization_result-5bbf801d.png differ diff --git a/assets/20_recovery_optimization_result-80832b6f.png b/assets/20_recovery_optimization_result-80832b6f.png new file mode 100644 index 00000000000..cb194f7d162 Binary files /dev/null and b/assets/20_recovery_optimization_result-80832b6f.png differ diff --git a/assets/21_Persistent_BufferPool-30d61026.png b/assets/21_Persistent_BufferPool-30d61026.png new file mode 100644 index 00000000000..0d32c82618f Binary files /dev/null and b/assets/21_Persistent_BufferPool-30d61026.png differ diff --git a/assets/21_Persistent_BufferPool-bd6c06a2.png b/assets/21_Persistent_BufferPool-bd6c06a2.png new file mode 100644 index 00000000000..01baa86916c Binary files /dev/null and b/assets/21_Persistent_BufferPool-bd6c06a2.png differ diff --git a/assets/22_buffer_pool_structure-a53b4626.png b/assets/22_buffer_pool_structure-a53b4626.png new file mode 100644 index 00000000000..c0915b3e364 Binary files /dev/null and b/assets/22_buffer_pool_structure-a53b4626.png differ diff --git a/assets/22_buffer_pool_structure-a755d484.png b/assets/22_buffer_pool_structure-a755d484.png new file mode 100644 index 00000000000..1aebca9c1dc Binary files /dev/null and b/assets/22_buffer_pool_structure-a755d484.png differ diff --git a/assets/23_persistent_buffer_pool_result-6759a779.png b/assets/23_persistent_buffer_pool_result-6759a779.png new file mode 100644 index 00000000000..e5809f23c26 Binary files /dev/null and b/assets/23_persistent_buffer_pool_result-6759a779.png differ diff --git a/assets/23_persistent_buffer_pool_result-abf85155.png b/assets/23_persistent_buffer_pool_result-abf85155.png new file mode 100644 index 00000000000..b88aadaf3da Binary files /dev/null and b/assets/23_persistent_buffer_pool_result-abf85155.png differ diff --git a/assets/24_principles_of_HTAP-2f3b912c.png b/assets/24_principles_of_HTAP-2f3b912c.png new file mode 100644 index 00000000000..83dc6342eca Binary files /dev/null and b/assets/24_principles_of_HTAP-2f3b912c.png differ diff --git a/assets/24_principles_of_HTAP-b1327018.png b/assets/24_principles_of_HTAP-b1327018.png new file mode 100644 index 00000000000..79e83f3ded1 Binary files /dev/null and b/assets/24_principles_of_HTAP-b1327018.png differ diff --git a/assets/25_distributed_optimizer-153c6304.png b/assets/25_distributed_optimizer-153c6304.png new file mode 100644 index 00000000000..d90ea11b637 Binary files /dev/null and b/assets/25_distributed_optimizer-153c6304.png differ diff --git a/assets/25_distributed_optimizer-a73c4add.png b/assets/25_distributed_optimizer-a73c4add.png new file mode 100644 index 00000000000..7c4627d0105 Binary files /dev/null and b/assets/25_distributed_optimizer-a73c4add.png differ diff --git a/assets/26_parallelism_of_operators-61071ed7.png b/assets/26_parallelism_of_operators-61071ed7.png new file mode 100644 index 00000000000..aafba50dbbc Binary files /dev/null and b/assets/26_parallelism_of_operators-61071ed7.png differ diff --git a/assets/26_parallelism_of_operators-d53ecbd5.png b/assets/26_parallelism_of_operators-d53ecbd5.png new file mode 100644 index 00000000000..76243e76118 Binary files /dev/null and b/assets/26_parallelism_of_operators-d53ecbd5.png differ diff --git a/assets/27_parallelism_of_operators_result-28ed41a9.png b/assets/27_parallelism_of_operators_result-28ed41a9.png new file mode 100644 index 00000000000..f2d68a84375 Binary files /dev/null and b/assets/27_parallelism_of_operators_result-28ed41a9.png differ diff --git a/assets/27_parallelism_of_operators_result-ab7b692f.png b/assets/27_parallelism_of_operators_result-ab7b692f.png new file mode 100644 index 00000000000..f36848148d2 Binary files /dev/null and b/assets/27_parallelism_of_operators_result-ab7b692f.png differ diff --git a/assets/28_data_skew-4f127c17.png b/assets/28_data_skew-4f127c17.png new file mode 100644 index 00000000000..429885ad63d Binary files /dev/null and b/assets/28_data_skew-4f127c17.png differ diff --git a/assets/28_data_skew-4fce9edd.png b/assets/28_data_skew-4fce9edd.png new file mode 100644 index 00000000000..84b5af5bc52 Binary files /dev/null and b/assets/28_data_skew-4fce9edd.png differ diff --git a/assets/29_Solve_data_skew_result-cfa7b2f0.png b/assets/29_Solve_data_skew_result-cfa7b2f0.png new file mode 100644 index 00000000000..6ec66c828e3 Binary files /dev/null and b/assets/29_Solve_data_skew_result-cfa7b2f0.png differ diff --git a/assets/29_Solve_data_skew_result-d1f5cd26.png b/assets/29_Solve_data_skew_result-d1f5cd26.png new file mode 100644 index 00000000000..4437bbc2051 Binary files /dev/null and b/assets/29_Solve_data_skew_result-d1f5cd26.png differ diff --git a/assets/2_compute-storage_separation_architecture-150c6ffc.png b/assets/2_compute-storage_separation_architecture-150c6ffc.png new file mode 100644 index 00000000000..23c550cf6ad Binary files /dev/null and b/assets/2_compute-storage_separation_architecture-150c6ffc.png differ diff --git a/assets/2_compute-storage_separation_architecture-2a7ce395.png b/assets/2_compute-storage_separation_architecture-2a7ce395.png new file mode 100644 index 00000000000..4d5525ba5ab Binary files /dev/null and b/assets/2_compute-storage_separation_architecture-2a7ce395.png differ diff --git a/assets/30_SQL_statement-level_scalability-03086846.png b/assets/30_SQL_statement-level_scalability-03086846.png new file mode 100644 index 00000000000..b98f494bc48 Binary files /dev/null and b/assets/30_SQL_statement-level_scalability-03086846.png differ diff --git a/assets/30_SQL_statement-level_scalability-e2a14f1f.png b/assets/30_SQL_statement-level_scalability-e2a14f1f.png new file mode 100644 index 00000000000..dccdb9f2421 Binary files /dev/null and b/assets/30_SQL_statement-level_scalability-e2a14f1f.png differ diff --git a/assets/31_schedule_workloads-1e37f980.png b/assets/31_schedule_workloads-1e37f980.png new file mode 100644 index 00000000000..59365ce37fa Binary files /dev/null and b/assets/31_schedule_workloads-1e37f980.png differ diff --git a/assets/31_schedule_workloads-b339cf98.png b/assets/31_schedule_workloads-b339cf98.png new file mode 100644 index 00000000000..0a3c70133f0 Binary files /dev/null and b/assets/31_schedule_workloads-b339cf98.png differ diff --git a/assets/32_transactional_consistency-0f80c9d0.png b/assets/32_transactional_consistency-0f80c9d0.png new file mode 100644 index 00000000000..590f5056327 Binary files /dev/null and b/assets/32_transactional_consistency-0f80c9d0.png differ diff --git a/assets/32_transactional_consistency-4f51f637.png b/assets/32_transactional_consistency-4f51f637.png new file mode 100644 index 00000000000..e02fabec2e5 Binary files /dev/null and b/assets/32_transactional_consistency-4f51f637.png differ diff --git a/assets/33_TPC-H_performance_Speedup1-b6f25c5e.png b/assets/33_TPC-H_performance_Speedup1-b6f25c5e.png new file mode 100644 index 00000000000..b172cd2b920 Binary files /dev/null and b/assets/33_TPC-H_performance_Speedup1-b6f25c5e.png differ diff --git a/assets/33_TPC-H_performance_Speedup1-bea777d8.png b/assets/33_TPC-H_performance_Speedup1-bea777d8.png new file mode 100644 index 00000000000..ed32f83b9a8 Binary files /dev/null and b/assets/33_TPC-H_performance_Speedup1-bea777d8.png differ diff --git a/assets/34_TPC-H_performance_Speedup2-57228502.png b/assets/34_TPC-H_performance_Speedup2-57228502.png new file mode 100644 index 00000000000..fdf6d6eba4f Binary files /dev/null and b/assets/34_TPC-H_performance_Speedup2-57228502.png differ diff --git a/assets/34_TPC-H_performance_Speedup2-5c119fbc.png b/assets/34_TPC-H_performance_Speedup2-5c119fbc.png new file mode 100644 index 00000000000..2d0ea066d0e Binary files /dev/null and b/assets/34_TPC-H_performance_Speedup2-5c119fbc.png differ diff --git a/assets/35_TPC-H_performance_Speedup3-6e2b1a40.png b/assets/35_TPC-H_performance_Speedup3-6e2b1a40.png new file mode 100644 index 00000000000..5ead28c2ea8 Binary files /dev/null and b/assets/35_TPC-H_performance_Speedup3-6e2b1a40.png differ diff --git a/assets/35_TPC-H_performance_Speedup3-c1c35820.png b/assets/35_TPC-H_performance_Speedup3-c1c35820.png new file mode 100644 index 00000000000..7060f7cd052 Binary files /dev/null and b/assets/35_TPC-H_performance_Speedup3-c1c35820.png differ diff --git a/assets/36_TPC-H_performance_Comparison_with_mpp1-265dba6a.png b/assets/36_TPC-H_performance_Comparison_with_mpp1-265dba6a.png new file mode 100644 index 00000000000..a0bbfaa9d66 Binary files /dev/null and b/assets/36_TPC-H_performance_Comparison_with_mpp1-265dba6a.png differ diff --git a/assets/36_TPC-H_performance_Comparison_with_mpp1-ecbde071.png b/assets/36_TPC-H_performance_Comparison_with_mpp1-ecbde071.png new file mode 100644 index 00000000000..3f1b2e8887d Binary files /dev/null and b/assets/36_TPC-H_performance_Comparison_with_mpp1-ecbde071.png differ diff --git a/assets/37_TPC-H_performance_Comparison_with_mpp2-6a739c6c.png b/assets/37_TPC-H_performance_Comparison_with_mpp2-6a739c6c.png new file mode 100644 index 00000000000..bf8a3eb68a5 Binary files /dev/null and b/assets/37_TPC-H_performance_Comparison_with_mpp2-6a739c6c.png differ diff --git a/assets/37_TPC-H_performance_Comparison_with_mpp2-e0571d47.png b/assets/37_TPC-H_performance_Comparison_with_mpp2-e0571d47.png new file mode 100644 index 00000000000..6a0cbb98a30 Binary files /dev/null and b/assets/37_TPC-H_performance_Comparison_with_mpp2-e0571d47.png differ diff --git a/assets/38_Index_creation_accelerated_by_PX-63d21186.png b/assets/38_Index_creation_accelerated_by_PX-63d21186.png new file mode 100644 index 00000000000..2e65ee53b99 Binary files /dev/null and b/assets/38_Index_creation_accelerated_by_PX-63d21186.png differ diff --git a/assets/38_Index_creation_accelerated_by_PX-cc3737a1.png b/assets/38_Index_creation_accelerated_by_PX-cc3737a1.png new file mode 100644 index 00000000000..4e93df7da2e Binary files /dev/null and b/assets/38_Index_creation_accelerated_by_PX-cc3737a1.png differ diff --git a/assets/39_Index_creation_accelerated_by_PX2-0c310510.png b/assets/39_Index_creation_accelerated_by_PX2-0c310510.png new file mode 100644 index 00000000000..6eefa12dfc6 Binary files /dev/null and b/assets/39_Index_creation_accelerated_by_PX2-0c310510.png differ diff --git a/assets/39_Index_creation_accelerated_by_PX2-340b1909.png b/assets/39_Index_creation_accelerated_by_PX2-340b1909.png new file mode 100644 index 00000000000..02aafdf1eb2 Binary files /dev/null and b/assets/39_Index_creation_accelerated_by_PX2-340b1909.png differ diff --git a/assets/3_HTAP_architecture-219dd5bb.png b/assets/3_HTAP_architecture-219dd5bb.png new file mode 100644 index 00000000000..f7a4c3c4962 Binary files /dev/null and b/assets/3_HTAP_architecture-219dd5bb.png differ diff --git a/assets/3_HTAP_architecture-43d8e225.png b/assets/3_HTAP_architecture-43d8e225.png new file mode 100644 index 00000000000..feef664215c Binary files /dev/null and b/assets/3_HTAP_architecture-43d8e225.png differ diff --git a/assets/404.html-60b35caa.js b/assets/404.html-60b35caa.js new file mode 100644 index 00000000000..7a25b17a47d --- /dev/null +++ b/assets/404.html-60b35caa.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-3706649a","path":"/404.html","title":"","lang":"en-US","frontmatter":{"layout":"NotFound"},"headers":[],"git":{},"filePathRelative":null}');export{t as data}; diff --git a/assets/404.html-66349191.js b/assets/404.html-66349191.js new file mode 100644 index 00000000000..707ce6e17a4 --- /dev/null +++ b/assets/404.html-66349191.js @@ -0,0 +1 @@ +import{_ as e,o as c,c as t}from"./app-3d1677bf.js";const _={};function o(r,n){return c(),t("div")}const a=e(_,[["render",o],["__file","404.html.vue"]]);export{a as default}; diff --git a/assets/40_spatio-temporal_databases-2527a436.png b/assets/40_spatio-temporal_databases-2527a436.png new file mode 100644 index 00000000000..05348c1085d Binary files /dev/null and b/assets/40_spatio-temporal_databases-2527a436.png differ diff --git a/assets/40_spatio-temporal_databases-8411c32e.png b/assets/40_spatio-temporal_databases-8411c32e.png new file mode 100644 index 00000000000..455c57ba6fb Binary files /dev/null and b/assets/40_spatio-temporal_databases-8411c32e.png differ diff --git a/assets/41_spatio-temporal_databases_result-33628595.png b/assets/41_spatio-temporal_databases_result-33628595.png new file mode 100644 index 00000000000..a894bc014dd Binary files /dev/null and b/assets/41_spatio-temporal_databases_result-33628595.png differ diff --git a/assets/41_spatio-temporal_databases_result-7e6ba3f6.png b/assets/41_spatio-temporal_databases_result-7e6ba3f6.png new file mode 100644 index 00000000000..efd644c371c Binary files /dev/null and b/assets/41_spatio-temporal_databases_result-7e6ba3f6.png differ diff --git a/assets/42_FlushList-20e70d3c.png b/assets/42_FlushList-20e70d3c.png new file mode 100644 index 00000000000..7118146e464 Binary files /dev/null and b/assets/42_FlushList-20e70d3c.png differ diff --git a/assets/42_FlushList-a5ba8869.png b/assets/42_FlushList-a5ba8869.png new file mode 100644 index 00000000000..addecd2c40f Binary files /dev/null and b/assets/42_FlushList-a5ba8869.png differ diff --git a/assets/42_buffer_conntrol-0b37890d.png b/assets/42_buffer_conntrol-0b37890d.png new file mode 100644 index 00000000000..23d00027247 Binary files /dev/null and b/assets/42_buffer_conntrol-0b37890d.png differ diff --git a/assets/42_buffer_conntrol-6b7ab4e5.png b/assets/42_buffer_conntrol-6b7ab4e5.png new file mode 100644 index 00000000000..cf432e40da5 Binary files /dev/null and b/assets/42_buffer_conntrol-6b7ab4e5.png differ diff --git a/assets/43_parr_Flush-3be063a3.png b/assets/43_parr_Flush-3be063a3.png new file mode 100644 index 00000000000..d8c79ae2153 Binary files /dev/null and b/assets/43_parr_Flush-3be063a3.png differ diff --git a/assets/44_Copy_Buffer-505a142f.png b/assets/44_Copy_Buffer-505a142f.png new file mode 100644 index 00000000000..f306ab31a61 Binary files /dev/null and b/assets/44_Copy_Buffer-505a142f.png differ diff --git a/assets/45_DDL_1-bc7e6ba3.png b/assets/45_DDL_1-bc7e6ba3.png new file mode 100644 index 00000000000..b1342eaedd5 Binary files /dev/null and b/assets/45_DDL_1-bc7e6ba3.png differ diff --git a/assets/46_DDL_2-51d294c2.png b/assets/46_DDL_2-51d294c2.png new file mode 100644 index 00000000000..564855fca8b Binary files /dev/null and b/assets/46_DDL_2-51d294c2.png differ diff --git a/assets/47_DDL_3-7c0328bf.png b/assets/47_DDL_3-7c0328bf.png new file mode 100644 index 00000000000..5d59ad0245b Binary files /dev/null and b/assets/47_DDL_3-7c0328bf.png differ diff --git a/assets/48_DDL_4-b2177964.png b/assets/48_DDL_4-b2177964.png new file mode 100644 index 00000000000..fc9f228381c Binary files /dev/null and b/assets/48_DDL_4-b2177964.png differ diff --git a/assets/49_LogIndex_1-e49fa6a7.png b/assets/49_LogIndex_1-e49fa6a7.png new file mode 100644 index 00000000000..cb976176b88 Binary files /dev/null and b/assets/49_LogIndex_1-e49fa6a7.png differ diff --git a/assets/4_principles_of_shared_storage-1ff0f380.png b/assets/4_principles_of_shared_storage-1ff0f380.png new file mode 100644 index 00000000000..86fdca49972 Binary files /dev/null and b/assets/4_principles_of_shared_storage-1ff0f380.png differ diff --git a/assets/4_principles_of_shared_storage-3ac70e1b.png b/assets/4_principles_of_shared_storage-3ac70e1b.png new file mode 100644 index 00000000000..0704d3e817a Binary files /dev/null and b/assets/4_principles_of_shared_storage-3ac70e1b.png differ diff --git a/assets/50_LogIndex_2-2d85ed00.png b/assets/50_LogIndex_2-2d85ed00.png new file mode 100644 index 00000000000..0cde7346732 Binary files /dev/null and b/assets/50_LogIndex_2-2d85ed00.png differ diff --git a/assets/51_LogIndex_3-1c28dec4.png b/assets/51_LogIndex_3-1c28dec4.png new file mode 100644 index 00000000000..a45d736ba56 Binary files /dev/null and b/assets/51_LogIndex_3-1c28dec4.png differ diff --git a/assets/52_LogIndex_4-50a08309.png b/assets/52_LogIndex_4-50a08309.png new file mode 100644 index 00000000000..c78d526e359 Binary files /dev/null and b/assets/52_LogIndex_4-50a08309.png differ diff --git a/assets/53_LogIndex_5-3a25393f.png b/assets/53_LogIndex_5-3a25393f.png new file mode 100644 index 00000000000..5ed9bd0ea82 Binary files /dev/null and b/assets/53_LogIndex_5-3a25393f.png differ diff --git a/assets/54_LogIndex_6-ea27fcdf.png b/assets/54_LogIndex_6-ea27fcdf.png new file mode 100644 index 00000000000..fbfae0b7993 Binary files /dev/null and b/assets/54_LogIndex_6-ea27fcdf.png differ diff --git a/assets/55_LogIndex_7-a84ed0dd.png b/assets/55_LogIndex_7-a84ed0dd.png new file mode 100644 index 00000000000..70c96cf017b Binary files /dev/null and b/assets/55_LogIndex_7-a84ed0dd.png differ diff --git a/assets/56_LogIndex_8-3f14f302.png b/assets/56_LogIndex_8-3f14f302.png new file mode 100644 index 00000000000..2e0ed0293a7 Binary files /dev/null and b/assets/56_LogIndex_8-3f14f302.png differ diff --git a/assets/57_LogIndex_9-1fcc55d8.png b/assets/57_LogIndex_9-1fcc55d8.png new file mode 100644 index 00000000000..989e127e4c5 Binary files /dev/null and b/assets/57_LogIndex_9-1fcc55d8.png differ diff --git a/assets/58_LogIndex_10-2eab9094.png b/assets/58_LogIndex_10-2eab9094.png new file mode 100644 index 00000000000..cb52a69702d Binary files /dev/null and b/assets/58_LogIndex_10-2eab9094.png differ diff --git a/assets/59_LogIndex_11-e0277c33.png b/assets/59_LogIndex_11-e0277c33.png new file mode 100644 index 00000000000..1c27151324f Binary files /dev/null and b/assets/59_LogIndex_11-e0277c33.png differ diff --git a/assets/5_In-memory_page_synchronization-9737c89d.png b/assets/5_In-memory_page_synchronization-9737c89d.png new file mode 100644 index 00000000000..a4540f0c91c Binary files /dev/null and b/assets/5_In-memory_page_synchronization-9737c89d.png differ diff --git a/assets/5_In-memory_page_synchronization-edc6ee66.png b/assets/5_In-memory_page_synchronization-edc6ee66.png new file mode 100644 index 00000000000..5784c4e8281 Binary files /dev/null and b/assets/5_In-memory_page_synchronization-edc6ee66.png differ diff --git a/assets/60_LogIndex_12-6b577085.png b/assets/60_LogIndex_12-6b577085.png new file mode 100644 index 00000000000..2536aef4f9d Binary files /dev/null and b/assets/60_LogIndex_12-6b577085.png differ diff --git a/assets/61_LogIndex_13-4a2d72a8.png b/assets/61_LogIndex_13-4a2d72a8.png new file mode 100644 index 00000000000..fcb623980dd Binary files /dev/null and b/assets/61_LogIndex_13-4a2d72a8.png differ diff --git a/assets/62_LogIndex_14-c90cc6e7.png b/assets/62_LogIndex_14-c90cc6e7.png new file mode 100644 index 00000000000..01f00d92876 Binary files /dev/null and b/assets/62_LogIndex_14-c90cc6e7.png differ diff --git a/assets/63-PolarDBStack-arch-88440a72.png b/assets/63-PolarDBStack-arch-88440a72.png new file mode 100644 index 00000000000..3222807c177 Binary files /dev/null and b/assets/63-PolarDBStack-arch-88440a72.png differ diff --git a/assets/6_outdated_pages-08398a44.png b/assets/6_outdated_pages-08398a44.png new file mode 100644 index 00000000000..09bd968b8d4 Binary files /dev/null and b/assets/6_outdated_pages-08398a44.png differ diff --git a/assets/6_outdated_pages-0ec897bc.png b/assets/6_outdated_pages-0ec897bc.png new file mode 100644 index 00000000000..7e3ae3d1514 Binary files /dev/null and b/assets/6_outdated_pages-0ec897bc.png differ diff --git a/assets/7_solution_to_outdated_pages-15d7ced2.png b/assets/7_solution_to_outdated_pages-15d7ced2.png new file mode 100644 index 00000000000..36010ea455b Binary files /dev/null and b/assets/7_solution_to_outdated_pages-15d7ced2.png differ diff --git a/assets/7_solution_to_outdated_pages-4f375655.png b/assets/7_solution_to_outdated_pages-4f375655.png new file mode 100644 index 00000000000..f771ade6796 Binary files /dev/null and b/assets/7_solution_to_outdated_pages-4f375655.png differ diff --git a/assets/8_solution_to_outdated_pages_LogIndex-aea5e936.png b/assets/8_solution_to_outdated_pages_LogIndex-aea5e936.png new file mode 100644 index 00000000000..36915fe3930 Binary files /dev/null and b/assets/8_solution_to_outdated_pages_LogIndex-aea5e936.png differ diff --git a/assets/8_solution_to_outdated_pages_LogIndex-b696c625.png b/assets/8_solution_to_outdated_pages_LogIndex-b696c625.png new file mode 100644 index 00000000000..5394f97c053 Binary files /dev/null and b/assets/8_solution_to_outdated_pages_LogIndex-b696c625.png differ diff --git a/assets/9_future_pages-13873b1a.js b/assets/9_future_pages-13873b1a.js new file mode 100644 index 00000000000..7aebeb18d4e --- /dev/null +++ b/assets/9_future_pages-13873b1a.js @@ -0,0 +1 @@ +const s="/PolarDB-for-PostgreSQL/assets/1_polardb_architecture-1942f502.png",o="/PolarDB-for-PostgreSQL/assets/6_outdated_pages-08398a44.png",t="/PolarDB-for-PostgreSQL/assets/7_solution_to_outdated_pages-15d7ced2.png",a="/PolarDB-for-PostgreSQL/assets/9_future_pages-5180f2fc.png";export{s as _,o as a,t as b,a as c}; diff --git a/assets/9_future_pages-5180f2fc.png b/assets/9_future_pages-5180f2fc.png new file mode 100644 index 00000000000..752d29e983b Binary files /dev/null and b/assets/9_future_pages-5180f2fc.png differ diff --git a/assets/9_future_pages-9b52d775.png b/assets/9_future_pages-9b52d775.png new file mode 100644 index 00000000000..89459cca572 Binary files /dev/null and b/assets/9_future_pages-9b52d775.png differ diff --git a/assets/9_future_pages-9e3b8fc6.js b/assets/9_future_pages-9e3b8fc6.js new file mode 100644 index 00000000000..7d6b3255369 --- /dev/null +++ b/assets/9_future_pages-9e3b8fc6.js @@ -0,0 +1 @@ +const s="/PolarDB-for-PostgreSQL/assets/1_polardb_architecture-ce580cc6.png",o="/PolarDB-for-PostgreSQL/assets/6_outdated_pages-0ec897bc.png",t="/PolarDB-for-PostgreSQL/assets/7_solution_to_outdated_pages-4f375655.png",r="/PolarDB-for-PostgreSQL/assets/9_future_pages-9b52d775.png";export{s as _,o as a,t as b,r as c}; diff --git a/assets/ArticleInfo-e2b0e2fd.js b/assets/ArticleInfo-e2b0e2fd.js new file mode 100644 index 00000000000..ff73db339e2 --- /dev/null +++ b/assets/ArticleInfo-e2b0e2fd.js @@ -0,0 +1 @@ +import{f as h,t as r,o as c,c as i,u as s,a as t,g as o,h as l,_}from"./app-3d1677bf.js";const d={class:"line"},v={key:0,class:"line container"},p=t("svg",{t:"1658821554263",class:"icon",viewBox:"0 0 1024 1024",version:"1.1",xmlns:"http://www.w3.org/2000/svg","p-id":"12186",width:"16",height:"16"},[t("path",{d:"M171.1 861.3c5.9 3.7 12.6 5.6 19.3 5.6 5.3 0 10.7-1.2 15.6-3.6l446.9-215c12-5.8 19.8-17.6 20.4-30.9l8.9-202.6 19.6 12.5c6 3.9 12.7 5.7 19.4 5.7 11.9 0 23.5-5.9 30.3-16.6 10.7-16.7 5.8-39-10.9-49.7L721 354.3l44.3-51.2c6.8-7.9 9.9-18.4 8.4-28.7s-7.4-19.5-16.2-25.1L481.9 72.7c-8.8-5.6-19.6-7.2-29.6-4.2S434 78.6 429.7 88.1l-28 61.6-19.6-12.6c-16.7-10.7-39-5.8-49.7 10.9-10.7 16.7-5.8 39 10.9 49.7l17.8 11.4-181.1 92c-12 6.1-19.5 18.3-19.7 31.7l-5.7 497.7c-0.3 12.5 6 24.1 16.5 30.8z m512.5-573.9L659.8 315 463 189l15.1-33.2 205.5 131.6z m-252.3-33.2l91.2 58.4 89.6 57.4-9.7 222.8-313.1 150.7L431 522.2c24.7 1.1 49.3-10.7 63.6-33 21.4-33.4 11.7-77.7-21.7-99.1s-77.7-11.7-99.1 21.7c-14.3 22.3-14.7 49.5-3.4 71.5L227.9 705.8l4-350.3 199.4-101.3zM896.2 887.2l-704.2 0.1c-19.9 0-36 16.1-36 36s16.1 36 36 36l704.1-0.1c19.9 0 36-16.1 36-36s-16-36-35.9-36zM704.6 761.2c0 19.9 16.1 36 36 36h155.6c19.9 0 36-16.1 36-36s-16.1-36-36-36H740.6c-19.9 0-36 16.2-36 36z","p-id":"12187",fill:"#999999"})],-1),f={class:"text"},m={key:1,class:"line container"},u=t("svg",{t:"1658821678607",class:"icon",viewBox:"0 0 1084 1024",version:"1.1",xmlns:"http://www.w3.org/2000/svg","p-id":"14656",width:"16",height:"16"},[t("path",{d:"M679.96791406 629.91443867h96.25668487v96.25668399h-96.25668487zM491.78609667 629.91443867h96.25668399v96.25668399h-96.25668398zM303.60427842 629.91443867h96.25668398v96.25668399h-96.25668398zM679.96791406 427.77540107h96.25668487v96.25668487h-96.25668487zM491.78609667 427.77540107h96.25668399v96.25668487h-96.25668398zM303.60427842 427.77540107h96.25668398v96.25668487h-96.25668398z",fill:"#999999","p-id":"14657"}),t("path",{d:"M821.94652373 105.79679141h-48.12834199V62h-96.25668398v43.79679141h-89.5187171V62h-96.25668399v43.79679141H398.8983957V62H302.64171084v43.79679141H254.51336885a192.51336885 192.51336885 0 0 0-192.51336885 192.51336885v471.17647089a192.51336885 192.51336885 0 0 0 192.51336885 192.51336885h567.43315488a192.51336885 192.51336885 0 0 0 192.51336973-192.51336885V298.31016026a192.51336885 192.51336885 0 0 0-192.51336973-192.51336885z m96.25668486 663.68983974a96.25668487 96.25668487 0 0 1-96.25668486 96.25668398H254.51336885a96.25668487 96.25668487 0 0 1-96.25668398-96.25668398V298.31016026 293.01604297h759.46523994v5.29411729z",fill:"#999999","p-id":"14658"})],-1),w={class:"text"},z={key:2,class:"line container"},g=t("svg",{t:"1658821512864",class:"icon",viewBox:"0 0 1024 1024",version:"1.1",xmlns:"http://www.w3.org/2000/svg","p-id":"10973",width:"16",height:"16"},[t("path",{d:"M836.879252 877.489158l-40.623209 0L796.256042 771.125689c0-59.154261-59.33641-108.910479-115.690906-157.030429-19.591197-16.72594-46.74872-41.527812-56.900941-53.077869l0-98.03478c10.152221-11.550057 37.310767-36.351929 56.900941-53.077869 56.354496-48.119951 115.690906-97.876168 115.690906-157.030429L796.256042 146.509818l40.623209 0c22.426779 0 40.609906-18.183128 40.609906-40.609906S859.306031 65.290005 836.879252 65.290005L187.120748 65.290005c-22.425755 0-40.609906 18.183128-40.609906 40.609906s18.184151 40.609906 40.609906 40.609906l40.596604 0 0 106.36347c0 59.154261 59.33334 108.910479 115.692952 157.030429 19.586081 16.72594 46.747697 41.527812 56.899918 53.077869l0 98.03478c-10.152221 11.550057-37.313837 36.351929-56.899918 53.077869C287.049668 662.21521 227.717352 711.971427 227.717352 771.125689l0 106.36347-40.596604 0c-22.425755 0-40.609906 18.183128-40.609906 40.609906 0 22.426779 18.184151 40.609906 40.609906 40.609906l649.759527 0c22.426779 0 40.609906-18.183128 40.609906-40.609906C877.489158 895.672286 859.306031 877.489158 836.879252 877.489158zM308.937165 771.125689c0-21.703301 57.097416-69.555146 87.207178-95.263667 48.958038-41.794895 85.384669-71.994708 85.384669-110.442368L481.529011 458.580347c0-38.448684-36.426631-68.648496-85.384669-110.442368-30.109762-25.709545-87.207178-73.560366-87.207178-95.263667L308.937165 146.509818l406.099065 0 0 106.36347c0 21.707394-57.100486 69.555146-87.211271 95.263667-48.952922 41.794895-85.381599 71.994708-85.381599 110.442368l0 106.838284c0 38.448684 36.427654 68.648496 85.381599 110.442368 30.109762 25.709545 87.211271 73.556273 87.211271 95.263667l0 106.36347-406.099065 0L308.937165 771.125689zM603.358731 238.027162l0-20.662599c0-19.62599 15.907295-35.533284 35.533284-35.533284 19.62599 0 35.533284 15.908318 35.533284 35.533284l0 20.662599c0 19.62599-15.908318 35.533284-35.533284 35.533284C619.267049 273.560446 603.358731 257.652128 603.358731 238.027162z","p-id":"10974",fill:"#999999"})],-1),x={class:"text"},M=h({__name:"ArticleInfo",props:{frontmatter:{type:Object,required:!0}},setup(n){const a=n,{frontmatter:e}=r(a);return(L,y)=>(c(),i("div",d,[s(e).author?(c(),i("div",v,[p,t("p",f,o(s(e).author),1)])):l("v-if",!0),s(e).date?(c(),i("div",m,[u,t("p",w,o(s(e).date),1)])):l("v-if",!0),s(e).minute?(c(),i("div",z,[g,t("p",x,o(s(e).minute)+" min",1)])):l("v-if",!0)]))}});const B=_(M,[["__file","ArticleInfo.vue"]]);export{B as default}; diff --git a/assets/adaptive-scan.html-63d6e581.js b/assets/adaptive-scan.html-63d6e581.js new file mode 100644 index 00000000000..968e24d4a55 --- /dev/null +++ b/assets/adaptive-scan.html-63d6e581.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-59700d71","path":"/zh/features/v11/epq/adaptive-scan.html","title":"自适应扫描","lang":"zh-CN","frontmatter":{"author":"步真","date":"2022/09/21","minute":25},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[{"level":3,"title":"非自适应扫描","slug":"非自适应扫描","link":"#非自适应扫描","children":[]},{"level":3,"title":"自适应扫描","slug":"自适应扫描-1","link":"#自适应扫描-1","children":[]}]},{"level":2,"title":"功能设计","slug":"功能设计","link":"#功能设计","children":[{"level":3,"title":"非自适应扫描","slug":"非自适应扫描-1","link":"#非自适应扫描-1","children":[]},{"level":3,"title":"自适应扫描","slug":"自适应扫描-2","link":"#自适应扫描-2","children":[]}]},{"level":2,"title":"使用指南","slug":"使用指南","link":"#使用指南","children":[{"level":3,"title":"非自适应扫描","slug":"非自适应扫描-2","link":"#非自适应扫描-2","children":[]},{"level":3,"title":"自适应扫描","slug":"自适应扫描-3","link":"#自适应扫描-3","children":[]}]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/adaptive-scan.md"}');export{l as data}; diff --git a/assets/adaptive-scan.html-a651e93c.js b/assets/adaptive-scan.html-a651e93c.js new file mode 100644 index 00000000000..24e187fd38c --- /dev/null +++ b/assets/adaptive-scan.html-a651e93c.js @@ -0,0 +1,66 @@ +import{_ as d,r as o,o as u,c as k,d as a,a as n,w as e,b as s,e as p}from"./app-3d1677bf.js";const m="/PolarDB-for-PostgreSQL/assets/htap-non-adaptive-scan-5fb1b1e0.png",b="/PolarDB-for-PostgreSQL/assets/htap-adaptive-scan-21b95764.png",v={},h=n("h1",{id:"自适应扫描",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#自适应扫描","aria-hidden":"true"},"#"),s(" 自适应扫描")],-1),_={class:"table-of-contents"},g=p('

背景介绍

PolarDB for PostgreSQL 支持 ePQ 弹性跨机并行查询特性,通过利用集群中多个节点的计算能力,来实现跨节点的并行查询功能。ePQ 可以支持顺序扫描、索引扫描等多种物理算子的跨节点并行化。其中,对顺序扫描算子,ePQ 提供了两种扫描模式,分别为 自适应扫描模式非自适应扫描模式

术语

功能介绍

非自适应扫描

非自适应扫描模式是 ePQ 顺序扫描算子(Sequential Scan)的默认扫描方式。每一个参与并行查询的 PX Worker 在执行过程中都会被分配一个唯一的 Worker ID。非自适应扫描模式将会依据 Worker ID 划分数据表在物理存储上的 Disk Unit ID,从而实现每个 PX Worker 可以均匀扫描数据表在共享存储上的存储单元,所有 PX Worker 的扫描结果最终汇总形成全量的数据。

自适应扫描

在非自适应扫描模式下,扫描单元会均匀划分给每个 PX Worker。当存在个别只读节点计算资源不足的情况下,可能会导致扫描过程发生计算倾斜:用户发起的单次并行查询迟迟不能完成,查询受限于计算资源不足的节点长时间不能完成扫描任务。

ePQ 提供的自适应扫描模式可以解决这个问题。自适应扫描模式不再限定每个 PX Worker 扫描特定的 Disk Unit ID,而是采用 请求-响应(Request-Response)模式,通过 QC 进程与 PX Worker 进程之间的特定 RPC 通信机制,由 QC 进程负责告知每个 PX Worker 进程可以执行的扫描任务,从而消除计算倾斜的问题。

功能设计

非自适应扫描

QC 进程在发起并行查询任务时,会为每个 PX Worker 进程分配固定的 Worker ID,每个 PX Worker 进程根据 Worker ID 对存储单元 取模,只扫描其所属的特定的 Dist Unit。

non-adaptive-scan

自适应扫描

QC 进程在发起并行查询任务时,会启动 自适应扫描线程,用于接收并处理来自 PX Worker 进程的请求消息。自适应扫描线程维护了当前查询扫描任务的进度,并根据每个 PX Worker 进程的工作进度,向 PX Worker 进程分派需要扫描的 Disk Unit ID。对于需要扫描的最后一个 Disk Unit,自适应扫描线程会唤醒处于空闲状态的 PX Worker,加速最后一块 Disk Unit 的扫描过程。

adaptive-scan

消息通信机制

由于自适应扫描线程与各个 PX worker 进程之间的通信数据很少,频率不高,所以重用了已有的 QC 进程与 PX worker 进程之间的 libpq 连接进行报文通信。自适应扫描线程通过 poll 的方式在需要时同步轮询 PX Worker 进程的请求和响应。

扫描任务协调

PX Worker 进程在执行顺序扫描算子时,会首先向 QC 进程发起询问请求,将以下信息发送给 QC 端的自适应扫描线程:

自适应扫描线程在收到询问请求后,会创建扫描任务或更新扫描任务的进度。

可变颗粒度

为了减少请求带来的网络交互次数,ePQ 实现了可变的任务颗粒度。当扫描任务量剩余较多时,PX Worker 进程单次领取的扫描物理块数较多;当扫描任务量剩余较少时,PX Worker 进程单次领取的扫描物理块数相应减少。通过这种方法,可以平衡 网络开销负载均衡 两者之间的关系。

缓存友好

',26),P=p(`

报文设计

使用指南

创建测试表:

postgres=# CREATE TABLE t(id INT);
+CREATE TABLE
+postgres=# INSERT INTO t VALUES(generate_series(1,100));
+INSERT 0 100
+

非自适应扫描

开启 ePQ 并行查询功能,并设置单节点并发度为 3。通过 EXPLAIN 可以看到执行计划来自 PX 优化器。由于参与测试的只读节点有两个,所以从执行计划中可以看到整体并发度为 6。

postgres=# SET polar_enable_px = 1;
+SET
+postgres=# SET polar_px_dop_per_node = 3;
+SET
+postgres=# SHOW polar_px_enable_adps;
+ polar_px_enable_adps
+----------------------
+ off
+(1 row)
+
+postgres=# EXPLAIN SELECT * FROM t;
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on t  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+
+postgres=# SELECT COUNT(*) FROM t;
+ count
+-------
+   100
+(1 row)
+

自适应扫描

开启自适应扫描功能的开关后,通过 EXPLAIN ANALYZE 可以看到每个 PX Worker 进程扫描的物理块号。

postgres=# SET polar_enable_px = 1;
+SET
+postgres=# SET polar_px_dop_per_node = 3;
+SET
+postgres=# SET polar_px_enable_adps = 1;
+SET
+postgres=# SHOW polar_px_enable_adps;
+ polar_px_enable_adps
+----------------------
+ on
+(1 row)
+
+postgres=# SET polar_px_enable_adps_explain_analyze = 1;
+SET
+postgres=# SHOW polar_px_enable_adps_explain_analyze;
+ polar_px_enable_adps_explain_analyze
+--------------------------------------
+ on
+(1 row)
+
+postgres=# EXPLAIN ANALYZE SELECT * FROM t;
+                                                        QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4) (actual time=0.968..0.982 rows=100 loops=1)
+   ->  Partial Seq Scan on t  (cost=0.00..431.00 rows=1 width=4) (actual time=0.380..0.435 rows=100 loops=1)
+         Dynamic Pages Per Worker: [1]
+ Planning Time: 5.571 ms
+ Optimizer: PolarDB PX Optimizer
+   (slice0)    Executor memory: 23K bytes.
+   (slice1)    Executor memory: 14K bytes avg x 6 workers, 14K bytes max (seg0).
+ Execution Time: 9.047 ms
+(8 rows)
+
+postgres=# SELECT COUNT(*) FROM t;
+ count
+-------
+   100
+(1 row)
+
`,11);function w(r,f){const l=o("Badge"),c=o("ArticleInfo"),t=o("router-link"),i=o("RouterLink");return u(),k("div",null,[h,a(l,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"}),a(c,{frontmatter:r.$frontmatter},null,8,["frontmatter"]),n("nav",_,[n("ul",null,[n("li",null,[a(t,{to:"#背景介绍"},{default:e(()=>[s("背景介绍")]),_:1})]),n("li",null,[a(t,{to:"#术语"},{default:e(()=>[s("术语")]),_:1})]),n("li",null,[a(t,{to:"#功能介绍"},{default:e(()=>[s("功能介绍")]),_:1}),n("ul",null,[n("li",null,[a(t,{to:"#非自适应扫描"},{default:e(()=>[s("非自适应扫描")]),_:1})]),n("li",null,[a(t,{to:"#自适应扫描-1"},{default:e(()=>[s("自适应扫描")]),_:1})])])]),n("li",null,[a(t,{to:"#功能设计"},{default:e(()=>[s("功能设计")]),_:1}),n("ul",null,[n("li",null,[a(t,{to:"#非自适应扫描-1"},{default:e(()=>[s("非自适应扫描")]),_:1})]),n("li",null,[a(t,{to:"#自适应扫描-2"},{default:e(()=>[s("自适应扫描")]),_:1})])])]),n("li",null,[a(t,{to:"#使用指南"},{default:e(()=>[s("使用指南")]),_:1}),n("ul",null,[n("li",null,[a(t,{to:"#非自适应扫描-2"},{default:e(()=>[s("非自适应扫描")]),_:1})]),n("li",null,[a(t,{to:"#自适应扫描-3"},{default:e(()=>[s("自适应扫描")]),_:1})])])])])]),g,n("p",null,[s("自适应扫描模式将尽量保证每个节点在多次执行并行查询任务时,能够重用 Shared Buffer 缓存,避免缓存频繁更新 / 淘汰。在实现上,自适应扫描功能会根据 "),a(i,{to:"/zh/features/v11/epq/cluster-info.html"},{default:e(()=>[s("集群拓扑视图")]),_:1}),s(" 配置的节点 IP 地址信息,采用缓存绑定策略,尽量让同一个物理 Page 被同一个节点复用。")]),P])}const x=d(v,[["render",w],["__file","adaptive-scan.html.vue"]]);export{x as default}; diff --git a/assets/aliyun-ecs-instance-b4e46a52.png b/assets/aliyun-ecs-instance-b4e46a52.png new file mode 100644 index 00000000000..b776526f5c9 Binary files /dev/null and b/assets/aliyun-ecs-instance-b4e46a52.png differ diff --git a/assets/aliyun-ecs-procedure-60ba621e.png b/assets/aliyun-ecs-procedure-60ba621e.png new file mode 100644 index 00000000000..fc49357dc18 Binary files /dev/null and b/assets/aliyun-ecs-procedure-60ba621e.png differ diff --git a/assets/aliyun-ecs-specs-323b2032.png b/assets/aliyun-ecs-specs-323b2032.png new file mode 100644 index 00000000000..0cf061d5623 Binary files /dev/null and b/assets/aliyun-ecs-specs-323b2032.png differ diff --git a/assets/aliyun-ecs-system-disk-c8a747ce.png b/assets/aliyun-ecs-system-disk-c8a747ce.png new file mode 100644 index 00000000000..83d688ab132 Binary files /dev/null and b/assets/aliyun-ecs-system-disk-c8a747ce.png differ diff --git a/assets/aliyun-essd-mounted-f02e5c42.png b/assets/aliyun-essd-mounted-f02e5c42.png new file mode 100644 index 00000000000..2689339e107 Binary files /dev/null and b/assets/aliyun-essd-mounted-f02e5c42.png differ diff --git a/assets/aliyun-essd-mounting-1b470123.png b/assets/aliyun-essd-mounting-1b470123.png new file mode 100644 index 00000000000..580ac221d68 Binary files /dev/null and b/assets/aliyun-essd-mounting-1b470123.png differ diff --git a/assets/aliyun-essd-ready-to-mount-59aa890c.png b/assets/aliyun-essd-ready-to-mount-59aa890c.png new file mode 100644 index 00000000000..02d9f545772 Binary files /dev/null and b/assets/aliyun-essd-ready-to-mount-59aa890c.png differ diff --git a/assets/aliyun-essd-specs-207958e6.png b/assets/aliyun-essd-specs-207958e6.png new file mode 100644 index 00000000000..e32edc1b8f6 Binary files /dev/null and b/assets/aliyun-essd-specs-207958e6.png differ diff --git a/assets/analyze.html-4db5cb7d.js b/assets/analyze.html-4db5cb7d.js new file mode 100644 index 00000000000..3c8afdf927e --- /dev/null +++ b/assets/analyze.html-4db5cb7d.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-28309dcf","path":"/zh/theory/analyze.html","title":"ANALYZE 源码解读","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/06/20","minute":15},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"统计信息","slug":"统计信息","link":"#统计信息","children":[{"level":3,"title":"Most Common Values (MCV)","slug":"most-common-values-mcv","link":"#most-common-values-mcv","children":[]},{"level":3,"title":"Histogram","slug":"histogram","link":"#histogram","children":[]},{"level":3,"title":"Correlation","slug":"correlation","link":"#correlation","children":[]},{"level":3,"title":"Most Common Elements","slug":"most-common-elements","link":"#most-common-elements","children":[]},{"level":3,"title":"Distinct Elements Count Histogram","slug":"distinct-elements-count-histogram","link":"#distinct-elements-count-histogram","children":[]},{"level":3,"title":"Length Histogram","slug":"length-histogram","link":"#length-histogram","children":[]},{"level":3,"title":"Bounds Histogram","slug":"bounds-histogram","link":"#bounds-histogram","children":[]}]},{"level":2,"title":"内核执行流程","slug":"内核执行流程","link":"#内核执行流程","children":[{"level":3,"title":"compute_trivial_stats","slug":"compute-trivial-stats","link":"#compute-trivial-stats","children":[]},{"level":3,"title":"compute_distinct_stats","slug":"compute-distinct-stats","link":"#compute-distinct-stats","children":[]},{"level":3,"title":"compute_scalar_stats","slug":"compute-scalar-stats","link":"#compute-scalar-stats","children":[]}]},{"level":2,"title":"总结","slug":"总结","link":"#总结","children":[]},{"level":2,"title":"参考资料","slug":"参考资料","link":"#参考资料","children":[]}],"git":{"updatedTime":1658822731000},"filePathRelative":"zh/theory/analyze.md"}');export{t as data}; diff --git a/assets/analyze.html-70b019fc.js b/assets/analyze.html-70b019fc.js new file mode 100644 index 00000000000..e77f438be6c --- /dev/null +++ b/assets/analyze.html-70b019fc.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-5879645e","path":"/theory/analyze.html","title":"Code Analysis of ANALYZE","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/06/20","minute":15},"headers":[{"level":2,"title":"Background","slug":"background","link":"#background","children":[]},{"level":2,"title":"Statistics","slug":"statistics","link":"#statistics","children":[{"level":3,"title":"Most Common Values (MCV)","slug":"most-common-values-mcv","link":"#most-common-values-mcv","children":[]},{"level":3,"title":"Histogram","slug":"histogram","link":"#histogram","children":[]},{"level":3,"title":"Correlation","slug":"correlation","link":"#correlation","children":[]},{"level":3,"title":"Most Common Elements","slug":"most-common-elements","link":"#most-common-elements","children":[]},{"level":3,"title":"Distinct Elements Count Histogram","slug":"distinct-elements-count-histogram","link":"#distinct-elements-count-histogram","children":[]},{"level":3,"title":"Length Histogram","slug":"length-histogram","link":"#length-histogram","children":[]},{"level":3,"title":"Bounds Histogram","slug":"bounds-histogram","link":"#bounds-histogram","children":[]}]},{"level":2,"title":"Kernel Execution of ANALYZE","slug":"kernel-execution-of-analyze","link":"#kernel-execution-of-analyze","children":[{"level":3,"title":"compute_trivial_stats","slug":"compute-trivial-stats","link":"#compute-trivial-stats","children":[]},{"level":3,"title":"compute_distinct_stats","slug":"compute-distinct-stats","link":"#compute-distinct-stats","children":[]},{"level":3,"title":"compute_scalar_stats","slug":"compute-scalar-stats","link":"#compute-scalar-stats","children":[]}]},{"level":2,"title":"Summary","slug":"summary","link":"#summary","children":[]},{"level":2,"title":"References","slug":"references","link":"#references","children":[]}],"git":{"updatedTime":1658822731000},"filePathRelative":"theory/analyze.md"}');export{e as data}; diff --git a/assets/analyze.html-877fc82a.js b/assets/analyze.html-877fc82a.js new file mode 100644 index 00000000000..a1a32198e35 --- /dev/null +++ b/assets/analyze.html-877fc82a.js @@ -0,0 +1,607 @@ +import{_ as l,r as t,o as p,c,d as s,a as n,b as a,e as r}from"./app-3d1677bf.js";const d={},u=n("h1",{id:"code-analysis-of-analyze",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#code-analysis-of-analyze","aria-hidden":"true"},"#"),a(" Code Analysis of ANALYZE")],-1),v=r(`

Background

PostgreSQL 在优化器中为一个查询树输出一个执行效率最高的物理计划树。其中,执行效率高低的衡量是通过代价估算实现的。比如通过估算查询返回元组的条数,和元组的宽度,就可以计算出 I/O 开销;也可以根据将要执行的物理操作估算出可能需要消耗的 CPU 代价。优化器通过系统表 pg_statistic 获得这些在代价估算过程需要使用到的关键统计信息,而 pg_statistic 系统表中的统计信息又是通过自动或手动的 ANALYZE 操作(或 VACUUM)计算得到的。ANALYZE 将会扫描表中的数据并按列进行分析,将得到的诸如每列的数据分布、最常见值、频率等统计信息写入系统表。

本文从源码的角度分析一下 ANALYZE 操作的实现机制。源码使用目前 PostgreSQL 最新的稳定版本 PostgreSQL 14。

Statistics

首先,我们应当搞明白分析操作的输出是什么。所以我们可以看一看 pg_statistic 中有哪些列,每个列的含义是什么。这个系统表中的每一行表示其它数据表中 每一列的统计信息

postgres=# \\d+ pg_statistic
+                                 Table "pg_catalog.pg_statistic"
+   Column    |   Type   | Collation | Nullable | Default | Storage  | Stats target | Description
+-------------+----------+-----------+----------+---------+----------+--------------+-------------
+ starelid    | oid      |           | not null |         | plain    |              |
+ staattnum   | smallint |           | not null |         | plain    |              |
+ stainherit  | boolean  |           | not null |         | plain    |              |
+ stanullfrac | real     |           | not null |         | plain    |              |
+ stawidth    | integer  |           | not null |         | plain    |              |
+ stadistinct | real     |           | not null |         | plain    |              |
+ stakind1    | smallint |           | not null |         | plain    |              |
+ stakind2    | smallint |           | not null |         | plain    |              |
+ stakind3    | smallint |           | not null |         | plain    |              |
+ stakind4    | smallint |           | not null |         | plain    |              |
+ stakind5    | smallint |           | not null |         | plain    |              |
+ staop1      | oid      |           | not null |         | plain    |              |
+ staop2      | oid      |           | not null |         | plain    |              |
+ staop3      | oid      |           | not null |         | plain    |              |
+ staop4      | oid      |           | not null |         | plain    |              |
+ staop5      | oid      |           | not null |         | plain    |              |
+ stanumbers1 | real[]   |           |          |         | extended |              |
+ stanumbers2 | real[]   |           |          |         | extended |              |
+ stanumbers3 | real[]   |           |          |         | extended |              |
+ stanumbers4 | real[]   |           |          |         | extended |              |
+ stanumbers5 | real[]   |           |          |         | extended |              |
+ stavalues1  | anyarray |           |          |         | extended |              |
+ stavalues2  | anyarray |           |          |         | extended |              |
+ stavalues3  | anyarray |           |          |         | extended |              |
+ stavalues4  | anyarray |           |          |         | extended |              |
+ stavalues5  | anyarray |           |          |         | extended |              |
+Indexes:
+    "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, stainherit)
+
/* ----------------
+ *      pg_statistic definition.  cpp turns this into
+ *      typedef struct FormData_pg_statistic
+ * ----------------
+ */
+CATALOG(pg_statistic,2619,StatisticRelationId)
+{
+    /* These fields form the unique key for the entry: */
+    Oid         starelid BKI_LOOKUP(pg_class);  /* relation containing
+                                                 * attribute */
+    int16       staattnum;      /* attribute (column) stats are for */
+    bool        stainherit;     /* true if inheritance children are included */
+
+    /* the fraction of the column's entries that are NULL: */
+    float4      stanullfrac;
+
+    /*
+     * stawidth is the average width in bytes of non-null entries.  For
+     * fixed-width datatypes this is of course the same as the typlen, but for
+     * var-width types it is more useful.  Note that this is the average width
+     * of the data as actually stored, post-TOASTing (eg, for a
+     * moved-out-of-line value, only the size of the pointer object is
+     * counted).  This is the appropriate definition for the primary use of
+     * the statistic, which is to estimate sizes of in-memory hash tables of
+     * tuples.
+     */
+    int32       stawidth;
+
+    /* ----------------
+     * stadistinct indicates the (approximate) number of distinct non-null
+     * data values in the column.  The interpretation is:
+     *      0       unknown or not computed
+     *      > 0     actual number of distinct values
+     *      < 0     negative of multiplier for number of rows
+     * The special negative case allows us to cope with columns that are
+     * unique (stadistinct = -1) or nearly so (for example, a column in which
+     * non-null values appear about twice on the average could be represented
+     * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
+     * column is nulls).  Because the number-of-rows statistic in pg_class may
+     * be updated more frequently than pg_statistic is, it's important to be
+     * able to describe such situations as a multiple of the number of rows,
+     * rather than a fixed number of distinct values.  But in other cases a
+     * fixed number is correct (eg, a boolean column).
+     * ----------------
+     */
+    float4      stadistinct;
+
+    /* ----------------
+     * To allow keeping statistics on different kinds of datatypes,
+     * we do not hard-wire any particular meaning for the remaining
+     * statistical fields.  Instead, we provide several "slots" in which
+     * statistical data can be placed.  Each slot includes:
+     *      kind            integer code identifying kind of data (see below)
+     *      op              OID of associated operator, if needed
+     *      coll            OID of relevant collation, or 0 if none
+     *      numbers         float4 array (for statistical values)
+     *      values          anyarray (for representations of data values)
+     * The ID, operator, and collation fields are never NULL; they are zeroes
+     * in an unused slot.  The numbers and values fields are NULL in an
+     * unused slot, and might also be NULL in a used slot if the slot kind
+     * has no need for one or the other.
+     * ----------------
+     */
+
+    int16       stakind1;
+    int16       stakind2;
+    int16       stakind3;
+    int16       stakind4;
+    int16       stakind5;
+
+    Oid         staop1 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop2 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop3 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop4 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop5 BKI_LOOKUP_OPT(pg_operator);
+
+    Oid         stacoll1 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll2 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll3 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll4 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll5 BKI_LOOKUP_OPT(pg_collation);
+
+#ifdef CATALOG_VARLEN           /* variable-length fields start here */
+    float4      stanumbers1[1];
+    float4      stanumbers2[1];
+    float4      stanumbers3[1];
+    float4      stanumbers4[1];
+    float4      stanumbers5[1];
+
+    /*
+     * Values in these arrays are values of the column's data type, or of some
+     * related type such as an array element type.  We presently have to cheat
+     * quite a bit to allow polymorphic arrays of this kind, but perhaps
+     * someday it'll be a less bogus facility.
+     */
+    anyarray    stavalues1;
+    anyarray    stavalues2;
+    anyarray    stavalues3;
+    anyarray    stavalues4;
+    anyarray    stavalues5;
+#endif
+} FormData_pg_statistic;
+

从数据库命令行的角度和内核 C 代码的角度来看,统计信息的内容都是一致的。所有的属性都以 sta 开头。其中:

由于不同数据类型所能够被计算的统计信息可能会有一些细微的差别,在接下来的部分中,PostgreSQL 预留了一些存放统计信息的 槽(slots)。目前的内核里暂时预留了五个槽:

#define STATISTIC_NUM_SLOTS  5
+

每一种特定的统计信息可以使用一个槽,具体在槽里放什么完全由这种统计信息的定义自由决定。每一个槽的可用空间包含这么几个部分(其中的 N 表示槽的编号,取值为 15):

PostgreSQL 内核中规定,统计信息的编号 199 被保留给 PostgreSQL 核心统计信息使用,其它部分的编号安排如内核注释所示:

/*
+ * The present allocation of "kind" codes is:
+ *
+ *  1-99:       reserved for assignment by the core PostgreSQL project
+ *              (values in this range will be documented in this file)
+ *  100-199:    reserved for assignment by the PostGIS project
+ *              (values to be documented in PostGIS documentation)
+ *  200-299:    reserved for assignment by the ESRI ST_Geometry project
+ *              (values to be documented in ESRI ST_Geometry documentation)
+ *  300-9999:   reserved for future public assignments
+ *
+ * For private use you may choose a "kind" code at random in the range
+ * 10000-30000.  However, for code that is to be widely disseminated it is
+ * better to obtain a publicly defined "kind" code by request from the
+ * PostgreSQL Global Development Group.
+ */
+

目前可以在内核代码中看到的 PostgreSQL 核心统计信息有 7 个,编号分别从 17。我们可以看看这 7 种统计信息分别如何使用上述的槽。

Most Common Values (MCV)

/*
+ * In a "most common values" slot, staop is the OID of the "=" operator
+ * used to decide whether values are the same or not, and stacoll is the
+ * collation used (same as column's collation).  stavalues contains
+ * the K most common non-null values appearing in the column, and stanumbers
+ * contains their frequencies (fractions of total row count).  The values
+ * shall be ordered in decreasing frequency.  Note that since the arrays are
+ * variable-size, K may be chosen by the statistics collector.  Values should
+ * not appear in MCV unless they have been observed to occur more than once;
+ * a unique column will have no MCV slot.
+ */
+#define STATISTIC_KIND_MCV  1
+

对于一个列中的 最常见值,在 staop 中保存 = 运算符来决定一个值是否等于一个最常见值。在 stavalues 中保存了该列中最常见的 K 个非空值,stanumbers 中分别保存了这 K 个值出现的频率。

Histogram

/*
+ * A "histogram" slot describes the distribution of scalar data.  staop is
+ * the OID of the "<" operator that describes the sort ordering, and stacoll
+ * is the relevant collation.  (In theory more than one histogram could appear,
+ * if a datatype has more than one useful sort operator or we care about more
+ * than one collation.  Currently the collation will always be that of the
+ * underlying column.)  stavalues contains M (>=2) non-null values that
+ * divide the non-null column data values into M-1 bins of approximately equal
+ * population.  The first stavalues item is the MIN and the last is the MAX.
+ * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
+ * slot is also provided, then the histogram describes the data distribution
+ * *after removing the values listed in MCV* (thus, it's a "compressed
+ * histogram" in the technical parlance).  This allows a more accurate
+ * representation of the distribution of a column with some very-common
+ * values.  In a column with only a few distinct values, it's possible that
+ * the MCV list describes the entire data population; in this case the
+ * histogram reduces to empty and should be omitted.
+ */
+#define STATISTIC_KIND_HISTOGRAM  2
+

表示一个(数值)列的数据分布直方图。staop 保存 < 运算符用于决定数据分布的排序顺序。stavalues 包含了能够将该列的非空值划分到 M - 1 个容量接近的桶中的 M 个非空值。如果该列中已经有了 MCV 的槽,那么数据分布直方图中将不包含 MCV 中的值,以获得更精确的数据分布。

Correlation

/*
+ * A "correlation" slot describes the correlation between the physical order
+ * of table tuples and the ordering of data values of this column, as seen
+ * by the "<" operator identified by staop with the collation identified by
+ * stacoll.  (As with the histogram, more than one entry could theoretically
+ * appear.)  stavalues is not used and should be NULL.  stanumbers contains
+ * a single entry, the correlation coefficient between the sequence of data
+ * values and the sequence of their actual tuple positions.  The coefficient
+ * ranges from +1 to -1.
+ */
+#define STATISTIC_KIND_CORRELATION  3
+

stanumbers 中保存数据值和它们的实际元组位置的相关系数。

Most Common Elements

/*
+ * A "most common elements" slot is similar to a "most common values" slot,
+ * except that it stores the most common non-null *elements* of the column
+ * values.  This is useful when the column datatype is an array or some other
+ * type with identifiable elements (for instance, tsvector).  staop contains
+ * the equality operator appropriate to the element type, and stacoll
+ * contains the collation to use with it.  stavalues contains
+ * the most common element values, and stanumbers their frequencies.  Unlike
+ * MCV slots, frequencies are measured as the fraction of non-null rows the
+ * element value appears in, not the frequency of all rows.  Also unlike
+ * MCV slots, the values are sorted into the element type's default order
+ * (to support binary search for a particular value).  Since this puts the
+ * minimum and maximum frequencies at unpredictable spots in stanumbers,
+ * there are two extra members of stanumbers, holding copies of the minimum
+ * and maximum frequencies.  Optionally, there can be a third extra member,
+ * which holds the frequency of null elements (expressed in the same terms:
+ * the fraction of non-null rows that contain at least one null element).  If
+ * this member is omitted, the column is presumed to contain no null elements.
+ *
+ * Note: in current usage for tsvector columns, the stavalues elements are of
+ * type text, even though their representation within tsvector is not
+ * exactly text.
+ */
+#define STATISTIC_KIND_MCELEM  4
+

与 MCV 类似,但是保存的是列中的 最常见元素,主要用于数组等类型。同样,在 staop 中保存了等值运算符用于判断元素出现的频率高低。但与 MCV 不同的是这里的频率计算的分母是非空的行,而不是所有的行。另外,所有的常见元素使用元素对应数据类型的默认顺序进行排序,以便二分查找。

Distinct Elements Count Histogram

/*
+ * A "distinct elements count histogram" slot describes the distribution of
+ * the number of distinct element values present in each row of an array-type
+ * column.  Only non-null rows are considered, and only non-null elements.
+ * staop contains the equality operator appropriate to the element type,
+ * and stacoll contains the collation to use with it.
+ * stavalues is not used and should be NULL.  The last member of stanumbers is
+ * the average count of distinct element values over all non-null rows.  The
+ * preceding M (>=2) members form a histogram that divides the population of
+ * distinct-elements counts into M-1 bins of approximately equal population.
+ * The first of these is the minimum observed count, and the last the maximum.
+ */
+#define STATISTIC_KIND_DECHIST  5
+

表示列中出现所有数值的频率分布直方图。stanumbers 数组的前 M 个元素是将列中所有唯一值的出现次数大致均分到 M - 1 个桶中的边界值。后续跟上一个所有唯一值的平均出现次数。这个统计信息应该会被用于计算 选择率

Length Histogram

/*
+ * A "length histogram" slot describes the distribution of range lengths in
+ * rows of a range-type column. stanumbers contains a single entry, the
+ * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
+ * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
+ * values that divide the column data values into M-1 bins of approximately
+ * equal population. The lengths are stored as float8s, as measured by the
+ * range type's subdiff function. Only non-null rows are considered.
+ */
+#define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
+

长度直方图描述了一个范围类型的列的范围长度分布。同样也是一个长度为 M 的直方图,保存在 stanumbers 中。

Bounds Histogram

/*
+ * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
+ * a range-type column.  stavalues contains M (>=2) range values that divide
+ * the column data values into M-1 bins of approximately equal population.
+ * Unlike a regular scalar histogram, this is actually two histograms combined
+ * into a single array, with the lower bounds of each value forming a
+ * histogram of lower bounds, and the upper bounds a histogram of upper
+ * bounds.  Only non-NULL, non-empty ranges are included.
+ */
+#define STATISTIC_KIND_BOUNDS_HISTOGRAM  7
+

边界直方图同样也被用于范围类型,与数据分布直方图类似。stavalues 中保存了使该列数值大致均分到 M - 1 个桶中的 M 个范围边界值。只考虑非空行。

Kernel Execution of ANALYZE

知道 pg_statistic 最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令,执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

/*
+ * standard_ProcessUtility itself deals only with utility commands for
+ * which we do not provide event trigger support.  Commands that do have
+ * such support are passed down to ProcessUtilitySlow, which contains the
+ * necessary infrastructure for such triggers.
+ *
+ * This division is not just for performance: it's critical that the
+ * event trigger code not be invoked when doing START TRANSACTION for
+ * example, because we might need to refresh the event trigger cache,
+ * which requires being in a valid transaction.
+ */
+void
+standard_ProcessUtility(PlannedStmt *pstmt,
+                        const char *queryString,
+                        bool readOnlyTree,
+                        ProcessUtilityContext context,
+                        ParamListInfo params,
+                        QueryEnvironment *queryEnv,
+                        DestReceiver *dest,
+                        QueryCompletion *qc)
+{
+    // ...
+
+    switch (nodeTag(parsetree))
+    {
+        // ...
+
+        case T_VacuumStmt:
+            ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
+            break;
+
+        // ...
+    }
+
+    // ...
+}
+

ANALYZE 的处理逻辑入口和 VACUUM 一致,进入 ExecVacuum() 函数。

/*
+ * Primary entry point for manual VACUUM and ANALYZE commands
+ *
+ * This is mainly a preparation wrapper for the real operations that will
+ * happen in vacuum().
+ */
+void
+ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
+{
+    // ...
+
+    /* Now go through the common routine */
+    vacuum(vacstmt->rels, &params, NULL, isTopLevel);
+}
+

在 parse 了一大堆 option 之后,进入了 vacuum() 函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以:

在明确要分析哪些表以后,依次将每一个表传入 analyze_rel() 函数:

if (params->options & VACOPT_ANALYZE)
+{
+    // ...
+
+    analyze_rel(vrel->oid, vrel->relation, params,
+                vrel->va_cols, in_outer_xact, vac_strategy);
+
+    // ...
+}
+

进入 analyze_rel() 函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁,以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel() 函数中。

/*
+ *  analyze_rel() -- analyze one relation
+ *
+ * relid identifies the relation to analyze.  If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do not
+ * use it once we've successfully opened the rel, since it might be stale.
+ */
+void
+analyze_rel(Oid relid, RangeVar *relation,
+            VacuumParams *params, List *va_cols, bool in_outer_xact,
+            BufferAccessStrategy bstrategy)
+{
+    // ...
+
+    /*
+     * Do the normal non-recursive ANALYZE.  We can skip this for partitioned
+     * tables, which don't contain any rows.
+     */
+    if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+        do_analyze_rel(onerel, params, va_cols, acquirefunc,
+                       relpages, false, in_outer_xact, elevel);
+
+    // ...
+}
+

进入 do_analyze_rel() 函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。

为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:

/*
+ * Determine how many rows we need to sample, using the worst case from
+ * all analyzable columns.  We use a lower bound of 100 rows to avoid
+ * possible overflow in Vitter's algorithm.  (Note: that will also be the
+ * target in the corner case where there are no analyzable columns.)
+ */
+targrows = 100;
+for (i = 0; i < attr_cnt; i++)
+{
+    if (targrows < vacattrstats[i]->minrows)
+        targrows = vacattrstats[i]->minrows;
+}
+for (ind = 0; ind < nindexes; ind++)
+{
+    AnlIndexData *thisdata = &indexdata[ind];
+
+    for (i = 0; i < thisdata->attr_cnt; i++)
+    {
+        if (targrows < thisdata->vacattrstats[i]->minrows)
+            targrows = thisdata->vacattrstats[i]->minrows;
+    }
+}
+
+/*
+ * Look at extended statistics objects too, as those may define custom
+ * statistics target. So we may need to sample more rows and then build
+ * the statistics with enough detail.
+ */
+minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
+
+if (targrows < minrows)
+    targrows = minrows;
+

在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc 函数指针采样数据:

/*
+ * Acquire the sample rows
+ */
+rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
+pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
+                             inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
+                             PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
+if (inh)
+    numrows = acquire_inherited_sample_rows(onerel, elevel,
+                                            rows, targrows,
+                                            &totalrows, &totaldeadrows);
+else
+    numrows = (*acquirefunc) (onerel, elevel,
+                              rows, targrows,
+                              &totalrows, &totaldeadrows);
+

这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样:

两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:

/*
+ * acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+acquire_sample_rows(Relation onerel, int elevel,
+                    HeapTuple *rows, int targrows,
+                    double *totalrows, double *totaldeadrows)
+{
+    // ...
+
+    /* Outer loop over blocks to sample */
+    while (BlockSampler_HasMore(&bs))
+    {
+        bool        block_accepted;
+        BlockNumber targblock = BlockSampler_Next(&bs);
+        // ...
+    }
+
+    // ...
+
+    /*
+     * If we didn't find as many tuples as we wanted then we're done. No sort
+     * is needed, since they're already in order.
+     *
+     * Otherwise we need to sort the collected tuples by position
+     * (itempointer). It's not worth worrying about corner cases where the
+     * tuples are already sorted.
+     */
+    if (numrows == targrows)
+        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+    /*
+     * Estimate total numbers of live and dead rows in relation, extrapolating
+     * on the assumption that the average tuple density in pages we didn't
+     * scan is the same as in the pages we did scan.  Since what we scanned is
+     * a random sample of the pages in the relation, this should be a good
+     * assumption.
+     */
+    if (bs.m > 0)
+    {
+        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+    }
+    else
+    {
+        *totalrows = 0.0;
+        *totaldeadrows = 0.0;
+    }
+
+    // ...
+}
+

回到 do_analyze_rel() 函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic 系统表:

/*
+ * Compute the statistics.  Temporary results during the calculations for
+ * each column are stored in a child context.  The calc routines are
+ * responsible to make sure that whatever they store into the VacAttrStats
+ * structure is allocated in anl_context.
+ */
+if (numrows > 0)
+{
+    // ...
+
+    for (i = 0; i < attr_cnt; i++)
+    {
+        VacAttrStats *stats = vacattrstats[i];
+        AttributeOpts *aopt;
+
+        stats->rows = rows;
+        stats->tupDesc = onerel->rd_att;
+        stats->compute_stats(stats,
+                             std_fetch_func,
+                             numrows,
+                             totalrows);
+
+        // ...
+    }
+
+    // ...
+
+    /*
+     * Emit the completed stats rows into pg_statistic, replacing any
+     * previous statistics for the target columns.  (If there are stats in
+     * pg_statistic for columns we didn't process, we leave them alone.)
+     */
+    update_attstats(RelationGetRelid(onerel), inh,
+                    attr_cnt, vacattrstats);
+
+    // ...
+}
+

显然,对于不同类型的列,其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:

/*
+ * std_typanalyze -- the default type-specific typanalyze function
+ */
+bool
+std_typanalyze(VacAttrStats *stats)
+{
+    // ...
+
+    /*
+     * Determine which standard statistics algorithm to use
+     */
+    if (OidIsValid(eqopr) && OidIsValid(ltopr))
+    {
+        /* Seems to be a scalar datatype */
+        stats->compute_stats = compute_scalar_stats;
+        /*--------------------
+         * The following choice of minrows is based on the paper
+         * "Random sampling for histogram construction: how much is enough?"
+         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
+         * Proceedings of ACM SIGMOD International Conference on Management
+         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
+         * says that for table size n, histogram size k, maximum relative
+         * error in bin size f, and error probability gamma, the minimum
+         * random sample size is
+         *      r = 4 * k * ln(2*n/gamma) / f^2
+         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
+         *      r = 305.82 * k
+         * Note that because of the log function, the dependence on n is
+         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
+         * bin size error with probability 0.99.  So there's no real need to
+         * scale for n, which is a good thing because we don't necessarily
+         * know it at this point.
+         *--------------------
+         */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+    else if (OidIsValid(eqopr))
+    {
+        /* We can still recognize distinct values */
+        stats->compute_stats = compute_distinct_stats;
+        /* Might as well use the same minrows as above */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+    else
+    {
+        /* Can't do much but the trivial stuff */
+        stats->compute_stats = compute_trivial_stats;
+        /* Might as well use the same minrows as above */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+
+    // ...
+}
+

这个条件判断语句可以被解读为:

我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。

compute_trivial_stats

如果某个列的数据类型不支持等值运算符和比较运算符,那么就只能进行一些简单的分析,比如:

这些可以通过对采样后的元组数组进行循环遍历后轻松得到。

/*
+ *  compute_trivial_stats() -- compute very basic column statistics
+ *
+ *  We use this when we cannot find a hash "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows and the average datum width.
+ */
+static void
+compute_trivial_stats(VacAttrStatsP stats,
+                      AnalyzeAttrFetchFunc fetchfunc,
+                      int samplerows,
+                      double totalrows)
+{}
+

compute_distinct_stats

如果某个列只支持等值运算符,也就是说我们只能知道一个数值 是什么,但不能和其它数值比大小。所以无法分析数值在大小范围上的分布,只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含:

/*
+ *  compute_distinct_stats() -- compute column statistics including ndistinct
+ *
+ *  We use this when we can find only an "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows, the average width, the
+ *  most common values, and the (estimated) number of distinct values.
+ *
+ *  The most common values are determined by brute force: we keep a list
+ *  of previously seen values, ordered by number of times seen, as we scan
+ *  the samples.  A newly seen value is inserted just after the last
+ *  multiply-seen value, causing the bottommost (oldest) singly-seen value
+ *  to drop off the list.  The accuracy of this method, and also its cost,
+ *  depend mainly on the length of the list we are willing to keep.
+ */
+static void
+compute_distinct_stats(VacAttrStatsP stats,
+                       AnalyzeAttrFetchFunc fetchfunc,
+                       int samplerows,
+                       double totalrows)
+{}
+

compute_scalar_stats

如果一个列的数据类型支持等值运算符和比较运算符,那么可以进行最详尽的分析。分析目标包含:

/*
+ *  compute_distinct_stats() -- compute column statistics including ndistinct
+ *
+ *  We use this when we can find only an "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows, the average width, the
+ *  most common values, and the (estimated) number of distinct values.
+ *
+ *  The most common values are determined by brute force: we keep a list
+ *  of previously seen values, ordered by number of times seen, as we scan
+ *  the samples.  A newly seen value is inserted just after the last
+ *  multiply-seen value, causing the bottommost (oldest) singly-seen value
+ *  to drop off the list.  The accuracy of this method, and also its cost,
+ *  depend mainly on the length of the list we are willing to keep.
+ */
+static void
+compute_distinct_stats(VacAttrStatsP stats,
+                       AnalyzeAttrFetchFunc fetchfunc,
+                       int samplerows,
+                       double totalrows)
+{}
+

Summary

以 PostgreSQL 优化器需要的统计信息为切入点,分析了 ANALYZE 命令的大致执行流程。出于简洁性,在流程分析上没有覆盖各种 corner case 和相关的处理逻辑。

References

`,80),m={href:"https://www.postgresql.org/docs/current/sql-analyze.html",target:"_blank",rel:"noopener noreferrer"},k={href:"https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-STATISTICS",target:"_blank",rel:"noopener noreferrer"},b={href:"https://www.postgresql.org/docs/current/planner-stats.html",target:"_blank",rel:"noopener noreferrer"},h={href:"https://www.postgresql.org/docs/current/catalog-pg-statistic.html",target:"_blank",rel:"noopener noreferrer"},g={href:"http://mysql.taobao.org/monthly/2016/05/09/",target:"_blank",rel:"noopener noreferrer"};function f(o,w){const i=t("ArticleInfo"),e=t("ExternalLinkIcon");return p(),c("div",null,[u,s(i,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),v,n("p",null,[n("a",m,[a("PostgreSQL 14 Documentation: ANALYZE"),s(e)])]),n("p",null,[n("a",k,[a("PostgreSQL 14 Documentation: 25.1. Routine Vacuuming"),s(e)])]),n("p",null,[n("a",b,[a("PostgreSQL 14 Documentation: 14.2. Statistics Used by the Planner"),s(e)])]),n("p",null,[n("a",h,[a("PostgreSQL 14 Documentation: 52.49. pg_statistic"),s(e)])]),n("p",null,[n("a",g,[a("阿里云数据库内核月报 2016/05:PostgreSQL 特性分析 统计信息计算方法"),s(e)])])])}const _=l(d,[["render",f],["__file","analyze.html.vue"]]);export{_ as default}; diff --git a/assets/analyze.html-f587193a.js b/assets/analyze.html-f587193a.js new file mode 100644 index 00000000000..93fd7311555 --- /dev/null +++ b/assets/analyze.html-f587193a.js @@ -0,0 +1,607 @@ +import{_ as l,r as t,o as p,c,d as s,a as n,b as a,e as r}from"./app-3d1677bf.js";const d={},u=n("h1",{id:"analyze-源码解读",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#analyze-源码解读","aria-hidden":"true"},"#"),a(" ANALYZE 源码解读")],-1),v=r(`

背景

PostgreSQL 在优化器中为一个查询树输出一个执行效率最高的物理计划树。其中,执行效率高低的衡量是通过代价估算实现的。比如通过估算查询返回元组的条数,和元组的宽度,就可以计算出 I/O 开销;也可以根据将要执行的物理操作估算出可能需要消耗的 CPU 代价。优化器通过系统表 pg_statistic 获得这些在代价估算过程需要使用到的关键统计信息,而 pg_statistic 系统表中的统计信息又是通过自动或手动的 ANALYZE 操作(或 VACUUM)计算得到的。ANALYZE 将会扫描表中的数据并按列进行分析,将得到的诸如每列的数据分布、最常见值、频率等统计信息写入系统表。

本文从源码的角度分析一下 ANALYZE 操作的实现机制。源码使用目前 PostgreSQL 最新的稳定版本 PostgreSQL 14。

统计信息

首先,我们应当搞明白分析操作的输出是什么。所以我们可以看一看 pg_statistic 中有哪些列,每个列的含义是什么。这个系统表中的每一行表示其它数据表中 每一列的统计信息

postgres=# \\d+ pg_statistic
+                                 Table "pg_catalog.pg_statistic"
+   Column    |   Type   | Collation | Nullable | Default | Storage  | Stats target | Description
+-------------+----------+-----------+----------+---------+----------+--------------+-------------
+ starelid    | oid      |           | not null |         | plain    |              |
+ staattnum   | smallint |           | not null |         | plain    |              |
+ stainherit  | boolean  |           | not null |         | plain    |              |
+ stanullfrac | real     |           | not null |         | plain    |              |
+ stawidth    | integer  |           | not null |         | plain    |              |
+ stadistinct | real     |           | not null |         | plain    |              |
+ stakind1    | smallint |           | not null |         | plain    |              |
+ stakind2    | smallint |           | not null |         | plain    |              |
+ stakind3    | smallint |           | not null |         | plain    |              |
+ stakind4    | smallint |           | not null |         | plain    |              |
+ stakind5    | smallint |           | not null |         | plain    |              |
+ staop1      | oid      |           | not null |         | plain    |              |
+ staop2      | oid      |           | not null |         | plain    |              |
+ staop3      | oid      |           | not null |         | plain    |              |
+ staop4      | oid      |           | not null |         | plain    |              |
+ staop5      | oid      |           | not null |         | plain    |              |
+ stanumbers1 | real[]   |           |          |         | extended |              |
+ stanumbers2 | real[]   |           |          |         | extended |              |
+ stanumbers3 | real[]   |           |          |         | extended |              |
+ stanumbers4 | real[]   |           |          |         | extended |              |
+ stanumbers5 | real[]   |           |          |         | extended |              |
+ stavalues1  | anyarray |           |          |         | extended |              |
+ stavalues2  | anyarray |           |          |         | extended |              |
+ stavalues3  | anyarray |           |          |         | extended |              |
+ stavalues4  | anyarray |           |          |         | extended |              |
+ stavalues5  | anyarray |           |          |         | extended |              |
+Indexes:
+    "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, stainherit)
+
/* ----------------
+ *      pg_statistic definition.  cpp turns this into
+ *      typedef struct FormData_pg_statistic
+ * ----------------
+ */
+CATALOG(pg_statistic,2619,StatisticRelationId)
+{
+    /* These fields form the unique key for the entry: */
+    Oid         starelid BKI_LOOKUP(pg_class);  /* relation containing
+                                                 * attribute */
+    int16       staattnum;      /* attribute (column) stats are for */
+    bool        stainherit;     /* true if inheritance children are included */
+
+    /* the fraction of the column's entries that are NULL: */
+    float4      stanullfrac;
+
+    /*
+     * stawidth is the average width in bytes of non-null entries.  For
+     * fixed-width datatypes this is of course the same as the typlen, but for
+     * var-width types it is more useful.  Note that this is the average width
+     * of the data as actually stored, post-TOASTing (eg, for a
+     * moved-out-of-line value, only the size of the pointer object is
+     * counted).  This is the appropriate definition for the primary use of
+     * the statistic, which is to estimate sizes of in-memory hash tables of
+     * tuples.
+     */
+    int32       stawidth;
+
+    /* ----------------
+     * stadistinct indicates the (approximate) number of distinct non-null
+     * data values in the column.  The interpretation is:
+     *      0       unknown or not computed
+     *      > 0     actual number of distinct values
+     *      < 0     negative of multiplier for number of rows
+     * The special negative case allows us to cope with columns that are
+     * unique (stadistinct = -1) or nearly so (for example, a column in which
+     * non-null values appear about twice on the average could be represented
+     * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
+     * column is nulls).  Because the number-of-rows statistic in pg_class may
+     * be updated more frequently than pg_statistic is, it's important to be
+     * able to describe such situations as a multiple of the number of rows,
+     * rather than a fixed number of distinct values.  But in other cases a
+     * fixed number is correct (eg, a boolean column).
+     * ----------------
+     */
+    float4      stadistinct;
+
+    /* ----------------
+     * To allow keeping statistics on different kinds of datatypes,
+     * we do not hard-wire any particular meaning for the remaining
+     * statistical fields.  Instead, we provide several "slots" in which
+     * statistical data can be placed.  Each slot includes:
+     *      kind            integer code identifying kind of data (see below)
+     *      op              OID of associated operator, if needed
+     *      coll            OID of relevant collation, or 0 if none
+     *      numbers         float4 array (for statistical values)
+     *      values          anyarray (for representations of data values)
+     * The ID, operator, and collation fields are never NULL; they are zeroes
+     * in an unused slot.  The numbers and values fields are NULL in an
+     * unused slot, and might also be NULL in a used slot if the slot kind
+     * has no need for one or the other.
+     * ----------------
+     */
+
+    int16       stakind1;
+    int16       stakind2;
+    int16       stakind3;
+    int16       stakind4;
+    int16       stakind5;
+
+    Oid         staop1 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop2 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop3 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop4 BKI_LOOKUP_OPT(pg_operator);
+    Oid         staop5 BKI_LOOKUP_OPT(pg_operator);
+
+    Oid         stacoll1 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll2 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll3 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll4 BKI_LOOKUP_OPT(pg_collation);
+    Oid         stacoll5 BKI_LOOKUP_OPT(pg_collation);
+
+#ifdef CATALOG_VARLEN           /* variable-length fields start here */
+    float4      stanumbers1[1];
+    float4      stanumbers2[1];
+    float4      stanumbers3[1];
+    float4      stanumbers4[1];
+    float4      stanumbers5[1];
+
+    /*
+     * Values in these arrays are values of the column's data type, or of some
+     * related type such as an array element type.  We presently have to cheat
+     * quite a bit to allow polymorphic arrays of this kind, but perhaps
+     * someday it'll be a less bogus facility.
+     */
+    anyarray    stavalues1;
+    anyarray    stavalues2;
+    anyarray    stavalues3;
+    anyarray    stavalues4;
+    anyarray    stavalues5;
+#endif
+} FormData_pg_statistic;
+

从数据库命令行的角度和内核 C 代码的角度来看,统计信息的内容都是一致的。所有的属性都以 sta 开头。其中:

由于不同数据类型所能够被计算的统计信息可能会有一些细微的差别,在接下来的部分中,PostgreSQL 预留了一些存放统计信息的 槽(slots)。目前的内核里暂时预留了五个槽:

#define STATISTIC_NUM_SLOTS  5
+

每一种特定的统计信息可以使用一个槽,具体在槽里放什么完全由这种统计信息的定义自由决定。每一个槽的可用空间包含这么几个部分(其中的 N 表示槽的编号,取值为 15):

PostgreSQL 内核中规定,统计信息的编号 199 被保留给 PostgreSQL 核心统计信息使用,其它部分的编号安排如内核注释所示:

/*
+ * The present allocation of "kind" codes is:
+ *
+ *  1-99:       reserved for assignment by the core PostgreSQL project
+ *              (values in this range will be documented in this file)
+ *  100-199:    reserved for assignment by the PostGIS project
+ *              (values to be documented in PostGIS documentation)
+ *  200-299:    reserved for assignment by the ESRI ST_Geometry project
+ *              (values to be documented in ESRI ST_Geometry documentation)
+ *  300-9999:   reserved for future public assignments
+ *
+ * For private use you may choose a "kind" code at random in the range
+ * 10000-30000.  However, for code that is to be widely disseminated it is
+ * better to obtain a publicly defined "kind" code by request from the
+ * PostgreSQL Global Development Group.
+ */
+

目前可以在内核代码中看到的 PostgreSQL 核心统计信息有 7 个,编号分别从 17。我们可以看看这 7 种统计信息分别如何使用上述的槽。

Most Common Values (MCV)

/*
+ * In a "most common values" slot, staop is the OID of the "=" operator
+ * used to decide whether values are the same or not, and stacoll is the
+ * collation used (same as column's collation).  stavalues contains
+ * the K most common non-null values appearing in the column, and stanumbers
+ * contains their frequencies (fractions of total row count).  The values
+ * shall be ordered in decreasing frequency.  Note that since the arrays are
+ * variable-size, K may be chosen by the statistics collector.  Values should
+ * not appear in MCV unless they have been observed to occur more than once;
+ * a unique column will have no MCV slot.
+ */
+#define STATISTIC_KIND_MCV  1
+

对于一个列中的 最常见值,在 staop 中保存 = 运算符来决定一个值是否等于一个最常见值。在 stavalues 中保存了该列中最常见的 K 个非空值,stanumbers 中分别保存了这 K 个值出现的频率。

Histogram

/*
+ * A "histogram" slot describes the distribution of scalar data.  staop is
+ * the OID of the "<" operator that describes the sort ordering, and stacoll
+ * is the relevant collation.  (In theory more than one histogram could appear,
+ * if a datatype has more than one useful sort operator or we care about more
+ * than one collation.  Currently the collation will always be that of the
+ * underlying column.)  stavalues contains M (>=2) non-null values that
+ * divide the non-null column data values into M-1 bins of approximately equal
+ * population.  The first stavalues item is the MIN and the last is the MAX.
+ * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
+ * slot is also provided, then the histogram describes the data distribution
+ * *after removing the values listed in MCV* (thus, it's a "compressed
+ * histogram" in the technical parlance).  This allows a more accurate
+ * representation of the distribution of a column with some very-common
+ * values.  In a column with only a few distinct values, it's possible that
+ * the MCV list describes the entire data population; in this case the
+ * histogram reduces to empty and should be omitted.
+ */
+#define STATISTIC_KIND_HISTOGRAM  2
+

表示一个(数值)列的数据分布直方图。staop 保存 < 运算符用于决定数据分布的排序顺序。stavalues 包含了能够将该列的非空值划分到 M - 1 个容量接近的桶中的 M 个非空值。如果该列中已经有了 MCV 的槽,那么数据分布直方图中将不包含 MCV 中的值,以获得更精确的数据分布。

Correlation

/*
+ * A "correlation" slot describes the correlation between the physical order
+ * of table tuples and the ordering of data values of this column, as seen
+ * by the "<" operator identified by staop with the collation identified by
+ * stacoll.  (As with the histogram, more than one entry could theoretically
+ * appear.)  stavalues is not used and should be NULL.  stanumbers contains
+ * a single entry, the correlation coefficient between the sequence of data
+ * values and the sequence of their actual tuple positions.  The coefficient
+ * ranges from +1 to -1.
+ */
+#define STATISTIC_KIND_CORRELATION  3
+

stanumbers 中保存数据值和它们的实际元组位置的相关系数。

Most Common Elements

/*
+ * A "most common elements" slot is similar to a "most common values" slot,
+ * except that it stores the most common non-null *elements* of the column
+ * values.  This is useful when the column datatype is an array or some other
+ * type with identifiable elements (for instance, tsvector).  staop contains
+ * the equality operator appropriate to the element type, and stacoll
+ * contains the collation to use with it.  stavalues contains
+ * the most common element values, and stanumbers their frequencies.  Unlike
+ * MCV slots, frequencies are measured as the fraction of non-null rows the
+ * element value appears in, not the frequency of all rows.  Also unlike
+ * MCV slots, the values are sorted into the element type's default order
+ * (to support binary search for a particular value).  Since this puts the
+ * minimum and maximum frequencies at unpredictable spots in stanumbers,
+ * there are two extra members of stanumbers, holding copies of the minimum
+ * and maximum frequencies.  Optionally, there can be a third extra member,
+ * which holds the frequency of null elements (expressed in the same terms:
+ * the fraction of non-null rows that contain at least one null element).  If
+ * this member is omitted, the column is presumed to contain no null elements.
+ *
+ * Note: in current usage for tsvector columns, the stavalues elements are of
+ * type text, even though their representation within tsvector is not
+ * exactly text.
+ */
+#define STATISTIC_KIND_MCELEM  4
+

与 MCV 类似,但是保存的是列中的 最常见元素,主要用于数组等类型。同样,在 staop 中保存了等值运算符用于判断元素出现的频率高低。但与 MCV 不同的是这里的频率计算的分母是非空的行,而不是所有的行。另外,所有的常见元素使用元素对应数据类型的默认顺序进行排序,以便二分查找。

Distinct Elements Count Histogram

/*
+ * A "distinct elements count histogram" slot describes the distribution of
+ * the number of distinct element values present in each row of an array-type
+ * column.  Only non-null rows are considered, and only non-null elements.
+ * staop contains the equality operator appropriate to the element type,
+ * and stacoll contains the collation to use with it.
+ * stavalues is not used and should be NULL.  The last member of stanumbers is
+ * the average count of distinct element values over all non-null rows.  The
+ * preceding M (>=2) members form a histogram that divides the population of
+ * distinct-elements counts into M-1 bins of approximately equal population.
+ * The first of these is the minimum observed count, and the last the maximum.
+ */
+#define STATISTIC_KIND_DECHIST  5
+

表示列中出现所有数值的频率分布直方图。stanumbers 数组的前 M 个元素是将列中所有唯一值的出现次数大致均分到 M - 1 个桶中的边界值。后续跟上一个所有唯一值的平均出现次数。这个统计信息应该会被用于计算 选择率

Length Histogram

/*
+ * A "length histogram" slot describes the distribution of range lengths in
+ * rows of a range-type column. stanumbers contains a single entry, the
+ * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
+ * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
+ * values that divide the column data values into M-1 bins of approximately
+ * equal population. The lengths are stored as float8s, as measured by the
+ * range type's subdiff function. Only non-null rows are considered.
+ */
+#define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
+

长度直方图描述了一个范围类型的列的范围长度分布。同样也是一个长度为 M 的直方图,保存在 stanumbers 中。

Bounds Histogram

/*
+ * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
+ * a range-type column.  stavalues contains M (>=2) range values that divide
+ * the column data values into M-1 bins of approximately equal population.
+ * Unlike a regular scalar histogram, this is actually two histograms combined
+ * into a single array, with the lower bounds of each value forming a
+ * histogram of lower bounds, and the upper bounds a histogram of upper
+ * bounds.  Only non-NULL, non-empty ranges are included.
+ */
+#define STATISTIC_KIND_BOUNDS_HISTOGRAM  7
+

边界直方图同样也被用于范围类型,与数据分布直方图类似。stavalues 中保存了使该列数值大致均分到 M - 1 个桶中的 M 个范围边界值。只考虑非空行。

内核执行流程

知道 pg_statistic 最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令,执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

/*
+ * standard_ProcessUtility itself deals only with utility commands for
+ * which we do not provide event trigger support.  Commands that do have
+ * such support are passed down to ProcessUtilitySlow, which contains the
+ * necessary infrastructure for such triggers.
+ *
+ * This division is not just for performance: it's critical that the
+ * event trigger code not be invoked when doing START TRANSACTION for
+ * example, because we might need to refresh the event trigger cache,
+ * which requires being in a valid transaction.
+ */
+void
+standard_ProcessUtility(PlannedStmt *pstmt,
+                        const char *queryString,
+                        bool readOnlyTree,
+                        ProcessUtilityContext context,
+                        ParamListInfo params,
+                        QueryEnvironment *queryEnv,
+                        DestReceiver *dest,
+                        QueryCompletion *qc)
+{
+    // ...
+
+    switch (nodeTag(parsetree))
+    {
+        // ...
+
+        case T_VacuumStmt:
+            ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
+            break;
+
+        // ...
+    }
+
+    // ...
+}
+

ANALYZE 的处理逻辑入口和 VACUUM 一致,进入 ExecVacuum() 函数。

/*
+ * Primary entry point for manual VACUUM and ANALYZE commands
+ *
+ * This is mainly a preparation wrapper for the real operations that will
+ * happen in vacuum().
+ */
+void
+ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
+{
+    // ...
+
+    /* Now go through the common routine */
+    vacuum(vacstmt->rels, &params, NULL, isTopLevel);
+}
+

在 parse 了一大堆 option 之后,进入了 vacuum() 函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以:

在明确要分析哪些表以后,依次将每一个表传入 analyze_rel() 函数:

if (params->options & VACOPT_ANALYZE)
+{
+    // ...
+
+    analyze_rel(vrel->oid, vrel->relation, params,
+                vrel->va_cols, in_outer_xact, vac_strategy);
+
+    // ...
+}
+

进入 analyze_rel() 函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁,以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel() 函数中。

/*
+ *  analyze_rel() -- analyze one relation
+ *
+ * relid identifies the relation to analyze.  If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do not
+ * use it once we've successfully opened the rel, since it might be stale.
+ */
+void
+analyze_rel(Oid relid, RangeVar *relation,
+            VacuumParams *params, List *va_cols, bool in_outer_xact,
+            BufferAccessStrategy bstrategy)
+{
+    // ...
+
+    /*
+     * Do the normal non-recursive ANALYZE.  We can skip this for partitioned
+     * tables, which don't contain any rows.
+     */
+    if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+        do_analyze_rel(onerel, params, va_cols, acquirefunc,
+                       relpages, false, in_outer_xact, elevel);
+
+    // ...
+}
+

进入 do_analyze_rel() 函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。

为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:

/*
+ * Determine how many rows we need to sample, using the worst case from
+ * all analyzable columns.  We use a lower bound of 100 rows to avoid
+ * possible overflow in Vitter's algorithm.  (Note: that will also be the
+ * target in the corner case where there are no analyzable columns.)
+ */
+targrows = 100;
+for (i = 0; i < attr_cnt; i++)
+{
+    if (targrows < vacattrstats[i]->minrows)
+        targrows = vacattrstats[i]->minrows;
+}
+for (ind = 0; ind < nindexes; ind++)
+{
+    AnlIndexData *thisdata = &indexdata[ind];
+
+    for (i = 0; i < thisdata->attr_cnt; i++)
+    {
+        if (targrows < thisdata->vacattrstats[i]->minrows)
+            targrows = thisdata->vacattrstats[i]->minrows;
+    }
+}
+
+/*
+ * Look at extended statistics objects too, as those may define custom
+ * statistics target. So we may need to sample more rows and then build
+ * the statistics with enough detail.
+ */
+minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
+
+if (targrows < minrows)
+    targrows = minrows;
+

在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc 函数指针采样数据:

/*
+ * Acquire the sample rows
+ */
+rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
+pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
+                             inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
+                             PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
+if (inh)
+    numrows = acquire_inherited_sample_rows(onerel, elevel,
+                                            rows, targrows,
+                                            &totalrows, &totaldeadrows);
+else
+    numrows = (*acquirefunc) (onerel, elevel,
+                              rows, targrows,
+                              &totalrows, &totaldeadrows);
+

这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样:

两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:

/*
+ * acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+acquire_sample_rows(Relation onerel, int elevel,
+                    HeapTuple *rows, int targrows,
+                    double *totalrows, double *totaldeadrows)
+{
+    // ...
+
+    /* Outer loop over blocks to sample */
+    while (BlockSampler_HasMore(&bs))
+    {
+        bool        block_accepted;
+        BlockNumber targblock = BlockSampler_Next(&bs);
+        // ...
+    }
+
+    // ...
+
+    /*
+     * If we didn't find as many tuples as we wanted then we're done. No sort
+     * is needed, since they're already in order.
+     *
+     * Otherwise we need to sort the collected tuples by position
+     * (itempointer). It's not worth worrying about corner cases where the
+     * tuples are already sorted.
+     */
+    if (numrows == targrows)
+        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+    /*
+     * Estimate total numbers of live and dead rows in relation, extrapolating
+     * on the assumption that the average tuple density in pages we didn't
+     * scan is the same as in the pages we did scan.  Since what we scanned is
+     * a random sample of the pages in the relation, this should be a good
+     * assumption.
+     */
+    if (bs.m > 0)
+    {
+        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+    }
+    else
+    {
+        *totalrows = 0.0;
+        *totaldeadrows = 0.0;
+    }
+
+    // ...
+}
+

回到 do_analyze_rel() 函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic 系统表:

/*
+ * Compute the statistics.  Temporary results during the calculations for
+ * each column are stored in a child context.  The calc routines are
+ * responsible to make sure that whatever they store into the VacAttrStats
+ * structure is allocated in anl_context.
+ */
+if (numrows > 0)
+{
+    // ...
+
+    for (i = 0; i < attr_cnt; i++)
+    {
+        VacAttrStats *stats = vacattrstats[i];
+        AttributeOpts *aopt;
+
+        stats->rows = rows;
+        stats->tupDesc = onerel->rd_att;
+        stats->compute_stats(stats,
+                             std_fetch_func,
+                             numrows,
+                             totalrows);
+
+        // ...
+    }
+
+    // ...
+
+    /*
+     * Emit the completed stats rows into pg_statistic, replacing any
+     * previous statistics for the target columns.  (If there are stats in
+     * pg_statistic for columns we didn't process, we leave them alone.)
+     */
+    update_attstats(RelationGetRelid(onerel), inh,
+                    attr_cnt, vacattrstats);
+
+    // ...
+}
+

显然,对于不同类型的列,其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:

/*
+ * std_typanalyze -- the default type-specific typanalyze function
+ */
+bool
+std_typanalyze(VacAttrStats *stats)
+{
+    // ...
+
+    /*
+     * Determine which standard statistics algorithm to use
+     */
+    if (OidIsValid(eqopr) && OidIsValid(ltopr))
+    {
+        /* Seems to be a scalar datatype */
+        stats->compute_stats = compute_scalar_stats;
+        /*--------------------
+         * The following choice of minrows is based on the paper
+         * "Random sampling for histogram construction: how much is enough?"
+         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
+         * Proceedings of ACM SIGMOD International Conference on Management
+         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
+         * says that for table size n, histogram size k, maximum relative
+         * error in bin size f, and error probability gamma, the minimum
+         * random sample size is
+         *      r = 4 * k * ln(2*n/gamma) / f^2
+         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
+         *      r = 305.82 * k
+         * Note that because of the log function, the dependence on n is
+         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
+         * bin size error with probability 0.99.  So there's no real need to
+         * scale for n, which is a good thing because we don't necessarily
+         * know it at this point.
+         *--------------------
+         */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+    else if (OidIsValid(eqopr))
+    {
+        /* We can still recognize distinct values */
+        stats->compute_stats = compute_distinct_stats;
+        /* Might as well use the same minrows as above */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+    else
+    {
+        /* Can't do much but the trivial stuff */
+        stats->compute_stats = compute_trivial_stats;
+        /* Might as well use the same minrows as above */
+        stats->minrows = 300 * attr->attstattarget;
+    }
+
+    // ...
+}
+

这个条件判断语句可以被解读为:

我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。

compute_trivial_stats

如果某个列的数据类型不支持等值运算符和比较运算符,那么就只能进行一些简单的分析,比如:

这些可以通过对采样后的元组数组进行循环遍历后轻松得到。

/*
+ *  compute_trivial_stats() -- compute very basic column statistics
+ *
+ *  We use this when we cannot find a hash "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows and the average datum width.
+ */
+static void
+compute_trivial_stats(VacAttrStatsP stats,
+                      AnalyzeAttrFetchFunc fetchfunc,
+                      int samplerows,
+                      double totalrows)
+{}
+

compute_distinct_stats

如果某个列只支持等值运算符,也就是说我们只能知道一个数值 是什么,但不能和其它数值比大小。所以无法分析数值在大小范围上的分布,只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含:

/*
+ *  compute_distinct_stats() -- compute column statistics including ndistinct
+ *
+ *  We use this when we can find only an "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows, the average width, the
+ *  most common values, and the (estimated) number of distinct values.
+ *
+ *  The most common values are determined by brute force: we keep a list
+ *  of previously seen values, ordered by number of times seen, as we scan
+ *  the samples.  A newly seen value is inserted just after the last
+ *  multiply-seen value, causing the bottommost (oldest) singly-seen value
+ *  to drop off the list.  The accuracy of this method, and also its cost,
+ *  depend mainly on the length of the list we are willing to keep.
+ */
+static void
+compute_distinct_stats(VacAttrStatsP stats,
+                       AnalyzeAttrFetchFunc fetchfunc,
+                       int samplerows,
+                       double totalrows)
+{}
+

compute_scalar_stats

如果一个列的数据类型支持等值运算符和比较运算符,那么可以进行最详尽的分析。分析目标包含:

/*
+ *  compute_distinct_stats() -- compute column statistics including ndistinct
+ *
+ *  We use this when we can find only an "=" operator for the datatype.
+ *
+ *  We determine the fraction of non-null rows, the average width, the
+ *  most common values, and the (estimated) number of distinct values.
+ *
+ *  The most common values are determined by brute force: we keep a list
+ *  of previously seen values, ordered by number of times seen, as we scan
+ *  the samples.  A newly seen value is inserted just after the last
+ *  multiply-seen value, causing the bottommost (oldest) singly-seen value
+ *  to drop off the list.  The accuracy of this method, and also its cost,
+ *  depend mainly on the length of the list we are willing to keep.
+ */
+static void
+compute_distinct_stats(VacAttrStatsP stats,
+                       AnalyzeAttrFetchFunc fetchfunc,
+                       int samplerows,
+                       double totalrows)
+{}
+

总结

以 PostgreSQL 优化器需要的统计信息为切入点,分析了 ANALYZE 命令的大致执行流程。出于简洁性,在流程分析上没有覆盖各种 corner case 和相关的处理逻辑。

参考资料

`,80),m={href:"https://www.postgresql.org/docs/current/sql-analyze.html",target:"_blank",rel:"noopener noreferrer"},k={href:"https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-STATISTICS",target:"_blank",rel:"noopener noreferrer"},b={href:"https://www.postgresql.org/docs/current/planner-stats.html",target:"_blank",rel:"noopener noreferrer"},h={href:"https://www.postgresql.org/docs/current/catalog-pg-statistic.html",target:"_blank",rel:"noopener noreferrer"},g={href:"http://mysql.taobao.org/monthly/2016/05/09/",target:"_blank",rel:"noopener noreferrer"};function f(o,w){const i=t("ArticleInfo"),e=t("ExternalLinkIcon");return p(),c("div",null,[u,s(i,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),v,n("p",null,[n("a",m,[a("PostgreSQL 14 Documentation: ANALYZE"),s(e)])]),n("p",null,[n("a",k,[a("PostgreSQL 14 Documentation: 25.1. Routine Vacuuming"),s(e)])]),n("p",null,[n("a",b,[a("PostgreSQL 14 Documentation: 14.2. Statistics Used by the Planner"),s(e)])]),n("p",null,[n("a",h,[a("PostgreSQL 14 Documentation: 52.49. pg_statistic"),s(e)])]),n("p",null,[n("a",g,[a("阿里云数据库内核月报 2016/05:PostgreSQL 特性分析 统计信息计算方法"),s(e)])])])}const _=l(d,[["render",f],["__file","analyze.html.vue"]]);export{_ as default}; diff --git a/assets/app-3d1677bf.js b/assets/app-3d1677bf.js new file mode 100644 index 00000000000..2a2b446dba6 --- /dev/null +++ b/assets/app-3d1677bf.js @@ -0,0 +1,10 @@ +const ql="modulepreload",Ul=function(e){return"/PolarDB-for-PostgreSQL/"+e},Jo={},m=function(t,n,r){if(!n||n.length===0)return t();const o=document.getElementsByTagName("link");return Promise.all(n.map(i=>{if(i=Ul(i),i in Jo)return;Jo[i]=!0;const s=i.endsWith(".css"),l=s?'[rel="stylesheet"]':"";if(!!r)for(let u=o.length-1;u>=0;u--){const f=o[u];if(f.href===i&&(!s||f.rel==="stylesheet"))return}else if(document.querySelector(`link[href="${i}"]${l}`))return;const c=document.createElement("link");if(c.rel=s?"stylesheet":ql,s||(c.as="script",c.crossOrigin=""),c.href=i,document.head.appendChild(c),s)return new Promise((u,f)=>{c.addEventListener("load",u),c.addEventListener("error",()=>f(new Error(`Unable to preload CSS for ${i}`)))})})).then(()=>t()).catch(i=>{const s=new Event("vite:preloadError",{cancelable:!0});if(s.payload=i,window.dispatchEvent(s),!s.defaultPrevented)throw i})};function ho(e,t){const n=Object.create(null),r=e.split(",");for(let o=0;o!!n[o.toLowerCase()]:o=>!!n[o]}const Te={},nn=[],it=()=>{},Kl=()=>!1,Wl=/^on[^a-z]/,Vn=e=>Wl.test(e),po=e=>e.startsWith("onUpdate:"),De=Object.assign,mo=(e,t)=>{const n=e.indexOf(t);n>-1&&e.splice(n,1)},Ql=Object.prototype.hasOwnProperty,de=(e,t)=>Ql.call(e,t),G=Array.isArray,rn=e=>yr(e)==="[object Map]",cs=e=>yr(e)==="[object Set]",se=e=>typeof e=="function",me=e=>typeof e=="string",vo=e=>typeof e=="symbol",Pe=e=>e!==null&&typeof e=="object",us=e=>Pe(e)&&se(e.then)&&se(e.catch),fs=Object.prototype.toString,yr=e=>fs.call(e),Yl=e=>yr(e).slice(8,-1),ds=e=>yr(e)==="[object Object]",_o=e=>me(e)&&e!=="NaN"&&e[0]!=="-"&&""+parseInt(e,10)===e,Pn=ho(",key,ref,ref_for,ref_key,onVnodeBeforeMount,onVnodeMounted,onVnodeBeforeUpdate,onVnodeUpdated,onVnodeBeforeUnmount,onVnodeUnmounted"),Er=e=>{const t=Object.create(null);return n=>t[n]||(t[n]=e(n))},Gl=/-(\w)/g,ft=Er(e=>e.replace(Gl,(t,n)=>n?n.toUpperCase():"")),Jl=/\B([A-Z])/g,Qt=Er(e=>e.replace(Jl,"-$1").toLowerCase()),Lr=Er(e=>e.charAt(0).toUpperCase()+e.slice(1)),kr=Er(e=>e?`on${Lr(e)}`:""),Dn=(e,t)=>!Object.is(e,t),zr=(e,t)=>{for(let n=0;n{Object.defineProperty(e,t,{configurable:!0,enumerable:!1,value:n})},Zl=e=>{const t=parseFloat(e);return isNaN(t)?e:t},Xl=e=>{const t=me(e)?Number(e):NaN;return isNaN(t)?e:t};let Zo;const Yr=()=>Zo||(Zo=typeof globalThis<"u"?globalThis:typeof self<"u"?self:typeof window<"u"?window:typeof global<"u"?global:{});function Mn(e){if(G(e)){const t={};for(let n=0;n{if(n){const r=n.split(ta);r.length>1&&(t[r[0].trim()]=r[1].trim())}}),t}function Ue(e){let t="";if(me(e))t=e;else if(G(e))for(let n=0;nme(e)?e:e==null?"":G(e)||Pe(e)&&(e.toString===fs||!se(e.toString))?JSON.stringify(e,ps,2):String(e),ps=(e,t)=>t&&t.__v_isRef?ps(e,t.value):rn(t)?{[`Map(${t.size})`]:[...t.entries()].reduce((n,[r,o])=>(n[`${r} =>`]=o,n),{})}:cs(t)?{[`Set(${t.size})`]:[...t.values()]}:Pe(t)&&!G(t)&&!ds(t)?String(t):t;let Qe;class sa{constructor(t=!1){this.detached=t,this._active=!0,this.effects=[],this.cleanups=[],this.parent=Qe,!t&&Qe&&(this.index=(Qe.scopes||(Qe.scopes=[])).push(this)-1)}get active(){return this._active}run(t){if(this._active){const n=Qe;try{return Qe=this,t()}finally{Qe=n}}}on(){Qe=this}off(){Qe=this.parent}stop(t){if(this._active){let n,r;for(n=0,r=this.effects.length;n{const t=new Set(e);return t.w=0,t.n=0,t},vs=e=>(e.w&Ct)>0,_s=e=>(e.n&Ct)>0,ca=({deps:e})=>{if(e.length)for(let t=0;t{const{deps:t}=e;if(t.length){let n=0;for(let r=0;r{(u==="length"||u>=a)&&l.push(c)})}else switch(n!==void 0&&l.push(s.get(n)),t){case"add":G(e)?_o(n)&&l.push(s.get("length")):(l.push(s.get(qt)),rn(e)&&l.push(s.get(Jr)));break;case"delete":G(e)||(l.push(s.get(qt)),rn(e)&&l.push(s.get(Jr)));break;case"set":rn(e)&&l.push(s.get(qt));break}if(l.length===1)l[0]&&Zr(l[0]);else{const a=[];for(const c of l)c&&a.push(...c);Zr(go(a))}}function Zr(e,t){const n=G(e)?e:[...e];for(const r of n)r.computed&&ei(r);for(const r of n)r.computed||ei(r)}function ei(e,t){(e!==rt||e.allowRecurse)&&(e.scheduler?e.scheduler():e.run())}function fa(e,t){var n;return(n=ar.get(e))==null?void 0:n.get(t)}const da=ho("__proto__,__v_isRef,__isVue"),ys=new Set(Object.getOwnPropertyNames(Symbol).filter(e=>e!=="arguments"&&e!=="caller").map(e=>Symbol[e]).filter(vo)),ha=yo(),pa=yo(!1,!0),ma=yo(!0),ti=va();function va(){const e={};return["includes","indexOf","lastIndexOf"].forEach(t=>{e[t]=function(...n){const r=pe(this);for(let i=0,s=this.length;i{e[t]=function(...n){mn();const r=pe(this)[t].apply(this,n);return vn(),r}}),e}function _a(e){const t=pe(this);return Ke(t,"has",e),t.hasOwnProperty(e)}function yo(e=!1,t=!1){return function(r,o,i){if(o==="__v_isReactive")return!e;if(o==="__v_isReadonly")return e;if(o==="__v_isShallow")return t;if(o==="__v_raw"&&i===(e?t?Sa:As:t?Ps:Ts).get(r))return r;const s=G(r);if(!e){if(s&&de(ti,o))return Reflect.get(ti,o,i);if(o==="hasOwnProperty")return _a}const l=Reflect.get(r,o,i);return(vo(o)?ys.has(o):da(o))||(e||Ke(r,"get",o),t)?l:ke(l)?s&&_o(o)?l:l.value:Pe(l)?e?Hn(l):Nn(l):l}}const ga=Es(),ba=Es(!0);function Es(e=!1){return function(n,r,o,i){let s=n[r];if(an(s)&&ke(s)&&!ke(o))return!1;if(!e&&(!cr(o)&&!an(o)&&(s=pe(s),o=pe(o)),!G(n)&&ke(s)&&!ke(o)))return s.value=o,!0;const l=G(n)&&_o(r)?Number(r)e,Tr=e=>Reflect.getPrototypeOf(e);function Kn(e,t,n=!1,r=!1){e=e.__v_raw;const o=pe(e),i=pe(t);n||(t!==i&&Ke(o,"get",t),Ke(o,"get",i));const{has:s}=Tr(o),l=r?Eo:n?Po:In;if(s.call(o,t))return l(e.get(t));if(s.call(o,i))return l(e.get(i));e!==o&&e.get(t)}function Wn(e,t=!1){const n=this.__v_raw,r=pe(n),o=pe(e);return t||(e!==o&&Ke(r,"has",e),Ke(r,"has",o)),e===o?n.has(e):n.has(e)||n.has(o)}function Qn(e,t=!1){return e=e.__v_raw,!t&&Ke(pe(e),"iterate",qt),Reflect.get(e,"size",e)}function ni(e){e=pe(e);const t=pe(this);return Tr(t).has.call(t,e)||(t.add(e),gt(t,"add",e,e)),this}function ri(e,t){t=pe(t);const n=pe(this),{has:r,get:o}=Tr(n);let i=r.call(n,e);i||(e=pe(e),i=r.call(n,e));const s=o.call(n,e);return n.set(e,t),i?Dn(t,s)&>(n,"set",e,t):gt(n,"add",e,t),this}function oi(e){const t=pe(this),{has:n,get:r}=Tr(t);let o=n.call(t,e);o||(e=pe(e),o=n.call(t,e)),r&&r.call(t,e);const i=t.delete(e);return o&>(t,"delete",e,void 0),i}function ii(){const e=pe(this),t=e.size!==0,n=e.clear();return t&>(e,"clear",void 0,void 0),n}function Yn(e,t){return function(r,o){const i=this,s=i.__v_raw,l=pe(s),a=t?Eo:e?Po:In;return!e&&Ke(l,"iterate",qt),s.forEach((c,u)=>r.call(o,a(c),a(u),i))}}function Gn(e,t,n){return function(...r){const o=this.__v_raw,i=pe(o),s=rn(i),l=e==="entries"||e===Symbol.iterator&&s,a=e==="keys"&&s,c=o[e](...r),u=n?Eo:t?Po:In;return!t&&Ke(i,"iterate",a?Jr:qt),{next(){const{value:f,done:h}=c.next();return h?{value:f,done:h}:{value:l?[u(f[0]),u(f[1])]:u(f),done:h}},[Symbol.iterator](){return this}}}}function Pt(e){return function(...t){return e==="delete"?!1:this}}function Aa(){const e={get(i){return Kn(this,i)},get size(){return Qn(this)},has:Wn,add:ni,set:ri,delete:oi,clear:ii,forEach:Yn(!1,!1)},t={get(i){return Kn(this,i,!1,!0)},get size(){return Qn(this)},has:Wn,add:ni,set:ri,delete:oi,clear:ii,forEach:Yn(!1,!0)},n={get(i){return Kn(this,i,!0)},get size(){return Qn(this,!0)},has(i){return Wn.call(this,i,!0)},add:Pt("add"),set:Pt("set"),delete:Pt("delete"),clear:Pt("clear"),forEach:Yn(!0,!1)},r={get(i){return Kn(this,i,!0,!0)},get size(){return Qn(this,!0)},has(i){return Wn.call(this,i,!0)},add:Pt("add"),set:Pt("set"),delete:Pt("delete"),clear:Pt("clear"),forEach:Yn(!0,!0)};return["keys","values","entries",Symbol.iterator].forEach(i=>{e[i]=Gn(i,!1,!1),n[i]=Gn(i,!0,!1),t[i]=Gn(i,!1,!0),r[i]=Gn(i,!0,!0)}),[e,n,t,r]}const[wa,Ra,xa,Oa]=Aa();function Lo(e,t){const n=t?e?Oa:xa:e?Ra:wa;return(r,o,i)=>o==="__v_isReactive"?!e:o==="__v_isReadonly"?e:o==="__v_raw"?r:Reflect.get(de(n,o)&&o in r?n:r,o,i)}const Da={get:Lo(!1,!1)},Ia={get:Lo(!1,!0)},Ca={get:Lo(!0,!1)},Ts=new WeakMap,Ps=new WeakMap,As=new WeakMap,Sa=new WeakMap;function ka(e){switch(e){case"Object":case"Array":return 1;case"Map":case"Set":case"WeakMap":case"WeakSet":return 2;default:return 0}}function za(e){return e.__v_skip||!Object.isExtensible(e)?0:ka(Yl(e))}function Nn(e){return an(e)?e:To(e,!1,Ls,Da,Ts)}function ws(e){return To(e,!1,Pa,Ia,Ps)}function Hn(e){return To(e,!0,Ta,Ca,As)}function To(e,t,n,r,o){if(!Pe(e)||e.__v_raw&&!(t&&e.__v_isReactive))return e;const i=o.get(e);if(i)return i;const s=za(e);if(s===0)return e;const l=new Proxy(e,s===2?r:n);return o.set(e,l),l}function on(e){return an(e)?on(e.__v_raw):!!(e&&e.__v_isReactive)}function an(e){return!!(e&&e.__v_isReadonly)}function cr(e){return!!(e&&e.__v_isShallow)}function Rs(e){return on(e)||an(e)}function pe(e){const t=e&&e.__v_raw;return t?pe(t):e}function xs(e){return lr(e,"__v_skip",!0),e}const In=e=>Pe(e)?Nn(e):e,Po=e=>Pe(e)?Hn(e):e;function Ao(e){Dt&&rt&&(e=pe(e),bs(e.dep||(e.dep=go())))}function wo(e,t){e=pe(e);const n=e.dep;n&&Zr(n)}function ke(e){return!!(e&&e.__v_isRef===!0)}function Le(e){return Os(e,!1)}function Ro(e){return Os(e,!0)}function Os(e,t){return ke(e)?e:new $a(e,t)}class $a{constructor(t,n){this.__v_isShallow=n,this.dep=void 0,this.__v_isRef=!0,this._rawValue=n?t:pe(t),this._value=n?t:In(t)}get value(){return Ao(this),this._value}set value(t){const n=this.__v_isShallow||cr(t)||an(t);t=n?t:pe(t),Dn(t,this._rawValue)&&(this._rawValue=t,this._value=n?t:In(t),wo(this))}}function ee(e){return ke(e)?e.value:e}const Va={get:(e,t,n)=>ee(Reflect.get(e,t,n)),set:(e,t,n,r)=>{const o=e[t];return ke(o)&&!ke(n)?(o.value=n,!0):Reflect.set(e,t,n,r)}};function Ds(e){return on(e)?e:new Proxy(e,Va)}class Ma{constructor(t){this.dep=void 0,this.__v_isRef=!0;const{get:n,set:r}=t(()=>Ao(this),()=>wo(this));this._get=n,this._set=r}get value(){return this._get()}set value(t){this._set(t)}}function Na(e){return new Ma(e)}function xo(e){const t=G(e)?new Array(e.length):{};for(const n in e)t[n]=Ba(e,n);return t}class Ha{constructor(t,n,r){this._object=t,this._key=n,this._defaultValue=r,this.__v_isRef=!0}get value(){const t=this._object[this._key];return t===void 0?this._defaultValue:t}set value(t){this._object[this._key]=t}get dep(){return fa(pe(this._object),this._key)}}function Ba(e,t,n){const r=e[t];return ke(r)?r:new Ha(e,t,n)}class Fa{constructor(t,n,r,o){this._setter=n,this.dep=void 0,this.__v_isRef=!0,this.__v_isReadonly=!1,this._dirty=!0,this.effect=new bo(t,()=>{this._dirty||(this._dirty=!0,wo(this))}),this.effect.computed=this,this.effect.active=this._cacheable=!o,this.__v_isReadonly=r}get value(){const t=pe(this);return Ao(t),(t._dirty||!t._cacheable)&&(t._dirty=!1,t._value=t.effect.run()),t._value}set value(t){this._setter(t)}}function ja(e,t,n=!1){let r,o;const i=se(e);return i?(r=e,o=it):(r=e.get,o=e.set),new Fa(r,o,i||!o,n)}function It(e,t,n,r){let o;try{o=r?e(...r):e()}catch(i){Bn(i,t,n)}return o}function Xe(e,t,n,r){if(se(e)){const i=It(e,t,n,r);return i&&us(i)&&i.catch(s=>{Bn(s,t,n)}),i}const o=[];for(let i=0;i>>1;Sn(He[r])ut&&He.splice(t,1)}function Wa(e){G(e)?sn.push(...e):(!mt||!mt.includes(e,e.allowRecurse?Ht+1:Ht))&&sn.push(e),Cs()}function si(e,t=Cn?ut+1:0){for(;tSn(n)-Sn(r)),Ht=0;Hte.id==null?1/0:e.id,Qa=(e,t)=>{const n=Sn(e)-Sn(t);if(n===0){if(e.pre&&!t.pre)return-1;if(t.pre&&!e.pre)return 1}return n};function Ss(e){Xr=!1,Cn=!0,He.sort(Qa);const t=it;try{for(ut=0;utme(_)?_.trim():_)),f&&(o=n.map(Zl))}let l,a=r[l=kr(t)]||r[l=kr(ft(t))];!a&&i&&(a=r[l=kr(Qt(t))]),a&&Xe(a,e,6,o);const c=r[l+"Once"];if(c){if(!e.emitted)e.emitted={};else if(e.emitted[l])return;e.emitted[l]=!0,Xe(c,e,6,o)}}function ks(e,t,n=!1){const r=t.emitsCache,o=r.get(e);if(o!==void 0)return o;const i=e.emits;let s={},l=!1;if(!se(e)){const a=c=>{const u=ks(c,t,!0);u&&(l=!0,De(s,u))};!n&&t.mixins.length&&t.mixins.forEach(a),e.extends&&a(e.extends),e.mixins&&e.mixins.forEach(a)}return!i&&!l?(Pe(e)&&r.set(e,null),null):(G(i)?i.forEach(a=>s[a]=null):De(s,i),Pe(e)&&r.set(e,s),s)}function wr(e,t){return!e||!Vn(t)?!1:(t=t.slice(2).replace(/Once$/,""),de(e,t[0].toLowerCase()+t.slice(1))||de(e,Qt(t))||de(e,t))}let Ve=null,zs=null;function fr(e){const t=Ve;return Ve=e,zs=e&&e.type.__scopeId||null,t}function $e(e,t=Ve,n){if(!t||e._n)return e;const r=(...o)=>{r._d&&gi(-1);const i=fr(t);let s;try{s=e(...o)}finally{fr(i),r._d&&gi(1)}return s};return r._n=!0,r._c=!0,r._d=!0,r}function $r(e){const{type:t,vnode:n,proxy:r,withProxy:o,props:i,propsOptions:[s],slots:l,attrs:a,emit:c,render:u,renderCache:f,data:h,setupState:_,ctx:E,inheritAttrs:T}=e;let R,g;const y=fr(e);try{if(n.shapeFlag&4){const I=o||r;R=nt(u.call(I,I,f,i,_,h,E)),g=a}else{const I=t;R=nt(I.length>1?I(i,{attrs:a,slots:l,emit:c}):I(i,null)),g=t.props?a:Ga(a)}}catch(I){Rn.length=0,Bn(I,e,1),R=ne(Ye)}let C=R;if(g&&T!==!1){const I=Object.keys(g),{shapeFlag:W}=C;I.length&&W&7&&(s&&I.some(po)&&(g=Ja(g,s)),C=kt(C,g))}return n.dirs&&(C=kt(C),C.dirs=C.dirs?C.dirs.concat(n.dirs):n.dirs),n.transition&&(C.transition=n.transition),R=C,fr(y),R}const Ga=e=>{let t;for(const n in e)(n==="class"||n==="style"||Vn(n))&&((t||(t={}))[n]=e[n]);return t},Ja=(e,t)=>{const n={};for(const r in e)(!po(r)||!(r.slice(9)in t))&&(n[r]=e[r]);return n};function Za(e,t,n){const{props:r,children:o,component:i}=e,{props:s,children:l,patchFlag:a}=t,c=i.emitsOptions;if(t.dirs||t.transition)return!0;if(n&&a>=0){if(a&1024)return!0;if(a&16)return r?li(r,s,c):!!s;if(a&8){const u=t.dynamicProps;for(let f=0;fe.__isSuspense;function $s(e,t){t&&t.pendingBranch?G(e)?t.effects.push(...e):t.effects.push(e):Wa(e)}function Vs(e,t){return Do(e,null,t)}const Jn={};function et(e,t,n){return Do(e,t,n)}function Do(e,t,{immediate:n,deep:r,flush:o,onTrack:i,onTrigger:s}=Te){var l;const a=ms()===((l=Ie)==null?void 0:l.scope)?Ie:null;let c,u=!1,f=!1;if(ke(e)?(c=()=>e.value,u=cr(e)):on(e)?(c=()=>e,r=!0):G(e)?(f=!0,u=e.some(I=>on(I)||cr(I)),c=()=>e.map(I=>{if(ke(I))return I.value;if(on(I))return jt(I);if(se(I))return It(I,a,2)})):se(e)?t?c=()=>It(e,a,2):c=()=>{if(!(a&&a.isUnmounted))return h&&h(),Xe(e,a,3,[_])}:c=it,t&&r){const I=c;c=()=>jt(I())}let h,_=I=>{h=y.onStop=()=>{It(I,a,4)}},E;if(fn)if(_=it,t?n&&Xe(t,a,3,[c(),f?[]:void 0,_]):c(),o==="sync"){const I=Yc();E=I.__watcherHandles||(I.__watcherHandles=[])}else return it;let T=f?new Array(e.length).fill(Jn):Jn;const R=()=>{if(y.active)if(t){const I=y.run();(r||u||(f?I.some((W,te)=>Dn(W,T[te])):Dn(I,T)))&&(h&&h(),Xe(t,a,3,[I,T===Jn?void 0:f&&T[0]===Jn?[]:T,_]),T=I)}else y.run()};R.allowRecurse=!!t;let g;o==="sync"?g=R:o==="post"?g=()=>qe(R,a&&a.suspense):(R.pre=!0,a&&(R.id=a.uid),g=()=>Ar(R));const y=new bo(c,g);t?n?R():T=y.run():o==="post"?qe(y.run.bind(y),a&&a.suspense):y.run();const C=()=>{y.stop(),a&&a.scope&&mo(a.scope.effects,y)};return E&&E.push(C),C}function tc(e,t,n){const r=this.proxy,o=me(e)?e.includes(".")?Ms(r,e):()=>r[e]:e.bind(r,r);let i;se(t)?i=t:(i=t.handler,n=t);const s=Ie;un(this);const l=Do(o,i.bind(r),n);return s?un(s):Kt(),l}function Ms(e,t){const n=t.split(".");return()=>{let r=e;for(let o=0;o{jt(n,t)});else if(ds(e))for(const n in e)jt(e[n],t);return e}function dr(e,t){const n=Ve;if(n===null)return e;const r=Ir(n)||n.proxy,o=e.dirs||(e.dirs=[]);for(let i=0;i{e.isMounted=!0}),xr(()=>{e.isUnmounting=!0}),e}const Je=[Function,Array],Ns={mode:String,appear:Boolean,persisted:Boolean,onBeforeEnter:Je,onEnter:Je,onAfterEnter:Je,onEnterCancelled:Je,onBeforeLeave:Je,onLeave:Je,onAfterLeave:Je,onLeaveCancelled:Je,onBeforeAppear:Je,onAppear:Je,onAfterAppear:Je,onAppearCancelled:Je},rc={name:"BaseTransition",props:Ns,setup(e,{slots:t}){const n=tl(),r=nc();let o;return()=>{const i=t.default&&Bs(t.default(),!0);if(!i||!i.length)return;let s=i[0];if(i.length>1){for(const T of i)if(T.type!==Ye){s=T;break}}const l=pe(e),{mode:a}=l;if(r.isLeaving)return Vr(s);const c=ai(s);if(!c)return Vr(s);const u=eo(c,l,r,n);to(c,u);const f=n.subTree,h=f&&ai(f);let _=!1;const{getTransitionKey:E}=c.type;if(E){const T=E();o===void 0?o=T:T!==o&&(o=T,_=!0)}if(h&&h.type!==Ye&&(!Bt(c,h)||_)){const T=eo(h,l,r,n);if(to(h,T),a==="out-in")return r.isLeaving=!0,T.afterLeave=()=>{r.isLeaving=!1,n.update.active!==!1&&n.update()},Vr(s);a==="in-out"&&c.type!==Ye&&(T.delayLeave=(R,g,y)=>{const C=Hs(r,h);C[String(h.key)]=h,R._leaveCb=()=>{g(),R._leaveCb=void 0,delete u.delayedLeave},u.delayedLeave=y})}return s}}},oc=rc;function Hs(e,t){const{leavingVNodes:n}=e;let r=n.get(t.type);return r||(r=Object.create(null),n.set(t.type,r)),r}function eo(e,t,n,r){const{appear:o,mode:i,persisted:s=!1,onBeforeEnter:l,onEnter:a,onAfterEnter:c,onEnterCancelled:u,onBeforeLeave:f,onLeave:h,onAfterLeave:_,onLeaveCancelled:E,onBeforeAppear:T,onAppear:R,onAfterAppear:g,onAppearCancelled:y}=t,C=String(e.key),I=Hs(n,e),W=(v,j)=>{v&&Xe(v,r,9,j)},te=(v,j)=>{const N=j[1];W(v,j),G(v)?v.every(Y=>Y.length<=1)&&N():v.length<=1&&N()},M={mode:i,persisted:s,beforeEnter(v){let j=l;if(!n.isMounted)if(o)j=T||l;else return;v._leaveCb&&v._leaveCb(!0);const N=I[C];N&&Bt(e,N)&&N.el._leaveCb&&N.el._leaveCb(),W(j,[v])},enter(v){let j=a,N=c,Y=u;if(!n.isMounted)if(o)j=R||a,N=g||c,Y=y||u;else return;let w=!1;const k=v._enterCb=z=>{w||(w=!0,z?W(Y,[v]):W(N,[v]),M.delayedLeave&&M.delayedLeave(),v._enterCb=void 0)};j?te(j,[v,k]):k()},leave(v,j){const N=String(e.key);if(v._enterCb&&v._enterCb(!0),n.isUnmounting)return j();W(f,[v]);let Y=!1;const w=v._leaveCb=k=>{Y||(Y=!0,j(),k?W(E,[v]):W(_,[v]),v._leaveCb=void 0,I[N]===e&&delete I[N])};I[N]=e,h?te(h,[v,w]):w()},clone(v){return eo(v,t,n,r)}};return M}function Vr(e){if(Fn(e))return e=kt(e),e.children=null,e}function ai(e){return Fn(e)?e.children?e.children[0]:void 0:e}function to(e,t){e.shapeFlag&6&&e.component?to(e.component.subTree,t):e.shapeFlag&128?(e.ssContent.transition=t.clone(e.ssContent),e.ssFallback.transition=t.clone(e.ssFallback)):e.transition=t}function Bs(e,t=!1,n){let r=[],o=0;for(let i=0;i1)for(let i=0;iDe({name:e.name},t,{setup:e}))():e}const ln=e=>!!e.type.__asyncLoader;function O(e){se(e)&&(e={loader:e});const{loader:t,loadingComponent:n,errorComponent:r,delay:o=200,timeout:i,suspensible:s=!0,onError:l}=e;let a=null,c,u=0;const f=()=>(u++,a=null,h()),h=()=>{let _;return a||(_=a=t().catch(E=>{if(E=E instanceof Error?E:new Error(String(E)),l)return new Promise((T,R)=>{l(E,()=>T(f()),()=>R(E),u+1)});throw E}).then(E=>_!==a&&a?a:(E&&(E.__esModule||E[Symbol.toStringTag]==="Module")&&(E=E.default),c=E,E)))};return ue({name:"AsyncComponentWrapper",__asyncLoader:h,get __asyncResolved(){return c},setup(){const _=Ie;if(c)return()=>Mr(c,_);const E=y=>{a=null,Bn(y,_,13,!r)};if(s&&_.suspense||fn)return h().then(y=>()=>Mr(y,_)).catch(y=>(E(y),()=>r?ne(r,{error:y}):null));const T=Le(!1),R=Le(),g=Le(!!o);return o&&setTimeout(()=>{g.value=!1},o),i!=null&&setTimeout(()=>{if(!T.value&&!R.value){const y=new Error(`Async component timed out after ${i}ms.`);E(y),R.value=y}},i),h().then(()=>{T.value=!0,_.parent&&Fn(_.parent.vnode)&&Ar(_.parent.update)}).catch(y=>{E(y),R.value=y}),()=>{if(T.value&&c)return Mr(c,_);if(R.value&&r)return ne(r,{error:R.value});if(n&&!g.value)return ne(n)}}})}function Mr(e,t){const{ref:n,props:r,children:o,ce:i}=t.vnode,s=ne(e,r,o);return s.ref=n,s.ce=i,delete t.vnode.ce,s}const Fn=e=>e.type.__isKeepAlive;function ic(e,t){Fs(e,"a",t)}function sc(e,t){Fs(e,"da",t)}function Fs(e,t,n=Ie){const r=e.__wdc||(e.__wdc=()=>{let o=n;for(;o;){if(o.isDeactivated)return;o=o.parent}return e()});if(Rr(t,r,n),n){let o=n.parent;for(;o&&o.parent;)Fn(o.parent.vnode)&&lc(r,t,n,o),o=o.parent}}function lc(e,t,n,r){const o=Rr(t,e,r,!0);Or(()=>{mo(r[t],o)},n)}function Rr(e,t,n=Ie,r=!1){if(n){const o=n[e]||(n[e]=[]),i=t.__weh||(t.__weh=(...s)=>{if(n.isUnmounted)return;mn(),un(n);const l=Xe(t,n,e,s);return Kt(),vn(),l});return r?o.unshift(i):o.push(i),i}}const yt=e=>(t,n=Ie)=>(!fn||e==="sp")&&Rr(e,(...r)=>t(...r),n),ac=yt("bm"),Ge=yt("m"),cc=yt("bu"),uc=yt("u"),xr=yt("bum"),Or=yt("um"),fc=yt("sp"),dc=yt("rtg"),hc=yt("rtc");function pc(e,t=Ie){Rr("ec",e,t)}const js="components";function bt(e,t){return vc(js,e,!0,t)||e}const mc=Symbol.for("v-ndc");function vc(e,t,n=!0,r=!1){const o=Ve||Ie;if(o){const i=o.type;if(e===js){const l=Kc(i,!1);if(l&&(l===t||l===ft(t)||l===Lr(ft(t))))return i}const s=ci(o[e]||i[e],t)||ci(o.appContext[e],t);return!s&&r?i:s}}function ci(e,t){return e&&(e[t]||e[ft(t)]||e[Lr(ft(t))])}function St(e,t,n,r){let o;const i=n&&n[r];if(G(e)||me(e)){o=new Array(e.length);for(let s=0,l=e.length;st(s,l,void 0,i&&i[l]));else{const s=Object.keys(e);o=new Array(s.length);for(let l=0,a=s.length;lvr(t)?!(t.type===Ye||t.type===Ee&&!qs(t.children)):!0)?e:null}const no=e=>e?nl(e)?Ir(e)||e.proxy:no(e.parent):null,An=De(Object.create(null),{$:e=>e,$el:e=>e.vnode.el,$data:e=>e.data,$props:e=>e.props,$attrs:e=>e.attrs,$slots:e=>e.slots,$refs:e=>e.refs,$parent:e=>no(e.parent),$root:e=>no(e.root),$emit:e=>e.emit,$options:e=>Io(e),$forceUpdate:e=>e.f||(e.f=()=>Ar(e.update)),$nextTick:e=>e.n||(e.n=Pr.bind(e.proxy)),$watch:e=>tc.bind(e)}),Nr=(e,t)=>e!==Te&&!e.__isScriptSetup&&de(e,t),_c={get({_:e},t){const{ctx:n,setupState:r,data:o,props:i,accessCache:s,type:l,appContext:a}=e;let c;if(t[0]!=="$"){const _=s[t];if(_!==void 0)switch(_){case 1:return r[t];case 2:return o[t];case 4:return n[t];case 3:return i[t]}else{if(Nr(r,t))return s[t]=1,r[t];if(o!==Te&&de(o,t))return s[t]=2,o[t];if((c=e.propsOptions[0])&&de(c,t))return s[t]=3,i[t];if(n!==Te&&de(n,t))return s[t]=4,n[t];ro&&(s[t]=0)}}const u=An[t];let f,h;if(u)return t==="$attrs"&&Ke(e,"get",t),u(e);if((f=l.__cssModules)&&(f=f[t]))return f;if(n!==Te&&de(n,t))return s[t]=4,n[t];if(h=a.config.globalProperties,de(h,t))return h[t]},set({_:e},t,n){const{data:r,setupState:o,ctx:i}=e;return Nr(o,t)?(o[t]=n,!0):r!==Te&&de(r,t)?(r[t]=n,!0):de(e.props,t)||t[0]==="$"&&t.slice(1)in e?!1:(i[t]=n,!0)},has({_:{data:e,setupState:t,accessCache:n,ctx:r,appContext:o,propsOptions:i}},s){let l;return!!n[s]||e!==Te&&de(e,s)||Nr(t,s)||(l=i[0])&&de(l,s)||de(r,s)||de(An,s)||de(o.config.globalProperties,s)},defineProperty(e,t,n){return n.get!=null?e._.accessCache[t]=0:de(n,"value")&&this.set(e,t,n.value,null),Reflect.defineProperty(e,t,n)}};function ui(e){return G(e)?e.reduce((t,n)=>(t[n]=null,t),{}):e}let ro=!0;function gc(e){const t=Io(e),n=e.proxy,r=e.ctx;ro=!1,t.beforeCreate&&fi(t.beforeCreate,e,"bc");const{data:o,computed:i,methods:s,watch:l,provide:a,inject:c,created:u,beforeMount:f,mounted:h,beforeUpdate:_,updated:E,activated:T,deactivated:R,beforeDestroy:g,beforeUnmount:y,destroyed:C,unmounted:I,render:W,renderTracked:te,renderTriggered:M,errorCaptured:v,serverPrefetch:j,expose:N,inheritAttrs:Y,components:w,directives:k,filters:z}=t;if(c&&bc(c,r,null),s)for(const oe in s){const ie=s[oe];se(ie)&&(r[oe]=ie.bind(n))}if(o){const oe=o.call(n,n);Pe(oe)&&(e.data=Nn(oe))}if(ro=!0,i)for(const oe in i){const ie=i[oe],Me=se(ie)?ie.bind(n,n):se(ie.get)?ie.get.bind(n,n):it,ze=!se(ie)&&se(ie.set)?ie.set.bind(n):it,je=q({get:Me,set:ze});Object.defineProperty(r,oe,{enumerable:!0,configurable:!0,get:()=>je.value,set:Ne=>je.value=Ne})}if(l)for(const oe in l)Us(l[oe],r,n,oe);if(a){const oe=se(a)?a.call(n):a;Reflect.ownKeys(oe).forEach(ie=>{Ut(ie,oe[ie])})}u&&fi(u,e,"c");function U(oe,ie){G(ie)?ie.forEach(Me=>oe(Me.bind(n))):ie&&oe(ie.bind(n))}if(U(ac,f),U(Ge,h),U(cc,_),U(uc,E),U(ic,T),U(sc,R),U(pc,v),U(hc,te),U(dc,M),U(xr,y),U(Or,I),U(fc,j),G(N))if(N.length){const oe=e.exposed||(e.exposed={});N.forEach(ie=>{Object.defineProperty(oe,ie,{get:()=>n[ie],set:Me=>n[ie]=Me})})}else e.exposed||(e.exposed={});W&&e.render===it&&(e.render=W),Y!=null&&(e.inheritAttrs=Y),w&&(e.components=w),k&&(e.directives=k)}function bc(e,t,n=it){G(e)&&(e=oo(e));for(const r in e){const o=e[r];let i;Pe(o)?"default"in o?i=Re(o.from||r,o.default,!0):i=Re(o.from||r):i=Re(o),ke(i)?Object.defineProperty(t,r,{enumerable:!0,configurable:!0,get:()=>i.value,set:s=>i.value=s}):t[r]=i}}function fi(e,t,n){Xe(G(e)?e.map(r=>r.bind(t.proxy)):e.bind(t.proxy),t,n)}function Us(e,t,n,r){const o=r.includes(".")?Ms(n,r):()=>n[r];if(me(e)){const i=t[e];se(i)&&et(o,i)}else if(se(e))et(o,e.bind(n));else if(Pe(e))if(G(e))e.forEach(i=>Us(i,t,n,r));else{const i=se(e.handler)?e.handler.bind(n):t[e.handler];se(i)&&et(o,i,e)}}function Io(e){const t=e.type,{mixins:n,extends:r}=t,{mixins:o,optionsCache:i,config:{optionMergeStrategies:s}}=e.appContext,l=i.get(t);let a;return l?a=l:!o.length&&!n&&!r?a=t:(a={},o.length&&o.forEach(c=>hr(a,c,s,!0)),hr(a,t,s)),Pe(t)&&i.set(t,a),a}function hr(e,t,n,r=!1){const{mixins:o,extends:i}=t;i&&hr(e,i,n,!0),o&&o.forEach(s=>hr(e,s,n,!0));for(const s in t)if(!(r&&s==="expose")){const l=yc[s]||n&&n[s];e[s]=l?l(e[s],t[s]):t[s]}return e}const yc={data:di,props:hi,emits:hi,methods:Tn,computed:Tn,beforeCreate:Be,created:Be,beforeMount:Be,mounted:Be,beforeUpdate:Be,updated:Be,beforeDestroy:Be,beforeUnmount:Be,destroyed:Be,unmounted:Be,activated:Be,deactivated:Be,errorCaptured:Be,serverPrefetch:Be,components:Tn,directives:Tn,watch:Lc,provide:di,inject:Ec};function di(e,t){return t?e?function(){return De(se(e)?e.call(this,this):e,se(t)?t.call(this,this):t)}:t:e}function Ec(e,t){return Tn(oo(e),oo(t))}function oo(e){if(G(e)){const t={};for(let n=0;n1)return n&&se(t)?t.call(r&&r.proxy):t}}function Ac(e,t,n,r=!1){const o={},i={};lr(i,Dr,1),e.propsDefaults=Object.create(null),Ws(e,t,o,i);for(const s in e.propsOptions[0])s in o||(o[s]=void 0);n?e.props=r?o:ws(o):e.type.props?e.props=o:e.props=i,e.attrs=i}function wc(e,t,n,r){const{props:o,attrs:i,vnode:{patchFlag:s}}=e,l=pe(o),[a]=e.propsOptions;let c=!1;if((r||s>0)&&!(s&16)){if(s&8){const u=e.vnode.dynamicProps;for(let f=0;f{a=!0;const[h,_]=Qs(f,t,!0);De(s,h),_&&l.push(..._)};!n&&t.mixins.length&&t.mixins.forEach(u),e.extends&&u(e.extends),e.mixins&&e.mixins.forEach(u)}if(!i&&!a)return Pe(e)&&r.set(e,nn),nn;if(G(i))for(let u=0;u-1,_[1]=T<0||E-1||de(_,"default"))&&l.push(f)}}}const c=[s,l];return Pe(e)&&r.set(e,c),c}function pi(e){return e[0]!=="$"}function mi(e){const t=e&&e.toString().match(/^\s*(function|class) (\w+)/);return t?t[2]:e===null?"null":""}function vi(e,t){return mi(e)===mi(t)}function _i(e,t){return G(t)?t.findIndex(n=>vi(n,e)):se(t)&&vi(t,e)?0:-1}const Ys=e=>e[0]==="_"||e==="$stable",Co=e=>G(e)?e.map(nt):[nt(e)],Rc=(e,t,n)=>{if(t._n)return t;const r=$e((...o)=>Co(t(...o)),n);return r._c=!1,r},Gs=(e,t,n)=>{const r=e._ctx;for(const o in e){if(Ys(o))continue;const i=e[o];if(se(i))t[o]=Rc(o,i,r);else if(i!=null){const s=Co(i);t[o]=()=>s}}},Js=(e,t)=>{const n=Co(t);e.slots.default=()=>n},xc=(e,t)=>{if(e.vnode.shapeFlag&32){const n=t._;n?(e.slots=pe(t),lr(t,"_",n)):Gs(t,e.slots={})}else e.slots={},t&&Js(e,t);lr(e.slots,Dr,1)},Oc=(e,t,n)=>{const{vnode:r,slots:o}=e;let i=!0,s=Te;if(r.shapeFlag&32){const l=t._;l?n&&l===1?i=!1:(De(o,t),!n&&l===1&&delete o._):(i=!t.$stable,Gs(t,o)),s=t}else t&&(Js(e,t),s={default:1});if(i)for(const l in o)!Ys(l)&&!(l in s)&&delete o[l]};function mr(e,t,n,r,o=!1){if(G(e)){e.forEach((h,_)=>mr(h,t&&(G(t)?t[_]:t),n,r,o));return}if(ln(r)&&!o)return;const i=r.shapeFlag&4?Ir(r.component)||r.component.proxy:r.el,s=o?null:i,{i:l,r:a}=e,c=t&&t.r,u=l.refs===Te?l.refs={}:l.refs,f=l.setupState;if(c!=null&&c!==a&&(me(c)?(u[c]=null,de(f,c)&&(f[c]=null)):ke(c)&&(c.value=null)),se(a))It(a,l,12,[s,u]);else{const h=me(a),_=ke(a);if(h||_){const E=()=>{if(e.f){const T=h?de(f,a)?f[a]:u[a]:a.value;o?G(T)&&mo(T,i):G(T)?T.includes(i)||T.push(i):h?(u[a]=[i],de(f,a)&&(f[a]=u[a])):(a.value=[i],e.k&&(u[e.k]=a.value))}else h?(u[a]=s,de(f,a)&&(f[a]=s)):_&&(a.value=s,e.k&&(u[e.k]=s))};s?(E.id=-1,qe(E,n)):E()}}}let At=!1;const Zn=e=>/svg/.test(e.namespaceURI)&&e.tagName!=="foreignObject",Xn=e=>e.nodeType===8;function Dc(e){const{mt:t,p:n,o:{patchProp:r,createText:o,nextSibling:i,parentNode:s,remove:l,insert:a,createComment:c}}=e,u=(g,y)=>{if(!y.hasChildNodes()){n(null,g,y),ur(),y._vnode=g;return}At=!1,f(y.firstChild,g,null,null,null),ur(),y._vnode=g,At&&console.error("Hydration completed but contains mismatches.")},f=(g,y,C,I,W,te=!1)=>{const M=Xn(g)&&g.data==="[",v=()=>T(g,y,C,I,W,M),{type:j,ref:N,shapeFlag:Y,patchFlag:w}=y;let k=g.nodeType;y.el=g,w===-2&&(te=!1,y.dynamicChildren=null);let z=null;switch(j){case cn:k!==3?y.children===""?(a(y.el=o(""),s(g),g),z=g):z=v():(g.data!==y.children&&(At=!0,g.data=y.children),z=i(g));break;case Ye:k!==8||M?z=v():z=i(g);break;case wn:if(M&&(g=i(g),k=g.nodeType),k===1||k===3){z=g;const le=!y.children.length;for(let U=0;U{te=te||!!y.dynamicChildren;const{type:M,props:v,patchFlag:j,shapeFlag:N,dirs:Y}=y,w=M==="input"&&Y||M==="option";if(w||j!==-1){if(Y&&ct(y,null,C,"created"),v)if(w||!te||j&48)for(const z in v)(w&&z.endsWith("value")||Vn(z)&&!Pn(z))&&r(g,z,null,v[z],!1,void 0,C);else v.onClick&&r(g,"onClick",null,v.onClick,!1,void 0,C);let k;if((k=v&&v.onVnodeBeforeMount)&&Ze(k,C,y),Y&&ct(y,null,C,"beforeMount"),((k=v&&v.onVnodeMounted)||Y)&&$s(()=>{k&&Ze(k,C,y),Y&&ct(y,null,C,"mounted")},I),N&16&&!(v&&(v.innerHTML||v.textContent))){let z=_(g.firstChild,y,g,C,I,W,te);for(;z;){At=!0;const le=z;z=z.nextSibling,l(le)}}else N&8&&g.textContent!==y.children&&(At=!0,g.textContent=y.children)}return g.nextSibling},_=(g,y,C,I,W,te,M)=>{M=M||!!y.dynamicChildren;const v=y.children,j=v.length;for(let N=0;N{const{slotScopeIds:M}=y;M&&(W=W?W.concat(M):M);const v=s(g),j=_(i(g),y,v,C,I,W,te);return j&&Xn(j)&&j.data==="]"?i(y.anchor=j):(At=!0,a(y.anchor=c("]"),v,j),j)},T=(g,y,C,I,W,te)=>{if(At=!0,y.el=null,te){const j=R(g);for(;;){const N=i(g);if(N&&N!==j)l(N);else break}}const M=i(g),v=s(g);return l(g),n(null,y,v,M,C,I,Zn(v),W),M},R=g=>{let y=0;for(;g;)if(g=i(g),g&&Xn(g)&&(g.data==="["&&y++,g.data==="]")){if(y===0)return i(g);y--}return g};return[u,f]}const qe=$s;function Ic(e){return Cc(e,Dc)}function Cc(e,t){const n=Yr();n.__VUE__=!0;const{insert:r,remove:o,patchProp:i,createElement:s,createText:l,createComment:a,setText:c,setElementText:u,parentNode:f,nextSibling:h,setScopeId:_=it,insertStaticContent:E}=e,T=(d,p,b,L=null,A=null,x=null,H=!1,S=null,V=!!p.dynamicChildren)=>{if(d===p)return;d&&!Bt(d,p)&&(L=P(d),Ne(d,A,x,!0),d=null),p.patchFlag===-2&&(V=!1,p.dynamicChildren=null);const{type:D,ref:J,shapeFlag:K}=p;switch(D){case cn:R(d,p,b,L);break;case Ye:g(d,p,b,L);break;case wn:d==null&&y(p,b,L,H);break;case Ee:w(d,p,b,L,A,x,H,S,V);break;default:K&1?W(d,p,b,L,A,x,H,S,V):K&6?k(d,p,b,L,A,x,H,S,V):(K&64||K&128)&&D.process(d,p,b,L,A,x,H,S,V,$)}J!=null&&A&&mr(J,d&&d.ref,x,p||d,!p)},R=(d,p,b,L)=>{if(d==null)r(p.el=l(p.children),b,L);else{const A=p.el=d.el;p.children!==d.children&&c(A,p.children)}},g=(d,p,b,L)=>{d==null?r(p.el=a(p.children||""),b,L):p.el=d.el},y=(d,p,b,L)=>{[d.el,d.anchor]=E(d.children,p,b,L,d.el,d.anchor)},C=({el:d,anchor:p},b,L)=>{let A;for(;d&&d!==p;)A=h(d),r(d,b,L),d=A;r(p,b,L)},I=({el:d,anchor:p})=>{let b;for(;d&&d!==p;)b=h(d),o(d),d=b;o(p)},W=(d,p,b,L,A,x,H,S,V)=>{H=H||p.type==="svg",d==null?te(p,b,L,A,x,H,S,V):j(d,p,A,x,H,S,V)},te=(d,p,b,L,A,x,H,S)=>{let V,D;const{type:J,props:K,shapeFlag:Z,transition:re,dirs:ae}=d;if(V=d.el=s(d.type,x,K&&K.is,K),Z&8?u(V,d.children):Z&16&&v(d.children,V,null,L,A,x&&J!=="foreignObject",H,S),ae&&ct(d,null,L,"created"),M(V,d,d.scopeId,H,L),K){for(const ge in K)ge!=="value"&&!Pn(ge)&&i(V,ge,null,K[ge],x,d.children,L,A,Se);"value"in K&&i(V,"value",null,K.value),(D=K.onVnodeBeforeMount)&&Ze(D,L,d)}ae&&ct(d,null,L,"beforeMount");const be=(!A||A&&!A.pendingBranch)&&re&&!re.persisted;be&&re.beforeEnter(V),r(V,p,b),((D=K&&K.onVnodeMounted)||be||ae)&&qe(()=>{D&&Ze(D,L,d),be&&re.enter(V),ae&&ct(d,null,L,"mounted")},A)},M=(d,p,b,L,A)=>{if(b&&_(d,b),L)for(let x=0;x{for(let D=V;D{const S=p.el=d.el;let{patchFlag:V,dynamicChildren:D,dirs:J}=p;V|=d.patchFlag&16;const K=d.props||Te,Z=p.props||Te;let re;b&&$t(b,!1),(re=Z.onVnodeBeforeUpdate)&&Ze(re,b,p,d),J&&ct(p,d,b,"beforeUpdate"),b&&$t(b,!0);const ae=A&&p.type!=="foreignObject";if(D?N(d.dynamicChildren,D,S,b,L,ae,x):H||ie(d,p,S,null,b,L,ae,x,!1),V>0){if(V&16)Y(S,p,K,Z,b,L,A);else if(V&2&&K.class!==Z.class&&i(S,"class",null,Z.class,A),V&4&&i(S,"style",K.style,Z.style,A),V&8){const be=p.dynamicProps;for(let ge=0;ge{re&&Ze(re,b,p,d),J&&ct(p,d,b,"updated")},L)},N=(d,p,b,L,A,x,H)=>{for(let S=0;S{if(b!==L){if(b!==Te)for(const S in b)!Pn(S)&&!(S in L)&&i(d,S,b[S],null,H,p.children,A,x,Se);for(const S in L){if(Pn(S))continue;const V=L[S],D=b[S];V!==D&&S!=="value"&&i(d,S,D,V,H,p.children,A,x,Se)}"value"in L&&i(d,"value",b.value,L.value)}},w=(d,p,b,L,A,x,H,S,V)=>{const D=p.el=d?d.el:l(""),J=p.anchor=d?d.anchor:l("");let{patchFlag:K,dynamicChildren:Z,slotScopeIds:re}=p;re&&(S=S?S.concat(re):re),d==null?(r(D,b,L),r(J,b,L),v(p.children,b,J,A,x,H,S,V)):K>0&&K&64&&Z&&d.dynamicChildren?(N(d.dynamicChildren,Z,b,A,x,H,S),(p.key!=null||A&&p===A.subTree)&&Zs(d,p,!0)):ie(d,p,b,J,A,x,H,S,V)},k=(d,p,b,L,A,x,H,S,V)=>{p.slotScopeIds=S,d==null?p.shapeFlag&512?A.ctx.activate(p,b,L,H,V):z(p,b,L,A,x,H,V):le(d,p,V)},z=(d,p,b,L,A,x,H)=>{const S=d.component=Bc(d,L,A);if(Fn(d)&&(S.ctx.renderer=$),Fc(S),S.asyncDep){if(A&&A.registerDep(S,U),!d.el){const V=S.subTree=ne(Ye);g(null,V,p,b)}return}U(S,d,p,b,A,x,H)},le=(d,p,b)=>{const L=p.component=d.component;if(Za(d,p,b))if(L.asyncDep&&!L.asyncResolved){oe(L,p,b);return}else L.next=p,Ka(L.update),L.update();else p.el=d.el,L.vnode=p},U=(d,p,b,L,A,x,H)=>{const S=()=>{if(d.isMounted){let{next:J,bu:K,u:Z,parent:re,vnode:ae}=d,be=J,ge;$t(d,!1),J?(J.el=ae.el,oe(d,J,H)):J=ae,K&&zr(K),(ge=J.props&&J.props.onVnodeBeforeUpdate)&&Ze(ge,re,J,ae),$t(d,!0);const xe=$r(d),tt=d.subTree;d.subTree=xe,T(tt,xe,f(tt.el),P(tt),d,A,x),J.el=xe.el,be===null&&Xa(d,xe.el),Z&&qe(Z,A),(ge=J.props&&J.props.onVnodeUpdated)&&qe(()=>Ze(ge,re,J,ae),A)}else{let J;const{el:K,props:Z}=p,{bm:re,m:ae,parent:be}=d,ge=ln(p);if($t(d,!1),re&&zr(re),!ge&&(J=Z&&Z.onVnodeBeforeMount)&&Ze(J,be,p),$t(d,!0),K&&ce){const xe=()=>{d.subTree=$r(d),ce(K,d.subTree,d,A,null)};ge?p.type.__asyncLoader().then(()=>!d.isUnmounted&&xe()):xe()}else{const xe=d.subTree=$r(d);T(null,xe,b,L,d,A,x),p.el=xe.el}if(ae&&qe(ae,A),!ge&&(J=Z&&Z.onVnodeMounted)){const xe=p;qe(()=>Ze(J,be,xe),A)}(p.shapeFlag&256||be&&ln(be.vnode)&&be.vnode.shapeFlag&256)&&d.a&&qe(d.a,A),d.isMounted=!0,p=b=L=null}},V=d.effect=new bo(S,()=>Ar(D),d.scope),D=d.update=()=>V.run();D.id=d.uid,$t(d,!0),D()},oe=(d,p,b)=>{p.component=d;const L=d.vnode.props;d.vnode=p,d.next=null,wc(d,p.props,L,b),Oc(d,p.children,b),mn(),si(),vn()},ie=(d,p,b,L,A,x,H,S,V=!1)=>{const D=d&&d.children,J=d?d.shapeFlag:0,K=p.children,{patchFlag:Z,shapeFlag:re}=p;if(Z>0){if(Z&128){ze(D,K,b,L,A,x,H,S,V);return}else if(Z&256){Me(D,K,b,L,A,x,H,S,V);return}}re&8?(J&16&&Se(D,A,x),K!==D&&u(b,K)):J&16?re&16?ze(D,K,b,L,A,x,H,S,V):Se(D,A,x,!0):(J&8&&u(b,""),re&16&&v(K,b,L,A,x,H,S,V))},Me=(d,p,b,L,A,x,H,S,V)=>{d=d||nn,p=p||nn;const D=d.length,J=p.length,K=Math.min(D,J);let Z;for(Z=0;ZJ?Se(d,A,x,!0,!1,K):v(p,b,L,A,x,H,S,V,K)},ze=(d,p,b,L,A,x,H,S,V)=>{let D=0;const J=p.length;let K=d.length-1,Z=J-1;for(;D<=K&&D<=Z;){const re=d[D],ae=p[D]=V?xt(p[D]):nt(p[D]);if(Bt(re,ae))T(re,ae,b,null,A,x,H,S,V);else break;D++}for(;D<=K&&D<=Z;){const re=d[K],ae=p[Z]=V?xt(p[Z]):nt(p[Z]);if(Bt(re,ae))T(re,ae,b,null,A,x,H,S,V);else break;K--,Z--}if(D>K){if(D<=Z){const re=Z+1,ae=reZ)for(;D<=K;)Ne(d[D],A,x,!0),D++;else{const re=D,ae=D,be=new Map;for(D=ae;D<=Z;D++){const We=p[D]=V?xt(p[D]):nt(p[D]);We.key!=null&&be.set(We.key,D)}let ge,xe=0;const tt=Z-ae+1;let Jt=!1,Qo=0;const _n=new Array(tt);for(D=0;D=tt){Ne(We,A,x,!0);continue}let at;if(We.key!=null)at=be.get(We.key);else for(ge=ae;ge<=Z;ge++)if(_n[ge-ae]===0&&Bt(We,p[ge])){at=ge;break}at===void 0?Ne(We,A,x,!0):(_n[at-ae]=D+1,at>=Qo?Qo=at:Jt=!0,T(We,p[at],b,null,A,x,H,S,V),xe++)}const Yo=Jt?Sc(_n):nn;for(ge=Yo.length-1,D=tt-1;D>=0;D--){const We=ae+D,at=p[We],Go=We+1{const{el:x,type:H,transition:S,children:V,shapeFlag:D}=d;if(D&6){je(d.component.subTree,p,b,L);return}if(D&128){d.suspense.move(p,b,L);return}if(D&64){H.move(d,p,b,$);return}if(H===Ee){r(x,p,b);for(let K=0;KS.enter(x),A);else{const{leave:K,delayLeave:Z,afterLeave:re}=S,ae=()=>r(x,p,b),be=()=>{K(x,()=>{ae(),re&&re()})};Z?Z(x,ae,be):be()}else r(x,p,b)},Ne=(d,p,b,L=!1,A=!1)=>{const{type:x,props:H,ref:S,children:V,dynamicChildren:D,shapeFlag:J,patchFlag:K,dirs:Z}=d;if(S!=null&&mr(S,null,b,d,!0),J&256){p.ctx.deactivate(d);return}const re=J&1&&Z,ae=!ln(d);let be;if(ae&&(be=H&&H.onVnodeBeforeUnmount)&&Ze(be,p,d),J&6)lt(d.component,b,L);else{if(J&128){d.suspense.unmount(b,L);return}re&&ct(d,null,p,"beforeUnmount"),J&64?d.type.remove(d,p,b,A,$,L):D&&(x!==Ee||K>0&&K&64)?Se(D,p,b,!1,!0):(x===Ee&&K&384||!A&&J&16)&&Se(V,p,b),L&&Lt(d)}(ae&&(be=H&&H.onVnodeUnmounted)||re)&&qe(()=>{be&&Ze(be,p,d),re&&ct(d,null,p,"unmounted")},b)},Lt=d=>{const{type:p,el:b,anchor:L,transition:A}=d;if(p===Ee){Tt(b,L);return}if(p===wn){I(d);return}const x=()=>{o(b),A&&!A.persisted&&A.afterLeave&&A.afterLeave()};if(d.shapeFlag&1&&A&&!A.persisted){const{leave:H,delayLeave:S}=A,V=()=>H(b,x);S?S(d.el,x,V):V()}else x()},Tt=(d,p)=>{let b;for(;d!==p;)b=h(d),o(d),d=b;o(p)},lt=(d,p,b)=>{const{bum:L,scope:A,update:x,subTree:H,um:S}=d;L&&zr(L),A.stop(),x&&(x.active=!1,Ne(H,d,p,b)),S&&qe(S,p),qe(()=>{d.isUnmounted=!0},p),p&&p.pendingBranch&&!p.isUnmounted&&d.asyncDep&&!d.asyncResolved&&d.suspenseId===p.pendingId&&(p.deps--,p.deps===0&&p.resolve())},Se=(d,p,b,L=!1,A=!1,x=0)=>{for(let H=x;Hd.shapeFlag&6?P(d.component.subTree):d.shapeFlag&128?d.suspense.next():h(d.anchor||d.el),F=(d,p,b)=>{d==null?p._vnode&&Ne(p._vnode,null,null,!0):T(p._vnode||null,d,p,null,null,null,b),si(),ur(),p._vnode=d},$={p:T,um:Ne,m:je,r:Lt,mt:z,mc:v,pc:ie,pbc:N,n:P,o:e};let Q,ce;return t&&([Q,ce]=t($)),{render:F,hydrate:Q,createApp:Pc(F,Q)}}function $t({effect:e,update:t},n){e.allowRecurse=t.allowRecurse=n}function Zs(e,t,n=!1){const r=e.children,o=t.children;if(G(r)&&G(o))for(let i=0;i>1,e[n[l]]0&&(t[r]=n[i-1]),n[i]=r)}}for(i=n.length,s=n[i-1];i-- >0;)n[i]=s,s=t[s];return n}const kc=e=>e.__isTeleport,Ee=Symbol.for("v-fgt"),cn=Symbol.for("v-txt"),Ye=Symbol.for("v-cmt"),wn=Symbol.for("v-stc"),Rn=[];let ot=null;function B(e=!1){Rn.push(ot=e?null:[])}function zc(){Rn.pop(),ot=Rn[Rn.length-1]||null}let kn=1;function gi(e){kn+=e}function Xs(e){return e.dynamicChildren=kn>0?ot||nn:null,zc(),kn>0&&ot&&ot.push(e),e}function X(e,t,n,r,o,i){return Xs(he(e,t,n,r,o,i,!0))}function Oe(e,t,n,r,o){return Xs(ne(e,t,n,r,o,!0))}function vr(e){return e?e.__v_isVNode===!0:!1}function Bt(e,t){return e.type===t.type&&e.key===t.key}const Dr="__vInternal",el=({key:e})=>e??null,ir=({ref:e,ref_key:t,ref_for:n})=>(typeof e=="number"&&(e=""+e),e!=null?me(e)||ke(e)||se(e)?{i:Ve,r:e,k:t,f:!!n}:e:null);function he(e,t=null,n=null,r=0,o=null,i=e===Ee?0:1,s=!1,l=!1){const a={__v_isVNode:!0,__v_skip:!0,type:e,props:t,key:t&&el(t),ref:t&&ir(t),scopeId:zs,slotScopeIds:null,children:n,component:null,suspense:null,ssContent:null,ssFallback:null,dirs:null,transition:null,el:null,anchor:null,target:null,targetAnchor:null,staticCount:0,shapeFlag:i,patchFlag:r,dynamicProps:o,dynamicChildren:null,appContext:null,ctx:Ve};return l?(So(a,n),i&128&&e.normalize(a)):n&&(a.shapeFlag|=me(n)?8:16),kn>0&&!s&&ot&&(a.patchFlag>0||i&6)&&a.patchFlag!==32&&ot.push(a),a}const ne=$c;function $c(e,t=null,n=null,r=0,o=null,i=!1){if((!e||e===mc)&&(e=Ye),vr(e)){const l=kt(e,t,!0);return n&&So(l,n),kn>0&&!i&&ot&&(l.shapeFlag&6?ot[ot.indexOf(e)]=l:ot.push(l)),l.patchFlag|=-2,l}if(Wc(e)&&(e=e.__vccOpts),t){t=Vc(t);let{class:l,style:a}=t;l&&!me(l)&&(t.class=Ue(l)),Pe(a)&&(Rs(a)&&!G(a)&&(a=De({},a)),t.style=Mn(a))}const s=me(e)?1:ec(e)?128:kc(e)?64:Pe(e)?4:se(e)?2:0;return he(e,t,n,r,o,s,i,!0)}function Vc(e){return e?Rs(e)||Dr in e?De({},e):e:null}function kt(e,t,n=!1){const{props:r,ref:o,patchFlag:i,children:s}=e,l=t?so(r||{},t):r;return{__v_isVNode:!0,__v_skip:!0,type:e.type,props:l,key:l&&el(l),ref:t&&t.ref?n&&o?G(o)?o.concat(ir(t)):[o,ir(t)]:ir(t):o,scopeId:e.scopeId,slotScopeIds:e.slotScopeIds,children:s,target:e.target,targetAnchor:e.targetAnchor,staticCount:e.staticCount,shapeFlag:e.shapeFlag,patchFlag:t&&e.type!==Ee?i===-1?16:i|16:i,dynamicProps:e.dynamicProps,dynamicChildren:e.dynamicChildren,appContext:e.appContext,dirs:e.dirs,transition:e.transition,component:e.component,suspense:e.suspense,ssContent:e.ssContent&&kt(e.ssContent),ssFallback:e.ssFallback&&kt(e.ssFallback),el:e.el,anchor:e.anchor,ctx:e.ctx,ce:e.ce}}function zt(e=" ",t=0){return ne(cn,null,e,t)}function Mc(e,t){const n=ne(wn,null,e);return n.staticCount=t,n}function we(e="",t=!1){return t?(B(),Oe(Ye,null,e)):ne(Ye,null,e)}function nt(e){return e==null||typeof e=="boolean"?ne(Ye):G(e)?ne(Ee,null,e.slice()):typeof e=="object"?xt(e):ne(cn,null,String(e))}function xt(e){return e.el===null&&e.patchFlag!==-1||e.memo?e:kt(e)}function So(e,t){let n=0;const{shapeFlag:r}=e;if(t==null)t=null;else if(G(t))n=16;else if(typeof t=="object")if(r&65){const o=t.default;o&&(o._c&&(o._d=!1),So(e,o()),o._c&&(o._d=!0));return}else{n=32;const o=t._;!o&&!(Dr in t)?t._ctx=Ve:o===3&&Ve&&(Ve.slots._===1?t._=1:(t._=2,e.patchFlag|=1024))}else se(t)?(t={default:t,_ctx:Ve},n=32):(t=String(t),r&64?(n=16,t=[zt(t)]):n=8);e.children=t,e.shapeFlag|=n}function so(...e){const t={};for(let n=0;nIe||Ve;let ko,Zt,bi="__VUE_INSTANCE_SETTERS__";(Zt=Yr()[bi])||(Zt=Yr()[bi]=[]),Zt.push(e=>Ie=e),ko=e=>{Zt.length>1?Zt.forEach(t=>t(e)):Zt[0](e)};const un=e=>{ko(e),e.scope.on()},Kt=()=>{Ie&&Ie.scope.off(),ko(null)};function nl(e){return e.vnode.shapeFlag&4}let fn=!1;function Fc(e,t=!1){fn=t;const{props:n,children:r}=e.vnode,o=nl(e);Ac(e,n,o,t),xc(e,r);const i=o?jc(e,t):void 0;return fn=!1,i}function jc(e,t){const n=e.type;e.accessCache=Object.create(null),e.proxy=xs(new Proxy(e.ctx,_c));const{setup:r}=n;if(r){const o=e.setupContext=r.length>1?Uc(e):null;un(e),mn();const i=It(r,e,0,[e.props,o]);if(vn(),Kt(),us(i)){if(i.then(Kt,Kt),t)return i.then(s=>{yi(e,s,t)}).catch(s=>{Bn(s,e,0)});e.asyncDep=i}else yi(e,i,t)}else rl(e,t)}function yi(e,t,n){se(t)?e.type.__ssrInlineRender?e.ssrRender=t:e.render=t:Pe(t)&&(e.setupState=Ds(t)),rl(e,n)}let Ei;function rl(e,t,n){const r=e.type;if(!e.render){if(!t&&Ei&&!r.render){const o=r.template||Io(e).template;if(o){const{isCustomElement:i,compilerOptions:s}=e.appContext.config,{delimiters:l,compilerOptions:a}=r,c=De(De({isCustomElement:i,delimiters:l},s),a);r.render=Ei(o,c)}}e.render=r.render||it}un(e),mn(),gc(e),vn(),Kt()}function qc(e){return e.attrsProxy||(e.attrsProxy=new Proxy(e.attrs,{get(t,n){return Ke(e,"get","$attrs"),t[n]}}))}function Uc(e){const t=n=>{e.exposed=n||{}};return{get attrs(){return qc(e)},slots:e.slots,emit:e.emit,expose:t}}function Ir(e){if(e.exposed)return e.exposeProxy||(e.exposeProxy=new Proxy(Ds(xs(e.exposed)),{get(t,n){if(n in t)return t[n];if(n in An)return An[n](e)},has(t,n){return n in t||n in An}}))}function Kc(e,t=!0){return se(e)?e.displayName||e.name:e.name||t&&e.__name}function Wc(e){return se(e)&&"__vccOpts"in e}const q=(e,t)=>ja(e,t,fn);function _e(e,t,n){const r=arguments.length;return r===2?Pe(t)&&!G(t)?vr(t)?ne(e,null,[t]):ne(e,t):ne(e,null,t):(r>3?n=Array.prototype.slice.call(arguments,2):r===3&&vr(n)&&(n=[n]),ne(e,t,n))}const Qc=Symbol.for("v-scx"),Yc=()=>Re(Qc),Gc="3.3.4",Jc="http://www.w3.org/2000/svg",Ft=typeof document<"u"?document:null,Li=Ft&&Ft.createElement("template"),Zc={insert:(e,t,n)=>{t.insertBefore(e,n||null)},remove:e=>{const t=e.parentNode;t&&t.removeChild(e)},createElement:(e,t,n,r)=>{const o=t?Ft.createElementNS(Jc,e):Ft.createElement(e,n?{is:n}:void 0);return e==="select"&&r&&r.multiple!=null&&o.setAttribute("multiple",r.multiple),o},createText:e=>Ft.createTextNode(e),createComment:e=>Ft.createComment(e),setText:(e,t)=>{e.nodeValue=t},setElementText:(e,t)=>{e.textContent=t},parentNode:e=>e.parentNode,nextSibling:e=>e.nextSibling,querySelector:e=>Ft.querySelector(e),setScopeId(e,t){e.setAttribute(t,"")},insertStaticContent(e,t,n,r,o,i){const s=n?n.previousSibling:t.lastChild;if(o&&(o===i||o.nextSibling))for(;t.insertBefore(o.cloneNode(!0),n),!(o===i||!(o=o.nextSibling)););else{Li.innerHTML=r?`${e}`:e;const l=Li.content;if(r){const a=l.firstChild;for(;a.firstChild;)l.appendChild(a.firstChild);l.removeChild(a)}t.insertBefore(l,n)}return[s?s.nextSibling:t.firstChild,n?n.previousSibling:t.lastChild]}};function Xc(e,t,n){const r=e._vtc;r&&(t=(t?[t,...r]:[...r]).join(" ")),t==null?e.removeAttribute("class"):n?e.setAttribute("class",t):e.className=t}function eu(e,t,n){const r=e.style,o=me(n);if(n&&!o){if(t&&!me(t))for(const i in t)n[i]==null&&lo(r,i,"");for(const i in n)lo(r,i,n[i])}else{const i=r.display;o?t!==n&&(r.cssText=n):t&&e.removeAttribute("style"),"_vod"in e&&(r.display=i)}}const Ti=/\s*!important$/;function lo(e,t,n){if(G(n))n.forEach(r=>lo(e,t,r));else if(n==null&&(n=""),t.startsWith("--"))e.setProperty(t,n);else{const r=tu(e,t);Ti.test(n)?e.setProperty(Qt(r),n.replace(Ti,""),"important"):e[r]=n}}const Pi=["Webkit","Moz","ms"],Hr={};function tu(e,t){const n=Hr[t];if(n)return n;let r=ft(t);if(r!=="filter"&&r in e)return Hr[t]=r;r=Lr(r);for(let o=0;oBr||(au.then(()=>Br=0),Br=Date.now());function uu(e,t){const n=r=>{if(!r._vts)r._vts=Date.now();else if(r._vts<=n.attached)return;Xe(fu(r,n.value),t,5,[r])};return n.value=e,n.attached=cu(),n}function fu(e,t){if(G(t)){const n=e.stopImmediatePropagation;return e.stopImmediatePropagation=()=>{n.call(e),e._stopped=!0},t.map(r=>o=>!o._stopped&&r&&r(o))}else return t}const Ri=/^on[a-z]/,du=(e,t,n,r,o=!1,i,s,l,a)=>{t==="class"?Xc(e,r,o):t==="style"?eu(e,n,r):Vn(t)?po(t)||su(e,t,n,r,s):(t[0]==="."?(t=t.slice(1),!0):t[0]==="^"?(t=t.slice(1),!1):hu(e,t,r,o))?ru(e,t,r,i,s,l,a):(t==="true-value"?e._trueValue=r:t==="false-value"&&(e._falseValue=r),nu(e,t,r,o))};function hu(e,t,n,r){return r?!!(t==="innerHTML"||t==="textContent"||t in e&&Ri.test(t)&&se(n)):t==="spellcheck"||t==="draggable"||t==="translate"||t==="form"||t==="list"&&e.tagName==="INPUT"||t==="type"&&e.tagName==="TEXTAREA"||Ri.test(t)&&me(n)?!1:t in e}const wt="transition",gn="animation",jn=(e,{slots:t})=>_e(oc,pu(e),t);jn.displayName="Transition";const ol={name:String,type:String,css:{type:Boolean,default:!0},duration:[String,Number,Object],enterFromClass:String,enterActiveClass:String,enterToClass:String,appearFromClass:String,appearActiveClass:String,appearToClass:String,leaveFromClass:String,leaveActiveClass:String,leaveToClass:String};jn.props=De({},Ns,ol);const Vt=(e,t=[])=>{G(e)?e.forEach(n=>n(...t)):e&&e(...t)},xi=e=>e?G(e)?e.some(t=>t.length>1):e.length>1:!1;function pu(e){const t={};for(const w in e)w in ol||(t[w]=e[w]);if(e.css===!1)return t;const{name:n="v",type:r,duration:o,enterFromClass:i=`${n}-enter-from`,enterActiveClass:s=`${n}-enter-active`,enterToClass:l=`${n}-enter-to`,appearFromClass:a=i,appearActiveClass:c=s,appearToClass:u=l,leaveFromClass:f=`${n}-leave-from`,leaveActiveClass:h=`${n}-leave-active`,leaveToClass:_=`${n}-leave-to`}=e,E=mu(o),T=E&&E[0],R=E&&E[1],{onBeforeEnter:g,onEnter:y,onEnterCancelled:C,onLeave:I,onLeaveCancelled:W,onBeforeAppear:te=g,onAppear:M=y,onAppearCancelled:v=C}=t,j=(w,k,z)=>{Mt(w,k?u:l),Mt(w,k?c:s),z&&z()},N=(w,k)=>{w._isLeaving=!1,Mt(w,f),Mt(w,_),Mt(w,h),k&&k()},Y=w=>(k,z)=>{const le=w?M:y,U=()=>j(k,w,z);Vt(le,[k,U]),Oi(()=>{Mt(k,w?a:i),Rt(k,w?u:l),xi(le)||Di(k,r,T,U)})};return De(t,{onBeforeEnter(w){Vt(g,[w]),Rt(w,i),Rt(w,s)},onBeforeAppear(w){Vt(te,[w]),Rt(w,a),Rt(w,c)},onEnter:Y(!1),onAppear:Y(!0),onLeave(w,k){w._isLeaving=!0;const z=()=>N(w,k);Rt(w,f),gu(),Rt(w,h),Oi(()=>{w._isLeaving&&(Mt(w,f),Rt(w,_),xi(I)||Di(w,r,R,z))}),Vt(I,[w,z])},onEnterCancelled(w){j(w,!1),Vt(C,[w])},onAppearCancelled(w){j(w,!0),Vt(v,[w])},onLeaveCancelled(w){N(w),Vt(W,[w])}})}function mu(e){if(e==null)return null;if(Pe(e))return[Fr(e.enter),Fr(e.leave)];{const t=Fr(e);return[t,t]}}function Fr(e){return Xl(e)}function Rt(e,t){t.split(/\s+/).forEach(n=>n&&e.classList.add(n)),(e._vtc||(e._vtc=new Set)).add(t)}function Mt(e,t){t.split(/\s+/).forEach(r=>r&&e.classList.remove(r));const{_vtc:n}=e;n&&(n.delete(t),n.size||(e._vtc=void 0))}function Oi(e){requestAnimationFrame(()=>{requestAnimationFrame(e)})}let vu=0;function Di(e,t,n,r){const o=e._endId=++vu,i=()=>{o===e._endId&&r()};if(n)return setTimeout(i,n);const{type:s,timeout:l,propCount:a}=_u(e,t);if(!s)return r();const c=s+"end";let u=0;const f=()=>{e.removeEventListener(c,h),i()},h=_=>{_.target===e&&++u>=a&&f()};setTimeout(()=>{u(n[E]||"").split(", "),o=r(`${wt}Delay`),i=r(`${wt}Duration`),s=Ii(o,i),l=r(`${gn}Delay`),a=r(`${gn}Duration`),c=Ii(l,a);let u=null,f=0,h=0;t===wt?s>0&&(u=wt,f=s,h=i.length):t===gn?c>0&&(u=gn,f=c,h=a.length):(f=Math.max(s,c),u=f>0?s>c?wt:gn:null,h=u?u===wt?i.length:a.length:0);const _=u===wt&&/\b(transform|all)(,|$)/.test(r(`${wt}Property`).toString());return{type:u,timeout:f,propCount:h,hasTransform:_}}function Ii(e,t){for(;e.lengthCi(n)+Ci(e[r])))}function Ci(e){return Number(e.slice(0,-1).replace(",","."))*1e3}function gu(){return document.body.offsetHeight}const bu={esc:"escape",space:" ",up:"arrow-up",left:"arrow-left",right:"arrow-right",down:"arrow-down",delete:"backspace"},yu=(e,t)=>n=>{if(!("key"in n))return;const r=Qt(n.key);if(t.some(o=>o===r||bu[o]===r))return e(n)},_r={beforeMount(e,{value:t},{transition:n}){e._vod=e.style.display==="none"?"":e.style.display,n&&t?n.beforeEnter(e):bn(e,t)},mounted(e,{value:t},{transition:n}){n&&t&&n.enter(e)},updated(e,{value:t,oldValue:n},{transition:r}){!t!=!n&&(r?t?(r.beforeEnter(e),bn(e,!0),r.enter(e)):r.leave(e,()=>{bn(e,!1)}):bn(e,t))},beforeUnmount(e,{value:t}){bn(e,t)}};function bn(e,t){e.style.display=t?e._vod:"none"}const Eu=De({patchProp:du},Zc);let jr,Si=!1;function Lu(){return jr=Si?jr:Ic(Eu),Si=!0,jr}const Tu=(...e)=>{const t=Lu().createApp(...e),{mount:n}=t;return t.mount=r=>{const o=Pu(r);if(o)return n(o,!0,o instanceof SVGElement)},t};function Pu(e){return me(e)?document.querySelector(e):e}const Au={"v-8daa1a0e":()=>m(()=>import("./index.html-c2968b1e.js"),[]).then(({data:e})=>e),"v-64270bfa":()=>m(()=>import("./db-localfs.html-bdc3f77a.js"),[]).then(({data:e})=>e),"v-20ec2a08":()=>m(()=>import("./db-pfs-curve.html-ee679a35.js"),[]).then(({data:e})=>e),"v-2da78b44":()=>m(()=>import("./db-pfs.html-00133c95.js"),[]).then(({data:e})=>e),"v-bca378d6":()=>m(()=>import("./deploy-official.html-efd51867.js"),[]).then(({data:e})=>e),"v-097f9dea":()=>m(()=>import("./deploy-stack.html-9812b946.js"),[]).then(({data:e})=>e),"v-4a7bdef6":()=>m(()=>import("./deploy.html-d61ba66a.js"),[]).then(({data:e})=>e),"v-e8e53a66":()=>m(()=>import("./fs-pfs-curve.html-fa75d4d3.js"),[]).then(({data:e})=>e),"v-4bd622ef":()=>m(()=>import("./fs-pfs.html-78c353ce.js"),[]).then(({data:e})=>e),"v-12a5021c":()=>m(()=>import("./introduction.html-1d0705b0.js"),[]).then(({data:e})=>e),"v-1ced8944":()=>m(()=>import("./quick-start.html-ede64a2e.js"),[]).then(({data:e})=>e),"v-5a992740":()=>m(()=>import("./storage-aliyun-essd.html-3dd7acdd.js"),[]).then(({data:e})=>e),"v-e3a62740":()=>m(()=>import("./storage-ceph.html-9327a336.js"),[]).then(({data:e})=>e),"v-7f31e698":()=>m(()=>import("./storage-curvebs.html-f3b814a4.js"),[]).then(({data:e})=>e),"v-c895df30":()=>m(()=>import("./storage-nbd.html-0d5c1474.js"),[]).then(({data:e})=>e),"v-43a2065f":()=>m(()=>import("./coding-style.html-9c14d7a6.js"),[]).then(({data:e})=>e),"v-2be11236":()=>m(()=>import("./contributing-polardb-docs.html-f51dbfef.js"),[]).then(({data:e})=>e),"v-48520b74":()=>m(()=>import("./contributing-polardb-kernel.html-1eca7ee4.js"),[]).then(({data:e})=>e),"v-c4fe9fca":()=>m(()=>import("./customize-dev-env.html-6e08f45f.js"),[]).then(({data:e})=>e),"v-2a8fa310":()=>m(()=>import("./dev-on-docker.html-efa784b2.js"),[]).then(({data:e})=>e),"v-7fdfc12a":()=>m(()=>import("./backup-and-restore.html-7682916f.js"),[]).then(({data:e})=>e),"v-530a6d12":()=>m(()=>import("./grow-storage.html-358b501e.js"),[]).then(({data:e})=>e),"v-4cbd0b64":()=>m(()=>import("./ro-online-promote.html-089ffddc.js"),[]).then(({data:e})=>e),"v-4a6d2de2":()=>m(()=>import("./scale-out.html-ee1b6f09.js"),[]).then(({data:e})=>e),"v-3a0d4712":()=>m(()=>import("./tpcc-test.html-52f0c227.js"),[]).then(({data:e})=>e),"v-691e4b88":()=>m(()=>import("./tpch-test.html-78343832.js"),[]).then(({data:e})=>e),"v-98064128":()=>m(()=>import("./index.html-60aab00b.js"),[]).then(({data:e})=>e),"v-5879645e":()=>m(()=>import("./analyze.html-70b019fc.js"),[]).then(({data:e})=>e),"v-4ccaa7d8":()=>m(()=>import("./arch-htap.html-21a1bc97.js"),[]).then(({data:e})=>e),"v-14c84b4c":()=>m(()=>import("./arch-overview.html-dcc3d371.js"),[]).then(({data:e})=>e),"v-46e5eefa":()=>m(()=>import("./buffer-management.html-120b73ba.js"),[]).then(({data:e})=>e),"v-5cfdf98b":()=>m(()=>import("./ddl-synchronization.html-37f0cfaf.js"),[]).then(({data:e})=>e),"v-65697b4c":()=>m(()=>import("./logindex.html-1973076c.js"),[]).then(({data:e})=>e),"v-6edf83b7":()=>m(()=>import("./polar-sequence-tech.html-0de65483.js"),[]).then(({data:e})=>e),"v-2d0ad528":()=>m(()=>import("./index.html-f767cea4.js"),[]).then(({data:e})=>e),"v-3ec72c4e":()=>m(()=>import("./coding-style.html-b182657f.js"),[]).then(({data:e})=>e),"v-210f48a7":()=>m(()=>import("./contributing-polardb-docs.html-5c2bada8.js"),[]).then(({data:e})=>e),"v-aa672cb6":()=>m(()=>import("./contributing-polardb-kernel.html-54788b1e.js"),[]).then(({data:e})=>e),"v-55351ab4":()=>m(()=>import("./db-localfs.html-0d436603.js"),[]).then(({data:e})=>e),"v-71a5b926":()=>m(()=>import("./db-pfs-curve.html-210b20fc.js"),[]).then(({data:e})=>e),"v-b00a48e2":()=>m(()=>import("./db-pfs.html-79e35242.js"),[]).then(({data:e})=>e),"v-c6592cf8":()=>m(()=>import("./deploy-official.html-6001b0e7.js"),[]).then(({data:e})=>e),"v-3dba534a":()=>m(()=>import("./deploy-stack.html-d9f23f36.js"),[]).then(({data:e})=>e),"v-ccde9c94":()=>m(()=>import("./deploy.html-523aee49.js"),[]).then(({data:e})=>e),"v-63309b3e":()=>m(()=>import("./fs-pfs-curve.html-99b42104.js"),[]).then(({data:e})=>e),"v-0aa4c420":()=>m(()=>import("./fs-pfs.html-0c262459.js"),[]).then(({data:e})=>e),"v-635e913a":()=>m(()=>import("./introduction.html-5114e518.js"),[]).then(({data:e})=>e),"v-7eb8feb3":()=>m(()=>import("./quick-start.html-fadf16d2.js"),[]).then(({data:e})=>e),"v-6c33fa62":()=>m(()=>import("./storage-aliyun-essd.html-82759337.js"),[]).then(({data:e})=>e),"v-65d024d1":()=>m(()=>import("./storage-ceph.html-27d081d7.js"),[]).then(({data:e})=>e),"v-7a570c87":()=>m(()=>import("./storage-curvebs.html-c5a165f0.js"),[]).then(({data:e})=>e),"v-04fef452":()=>m(()=>import("./storage-nbd.html-6a3a12bf.js"),[]).then(({data:e})=>e),"v-d69972ec":()=>m(()=>import("./customize-dev-env.html-f893e063.js"),[]).then(({data:e})=>e),"v-25b4c8ff":()=>m(()=>import("./dev-on-docker.html-fe137802.js"),[]).then(({data:e})=>e),"v-0bbe1b6a":()=>m(()=>import("./index.html-a68fc122.js"),[]).then(({data:e})=>e),"v-6fed01c8":()=>m(()=>import("./backup-and-restore.html-15f20d92.js"),[]).then(({data:e})=>e),"v-a8802f54":()=>m(()=>import("./cpu-usage-high.html-c7872413.js"),[]).then(({data:e})=>e),"v-a3c3fc30":()=>m(()=>import("./grow-storage.html-2006e829.js"),[]).then(({data:e})=>e),"v-13307193":()=>m(()=>import("./ro-online-promote.html-73ae6acb.js"),[]).then(({data:e})=>e),"v-4a816e3e":()=>m(()=>import("./scale-out.html-c244f53c.js"),[]).then(({data:e})=>e),"v-52b161a6":()=>m(()=>import("./tpcc-test.html-0f31266e.js"),[]).then(({data:e})=>e),"v-3b28df6b":()=>m(()=>import("./tpch-test.html-5f912467.js"),[]).then(({data:e})=>e),"v-7b6b229b":()=>m(()=>import("./index.html-ebe7d04c.js"),[]).then(({data:e})=>e),"v-28309dcf":()=>m(()=>import("./analyze.html-4db5cb7d.js"),[]).then(({data:e})=>e),"v-0b994909":()=>m(()=>import("./arch-htap.html-581ae188.js"),[]).then(({data:e})=>e),"v-7ce47b0b":()=>m(()=>import("./arch-overview.html-c15ab6a4.js"),[]).then(({data:e})=>e),"v-7ac661aa":()=>m(()=>import("./buffer-management.html-35dc0ba1.js"),[]).then(({data:e})=>e),"v-7304dd08":()=>m(()=>import("./ddl-synchronization.html-9c478656.js"),[]).then(({data:e})=>e),"v-170991ee":()=>m(()=>import("./logindex.html-d1deed5e.js"),[]).then(({data:e})=>e),"v-4f41c8b0":()=>m(()=>import("./polar-sequence-tech.html-2a7cb868.js"),[]).then(({data:e})=>e),"v-7f44b843":()=>m(()=>import("./index.html-8e3e01b7.js"),[]).then(({data:e})=>e),"v-6024a2d1":()=>m(()=>import("./index.html-f4a2cc54.js"),[]).then(({data:e})=>e),"v-2a7736c4":()=>m(()=>import("./avail-online-promote.html-40b93d0d.js"),[]).then(({data:e})=>e),"v-18c2ec3b":()=>m(()=>import("./avail-parallel-replay.html-a035d420.js"),[]).then(({data:e})=>e),"v-4e16f0f0":()=>m(()=>import("./datamax.html-5138183e.js"),[]).then(({data:e})=>e),"v-bb50ce5c":()=>m(()=>import("./flashback-table.html-2404989b.js"),[]).then(({data:e})=>e),"v-4fd5d67a":()=>m(()=>import("./resource-manager.html-ea11f2ad.js"),[]).then(({data:e})=>e),"v-62087a8c":()=>m(()=>import("./index.html-e2ca5e7d.js"),[]).then(({data:e})=>e),"v-59700d71":()=>m(()=>import("./adaptive-scan.html-63d6e581.js"),[]).then(({data:e})=>e),"v-798d4bcc":()=>m(()=>import("./cluster-info.html-8592b599.js"),[]).then(({data:e})=>e),"v-5b4b4332":()=>m(()=>import("./epq-create-btree-index.html-c1e1de42.js"),[]).then(({data:e})=>e),"v-da223262":()=>m(()=>import("./epq-ctas-mtview-bulk-insert.html-ca602c4f.js"),[]).then(({data:e})=>e),"v-9aa77614":()=>m(()=>import("./epq-explain-analyze.html-948c1fdb.js"),[]).then(({data:e})=>e),"v-351ad83c":()=>m(()=>import("./epq-node-and-dop.html-2ee64cdd.js"),[]).then(({data:e})=>e),"v-5d5635bc":()=>m(()=>import("./epq-partitioned-table.html-804dd467.js"),[]).then(({data:e})=>e),"v-3f61fca0":()=>m(()=>import("./parallel-dml.html-ce10b755.js"),[]).then(({data:e})=>e),"v-9d84b310":()=>m(()=>import("./index.html-de3342c7.js"),[]).then(({data:e})=>e),"v-3c5bafa7":()=>m(()=>import("./pgvector.html-3c1132df.js"),[]).then(({data:e})=>e),"v-bc8fc3a4":()=>m(()=>import("./smlar.html-43cf50c7.js"),[]).then(({data:e})=>e),"v-ba4b3c7c":()=>m(()=>import("./index.html-ed802d58.js"),[]).then(({data:e})=>e),"v-0bb2232b":()=>m(()=>import("./bulk-read-and-extend.html-ecac2d9c.js"),[]).then(({data:e})=>e),"v-37c6fdad":()=>m(()=>import("./rel-size-cache.html-d3f30121.js"),[]).then(({data:e})=>e),"v-69fcb160":()=>m(()=>import("./shared-server.html-aa99c110.js"),[]).then(({data:e})=>e),"v-010157e8":()=>m(()=>import("./index.html-efbb7ed1.js"),[]).then(({data:e})=>e),"v-39aa8be0":()=>m(()=>import("./tde.html-ef77c890.js"),[]).then(({data:e})=>e),"v-3706649a":()=>m(()=>import("./404.html-60b35caa.js"),[]).then(({data:e})=>e)},wu=JSON.parse('{"base":"/PolarDB-for-PostgreSQL/","lang":"en-US","title":"","description":"","head":[["link",{"rel":"icon","href":"/PolarDB-for-PostgreSQL/favicon.ico"}]],"locales":{"/":{"lang":"en-US","title":"PolarDB for PostgreSQL","description":"A cloud-native database developed by Alibaba Cloud"},"/zh/":{"lang":"zh-CN","title":"PolarDB for PostgreSQL","description":"阿里云自主研发的云原生数据库"}}}');var Ru=([e,t,n])=>e==="meta"&&t.name?`${e}.${t.name}`:["title","base"].includes(e)?e:e==="template"&&t.id?`${e}.${t.id}`:JSON.stringify([e,t,n]),xu=e=>{const t=new Set,n=[];return e.forEach(r=>{const o=Ru(r);t.has(o)||(t.add(o),n.push(r))}),n},qn=e=>/^(https?:)?\/\//.test(e),Ou=e=>/^mailto:/.test(e),Du=e=>/^tel:/.test(e),zo=e=>Object.prototype.toString.call(e)==="[object Object]",il=e=>e[e.length-1]==="/"?e.slice(0,-1):e,sl=e=>e[0]==="/"?e.slice(1):e,ll=(e,t)=>{const n=Object.keys(e).sort((r,o)=>{const i=o.split("/").length-r.split("/").length;return i!==0?i:o.length-r.length});for(const r of n)if(t.startsWith(r))return r;return"/"},ki=(e,t="/")=>{const n=e.replace(/^(https?:)?\/\/[^/]*/,"");return n.startsWith(t)?`/${n.slice(t.length)}`:n};const al={"v-8daa1a0e":O(()=>m(()=>import("./index.html-390f5696.js"),[])),"v-64270bfa":O(()=>m(()=>import("./db-localfs.html-6fce4fb5.js"),[])),"v-20ec2a08":O(()=>m(()=>import("./db-pfs-curve.html-2c67fb2a.js"),[])),"v-2da78b44":O(()=>m(()=>import("./db-pfs.html-ec141362.js"),[])),"v-bca378d6":O(()=>m(()=>import("./deploy-official.html-56c4332d.js"),[])),"v-097f9dea":O(()=>m(()=>import("./deploy-stack.html-883c5c20.js"),[])),"v-4a7bdef6":O(()=>m(()=>import("./deploy.html-42673f52.js"),[])),"v-e8e53a66":O(()=>m(()=>import("./fs-pfs-curve.html-b215dfd2.js"),[])),"v-4bd622ef":O(()=>m(()=>import("./fs-pfs.html-fa03f7c3.js"),[])),"v-12a5021c":O(()=>m(()=>import("./introduction.html-606b1a82.js"),[])),"v-1ced8944":O(()=>m(()=>import("./quick-start.html-993589ee.js"),[])),"v-5a992740":O(()=>m(()=>import("./storage-aliyun-essd.html-f09c57cf.js"),[])),"v-e3a62740":O(()=>m(()=>import("./storage-ceph.html-fd9bfda4.js"),[])),"v-7f31e698":O(()=>m(()=>import("./storage-curvebs.html-e2572630.js"),[])),"v-c895df30":O(()=>m(()=>import("./storage-nbd.html-162d7e26.js"),[])),"v-43a2065f":O(()=>m(()=>import("./coding-style.html-f771a098.js"),[])),"v-2be11236":O(()=>m(()=>import("./contributing-polardb-docs.html-2f5025bd.js"),[])),"v-48520b74":O(()=>m(()=>import("./contributing-polardb-kernel.html-92d0b879.js"),[])),"v-c4fe9fca":O(()=>m(()=>import("./customize-dev-env.html-95ee07be.js"),[])),"v-2a8fa310":O(()=>m(()=>import("./dev-on-docker.html-36d2d71c.js"),[])),"v-7fdfc12a":O(()=>m(()=>import("./backup-and-restore.html-03a875b8.js"),[])),"v-530a6d12":O(()=>m(()=>import("./grow-storage.html-ae16c782.js"),[])),"v-4cbd0b64":O(()=>m(()=>import("./ro-online-promote.html-b639a7d1.js"),[])),"v-4a6d2de2":O(()=>m(()=>import("./scale-out.html-eed2da3b.js"),[])),"v-3a0d4712":O(()=>m(()=>import("./tpcc-test.html-b60e72ae.js"),[])),"v-691e4b88":O(()=>m(()=>import("./tpch-test.html-f7f8e1ad.js"),[])),"v-98064128":O(()=>m(()=>import("./index.html-cd3aa341.js"),[])),"v-5879645e":O(()=>m(()=>import("./analyze.html-877fc82a.js"),[])),"v-4ccaa7d8":O(()=>m(()=>import("./arch-htap.html-03506fa3.js"),[])),"v-14c84b4c":O(()=>m(()=>import("./arch-overview.html-c2599ebc.js"),["assets/arch-overview.html-c2599ebc.js","assets/9_future_pages-13873b1a.js"])),"v-46e5eefa":O(()=>m(()=>import("./buffer-management.html-d9b5fb2e.js"),["assets/buffer-management.html-d9b5fb2e.js","assets/9_future_pages-13873b1a.js"])),"v-5cfdf98b":O(()=>m(()=>import("./ddl-synchronization.html-dc62a732.js"),[])),"v-65697b4c":O(()=>m(()=>import("./logindex.html-2840dbbf.js"),[])),"v-6edf83b7":O(()=>m(()=>import("./polar-sequence-tech.html-569c0f3f.js"),[])),"v-2d0ad528":O(()=>m(()=>import("./index.html-b1951828.js"),[])),"v-3ec72c4e":O(()=>m(()=>import("./coding-style.html-181aff3b.js"),[])),"v-210f48a7":O(()=>m(()=>import("./contributing-polardb-docs.html-43544697.js"),[])),"v-aa672cb6":O(()=>m(()=>import("./contributing-polardb-kernel.html-9fffc22f.js"),[])),"v-55351ab4":O(()=>m(()=>import("./db-localfs.html-d7558701.js"),[])),"v-71a5b926":O(()=>m(()=>import("./db-pfs-curve.html-bc2859d9.js"),[])),"v-b00a48e2":O(()=>m(()=>import("./db-pfs.html-25c4f785.js"),[])),"v-c6592cf8":O(()=>m(()=>import("./deploy-official.html-d090475d.js"),[])),"v-3dba534a":O(()=>m(()=>import("./deploy-stack.html-b9d4cc47.js"),[])),"v-ccde9c94":O(()=>m(()=>import("./deploy.html-2951b18a.js"),[])),"v-63309b3e":O(()=>m(()=>import("./fs-pfs-curve.html-afd924fe.js"),[])),"v-0aa4c420":O(()=>m(()=>import("./fs-pfs.html-b712bf3a.js"),[])),"v-635e913a":O(()=>m(()=>import("./introduction.html-db3ff455.js"),[])),"v-7eb8feb3":O(()=>m(()=>import("./quick-start.html-b665e5e8.js"),[])),"v-6c33fa62":O(()=>m(()=>import("./storage-aliyun-essd.html-a35a0fec.js"),[])),"v-65d024d1":O(()=>m(()=>import("./storage-ceph.html-4c626f1d.js"),[])),"v-7a570c87":O(()=>m(()=>import("./storage-curvebs.html-a99e7740.js"),[])),"v-04fef452":O(()=>m(()=>import("./storage-nbd.html-97a12948.js"),[])),"v-d69972ec":O(()=>m(()=>import("./customize-dev-env.html-aa6a8576.js"),[])),"v-25b4c8ff":O(()=>m(()=>import("./dev-on-docker.html-c045aaf0.js"),[])),"v-0bbe1b6a":O(()=>m(()=>import("./index.html-c24c33bc.js"),[])),"v-6fed01c8":O(()=>m(()=>import("./backup-and-restore.html-293288f7.js"),[])),"v-a8802f54":O(()=>m(()=>import("./cpu-usage-high.html-7366dfbc.js"),[])),"v-a3c3fc30":O(()=>m(()=>import("./grow-storage.html-f1072fd0.js"),[])),"v-13307193":O(()=>m(()=>import("./ro-online-promote.html-659e21ea.js"),[])),"v-4a816e3e":O(()=>m(()=>import("./scale-out.html-c7075237.js"),[])),"v-52b161a6":O(()=>m(()=>import("./tpcc-test.html-88535e70.js"),[])),"v-3b28df6b":O(()=>m(()=>import("./tpch-test.html-83b2e511.js"),[])),"v-7b6b229b":O(()=>m(()=>import("./index.html-66d290ab.js"),[])),"v-28309dcf":O(()=>m(()=>import("./analyze.html-f587193a.js"),[])),"v-0b994909":O(()=>m(()=>import("./arch-htap.html-b0e18587.js"),[])),"v-7ce47b0b":O(()=>m(()=>import("./arch-overview.html-ed106ad9.js"),["assets/arch-overview.html-ed106ad9.js","assets/9_future_pages-9e3b8fc6.js"])),"v-7ac661aa":O(()=>m(()=>import("./buffer-management.html-5ac35282.js"),["assets/buffer-management.html-5ac35282.js","assets/9_future_pages-9e3b8fc6.js"])),"v-7304dd08":O(()=>m(()=>import("./ddl-synchronization.html-bc052c77.js"),[])),"v-170991ee":O(()=>m(()=>import("./logindex.html-2ff46a28.js"),[])),"v-4f41c8b0":O(()=>m(()=>import("./polar-sequence-tech.html-a8c531ba.js"),[])),"v-7f44b843":O(()=>m(()=>import("./index.html-d6e90735.js"),[])),"v-6024a2d1":O(()=>m(()=>import("./index.html-a1b339d0.js"),[])),"v-2a7736c4":O(()=>m(()=>import("./avail-online-promote.html-21127e10.js"),[])),"v-18c2ec3b":O(()=>m(()=>import("./avail-parallel-replay.html-2136f786.js"),[])),"v-4e16f0f0":O(()=>m(()=>import("./datamax.html-2f4105a2.js"),[])),"v-bb50ce5c":O(()=>m(()=>import("./flashback-table.html-f52a25dd.js"),[])),"v-4fd5d67a":O(()=>m(()=>import("./resource-manager.html-1cf58c68.js"),[])),"v-62087a8c":O(()=>m(()=>import("./index.html-4d0beb35.js"),[])),"v-59700d71":O(()=>m(()=>import("./adaptive-scan.html-a651e93c.js"),[])),"v-798d4bcc":O(()=>m(()=>import("./cluster-info.html-987f49a8.js"),[])),"v-5b4b4332":O(()=>m(()=>import("./epq-create-btree-index.html-86dfc866.js"),[])),"v-da223262":O(()=>m(()=>import("./epq-ctas-mtview-bulk-insert.html-120d6540.js"),[])),"v-9aa77614":O(()=>m(()=>import("./epq-explain-analyze.html-c636bd81.js"),[])),"v-351ad83c":O(()=>m(()=>import("./epq-node-and-dop.html-bb13ee52.js"),[])),"v-5d5635bc":O(()=>m(()=>import("./epq-partitioned-table.html-bde50aed.js"),[])),"v-3f61fca0":O(()=>m(()=>import("./parallel-dml.html-fe244403.js"),[])),"v-9d84b310":O(()=>m(()=>import("./index.html-f9a07053.js"),[])),"v-3c5bafa7":O(()=>m(()=>import("./pgvector.html-74643e40.js"),[])),"v-bc8fc3a4":O(()=>m(()=>import("./smlar.html-e8b6a5e2.js"),[])),"v-ba4b3c7c":O(()=>m(()=>import("./index.html-2e0bfe16.js"),[])),"v-0bb2232b":O(()=>m(()=>import("./bulk-read-and-extend.html-586617cf.js"),[])),"v-37c6fdad":O(()=>m(()=>import("./rel-size-cache.html-0ff52651.js"),[])),"v-69fcb160":O(()=>m(()=>import("./shared-server.html-5057af3a.js"),[])),"v-010157e8":O(()=>m(()=>import("./index.html-c8d3a2fb.js"),[])),"v-39aa8be0":O(()=>m(()=>import("./tde.html-babd189d.js"),[])),"v-3706649a":O(()=>m(()=>import("./404.html-66349191.js"),[]))};var Iu=Symbol(""),cl=Symbol(""),Cu=Hn({key:"",path:"",title:"",lang:"",frontmatter:{},headers:[]}),Wt=()=>{const e=Re(cl);if(!e)throw new Error("pageData() is called without provider.");return e},ul=Symbol(""),vt=()=>{const e=Re(ul);if(!e)throw new Error("usePageFrontmatter() is called without provider.");return e},fl=Symbol(""),Su=()=>{const e=Re(fl);if(!e)throw new Error("usePageHead() is called without provider.");return e},ku=Symbol(""),dl=Symbol(""),hl=()=>{const e=Re(dl);if(!e)throw new Error("usePageLang() is called without provider.");return e},pl=Symbol(""),zu=()=>{const e=Re(pl);if(!e)throw new Error("usePageLayout() is called without provider.");return e},$u=Le(Au),$o=Symbol(""),Un=()=>{const e=Re($o);if(!e)throw new Error("useRouteLocale() is called without provider.");return e},tn=Le(wu),ml=()=>tn,vl=Symbol(""),Vo=()=>{const e=Re(vl);if(!e)throw new Error("useSiteLocaleData() is called without provider.");return e},Vu=Symbol(""),Mu="Layout",Nu="NotFound",ht=Nn({resolveLayouts:e=>e.reduce((t,n)=>({...t,...n.layouts}),{}),resolvePageData:async e=>{const t=$u.value[e];return await(t==null?void 0:t())??Cu},resolvePageFrontmatter:e=>e.frontmatter,resolvePageHead:(e,t,n)=>{const r=me(t.description)?t.description:n.description,o=[...G(t.head)?t.head:[],...n.head,["title",{},e],["meta",{name:"description",content:r}]];return xu(o)},resolvePageHeadTitle:(e,t)=>[e.title,t.title].filter(n=>!!n).join(" | "),resolvePageLang:(e,t)=>e.lang||t.lang||"en-US",resolvePageLayout:(e,t)=>{let n;if(e.path){const r=e.frontmatter.layout;me(r)?n=r:n=Mu}else n=Nu;return t[n]},resolveRouteLocale:(e,t)=>ll(e,t),resolveSiteLocaleData:(e,t)=>({...e,...e.locales[t]})}),Mo=ue({name:"ClientOnly",setup(e,t){const n=Le(!1);return Ge(()=>{n.value=!0}),()=>{var r,o;return n.value?(o=(r=t.slots).default)==null?void 0:o.call(r):null}}}),Hu=ue({name:"Content",props:{pageKey:{type:String,required:!1,default:""}},setup(e){const t=Wt(),n=q(()=>al[e.pageKey||t.value.key]);return()=>n.value?_e(n.value):_e("div","404 Not Found")}}),Et=(e={})=>e,No=e=>qn(e)?e:`/PolarDB-for-PostgreSQL/${sl(e)}`;function Ho(e,t,n){var r,o,i;t===void 0&&(t=50),n===void 0&&(n={});var s=(r=n.isImmediate)!=null&&r,l=(o=n.callback)!=null&&o,a=n.maxWait,c=Date.now(),u=[];function f(){if(a!==void 0){var _=Date.now()-c;if(_+t>=a)return a-_}return t}var h=function(){var _=[].slice.call(arguments),E=this;return new Promise(function(T,R){var g=s&&i===void 0;if(i!==void 0&&clearTimeout(i),i=setTimeout(function(){if(i=void 0,c=Date.now(),!s){var C=e.apply(E,_);l&&l(C),u.forEach(function(I){return(0,I.resolve)(C)}),u=[]}},f()),g){var y=e.apply(E,_);return l&&l(y),T(y)}u.push({resolve:T,reject:R})})};return h.cancel=function(_){i!==void 0&&clearTimeout(i),u.forEach(function(E){return(0,E.reject)(_)}),u=[]},h}/*! + * vue-router v4.2.4 + * (c) 2023 Eduardo San Martin Morote + * @license MIT + */const en=typeof window<"u";function Bu(e){return e.__esModule||e[Symbol.toStringTag]==="Module"}const ve=Object.assign;function qr(e,t){const n={};for(const r in t){const o=t[r];n[r]=st(o)?o.map(e):e(o)}return n}const xn=()=>{},st=Array.isArray,Fu=/\/$/,ju=e=>e.replace(Fu,"");function Ur(e,t,n="/"){let r,o={},i="",s="";const l=t.indexOf("#");let a=t.indexOf("?");return l=0&&(a=-1),a>-1&&(r=t.slice(0,a),i=t.slice(a+1,l>-1?l:t.length),o=e(i)),l>-1&&(r=r||t.slice(0,l),s=t.slice(l,t.length)),r=Wu(r??t,n),{fullPath:r+(i&&"?")+i+s,path:r,query:o,hash:s}}function qu(e,t){const n=t.query?e(t.query):"";return t.path+(n&&"?")+n+(t.hash||"")}function zi(e,t){return!t||!e.toLowerCase().startsWith(t.toLowerCase())?e:e.slice(t.length)||"/"}function Uu(e,t,n){const r=t.matched.length-1,o=n.matched.length-1;return r>-1&&r===o&&dn(t.matched[r],n.matched[o])&&_l(t.params,n.params)&&e(t.query)===e(n.query)&&t.hash===n.hash}function dn(e,t){return(e.aliasOf||e)===(t.aliasOf||t)}function _l(e,t){if(Object.keys(e).length!==Object.keys(t).length)return!1;for(const n in e)if(!Ku(e[n],t[n]))return!1;return!0}function Ku(e,t){return st(e)?$i(e,t):st(t)?$i(t,e):e===t}function $i(e,t){return st(t)?e.length===t.length&&e.every((n,r)=>n===t[r]):e.length===1&&e[0]===t}function Wu(e,t){if(e.startsWith("/"))return e;if(!e)return t;const n=t.split("/"),r=e.split("/"),o=r[r.length-1];(o===".."||o===".")&&r.push("");let i=n.length-1,s,l;for(s=0;s1&&i--;else break;return n.slice(0,i).join("/")+"/"+r.slice(s-(s===r.length?1:0)).join("/")}var zn;(function(e){e.pop="pop",e.push="push"})(zn||(zn={}));var On;(function(e){e.back="back",e.forward="forward",e.unknown=""})(On||(On={}));function Qu(e){if(!e)if(en){const t=document.querySelector("base");e=t&&t.getAttribute("href")||"/",e=e.replace(/^\w+:\/\/[^\/]+/,"")}else e="/";return e[0]!=="/"&&e[0]!=="#"&&(e="/"+e),ju(e)}const Yu=/^[^#]+#/;function Gu(e,t){return e.replace(Yu,"#")+t}function Ju(e,t){const n=document.documentElement.getBoundingClientRect(),r=e.getBoundingClientRect();return{behavior:t.behavior,left:r.left-n.left-(t.left||0),top:r.top-n.top-(t.top||0)}}const Cr=()=>({left:window.pageXOffset,top:window.pageYOffset});function Zu(e){let t;if("el"in e){const n=e.el,r=typeof n=="string"&&n.startsWith("#"),o=typeof n=="string"?r?document.getElementById(n.slice(1)):document.querySelector(n):n;if(!o)return;t=Ju(o,e)}else t=e;"scrollBehavior"in document.documentElement.style?window.scrollTo(t):window.scrollTo(t.left!=null?t.left:window.pageXOffset,t.top!=null?t.top:window.pageYOffset)}function Vi(e,t){return(history.state?history.state.position-t:-1)+e}const ao=new Map;function Xu(e,t){ao.set(e,t)}function ef(e){const t=ao.get(e);return ao.delete(e),t}let tf=()=>location.protocol+"//"+location.host;function gl(e,t){const{pathname:n,search:r,hash:o}=t,i=e.indexOf("#");if(i>-1){let l=o.includes(e.slice(i))?e.slice(i).length:1,a=o.slice(l);return a[0]!=="/"&&(a="/"+a),zi(a,"")}return zi(n,e)+r+o}function nf(e,t,n,r){let o=[],i=[],s=null;const l=({state:h})=>{const _=gl(e,location),E=n.value,T=t.value;let R=0;if(h){if(n.value=_,t.value=h,s&&s===E){s=null;return}R=T?h.position-T.position:0}else r(_);o.forEach(g=>{g(n.value,E,{delta:R,type:zn.pop,direction:R?R>0?On.forward:On.back:On.unknown})})};function a(){s=n.value}function c(h){o.push(h);const _=()=>{const E=o.indexOf(h);E>-1&&o.splice(E,1)};return i.push(_),_}function u(){const{history:h}=window;h.state&&h.replaceState(ve({},h.state,{scroll:Cr()}),"")}function f(){for(const h of i)h();i=[],window.removeEventListener("popstate",l),window.removeEventListener("beforeunload",u)}return window.addEventListener("popstate",l),window.addEventListener("beforeunload",u,{passive:!0}),{pauseListeners:a,listen:c,destroy:f}}function Mi(e,t,n,r=!1,o=!1){return{back:e,current:t,forward:n,replaced:r,position:window.history.length,scroll:o?Cr():null}}function rf(e){const{history:t,location:n}=window,r={value:gl(e,n)},o={value:t.state};o.value||i(r.value,{back:null,current:r.value,forward:null,position:t.length-1,replaced:!0,scroll:null},!0);function i(a,c,u){const f=e.indexOf("#"),h=f>-1?(n.host&&document.querySelector("base")?e:e.slice(f))+a:tf()+e+a;try{t[u?"replaceState":"pushState"](c,"",h),o.value=c}catch(_){console.error(_),n[u?"replace":"assign"](h)}}function s(a,c){const u=ve({},t.state,Mi(o.value.back,a,o.value.forward,!0),c,{position:o.value.position});i(a,u,!0),r.value=a}function l(a,c){const u=ve({},o.value,t.state,{forward:a,scroll:Cr()});i(u.current,u,!0);const f=ve({},Mi(r.value,a,null),{position:u.position+1},c);i(a,f,!1),r.value=a}return{location:r,state:o,push:l,replace:s}}function of(e){e=Qu(e);const t=rf(e),n=nf(e,t.state,t.location,t.replace);function r(i,s=!0){s||n.pauseListeners(),history.go(i)}const o=ve({location:"",base:e,go:r,createHref:Gu.bind(null,e)},t,n);return Object.defineProperty(o,"location",{enumerable:!0,get:()=>t.location.value}),Object.defineProperty(o,"state",{enumerable:!0,get:()=>t.state.value}),o}function sf(e){return typeof e=="string"||e&&typeof e=="object"}function bl(e){return typeof e=="string"||typeof e=="symbol"}const pt={path:"/",name:void 0,params:{},query:{},hash:"",fullPath:"/",matched:[],meta:{},redirectedFrom:void 0},yl=Symbol("");var Ni;(function(e){e[e.aborted=4]="aborted",e[e.cancelled=8]="cancelled",e[e.duplicated=16]="duplicated"})(Ni||(Ni={}));function hn(e,t){return ve(new Error,{type:e,[yl]:!0},t)}function dt(e,t){return e instanceof Error&&yl in e&&(t==null||!!(e.type&t))}const Hi="[^/]+?",lf={sensitive:!1,strict:!1,start:!0,end:!0},af=/[.+*?^${}()[\]/\\]/g;function cf(e,t){const n=ve({},lf,t),r=[];let o=n.start?"^":"";const i=[];for(const c of e){const u=c.length?[]:[90];n.strict&&!c.length&&(o+="/");for(let f=0;ft.length?t.length===1&&t[0]===40+40?1:-1:0}function ff(e,t){let n=0;const r=e.score,o=t.score;for(;n0&&t[t.length-1]<0}const df={type:0,value:""},hf=/[a-zA-Z0-9_]/;function pf(e){if(!e)return[[]];if(e==="/")return[[df]];if(!e.startsWith("/"))throw new Error(`Invalid path "${e}"`);function t(_){throw new Error(`ERR (${n})/"${c}": ${_}`)}let n=0,r=n;const o=[];let i;function s(){i&&o.push(i),i=[]}let l=0,a,c="",u="";function f(){c&&(n===0?i.push({type:0,value:c}):n===1||n===2||n===3?(i.length>1&&(a==="*"||a==="+")&&t(`A repeatable param (${c}) must be alone in its segment. eg: '/:ids+.`),i.push({type:1,value:c,regexp:u,repeatable:a==="*"||a==="+",optional:a==="*"||a==="?"})):t("Invalid state to consume buffer"),c="")}function h(){c+=a}for(;l{s(y)}:xn}function s(u){if(bl(u)){const f=r.get(u);f&&(r.delete(u),n.splice(n.indexOf(f),1),f.children.forEach(s),f.alias.forEach(s))}else{const f=n.indexOf(u);f>-1&&(n.splice(f,1),u.record.name&&r.delete(u.record.name),u.children.forEach(s),u.alias.forEach(s))}}function l(){return n}function a(u){let f=0;for(;f=0&&(u.record.path!==n[f].record.path||!El(u,n[f]));)f++;n.splice(f,0,u),u.record.name&&!ji(u)&&r.set(u.record.name,u)}function c(u,f){let h,_={},E,T;if("name"in u&&u.name){if(h=r.get(u.name),!h)throw hn(1,{location:u});T=h.record.name,_=ve(Fi(f.params,h.keys.filter(y=>!y.optional).map(y=>y.name)),u.params&&Fi(u.params,h.keys.map(y=>y.name))),E=h.stringify(_)}else if("path"in u)E=u.path,h=n.find(y=>y.re.test(E)),h&&(_=h.parse(E),T=h.record.name);else{if(h=f.name?r.get(f.name):n.find(y=>y.re.test(f.path)),!h)throw hn(1,{location:u,currentLocation:f});T=h.record.name,_=ve({},f.params,u.params),E=h.stringify(_)}const R=[];let g=h;for(;g;)R.unshift(g.record),g=g.parent;return{name:T,path:E,params:_,matched:R,meta:bf(R)}}return e.forEach(u=>i(u)),{addRoute:i,resolve:c,removeRoute:s,getRoutes:l,getRecordMatcher:o}}function Fi(e,t){const n={};for(const r of t)r in e&&(n[r]=e[r]);return n}function _f(e){return{path:e.path,redirect:e.redirect,name:e.name,meta:e.meta||{},aliasOf:void 0,beforeEnter:e.beforeEnter,props:gf(e),children:e.children||[],instances:{},leaveGuards:new Set,updateGuards:new Set,enterCallbacks:{},components:"components"in e?e.components||null:e.component&&{default:e.component}}}function gf(e){const t={},n=e.props||!1;if("component"in e)t.default=n;else for(const r in e.components)t[r]=typeof n=="object"?n[r]:n;return t}function ji(e){for(;e;){if(e.record.aliasOf)return!0;e=e.parent}return!1}function bf(e){return e.reduce((t,n)=>ve(t,n.meta),{})}function qi(e,t){const n={};for(const r in e)n[r]=r in t?t[r]:e[r];return n}function El(e,t){return t.children.some(n=>n===e||El(e,n))}const Ll=/#/g,yf=/&/g,Ef=/\//g,Lf=/=/g,Tf=/\?/g,Tl=/\+/g,Pf=/%5B/g,Af=/%5D/g,Pl=/%5E/g,wf=/%60/g,Al=/%7B/g,Rf=/%7C/g,wl=/%7D/g,xf=/%20/g;function Bo(e){return encodeURI(""+e).replace(Rf,"|").replace(Pf,"[").replace(Af,"]")}function Of(e){return Bo(e).replace(Al,"{").replace(wl,"}").replace(Pl,"^")}function co(e){return Bo(e).replace(Tl,"%2B").replace(xf,"+").replace(Ll,"%23").replace(yf,"%26").replace(wf,"`").replace(Al,"{").replace(wl,"}").replace(Pl,"^")}function Df(e){return co(e).replace(Lf,"%3D")}function If(e){return Bo(e).replace(Ll,"%23").replace(Tf,"%3F")}function Cf(e){return e==null?"":If(e).replace(Ef,"%2F")}function gr(e){try{return decodeURIComponent(""+e)}catch{}return""+e}function Sf(e){const t={};if(e===""||e==="?")return t;const r=(e[0]==="?"?e.slice(1):e).split("&");for(let o=0;oi&&co(i)):[r&&co(r)]).forEach(i=>{i!==void 0&&(t+=(t.length?"&":"")+n,i!=null&&(t+="="+i))})}return t}function kf(e){const t={};for(const n in e){const r=e[n];r!==void 0&&(t[n]=st(r)?r.map(o=>o==null?null:""+o):r==null?r:""+r)}return t}const zf=Symbol(""),Ki=Symbol(""),Sr=Symbol(""),Fo=Symbol(""),uo=Symbol("");function yn(){let e=[];function t(r){return e.push(r),()=>{const o=e.indexOf(r);o>-1&&e.splice(o,1)}}function n(){e=[]}return{add:t,list:()=>e.slice(),reset:n}}function Ot(e,t,n,r,o){const i=r&&(r.enterCallbacks[o]=r.enterCallbacks[o]||[]);return()=>new Promise((s,l)=>{const a=f=>{f===!1?l(hn(4,{from:n,to:t})):f instanceof Error?l(f):sf(f)?l(hn(2,{from:t,to:f})):(i&&r.enterCallbacks[o]===i&&typeof f=="function"&&i.push(f),s())},c=e.call(r&&r.instances[o],t,n,a);let u=Promise.resolve(c);e.length<3&&(u=u.then(a)),u.catch(f=>l(f))})}function Kr(e,t,n,r){const o=[];for(const i of e)for(const s in i.components){let l=i.components[s];if(!(t!=="beforeRouteEnter"&&!i.instances[s]))if($f(l)){const c=(l.__vccOpts||l)[t];c&&o.push(Ot(c,n,r,i,s))}else{let a=l();o.push(()=>a.then(c=>{if(!c)return Promise.reject(new Error(`Couldn't resolve component "${s}" at "${i.path}"`));const u=Bu(c)?c.default:c;i.components[s]=u;const h=(u.__vccOpts||u)[t];return h&&Ot(h,n,r,i,s)()}))}}return o}function $f(e){return typeof e=="object"||"displayName"in e||"props"in e||"__vccOpts"in e}function Wi(e){const t=Re(Sr),n=Re(Fo),r=q(()=>t.resolve(ee(e.to))),o=q(()=>{const{matched:a}=r.value,{length:c}=a,u=a[c-1],f=n.matched;if(!u||!f.length)return-1;const h=f.findIndex(dn.bind(null,u));if(h>-1)return h;const _=Qi(a[c-2]);return c>1&&Qi(u)===_&&f[f.length-1].path!==_?f.findIndex(dn.bind(null,a[c-2])):h}),i=q(()=>o.value>-1&&Hf(n.params,r.value.params)),s=q(()=>o.value>-1&&o.value===n.matched.length-1&&_l(n.params,r.value.params));function l(a={}){return Nf(a)?t[ee(e.replace)?"replace":"push"](ee(e.to)).catch(xn):Promise.resolve()}return{route:r,href:q(()=>r.value.href),isActive:i,isExactActive:s,navigate:l}}const Vf=ue({name:"RouterLink",compatConfig:{MODE:3},props:{to:{type:[String,Object],required:!0},replace:Boolean,activeClass:String,exactActiveClass:String,custom:Boolean,ariaCurrentValue:{type:String,default:"page"}},useLink:Wi,setup(e,{slots:t}){const n=Nn(Wi(e)),{options:r}=Re(Sr),o=q(()=>({[Yi(e.activeClass,r.linkActiveClass,"router-link-active")]:n.isActive,[Yi(e.exactActiveClass,r.linkExactActiveClass,"router-link-exact-active")]:n.isExactActive}));return()=>{const i=t.default&&t.default(n);return e.custom?i:_e("a",{"aria-current":n.isExactActive?e.ariaCurrentValue:null,href:n.href,onClick:n.navigate,class:o.value},i)}}}),Mf=Vf;function Nf(e){if(!(e.metaKey||e.altKey||e.ctrlKey||e.shiftKey)&&!e.defaultPrevented&&!(e.button!==void 0&&e.button!==0)){if(e.currentTarget&&e.currentTarget.getAttribute){const t=e.currentTarget.getAttribute("target");if(/\b_blank\b/i.test(t))return}return e.preventDefault&&e.preventDefault(),!0}}function Hf(e,t){for(const n in t){const r=t[n],o=e[n];if(typeof r=="string"){if(r!==o)return!1}else if(!st(o)||o.length!==r.length||r.some((i,s)=>i!==o[s]))return!1}return!0}function Qi(e){return e?e.aliasOf?e.aliasOf.path:e.path:""}const Yi=(e,t,n)=>e??t??n,Bf=ue({name:"RouterView",inheritAttrs:!1,props:{name:{type:String,default:"default"},route:Object},compatConfig:{MODE:3},setup(e,{attrs:t,slots:n}){const r=Re(uo),o=q(()=>e.route||r.value),i=Re(Ki,0),s=q(()=>{let c=ee(i);const{matched:u}=o.value;let f;for(;(f=u[c])&&!f.components;)c++;return c}),l=q(()=>o.value.matched[s.value]);Ut(Ki,q(()=>s.value+1)),Ut(zf,l),Ut(uo,o);const a=Le();return et(()=>[a.value,l.value,e.name],([c,u,f],[h,_,E])=>{u&&(u.instances[f]=c,_&&_!==u&&c&&c===h&&(u.leaveGuards.size||(u.leaveGuards=_.leaveGuards),u.updateGuards.size||(u.updateGuards=_.updateGuards))),c&&u&&(!_||!dn(u,_)||!h)&&(u.enterCallbacks[f]||[]).forEach(T=>T(c))},{flush:"post"}),()=>{const c=o.value,u=e.name,f=l.value,h=f&&f.components[u];if(!h)return Gi(n.default,{Component:h,route:c});const _=f.props[u],E=_?_===!0?c.params:typeof _=="function"?_(c):_:null,R=_e(h,ve({},E,t,{onVnodeUnmounted:g=>{g.component.isUnmounted&&(f.instances[u]=null)},ref:a}));return Gi(n.default,{Component:R,route:c})||R}}});function Gi(e,t){if(!e)return null;const n=e(t);return n.length===1?n[0]:n}const Rl=Bf;function Ff(e){const t=vf(e.routes,e),n=e.parseQuery||Sf,r=e.stringifyQuery||Ui,o=e.history,i=yn(),s=yn(),l=yn(),a=Ro(pt);let c=pt;en&&e.scrollBehavior&&"scrollRestoration"in history&&(history.scrollRestoration="manual");const u=qr.bind(null,P=>""+P),f=qr.bind(null,Cf),h=qr.bind(null,gr);function _(P,F){let $,Q;return bl(P)?($=t.getRecordMatcher(P),Q=F):Q=P,t.addRoute(Q,$)}function E(P){const F=t.getRecordMatcher(P);F&&t.removeRoute(F)}function T(){return t.getRoutes().map(P=>P.record)}function R(P){return!!t.getRecordMatcher(P)}function g(P,F){if(F=ve({},F||a.value),typeof P=="string"){const b=Ur(n,P,F.path),L=t.resolve({path:b.path},F),A=o.createHref(b.fullPath);return ve(b,L,{params:h(L.params),hash:gr(b.hash),redirectedFrom:void 0,href:A})}let $;if("path"in P)$=ve({},P,{path:Ur(n,P.path,F.path).path});else{const b=ve({},P.params);for(const L in b)b[L]==null&&delete b[L];$=ve({},P,{params:f(b)}),F.params=f(F.params)}const Q=t.resolve($,F),ce=P.hash||"";Q.params=u(h(Q.params));const d=qu(r,ve({},P,{hash:Of(ce),path:Q.path})),p=o.createHref(d);return ve({fullPath:d,hash:ce,query:r===Ui?kf(P.query):P.query||{}},Q,{redirectedFrom:void 0,href:p})}function y(P){return typeof P=="string"?Ur(n,P,a.value.path):ve({},P)}function C(P,F){if(c!==P)return hn(8,{from:F,to:P})}function I(P){return M(P)}function W(P){return I(ve(y(P),{replace:!0}))}function te(P){const F=P.matched[P.matched.length-1];if(F&&F.redirect){const{redirect:$}=F;let Q=typeof $=="function"?$(P):$;return typeof Q=="string"&&(Q=Q.includes("?")||Q.includes("#")?Q=y(Q):{path:Q},Q.params={}),ve({query:P.query,hash:P.hash,params:"path"in Q?{}:P.params},Q)}}function M(P,F){const $=c=g(P),Q=a.value,ce=P.state,d=P.force,p=P.replace===!0,b=te($);if(b)return M(ve(y(b),{state:typeof b=="object"?ve({},ce,b.state):ce,force:d,replace:p}),F||$);const L=$;L.redirectedFrom=F;let A;return!d&&Uu(r,Q,$)&&(A=hn(16,{to:L,from:Q}),je(Q,Q,!0,!1)),(A?Promise.resolve(A):N(L,Q)).catch(x=>dt(x)?dt(x,2)?x:ze(x):ie(x,L,Q)).then(x=>{if(x){if(dt(x,2))return M(ve({replace:p},y(x.to),{state:typeof x.to=="object"?ve({},ce,x.to.state):ce,force:d}),F||L)}else x=w(L,Q,!0,p,ce);return Y(L,Q,x),x})}function v(P,F){const $=C(P,F);return $?Promise.reject($):Promise.resolve()}function j(P){const F=Tt.values().next().value;return F&&typeof F.runWithContext=="function"?F.runWithContext(P):P()}function N(P,F){let $;const[Q,ce,d]=jf(P,F);$=Kr(Q.reverse(),"beforeRouteLeave",P,F);for(const b of Q)b.leaveGuards.forEach(L=>{$.push(Ot(L,P,F))});const p=v.bind(null,P,F);return $.push(p),Se($).then(()=>{$=[];for(const b of i.list())$.push(Ot(b,P,F));return $.push(p),Se($)}).then(()=>{$=Kr(ce,"beforeRouteUpdate",P,F);for(const b of ce)b.updateGuards.forEach(L=>{$.push(Ot(L,P,F))});return $.push(p),Se($)}).then(()=>{$=[];for(const b of d)if(b.beforeEnter)if(st(b.beforeEnter))for(const L of b.beforeEnter)$.push(Ot(L,P,F));else $.push(Ot(b.beforeEnter,P,F));return $.push(p),Se($)}).then(()=>(P.matched.forEach(b=>b.enterCallbacks={}),$=Kr(d,"beforeRouteEnter",P,F),$.push(p),Se($))).then(()=>{$=[];for(const b of s.list())$.push(Ot(b,P,F));return $.push(p),Se($)}).catch(b=>dt(b,8)?b:Promise.reject(b))}function Y(P,F,$){l.list().forEach(Q=>j(()=>Q(P,F,$)))}function w(P,F,$,Q,ce){const d=C(P,F);if(d)return d;const p=F===pt,b=en?history.state:{};$&&(Q||p?o.replace(P.fullPath,ve({scroll:p&&b&&b.scroll},ce)):o.push(P.fullPath,ce)),a.value=P,je(P,F,$,p),ze()}let k;function z(){k||(k=o.listen((P,F,$)=>{if(!lt.listening)return;const Q=g(P),ce=te(Q);if(ce){M(ve(ce,{replace:!0}),Q).catch(xn);return}c=Q;const d=a.value;en&&Xu(Vi(d.fullPath,$.delta),Cr()),N(Q,d).catch(p=>dt(p,12)?p:dt(p,2)?(M(p.to,Q).then(b=>{dt(b,20)&&!$.delta&&$.type===zn.pop&&o.go(-1,!1)}).catch(xn),Promise.reject()):($.delta&&o.go(-$.delta,!1),ie(p,Q,d))).then(p=>{p=p||w(Q,d,!1),p&&($.delta&&!dt(p,8)?o.go(-$.delta,!1):$.type===zn.pop&&dt(p,20)&&o.go(-1,!1)),Y(Q,d,p)}).catch(xn)}))}let le=yn(),U=yn(),oe;function ie(P,F,$){ze(P);const Q=U.list();return Q.length?Q.forEach(ce=>ce(P,F,$)):console.error(P),Promise.reject(P)}function Me(){return oe&&a.value!==pt?Promise.resolve():new Promise((P,F)=>{le.add([P,F])})}function ze(P){return oe||(oe=!P,z(),le.list().forEach(([F,$])=>P?$(P):F()),le.reset()),P}function je(P,F,$,Q){const{scrollBehavior:ce}=e;if(!en||!ce)return Promise.resolve();const d=!$&&ef(Vi(P.fullPath,0))||(Q||!$)&&history.state&&history.state.scroll||null;return Pr().then(()=>ce(P,F,d)).then(p=>p&&Zu(p)).catch(p=>ie(p,P,F))}const Ne=P=>o.go(P);let Lt;const Tt=new Set,lt={currentRoute:a,listening:!0,addRoute:_,removeRoute:E,hasRoute:R,getRoutes:T,resolve:g,options:e,push:I,replace:W,go:Ne,back:()=>Ne(-1),forward:()=>Ne(1),beforeEach:i.add,beforeResolve:s.add,afterEach:l.add,onError:U.add,isReady:Me,install(P){const F=this;P.component("RouterLink",Mf),P.component("RouterView",Rl),P.config.globalProperties.$router=F,Object.defineProperty(P.config.globalProperties,"$route",{enumerable:!0,get:()=>ee(a)}),en&&!Lt&&a.value===pt&&(Lt=!0,I(o.location).catch(ce=>{}));const $={};for(const ce in pt)Object.defineProperty($,ce,{get:()=>a.value[ce],enumerable:!0});P.provide(Sr,F),P.provide(Fo,ws($)),P.provide(uo,a);const Q=P.unmount;Tt.add(P),P.unmount=function(){Tt.delete(P),Tt.size<1&&(c=pt,k&&k(),k=null,a.value=pt,Lt=!1,oe=!1),Q()}}};function Se(P){return P.reduce((F,$)=>F.then(()=>j($)),Promise.resolve())}return lt}function jf(e,t){const n=[],r=[],o=[],i=Math.max(t.matched.length,e.matched.length);for(let s=0;sdn(c,l))?r.push(l):n.push(l));const a=e.matched[s];a&&(t.matched.find(c=>dn(c,a))||o.push(a))}return[n,r,o]}function Yt(){return Re(Sr)}function Gt(){return Re(Fo)}const qf=({headerLinkSelector:e,headerAnchorSelector:t,delay:n,offset:r=5})=>{const o=Yt(),s=Ho(()=>{var T,R;const l=Math.max(window.scrollY,document.documentElement.scrollTop,document.body.scrollTop);if(Math.abs(l-0)h.some(y=>y.hash===g.hash));for(let g=0;g=(((T=y.parentElement)==null?void 0:T.offsetTop)??0)-r,W=!C||l<(((R=C.parentElement)==null?void 0:R.offsetTop)??0)-r;if(!(I&&W))continue;const M=decodeURIComponent(o.currentRoute.value.hash),v=decodeURIComponent(y.hash);if(M===v)return;if(f){for(let j=g+1;j{window.addEventListener("scroll",s)}),xr(()=>{window.removeEventListener("scroll",s)})},Ji=async(e,t)=>{const{scrollBehavior:n}=e.options;e.options.scrollBehavior=void 0,await e.replace({query:e.currentRoute.value.query,hash:t}).finally(()=>e.options.scrollBehavior=n)},Uf="a.sidebar-item",Kf=".header-anchor",Wf=300,Qf=5,Yf=Et({setup(){qf({headerLinkSelector:Uf,headerAnchorSelector:Kf,delay:Wf,offset:Qf})}}),Zi=()=>window.pageYOffset||document.documentElement.scrollTop||document.body.scrollTop||0,Gf=()=>window.scrollTo({top:0,behavior:"smooth"});const Jf=ue({name:"BackToTop",setup(){const e=Le(0),t=q(()=>e.value>300),n=Ho(()=>{e.value=Zi()},100);Ge(()=>{e.value=Zi(),window.addEventListener("scroll",()=>n())});const r=_e("div",{class:"back-to-top",onClick:Gf});return()=>_e(jn,{name:"back-to-top"},()=>t.value?r:null)}}),Zf=Et({rootComponents:[Jf]});const Xf=_e("svg",{class:"external-link-icon",xmlns:"http://www.w3.org/2000/svg","aria-hidden":"true",focusable:"false",x:"0px",y:"0px",viewBox:"0 0 100 100",width:"15",height:"15"},[_e("path",{fill:"currentColor",d:"M18.8,85.1h56l0,0c2.2,0,4-1.8,4-4v-32h-8v28h-48v-48h28v-8h-32l0,0c-2.2,0-4,1.8-4,4v56C14.8,83.3,16.6,85.1,18.8,85.1z"}),_e("polygon",{fill:"currentColor",points:"45.7,48.7 51.3,54.3 77.2,28.5 77.2,37.2 85.2,37.2 85.2,14.9 62.8,14.9 62.8,22.9 71.5,22.9"})]),ed=ue({name:"ExternalLinkIcon",props:{locales:{type:Object,required:!1,default:()=>({})}},setup(e){const t=Un(),n=q(()=>e.locales[t.value]??{openInNewWindow:"open in new window"});return()=>_e("span",[Xf,_e("span",{class:"external-link-icon-sr-only"},n.value.openInNewWindow)])}}),td={"/":{openInNewWindow:"open in new window"},"/zh/":{openInNewWindow:"open in new window"}},nd=Et({enhance({app:e}){e.component("ExternalLinkIcon",_e(ed,{locales:td}))}});/*! medium-zoom 1.0.8 | MIT License | https://github.com/francoischalifour/medium-zoom */var Nt=Object.assign||function(e){for(var t=1;t1&&arguments[1]!==void 0?arguments[1]:{},r=window.Promise||function(w){function k(){}w(k,k)},o=function(w){var k=w.target;if(k===j){E();return}C.indexOf(k)!==-1&&T({target:k})},i=function(){if(!(W||!v.original)){var w=window.pageYOffset||document.documentElement.scrollTop||document.body.scrollTop||0;Math.abs(te-w)>M.scrollOffset&&setTimeout(E,150)}},s=function(w){var k=w.key||w.keyCode;(k==="Escape"||k==="Esc"||k===27)&&E()},l=function(){var w=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{},k=w;if(w.background&&(j.style.background=w.background),w.container&&w.container instanceof Object&&(k.container=Nt({},M.container,w.container)),w.template){var z=sr(w.template)?w.template:document.querySelector(w.template);k.template=z}return M=Nt({},M,k),C.forEach(function(le){le.dispatchEvent(Xt("medium-zoom:update",{detail:{zoom:N}}))}),N},a=function(){var w=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{};return e(Nt({},M,w))},c=function(){for(var w=arguments.length,k=Array(w),z=0;z0?k.reduce(function(U,oe){return[].concat(U,es(oe))},[]):C;return le.forEach(function(U){U.classList.remove("medium-zoom-image"),U.dispatchEvent(Xt("medium-zoom:detach",{detail:{zoom:N}}))}),C=C.filter(function(U){return le.indexOf(U)===-1}),N},f=function(w,k){var z=arguments.length>2&&arguments[2]!==void 0?arguments[2]:{};return C.forEach(function(le){le.addEventListener("medium-zoom:"+w,k,z)}),I.push({type:"medium-zoom:"+w,listener:k,options:z}),N},h=function(w,k){var z=arguments.length>2&&arguments[2]!==void 0?arguments[2]:{};return C.forEach(function(le){le.removeEventListener("medium-zoom:"+w,k,z)}),I=I.filter(function(le){return!(le.type==="medium-zoom:"+w&&le.listener.toString()===k.toString())}),N},_=function(){var w=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{},k=w.target,z=function(){var U={width:document.documentElement.clientWidth,height:document.documentElement.clientHeight,left:0,top:0,right:0,bottom:0},oe=void 0,ie=void 0;if(M.container)if(M.container instanceof Object)U=Nt({},U,M.container),oe=U.width-U.left-U.right-M.margin*2,ie=U.height-U.top-U.bottom-M.margin*2;else{var Me=sr(M.container)?M.container:document.querySelector(M.container),ze=Me.getBoundingClientRect(),je=ze.width,Ne=ze.height,Lt=ze.left,Tt=ze.top;U=Nt({},U,{width:je,height:Ne,left:Lt,top:Tt})}oe=oe||U.width-M.margin*2,ie=ie||U.height-M.margin*2;var lt=v.zoomedHd||v.original,Se=Xi(lt)?oe:lt.naturalWidth||oe,P=Xi(lt)?ie:lt.naturalHeight||ie,F=lt.getBoundingClientRect(),$=F.top,Q=F.left,ce=F.width,d=F.height,p=Math.min(Math.max(ce,Se),oe)/ce,b=Math.min(Math.max(d,P),ie)/d,L=Math.min(p,b),A=(-Q+(oe-ce)/2+M.margin+U.left)/L,x=(-$+(ie-d)/2+M.margin+U.top)/L,H="scale("+L+") translate3d("+A+"px, "+x+"px, 0)";v.zoomed.style.transform=H,v.zoomedHd&&(v.zoomedHd.style.transform=H)};return new r(function(le){if(k&&C.indexOf(k)===-1){le(N);return}var U=function je(){W=!1,v.zoomed.removeEventListener("transitionend",je),v.original.dispatchEvent(Xt("medium-zoom:opened",{detail:{zoom:N}})),le(N)};if(v.zoomed){le(N);return}if(k)v.original=k;else if(C.length>0){var oe=C;v.original=oe[0]}else{le(N);return}if(v.original.dispatchEvent(Xt("medium-zoom:open",{detail:{zoom:N}})),te=window.pageYOffset||document.documentElement.scrollTop||document.body.scrollTop||0,W=!0,v.zoomed=id(v.original),document.body.appendChild(j),M.template){var ie=sr(M.template)?M.template:document.querySelector(M.template);v.template=document.createElement("div"),v.template.appendChild(ie.content.cloneNode(!0)),document.body.appendChild(v.template)}if(v.original.parentElement&&v.original.parentElement.tagName==="PICTURE"&&v.original.currentSrc&&(v.zoomed.src=v.original.currentSrc),document.body.appendChild(v.zoomed),window.requestAnimationFrame(function(){document.body.classList.add("medium-zoom--opened")}),v.original.classList.add("medium-zoom-image--hidden"),v.zoomed.classList.add("medium-zoom-image--opened"),v.zoomed.addEventListener("click",E),v.zoomed.addEventListener("transitionend",U),v.original.getAttribute("data-zoom-src")){v.zoomedHd=v.zoomed.cloneNode(),v.zoomedHd.removeAttribute("srcset"),v.zoomedHd.removeAttribute("sizes"),v.zoomedHd.removeAttribute("loading"),v.zoomedHd.src=v.zoomed.getAttribute("data-zoom-src"),v.zoomedHd.onerror=function(){clearInterval(Me),console.warn("Unable to reach the zoom image target "+v.zoomedHd.src),v.zoomedHd=null,z()};var Me=setInterval(function(){v.zoomedHd.complete&&(clearInterval(Me),v.zoomedHd.classList.add("medium-zoom-image--opened"),v.zoomedHd.addEventListener("click",E),document.body.appendChild(v.zoomedHd),z())},10)}else if(v.original.hasAttribute("srcset")){v.zoomedHd=v.zoomed.cloneNode(),v.zoomedHd.removeAttribute("sizes"),v.zoomedHd.removeAttribute("loading");var ze=v.zoomedHd.addEventListener("load",function(){v.zoomedHd.removeEventListener("load",ze),v.zoomedHd.classList.add("medium-zoom-image--opened"),v.zoomedHd.addEventListener("click",E),document.body.appendChild(v.zoomedHd),z()})}else z()})},E=function(){return new r(function(w){if(W||!v.original){w(N);return}var k=function z(){v.original.classList.remove("medium-zoom-image--hidden"),document.body.removeChild(v.zoomed),v.zoomedHd&&document.body.removeChild(v.zoomedHd),document.body.removeChild(j),v.zoomed.classList.remove("medium-zoom-image--opened"),v.template&&document.body.removeChild(v.template),W=!1,v.zoomed.removeEventListener("transitionend",z),v.original.dispatchEvent(Xt("medium-zoom:closed",{detail:{zoom:N}})),v.original=null,v.zoomed=null,v.zoomedHd=null,v.template=null,w(N)};W=!0,document.body.classList.remove("medium-zoom--opened"),v.zoomed.style.transform="",v.zoomedHd&&(v.zoomedHd.style.transform=""),v.template&&(v.template.style.transition="opacity 150ms",v.template.style.opacity=0),v.original.dispatchEvent(Xt("medium-zoom:close",{detail:{zoom:N}})),v.zoomed.addEventListener("transitionend",k)})},T=function(){var w=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{},k=w.target;return v.original?E():_({target:k})},R=function(){return M},g=function(){return C},y=function(){return v.original},C=[],I=[],W=!1,te=0,M=n,v={original:null,zoomed:null,zoomedHd:null,template:null};Object.prototype.toString.call(t)==="[object Object]"?M=t:(t||typeof t=="string")&&c(t),M=Nt({margin:0,background:"#fff",scrollOffset:40,container:null,template:null},M);var j=od(M.background);document.addEventListener("click",o),document.addEventListener("keyup",s),document.addEventListener("scroll",i),window.addEventListener("resize",E);var N={open:_,close:E,toggle:T,update:l,clone:a,attach:c,detach:u,on:f,off:h,getOptions:R,getImages:g,getZoomedImage:y};return N};function ld(e,t){t===void 0&&(t={});var n=t.insertAt;if(!(!e||typeof document>"u")){var r=document.head||document.getElementsByTagName("head")[0],o=document.createElement("style");o.type="text/css",n==="top"&&r.firstChild?r.insertBefore(o,r.firstChild):r.appendChild(o),o.styleSheet?o.styleSheet.cssText=e:o.appendChild(document.createTextNode(e))}}var ad=".medium-zoom-overlay{position:fixed;top:0;right:0;bottom:0;left:0;opacity:0;transition:opacity .3s;will-change:opacity}.medium-zoom--opened .medium-zoom-overlay{cursor:pointer;cursor:zoom-out;opacity:1}.medium-zoom-image{cursor:pointer;cursor:zoom-in;transition:transform .3s cubic-bezier(.2,0,.2,1)!important}.medium-zoom-image--hidden{visibility:hidden}.medium-zoom-image--opened{position:relative;cursor:pointer;cursor:zoom-out;will-change:transform}";ld(ad);const cd=sd,ud=Symbol("mediumZoom");const fd=".theme-default-content > img, .theme-default-content :not(a) > img",dd={},hd=300,pd=Et({enhance({app:e,router:t}){const n=cd(dd);n.refresh=(r=fd)=>{n.detach(),n.attach(r)},e.provide(ud,n),t.afterEach(()=>{setTimeout(()=>n.refresh(),hd)})}});/** + * NProgress, (c) 2013, 2014 Rico Sta. Cruz - http://ricostacruz.com/nprogress + * @license MIT + */const fe={settings:{minimum:.08,easing:"ease",speed:200,trickle:!0,trickleRate:.02,trickleSpeed:800,barSelector:'[role="bar"]',parent:"body",template:'
'},status:null,set:e=>{const t=fe.isStarted();e=Wr(e,fe.settings.minimum,1),fe.status=e===1?null:e;const n=fe.render(!t),r=n.querySelector(fe.settings.barSelector),o=fe.settings.speed,i=fe.settings.easing;return n.offsetWidth,md(s=>{tr(r,{transform:"translate3d("+ts(e)+"%,0,0)",transition:"all "+o+"ms "+i}),e===1?(tr(n,{transition:"none",opacity:"1"}),n.offsetWidth,setTimeout(function(){tr(n,{transition:"all "+o+"ms linear",opacity:"0"}),setTimeout(function(){fe.remove(),s()},o)},o)):setTimeout(()=>s(),o)}),fe},isStarted:()=>typeof fe.status=="number",start:()=>{fe.status||fe.set(0);const e=()=>{setTimeout(()=>{fe.status&&(fe.trickle(),e())},fe.settings.trickleSpeed)};return fe.settings.trickle&&e(),fe},done:e=>!e&&!fe.status?fe:fe.inc(.3+.5*Math.random()).set(1),inc:e=>{let t=fe.status;return t?(typeof e!="number"&&(e=(1-t)*Wr(Math.random()*t,.1,.95)),t=Wr(t+e,0,.994),fe.set(t)):fe.start()},trickle:()=>fe.inc(Math.random()*fe.settings.trickleRate),render:e=>{if(fe.isRendered())return document.getElementById("nprogress");ns(document.documentElement,"nprogress-busy");const t=document.createElement("div");t.id="nprogress",t.innerHTML=fe.settings.template;const n=t.querySelector(fe.settings.barSelector),r=e?"-100":ts(fe.status||0),o=document.querySelector(fe.settings.parent);return tr(n,{transition:"all 0 linear",transform:"translate3d("+r+"%,0,0)"}),o!==document.body&&ns(o,"nprogress-custom-parent"),o==null||o.appendChild(t),t},remove:()=>{rs(document.documentElement,"nprogress-busy"),rs(document.querySelector(fe.settings.parent),"nprogress-custom-parent");const e=document.getElementById("nprogress");e&&vd(e)},isRendered:()=>!!document.getElementById("nprogress")},Wr=(e,t,n)=>en?n:e,ts=e=>(-1+e)*100,md=function(){const e=[];function t(){const n=e.shift();n&&n(t)}return function(n){e.push(n),e.length===1&&t()}}(),tr=function(){const e=["Webkit","O","Moz","ms"],t={};function n(s){return s.replace(/^-ms-/,"ms-").replace(/-([\da-z])/gi,function(l,a){return a.toUpperCase()})}function r(s){const l=document.body.style;if(s in l)return s;let a=e.length;const c=s.charAt(0).toUpperCase()+s.slice(1);let u;for(;a--;)if(u=e[a]+c,u in l)return u;return s}function o(s){return s=n(s),t[s]??(t[s]=r(s))}function i(s,l,a){l=o(l),s.style[l]=a}return function(s,l){for(const a in l){const c=l[a];c!==void 0&&Object.prototype.hasOwnProperty.call(l,a)&&i(s,a,c)}}}(),xl=(e,t)=>(typeof e=="string"?e:jo(e)).indexOf(" "+t+" ")>=0,ns=(e,t)=>{const n=jo(e),r=n+t;xl(n,t)||(e.className=r.substring(1))},rs=(e,t)=>{const n=jo(e);if(!xl(e,t))return;const r=n.replace(" "+t+" "," ");e.className=r.substring(1,r.length-1)},jo=e=>(" "+(e.className||"")+" ").replace(/\s+/gi," "),vd=e=>{e&&e.parentNode&&e.parentNode.removeChild(e)};const _d=()=>{Ge(()=>{const e=Yt(),t=new Set;t.add(e.currentRoute.value.path),e.beforeEach(n=>{t.has(n.path)||fe.start()}),e.afterEach(n=>{t.add(n.path),fe.done()})})},gd=Et({setup(){_d()}}),bd=JSON.parse(`{"logo":"/images/polardb.png","repo":"ApsaraDB/PolarDB-for-PostgreSQL","colorMode":"light","contributors":false,"locales":{"/":{"selectLanguageName":"English","editLinkText":"Edit this page on GitHub","navbar":[{"text":"Deployment","children":["/deploying/introduction.html","/deploying/quick-start.html","/deploying/deploy.html",{"text":"Preparation of Shared-Storage Device","children":["/deploying/storage-aliyun-essd.html","/deploying/storage-curvebs.html","/deploying/storage-ceph.html","/deploying/storage-nbd.html"]},{"text":"Preparation of File System","children":["/deploying/fs-pfs.html","/deploying/fs-pfs-curve.html"]},{"text":"Deploying PolarDB","children":["/deploying/db-localfs.html","/deploying/db-pfs.html","/deploying/db-pfs-curve.html"]},{"text":"More about Deployment","children":["/deploying/deploy-stack.html","/deploying/deploy-official.html"]}]},{"text":"Ops","link":"/operation/","children":[{"text":"Daily Ops","children":["/operation/backup-and-restore.html","/operation/grow-storage.html","/operation/scale-out.html","/operation/ro-online-promote.html"]},{"text":"Benchmarks","children":["/operation/tpcc-test.html","/operation/tpch-test.html"]}]},{"text":"Kernel Features","link":"/features/"},{"text":"Theory","link":"/theory/","children":[{"text":"PolarDB for PostgreSQL","children":["/theory/arch-overview.html","/theory/arch-htap.html","/theory/buffer-management.html","/theory/ddl-synchronization.html","/theory/logindex.html"]},{"text":"PostgreSQL","children":["/theory/analyze.html","/theory/polar-sequence-tech.html"]}]},{"text":"Dev","link":"/development/","children":[{"text":"Development on Docker","link":"/development/dev-on-docker.html"},{"text":"Customize Development Environment","link":"/development/customize-dev-env.html"}]},{"text":"Contributing","link":"/contributing/","children":[{"text":"Contributing Docs","link":"/contributing/contributing-polardb-docs.html"},{"text":"Contributing Code","link":"/contributing/contributing-polardb-kernel.html"},{"text":"Coding Style","link":"/contributing/coding-style.html"}]}],"sidebarDepth":1,"sidebar":{"/deploying":[{"text":"Deployment","children":["/deploying/introduction.md","/deploying/quick-start.md",{"text":"Advanced Deployment","link":"/deploying/deploy.md","children":[{"text":"Preparation of Shared-Storage Device","children":["/deploying/storage-aliyun-essd.md","/deploying/storage-curvebs.md","/deploying/storage-ceph.md","/deploying/storage-nbd.md"]},{"text":"Preparation of File System","children":["/deploying/fs-pfs.md","/deploying/fs-pfs-curve.md"]},{"text":"Deploying PolarDB","children":["/deploying/db-localfs.md","/deploying/db-pfs.md","/deploying/db-pfs-curve.md"]}]},{"text":"More about Deployment","children":["/deploying/deploy-stack.md","/deploying/deploy-official.md"]}]}],"/operation/":[{"text":"Ops","children":[{"text":"Daily Ops","children":["/operation/backup-and-restore.md","/operation/grow-storage.md","/operation/scale-out.md","/operation/ro-online-promote.md"]},{"text":"Benchmarks","children":["/operation/tpcc-test.md","/operation/tpch-test.md"]}]}],"/features":[],"/theory/":[{"text":"Theory","children":[{"text":"PolarDB for PostgreSQL","children":["/theory/arch-overview.md","/theory/arch-htap.md","/theory/buffer-management.md","/theory/ddl-synchronization.md","/theory/logindex.md"]},{"text":"PostgreSQL","children":["/theory/analyze.md","/theory/polar-sequence-tech.md"]}]}],"/development/":[{"text":"Development","children":["/development/dev-on-docker.md","/development/customize-dev-env.md"]}],"/contributing":[{"text":"Contributing","children":["/contributing/contributing-polardb-kernel.md","/contributing/contributing-polardb-docs.md","/contributing/coding-style.md"]}]}},"/zh/":{"selectLanguageName":"简体中文","selectLanguageText":"选择语言","selectLanguageAriaLabel":"选择语言","editLinkText":"在 GitHub 上编辑此页","lastUpdatedText":"上次更新","contributorsText":"贡献者","tip":"提示","warning":"注意","danger":"警告","navbar":[{"text":"部署指南","children":["/zh/deploying/introduction.html","/zh/deploying/quick-start.html","/zh/deploying/deploy.html",{"text":"共享存储设备的准备","children":["/zh/deploying/storage-aliyun-essd.html","/zh/deploying/storage-curvebs.html","/zh/deploying/storage-ceph.html","/zh/deploying/storage-nbd.html"]},{"text":"文件系统的准备","children":["/zh/deploying/fs-pfs.html","/zh/deploying/fs-pfs-curve.html"]},{"text":"部署 PolarDB 数据库","children":["/zh/deploying/db-localfs.html","/zh/deploying/db-pfs.html","/zh/deploying/db-pfs-curve.html"]},{"text":"更多部署方式","children":["/zh/deploying/deploy-stack.html","/zh/deploying/deploy-official.html"]}]},{"text":"使用与运维","link":"/zh/operation/","children":[{"text":"日常运维","children":["/zh/operation/backup-and-restore.html","/zh/operation/grow-storage.html","/zh/operation/scale-out.html","/zh/operation/ro-online-promote.html"]},{"text":"问题诊断","children":["/zh/operation/cpu-usage-high.html"]},{"text":"性能测试","children":["/zh/operation/tpcc-test.html","/zh/operation/tpch-test.html"]}]},{"text":"自研功能","children":[{"text":"功能总览","link":"/zh/features/"},{"text":"PolarDB for PostgreSQL 11","link":"/zh/features/v11/","children":["/zh/features/v11/performance/","/zh/features/v11/availability/","/zh/features/v11/security/","/zh/features/v11/epq/","/zh/features/v11/extensions/"]}]},{"text":"原理解读","link":"/zh/theory/","children":[{"text":"PolarDB for PostgreSQL","children":["/zh/theory/arch-overview.html","/zh/theory/arch-htap.html","/zh/theory/buffer-management.html","/zh/theory/ddl-synchronization.html","/zh/theory/logindex.html"]},{"text":"PostgreSQL","children":["/zh/theory/analyze.html","/zh/theory/polar-sequence-tech.html"]}]},{"text":"上手开发","link":"/zh/development/","children":["/zh/development/dev-on-docker.html","/zh/development/customize-dev-env.html"]},{"text":"参与社区","link":"/zh/contributing/","children":[{"text":"贡献文档","link":"/zh/contributing/contributing-polardb-docs.html"},{"text":"贡献代码","link":"/zh/contributing/contributing-polardb-kernel.html"},{"text":"编码风格","link":"/zh/contributing/coding-style.html"}]}],"sidebarDepth":3,"sidebar":{"/zh/deploying":[{"text":"部署指南","children":["/zh/deploying/introduction.md","/zh/deploying/quick-start.md",{"text":"进阶部署","link":"/zh/deploying/deploy.md","children":[{"text":"共享存储设备的准备","children":["/zh/deploying/storage-aliyun-essd.md","/zh/deploying/storage-curvebs.md","/zh/deploying/storage-ceph.md","/zh/deploying/storage-nbd.md"]},{"text":"文件系统的准备","children":["/zh/deploying/fs-pfs.md","/zh/deploying/fs-pfs-curve.md"]},{"text":"部署 PolarDB 数据库","children":["/zh/deploying/db-localfs.md","/zh/deploying/db-pfs.md","/zh/deploying/db-pfs-curve.md"]}]},{"text":"更多部署方式","children":["/zh/deploying/deploy-stack.md","/zh/deploying/deploy-official.md"]}]}],"/zh/operation/":[{"text":"使用与运维","children":[{"text":"日常运维","children":["/zh/operation/backup-and-restore.md","/zh/operation/grow-storage.md","/zh/operation/scale-out.md","/zh/operation/ro-online-promote.md"]},{"text":"问题诊断","children":["/zh/operation/cpu-usage-high.md"]},{"text":"性能测试","children":["/zh/operation/tpcc-test.md","/zh/operation/tpch-test.md"]}]}],"/zh/features":[{"text":"自研功能","link":"/zh/features/","children":[{"text":"PolarDB for PostgreSQL 11","link":"/zh/features/v11/","children":[{"text":"高性能","link":"/zh/features/v11/performance/","children":["/zh/features/v11/performance/bulk-read-and-extend.md","/zh/features/v11/performance/rel-size-cache.md","/zh/features/v11/performance/shared-server.md"]},{"text":"高可用","link":"/zh/features/v11/availability/","children":["/zh/features/v11/availability/avail-online-promote.md","/zh/features/v11/availability/avail-parallel-replay.md","/zh/features/v11/availability/datamax.md","/zh/features/v11/availability/resource-manager.md","/zh/features/v11/availability/flashback-table.md"]},{"text":"安全","link":"/zh/features/v11/security/","children":["/zh/features/v11/security/tde.md"]},{"text":"弹性跨机并行查询(ePQ)","link":"/zh/features/v11/epq/","children":["/zh/features/v11/epq/epq-explain-analyze.md","/zh/features/v11/epq/epq-node-and-dop.md","/zh/features/v11/epq/epq-partitioned-table.md","/zh/features/v11/epq/epq-create-btree-index.md","/zh/features/v11/epq/cluster-info.md","/zh/features/v11/epq/adaptive-scan.md","/zh/features/v11/epq/parallel-dml.md","/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.md"]},{"text":"第三方插件","link":"/zh/features/v11/extensions/","children":["/zh/features/v11/extensions/pgvector.md","/zh/features/v11/extensions/smlar.md"]}]}]}],"/zh/theory/":[{"text":"原理解读","children":[{"text":"PolarDB for PostgreSQL","children":["/zh/theory/arch-overview.md","/zh/theory/arch-htap.md","/zh/theory/buffer-management.md","/zh/theory/ddl-synchronization.md","/zh/theory/logindex.md"]},{"text":"PostgreSQL","children":["/zh/theory/analyze.md","/zh/theory/polar-sequence-tech.md"]}]}],"/zh/development/":[{"text":"上手开发","children":["/zh/development/dev-on-docker.md","/zh/development/customize-dev-env.md"]}],"/zh/contributing":[{"text":"参与社区","children":["/zh/contributing/contributing-polardb-kernel.md","/zh/contributing/contributing-polardb-docs.md","/zh/contributing/coding-style.md"]}]}}},"colorModeSwitch":true,"navbar":[],"selectLanguageText":"Languages","selectLanguageAriaLabel":"Select language","sidebar":"auto","sidebarDepth":2,"editLink":true,"editLinkText":"Edit this page","lastUpdated":true,"lastUpdatedText":"Last Updated","contributorsText":"Contributors","notFound":["There's nothing here.","How did we get here?","That's a Four-Oh-Four.","Looks like we've got some broken links."],"backToHome":"Take me home","openInNewWindow":"open in new window","toggleColorMode":"toggle color mode","toggleSidebar":"toggle sidebar"}`),yd=Le(bd),Ol=()=>yd,Dl=Symbol(""),Ed=()=>{const e=Re(Dl);if(!e)throw new Error("useThemeLocaleData() is called without provider.");return e},Ld=(e,t)=>{const{locales:n,...r}=e;return{...r,...n==null?void 0:n[t]}},Td=Et({enhance({app:e}){const t=Ol(),n=e._context.provides[$o],r=q(()=>Ld(t.value,n.value));e.provide(Dl,r),Object.defineProperties(e.config.globalProperties,{$theme:{get(){return t.value}},$themeLocale:{get(){return r.value}}})}}),Pd=ue({__name:"Badge",props:{type:{type:String,required:!1,default:"tip"},text:{type:String,required:!1,default:""},vertical:{type:String,required:!1,default:void 0}},setup(e){return(t,n)=>(B(),X("span",{class:Ue(["badge",e.type]),style:Mn({verticalAlign:e.vertical})},[ye(t.$slots,"default",{},()=>[zt(Ce(e.text),1)])],6))}}),Ae=(e,t)=>{const n=e.__vccOpts||e;for(const[r,o]of t)n[r]=o;return n},Ad=Ae(Pd,[["__file","Badge.vue"]]),wd=ue({name:"CodeGroup",slots:Object,setup(e,{slots:t}){const n=Le(-1),r=Le([]),o=(l=n.value)=>{l{l>0?n.value=l-1:n.value=r.value.length-1,r.value[n.value].focus()},s=(l,a)=>{l.key===" "||l.key==="Enter"?(l.preventDefault(),n.value=a):l.key==="ArrowRight"?(l.preventDefault(),o(a)):l.key==="ArrowLeft"&&(l.preventDefault(),i(a))};return()=>{var a;const l=(((a=t.default)==null?void 0:a.call(t))||[]).filter(c=>c.type.name==="CodeGroupItem").map(c=>(c.props===null&&(c.props={}),c));return l.length===0?null:(n.value<0||n.value>l.length-1?(n.value=l.findIndex(c=>c.props.active===""||c.props.active===!0),n.value===-1&&(n.value=0)):l.forEach((c,u)=>{c.props.active=u===n.value}),_e("div",{class:"code-group"},[_e("div",{class:"code-group__nav"},_e("ul",{class:"code-group__ul"},l.map((c,u)=>{const f=u===n.value;return _e("li",{class:"code-group__li"},_e("button",{ref:h=>{h&&(r.value[u]=h)},class:{"code-group__nav-tab":!0,"code-group__nav-tab-active":f},ariaPressed:f,ariaExpanded:f,onClick:()=>n.value=u,onKeydown:h=>s(h,u)},c.props.title))}))),l]))}}}),Rd=["aria-selected"],xd=ue({name:"CodeGroupItem"}),Od=ue({...xd,props:{title:{type:String,required:!0},active:{type:Boolean,required:!1,default:!1}},setup(e){return(t,n)=>(B(),X("div",{class:Ue(["code-group-item",{"code-group-item__active":e.active}]),"aria-selected":e.active},[ye(t.$slots,"default")],10,Rd))}}),Dd=Ae(Od,[["__file","CodeGroupItem.vue"]]);function os(e,t){var n;const r=Ro();return Vs(()=>{r.value=e()},{...t,flush:(n=t==null?void 0:t.flush)!=null?n:"sync"}),Hn(r)}function Id(e,t){let n,r,o;const i=Le(!0),s=()=>{i.value=!0,o()};et(e,s,{flush:"sync"});const l=typeof t=="function"?t:t.get,a=typeof t=="function"?void 0:t.set,c=Na((u,f)=>(r=u,o=f,{get(){return i.value&&(n=l(),i.value=!1),r(),n},set(h){a==null||a(h)}}));return Object.isExtensible(c)&&(c.trigger=s),c}function Il(e){return ms()?(aa(e),!0):!1}function pn(e){return typeof e=="function"?e():ee(e)}const Cd=typeof window<"u"&&typeof document<"u",Sd=Object.prototype.toString,kd=e=>Sd.call(e)==="[object Object]",zd=()=>{};function $d(e,t){function n(...r){return new Promise((o,i)=>{Promise.resolve(e(()=>t.apply(this,r),{fn:t,thisArg:this,args:r})).then(o).catch(i)})}return n}const Cl=e=>e();function Vd(e=Cl){const t=Le(!0);function n(){t.value=!1}function r(){t.value=!0}const o=(...i)=>{t.value&&e(...i)};return{isActive:Hn(t),pause:n,resume:r,eventFilter:o}}function Md(e,t,n={}){const{eventFilter:r=Cl,...o}=n;return et(e,$d(r,t),o)}function Nd(e,t,n={}){const{eventFilter:r,...o}=n,{eventFilter:i,pause:s,resume:l,isActive:a}=Vd(r);return{stop:Md(e,t,{...o,eventFilter:i}),pause:s,resume:l,isActive:a}}function Hd(e=!1,t={}){const{truthyValue:n=!0,falsyValue:r=!1}=t,o=ke(e),i=Le(e);function s(l){if(arguments.length)return i.value=l,i.value;{const a=pn(n);return i.value=i.value===a?pn(r):a,i.value}}return o?s:[i,s]}function Bd(e){var t;const n=pn(e);return(t=n==null?void 0:n.$el)!=null?t:n}const br=Cd?window:void 0;function fo(...e){let t,n,r,o;if(typeof e[0]=="string"||Array.isArray(e[0])?([n,r,o]=e,t=br):[t,n,r,o]=e,!t)return zd;Array.isArray(n)||(n=[n]),Array.isArray(r)||(r=[r]);const i=[],s=()=>{i.forEach(u=>u()),i.length=0},l=(u,f,h,_)=>(u.addEventListener(f,h,_),()=>u.removeEventListener(f,h,_)),a=et(()=>[Bd(t),pn(o)],([u,f])=>{if(s(),!u)return;const h=kd(f)?{...f}:f;i.push(...n.flatMap(_=>r.map(E=>l(u,_,E,h))))},{immediate:!0,flush:"post"}),c=()=>{a(),s()};return Il(c),c}function Fd(){const e=Le(!1);return tl()&&Ge(()=>{e.value=!0}),e}function jd(e){const t=Fd();return q(()=>(t.value,!!e()))}function qd(e,t={}){const{window:n=br}=t,r=jd(()=>n&&"matchMedia"in n&&typeof n.matchMedia=="function");let o;const i=Le(!1),s=c=>{i.value=c.matches},l=()=>{o&&("removeEventListener"in o?o.removeEventListener("change",s):o.removeListener(s))},a=Vs(()=>{r.value&&(l(),o=n.matchMedia(pn(e)),"addEventListener"in o?o.addEventListener("change",s):o.addListener(s),i.value=o.matches)});return Il(()=>{a(),l(),o=void 0}),i}const nr=typeof globalThis<"u"?globalThis:typeof window<"u"?window:typeof global<"u"?global:typeof self<"u"?self:{},rr="__vueuse_ssr_handlers__",Ud=Kd();function Kd(){return rr in nr||(nr[rr]=nr[rr]||{}),nr[rr]}function Wd(e,t){return Ud[e]||t}function Qd(e){return e==null?"any":e instanceof Set?"set":e instanceof Map?"map":e instanceof Date?"date":typeof e=="boolean"?"boolean":typeof e=="string"?"string":typeof e=="object"?"object":Number.isNaN(e)?"any":"number"}const Yd={boolean:{read:e=>e==="true",write:e=>String(e)},object:{read:e=>JSON.parse(e),write:e=>JSON.stringify(e)},number:{read:e=>Number.parseFloat(e),write:e=>String(e)},any:{read:e=>e,write:e=>String(e)},string:{read:e=>e,write:e=>String(e)},map:{read:e=>new Map(JSON.parse(e)),write:e=>JSON.stringify(Array.from(e.entries()))},set:{read:e=>new Set(JSON.parse(e)),write:e=>JSON.stringify(Array.from(e))},date:{read:e=>new Date(e),write:e=>e.toISOString()}},is="vueuse-storage";function Gd(e,t,n,r={}){var o;const{flush:i="pre",deep:s=!0,listenToStorageChanges:l=!0,writeDefaults:a=!0,mergeDefaults:c=!1,shallow:u,window:f=br,eventFilter:h,onError:_=v=>{console.error(v)}}=r,E=(u?Ro:Le)(t);if(!n)try{n=Wd("getDefaultStorage",()=>{var v;return(v=br)==null?void 0:v.localStorage})()}catch(v){_(v)}if(!n)return E;const T=pn(t),R=Qd(T),g=(o=r.serializer)!=null?o:Yd[R],{pause:y,resume:C}=Nd(E,()=>I(E.value),{flush:i,deep:s,eventFilter:h});return f&&l&&(fo(f,"storage",M),fo(f,is,te)),M(),E;function I(v){try{if(v==null)n.removeItem(e);else{const j=g.write(v),N=n.getItem(e);N!==j&&(n.setItem(e,j),f&&f.dispatchEvent(new CustomEvent(is,{detail:{key:e,oldValue:N,newValue:j,storageArea:n}})))}}catch(j){_(j)}}function W(v){const j=v?v.newValue:n.getItem(e);if(j==null)return a&&T!==null&&n.setItem(e,g.write(T)),T;if(!v&&c){const N=g.read(j);return typeof c=="function"?c(N,T):R==="object"&&!Array.isArray(N)?{...T,...N}:N}else return typeof j!="string"?j:g.read(j)}function te(v){M(v.detail)}function M(v){if(!(v&&v.storageArea!==n)){if(v&&v.key==null){E.value=T;return}if(!(v&&v.key!==e)){y();try{(v==null?void 0:v.newValue)!==g.write(E.value)&&(E.value=W(v))}catch(j){_(j)}finally{v?Pr(C):C()}}}}}function Jd(e){return qd("(prefers-color-scheme: dark)",e)}const Zd=()=>Ol(),Fe=()=>Ed(),Sl=Symbol(""),qo=()=>{const e=Re(Sl);if(!e)throw new Error("useDarkMode() is called without provider.");return e},Xd=()=>{const e=Fe(),t=Jd(),n=Gd("vuepress-color-scheme",e.value.colorMode),r=q({get(){return e.value.colorModeSwitch?n.value==="auto"?t.value:n.value==="dark":e.value.colorMode==="dark"},set(o){o===t.value?n.value="auto":n.value=o?"dark":"light"}});Ut(Sl,r),eh(r)},eh=e=>{const t=(n=e.value)=>{const r=window==null?void 0:window.document.querySelector("html");r==null||r.classList.toggle("dark",n)};Ge(()=>{et(e,t,{immediate:!0})}),Or(()=>t())},kl=(...e)=>{const n=Yt().resolve(...e),r=n.matched[n.matched.length-1];if(!(r!=null&&r.redirect))return n;const{redirect:o}=r,i=se(o)?o(n):o,s=me(i)?{path:i}:i;return kl({hash:n.hash,query:n.query,params:n.params,...s})},Uo=e=>{const t=kl(encodeURI(e));return{text:t.meta.title||e,link:t.name==="404"?e:t.fullPath}};let Qr=null,En=null;const th={wait:()=>Qr,pending:()=>{Qr=new Promise(e=>En=e)},resolve:()=>{En==null||En(),Qr=null,En=null}},zl=()=>th,$l=Symbol("sidebarItems"),Ko=()=>{const e=Re($l);if(!e)throw new Error("useSidebarItems() is called without provider.");return e},nh=()=>{const e=Fe(),t=vt(),n=q(()=>rh(t.value,e.value));Ut($l,n)},rh=(e,t)=>{const n=e.sidebar??t.sidebar??"auto",r=e.sidebarDepth??t.sidebarDepth??2;return e.home||n===!1?[]:n==="auto"?ih(r):G(n)?Vl(n,r):zo(n)?sh(n,r):[]},oh=(e,t)=>({text:e.title,link:e.link,children:Wo(e.children,t)}),Wo=(e,t)=>t>0?e.map(n=>oh(n,t-1)):[],ih=e=>{const t=Wt();return[{text:t.value.title,children:Wo(t.value.headers,e)}]},Vl=(e,t)=>{const n=Gt(),r=Wt(),o=i=>{var l;let s;if(me(i)?s=Uo(i):s=i,s.children)return{...s,children:s.children.map(a=>o(a))};if(s.link===n.path){const a=((l=r.value.headers[0])==null?void 0:l.level)===1?r.value.headers[0].children:r.value.headers;return{...s,children:Wo(a,t)}}return s};return e.map(i=>o(i))},sh=(e,t)=>{const n=Gt(),r=ll(e,n.path),o=e[r]??[];return Vl(o,t)},lh="719px",ah={mobile:lh};var $n;(function(e){e.MOBILE="mobile"})($n||($n={}));var as;const ch={[$n.MOBILE]:Number.parseInt((as=ah.mobile)==null?void 0:as.replace("px",""),10)},Ml=(e,t)=>{const n=ch[e];Number.isInteger(n)&&Ge(()=>{t(n),window.addEventListener("resize",()=>t(n),!1),window.addEventListener("orientationchange",()=>t(n),!1)})},uh={},fh={class:"theme-default-content"};function dh(e,t){const n=bt("Content");return B(),X("div",fh,[ne(n)])}const hh=Ae(uh,[["render",dh],["__file","HomeContent.vue"]]),ph={key:0,class:"features"},mh=ue({__name:"HomeFeatures",setup(e){const t=vt(),n=q(()=>G(t.value.features)?t.value.features:[]);return(r,o)=>n.value.length?(B(),X("div",ph,[(B(!0),X(Ee,null,St(n.value,i=>(B(),X("div",{key:i.title,class:"feature"},[he("h2",null,Ce(i.title),1),he("p",null,Ce(i.details),1)]))),128))])):we("v-if",!0)}}),vh=Ae(mh,[["__file","HomeFeatures.vue"]]),_h=["innerHTML"],gh=["textContent"],bh=ue({__name:"HomeFooter",setup(e){const t=vt(),n=q(()=>t.value.footer),r=q(()=>t.value.footerHtml);return(o,i)=>n.value?(B(),X(Ee,{key:0},[we(" eslint-disable-next-line vue/no-v-html "),r.value?(B(),X("div",{key:0,class:"footer",innerHTML:n.value},null,8,_h)):(B(),X("div",{key:1,class:"footer",textContent:Ce(n.value)},null,8,gh))],64)):we("v-if",!0)}}),yh=Ae(bh,[["__file","HomeFooter.vue"]]),Eh=["href","rel","target","aria-label"],Lh=ue({inheritAttrs:!1}),Th=ue({...Lh,__name:"AutoLink",props:{item:{type:Object,required:!0}},setup(e){const t=e,n=Gt(),r=ml(),{item:o}=xo(t),i=q(()=>qn(o.value.link)),s=q(()=>Ou(o.value.link)||Du(o.value.link)),l=q(()=>{if(!s.value){if(o.value.target)return o.value.target;if(i.value)return"_blank"}}),a=q(()=>l.value==="_blank"),c=q(()=>!i.value&&!s.value&&!a.value),u=q(()=>{if(!s.value){if(o.value.rel)return o.value.rel;if(a.value)return"noopener noreferrer"}}),f=q(()=>o.value.ariaLabel||o.value.text),h=q(()=>{const T=Object.keys(r.value.locales);return T.length?!T.some(R=>R===o.value.link):o.value.link!=="/"}),_=q(()=>h.value?n.path.startsWith(o.value.link):!1),E=q(()=>c.value?o.value.activeMatch?new RegExp(o.value.activeMatch).test(n.path):_.value:!1);return(T,R)=>{const g=bt("RouterLink"),y=bt("AutoLinkExternalIcon");return c.value?(B(),Oe(g,so({key:0,class:{"router-link-active":E.value},to:ee(o).link,"aria-label":f.value},T.$attrs),{default:$e(()=>[ye(T.$slots,"before"),zt(" "+Ce(ee(o).text)+" ",1),ye(T.$slots,"after")]),_:3},16,["class","to","aria-label"])):(B(),X("a",so({key:1,class:"external-link",href:ee(o).link,rel:u.value,target:l.value,"aria-label":f.value},T.$attrs),[ye(T.$slots,"before"),zt(" "+Ce(ee(o).text)+" ",1),a.value?(B(),Oe(y,{key:0})):we("v-if",!0),ye(T.$slots,"after")],16,Eh))}}}),_t=Ae(Th,[["__file","AutoLink.vue"]]),Ph={class:"hero"},Ah={key:0,id:"main-title"},wh={key:1,class:"description"},Rh={key:2,class:"actions"},xh=ue({__name:"HomeHero",setup(e){const t=vt(),n=Vo(),r=qo(),o=q(()=>r.value&&t.value.heroImageDark!==void 0?t.value.heroImageDark:t.value.heroImage),i=q(()=>t.value.heroAlt||l.value||"hero"),s=q(()=>t.value.heroHeight||280),l=q(()=>t.value.heroText===null?null:t.value.heroText||n.value.title||"Hello"),a=q(()=>t.value.tagline===null?null:t.value.tagline||n.value.description||"Welcome to your VuePress site"),c=q(()=>G(t.value.actions)?t.value.actions.map(({text:f,link:h,type:_="primary"})=>({text:f,link:h,type:_})):[]),u=()=>{if(!o.value)return null;const f=_e("img",{src:No(o.value),alt:i.value,height:s.value});return t.value.heroImageDark===void 0?f:_e(Mo,()=>f)};return(f,h)=>(B(),X("header",Ph,[ne(u),l.value?(B(),X("h1",Ah,Ce(l.value),1)):we("v-if",!0),a.value?(B(),X("p",wh,Ce(a.value),1)):we("v-if",!0),c.value.length?(B(),X("p",Rh,[(B(!0),X(Ee,null,St(c.value,_=>(B(),Oe(_t,{key:_.text,class:Ue(["action-button",[_.type]]),item:_},null,8,["class","item"]))),128))])):we("v-if",!0)]))}}),Oh=Ae(xh,[["__file","HomeHero.vue"]]),Dh={class:"home"},Ih=ue({__name:"Home",setup(e){return(t,n)=>(B(),X("main",Dh,[ne(Oh),ne(vh),ne(hh),ne(yh)]))}}),Ch=Ae(Ih,[["__file","Home.vue"]]),Sh=ue({__name:"NavbarBrand",setup(e){const t=Un(),n=Vo(),r=Fe(),o=qo(),i=q(()=>r.value.home||t.value),s=q(()=>n.value.title),l=q(()=>o.value&&r.value.logoDark!==void 0?r.value.logoDark:r.value.logo),a=()=>{if(!l.value)return null;const c=_e("img",{class:"logo",src:No(l.value),alt:s.value});return r.value.logoDark===void 0?c:_e(Mo,()=>c)};return(c,u)=>{const f=bt("RouterLink");return B(),Oe(f,{to:i.value},{default:$e(()=>[ne(a),s.value?(B(),X("span",{key:0,class:Ue(["site-name",{"can-hide":l.value}])},Ce(s.value),3)):we("v-if",!0)]),_:1},8,["to"])}}}),kh=Ae(Sh,[["__file","NavbarBrand.vue"]]),zh=ue({__name:"DropdownTransition",setup(e){const t=r=>{r.style.height=r.scrollHeight+"px"},n=r=>{r.style.height=""};return(r,o)=>(B(),Oe(jn,{name:"dropdown",onEnter:t,onAfterEnter:n,onBeforeLeave:t},{default:$e(()=>[ye(r.$slots,"default")]),_:3}))}}),Nl=Ae(zh,[["__file","DropdownTransition.vue"]]),$h=["aria-label"],Vh={class:"title"},Mh=he("span",{class:"arrow down"},null,-1),Nh=["aria-label"],Hh={class:"title"},Bh={class:"navbar-dropdown"},Fh={class:"navbar-dropdown-subtitle"},jh={key:1},qh={class:"navbar-dropdown-subitem-wrapper"},Uh=ue({__name:"NavbarDropdown",props:{item:{type:Object,required:!0}},setup(e){const t=e,{item:n}=xo(t),r=q(()=>n.value.ariaLabel||n.value.text),o=Le(!1),i=Gt();et(()=>i.path,()=>{o.value=!1});const s=a=>{a.detail===0?o.value=!o.value:o.value=!1},l=(a,c)=>c[c.length-1]===a;return(a,c)=>(B(),X("div",{class:Ue(["navbar-dropdown-wrapper",{open:o.value}])},[he("button",{class:"navbar-dropdown-title",type:"button","aria-label":r.value,onClick:s},[he("span",Vh,Ce(ee(n).text),1),Mh],8,$h),he("button",{class:"navbar-dropdown-title-mobile",type:"button","aria-label":r.value,onClick:c[0]||(c[0]=u=>o.value=!o.value)},[he("span",Hh,Ce(ee(n).text),1),he("span",{class:Ue(["arrow",o.value?"down":"right"])},null,2)],8,Nh),ne(Nl,null,{default:$e(()=>[dr(he("ul",Bh,[(B(!0),X(Ee,null,St(ee(n).children,u=>(B(),X("li",{key:u.text,class:"navbar-dropdown-item"},[u.children?(B(),X(Ee,{key:0},[he("h4",Fh,[u.link?(B(),Oe(_t,{key:0,item:u,onFocusout:f=>l(u,ee(n).children)&&u.children.length===0&&(o.value=!1)},null,8,["item","onFocusout"])):(B(),X("span",jh,Ce(u.text),1))]),he("ul",qh,[(B(!0),X(Ee,null,St(u.children,f=>(B(),X("li",{key:f.link,class:"navbar-dropdown-subitem"},[ne(_t,{item:f,onFocusout:h=>l(f,u.children)&&l(u,ee(n).children)&&(o.value=!1)},null,8,["item","onFocusout"])]))),128))])],64)):(B(),Oe(_t,{key:1,item:u,onFocusout:f=>l(u,ee(n).children)&&(o.value=!1)},null,8,["item","onFocusout"]))]))),128))],512),[[_r,o.value]])]),_:1})],2))}}),Kh=Ae(Uh,[["__file","NavbarDropdown.vue"]]),ss=e=>decodeURI(e).replace(/#.*$/,"").replace(/(index)?\.(md|html)$/,""),Wh=(e,t)=>{if(t.hash===e)return!0;const n=ss(t.path),r=ss(e);return n===r},Hl=(e,t)=>e.link&&Wh(e.link,t)?!0:e.children?e.children.some(n=>Hl(n,t)):!1,Bl=e=>!qn(e)||/github\.com/.test(e)?"GitHub":/bitbucket\.org/.test(e)?"Bitbucket":/gitlab\.com/.test(e)?"GitLab":/gitee\.com/.test(e)?"Gitee":null,Qh={GitHub:":repo/edit/:branch/:path",GitLab:":repo/-/edit/:branch/:path",Gitee:":repo/edit/:branch/:path",Bitbucket:":repo/src/:branch/:path?mode=edit&spa=0&at=:branch&fileviewer=file-view-default"},Yh=({docsRepo:e,editLinkPattern:t})=>{if(t)return t;const n=Bl(e);return n!==null?Qh[n]:null},Gh=({docsRepo:e,docsBranch:t,docsDir:n,filePathRelative:r,editLinkPattern:o})=>{if(!r)return null;const i=Yh({docsRepo:e,editLinkPattern:o});return i?i.replace(/:repo/,qn(e)?e:`https://github.com/${e}`).replace(/:branch/,t).replace(/:path/,sl(`${il(n)}/${r}`)):null},Jh={key:0,class:"navbar-items"},Zh=ue({__name:"NavbarItems",setup(e){const t=()=>{const u=Yt(),f=Un(),h=ml(),_=Vo(),E=Zd(),T=Fe();return q(()=>{const R=Object.keys(h.value.locales);if(R.length<2)return[];const g=u.currentRoute.value.path,y=u.currentRoute.value.fullPath;return[{text:`${T.value.selectLanguageText}`,ariaLabel:`${T.value.selectLanguageAriaLabel??T.value.selectLanguageText}`,children:R.map(I=>{var N,Y;const W=((N=h.value.locales)==null?void 0:N[I])??{},te=((Y=E.value.locales)==null?void 0:Y[I])??{},M=`${W.lang}`,v=te.selectLanguageName??M;let j;if(M===_.value.lang)j=y;else{const w=g.replace(f.value,I);u.getRoutes().some(k=>k.path===w)?j=y.replace(g,w):j=te.home??I}return{text:v,link:j}})}]})},n=()=>{const u=Fe(),f=q(()=>u.value.repo),h=q(()=>f.value?Bl(f.value):null),_=q(()=>f.value&&!qn(f.value)?`https://github.com/${f.value}`:f.value),E=q(()=>_.value?u.value.repoLabel?u.value.repoLabel:h.value===null?"Source":h.value:null);return q(()=>!_.value||!E.value?[]:[{text:E.value,link:_.value}])},r=u=>me(u)?Uo(u):u.children?{...u,children:u.children.map(r)}:u,o=()=>{const u=Fe();return q(()=>(u.value.navbar||[]).map(r))},i=Le(!1),s=o(),l=t(),a=n(),c=q(()=>[...s.value,...l.value,...a.value]);return Ml($n.MOBILE,u=>{window.innerWidthc.value.length?(B(),X("nav",Jh,[(B(!0),X(Ee,null,St(c.value,h=>(B(),X("div",{key:h.text,class:"navbar-item"},[h.children?(B(),Oe(Kh,{key:0,item:h,class:Ue(i.value?"mobile":"")},null,8,["item","class"])):(B(),Oe(_t,{key:1,item:h},null,8,["item"]))]))),128))])):we("v-if",!0)}}),Fl=Ae(Zh,[["__file","NavbarItems.vue"]]),Xh=["title"],ep={class:"icon",focusable:"false",viewBox:"0 0 32 32"},tp=Mc('',9),np=[tp],rp={class:"icon",focusable:"false",viewBox:"0 0 32 32"},op=he("path",{d:"M13.502 5.414a15.075 15.075 0 0 0 11.594 18.194a11.113 11.113 0 0 1-7.975 3.39c-.138 0-.278.005-.418 0a11.094 11.094 0 0 1-3.2-21.584M14.98 3a1.002 1.002 0 0 0-.175.016a13.096 13.096 0 0 0 1.825 25.981c.164.006.328 0 .49 0a13.072 13.072 0 0 0 10.703-5.555a1.01 1.01 0 0 0-.783-1.565A13.08 13.08 0 0 1 15.89 4.38A1.015 1.015 0 0 0 14.98 3z",fill:"currentColor"},null,-1),ip=[op],sp=ue({__name:"ToggleColorModeButton",setup(e){const t=Fe(),n=qo(),r=()=>{n.value=!n.value};return(o,i)=>(B(),X("button",{class:"toggle-color-mode-button",title:ee(t).toggleColorMode,onClick:r},[dr((B(),X("svg",ep,np,512)),[[_r,!ee(n)]]),dr((B(),X("svg",rp,ip,512)),[[_r,ee(n)]])],8,Xh))}}),lp=Ae(sp,[["__file","ToggleColorModeButton.vue"]]),ap=["title"],cp=he("div",{class:"icon","aria-hidden":"true"},[he("span"),he("span"),he("span")],-1),up=[cp],fp=ue({__name:"ToggleSidebarButton",emits:["toggle"],setup(e){const t=Fe();return(n,r)=>(B(),X("div",{class:"toggle-sidebar-button",title:ee(t).toggleSidebar,"aria-expanded":"false",role:"button",tabindex:"0",onClick:r[0]||(r[0]=o=>n.$emit("toggle"))},up,8,ap))}}),dp=Ae(fp,[["__file","ToggleSidebarButton.vue"]]),hp=ue({__name:"Navbar",emits:["toggle-sidebar"],setup(e){const t=Fe(),n=Le(null),r=Le(null),o=Le(0),i=q(()=>o.value?{maxWidth:o.value+"px"}:{});Ml($n.MOBILE,l=>{var c;const a=s(n.value,"paddingLeft")+s(n.value,"paddingRight");window.innerWidth{const c=bt("NavbarSearch");return B(),X("header",{ref_key:"navbar",ref:n,class:"navbar"},[ne(dp,{onToggle:a[0]||(a[0]=u=>l.$emit("toggle-sidebar"))}),he("span",{ref_key:"navbarBrand",ref:r},[ne(kh)],512),he("div",{class:"navbar-items-wrapper",style:Mn(i.value)},[ye(l.$slots,"before"),ne(Fl,{class:"can-hide"}),ye(l.$slots,"after"),ee(t).colorModeSwitch?(B(),Oe(lp,{key:0})):we("v-if",!0),ne(c)],4)],512)}}}),pp=Ae(hp,[["__file","Navbar.vue"]]),mp={class:"page-meta"},vp={key:0,class:"meta-item edit-link"},_p={key:1,class:"meta-item last-updated"},gp={class:"meta-item-label"},bp={class:"meta-item-info"},yp={key:2,class:"meta-item contributors"},Ep={class:"meta-item-label"},Lp={class:"meta-item-info"},Tp=["title"],Pp=ue({__name:"PageMeta",setup(e){const t=()=>{const a=Fe(),c=Wt(),u=vt();return q(()=>{if(!(u.value.editLink??a.value.editLink??!0))return null;const{repo:h,docsRepo:_=h,docsBranch:E="main",docsDir:T="",editLinkText:R}=a.value;if(!_)return null;const g=Gh({docsRepo:_,docsBranch:E,docsDir:T,filePathRelative:c.value.filePathRelative,editLinkPattern:u.value.editLinkPattern??a.value.editLinkPattern});return g?{text:R??"Edit this page",link:g}:null})},n=()=>{const a=Fe(),c=Wt(),u=vt();return q(()=>{var _,E;return!(u.value.lastUpdated??a.value.lastUpdated??!0)||!((_=c.value.git)!=null&&_.updatedTime)?null:new Date((E=c.value.git)==null?void 0:E.updatedTime).toLocaleString()})},r=()=>{const a=Fe(),c=Wt(),u=vt();return q(()=>{var h;return u.value.contributors??a.value.contributors??!0?((h=c.value.git)==null?void 0:h.contributors)??null:null})},o=Fe(),i=t(),s=n(),l=r();return(a,c)=>{const u=bt("ClientOnly");return B(),X("footer",mp,[ee(i)?(B(),X("div",vp,[ne(_t,{class:"meta-item-label",item:ee(i)},null,8,["item"])])):we("v-if",!0),ee(s)?(B(),X("div",_p,[he("span",gp,Ce(ee(o).lastUpdatedText)+": ",1),ne(u,null,{default:$e(()=>[he("span",bp,Ce(ee(s)),1)]),_:1})])):we("v-if",!0),ee(l)&&ee(l).length?(B(),X("div",yp,[he("span",Ep,Ce(ee(o).contributorsText)+": ",1),he("span",Lp,[(B(!0),X(Ee,null,St(ee(l),(f,h)=>(B(),X(Ee,{key:h},[he("span",{class:"contributor",title:`email: ${f.email}`},Ce(f.name),9,Tp),h!==ee(l).length-1?(B(),X(Ee,{key:0},[zt(", ")],64)):we("v-if",!0)],64))),128))])])):we("v-if",!0)])}}}),Ap=Ae(Pp,[["__file","PageMeta.vue"]]),wp={key:0,class:"page-nav"},Rp={class:"inner"},xp={key:0,class:"prev"},Op={key:1,class:"next"},Dp=ue({__name:"PageNav",setup(e){const t=a=>a===!1?null:me(a)?Uo(a):zo(a)?a:!1,n=(a,c,u)=>{const f=a.findIndex(h=>h.link===c);if(f!==-1){const h=a[f+u];return h!=null&&h.link?h:null}for(const h of a)if(h.children){const _=n(h.children,c,u);if(_)return _}return null},r=vt(),o=Ko(),i=Gt(),s=q(()=>{const a=t(r.value.prev);return a!==!1?a:n(o.value,i.path,-1)}),l=q(()=>{const a=t(r.value.next);return a!==!1?a:n(o.value,i.path,1)});return(a,c)=>s.value||l.value?(B(),X("nav",wp,[he("p",Rp,[s.value?(B(),X("span",xp,[ne(_t,{item:s.value},null,8,["item"])])):we("v-if",!0),l.value?(B(),X("span",Op,[ne(_t,{item:l.value},null,8,["item"])])):we("v-if",!0)])])):we("v-if",!0)}}),Ip=Ae(Dp,[["__file","PageNav.vue"]]),Cp={class:"page"},Sp={class:"theme-default-content"},kp=ue({__name:"Page",setup(e){return(t,n)=>{const r=bt("Content");return B(),X("main",Cp,[ye(t.$slots,"top"),he("div",Sp,[ye(t.$slots,"content-top"),ne(r),ye(t.$slots,"content-bottom")]),ne(Ap),ne(Ip),ye(t.$slots,"bottom")])}}}),zp=Ae(kp,[["__file","Page.vue"]]),$p=["onKeydown"],Vp={class:"sidebar-item-children"},Mp=ue({__name:"SidebarItem",props:{item:{type:Object,required:!0},depth:{type:Number,required:!1,default:0}},setup(e){const t=e,{item:n,depth:r}=xo(t),o=Gt(),i=Yt(),s=q(()=>Hl(n.value,o)),l=q(()=>({"sidebar-item":!0,"sidebar-heading":r.value===0,active:s.value,collapsible:n.value.collapsible})),a=q(()=>n.value.collapsible?s.value:!0),[c,u]=Hd(a.value),f=_=>{n.value.collapsible&&(_.preventDefault(),u())},h=i.afterEach(_=>{Pr(()=>{c.value=a.value})});return xr(()=>{h()}),(_,E)=>{var R;const T=bt("SidebarItem",!0);return B(),X("li",null,[ee(n).link?(B(),Oe(_t,{key:0,class:Ue(l.value),item:ee(n)},null,8,["class","item"])):(B(),X("p",{key:1,tabindex:"0",class:Ue(l.value),onClick:f,onKeydown:yu(f,["enter"])},[zt(Ce(ee(n).text)+" ",1),ee(n).collapsible?(B(),X("span",{key:0,class:Ue(["arrow",ee(c)?"down":"right"])},null,2)):we("v-if",!0)],42,$p)),(R=ee(n).children)!=null&&R.length?(B(),Oe(Nl,{key:2},{default:$e(()=>[dr(he("ul",Vp,[(B(!0),X(Ee,null,St(ee(n).children,g=>(B(),Oe(T,{key:`${ee(r)}${g.text}${g.link}`,item:g,depth:ee(r)+1},null,8,["item","depth"]))),128))],512),[[_r,ee(c)]])]),_:1})):we("v-if",!0)])}}}),Np=Ae(Mp,[["__file","SidebarItem.vue"]]),Hp={key:0,class:"sidebar-items"},Bp=ue({__name:"SidebarItems",setup(e){const t=Gt(),n=Ko();return Ge(()=>{et(()=>t.hash,r=>{const o=document.querySelector(".sidebar");if(!o)return;const i=document.querySelector(`.sidebar a.sidebar-item[href="${t.path}${r}"]`);if(!i)return;const{top:s,height:l}=o.getBoundingClientRect(),{top:a,height:c}=i.getBoundingClientRect();as+l&&i.scrollIntoView(!1)})}),(r,o)=>ee(n).length?(B(),X("ul",Hp,[(B(!0),X(Ee,null,St(ee(n),i=>(B(),Oe(Np,{key:`${i.text}${i.link}`,item:i},null,8,["item"]))),128))])):we("v-if",!0)}}),Fp=Ae(Bp,[["__file","SidebarItems.vue"]]),jp={class:"sidebar"},qp=ue({__name:"Sidebar",setup(e){return(t,n)=>(B(),X("aside",jp,[ne(Fl),ye(t.$slots,"top"),ne(Fp),ye(t.$slots,"bottom")]))}}),Up=Ae(qp,[["__file","Sidebar.vue"]]),Kp=ue({__name:"Layout",setup(e){const t=Wt(),n=vt(),r=Fe(),o=q(()=>n.value.navbar!==!1&&r.value.navbar!==!1),i=Ko(),s=Le(!1),l=R=>{s.value=typeof R=="boolean"?R:!s.value},a={x:0,y:0},c=R=>{a.x=R.changedTouches[0].clientX,a.y=R.changedTouches[0].clientY},u=R=>{const g=R.changedTouches[0].clientX-a.x,y=R.changedTouches[0].clientY-a.y;Math.abs(g)>Math.abs(y)&&Math.abs(g)>40&&(g>0&&a.x<=80?l(!0):l(!1))},f=q(()=>[{"no-navbar":!o.value,"no-sidebar":!i.value.length,"sidebar-open":s.value},n.value.pageClass]);let h;Ge(()=>{h=Yt().afterEach(()=>{l(!1)})}),Or(()=>{h()});const _=zl(),E=_.resolve,T=_.pending;return(R,g)=>(B(),X("div",{class:Ue(["theme-container",f.value]),onTouchstart:c,onTouchend:u},[ye(R.$slots,"navbar",{},()=>[o.value?(B(),Oe(pp,{key:0,onToggleSidebar:l},{before:$e(()=>[ye(R.$slots,"navbar-before")]),after:$e(()=>[ye(R.$slots,"navbar-after")]),_:3})):we("v-if",!0)]),he("div",{class:"sidebar-mask",onClick:g[0]||(g[0]=y=>l(!1))}),ye(R.$slots,"sidebar",{},()=>[ne(Up,null,{top:$e(()=>[ye(R.$slots,"sidebar-top")]),bottom:$e(()=>[ye(R.$slots,"sidebar-bottom")]),_:3})]),ye(R.$slots,"page",{},()=>[ee(n).home?(B(),Oe(Ch,{key:0})):(B(),Oe(jn,{key:1,name:"fade-slide-y",mode:"out-in",onBeforeEnter:ee(E),onBeforeLeave:ee(T)},{default:$e(()=>[(B(),Oe(zp,{key:ee(t).path},{top:$e(()=>[ye(R.$slots,"page-top")]),"content-top":$e(()=>[ye(R.$slots,"page-content-top")]),"content-bottom":$e(()=>[ye(R.$slots,"page-content-bottom")]),bottom:$e(()=>[ye(R.$slots,"page-bottom")]),_:3}))]),_:3},8,["onBeforeEnter","onBeforeLeave"]))])],34))}}),Wp=Ae(Kp,[["__file","Layout.vue"]]),Qp={class:"theme-container"},Yp={class:"page"},Gp={class:"theme-default-content"},Jp=he("h1",null,"404",-1),Zp=ue({__name:"NotFound",setup(e){const t=Un(),n=Fe(),r=n.value.notFound??["Not Found"],o=()=>r[Math.floor(Math.random()*r.length)],i=n.value.home??t.value,s=n.value.backToHome??"Back to home";return(l,a)=>{const c=bt("RouterLink");return B(),X("div",Qp,[he("main",Yp,[he("div",Gp,[Jp,he("blockquote",null,Ce(o()),1),ne(c,{to:ee(i)},{default:$e(()=>[zt(Ce(ee(s)),1)]),_:1},8,["to"])])])])}}}),Xp=Ae(Zp,[["__file","NotFound.vue"]]);const em=Et({enhance({app:e,router:t}){e.component("Badge",Ad),e.component("CodeGroup",wd),e.component("CodeGroupItem",Dd),e.component("AutoLinkExternalIcon",()=>{const r=e.component("ExternalLinkIcon");return r?_e(r):null}),e.component("NavbarSearch",()=>{const r=e.component("Docsearch")||e.component("SearchBox");return r?_e(r):null});const n=t.options.scrollBehavior;t.options.scrollBehavior=async(...r)=>(await zl().wait(),n(...r))},setup(){Xd(),nh()},layouts:{Layout:Wp,NotFound:Xp}}),tm=e=>{const t=fo("keydown",n=>{const r=n.key==="k"&&(n.ctrlKey||n.metaKey);!(n.key==="/")&&!r||(n.preventDefault(),e(),t())})},nm=e=>e.button===1||e.altKey||e.ctrlKey||e.metaKey||e.shiftKey,rm=()=>{const e=Yt();return{hitComponent:({hit:t,children:n})=>({type:"a",ref:void 0,constructor:void 0,key:void 0,props:{href:t.url,onClick:r=>{nm(r)||(r.preventDefault(),e.push(ki(t.url,"/PolarDB-for-PostgreSQL/")))},children:n},__v:null}),navigator:{navigate:({itemUrl:t})=>{e.push(ki(t,"/PolarDB-for-PostgreSQL/"))}},transformSearchClient:t=>{const n=Ho(t.search,500);return{...t,search:async(...r)=>n(...r)}}}},om=(e=[],t)=>[`lang:${t}`,...G(e)?e:[e]],im=({buttonText:e="Search",buttonAriaLabel:t=e}={})=>``,sm=16,jl=()=>{if(document.querySelector(".DocSearch-Modal"))return;const e=new Event("keydown");e.key="k",e.metaKey=!0,window.dispatchEvent(e),setTimeout(jl,sm)},lm=e=>{const t="algolia-preconnect";(window.requestIdleCallback||setTimeout)(()=>{if(document.head.querySelector(`#${t}`))return;const r=document.createElement("link");r.id=t,r.rel="preconnect",r.href=`https://${e}-dsn.algolia.net`,r.crossOrigin="",document.head.appendChild(r)})},am={appId:"OYQ6LCESQG",apiKey:"748b096a5ca5958b2da16301f213d7b1",indexName:"polardb-for-postgresql",locales:{"/zh/":{placeholder:"搜索文档",translations:{button:{buttonText:"搜索文档",buttonAriaLabel:"搜索文档"},modal:{searchBox:{resetButtonTitle:"清除查询条件",resetButtonAriaLabel:"清除查询条件",cancelButtonText:"取消",cancelButtonAriaLabel:"取消"},startScreen:{recentSearchesTitle:"搜索历史",noRecentSearchesText:"没有搜索历史",saveRecentSearchButtonTitle:"保存至搜索历史",removeRecentSearchButtonTitle:"从搜索历史中移除",favoriteSearchesTitle:"收藏",removeFavoriteSearchButtonTitle:"从收藏中移除"},errorScreen:{titleText:"无法获取结果",helpText:"你可能需要检查你的网络连接"},footer:{selectText:"选择",navigateText:"切换",closeText:"关闭",searchByText:"搜索提供者"},noResultsScreen:{noResultsText:"无法找到相关结果",suggestedQueryText:"你可以尝试查询",openIssueText:"你认为该查询应该有结果?",openIssueLinkText:"点击反馈"}}}}}};m(()=>import("./style-e9220a04.js"),[]),m(()=>import("./docsearch-1d421ddb.js"),[]);const cm=ue({name:"Docsearch",props:{containerId:{type:String,required:!1,default:"docsearch-container"},options:{type:Object,required:!1,default:()=>am}},setup(e){const t=rm(),n=hl(),r=Un(),o=Le(!1),i=Le(!1),s=q(()=>{var c;return{...e.options,...(c=e.options.locales)==null?void 0:c[r.value]}}),l=async()=>{var u;const{default:c}=await m(()=>import("./index-82585c84.js"),[]);c({...t,...s.value,container:`#${e.containerId}`,searchParameters:{...s.value.searchParameters,facetFilters:om((u=s.value.searchParameters)==null?void 0:u.facetFilters,n.value)}}),o.value=!0},a=()=>{i.value||o.value||(i.value=!0,l(),jl(),et(r,l))};return tm(a),Ge(()=>lm(s.value.appId)),()=>{var c;return[_e("div",{id:e.containerId,style:{display:o.value?"block":"none"}}),o.value?null:_e("div",{onClick:a,innerHTML:im((c=s.value.translations)==null?void 0:c.button)})]}}}),um=Et({enhance({app:e}){e.component("Docsearch",cm)}});const fm={};const dm=Et({enhance:({app:e})=>{},setup:()=>{}}),hm={enhance:({app:e})=>{e.component("ArticleInfo",O(()=>m(()=>import("./ArticleInfo-e2b0e2fd.js"),[])))}},or=[Yf,Zf,nd,pd,gd,Td,em,um,fm,dm,hm],pm=[["v-8daa1a0e","/",{title:"Documentation"},["/README.md"]],["v-64270bfa","/deploying/db-localfs.html",{title:"基于单机文件系统部署"},[":md"]],["v-20ec2a08","/deploying/db-pfs-curve.html",{title:"基于 PFS for CurveBS 文件系统部署"},[":md"]],["v-2da78b44","/deploying/db-pfs.html",{title:"基于 PFS 文件系统部署"},[":md"]],["v-bca378d6","/deploying/deploy-official.html",{title:"阿里云官网购买实例"},[":md"]],["v-097f9dea","/deploying/deploy-stack.html",{title:"基于 PolarDB Stack 共享存储"},[":md"]],["v-4a7bdef6","/deploying/deploy.html",{title:"进阶部署"},[":md"]],["v-e8e53a66","/deploying/fs-pfs-curve.html",{title:"格式化并挂载 PFS for CurveBS"},[":md"]],["v-4bd622ef","/deploying/fs-pfs.html",{title:"格式化并挂载 PFS"},[":md"]],["v-12a5021c","/deploying/introduction.html",{title:"架构简介"},[":md"]],["v-1ced8944","/deploying/quick-start.html",{title:"快速部署"},[":md"]],["v-5a992740","/deploying/storage-aliyun-essd.html",{title:"阿里云 ECS + ESSD 云盘存储"},[":md"]],["v-e3a62740","/deploying/storage-ceph.html",{title:"Ceph 共享存储"},[":md"]],["v-7f31e698","/deploying/storage-curvebs.html",{title:"CurveBS 共享存储"},[":md"]],["v-c895df30","/deploying/storage-nbd.html",{title:"NBD 共享存储"},[":md"]],["v-43a2065f","/contributing/coding-style.html",{title:"Coding Style"},[":md"]],["v-2be11236","/contributing/contributing-polardb-docs.html",{title:"Documentation Contributing"},[":md"]],["v-48520b74","/contributing/contributing-polardb-kernel.html",{title:"Code Contributing"},[":md"]],["v-c4fe9fca","/development/customize-dev-env.html",{title:"定制开发环境"},[":md"]],["v-2a8fa310","/development/dev-on-docker.html",{title:"基于 Docker 容器开发"},[":md"]],["v-7fdfc12a","/operation/backup-and-restore.html",{title:"备份恢复"},[":md"]],["v-530a6d12","/operation/grow-storage.html",{title:"共享存储在线扩容"},[":md"]],["v-4cbd0b64","/operation/ro-online-promote.html",{title:"只读节点在线 Promote"},[":md"]],["v-4a6d2de2","/operation/scale-out.html",{title:"计算节点扩缩容"},[":md"]],["v-3a0d4712","/operation/tpcc-test.html",{title:"TPC-C 测试"},[":md"]],["v-691e4b88","/operation/tpch-test.html",{title:"TPC-H 测试"},[":md"]],["v-98064128","/roadmap/",{title:"Roadmap"},["/roadmap/README.md"]],["v-5879645e","/theory/analyze.html",{title:"Code Analysis of ANALYZE"},[":md"]],["v-4ccaa7d8","/theory/arch-htap.html",{title:"HTAP Architecture"},[":md"]],["v-14c84b4c","/theory/arch-overview.html",{title:"Overview"},[":md"]],["v-46e5eefa","/theory/buffer-management.html",{title:"Buffer Management"},[":md"]],["v-5cfdf98b","/theory/ddl-synchronization.html",{title:"DDL Synchronization"},[":md"]],["v-65697b4c","/theory/logindex.html",{title:"LogIndex"},[":md"]],["v-6edf83b7","/theory/polar-sequence-tech.html",{title:"Sequence"},[":md"]],["v-2d0ad528","/zh/",{title:"文档"},["/zh/README.md"]],["v-3ec72c4e","/zh/contributing/coding-style.html",{title:"编码风格"},[":md"]],["v-210f48a7","/zh/contributing/contributing-polardb-docs.html",{title:"贡献文档"},[":md"]],["v-aa672cb6","/zh/contributing/contributing-polardb-kernel.html",{title:"贡献代码"},[":md"]],["v-55351ab4","/zh/deploying/db-localfs.html",{title:"基于单机文件系统部署"},[":md"]],["v-71a5b926","/zh/deploying/db-pfs-curve.html",{title:"基于 PFS for CurveBS 文件系统部署"},[":md"]],["v-b00a48e2","/zh/deploying/db-pfs.html",{title:"基于 PFS 文件系统部署"},[":md"]],["v-c6592cf8","/zh/deploying/deploy-official.html",{title:"阿里云官网购买实例"},[":md"]],["v-3dba534a","/zh/deploying/deploy-stack.html",{title:"基于 PolarDB Stack 共享存储"},[":md"]],["v-ccde9c94","/zh/deploying/deploy.html",{title:"进阶部署"},[":md"]],["v-63309b3e","/zh/deploying/fs-pfs-curve.html",{title:"格式化并挂载 PFS for CurveBS"},[":md"]],["v-0aa4c420","/zh/deploying/fs-pfs.html",{title:"格式化并挂载 PFS"},[":md"]],["v-635e913a","/zh/deploying/introduction.html",{title:"架构简介"},[":md"]],["v-7eb8feb3","/zh/deploying/quick-start.html",{title:"快速部署"},[":md"]],["v-6c33fa62","/zh/deploying/storage-aliyun-essd.html",{title:"阿里云 ECS + ESSD 云盘存储"},[":md"]],["v-65d024d1","/zh/deploying/storage-ceph.html",{title:"Ceph 共享存储"},[":md"]],["v-7a570c87","/zh/deploying/storage-curvebs.html",{title:"CurveBS 共享存储"},[":md"]],["v-04fef452","/zh/deploying/storage-nbd.html",{title:"NBD 共享存储"},[":md"]],["v-d69972ec","/zh/development/customize-dev-env.html",{title:"定制开发环境"},[":md"]],["v-25b4c8ff","/zh/development/dev-on-docker.html",{title:"基于 Docker 容器开发"},[":md"]],["v-0bbe1b6a","/zh/features/",{title:"自研功能"},["/zh/features/README.md"]],["v-6fed01c8","/zh/operation/backup-and-restore.html",{title:"备份恢复"},[":md"]],["v-a8802f54","/zh/operation/cpu-usage-high.html",{title:"CPU 使用率高的排查方法"},[":md"]],["v-a3c3fc30","/zh/operation/grow-storage.html",{title:"共享存储在线扩容"},[":md"]],["v-13307193","/zh/operation/ro-online-promote.html",{title:"只读节点在线 Promote"},[":md"]],["v-4a816e3e","/zh/operation/scale-out.html",{title:"计算节点扩缩容"},[":md"]],["v-52b161a6","/zh/operation/tpcc-test.html",{title:"TPC-C 测试"},[":md"]],["v-3b28df6b","/zh/operation/tpch-test.html",{title:"TPC-H 测试"},[":md"]],["v-7b6b229b","/zh/roadmap/",{title:"版本规划"},["/zh/roadmap/README.md"]],["v-28309dcf","/zh/theory/analyze.html",{title:"ANALYZE 源码解读"},[":md"]],["v-0b994909","/zh/theory/arch-htap.html",{title:"HTAP 架构详解"},[":md"]],["v-7ce47b0b","/zh/theory/arch-overview.html",{title:"特性总览"},[":md"]],["v-7ac661aa","/zh/theory/buffer-management.html",{title:"缓冲区管理"},[":md"]],["v-7304dd08","/zh/theory/ddl-synchronization.html",{title:"DDL 同步"},[":md"]],["v-170991ee","/zh/theory/logindex.html",{title:"LogIndex"},[":md"]],["v-4f41c8b0","/zh/theory/polar-sequence-tech.html",{title:"Sequence 使用、原理全面解析"},[":md"]],["v-7f44b843","/zh/features/v11/",{title:"自研功能"},["/zh/features/v11/README.md"]],["v-6024a2d1","/zh/features/v11/availability/",{title:"高可用"},["/zh/features/v11/availability/README.md"]],["v-2a7736c4","/zh/features/v11/availability/avail-online-promote.html",{title:"只读节点 Online Promote"},[":md"]],["v-18c2ec3b","/zh/features/v11/availability/avail-parallel-replay.html",{title:"WAL 日志并行回放"},[":md"]],["v-4e16f0f0","/zh/features/v11/availability/datamax.html",{title:"DataMax 日志节点"},[":md"]],["v-bb50ce5c","/zh/features/v11/availability/flashback-table.html",{title:"闪回表和闪回日志"},[":md"]],["v-4fd5d67a","/zh/features/v11/availability/resource-manager.html",{title:"Resource Manager"},[":md"]],["v-62087a8c","/zh/features/v11/epq/",{title:"弹性跨机并行查询(ePQ)"},["/zh/features/v11/epq/README.md"]],["v-59700d71","/zh/features/v11/epq/adaptive-scan.html",{title:"自适应扫描"},[":md"]],["v-798d4bcc","/zh/features/v11/epq/cluster-info.html",{title:"集群拓扑视图"},[":md"]],["v-5b4b4332","/zh/features/v11/epq/epq-create-btree-index.html",{title:"ePQ 支持创建 B-Tree 索引并行加速"},[":md"]],["v-da223262","/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html",{title:"ePQ 支持创建/刷新物化视图并行加速和批量写入"},[":md"]],["v-9aa77614","/zh/features/v11/epq/epq-explain-analyze.html",{title:"ePQ 执行计划查看与分析"},[":md"]],["v-351ad83c","/zh/features/v11/epq/epq-node-and-dop.html",{title:"ePQ 计算节点范围选择与并行度控制"},[":md"]],["v-5d5635bc","/zh/features/v11/epq/epq-partitioned-table.html",{title:"ePQ 支持分区表查询"},[":md"]],["v-3f61fca0","/zh/features/v11/epq/parallel-dml.html",{title:"并行 INSERT"},[":md"]],["v-9d84b310","/zh/features/v11/extensions/",{title:"第三方插件"},["/zh/features/v11/extensions/README.md"]],["v-3c5bafa7","/zh/features/v11/extensions/pgvector.html",{title:"pgvector"},[":md"]],["v-bc8fc3a4","/zh/features/v11/extensions/smlar.html",{title:"smlar"},[":md"]],["v-ba4b3c7c","/zh/features/v11/performance/",{title:"高性能"},["/zh/features/v11/performance/README.md"]],["v-0bb2232b","/zh/features/v11/performance/bulk-read-and-extend.html",{title:"预读 / 预扩展"},[":md"]],["v-37c6fdad","/zh/features/v11/performance/rel-size-cache.html",{title:"表大小缓存"},[":md"]],["v-69fcb160","/zh/features/v11/performance/shared-server.html",{title:"Shared Server"},[":md"]],["v-010157e8","/zh/features/v11/security/",{title:"安全"},["/zh/features/v11/security/README.md"]],["v-39aa8be0","/zh/features/v11/security/tde.html",{title:"TDE 透明数据加密"},[":md"]],["v-3706649a","/404.html",{title:""},[]]];var ls=ue({name:"Vuepress",setup(){const e=zu();return()=>_e(e.value)}}),mm=()=>pm.reduce((e,[t,n,r,o])=>(e.push({name:t,path:n,component:ls,meta:r},{path:n.endsWith("/")?n+"index.html":n.substring(0,n.length-5),redirect:n},...o.map(i=>({path:i===":md"?n.substring(0,n.length-5)+".md":i,redirect:n}))),e),[{name:"404",path:"/:catchAll(.*)",component:ls}]),vm=of,_m=()=>{const e=Ff({history:vm(il("/PolarDB-for-PostgreSQL/")),routes:mm(),scrollBehavior:(t,n,r)=>r||(t.hash?{el:t.hash}:{top:0})});return e.beforeResolve(async(t,n)=>{var r;(t.path!==n.path||n===pt)&&([t.meta._data]=await Promise.all([ht.resolvePageData(t.name),(r=al[t.name])==null?void 0:r.__asyncLoader()]))}),e},gm=e=>{e.component("ClientOnly",Mo),e.component("Content",Hu)},bm=(e,t,n)=>{const r=os(()=>t.currentRoute.value.path),o=os(()=>ht.resolveRouteLocale(tn.value.locales,r.value)),i=Id(r,()=>t.currentRoute.value.meta._data),s=q(()=>ht.resolveLayouts(n)),l=q(()=>ht.resolveSiteLocaleData(tn.value,o.value)),a=q(()=>ht.resolvePageFrontmatter(i.value)),c=q(()=>ht.resolvePageHeadTitle(i.value,l.value)),u=q(()=>ht.resolvePageHead(c.value,a.value,l.value)),f=q(()=>ht.resolvePageLang(i.value,l.value)),h=q(()=>ht.resolvePageLayout(i.value,s.value));return e.provide(Iu,s),e.provide(cl,i),e.provide(ul,a),e.provide(ku,c),e.provide(fl,u),e.provide(dl,f),e.provide(pl,h),e.provide($o,o),e.provide(vl,l),Object.defineProperties(e.config.globalProperties,{$frontmatter:{get:()=>a.value},$head:{get:()=>u.value},$headTitle:{get:()=>c.value},$lang:{get:()=>f.value},$page:{get:()=>i.value},$routeLocale:{get:()=>o.value},$site:{get:()=>tn.value},$siteLocale:{get:()=>l.value},$withBase:{get:()=>No}}),{layouts:s,pageData:i,pageFrontmatter:a,pageHead:u,pageHeadTitle:c,pageLang:f,pageLayout:h,routeLocale:o,siteData:tn,siteLocaleData:l}},ym=()=>{const e=Su(),t=hl(),n=Le([]),r=()=>{e.value.forEach(i=>{const s=Em(i);s&&n.value.push(s)})},o=()=>{document.documentElement.lang=t.value,n.value.forEach(i=>{i.parentNode===document.head&&document.head.removeChild(i)}),n.value.splice(0,n.value.length),e.value.forEach(i=>{const s=Lm(i);s!==null&&(document.head.appendChild(s),n.value.push(s))})};Ut(Vu,o),Ge(()=>{r(),o(),et(()=>e.value,o)})},Em=([e,t,n=""])=>{const r=Object.entries(t).map(([l,a])=>me(a)?`[${l}=${JSON.stringify(a)}]`:a===!0?`[${l}]`:"").join(""),o=`head > ${e}${r}`;return Array.from(document.querySelectorAll(o)).find(l=>l.innerText===n)||null},Lm=([e,t,n])=>{if(!me(e))return null;const r=document.createElement(e);return zo(t)&&Object.entries(t).forEach(([o,i])=>{me(i)?r.setAttribute(o,i):i===!0&&r.setAttribute(o,"")}),me(n)&&r.appendChild(document.createTextNode(n)),r},Tm=Tu,Pm=async()=>{var n;const e=Tm({name:"VuepressApp",setup(){var r;ym();for(const o of or)(r=o.setup)==null||r.call(o);return()=>[_e(Rl),...or.flatMap(({rootComponents:o=[]})=>o.map(i=>_e(i)))]}}),t=_m();gm(e),bm(e,t,or);for(const r of or)await((n=r.enhance)==null?void 0:n.call(r,{app:e,router:t,siteData:tn}));return e.use(t),{app:e,router:t}};Pm().then(({app:e,router:t})=>{t.isReady().then(()=>{e.mount("#app")})});export{Ae as _,he as a,zt as b,X as c,Pm as createVueApp,ne as d,Mc as e,ue as f,Ce as g,we as h,B as o,bt as r,xo as t,ee as u,$e as w}; diff --git a/assets/arch-htap.html-03506fa3.js b/assets/arch-htap.html-03506fa3.js new file mode 100644 index 00000000000..c251e55a6c1 --- /dev/null +++ b/assets/arch-htap.html-03506fa3.js @@ -0,0 +1,77 @@ +import{_ as r,r as l,o as i,c,d as n,a as s,w as o,b as a,e as d}from"./app-3d1677bf.js";const u="/PolarDB-for-PostgreSQL/assets/htap-1-background-c1448c2b.png",k="/PolarDB-for-PostgreSQL/assets/htap-2-arch-75a7a690.png",P="/PolarDB-for-PostgreSQL/assets/htap-3-mpp-125b1127.png",m="/PolarDB-for-PostgreSQL/assets/htap-4-1-consistency-b92b1c5f.png",v="/PolarDB-for-PostgreSQL/assets/htap-4-2-serverless-a6102d5e.png",b="/PolarDB-for-PostgreSQL/assets/htap-4-3-serverlessmap-8c3c8571.png",h="/PolarDB-for-PostgreSQL/assets/htap-5-skew-c7747f23.png",_="/PolarDB-for-PostgreSQL/assets/htap-7-1-acc-f65e825a.png",g="/PolarDB-for-PostgreSQL/assets/htap-7-2-cpu-48d29353.png",f="/PolarDB-for-PostgreSQL/assets/htap-7-3-dop-4dd408f5.png",w="/PolarDB-for-PostgreSQL/assets/htap-8-1-tpch-mpp-1d438468.png",S="/PolarDB-for-PostgreSQL/assets/htap-8-2-tpch-mpp-each-2433a941.png",M="/PolarDB-for-PostgreSQL/assets/htap-6-btbuild-adea540c.png",y={},B=s("h1",{id:"htap-architecture",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#htap-architecture","aria-hidden":"true"},"#"),a(" HTAP Architecture")],-1),T={class:"table-of-contents"},D=d('

背景

很多 PolarDB PG 的用户都有 TP (Transactional Processing) 和 AP (Analytical Processing) 共用的需求。他们期望数据库在白天处理高并发的 TP 请求,在夜间 TP 流量下降、机器负载空闲时进行 AP 的报表分析。但是即使这样,依然没有最大化利用空闲机器的资源。原先的 PolarDB PG 数据库在处理复杂的 AP 查询时会遇到两大挑战:

  • 单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,无论是单机串行还是单机并行,都无法利用其他节点的 CPU、内存等计算资源,只能纵向 Scale Up,不能横向 Scale Out;
  • PolarDB 底层是存储池,理论上 I/O 吞吐是无限大的。而单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,受限于单节点 CPU 和内存的瓶颈,无法充分发挥存储侧大 I/O 带宽的优势。

image.png

为了解决用户实际使用中的痛点,PolarDB 实现了 HTAP 特性。当前业界 HTAP 的解决方案主要有以下三种:

  1. TP 和 AP 在存储和计算上完全分离
    • 优势:两种业务负载互不影响
    • 劣势:
      • 时效性:TP 的数据需要导入到 AP 系统中,存在一定的延迟
      • 成本 / 运维难度:增加了一套冗余的 AP 系统
  2. TP 和 AP 在存储和计算上完全共享
    • 优势:成本最小化、资源利用最大化
    • 劣势:
      • 计算共享会导致 AP 查询和 TP 查询同时运行时或多或少会存在相互影响
      • 扩展计算节点存储时,数据需要重分布,无法快速弹性 Scale Out
  3. TP 和 AP 在存储上共享,在计算上分离
    • PolarDB 的存储计算分离架构天然支持此方案

原理

架构特性

基于 PolarDB 的存储计算分离架构,我们研发了分布式 MPP 执行引擎,提供了跨机并行执行、弹性计算弹性扩展的保证,使得 PolarDB 初步具备了 HTAP 的能力:

  1. 一体化存储:毫秒级数据新鲜度
    • TP / AP 共享一套存储数据,减少存储成本,提高查询时效
  2. TP / AP 物理隔离:杜绝 CPU / 内存的相互影响
    • 单机执行引擎:在 RW / RO 节点上,处理高并发的 TP 查询
    • 分布式 MPP 执行引擎: 在 RO 节点,处理高复杂度的 AP 查询
  3. Serverless 弹性扩展:任何一个 RO 节点均可发起 MPP 查询
    • Scale Out:弹性调整 MPP 的执行节点范围
    • Scale Up:弹性调整 MPP 的单机并行度
  4. 消除数据倾斜、计算倾斜,充分考虑 PostgreSQL 的 Buffer Pool 亲和性

image.png

分布式 MPP 执行引擎

PolarDB HTAP 的核心是分布式 MPP 执行引擎,是典型的火山模型引擎。A、B 两张表先做 join 再做聚合输出,这也是 PostgreSQL 单机执行引擎的执行流程。

image.png

在传统的 MPP 执行引擎中,数据被打散到不同的节点上,不同节点上的数据可能具有不同的分布属性,比如哈希分布、随机分布、复制分布等。传统的 MPP 执行引擎会针对不同表的数据分布特点,在执行计划中插入算子来保证上层算子对数据的分布属性无感知。

不同的是,PolarDB 是共享存储架构,存储上的数据可以被所有计算节点全量访问。如果使用传统的 MPP 执行引擎,每个计算节点 Worker 都会扫描全量数据,从而得到重复的数据;同时,也没有起到扫描时分治加速的效果,并不能称得上是真正意义上的 MPP 引擎。

因此,在 PolarDB 分布式 MPP 执行引擎中,我们借鉴了火山模型论文中的思想,对所有扫描算子进行并发处理,引入了 PxScan 算子来屏蔽共享存储。PxScan 算子将 shared-storage 的数据映射为 shared-nothing 的数据,通过 Worker 之间的协调,将目标表划分为多个虚拟分区数据块,每个 Worker 扫描各自的虚拟分区数据块,从而实现了跨机分布式并行扫描。

PxScan 算子扫描出来的数据会通过 Shuffle 算子来重分布。重分布后的数据在每个 Worker 上如同单机执行一样,按照火山模型来执行。

Serverless 弹性扩展

传统 MPP 只能在指定节点发起 MPP 查询,因此每个节点上都只能有单个 Worker 扫描一张表。为了支持云原生下 serverless 弹性扩展的需求,我们引入了分布式事务一致性保证。

image.png

任意选择一个节点作为 Coordinator 节点,它的 ReadLSN 会作为约定的 LSN,从所有 MPP 节点的快照版本号中选择最小的版本号作为全局约定的快照版本号。通过 LSN 的回放等待和 Global Snapshot 同步机制,确保在任何一个节点发起 MPP 查询时,数据和快照均能达到一致可用的状态。

image.png

为了实现 serverless 的弹性扩展,我们从共享存储的特点出发,将 Coordinator 节点全链路上各个模块需要的外部依赖全部放至共享存储上。各个 Worker 节点运行时需要的参数也会通过控制链路从 Coordinator 节点同步过来,从而使 Coordinator 节点和 Worker 节点全链路 无状态化 (Stateless)

基于以上两点设计,PolarDB 的弹性扩展具备了以下几大优势:

  • 任何节点都可以成为 Coordinator 节点,解决了传统 MPP 数据库 Coordinator 节点的单点问题。
  • PolarDB 可以横向 Scale Out(计算节点数量),也可以纵向 Scale Up(单节点并行度),且弹性扩展即时生效,不需要重新分布数据。
  • 允许业务有更多的弹性调度策略,不同的业务域可以运行在不同的节点集合上。如下图右侧所示,业务域 1 的 SQL 可以选择 RO1 和 RO2 节点来执行 AP 查询,业务域 2 的 SQL 可以选择使用 RO3 和 RO4 节点来执行 AP 查询。两个业务域使用的计算节点可以实现弹性调度。

image.png

消除倾斜

倾斜是传统 MPP 固有的问题,其根本原因主要是数据分布倾斜和数据计算倾斜:

  • 数据分布倾斜通常由数据打散不均衡导致,在 PostgreSQL 中还会由于大对象 Toast 表存储引入一些不可避免的数据分布不均衡问题;
  • 计算倾斜通常由于不同节点上并发的事务、Buffer Pool、网络、I/O 抖动导致。

倾斜会导致传统 MPP 在执行时出现木桶效应,执行完成时间受制于执行最慢的子任务。

image.png

PolarDB 设计并实现了 自适应扫描机制。如上图所示,采用 Coordinator 节点来协调 Worker 节点的工作模式。在扫描数据时,Coordinator 节点会在内存中创建一个任务管理器,根据扫描任务对 Worker 节点进行调度。Coordinator 节点内部分为两个线程:

  • Data 线程主要负责服务数据链路、收集汇总元组
  • Control 线程负责服务控制链路、控制每一个扫描算子的扫描进度

扫描进度较快的 Worker 能够扫描多个数据块,实现能者多劳。比如上图中 RO1 与 RO3 的 Worker 各自扫描了 4 个数据块, RO2 由于计算倾斜可以扫描更多数据块,因此它最终扫描了 6 个数据块。

PolarDB HTAP 的自适应扫描机制还充分考虑了 PostgreSQL 的 Buffer Pool 亲和性,保证每个 Worker 尽可能扫描固定的数据块,从而最大化命中 Buffer Pool 的概率,降低 I/O 开销。

TPC-H 性能对比

单机并行 vs 分布式 MPP

我们使用 256 GB 内存的 16 个 PolarDB PG 实例作为 RO 节点,搭建了 1 TB 的 TPC-H 环境进行对比测试。相较于单机并行,分布式 MPP 并行充分利用了所有 RO 节点的计算资源和底层共享存储的 I/O 带宽,从根本上解决了前文提及的 HTAP 诸多挑战。在 TPC-H 的 22 条 SQL 中,有 3 条 SQL 加速了 60 多倍,19 条 SQL 加速了 10 多倍,平均加速 23 倍。

image.png

此外,我们也测试了弹性扩展计算资源带来的性能变化。通过增加 CPU 的总核心数,从 16 核增加到 128 核,TPC-H 的总运行时间线性提升,每条 SQL 的执行速度也呈线性提升,这也验证了 PolarDB HTAP serverless 弹性扩展的特点。

image.png

image.png

在测试中发现,当 CPU 的总核数增加到 256 核时,性能提升不再明显。原因是此时 PolarDB 共享存储的 I/O 带宽已经打满,成为了瓶颈。

PolarDB vs 传统 MPP 数据库

我们将 PolarDB 的分布式 MPP 执行引擎与传统数据库的 MPP 执行引擎进行了对比,同样使用了 256 GB 内存的 16 个节点。

在 1 TB 的 TPC-H 数据上,当保持与传统 MPP 数据库相同单机并行度的情况下(多机单进程),PolarDB 的性能是传统 MPP 数据库的 90%。其中最本质的原因是传统 MPP 数据库的数据默认是哈希分布的,当两张表的 join key 是各自的分布键时,可以不用 shuffle 直接进行本地的 Wise Join。而 PolarDB 的底层是共享存储池,PxScan 算子并行扫描出来的数据等价于随机分布,必须进行 shuffle 重分布以后才能像传统 MPP 数据库一样进行后续的处理。因此,TPC-H 涉及到表连接时,PolarDB 相比传统 MPP 数据库多了一次网络 shuffle 的开销。

image.png

image.png

PolarDB 分布式 MPP 执行引擎能够进行弹性扩展,数据无需重分布。因此,在有限的 16 台机器上执行 MPP 时,PolarDB 还可以继续扩展单机并行度,充分利用每台机器的资源:当 PolarDB 的单机并行度为 8 时,它的性能是传统 MPP 数据库的 5-6 倍;当 PolarDB 的单机并行度呈线性增加时,PolarDB 的总体性能也呈线性增加。只需要修改配置参数,就可以即时生效。

功能特性

Parallel Query 并行查询

经过持续迭代的研发,目前 PolarDB HTAP 在 Parallel Query 上支持的功能特性主要有五大部分:

  • 基础算子全支持:扫描 / 连接 / 聚合 / 子查询等算子。
  • 共享存储算子优化:包括 Shuffle 算子共享、SharedSeqScan 共享、SharedIndexScan 算子等。其中 SharedSeqScan 共享、SharedIndexScan 共享是指,在大表 join 小表时,小表采用类似于复制表的机制来减少广播开销,进而提升性能。
  • 分区表支持:不仅包括对 Hash / Range / List 三种分区方式的完整支持,还包括对多级分区静态裁剪、分区动态裁剪的支持。除此之外,PolarDB 分布式 MPP 执行引擎还支持分区表的 Partition Wise Join。
  • 并行度弹性控制:包括全局级别、表级别、会话级别、查询级别的并行度控制。
  • Serverless 弹性扩展:不仅包括任意节点发起 MPP、MPP 节点范围内的任意组合,还包括集群拓扑信息的自动维护,以及支持共享存储模式、主备库模式、三节点模式。

Parallel DML

基于 PolarDB 读写分离架构和 HTAP serverless 弹性扩展的设计, PolarDB Parallel DML 支持一写多读、多写多读两种特性。

  • 一写多读:在 RO 节点上有多个读 Worker,在 RW 节点上只有一个写 Worker;
  • 多写多读:在 RO 节点上有多个读 Worker,在 RW 节点上也有多个写 Worker。多写多读场景下,读写的并发度完全解耦。

不同的特性适用不同的场景,用户可以根据自己的业务特点来选择不同的 PDML 功能特性。

索引构建加速

PolarDB 分布式 MPP 执行引擎,不仅可以用于只读查询和 DML,还可以用于 索引构建加速。OLTP 业务中有大量的索引,而 B-Tree 索引创建的过程大约有 80% 的时间消耗在排序和构建索引页上,20% 消耗在写入索引页上。如下图所示,PolarDB 利用 RO 节点对数据进行分布式 MPP 加速排序,采用流水化的技术来构建索引页,同时使用批量写入技术来提升索引页的写入速度。

image.png

在目前索引构建加速这一特性中,PolarDB 已经对 B-Tree 索引的普通创建以及 B-Tree 索引的在线创建 (Concurrently) 两种功能进行了支持。

使用说明

PolarDB HTAP 适用于日常业务中的 轻分析类业务,例如:对账业务,报表业务。

使用 MPP 进行分析型查询

PolarDB PG 引擎默认不开启 MPP 功能。若您需要使用此功能,请使用如下参数:

  • polar_enable_px:指定是否开启 MPP 功能。默认为 OFF,即不开启。
  • polar_px_max_workers_number:设置单个节点上的最大 MPP Worker 进程数,默认为 30。该参数限制了单个节点上的最大并行度,节点上所有会话的 MPP workers 进程数不能超过该参数大小。
  • polar_px_dop_per_node:设置当前会话并行查询的并行度,默认为 1,推荐值为当前 CPU 总核数。若设置该参数为 N,则一个会话在每个节点上将会启用 N 个 MPP Worker 进程,用于处理当前的 MPP 逻辑
  • polar_px_nodes:指定参与 MPP 的只读节点。默认为空,表示所有只读节点都参与。可配置为指定节点参与 MPP,以逗号分隔
  • px_worker:指定 MPP 是否对特定表生效。默认不生效。MPP 功能比较消耗集群计算节点的资源,因此只有对设置了 px_workers 的表才使用该功能。例如:
    • ALTER TABLE t1 SET(px_workers=1) 表示 t1 表允许 MPP
    • ALTER TABLE t1 SET(px_workers=-1) 表示 t1 表禁止 MPP
    • ALTER TABLE t1 SET(px_workers=0) 表示 t1 表忽略 MPP(默认状态)

本示例以简单的单表查询操作,来描述 MPP 的功能是否有效。

-- 创建 test 表并插入基础数据。
+CREATE TABLE test(id int);
+INSERT INTO test SELECT generate_series(1,1000000);
+
+-- 默认情况下 MPP 功能不开启,单表查询执行计划为 PG 原生的 Seq Scan
+EXPLAIN SELECT * FROM test;
+                       QUERY PLAN
+--------------------------------------------------------
+ Seq Scan on test  (cost=0.00..35.50 rows=2550 width=4)
+(1 row)
+

开启并使用 MPP 功能:

-- 对 test 表启用 MPP 功能
+ALTER TABLE test SET (px_workers=1);
+
+-- 开启 MPP 功能
+SET polar_enable_px = on;
+
+EXPLAIN SELECT * FROM test;
+
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Seq Scan on test (scan partial)  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

配置参与 MPP 的计算节点范围:

-- 查询当前所有只读节点的名称
+CREATE EXTENSION polar_monitor;
+
+SELECT name,host,port FROM polar_cluster_info WHERE px_node='t';
+ name  |   host    | port
+-------+-----------+------
+ node1 | 127.0.0.1 | 5433
+ node2 | 127.0.0.1 | 5434
+(2 rows)
+
+-- 当前集群有 2 个只读节点,名称分别为:node1,node2
+
+-- 指定 node1 只读节点参与 MPP
+SET polar_px_nodes = 'node1';
+
+-- 查询参与并行查询的节点
+SHOW polar_px_nodes;
+ polar_px_nodes
+----------------
+ node1
+(1 row)
+
+EXPLAIN SELECT * FROM test;
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 1:1  (slice1; segments: 1)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

使用 MPP 进行分区表查询

当前 MPP 对分区表支持的功能如下所示:

  • 支持 Range 分区的并行查询
  • 支持 List 分区的并行查询
  • 支持单列 Hash 分区的并行查询
  • 支持分区裁剪
  • 支持带有索引的分区表并行查询
  • 支持分区表连接查询
  • 支持多级分区的并行查询
--分区表 MPP 功能默认关闭,需要先开启 MPP 功能
+SET polar_enable_px = ON;
+
+-- 执行以下语句,开启分区表 MPP 功能
+SET polar_px_enable_partition = true;
+
+-- 执行以下语句,开启多级分区表 MPP 功能
+SET polar_px_optimizer_multilevel_partitioning = true;
+

使用 MPP 加速索引创建

当前仅支持对 B-Tree 索引的构建,且暂不支持 INCLUDE 等索引构建语法,暂不支持表达式等索引列类型。

如果需要使用 MPP 功能加速创建索引,请使用如下参数:

  • polar_px_dop_per_node:指定通过 MPP 加速构建索引的并行度。默认为 1
  • polar_px_enable_replay_wait:当使用 MPP 加速索引构建时,当前会话内无需手动开启该参数,该参数将自动生效,以保证最近更新的数据表项可以被创建到索引中,保证索引表的完整性。索引创建完成后,该参数将会被重置为数据库默认值。
  • polar_px_enable_btbuild:是否开启使用 MPP 加速创建索引。取值为 OFF 时不开启(默认),取值为 ON 时开启。
  • polar_bt_write_page_buffer_size:指定索引构建过程中的写 I/O 策略。该参数默认值为 0(不开启),单位为块,最大值可设置为 8192。推荐设置为 4096
    • 当该参数设置为不开启时,在索引创建的过程中,对于索引页写满后的写盘方式是 block-by-block 的单个块写盘。
    • 当该参数设置为开启时,内核中将缓存一个 polar_bt_write_page_buffer_size 大小的 buffer,对于需要写盘的索引页,会通过该 buffer 进行 I/O 合并再统一写盘,避免了频繁调度 I/O 带来的性能开销。该参数会额外提升 20% 的索引创建性能。
-- 开启使用 MPP 加速创建索引功能。
+SET polar_px_enable_btbuild = on;
+
+-- 使用如下语法创建索引
+CREATE INDEX t ON test(id) WITH(px_build = ON);
+
+-- 查询表结构
+\\d test
+               Table "public.test"
+ Column |  Type   | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id     | integer |           |          |
+ id2    | integer |           |          |
+Indexes:
+    "t" btree (id) WITH (px_build=finish)
+
`,82);function x(p,L){const t=l("ArticleInfo"),e=l("router-link");return i(),c("div",null,[B,n(t,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),s("nav",T,[s("ul",null,[s("li",null,[n(e,{to:"#背景"},{default:o(()=>[a("背景")]),_:1})]),s("li",null,[n(e,{to:"#原理"},{default:o(()=>[a("原理")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#架构特性"},{default:o(()=>[a("架构特性")]),_:1})]),s("li",null,[n(e,{to:"#分布式-mpp-执行引擎"},{default:o(()=>[a("分布式 MPP 执行引擎")]),_:1})]),s("li",null,[n(e,{to:"#serverless-弹性扩展"},{default:o(()=>[a("Serverless 弹性扩展")]),_:1})]),s("li",null,[n(e,{to:"#消除倾斜"},{default:o(()=>[a("消除倾斜")]),_:1})])])]),s("li",null,[n(e,{to:"#tpc-h-性能对比"},{default:o(()=>[a("TPC-H 性能对比")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#单机并行-vs-分布式-mpp"},{default:o(()=>[a("单机并行 vs 分布式 MPP")]),_:1})]),s("li",null,[n(e,{to:"#polardb-vs-传统-mpp-数据库"},{default:o(()=>[a("PolarDB vs 传统 MPP 数据库")]),_:1})])])]),s("li",null,[n(e,{to:"#功能特性"},{default:o(()=>[a("功能特性")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#parallel-query-并行查询"},{default:o(()=>[a("Parallel Query 并行查询")]),_:1})]),s("li",null,[n(e,{to:"#parallel-dml"},{default:o(()=>[a("Parallel DML")]),_:1})]),s("li",null,[n(e,{to:"#索引构建加速"},{default:o(()=>[a("索引构建加速")]),_:1})])])]),s("li",null,[n(e,{to:"#使用说明"},{default:o(()=>[a("使用说明")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#使用-mpp-进行分析型查询"},{default:o(()=>[a("使用 MPP 进行分析型查询")]),_:1})]),s("li",null,[n(e,{to:"#使用-mpp-进行分区表查询"},{default:o(()=>[a("使用 MPP 进行分区表查询")]),_:1})]),s("li",null,[n(e,{to:"#使用-mpp-加速索引创建"},{default:o(()=>[a("使用 MPP 加速索引创建")]),_:1})])])])])]),D])}const A=r(y,[["render",x],["__file","arch-htap.html.vue"]]);export{A as default}; diff --git a/assets/arch-htap.html-21a1bc97.js b/assets/arch-htap.html-21a1bc97.js new file mode 100644 index 00000000000..9f33fbd5567 --- /dev/null +++ b/assets/arch-htap.html-21a1bc97.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-4ccaa7d8","path":"/theory/arch-htap.html","title":"HTAP Architecture","lang":"en-US","frontmatter":{"author":"严华","date":"2022/09/10","minute":35},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[{"level":3,"title":"架构特性","slug":"架构特性","link":"#架构特性","children":[]},{"level":3,"title":"分布式 MPP 执行引擎","slug":"分布式-mpp-执行引擎","link":"#分布式-mpp-执行引擎","children":[]},{"level":3,"title":"Serverless 弹性扩展","slug":"serverless-弹性扩展","link":"#serverless-弹性扩展","children":[]},{"level":3,"title":"消除倾斜","slug":"消除倾斜","link":"#消除倾斜","children":[]}]},{"level":2,"title":"TPC-H 性能对比","slug":"tpc-h-性能对比","link":"#tpc-h-性能对比","children":[{"level":3,"title":"单机并行 vs 分布式 MPP","slug":"单机并行-vs-分布式-mpp","link":"#单机并行-vs-分布式-mpp","children":[]},{"level":3,"title":"PolarDB vs 传统 MPP 数据库","slug":"polardb-vs-传统-mpp-数据库","link":"#polardb-vs-传统-mpp-数据库","children":[]}]},{"level":2,"title":"功能特性","slug":"功能特性","link":"#功能特性","children":[{"level":3,"title":"Parallel Query 并行查询","slug":"parallel-query-并行查询","link":"#parallel-query-并行查询","children":[]},{"level":3,"title":"Parallel DML","slug":"parallel-dml","link":"#parallel-dml","children":[]},{"level":3,"title":"索引构建加速","slug":"索引构建加速","link":"#索引构建加速","children":[]}]},{"level":2,"title":"使用说明","slug":"使用说明","link":"#使用说明","children":[{"level":3,"title":"使用 MPP 进行分析型查询","slug":"使用-mpp-进行分析型查询","link":"#使用-mpp-进行分析型查询","children":[]},{"level":3,"title":"使用 MPP 进行分区表查询","slug":"使用-mpp-进行分区表查询","link":"#使用-mpp-进行分区表查询","children":[]},{"level":3,"title":"使用 MPP 加速索引创建","slug":"使用-mpp-加速索引创建","link":"#使用-mpp-加速索引创建","children":[]}]}],"git":{"updatedTime":1684803576000},"filePathRelative":"theory/arch-htap.md"}');export{l as data}; diff --git a/assets/arch-htap.html-581ae188.js b/assets/arch-htap.html-581ae188.js new file mode 100644 index 00000000000..1f6d4c2a81a --- /dev/null +++ b/assets/arch-htap.html-581ae188.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-0b994909","path":"/zh/theory/arch-htap.html","title":"HTAP 架构详解","lang":"zh-CN","frontmatter":{"author":"严华","date":"2022/09/10","minute":35},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[{"level":3,"title":"架构特性","slug":"架构特性","link":"#架构特性","children":[]},{"level":3,"title":"分布式 MPP 执行引擎","slug":"分布式-mpp-执行引擎","link":"#分布式-mpp-执行引擎","children":[]},{"level":3,"title":"Serverless 弹性扩展","slug":"serverless-弹性扩展","link":"#serverless-弹性扩展","children":[]},{"level":3,"title":"消除倾斜","slug":"消除倾斜","link":"#消除倾斜","children":[]}]},{"level":2,"title":"TPC-H 性能对比","slug":"tpc-h-性能对比","link":"#tpc-h-性能对比","children":[{"level":3,"title":"单机并行 vs 分布式 MPP","slug":"单机并行-vs-分布式-mpp","link":"#单机并行-vs-分布式-mpp","children":[]},{"level":3,"title":"PolarDB vs 传统 MPP 数据库","slug":"polardb-vs-传统-mpp-数据库","link":"#polardb-vs-传统-mpp-数据库","children":[]}]},{"level":2,"title":"功能特性","slug":"功能特性","link":"#功能特性","children":[{"level":3,"title":"Parallel Query 并行查询","slug":"parallel-query-并行查询","link":"#parallel-query-并行查询","children":[]},{"level":3,"title":"Parallel DML","slug":"parallel-dml","link":"#parallel-dml","children":[]},{"level":3,"title":"索引构建加速","slug":"索引构建加速","link":"#索引构建加速","children":[]}]},{"level":2,"title":"使用说明","slug":"使用说明","link":"#使用说明","children":[{"level":3,"title":"使用 MPP 进行分析型查询","slug":"使用-mpp-进行分析型查询","link":"#使用-mpp-进行分析型查询","children":[]},{"level":3,"title":"使用 MPP 进行分区表查询","slug":"使用-mpp-进行分区表查询","link":"#使用-mpp-进行分区表查询","children":[]},{"level":3,"title":"使用 MPP 加速索引创建","slug":"使用-mpp-加速索引创建","link":"#使用-mpp-加速索引创建","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/theory/arch-htap.md"}');export{l as data}; diff --git a/assets/arch-htap.html-b0e18587.js b/assets/arch-htap.html-b0e18587.js new file mode 100644 index 00000000000..040d52f4b9c --- /dev/null +++ b/assets/arch-htap.html-b0e18587.js @@ -0,0 +1,77 @@ +import{_ as r,r as l,o as i,c,d as n,a as s,w as o,b as a,e as d}from"./app-3d1677bf.js";const u="/PolarDB-for-PostgreSQL/assets/htap-1-background-c1448c2b.png",k="/PolarDB-for-PostgreSQL/assets/htap-2-arch-75a7a690.png",P="/PolarDB-for-PostgreSQL/assets/htap-3-mpp-125b1127.png",m="/PolarDB-for-PostgreSQL/assets/htap-4-1-consistency-b92b1c5f.png",v="/PolarDB-for-PostgreSQL/assets/htap-4-2-serverless-a6102d5e.png",b="/PolarDB-for-PostgreSQL/assets/htap-4-3-serverlessmap-8c3c8571.png",h="/PolarDB-for-PostgreSQL/assets/htap-5-skew-c7747f23.png",_="/PolarDB-for-PostgreSQL/assets/htap-7-1-acc-f65e825a.png",g="/PolarDB-for-PostgreSQL/assets/htap-7-2-cpu-48d29353.png",f="/PolarDB-for-PostgreSQL/assets/htap-7-3-dop-4dd408f5.png",w="/PolarDB-for-PostgreSQL/assets/htap-8-1-tpch-mpp-1d438468.png",S="/PolarDB-for-PostgreSQL/assets/htap-8-2-tpch-mpp-each-2433a941.png",M="/PolarDB-for-PostgreSQL/assets/htap-6-btbuild-adea540c.png",y={},B=s("h1",{id:"htap-架构详解",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#htap-架构详解","aria-hidden":"true"},"#"),a(" HTAP 架构详解")],-1),T={class:"table-of-contents"},D=d('

背景

很多 PolarDB PG 的用户都有 TP (Transactional Processing) 和 AP (Analytical Processing) 共用的需求。他们期望数据库在白天处理高并发的 TP 请求,在夜间 TP 流量下降、机器负载空闲时进行 AP 的报表分析。但是即使这样,依然没有最大化利用空闲机器的资源。原先的 PolarDB PG 数据库在处理复杂的 AP 查询时会遇到两大挑战:

  • 单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,无论是单机串行还是单机并行,都无法利用其他节点的 CPU、内存等计算资源,只能纵向 Scale Up,不能横向 Scale Out;
  • PolarDB 底层是存储池,理论上 I/O 吞吐是无限大的。而单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,受限于单节点 CPU 和内存的瓶颈,无法充分发挥存储侧大 I/O 带宽的优势。

image.png

为了解决用户实际使用中的痛点,PolarDB 实现了 HTAP 特性。当前业界 HTAP 的解决方案主要有以下三种:

  1. TP 和 AP 在存储和计算上完全分离
    • 优势:两种业务负载互不影响
    • 劣势:
      • 时效性:TP 的数据需要导入到 AP 系统中,存在一定的延迟
      • 成本 / 运维难度:增加了一套冗余的 AP 系统
  2. TP 和 AP 在存储和计算上完全共享
    • 优势:成本最小化、资源利用最大化
    • 劣势:
      • 计算共享会导致 AP 查询和 TP 查询同时运行时或多或少会存在相互影响
      • 扩展计算节点存储时,数据需要重分布,无法快速弹性 Scale Out
  3. TP 和 AP 在存储上共享,在计算上分离
    • PolarDB 的存储计算分离架构天然支持此方案

原理

架构特性

基于 PolarDB 的存储计算分离架构,我们研发了分布式 MPP 执行引擎,提供了跨机并行执行、弹性计算弹性扩展的保证,使得 PolarDB 初步具备了 HTAP 的能力:

  1. 一体化存储:毫秒级数据新鲜度
    • TP / AP 共享一套存储数据,减少存储成本,提高查询时效
  2. TP / AP 物理隔离:杜绝 CPU / 内存的相互影响
    • 单机执行引擎:在 RW / RO 节点上,处理高并发的 TP 查询
    • 分布式 MPP 执行引擎: 在 RO 节点,处理高复杂度的 AP 查询
  3. Serverless 弹性扩展:任何一个 RO 节点均可发起 MPP 查询
    • Scale Out:弹性调整 MPP 的执行节点范围
    • Scale Up:弹性调整 MPP 的单机并行度
  4. 消除数据倾斜、计算倾斜,充分考虑 PostgreSQL 的 Buffer Pool 亲和性

image.png

分布式 MPP 执行引擎

PolarDB HTAP 的核心是分布式 MPP 执行引擎,是典型的火山模型引擎。A、B 两张表先做 join 再做聚合输出,这也是 PostgreSQL 单机执行引擎的执行流程。

image.png

在传统的 MPP 执行引擎中,数据被打散到不同的节点上,不同节点上的数据可能具有不同的分布属性,比如哈希分布、随机分布、复制分布等。传统的 MPP 执行引擎会针对不同表的数据分布特点,在执行计划中插入算子来保证上层算子对数据的分布属性无感知。

不同的是,PolarDB 是共享存储架构,存储上的数据可以被所有计算节点全量访问。如果使用传统的 MPP 执行引擎,每个计算节点 Worker 都会扫描全量数据,从而得到重复的数据;同时,也没有起到扫描时分治加速的效果,并不能称得上是真正意义上的 MPP 引擎。

因此,在 PolarDB 分布式 MPP 执行引擎中,我们借鉴了火山模型论文中的思想,对所有扫描算子进行并发处理,引入了 PxScan 算子来屏蔽共享存储。PxScan 算子将 shared-storage 的数据映射为 shared-nothing 的数据,通过 Worker 之间的协调,将目标表划分为多个虚拟分区数据块,每个 Worker 扫描各自的虚拟分区数据块,从而实现了跨机分布式并行扫描。

PxScan 算子扫描出来的数据会通过 Shuffle 算子来重分布。重分布后的数据在每个 Worker 上如同单机执行一样,按照火山模型来执行。

Serverless 弹性扩展

传统 MPP 只能在指定节点发起 MPP 查询,因此每个节点上都只能有单个 Worker 扫描一张表。为了支持云原生下 serverless 弹性扩展的需求,我们引入了分布式事务一致性保证。

image.png

任意选择一个节点作为 Coordinator 节点,它的 ReadLSN 会作为约定的 LSN,从所有 MPP 节点的快照版本号中选择最小的版本号作为全局约定的快照版本号。通过 LSN 的回放等待和 Global Snaphot 同步机制,确保在任何一个节点发起 MPP 查询时,数据和快照均能达到一致可用的状态。

image.png

为了实现 serverless 的弹性扩展,我们从共享存储的特点出发,将 Coordinator 节点全链路上各个模块需要的外部依赖全部放至共享存储上。各个 Worker 节点运行时需要的参数也会通过控制链路从 Coordinator 节点同步过来,从而使 Coordinator 节点和 Worker 节点全链路 无状态化 (Stateless)

基于以上两点设计,PolarDB 的弹性扩展具备了以下几大优势:

  • 任何节点都可以成为 Coordinator 节点,解决了传统 MPP 数据库 Coordinator 节点的单点问题。
  • PolarDB 可以横向 Scale Out(计算节点数量),也可以纵向 Scale Up(单节点并行度),且弹性扩展即时生效,不需要重新分布数据。
  • 允许业务有更多的弹性调度策略,不同的业务域可以运行在不同的节点集合上。如下图右侧所示,业务域 1 的 SQL 可以选择 RO1 和 RO2 节点来执行 AP 查询,业务域 2 的 SQL 可以选择使用 RO3 和 RO4 节点来执行 AP 查询。两个业务域使用的计算节点可以实现弹性调度。

image.png

消除倾斜

倾斜是传统 MPP 固有的问题,其根本原因主要是数据分布倾斜和数据计算倾斜:

  • 数据分布倾斜通常由数据打散不均衡导致,在 PostgreSQL 中还会由于大对象 Toast 表存储引入一些不可避免的数据分布不均衡问题;
  • 计算倾斜通常由于不同节点上并发的事务、Buffer Pool、网络、I/O 抖动导致。

倾斜会导致传统 MPP 在执行时出现木桶效应,执行完成时间受制于执行最慢的子任务。

image.png

PolarDB 设计并实现了 自适应扫描机制。如上图所示,采用 Coordinator 节点来协调 Worker 节点的工作模式。在扫描数据时,Coordinator 节点会在内存中创建一个任务管理器,根据扫描任务对 Worker 节点进行调度。Coordinator 节点内部分为两个线程:

  • Data 线程主要负责服务数据链路、收集汇总元组
  • Control 线程负责服务控制链路、控制每一个扫描算子的扫描进度

扫描进度较快的 Worker 能够扫描多个数据块,实现能者多劳。比如上图中 RO1 与 RO3 的 Worker 各自扫描了 4 个数据块, RO2 由于计算倾斜可以扫描更多数据块,因此它最终扫描了 6 个数据块。

PolarDB HTAP 的自适应扫描机制还充分考虑了 PostgreSQL 的 Buffer Pool 亲和性,保证每个 Worker 尽可能扫描固定的数据块,从而最大化命中 Buffer Pool 的概率,降低 I/O 开销。

TPC-H 性能对比

单机并行 vs 分布式 MPP

我们使用 256 GB 内存的 16 个 PolarDB PG 实例作为 RO 节点,搭建了 1 TB 的 TPC-H 环境进行对比测试。相较于单机并行,分布式 MPP 并行充分利用了所有 RO 节点的计算资源和底层共享存储的 I/O 带宽,从根本上解决了前文提及的 HTAP 诸多挑战。在 TPC-H 的 22 条 SQL 中,有 3 条 SQL 加速了 60 多倍,19 条 SQL 加速了 10 多倍,平均加速 23 倍。

image.png

此外,我们也测试了弹性扩展计算资源带来的性能变化。通过增加 CPU 的总核心数,从 16 核增加到 128 核,TPC-H 的总运行时间线性提升,每条 SQL 的执行速度也呈线性提升,这也验证了 PolarDB HTAP serverless 弹性扩展的特点。

image.png

image.png

在测试中发现,当 CPU 的总核数增加到 256 核时,性能提升不再明显。原因是此时 PolarDB 共享存储的 I/O 带宽已经打满,成为了瓶颈。

PolarDB vs 传统 MPP 数据库

我们将 PolarDB 的分布式 MPP 执行引擎与传统数据库的 MPP 执行引擎进行了对比,同样使用了 256 GB 内存的 16 个节点。

在 1 TB 的 TPC-H 数据上,当保持与传统 MPP 数据库相同单机并行度的情况下(多机单进程),PolarDB 的性能是传统 MPP 数据库的 90%。其中最本质的原因是传统 MPP 数据库的数据默认是哈希分布的,当两张表的 join key 是各自的分布键时,可以不用 shuffle 直接进行本地的 Wise Join。而 PolarDB 的底层是共享存储池,PxScan 算子并行扫描出来的数据等价于随机分布,必须进行 shuffle 重分布以后才能像传统 MPP 数据库一样进行后续的处理。因此,TPC-H 涉及到表连接时,PolarDB 相比传统 MPP 数据库多了一次网络 shuffle 的开销。

image.png

image.png

PolarDB 分布式 MPP 执行引擎能够进行弹性扩展,数据无需重分布。因此,在有限的 16 台机器上执行 MPP 时,PolarDB 还可以继续扩展单机并行度,充分利用每台机器的资源:当 PolarDB 的单机并行度为 8 时,它的性能是传统 MPP 数据库的 5-6 倍;当 PolarDB 的单机并行度呈线性增加时,PolarDB 的总体性能也呈线性增加。只需要修改配置参数,就可以即时生效。

功能特性

Parallel Query 并行查询

经过持续迭代的研发,目前 PolarDB HTAP 在 Parallel Query 上支持的功能特性主要有五大部分:

  • 基础算子全支持:扫描 / 连接 / 聚合 / 子查询等算子。
  • 共享存储算子优化:包括 Shuffle 算子共享、SharedSeqScan 共享、SharedIndexScan 算子等。其中 SharedSeqScan 共享、SharedIndexScan 共享是指,在大表 join 小表时,小表采用类似于复制表的机制来减少广播开销,进而提升性能。
  • 分区表支持:不仅包括对 Hash / Range / List 三种分区方式的完整支持,还包括对多级分区静态裁剪、分区动态裁剪的支持。除此之外,PolarDB 分布式 MPP 执行引擎还支持分区表的 Partition Wise Join。
  • 并行度弹性控制:包括全局级别、表级别、会话级别、查询级别的并行度控制。
  • Serverless 弹性扩展:不仅包括任意节点发起 MPP、MPP 节点范围内的任意组合,还包括集群拓扑信息的自动维护,以及支持共享存储模式、主备库模式、三节点模式。

Parallel DML

基于 PolarDB 读写分离架构和 HTAP serverless 弹性扩展的设计, PolarDB Parallel DML 支持一写多读、多写多读两种特性。

  • 一写多读:在 RO 节点上有多个读 Worker,在 RW 节点上只有一个写 Worker;
  • 多写多读:在 RO 节点上有多个读 Worker,在 RW 节点上也有多个写 Worker。多写多读场景下,读写的并发度完全解耦。

不同的特性适用不同的场景,用户可以根据自己的业务特点来选择不同的 PDML 功能特性。

索引构建加速

PolarDB 分布式 MPP 执行引擎,不仅可以用于只读查询和 DML,还可以用于 索引构建加速。OLTP 业务中有大量的索引,而 B-Tree 索引创建的过程大约有 80% 的时间消耗在排序和构建索引页上,20% 消耗在写入索引页上。如下图所示,PolarDB 利用 RO 节点对数据进行分布式 MPP 加速排序,采用流水化的技术来构建索引页,同时使用批量写入技术来提升索引页的写入速度。

image.png

在目前索引构建加速这一特性中,PolarDB 已经对 B-Tree 索引的普通创建以及 B-Tree 索引的在线创建 (Concurrently) 两种功能进行了支持。

使用说明

PolarDB HTAP 适用于日常业务中的 轻分析类业务,例如:对账业务,报表业务。

使用 MPP 进行分析型查询

PolarDB PG 引擎默认不开启 MPP 功能。若您需要使用此功能,请使用如下参数:

  • polar_enable_px:指定是否开启 MPP 功能。默认为 OFF,即不开启。
  • polar_px_max_workers_number:设置单个节点上的最大 MPP Worker 进程数,默认为 30。该参数限制了单个节点上的最大并行度,节点上所有会话的 MPP workers 进程数不能超过该参数大小。
  • polar_px_dop_per_node:设置当前会话并行查询的并行度,默认为 1,推荐值为当前 CPU 总核数。若设置该参数为 N,则一个会话在每个节点上将会启用 N 个 MPP Worker 进程,用于处理当前的 MPP 逻辑
  • polar_px_nodes:指定参与 MPP 的只读节点。默认为空,表示所有只读节点都参与。可配置为指定节点参与 MPP,以逗号分隔
  • px_worker:指定 MPP 是否对特定表生效。默认不生效。MPP 功能比较消耗集群计算节点的资源,因此只有对设置了 px_workers 的表才使用该功能。例如:
    • ALTER TABLE t1 SET(px_workers=1) 表示 t1 表允许 MPP
    • ALTER TABLE t1 SET(px_workers=-1) 表示 t1 表禁止 MPP
    • ALTER TABLE t1 SET(px_workers=0) 表示 t1 表忽略 MPP(默认状态)

本示例以简单的单表查询操作,来描述 MPP 的功能是否有效。

-- 创建 test 表并插入基础数据。
+CREATE TABLE test(id int);
+INSERT INTO test SELECT generate_series(1,1000000);
+
+-- 默认情况下 MPP 功能不开启,单表查询执行计划为 PG 原生的 Seq Scan
+EXPLAIN SELECT * FROM test;
+                       QUERY PLAN
+--------------------------------------------------------
+ Seq Scan on test  (cost=0.00..35.50 rows=2550 width=4)
+(1 row)
+

开启并使用 MPP 功能:

-- 对 test 表启用 MPP 功能
+ALTER TABLE test SET (px_workers=1);
+
+-- 开启 MPP 功能
+SET polar_enable_px = on;
+
+EXPLAIN SELECT * FROM test;
+
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Seq Scan on test (scan partial)  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

配置参与 MPP 的计算节点范围:

-- 查询当前所有只读节点的名称
+CREATE EXTENSION polar_monitor;
+
+SELECT name,host,port FROM polar_cluster_info WHERE px_node='t';
+ name  |   host    | port
+-------+-----------+------
+ node1 | 127.0.0.1 | 5433
+ node2 | 127.0.0.1 | 5434
+(2 rows)
+
+-- 当前集群有 2 个只读节点,名称分别为:node1,node2
+
+-- 指定 node1 只读节点参与 MPP
+SET polar_px_nodes = 'node1';
+
+-- 查询参与并行查询的节点
+SHOW polar_px_nodes;
+ polar_px_nodes
+----------------
+ node1
+(1 row)
+
+EXPLAIN SELECT * FROM test;
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 1:1  (slice1; segments: 1)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

使用 MPP 进行分区表查询

当前 MPP 对分区表支持的功能如下所示:

  • 支持 Range 分区的并行查询
  • 支持 List 分区的并行查询
  • 支持单列 Hash 分区的并行查询
  • 支持分区裁剪
  • 支持带有索引的分区表并行查询
  • 支持分区表连接查询
  • 支持多级分区的并行查询
--分区表 MPP 功能默认关闭,需要先开启 MPP 功能
+SET polar_enable_px = ON;
+
+-- 执行以下语句,开启分区表 MPP 功能
+SET polar_px_enable_partition = true;
+
+-- 执行以下语句,开启多级分区表 MPP 功能
+SET polar_px_optimizer_multilevel_partitioning = true;
+

使用 MPP 加速索引创建

当前仅支持对 B-Tree 索引的构建,且暂不支持 INCLUDE 等索引构建语法,暂不支持表达式等索引列类型。

如果需要使用 MPP 功能加速创建索引,请使用如下参数:

  • polar_px_dop_per_node:指定通过 MPP 加速构建索引的并行度。默认为 1
  • polar_px_enable_replay_wait:当使用 MPP 加速索引构建时,当前会话内无需手动开启该参数,该参数将自动生效,以保证最近更新的数据表项可以被创建到索引中,保证索引表的完整性。索引创建完成后,该参数将会被重置为数据库默认值。
  • polar_px_enable_btbuild:是否开启使用 MPP 加速创建索引。取值为 OFF 时不开启(默认),取值为 ON 时开启。
  • polar_bt_write_page_buffer_size:指定索引构建过程中的写 I/O 策略。该参数默认值为 0(不开启),单位为块,最大值可设置为 8192。推荐设置为 4096
    • 当该参数设置为不开启时,在索引创建的过程中,对于索引页写满后的写盘方式是 block-by-block 的单个块写盘。
    • 当该参数设置为开启时,内核中将缓存一个 polar_bt_write_page_buffer_size 大小的 buffer,对于需要写盘的索引页,会通过该 buffer 进行 I/O 合并再统一写盘,避免了频繁调度 I/O 带来的性能开销。该参数会额外提升 20% 的索引创建性能。
-- 开启使用 MPP 加速创建索引功能。
+SET polar_px_enable_btbuild = on;
+
+-- 使用如下语法创建索引
+CREATE INDEX t ON test(id) WITH(px_build = ON);
+
+-- 查询表结构
+\\d test
+               Table "public.test"
+ Column |  Type   | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id     | integer |           |          |
+ id2    | integer |           |          |
+Indexes:
+    "t" btree (id) WITH (px_build=finish)
+
`,82);function x(p,L){const t=l("ArticleInfo"),e=l("router-link");return i(),c("div",null,[B,n(t,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),s("nav",T,[s("ul",null,[s("li",null,[n(e,{to:"#背景"},{default:o(()=>[a("背景")]),_:1})]),s("li",null,[n(e,{to:"#原理"},{default:o(()=>[a("原理")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#架构特性"},{default:o(()=>[a("架构特性")]),_:1})]),s("li",null,[n(e,{to:"#分布式-mpp-执行引擎"},{default:o(()=>[a("分布式 MPP 执行引擎")]),_:1})]),s("li",null,[n(e,{to:"#serverless-弹性扩展"},{default:o(()=>[a("Serverless 弹性扩展")]),_:1})]),s("li",null,[n(e,{to:"#消除倾斜"},{default:o(()=>[a("消除倾斜")]),_:1})])])]),s("li",null,[n(e,{to:"#tpc-h-性能对比"},{default:o(()=>[a("TPC-H 性能对比")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#单机并行-vs-分布式-mpp"},{default:o(()=>[a("单机并行 vs 分布式 MPP")]),_:1})]),s("li",null,[n(e,{to:"#polardb-vs-传统-mpp-数据库"},{default:o(()=>[a("PolarDB vs 传统 MPP 数据库")]),_:1})])])]),s("li",null,[n(e,{to:"#功能特性"},{default:o(()=>[a("功能特性")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#parallel-query-并行查询"},{default:o(()=>[a("Parallel Query 并行查询")]),_:1})]),s("li",null,[n(e,{to:"#parallel-dml"},{default:o(()=>[a("Parallel DML")]),_:1})]),s("li",null,[n(e,{to:"#索引构建加速"},{default:o(()=>[a("索引构建加速")]),_:1})])])]),s("li",null,[n(e,{to:"#使用说明"},{default:o(()=>[a("使用说明")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#使用-mpp-进行分析型查询"},{default:o(()=>[a("使用 MPP 进行分析型查询")]),_:1})]),s("li",null,[n(e,{to:"#使用-mpp-进行分区表查询"},{default:o(()=>[a("使用 MPP 进行分区表查询")]),_:1})]),s("li",null,[n(e,{to:"#使用-mpp-加速索引创建"},{default:o(()=>[a("使用 MPP 加速索引创建")]),_:1})])])])])]),D])}const A=r(y,[["render",x],["__file","arch-htap.html.vue"]]);export{A as default}; diff --git a/assets/arch-overview.html-c15ab6a4.js b/assets/arch-overview.html-c15ab6a4.js new file mode 100644 index 00000000000..1563d5949ad --- /dev/null +++ b/assets/arch-overview.html-c15ab6a4.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-7ce47b0b","path":"/zh/theory/arch-overview.html","title":"特性总览","lang":"zh-CN","frontmatter":{"author":"北侠","date":"2021/08/24","minute":35},"headers":[{"level":2,"title":"传统数据库的问题","slug":"传统数据库的问题","link":"#传统数据库的问题","children":[]},{"level":2,"title":"PolarDB 云原生数据库的优势","slug":"polardb-云原生数据库的优势","link":"#polardb-云原生数据库的优势","children":[]},{"level":2,"title":"PolarDB 整体架构概述","slug":"polardb-整体架构概述","link":"#polardb-整体架构概述","children":[{"level":3,"title":"存储计算分离架构概述","slug":"存储计算分离架构概述","link":"#存储计算分离架构概述","children":[]},{"level":3,"title":"HTAP 架构概述","slug":"htap-架构概述","link":"#htap-架构概述","children":[]}]},{"level":2,"title":"PolarDB:存储计算分离架构详解","slug":"polardb-存储计算分离架构详解","link":"#polardb-存储计算分离架构详解","children":[{"level":3,"title":"Shared-Storage 带来的挑战","slug":"shared-storage-带来的挑战","link":"#shared-storage-带来的挑战","children":[]},{"level":3,"title":"架构原理","slug":"架构原理","link":"#架构原理","children":[]},{"level":3,"title":"数据一致性","slug":"数据一致性","link":"#数据一致性","children":[]},{"level":3,"title":"低延迟复制","slug":"低延迟复制","link":"#低延迟复制","children":[]},{"level":3,"title":"Recovery 优化","slug":"recovery-优化","link":"#recovery-优化","children":[]}]},{"level":2,"title":"PolarDB:HTAP 架构详解","slug":"polardb-htap-架构详解","link":"#polardb-htap-架构详解","children":[{"level":3,"title":"HTAP 架构原理","slug":"htap-架构原理","link":"#htap-架构原理","children":[]},{"level":3,"title":"分布式优化器","slug":"分布式优化器","link":"#分布式优化器","children":[]},{"level":3,"title":"算子并行化","slug":"算子并行化","link":"#算子并行化","children":[]},{"level":3,"title":"消除数据倾斜问题","slug":"消除数据倾斜问题","link":"#消除数据倾斜问题","children":[]},{"level":3,"title":"SQL 级别弹性扩展","slug":"sql-级别弹性扩展","link":"#sql-级别弹性扩展","children":[]},{"level":3,"title":"事务一致性","slug":"事务一致性","link":"#事务一致性","children":[]},{"level":3,"title":"TPC-H 性能:加速比","slug":"tpc-h-性能-加速比","link":"#tpc-h-性能-加速比","children":[]},{"level":3,"title":"TPC-H 性能:和传统 MPP 数据库的对比","slug":"tpc-h-性能-和传统-mpp-数据库的对比","link":"#tpc-h-性能-和传统-mpp-数据库的对比","children":[]},{"level":3,"title":"分布式执行加速索引创建","slug":"分布式执行加速索引创建","link":"#分布式执行加速索引创建","children":[]},{"level":3,"title":"分布式并行执行加速多模:时空数据库","slug":"分布式并行执行加速多模-时空数据库","link":"#分布式并行执行加速多模-时空数据库","children":[]}]},{"level":2,"title":"总结","slug":"总结","link":"#总结","children":[]}],"git":{"updatedTime":1688442053000},"filePathRelative":"zh/theory/arch-overview.md"}');export{l as data}; diff --git a/assets/arch-overview.html-c2599ebc.js b/assets/arch-overview.html-c2599ebc.js new file mode 100644 index 00000000000..94504dbd9b1 --- /dev/null +++ b/assets/arch-overview.html-c2599ebc.js @@ -0,0 +1 @@ +import{_ as l,a as d,b as h,c as p}from"./9_future_pages-13873b1a.js";import{_ as c,r as s,o as u,c as m,d as a,a as e,w as r,b as t,e as g}from"./app-3d1677bf.js";const f="/PolarDB-for-PostgreSQL/assets/2_compute-storage_separation_architecture-150c6ffc.png",y="/PolarDB-for-PostgreSQL/assets/3_HTAP_architecture-43d8e225.png",b="/PolarDB-for-PostgreSQL/assets/4_principles_of_shared_storage-1ff0f380.png",P="/PolarDB-for-PostgreSQL/assets/5_In-memory_page_synchronization-9737c89d.png",_="/PolarDB-for-PostgreSQL/assets/8_solution_to_outdated_pages_LogIndex-b696c625.png",w="/PolarDB-for-PostgreSQL/assets/10_solutions_to_future_pages-f6d8bc5c.png",v="/PolarDB-for-PostgreSQL/assets/11_issues_of_conventional_streaming_replication-79eab5de.png",T="/PolarDB-for-PostgreSQL/assets/12_Replicate_only_metadata_of_WAL_records-092ef5f2.png",L="/PolarDB-for-PostgreSQL/assets/13_optimization1_result-98261cb3.png",D="/PolarDB-for-PostgreSQL/assets/14_optimize_log_apply_of_WAL_records-e19cfea8.png",A="/PolarDB-for-PostgreSQL/assets/15_optimization2_result-5c124fdf.png",x="/PolarDB-for-PostgreSQL/assets/16_optimize_log_apply_of_DDL_locks-0e74ca0c.png",B="/PolarDB-for-PostgreSQL/assets/17_optimization3_result-c08ad12d.png",S="/PolarDB-for-PostgreSQL/assets/18_recovery_optimization_background-a9ce115d.png",W="/PolarDB-for-PostgreSQL/assets/19_lazy_recovery-f16bb60b.png",I="/PolarDB-for-PostgreSQL/assets/20_recovery_optimization_result-80832b6f.png",k="/PolarDB-for-PostgreSQL/assets/21_Persistent_BufferPool-bd6c06a2.png",z="/PolarDB-for-PostgreSQL/assets/22_buffer_pool_structure-a755d484.png",O="/PolarDB-for-PostgreSQL/assets/23_persistent_buffer_pool_result-abf85155.png",C="/PolarDB-for-PostgreSQL/assets/24_principles_of_HTAP-2f3b912c.png",q="/PolarDB-for-PostgreSQL/assets/25_distributed_optimizer-153c6304.png",Q="/PolarDB-for-PostgreSQL/assets/26_parallelism_of_operators-d53ecbd5.png",H="/PolarDB-for-PostgreSQL/assets/27_parallelism_of_operators_result-ab7b692f.png",R="/PolarDB-for-PostgreSQL/assets/28_data_skew-4fce9edd.png",M="/PolarDB-for-PostgreSQL/assets/29_Solve_data_skew_result-d1f5cd26.png",E="/PolarDB-for-PostgreSQL/assets/30_SQL_statement-level_scalability-03086846.png",N="/PolarDB-for-PostgreSQL/assets/31_schedule_workloads-1e37f980.png",F="/PolarDB-for-PostgreSQL/assets/32_transactional_consistency-0f80c9d0.png",G="/PolarDB-for-PostgreSQL/assets/33_TPC-H_performance_Speedup1-bea777d8.png",Y="/PolarDB-for-PostgreSQL/assets/34_TPC-H_performance_Speedup2-57228502.png",j="/PolarDB-for-PostgreSQL/assets/35_TPC-H_performance_Speedup3-6e2b1a40.png",V="/PolarDB-for-PostgreSQL/assets/36_TPC-H_performance_Comparison_with_mpp1-265dba6a.png",U="/PolarDB-for-PostgreSQL/assets/37_TPC-H_performance_Comparison_with_mpp2-e0571d47.png",X="/PolarDB-for-PostgreSQL/assets/38_Index_creation_accelerated_by_PX-cc3737a1.png",$="/PolarDB-for-PostgreSQL/assets/39_Index_creation_accelerated_by_PX2-0c310510.png",K="/PolarDB-for-PostgreSQL/assets/40_spatio-temporal_databases-8411c32e.png",J="/PolarDB-for-PostgreSQL/assets/41_spatio-temporal_databases_result-33628595.png",Z={},ee=e("h1",{id:"overview",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#overview","aria-hidden":"true"},"#"),t(" Overview")],-1),ae=e("p",null,"PolarDB for PostgreSQL (hereafter simplified as PolarDB) is a stable, reliable, scalable, highly available, and secure enterprise-grade database service that is independently developed by Alibaba Cloud to help you increase security compliance and cost-effectiveness. PolarDB is 100% compatible with PostgreSQL. It runs in a proprietary compute-storage separation architecture of Alibaba Cloud to support the horizontal scaling of the storage and computing capabilities.",-1),te=e("p",null,"PolarDB can process a mix of online transaction processing (OLTP) workloads and online analytical processing (OLAP) workloads in parallel. PolarDB also provides a wide range of innovative multi-model database capabilities to help you process, analyze, and search for diversified data, such as spatio-temporal, GIS, image, vector, and graph data.",-1),oe=e("p",null,"PolarDB supports various deployment architectures. For example, PolarDB supports compute-storage separation, three-node X-Paxos clusters, and local SSDs.",-1),re={class:"table-of-contents"},se=g('

Issues in Conventional Database Systems

If you are using a conventional database system and the complexity of your workloads continues to increase, you may face the following challenges as the amount of your business data grows:

  1. The storage capacity is limited by the maximum storage capacity of a single host.
  2. You can increase the read capability of your database system only by creating read-only instances. Each read-only instance must be allocated a specific amount of exclusive storage space, which increases costs.
  3. The time that is required to create a read-only instance increases due to the increase in the amount of data.
  4. The latency of data replication between the primary instance and the secondary instance is high.

Benefits of PolarDB

image.png

To help you resolve the issues that occur in conventional database systems, Alibaba Cloud provides PolarDB. PolarDB runs in a proprietary compute-storage separation architecture of Alibaba Cloud. This architecture has the following benefits:

  1. Scalability: Computing is separated from storage. You can flexibly scale out the computing cluster or the storage cluster based on your business requirements.
  2. Cost-effectiveness: All compute nodes share the same physical storage. This significantly reduces costs.
  3. Easy to use: Each PolarDB cluster consists of one primary node and one or more read-only nodes to support read/write splitting.
  4. Reliability: Data is stored in triplicate, and a backup can be finished in seconds.

A Guide to This Document

PolarDB is integrated with various technologies and innovations. This document describes the following two aspects of the PolarDB architecture in sequence: compute-storage separation and hybrid transactional/analytical processing (HTAP). You can find and read the content of your interest with ease.

  • Compute-storage separation is the foundation of the PolarDB architecture. Conventional database systems run in the shared-nothing architecture, in which each instance is allocated independent computing resources and storage resources. As conventional database systems evolve towards compute-storage separation, database engines developers face challenges in managing executors, transactions, and buffers. PolarDB is designed to help you address these challenges.
  • HTAP is designed to support OLAP queries in OLTP scenarios and fully utilize the computing capabilities of multiple read-only nodes. HTAP is achieved by using a shared storage-based massively parallel processing (MPP) architecture. In the shared storage-based MPP architecture, each table or index tree is stored as a whole and is not divided into virtual partitions that are stored on different nodes. This way, you can retain the workflows used in OLTP scenarios. In addition, you can use the shared storage-based MPP architecture without the need to modify your application data.

This section explains the following two aspects of the PolarDB architecture: compute-storage separation and HTAP.

Compute-Storage Separation

image.png

PolarDB supports compute-storage separation. Each PolarDB cluster consists of a computing cluster and a storage cluster. You can flexibly scale out the computing cluster or the storage cluster based on your business requirements.

  1. If the computing power is insufficient, you can scale out only the computing cluster.
  2. If the storage capacity is insufficient, you can scale out only the storage cluster.

After the shared-storage architecture is used in PolarDB, the primary node and the read-only nodes share the same physical storage. If the primary node still uses the method that is used in conventional database systems to flush write-ahead logging (WAL) records, the following issues may occur.

  1. The pages that the read-only nodes read from the shared storage are outdated pages. Outdated pages are pages that are of earlier versions than the versions that are recorded on the read-only nodes.
  2. The pages that the read-only nodes read from the shared storage are future pages. Future pages are pages that are of later versions than the versions that are recorded on the read-only nodes.
  3. When your workloads are switched over from the primary node to a read-only node, the pages that the read-only node reads from the shared storage are outdated pages. In this case, the read-only node needs to read and apply WAL records to restore dirty pages.

To resolve the first issue, PolarDB must support multiple versions for each page. To resolve the second issue, PolarDB must control the speed at which the primary node flushes WAL records.

HTAP

When read/write splitting is enabled, each individual compute node cannot fully utilize the high I/O throughput that is provided by the shared storage. In addition, you cannot accelerate large queries by adding computing resources. To resolve these issues, PolarDB uses the shared storage-based MPP architecture to accelerate OLAP queries in OLTP scenarios.

PolarDB supports a complete suite of data types that are used in OLTP scenarios. PolarDB also supports two computing engines, which can process these types of data:

  • Standalone execution engine: processes highly concurrent OLTP queries.
  • Distributed execution engine: processes large OLAP queries.

image.png

When the same hardware resources are used, PolarDB delivers performance that is 90% of the performance delivered by traditional MPP database. PolarDB also provides SQL statement-level scalability. If the computing power of your PolarDB cluster is insufficient, you can allocate more CPU resources to OLAP queries without the need to rearrange data.

The following sections provide more details about compute-storage separation and HTAP.

PolarDB: Compute-Storage Separation

Challenges of Shared Storage

Compute-storage separation enables the compute nodes of your PolarDB cluster to share the same physical storage. Shared storage brings the following challenges:

  • Data consistency: how to ensure consistency between N copies of data in the computing cluster and 1 copy of data in the storage cluster.
  • Read/write splitting: how to replicate data at a low latency.
  • High availability: how to perform recovery and failover.
  • I/O model: how to optimize the file system from buffered I/O to direct I/O.

Basic Principles of Shared Storage

image.png

The following basic principles of shared storage apply to PolarDB:

  • The primary node can process read requests and write requests. The read-only nodes can process only read requests.
  • Only the primary node can write data to the shared storage. This way, the data that you query on the primary node is the same as the data that you query on the read-only nodes.
  • The read-only nodes apply WAL records to ensure that the pages in the memory of the read-only nodes are synchronous with the pages in the memory of the primary node.
  • The primary node writes WAL records to the shared storage, and only the metadata of the WAL records is replicated to the read-only nodes.
  • The read-only nodes read WAL records from the shared storage and apply the WAL records.

Data Consistency

In-memory Page Synchronization in Shared-nothing Architecture

In a conventional database system, the primary instance and read-only instances each are allocated independent memory resources and storage resources. The primary instance replicates WAL records to the read-only instances, and the read-only instances read and apply the WAL records. These basic principles also apply to replication state machines.

In-memory Page Synchronization in Shared-storage Architecture

In a PolarDB cluster, the primary node replicates WAL records to the shared storage. The read-only nodes read and apply the most recent WAL records from the shared storage to ensure that the pages in the memory of the read-only nodes are synchronous with the pages in the memory of the primary node.

image.png

  1. The primary node flushes the WAL records of a page to write version 200 of the page to the shared storage.
  2. The read-only nodes read and apply the WAL records of the page to update the page from version 100 to version 200.

Outdated Pages in Shared-storage Architecture

In the workflow shown in the preceding figure, the new page that the read-only nodes obtain by applying WAL records is removed from the buffer pools of the read-only nodes. When you query the page on the read-only nodes, the read-only nodes read the page from the shared storage. As a result, only the previous version of the page is returned. This previous version is called an outdated page. The following figure shows more details.

image.png

  1. At T1, the primary node writes a WAL record with a log sequence number (LSN) of 200 to the memory to update Page 1 from version 500 to version 600.
  2. At T1, Page 1 on the read-only nodes is in version 500.
  3. At T2, the primary node sends the metadata of WAL Record 200 to the read-only nodes to notify the read-only nodes of a new WAL record.
  4. At T3, you query Page 1 on the read-only nodes. The read-only nodes read version 500 of Page 1 and WAL Record 200 and apply WAL Record 200 to update Page 1 from version 500 to version 600.
  5. At T4, the read-only nodes remove version 600 of Page 1 because their buffer pools cannot provide sufficient space.
  6. The primary node does not write version 600 of Page 1 to the shared storage. The most recent version of Page 1 in the shared storage is still version 500.
  7. At T5, you query Page 1 on the read-only nodes. The read-only nodes read Page 1 from the shared storage because Page 1 has been removed from the memory of the read-only nodes. In this case, the outdated version 500 of Page 1 is returned.

Solution to Outdated Pages

When you query a page on the read-only nodes at a specific point in time, the read-only nodes need to read the base version of the page and the WAL records up to that point in time. Then, the read-only nodes need to apply the WAL records one by one in sequence. The following figure shows more details.

image.png

  1. The metadata of the WAL records of each page is retained in the memory of the read-only nodes.
  2. When you query a page on the read-only nodes, the read-only nodes need to read and apply the WAL records of the page until the read-only nodes obtain the most recent version of the page.
  3. The read-only nodes read and apply WAL records from the shared storage based on the metadata of the WAL records.

PolarDB needs to maintain an inverted index that stores the mapping from each page to the WAL records of the page. However, the memory capacity of each read-only node is limited. Therefore, these inverted indexes must be persistently stored. To meet this requirement, PolarDB provides LogIndex. LogIndex is an index structure, which is used to persistently store hash data.

  1. The WAL receiver processes of the read-only nodes receive the metadata of WAL records from the primary node.
  2. The metadata of each WAL record contains information about which page is updated.
  3. The read-only nodes insert the metadata of each WAL record into a LogIndex structure to generate a LogIndex record. The key of the LogIndex record is the ID of the page that is updated, and the value of the LogIndex record is the LSN of the WAL record.
  4. One WAL record may contain information about multiple pages that are updated. This process is defined as index block split. If index blocks are split, one WAL record maps multiple LogIndex records.
  5. The read-only nodes mark each updated page as outdated in their buffer pools. When you query an updated page on the read-only nodes, the read-only nodes can read and apply the WAL records of the page based on the LogIndex records that map the WAL records.
  6. When the memory usage of the read-only nodes reaches a specific threshold, the hash data that is stored in LogIndex structures is asynchronously flushed from the memory to the disk.

image.png

LogIndex helps prevent outdated pages and enable the read-only nodes to run in lazy log apply mode. In the lazy log apply mode, the read-only nodes apply only the metadata of the WAL records for dirty pages.

Future Pages in Shared-storage Architecture

The read-only nodes may return future pages, whose versions are later than the versions that are recorded on the read-only nodes. The following figure shows more details.

image.png

  1. At T1, the primary node updates Page 1 twice from version 500 to version 700. Two WAL records are generated during the update process. The LSN of one WAL record is 200, and the LSN of the other WAL record is 300. At this time, Page 1 is still in version 500 on the primary node and the read-only nodes.
  2. At T2, the primary node sends WAL Record 200 to the read-only nodes.
  3. At T3, the read-only nodes apply WAL Record 200 to update Page 1 to version 600. At this time, the read-only nodes have not read or applied WAL Record 300.
  4. At T4, the primary node writes version 700 of Page 1 to the shared storage. At the same time, Page 1 is removed from the buffer pools of the read-only nodes.
  5. At T5, the read-only nodes attempt to read Page 1 again. Page 1 cannot be found in the buffer pools of the read-only nodes. Therefore, the read-only nodes obtain version 700 of Page 1 from the shared storage. Version 700 of Page 1 is a future page to the read-only nodes because the read-only nodes have not read or applied WAL Record 300.
  6. If some of the pages that the read-only nodes obtain from the shared storage are future pages and some are normal pages, data inconsistencies may occur. For example, after an index block is split into two indexes that each map a page, one of the pages the read-only nodes read is a normal page and the other is a future page. In this case, the B+ tree structures of the indexes are damaged.

Solutions to Future Pages

The read-only nodes apply WAL records at high speeds in lazy apply mode. However, the speeds may still be lower than the speed at which the primary node flushes WAL records. If the primary node flushes WAL records faster than the read-only nodes apply WAL records, future pages are returned. To prevent future pages, PolarDB must ensure that the speed at which the primary node flushes WAL records does not exceed the speeds at which the read-only nodes apply WAL records. The following figure shows more details.

image.png

  1. The read-only nodes apply the WAL record that is generated at T4.
  2. When the primary node flushes WAL records to the shared storage, it sorts all WAL records by LSN and flushes only the WAL records that are updated up to T4.
  3. The file position of the LSN that is generated at T4 is defined as the file position of consistency.

Low-latency Replication

Issues of Conventional Streaming Replication

  1. The I/O loads on the log synchronization link are heavy, and a large amount of data is transmitted over the network.
  2. When the read-only nodes process I/O-bound workloads or CPU-bound workloads, they read pages and modify the pages in their buffer pools at low speeds.
  3. When file- and data-related DDL operations attempt to acquire locks on specific objects, blocking exceptions may occur. As a result, the operations are run at low speeds.
  4. When the read-only nodes process highly concurrent queries, transaction snapshots are taken at low speeds. The following figure shows more details.

image.png

  1. The primary node writes WAL records to its local file system.
  2. The WAL sender process of the primary node reads and sends the WAL records to the read-only nodes.
  3. The WAL receiver processes of the read-only nodes receive and write the WAL records to the local file systems of the read-only nodes.
  4. The read-only nodes read the WAL records, write the updated pages to their buffer pools, and then apply the WAL records in the memory.
  5. The primary node flushes the WAL records to the shared storage.

The full path is long, and the latency on the read-only nodes is high. This may cause an imbalance between the read loads and write loads over the read/write splitting link.

Optimization Method 1: Replicate Only the Metadata of WAL Records

The read-only nodes can read WAL records from the shared storage. Therefore, the primary node can remove the payloads of WAL records and send only the metadata of WAL records to the read-only nodes. This alleviates the pressure on network transmission and reduces the I/O loads on critical paths. The following figure shows more details.

  1. Each WAL record consists of three parts: header, page ID, and payload. The header and the page ID comprise the metadata of a WAL record.
  2. The primary node replicates only the metadata of WAL records to the read-only nodes.
  3. The read-only nodes read WAL records from the shared storage based on the metadata of the WAL records.

image.png

This optimization method significantly reduces the amount of data that needs to be transmitted between the primary node and the read-only nodes. The amount of data that needs to be transmitted decreases by 98%, as shown in the following figure.

image.png

Optimization Method 2: Optimize the Log Apply of WAL Records

Conventional database systems need to read a large number of pages, apply WAL records to these pages one by one, and then flush the updated pages to the disk. To reduce the read I/O loads on critical paths, PolarDB supports compute-storage separation. If the page that you query on the read-only nodes cannot be hit in the buffer pools of the read-only nodes, no I/O loads are generated and only LogIndex records are recorded.

The following I/O operations that are performed by log apply processes can be offloaded to session processes:

  1. Data page-related I/O operations
  2. I/O operations to apply WAL records
  3. I/O operations to apply multiple versions of pages based on LogIndex records

In the example shown in the following figure, when the log apply process of a read-only node applies the metadata of a WAL record of a page:

image.png

  1. If the page cannot be hit in the memory, only the LogIndex record that maps the WAL record is recorded.
  2. If the page can be hit in the memory, the page is marked as outdated and the LogIndex record that maps the WAL record is recorded. The log apply process is complete.
  3. When you start a session process to read the page, the session process reads and writes the most recent version of the page to the buffer pool. Then, the session process applies the WAL record that maps the LogIndex record.
  4. Major I/O operations are no longer run by a single log apply process. These operations are offloaded to multiple user processes.

This optimization method significantly reduces the log apply latency and increases the log apply speed by 30 times compared with Amazon Aurora.

image.png

Optimization Method 3: Optimize the Log Apply of DDL Locks

When the primary node runs a DDL operation such as DROP TABLE to modify a table, the primary node acquires an exclusive DDL lock on the table. The exclusive DDL lock is replicated to the read-only nodes along with WAL records. The read-only nodes apply the WAL records to acquire the exclusive DDL lock on the table. This ensures that the table cannot be deleted by the primary node when a read-only node is reading the table. Only one copy of the table is stored in the shared storage.

When the applying process of a read-only node applies the exclusive DDL lock, the read-only node may require a long period of time to acquire the exclusive DDL lock on the table. You can optimize the critical path of the log apply process by offloading the task of acquiring the exclusive DDL lock to other processes.

image.png

This optimization method ensures that the critical path of the log apply process of a read-only node is not blocked even if the log apply process needs to wait for the release of an exclusive DDL lock.

image.png

The three optimization methods in combination significantly reduce replication latency and have the following benefits:

  • Read/write splitting: Loads are balanced, which allows PolarDB to deliver user experience that is comparable to Oracle Real Application Clusters (RAC).
  • High availability: The time that is required for failover is reduced.
  • Stability: The number of future pages is minimized, and fewer or even no page snapshots need to be taken.

Recovery Optimization

Background Information

If the read-only nodes apply WAL records at low speeds, your PolarDB cluster may require a long period of time to recover from exceptions such as out of memory (OOM) errors and unexpected crashes. When the direct I/O model is used for the shared storage, the severity of this issue increases.

image.png

Lazy Recovery

The preceding sections explain how LogIndex enables the read-only nodes to apply WAL records in lazy log apply mode. In general, the recovery process of the primary node after a restart is the same as the process in which the read-only nodes apply WAL records. In this sense, the lazy log apply mode can also be used to accelerate the recovery of the primary node.

image.png

  1. The primary node begins to apply WAL records in lazy log apply mode one by one starting from a specific checkpoint.
  2. After the primary node applies all LogIndex records, the log apply is complete.
  3. After the recovery is complete, the primary node starts to run.
  4. The actual log apply workloads are offloaded to the session process that is started after the primary node restarts.

The example in the following figure shows how the optimized recovery method significantly reduces the time that is required to apply 500 MB of WAL records.

image.png

Persistent Buffer Pool

After the primary node recovers, a session process may need to apply the pages that the session process reads. When a session process is applying pages, the primary node responds at low speeds for a short period of time. To resolve this issue, PolarDB does not delete pages from the buffer pool of the primary node if the primary node restarts or unexpectedly crashes.

image.png

The shared memory of the database engine consists of the following two parts:

  1. One part is used to store global structures and ProcArray structures.
  2. The other part is used to store buffer pool structures. The buffer pool is allocated as a specific amount of named shared memory. Therefore, the buffer pool remains valid after the primary node restarts. However, global structures need to be reinitialized after the primary node restarts.

image.png

Not all pages in the buffer pool of the primary node can be reused. For example, if a process acquires an exclusive lock on a page before the primary node restarts and then unexpectedly crashes, no other processes can release the exclusive lock on the page. Therefore, after the primary node unexpectedly crashes or restarts, it needs to traverse all pages in its buffer pool to identify and remove the pages that cannot be reused. In addition, the recycling of buffer pools depends on Kubernetes.

This optimized buffer pool mechanism ensures the stable performance of your PolarDB cluster before and after a restart.

image.png

PolarDB HTAP

The shared storage of PolarDB is organized as a storage pool. When read/write splitting is enabled, the theoretical I/O throughput that is supported by the shared storage is infinite. However, large queries can be run only on individual compute nodes, and the CPU, memory, and I/O specifications of a single compute node are limited. Therefore, a single compute node cannot fully utilize the high I/O throughput that is supported by the shared storage or accelerate large queries by acquiring more computing resources. To resolve these issues, PolarDB uses the shared storage-based MPP architecture to accelerate OLAP queries in OLTP scenarios.

Basic Principles of HTAP

In a PolarDB cluster, the physical storage is shared among all compute nodes. Therefore, you cannot use the method of scanning tables in conventional MPP databases to scan tables in PolarDB clusters. PolarDB supports MPP on standalone execution engines and provides optimized shared storage. This shared storage-based MPP architecture is the first architecture of its kind in the industry. We recommend that you familiarize yourself with following basic principles of this architecture before you use PolarDB:

  1. The Shuffle operator masks the data distribution.
  2. The ParallelScan operator masks the shared storage.

image.png

The preceding figure shows an example.

  1. Table A and Table B are joined and aggregated.
  2. Table A and Table B are still individual tables in the shared storage. These tables are not physically partitioned.
  3. Four types of scan operators are redesigned to scan tables in the shared storage as virtual partitions.

Distributed Optimizer

The GPORCA optmizer is extended to provide a set of transformation rules that can recognize shared storage. The GPORCA optimizer enables PolarDB to access a specific amount of planned search space. For example, PolarDB can scan a table as a whole or as different virtual partitions. This is a major difference between shared storage-based MPP and conventional MPP.

The modules in gray in the upper part of the following figure are modules of the database engine. These modules enable the database engine of PolarDB to adapt to the GPORCA optimizer.

The modules in the lower part of the following figure comprise the GPORCA optimizer. Among these modules, the modules in gray are extended modules, which enable the GPORCA optimizer to communicate with the shared storage of PolarDB.

image.png

Parallelism of Operators

Four types of operators in PolarDB require parallelism. This section describes how to enable parallelism for operators that are used to run sequential scans. To fully utilize the I/O throughput that is supported by the shared storage, PolarDB splits each table into logical units during a sequential scan. Each unit contains 4 MB of data. This way, PolarDB can distribute I/O loads to different disks, and the disks can simultaneously scan data to accelerate the sequential scan. In addition, each read-only node needs to scan only specific tables rather than all tables. The size of tables that can be cached is the total size of the buffer pools of all read-only nodes.

image.png

Parallelism has the following benefits, as shown in the following figure:

  1. You can increase scan performance by 30 times by creating read-only nodes.
  2. You can reduce the time that is required for a scan from 37 minutes to 3.75 seconds by enabling the buffering feature.

image.png

Solve the Issue of Data Skew

Data skew is a common issue in conventional MPP:

  1. In PolarDB, large objects reference TOAST tables by using heap tables. You cannot balance loads even if you shard TOAST tables or heap tables.
  2. In addition, the transactions, buffer pools, network connections, and I/O loads of the read-only nodes jitter.
  3. The preceding issues cause long-tail processes.

image.png

  1. The coordinator node consists of two parts: DataThread and ControlThread.
  2. DataThread collects and aggregates tuples.
  3. ControlThread controls the scan progress of each scan operator.
  4. A worker thread that scans data at a high speed can scan more logical data shards.
  5. The affinity of buffers must be considered.

Although a scan task is dynamically distributed, we recommend that you maintain the affinity of buffers at your best. In addition, the context of each operator is stored in the private memory of the worker threads. The coordinator node does not store the information about specific tables.

In the example shown in the following table, PolarDB uses static sharding to shard large objects. During the static sharding process, data skew occurs, but the performance of dynamic scanning can still linearly increase.

image.png

SQL Statement-level Scalability

Data sharing helps deliver ultimate scalability in cloud-native environments. The full path of the coordinator node involves various modules, and PolarDB can store the external dependencies of these modules to the shared storage. In addition, the full path of a worker thread involves a number of operational parameters, and PolarDB can synchronize these parameters from the coordinator node over the control path. This way, the coordinator node and the worker thread are stateless.

image.png

The following conclusions are made based on the preceding analysis:

  1. All read-only nodes that run SQL joins can function as coordinator nodes. Therefore, the performance of PolarDB is no longer limited due to the availability of only a single coordinator node.
  2. Each SQL statement can start any number of worker threads on any compute node. This increases the computing power and allows you to schedule your workloads in a more flexible manner. You can configure PolarDB to simultaneously run different kinds of workloads on different compute nodes.

image.png

Transactional Consistency

The log apply wait mechanism and the global snapshot mechanism are used to ensure data consistency among multiple compute nodes. The log apply wait mechanism ensures that all worker threads can obtain the most recent version of each page. The global snapshot mechanism ensures that a unified version of each page can be selected.

image.png

TPC-H Performance: Speedup

image.png

A total of 1 TB of data is used for TPC-H testing. First, run 22 SQL statements in a PolarDB cluster and in a conventional database system. The PolarDB cluster supports distributed parallelism, and the conventional database system supports standalone parallelism. The test result shows that the PolarDB cluster executes three SQL statements at speeds that are 60 times higher and 19 statements at speeds that are 10 times higher than the conventional database system.

image.png

image.png

Then, run a TPC-H test by using a distributed execution engine. The test result shows that the speed at which each of the 22 SQL statements runs linearly increases as the number of cores increases from 16 to 128.

TPC-H Performance: Comparison with Traditional MPP Database

When 16 nodes are configured, PolarDB delivers performance that is 90% of the performance delivered by MPP-based database.

image.png

image.png

As mentioned earlier, the distributed execution engine of PolarDB supports scalability, and data in PolarDB does not need to be redistributed. When the degree of parallelism (DOP) is 8, PolarDB delivers performance that is 5.6 times the performance delivered by MPP-based database.

Index Creation Accelerated by Distributed Execution

A large number of indexes are created in OLTP scenarios. The workloads that you run to create these indexes are divided into two parts: 80% of the workloads are run to sort and create index pages, and 20% of the workloads are run to write index pages. Distributed execution accelerates the process of sorting indexes and supports the batch writing of index pages.

image.png Distributed execution accelerates the creation of indexes by four to five times.

image.png

Multi-model Spatio-temporal Database Accelerated by Distributed, Parallel Execution

PolarDB is a multi-model database service that supports spatio-temporal data. PolarDB runs CPU-bound workloads and I/O-bound workloads. These workloads can be accelerated by distributed execution. The shared storage of PolarDB supports scans on shared R-tree indexes.

image.png

  • Data volume: 400 million data records, which amount to 500 GB in total
  • Configuration: 5 read-only nodes, each of which provides 16 cores and 128 GB of memory
  • Performance:
    • Linearly increases with the number of cores.
    • Increases by 71 times when the number of cores increases from 16 to 80.

image.png

Summary

This document describes the crucial technologies that are used in the PolarDB architecture:

  • Compute-storage separation
  • HTAP

More technical details about PolarDB will be discussed in other documents. For example, how the shared storage-based query optimizer runs, how LogIndex achieves high performance, how PolarDB flashes your data back to a specific point in time, how MPP can be implemented in the shared storage, and how PolarDB works with X-Paxos to ensure high availability.

',168);function ie(i,ne){const n=s("ArticleInfo"),o=s("router-link");return u(),m("div",null,[ee,a(n,{frontmatter:i.$frontmatter},null,8,["frontmatter"]),ae,te,oe,e("nav",re,[e("ul",null,[e("li",null,[a(o,{to:"#issues-in-conventional-database-systems"},{default:r(()=>[t("Issues in Conventional Database Systems")]),_:1})]),e("li",null,[a(o,{to:"#benefits-of-polardb"},{default:r(()=>[t("Benefits of PolarDB")]),_:1})]),e("li",null,[a(o,{to:"#a-guide-to-this-document"},{default:r(()=>[t("A Guide to This Document")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#compute-storage-separation"},{default:r(()=>[t("Compute-Storage Separation")]),_:1})]),e("li",null,[a(o,{to:"#htap"},{default:r(()=>[t("HTAP")]),_:1})])])]),e("li",null,[a(o,{to:"#polardb-compute-storage-separation"},{default:r(()=>[t("PolarDB: Compute-Storage Separation")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#challenges-of-shared-storage"},{default:r(()=>[t("Challenges of Shared Storage")]),_:1})]),e("li",null,[a(o,{to:"#basic-principles-of-shared-storage"},{default:r(()=>[t("Basic Principles of Shared Storage")]),_:1})]),e("li",null,[a(o,{to:"#data-consistency"},{default:r(()=>[t("Data Consistency")]),_:1})]),e("li",null,[a(o,{to:"#low-latency-replication"},{default:r(()=>[t("Low-latency Replication")]),_:1})]),e("li",null,[a(o,{to:"#recovery-optimization"},{default:r(()=>[t("Recovery Optimization")]),_:1})])])]),e("li",null,[a(o,{to:"#polardb-htap"},{default:r(()=>[t("PolarDB HTAP")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#basic-principles-of-htap"},{default:r(()=>[t("Basic Principles of HTAP")]),_:1})]),e("li",null,[a(o,{to:"#distributed-optimizer"},{default:r(()=>[t("Distributed Optimizer")]),_:1})]),e("li",null,[a(o,{to:"#parallelism-of-operators"},{default:r(()=>[t("Parallelism of Operators")]),_:1})]),e("li",null,[a(o,{to:"#solve-the-issue-of-data-skew"},{default:r(()=>[t("Solve the Issue of Data Skew")]),_:1})])])]),e("li",null,[a(o,{to:"#sql-statement-level-scalability"},{default:r(()=>[t("SQL Statement-level Scalability")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#transactional-consistency"},{default:r(()=>[t("Transactional Consistency")]),_:1})]),e("li",null,[a(o,{to:"#tpc-h-performance-speedup"},{default:r(()=>[t("TPC-H Performance: Speedup")]),_:1})]),e("li",null,[a(o,{to:"#tpc-h-performance-comparison-with-traditional-mpp-database"},{default:r(()=>[t("TPC-H Performance: Comparison with Traditional MPP Database")]),_:1})]),e("li",null,[a(o,{to:"#index-creation-accelerated-by-distributed-execution"},{default:r(()=>[t("Index Creation Accelerated by Distributed Execution")]),_:1})]),e("li",null,[a(o,{to:"#multi-model-spatio-temporal-database-accelerated-by-distributed-parallel-execution"},{default:r(()=>[t("Multi-model Spatio-temporal Database Accelerated by Distributed, Parallel Execution")]),_:1})])])]),e("li",null,[a(o,{to:"#summary"},{default:r(()=>[t("Summary")]),_:1})])])]),se])}const he=c(Z,[["render",ie],["__file","arch-overview.html.vue"]]);export{he as default}; diff --git a/assets/arch-overview.html-dcc3d371.js b/assets/arch-overview.html-dcc3d371.js new file mode 100644 index 00000000000..649d234e778 --- /dev/null +++ b/assets/arch-overview.html-dcc3d371.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-14c84b4c","path":"/theory/arch-overview.html","title":"Overview","lang":"en-US","frontmatter":{"author":"北侠","date":"2021/08/24","minute":35},"headers":[{"level":2,"title":"Issues in Conventional Database Systems","slug":"issues-in-conventional-database-systems","link":"#issues-in-conventional-database-systems","children":[]},{"level":2,"title":"Benefits of PolarDB","slug":"benefits-of-polardb","link":"#benefits-of-polardb","children":[]},{"level":2,"title":"A Guide to This Document","slug":"a-guide-to-this-document","link":"#a-guide-to-this-document","children":[{"level":3,"title":"Compute-Storage Separation","slug":"compute-storage-separation","link":"#compute-storage-separation","children":[]},{"level":3,"title":"HTAP","slug":"htap","link":"#htap","children":[]}]},{"level":2,"title":"PolarDB: Compute-Storage Separation","slug":"polardb-compute-storage-separation","link":"#polardb-compute-storage-separation","children":[{"level":3,"title":"Challenges of Shared Storage","slug":"challenges-of-shared-storage","link":"#challenges-of-shared-storage","children":[]},{"level":3,"title":"Basic Principles of Shared Storage","slug":"basic-principles-of-shared-storage","link":"#basic-principles-of-shared-storage","children":[]},{"level":3,"title":"Data Consistency","slug":"data-consistency","link":"#data-consistency","children":[]},{"level":3,"title":"Low-latency Replication","slug":"low-latency-replication","link":"#low-latency-replication","children":[]},{"level":3,"title":"Recovery Optimization","slug":"recovery-optimization","link":"#recovery-optimization","children":[]}]},{"level":2,"title":"PolarDB HTAP","slug":"polardb-htap","link":"#polardb-htap","children":[{"level":3,"title":"Basic Principles of HTAP","slug":"basic-principles-of-htap","link":"#basic-principles-of-htap","children":[]},{"level":3,"title":"Distributed Optimizer","slug":"distributed-optimizer","link":"#distributed-optimizer","children":[]},{"level":3,"title":"Parallelism of Operators","slug":"parallelism-of-operators","link":"#parallelism-of-operators","children":[]},{"level":3,"title":"Solve the Issue of Data Skew","slug":"solve-the-issue-of-data-skew","link":"#solve-the-issue-of-data-skew","children":[]}]},{"level":2,"title":"SQL Statement-level Scalability","slug":"sql-statement-level-scalability","link":"#sql-statement-level-scalability","children":[{"level":3,"title":"Transactional Consistency","slug":"transactional-consistency","link":"#transactional-consistency","children":[]},{"level":3,"title":"TPC-H Performance: Speedup","slug":"tpc-h-performance-speedup","link":"#tpc-h-performance-speedup","children":[]},{"level":3,"title":"TPC-H Performance: Comparison with Traditional MPP Database","slug":"tpc-h-performance-comparison-with-traditional-mpp-database","link":"#tpc-h-performance-comparison-with-traditional-mpp-database","children":[]},{"level":3,"title":"Index Creation Accelerated by Distributed Execution","slug":"index-creation-accelerated-by-distributed-execution","link":"#index-creation-accelerated-by-distributed-execution","children":[]},{"level":3,"title":"Multi-model Spatio-temporal Database Accelerated by Distributed, Parallel Execution","slug":"multi-model-spatio-temporal-database-accelerated-by-distributed-parallel-execution","link":"#multi-model-spatio-temporal-database-accelerated-by-distributed-parallel-execution","children":[]}]},{"level":2,"title":"Summary","slug":"summary","link":"#summary","children":[]}],"git":{"updatedTime":1688442053000},"filePathRelative":"theory/arch-overview.md"}');export{e as data}; diff --git a/assets/arch-overview.html-ed106ad9.js b/assets/arch-overview.html-ed106ad9.js new file mode 100644 index 00000000000..585c6baa98f --- /dev/null +++ b/assets/arch-overview.html-ed106ad9.js @@ -0,0 +1 @@ +import{_ as p,a as n,b as d,c as h}from"./9_future_pages-9e3b8fc6.js";import{_ as g,r as i,o as c,c as _,d as e,a,w as o,b as r,e as P}from"./app-3d1677bf.js";const f="/PolarDB-for-PostgreSQL/assets/2_compute-storage_separation_architecture-2a7ce395.png",m="/PolarDB-for-PostgreSQL/assets/3_HTAP_architecture-219dd5bb.png",u="/PolarDB-for-PostgreSQL/assets/4_principles_of_shared_storage-3ac70e1b.png",S="/PolarDB-for-PostgreSQL/assets/5_In-memory_page_synchronization-edc6ee66.png",L="/PolarDB-for-PostgreSQL/assets/8_solution_to_outdated_pages_LogIndex-aea5e936.png",B="/PolarDB-for-PostgreSQL/assets/10_solutions_to_future_pages-f585c284.png",D="/PolarDB-for-PostgreSQL/assets/11_issues_of_conventional_streaming_replication-fe65f8ee.png",b="/PolarDB-for-PostgreSQL/assets/12_Replicate_only_metadata_of_WAL_records-d2fbf65b.png",x="/PolarDB-for-PostgreSQL/assets/13_optimization1_result-d85386d9.png",A="/PolarDB-for-PostgreSQL/assets/14_optimize_log_apply_of_WAL_records-a2722b50.png",T="/PolarDB-for-PostgreSQL/assets/15_optimization2_result-3dd5d1a8.png",Q="/PolarDB-for-PostgreSQL/assets/16_optimize_log_apply_of_DDL_locks-d4407c97.png",I="/PolarDB-for-PostgreSQL/assets/17_optimization3_result-2e8e1fc5.png",O="/PolarDB-for-PostgreSQL/assets/18_recovery_optimization_background-60743f8d.png",y="/PolarDB-for-PostgreSQL/assets/19_lazy_recovery-ba7ee19e.png",C="/PolarDB-for-PostgreSQL/assets/20_recovery_optimization_result-5bbf801d.png",v="/PolarDB-for-PostgreSQL/assets/21_Persistent_BufferPool-30d61026.png",W="/PolarDB-for-PostgreSQL/assets/22_buffer_pool_structure-a53b4626.png",H="/PolarDB-for-PostgreSQL/assets/23_persistent_buffer_pool_result-6759a779.png",M="/PolarDB-for-PostgreSQL/assets/24_principles_of_HTAP-b1327018.png",R="/PolarDB-for-PostgreSQL/assets/25_distributed_optimizer-a73c4add.png",k="/PolarDB-for-PostgreSQL/assets/26_parallelism_of_operators-61071ed7.png",z="/PolarDB-for-PostgreSQL/assets/27_parallelism_of_operators_result-28ed41a9.png",N="/PolarDB-for-PostgreSQL/assets/28_data_skew-4f127c17.png",w="/PolarDB-for-PostgreSQL/assets/29_Solve_data_skew_result-cfa7b2f0.png",U="/PolarDB-for-PostgreSQL/assets/30_SQL_statement-level_scalability-e2a14f1f.png",X="/PolarDB-for-PostgreSQL/assets/31_schedule_workloads-b339cf98.png",q="/PolarDB-for-PostgreSQL/assets/32_transactional_consistency-4f51f637.png",E="/PolarDB-for-PostgreSQL/assets/33_TPC-H_performance_Speedup1-b6f25c5e.png",G="/PolarDB-for-PostgreSQL/assets/34_TPC-H_performance_Speedup2-5c119fbc.png",V="/PolarDB-for-PostgreSQL/assets/35_TPC-H_performance_Speedup3-c1c35820.png",$="/PolarDB-for-PostgreSQL/assets/36_TPC-H_performance_Comparison_with_mpp1-ecbde071.png",j="/PolarDB-for-PostgreSQL/assets/37_TPC-H_performance_Comparison_with_mpp2-6a739c6c.png",F="/PolarDB-for-PostgreSQL/assets/38_Index_creation_accelerated_by_PX-63d21186.png",J="/PolarDB-for-PostgreSQL/assets/39_Index_creation_accelerated_by_PX2-340b1909.png",K="/PolarDB-for-PostgreSQL/assets/40_spatio-temporal_databases-2527a436.png",Y="/PolarDB-for-PostgreSQL/assets/41_spatio-temporal_databases_result-7e6ba3f6.png",Z={},aa=a("h1",{id:"特性总览",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#特性总览","aria-hidden":"true"},"#"),r(" 特性总览")],-1),ea=a("p",null,"PolarDB for PostgreSQL(以下简称 PolarDB)是一款阿里云自主研发的企业级数据库产品,采用计算存储分离架构,100% 兼容 PostgreSQL。PolarDB 的存储与计算能力均可横向扩展,具有高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB 具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载;还具有时空、向量、搜索、图谱等多模创新特性,可以满足企业对数据处理日新月异的新需求。",-1),ra=a("p",null,"PolarDB 支持多种部署形态:存储计算分离部署、X-Paxos 三节点部署、本地盘部署。",-1),la={class:"table-of-contents"},oa=P('

传统数据库的问题

随着用户业务数据量越来越大,业务越来越复杂,传统数据库系统面临巨大挑战,如:

  1. 存储空间无法超过单机上限。
  2. 通过只读实例进行读扩展,每个只读实例独享一份存储,成本增加。
  3. 随着数据量增加,创建只读实例的耗时增加。
  4. 主备延迟高。

PolarDB 云原生数据库的优势

image.png

针对上述传统数据库的问题,阿里云研发了 PolarDB 云原生数据库。采用了自主研发的计算集群和存储集群分离的架构。具备如下优势:

  1. 扩展性:存储计算分离,极致弹性。
  2. 成本:共享一份数据,存储成本低。
  3. 易用性:一写多读,透明读写分离。
  4. 可靠性:三副本、秒级备份。

PolarDB 整体架构概述

下面会从两个方面来解读 PolarDB 的架构,分别是:存储计算分离架构、HTAP 架构。

存储计算分离架构概述

image.png

PolarDB 是存储计算分离的设计,存储集群和计算集群可以分别独立扩展:

  1. 当计算能力不够时,可以单独扩展计算集群。
  2. 当存储容量不够时,可以单独扩展存储集群。

基于 Shared-Storage 后,主节点和多个只读节点共享一份存储数据,主节点刷脏不能再像传统的刷脏方式了,否则:

  1. 只读节点去存储中读取的页面,可能是比较老的版本,不符合他自己的状态。
  2. 只读节点指读取到的页面比自身内存中想要的数据要超前。
  3. 主节点切换到只读节点时,只读节点接管数据更新时,存储中的页面可能是旧的,需要读取日志重新对脏页的恢复。

对于第一个问题,我们需要有页面多版本能力;对于第二个问题,我们需要主库控制脏页的刷脏速度。

HTAP 架构概述

读写分离后,单个计算节点无法发挥出存储侧大 IO 带宽的优势,也无法通过增加计算资源来加速大的查询。我们研发了基于 Shared-Storage 的 MPP 分布式并行执行,来加速在 OLTP 场景下 OLAP 查询。 PolarDB 支持一套 OLTP 场景型的数据在如下两种计算引擎下使用:

  • 单机执行引擎:处理高并发的 OLTP 型负载。
  • 分布式执行引擎:处理大查询的 OLAP 型负载。

image.png

在使用相同的硬件资源时性能达到了传统 MPP 数据库的 90%,同时具备了 SQL 级别的弹性:在计算能力不足时,可随时增加参与 OLAP 分析查询的 CPU,而数据无需重分布。

PolarDB:存储计算分离架构详解

Shared-Storage 带来的挑战

基于 Shared-Storage 之后,数据库由传统的 share nothing,转变成了 shared storage 架构。需要解决如下问题:

  • 数据一致性:由原来的 N 份计算+N 份存储,转变成了 N 份计算+1 份存储。
  • 读写分离:如何基于新架构做到低延迟的复制。
  • 高可用:如何 Recovery 和 Failover。
  • IO 模型:如何从 Buffer-IO 向 Direct-IO 优化。

架构原理

image.png

首先来看下基于 Shared-Storage 的 PolarDB 的架构原理。

  • 主节点为可读可写节点(RW),只读节点为只读(RO)。
  • Shared-Storage 层,只有主节点能写入,因此主节点和只读节点能看到一致的落盘的数据。
  • 只读节点的内存状态是通过回放 WAL 保持和主节点同步的。
  • 主节点的 WAL 日志写到 Shared-Storage,仅复制 WAL 的 meta 给只读节点。
  • 只读节点从 Shared-Storage 上读取 WAL 并回放。

数据一致性

传统数据库的内存状态同步

传统 share nothing 的数据库,主节点和只读节点都有自己的内存和存储,只需要从主节点复制 WAL 日志到只读节点,并在只读节点上依次回放日志即可,这也是复制状态机的基本原理。

基于 Shared-Storage 的内存状态同步

前面讲到过存储计算分离后,Shared-Storage 上读取到的页面是一致的,内存状态是通过从 Shared-Storage 上读取最新的 WAL 并回放得来,如下图:

image.png

  1. 主节点通过刷脏把版本 200 写入到 Shared-Storage。
  2. 只读节点基于版本 100,并回放日志得到 200。

基于 Shared-Storage 的“过去页面”

上述流程中,只读节点中基于日志回放出来的页面会被淘汰掉,此后需要再次从存储上读取页面,会出现读取的页面是之前的老页面,称为“过去页面”。如下图:

image.png

  1. T1 时刻,主节点在 T1 时刻写入日志 LSN=200,把页面 P1 的内容从 500 更新到 600;
  2. 只读节点此时页面 P1 的内容是 500;
  3. T2 时刻,主节点将日志 200 的 meta 信息发送给只读节点,只读节点得知存在新的日志;
  4. T3 时刻,此时在只读节点上读取页面 P1,需要读取页面 P1 和 LSN=200 的日志,进行一次回放,得到 P1 的最新内容为 600;
  5. T4 时刻,只读节点上由于 BufferPool 不足,将回放出来的最新页面 P1 淘汰掉;
  6. 主节点没有将最新的页面 P1 为 600 的最新内容刷脏到 Shared-Storage 上;
  7. T5 时刻,再次从只读节点上发起读取 P1 操作,由于内存中已把 P1 淘汰掉了,因此从 Shared-Storage 上读取,此时读取到了“过去页面”的内容;

“过去页面” 的解法

只读节点在任意时刻读取页面时,需要找到对应的 Base 页面和对应起点的日志,依次回放。如下图:

image.png

  1. 在只读节点内存中维护每个 Page 对应的日志 meta。
  2. 在读取时一个 Page 时,按需逐个应用日志直到期望的 Page 版本。
  3. 应用日志时,通过日志的 meta 从 Shared-Storage 上读取。

通过上述分析,需要维护每个 Page 到日志的“倒排”索引,而只读节点的内存是有限的,因此这个 Page 到日志的索引需要持久化,PolarDB 设计了一个可持久化的索引结构 - LogIndex。LogIndex 本质是一个可持久化的 hash 数据结构。

  1. 只读节点通过 WAL receiver 接收从主节点过来的 WAL meta 信息。
  2. WAL meta 记录该条日志修改了哪些 Page。
  3. 将该条 WAL meta 插入到 LogIndex 中,key 是 PageID,value 是 LSN。
  4. 一条 WAL 日志可能更新了多个 Page(索引分裂),在 LogIndex 对有多条记录。
  5. 同时在 BufferPool 中给该该 Page 打上 outdate 标记,以便使得下次读取的时候从 LogIndex 重回放对应的日志。
  6. 当内存达到一定阈值时,LogIndex 异步将内存中的 hash 刷到盘上。

image.png

通过 LogIndex 解决了刷脏依赖“过去页面”的问题,也是得只读节点的回放转变成了 Lazy 的回放:只需要回放日志的 meta 信息即可。

基于 Shared-Storage 的“未来页面”

在存储计算分离后,刷脏依赖还存在“未来页面”的问题。如下图所示:

image.png

  1. T1 时刻,主节点对 P1 更新了 2 次,产生了 2 条日志,此时主节点和只读节点上页面 P1 的内容都是 500。
  2. T2 时刻, 发送日志 LSN=200 给只读节点。
  3. T3 时刻,只读节点回放 LSN=200 的日志,得到 P1 的内容为 600,此时只读节点日志回放到了 200,后面的 LSN=300 的日志对他来说还不存在。
  4. T4 时刻,主节点刷脏,将 P1 最新的内容 700 刷到了 Shared-Storage 上,同时只读节点上 BufferPool 淘汰掉了页面 P1。
  5. T5 时刻,只读节点再次读取页面 P1,由于 BufferPool 中不存在 P1,因此从共享内存上读取了最新的 P1,但是只读节点并没有回放 LSN=300 的日志,读取到了一个对他来说超前的“未来页面”。
  6. “未来页面”的问题是:部分页面是未来页面,部分页面是正常的页面,会到时数据不一致,比如索引分裂成 2 个 Page 后,一个读取到了正常的 Page,另一个读取到了“未来页面”,B+Tree 的索引结构会被破坏。

“未来页面”的解法

“未来页面”的原因是主节点刷脏的速度超过了任一只读节点的回放速度(虽然只读节点的 Lazy 回放已经很快了)。因此,解法就是对主节点刷脏进度时做控制:不能超过最慢的只读节点的回放位点。如下图所示:

image.png

  1. 只读节点回放到 T4 位点。
  2. 主节点在刷脏时,对所有脏页按照 LSN 排序,仅刷在 T4 之前的脏页(包括 T4),之后的脏页不刷。
  3. 其中,T4 的 LSN 位点称为“一致性位点”。

低延迟复制

传统流复制的问题

  1. 同步链路:日志同步路径 IO 多,网络传输量大。
  2. 页面回放:读取和 Buffer 修改慢(IO 密集型 + CPU 密集型)。
  3. DDL 回放:修改文件时需要对修改的文件加锁,而加锁的过程容易被阻塞,导致 DDL 慢。
  4. 快照更新:RO 高并发引起事务快照更新慢。

如下图所示:

image.png

  1. 主节点写入 WAL 日志到本地文件系统中。
  2. WAL Sender 进程读取,并发送。
  3. 只读节点的 WAL Receiver 进程接收写入到本地文件系统中。
  4. 回放进程读取 WAL 日志,读取对应的 Page 到 BufferPool 中,并在内存中回放。
  5. 主节点刷脏页到 Shared Storage。

可以看到,整个链路是很长的,只读节点延迟高,影响用户业务读写分离负载均衡。

优化 1:只复制 Meta

因为底层是 Shared-Storage,只读节点可直接从 Shared-Storage 上读取所需要的 WAL 数据。因此主节点只把 WAL 日志的元数据(去掉 Payload)复制到只读节点,这样网络传输量小,减少关键路径上的 IO。如下图所示:

image.png

  1. WAL Record 是由:Header,PageID,Payload 组成。
  2. 由于只读节点可以直接读取 Shared-Storage 上的 WAL 文件,因此主节点只把 WAL 日志的元数据发送(复制)到只读节点,包括:Header,PageID。
  3. 在只读节点上,通过 WAL 的元数据直接读取 Shared-Storage 上完整的 WAL 文件。

通过上述优化,能显著减少主节点和只读节点间的网络传输量。从下图可以看到网络传输量减少了 98%。

image.png

优化 2:页面回放优化

在传统 DB 中日志回放的过程中会读取大量的 Page 并逐个日志 Apply,然后落盘。该流程在用户读 IO 的关键路径上,借助存储计算分离可以做到:如果只读节点上 Page 不在 BufferPool 中,不产生任何 IO,仅仅记录 LogIndex 即可。

可以将回放进程中的如下 IO 操作 offload 到 session 进程中:

  1. 数据页 IO 开销。
  2. 日志 apply 开销。
  3. 基于 LogIndex 页面的多版本回放。

如下图所示,在只读节点上的回放进程中,在 Apply 一条 WAL 的 meta 时:

image.png

  1. 如果对应 Page 不在内存中,仅仅记录 LogIndex。
  2. 如果对应的 Page 在内存中,则标记为 Outdate,并记录 LogIndex,回放过程完成。
  3. 用户 session 进程在读取 Page 时,读取正确的 Page 到 BufferPool 中,并通过 LogIndex 来回放相应的日志。
  4. 可以看到,主要的 IO 操作有原来的单个回放进程 offload 到了多个用户进程。

通过上述优化,能显著减少回放的延迟,比 AWS Aurora 快 30 倍。

image.png

优化 3:DDL 锁回放优化

在主节点执行 DDL 时,比如:drop table,需要在所有节点上都对表上排他锁,这样能保证表文件不会在只读节点上读取时被主节点删除掉了(因为文件在 Shared-Storage 上只有一份)。在所有只读节点上对表上排他锁是通过 WAL 复制到所有的只读节点,只读节点回放 DDL 锁来完成。而回放进程在回放 DDL 锁时,对表上锁可能会阻塞很久,因此可以通过把 DDL 锁也 offload 到其他进程上来优化回放进程的关键路径。

image.png

通过上述优化,能够回放进程一直处于平滑的状态,不会因为去等 DDL 而阻塞了回放的关键路径。

image.png

上述 3 个优化之后,极大的降低了复制延迟,能够带来如下优势:

  • 读写分离:负载均衡,更接近 Oracle RAC 使用体验。
  • 高可用:加速 HA 流程。
  • 稳定性:最小化未来页的数量,可以写更少或者无需写页面快照。

Recovery 优化

背景

数据库 OOM、Crash 等场景恢复时间长,本质上是日志回放慢,在共享存储 Direct-IO 模型下问题更加突出。

image.png

Lazy Recovery

前面讲到过通过 LogIndex 我们在只读节点上做到了 Lazy 的回放,那么在主节点重启后的 recovery 过程中,本质也是在回放日志,那么我们可以借助 Lazy 回放来加速 recovery 的过程:

image.png

  1. 从 checkpoint 点开始逐条去读 WAL 日志。
  2. 回放完 LogIndex 日志后,即认为回放完成。
  3. recovery 完成,开始提供服务。
  4. 真正的回放被 offload 到了重启之后进来的 session 进程中。

优化之后(回放 500MB 日志量):

image.png

Persistent BufferPool

上述方案优化了在 recovery 的重启速度,但是在重启之后,session 进程通过读取 WAL 日志来回放想要的 page。表现就是在 recovery 之后会有短暂的响应慢的问题。优化的办法为在数据库重启时 BufferPool 并不销毁,如下图所示:crash 和 restart 期间 BufferPool 不销毁。

image.png

内核中的共享内存分成 2 部分:

  1. 全局结构,ProcArray 等。
  2. BufferPool 结构;其中 BufferPool 通过具名共享内存来分配,在进程重启后仍然有效。而全局结构在进程重启后需要重新初始化。

image.png

而 BufferPool 中并不是所有的 Page 都是可以复用的,比如:在重启前,某进程对 Page 上 X 锁,随后 crash 了,该 X 锁就没有进程来释放了。因此,在 crash 和 restart 之后需要把所有的 BufferPool 遍历一遍,剔除掉不能被复用的 Page。另外,BufferPool 的回收依赖 k8s。该优化之后,使得重启前后性能平稳。

image.png

PolarDB:HTAP 架构详解

PolarDB 读写分离后,由于底层是存储池,理论上 IO 吞吐是无限大的。而大查询只能在单个计算节点上执行,单个计算节点的 CPU/MEM/IO 是有限的,因此单个计算节点无法发挥出存储侧的大 IO 带宽的优势,也无法通过增加计算资源来加速大的查询。我们研发了基于 Shared-Storage 的 MPP 分布式并行执行,来加速在 OLTP 场景下 OLAP 查询。

HTAP 架构原理

PolarDB 底层存储在不同节点上是共享的,因此不能直接像传统 MPP 一样去扫描表。我们在原来单机执行引擎上支持了 MPP 分布式并行执行,同时对 Shared-Storage 进行了优化。 基于 Shared-Storage 的 MPP 是业界首创,它的原理是:

  1. Shuffle 算子屏蔽数据分布。
  2. ParallelScan 算子屏蔽共享存储。

image.png

如图所示:

  1. 表 A 和表 B 做 join,并做聚合。
  2. 共享存储中的表仍然是单个表,并没有做物理上的分区。
  3. 重新设计 4 类扫描算子,使之在扫描共享存储上的表时能够分片扫描,形成 virtual partition。

分布式优化器

基于社区的 GPORCA 优化器扩展了能感知共享存储特性的 Transformation Rules。使得能够探索共享存储下特有的 Plan 空间,比如:对于一个表在 PolarDB 中既可以全量的扫描,也可以分区域扫描,这个是和传统 MPP 的本质区别。图中,上面灰色部分是 PolarDB 内核与 GPORCA 优化器的适配部分。下半部分是 ORCA 内核,灰色模块是我们在 ORCA 内核中对共享存储特性所做的扩展。

image.png

算子并行化

PolarDB 中有 4 类算子需要并行化,下面介绍一个具有代表性的 Seqscan 的算子的并行化。为了最大限度的利用存储的大 IO 带宽,在顺序扫描时,按照 4MB 为单位做逻辑切分,将 IO 尽量打散到不同的盘上,达到所有的盘同时提供读服务的效果。这样做还有一个优势,就是每个只读节点只扫描部分表文件,那么最终能缓存的表大小是所有只读节点的 BufferPool 总和。

image.png

下面的图表中:

  1. 增加只读节点,扫描性能线性提升 30 倍。
  2. 打开 Buffer 时,扫描从 37 分钟降到 3.75 秒。

image.png

消除数据倾斜问题

倾斜是传统 MPP 固有的问题:

  1. 在 PolarDB 中,大对象的是通过 heap 表关联 TOAST​ 表,无论对哪个表切分都无法达到均衡。
  2. 另外,不同只读节点的事务、buffer、网络、IO 负载抖动。

以上两点会导致分布执行时存在长尾进程。

image.png

  1. 协调节点内部分成 DataThread 和 ControlThread。
  2. DataThread 负责收集汇总元组。
  3. ControlThread 负责控制每个扫描算子的扫描进度。
  4. 扫描快的工作进程能多扫描逻辑的数据切片。
  5. 过程中需要考虑 Buffer 的亲和性。

需要注意的是:尽管是动态分配,尽量维护 buffer 的亲和性;另外,每个算子的上下文存储在 worker 的私有内存中,Coordinator 不存储具体表的信息;

下面表格中,当出现大对象时,静态切分出现数据倾斜,而动态扫描仍然能够线性提升。

image.png

SQL 级别弹性扩展

那我们利用数据共享的特点,还可以支持云原生下极致弹性的要求:把 Coordinator 全链路上各个模块所需要的外部依赖存在共享存储上,同时 worker 全链路上需要的运行时参数通过控制链路从 Coordinator 同步过来,使 Coordinator 和 worker 无状态化。

image.png

因此:

  1. SQL 连接的任意只读节点都可以成为 Coordinator 节点,这解决了 Coordinator 单点问题。
  2. 一个 SQL 能在任意节点上启动任意 worker 数目,达到算力能 SQL 级别弹性扩展,也允许业务有更多的调度策略:不同业务域同时跑在不同的节点集合上。

image.png

事务一致性

多个计算节点数据一致性通过等待回放和 globalsnapshot 机制来完成。等待回放保证所有 worker 能看到所需要的数据版本,而 globalsnapshot 保证了选出一个统一的版本。

image.png

TPC-H 性能:加速比

image.png

我们使用 1TB 的 TPC-H 进行了测试,首先对比了 PolarDB 新的分布式并行和单机并行的性能:有 3 个 SQL 提速 60 倍,19 个 SQL 提速 10 倍以上;

image.png

image.png

另外,使用分布式执行引擎测,试增加 CPU 时的性能,可以看到,从 16 核和 128 核时性能线性提升;单看 22 条 SQL,通过该增加 CPU,每个条 SQL 性能线性提升。

TPC-H 性能:和传统 MPP 数据库的对比

与传统 MPP 数据库相比,同样使用 16 个节点,PolarDB 的性能是传统 MPP 数据库的 90%。

image.png

image.png

前面讲到我们给 PolarDB 的分布式引擎做到了弹性扩展,数据不需要充分重分布,当 dop = 8 时,性能是传统 MPP 数据库的 5.6 倍。

分布式执行加速索引创建

OLTP 业务中会建大量的索引,经分析建索引过程中:80%是在排序和构建索引页,20%在写索引页。通过使用分布式并行来加速排序过程,同时流水化批量写入。

image.png

上述优化能够使得创建索引有 4~5 倍的提升。

image.png

分布式并行执行加速多模:时空数据库

PolarDB 是对多模数据库,支持时空数据。时空数据库是计算密集型和 IO 密集型,可以借助分布式执行来加速。我们针对共享存储开发了扫描共享 RTREE 索引的功能。

image.png

  • 数据量:40000 万,500 GB
  • 规格:5 个只读节点,每个节点规格为 16 核 CPU、128 GB 内存
  • 性能:
    • 随 CPU 数目线性提升
    • 共 80 核 CPU 时,提升71 倍

image.png

总结

本文从架构层面分析了 PolarDB 的技术要点:

  • 存储计算分离架构。
  • HTAP 架构。

后续文章将具体讨论更多的技术细节,比如:如何基于 Shared-Storage 的查询优化器,LogIndex 如何做到高性能,如何闪回到任意时间点,如何在 Shared-Storage 上支持 MPP,如何和 X-Paxos 结合构建高可用等等,敬请期待。

',163);function ia(t,ta){const s=i("ArticleInfo"),l=i("router-link");return c(),_("div",null,[aa,e(s,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),ea,ra,a("nav",la,[a("ul",null,[a("li",null,[e(l,{to:"#传统数据库的问题"},{default:o(()=>[r("传统数据库的问题")]),_:1})]),a("li",null,[e(l,{to:"#polardb-云原生数据库的优势"},{default:o(()=>[r("PolarDB 云原生数据库的优势")]),_:1})]),a("li",null,[e(l,{to:"#polardb-整体架构概述"},{default:o(()=>[r("PolarDB 整体架构概述")]),_:1}),a("ul",null,[a("li",null,[e(l,{to:"#存储计算分离架构概述"},{default:o(()=>[r("存储计算分离架构概述")]),_:1})]),a("li",null,[e(l,{to:"#htap-架构概述"},{default:o(()=>[r("HTAP 架构概述")]),_:1})])])]),a("li",null,[e(l,{to:"#polardb-存储计算分离架构详解"},{default:o(()=>[r("PolarDB:存储计算分离架构详解")]),_:1}),a("ul",null,[a("li",null,[e(l,{to:"#shared-storage-带来的挑战"},{default:o(()=>[r("Shared-Storage 带来的挑战")]),_:1})]),a("li",null,[e(l,{to:"#架构原理"},{default:o(()=>[r("架构原理")]),_:1})]),a("li",null,[e(l,{to:"#数据一致性"},{default:o(()=>[r("数据一致性")]),_:1})]),a("li",null,[e(l,{to:"#低延迟复制"},{default:o(()=>[r("低延迟复制")]),_:1})]),a("li",null,[e(l,{to:"#recovery-优化"},{default:o(()=>[r("Recovery 优化")]),_:1})])])]),a("li",null,[e(l,{to:"#polardb-htap-架构详解"},{default:o(()=>[r("PolarDB:HTAP 架构详解")]),_:1}),a("ul",null,[a("li",null,[e(l,{to:"#htap-架构原理"},{default:o(()=>[r("HTAP 架构原理")]),_:1})]),a("li",null,[e(l,{to:"#分布式优化器"},{default:o(()=>[r("分布式优化器")]),_:1})]),a("li",null,[e(l,{to:"#算子并行化"},{default:o(()=>[r("算子并行化")]),_:1})]),a("li",null,[e(l,{to:"#消除数据倾斜问题"},{default:o(()=>[r("消除数据倾斜问题")]),_:1})]),a("li",null,[e(l,{to:"#sql-级别弹性扩展"},{default:o(()=>[r("SQL 级别弹性扩展")]),_:1})]),a("li",null,[e(l,{to:"#事务一致性"},{default:o(()=>[r("事务一致性")]),_:1})]),a("li",null,[e(l,{to:"#tpc-h-性能-加速比"},{default:o(()=>[r("TPC-H 性能:加速比")]),_:1})]),a("li",null,[e(l,{to:"#tpc-h-性能-和传统-mpp-数据库的对比"},{default:o(()=>[r("TPC-H 性能:和传统 MPP 数据库的对比")]),_:1})]),a("li",null,[e(l,{to:"#分布式执行加速索引创建"},{default:o(()=>[r("分布式执行加速索引创建")]),_:1})]),a("li",null,[e(l,{to:"#分布式并行执行加速多模-时空数据库"},{default:o(()=>[r("分布式并行执行加速多模:时空数据库")]),_:1})])])]),a("li",null,[e(l,{to:"#总结"},{default:o(()=>[r("总结")]),_:1})])])]),oa])}const na=g(Z,[["render",ia],["__file","arch-overview.html.vue"]]);export{na as default}; diff --git a/assets/avail-online-promote.html-21127e10.js b/assets/avail-online-promote.html-21127e10.js new file mode 100644 index 00000000000..090e33ba27c --- /dev/null +++ b/assets/avail-online-promote.html-21127e10.js @@ -0,0 +1,2 @@ +import{_ as c,r as i,o as s,c as p,d as o,a as e,w as a,b as l,e as _}from"./app-3d1677bf.js";const u="/PolarDB-for-PostgreSQL/assets/online_promote_postmaster-92e1fd76.png",g="/PolarDB-for-PostgreSQL/assets/online_promote_startup-b84a6f37.png",m="/PolarDB-for-PostgreSQL/assets/online_promote_logindex_bgw-d9f46b31.png",h={},P=e("h1",{id:"只读节点-online-promote",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#只读节点-online-promote","aria-hidden":"true"},"#"),l(" 只读节点 Online Promote")],-1),L={class:"table-of-contents"},O=_(`

背景

PolarDB 是基于共享存储的一写多读架构,与传统数据库的主备架构有所不同:

  • Standby 节点,是传统数据库的备库节点,有独立的存储,与主库节点之间通过传输完整的 WAL 日志来同步数据;
  • 只读节点,也称为 Replica 节点,是 PolarDB 数据库的只读备库节点,与主节点共享同一份存储,与主库节点之间通过传输 WAL Meta 日志信息来同步数据。

传统数据库支持 Standby 节点升级为主库节点的 Promote 操作,在不重启的情况下,提升备库节点为主库节点,继续提供读写服务,保证集群高可用的同时,也有效降低了实例的恢复时间 RTO。

PolarDB 同样需要只读备库节点提升为主库节点的 Promote 能力,鉴于只读节点与传统数据库 Standby 节点的不同,PolarDB 提出了一种一写多读架构下只读节点的 OnlinePromote 机制。

使用

使用 pg_ctl 工具对 Replica 节点执行 Promote 操作:

pg_ctl promote -D [datadir]
+

OnlinePromote 原理

触发机制

PolarDB 使用和传统数据库一致的备库节点 Promote 方法,触发条件如下:

  • 调用 pg_ctl 工具的 Promote 命令,pg_ctl 工具会向 Postmaster 进程发送信号,接收到信号的 Postmaster 进程再通知其他进程执行相应的操作,完成整个 Promote 操作。
  • recovery.conf 中定义 trigger file 的路径,其他组件通过生成 trigger file 来触发。

相比于传统数据库 Standby 节点的 Promote 操作,PolarDB Replica 节点的 OnlinePromote 操作需要多考虑以下几个问题:

  • Replica 节点 OnlinePromote 为主库节点后,需要以读写模式重新挂载共享存储;
  • Replica 节点会在内存中维护一些重要的控制信息,这些控制信息在主库节点上会被持久化到共享存储中。Promote 过程中,这部分信息也需要持久化到共享存储;
  • Replica 节点在内存中通过日志回放得到的数据信息,在 OnlinePromote 的过程中需要确认哪些数据可以写入共享存储;
  • Replica 节点在内存中回放 WAL 日志时,缓冲区淘汰方法和不刷脏的特性与主库节点截然不同,OnlinePromote 过程中应该如何处理;
  • Replica 节点 OnlinePromote 过程中,各个子进程的处理过程。

Postmaster 进程处理过程

  1. Postmaster 进程发现 trigger file 文件或者接收到 OnlinePromote 命令后,进入 OnlinePromote 的处理流程;
  2. 发送 SIGTERM 信号给当前所有 Backend 进程。
    • 只读节点在 OnlinePromote 过程中可以继续提供只读服务,但是只读的数据不能保证是最新的。为了避免切换过程中从新的主库节点读到旧的数据,这里先将所有的 Backend 会话断开,等 Startup 进程退出后再开始对外提供读写服务。
  3. 重新以 读写模式 挂载共享存储,需要底层存储提供相应的功能支持;
  4. 发送 SIGUSR2 信号给 Startup 进程,通知其结束回放并处理 OnlinePromote 操作;
  5. 发送 SIGUSR2 信号给 Polar Worker 辅助进程,通知其停止对于部分 LogIndex 数据的解析,因为这部分 LogIndex 数据只对于正常运行期间的 Replica 节点有用处。
  6. 发送 SIGUSR2 信号给 LogIndex BGW (Background Ground Worker) 后台回放进程,通知其处理 OnlinePromote 操作。

image.png

Startup 进程处理过程

  1. Startup 进程回放完所有旧主库节点产生的 WAL 日志,生成相应的 LogIndex 数据;
  2. 确认旧主库节点最后一次的 checkpoint 在 Replica 节点也完成,目的是确保对应的 checkpoint 应该在 Replica 节点本地写入的数据落盘完毕;
  3. 等待确认 LogIndex BGW 进程进入 POLAR_BG_WAITING_RESET 状态;
  4. 将 Replica 节点本地的数据(如 clog 等)拷贝到共享存储中;
  5. 重置 WAL Meta Queue 内存空间,从共享存储中重新加载 slot 信息,并重新设置 LogIndex BGW 进程的回放位点为其与当前一致性位点两者的最小值,表示接下来 LogIndex BGW 进程从该位点开始新的回放;
  6. 将节点角色设置为主库节点,并设置 LogIndex BGW 进程的状态为 POLAR_BG_ONLINE_PROMOTE,至此实例可以对外提供读写服务。

image.png

LogIndex BGW 进程处理过程

LogIndex BGW 进程有自己的状态机,在其生命周期内,一直按照该状态机运行,具体每个状态机的操作内容如下:

  • POLAR_BG_WAITING_RESET:LogIndex BGW 进程状态重置,通知其他进程状态机发生变化;
  • POLAR_BG_ONLINE_PROMOTE:读取 LogIndex 数据,组织并分发回放任务,利用并行回放进程组回放 WAL 日志,该状态的进程需要回放完所有的 LogIndex 数据才会进行状态切换,最后推进后台回放进程的回放位点;
  • POLAR_BG_REDO_NOT_START:表示回放任务结束;
  • POLAR_BG_RO_BUF_REPLAYING:Replica 节点正常运行时,进程处于该状态,读取 LogIndex 数据,按照 WAL 日志的顺序回放一定量的 WAL 日志,每回放一轮,便会推进后台回放进程的回放位点;
  • POLAR_BG_PARALLEL_REPLAYING:LogIndex BGW 进程每次读取一定量的 LogIndex 数据,组织并分发回放任务,利用并行回放进程组回放 WAL 日志,每回放一轮,便会推进后台回放进程的回放位点。

image.png

LogIndex BGW 进程接收到 Postmaster 的 SIGUSR2 信号后,执行 OnlinePromote 操作的流程如下:

  1. 将所有的 LogIndex 数据落盘,并切换状态为 POLAR_BG_WAITING_RESET
  2. 等待 Startup 进程将其切换为 POLAR_BG_ONLINE_PROMOTE 状态;
    • Replica 节点在执行 OnlinePromote 操作前,后台回放进程只回放在 buffer pool 中的页面;
    • Replica 节点处于 OnlinePromote 过程中时,鉴于之前主库节点可能有部分页面在内存中,未来得及落盘,所以后台回放进程按照日志顺序回放所有的 WAL 日志,并在回放后调用 MarkBufferDirty 标记该页面为脏页,等待刷脏;
    • 回放结束后,推进后台回放进程的回放位点,然后切换状态为 POLAR_BG_REDO_NOT_START

刷脏控制

每个脏页都带有一个 Oldest LSN,该 LSN 在 FlushList 里是有序的,目的是通过这个 LSN 来确定一致性位点。

Replica 节点在 OnlinePromote 过程后,由于同时存在着回放和新的页面写入,如果像主库节点一样,直接将当前的 WAL 日志插入位点设为 Buffer 的 Oldest LSN,可能会导致:比它小的 Buffer 还未落盘,但新的一致性位点已经被设置。

所以 Replica 节点在 OnlinePromote 过程中需要面对两个问题:

  • 旧主库节点的 WAL 日志回放时,如何给脏页设置 Oldest LSN;
  • 新主库节点产生的脏页如何设置 Oldest LSN;

PolarDB 在 Replica 节点 OnlinePromote 的过程中,将上述两类情况产生的脏页的 Oldest LSN 都设置为 LogIndex BGW 进程推进的回放位点。只有当标记为相同 Oldest LSN 的 Buffer 都落盘了,才将一致性位点向前推进。

',32);function f(n,R){const r=i("Badge"),d=i("ArticleInfo"),t=i("router-link");return s(),p("div",null,[P,o(r,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"}),o(d,{frontmatter:n.$frontmatter},null,8,["frontmatter"]),e("nav",L,[e("ul",null,[e("li",null,[o(t,{to:"#背景"},{default:a(()=>[l("背景")]),_:1})]),e("li",null,[o(t,{to:"#使用"},{default:a(()=>[l("使用")]),_:1})]),e("li",null,[o(t,{to:"#onlinepromote-原理"},{default:a(()=>[l("OnlinePromote 原理")]),_:1}),e("ul",null,[e("li",null,[o(t,{to:"#触发机制"},{default:a(()=>[l("触发机制")]),_:1})]),e("li",null,[o(t,{to:"#postmaster-进程处理过程"},{default:a(()=>[l("Postmaster 进程处理过程")]),_:1})]),e("li",null,[o(t,{to:"#startup-进程处理过程"},{default:a(()=>[l("Startup 进程处理过程")]),_:1})]),e("li",null,[o(t,{to:"#logindex-bgw-进程处理过程"},{default:a(()=>[l("LogIndex BGW 进程处理过程")]),_:1})]),e("li",null,[o(t,{to:"#刷脏控制"},{default:a(()=>[l("刷脏控制")]),_:1})])])])])]),O])}const x=c(h,[["render",f],["__file","avail-online-promote.html.vue"]]);export{x as default}; diff --git a/assets/avail-online-promote.html-40b93d0d.js b/assets/avail-online-promote.html-40b93d0d.js new file mode 100644 index 00000000000..dcdfa683fa2 --- /dev/null +++ b/assets/avail-online-promote.html-40b93d0d.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-2a7736c4","path":"/zh/features/v11/availability/avail-online-promote.html","title":"只读节点 Online Promote","lang":"zh-CN","frontmatter":{"author":"学弈","date":"2022/09/20","minute":25},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"使用","slug":"使用","link":"#使用","children":[]},{"level":2,"title":"OnlinePromote 原理","slug":"onlinepromote-原理","link":"#onlinepromote-原理","children":[{"level":3,"title":"触发机制","slug":"触发机制","link":"#触发机制","children":[]},{"level":3,"title":"Postmaster 进程处理过程","slug":"postmaster-进程处理过程","link":"#postmaster-进程处理过程","children":[]},{"level":3,"title":"Startup 进程处理过程","slug":"startup-进程处理过程","link":"#startup-进程处理过程","children":[]},{"level":3,"title":"LogIndex BGW 进程处理过程","slug":"logindex-bgw-进程处理过程","link":"#logindex-bgw-进程处理过程","children":[]},{"level":3,"title":"刷脏控制","slug":"刷脏控制","link":"#刷脏控制","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/availability/avail-online-promote.md"}');export{e as data}; diff --git a/assets/avail-parallel-replay.html-2136f786.js b/assets/avail-parallel-replay.html-2136f786.js new file mode 100644 index 00000000000..6996a0ef641 --- /dev/null +++ b/assets/avail-parallel-replay.html-2136f786.js @@ -0,0 +1,2 @@ +import{_ as n,r as i,o as c,c as p,d as a,a as e,w as s,b as l,e as _}from"./app-3d1677bf.js";const k="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_1-5ade2a03.png",g="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_2-f13fd3b5.png",h="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_task-6ddc37a4.png",u="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_dispatcher-ef5aa6cd.png",T="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_procs_1-37e52fe7.png",L="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_procs_2-777f9348.png",m="/PolarDB-for-PostgreSQL/assets/pr_parallel_execute_procs_3-c2b9a687.png",f="/PolarDB-for-PostgreSQL/assets/pr_parallel_replay_1-fa5b96d0.png",x="/PolarDB-for-PostgreSQL/assets/pr_parallel_replay_2-bf2d2654.png",N={},B=e("h1",{id:"wal-日志并行回放",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#wal-日志并行回放","aria-hidden":"true"},"#"),l(" WAL 日志并行回放")],-1),$={class:"table-of-contents"},A=_('

背景

在 PolarDB for PostgreSQL 的一写多读架构下,只读节点(Replica 节点)运行过程中,LogIndex 后台回放进程(LogIndex Background Worker)和会话进程(Backend)分别使用 LogIndex 数据在不同的 Buffer 上回放 WAL 日志,本质上达到了一种并行回放 WAL 日志的效果。

鉴于 WAL 日志回放在 PolarDB 集群的高可用中起到至关重要的作用,将并行回放 WAL 日志的思想用到常规的日志回放路径上,是一种很好的优化思路。

并行回放 WAL 日志至少可以在以下三个场景下发挥优势:

  1. 主库节点、只读节点以及备库节点崩溃恢复(Crash Recovery)的过程;
  2. 只读节点 LogIndex BGW 进程持续回放 WAL 日志的过程;
  3. 备库节点 Startup 进程持续回放 WAL 日志的过程。

术语

  • Block:数据块
  • WAL:Write-Ahead Logging,预写日志
  • Task Node:并行执行框架中的子任务执行节点,可以接收并执行一个子任务
  • Task Tag:子任务的分类标识,同一类的子任务执行顺序有先后关系
  • Hold List:并行执行框架中,每个子进程调度执行回放子任务所使用的链表

原理

概述

一条 WAL 日志可能修改多个数据块 Block,因此可以使用如下定义来表示 WAL 日志的回放过程:

  • 假设第 i 条 WAL 日志 LSN 为 $LSN_i$,其修改了 m 个数据块,则定义第 i 条 WAL 日志修改的数据块列表 $Block_i = [Block_{i,0}, Block_{i,1}, ..., Block_{i,m}]$;
  • 定义最小的回放子任务为 $Task_{i,j}={LSN_i -> Block_{i,j}}$,表示在数据块 $Block_{i,j}$ 上回放第 i 条 WAL 日志;
  • 因此,一条修改了 k 个 Block 的 WAL 日志就可以表示成 k 个回放子任务的集合:$TASK_{i,*} = [Task_{i,0}, Task_{i,1}, ..., Task_{i,k}]$;
  • 进而,多条 WAL 日志就可以表示成一系列回放子任务的集合:$TASK_{,} = [Task_{0,}, Task_{1,}, ..., Task_{N,*}]$;

在日志回放子任务集合 $Task_{,}$ 中,每个子任务的执行,有时并不依赖于前序子任务的执行结果。假设回放子任务集合如下:$TASK_{,} = [Task_{0,}, Task_{1,}, Task_{2,*}]$,其中:

  • $Task_{0,*}=[Task_{0,0}, Task_{0,1}, Task_{0,2}]$
  • $Task_{1,*}=[Task_{1,0}, Task_{1,1}]$,
  • $Task_{2,*}=[Task_{2,0}]$

并且 $Block_{0,0} = Block_{1,0}$,$Block_{0,1} = Block_{1,1}$,$Block_{0,2} = Block_{2,0}$

则可以并行回放的子任务集合有三个:$[Task_{0,0},Task_{1,0}]$、$[Task_{0,1},Task_{1,1}]$、$[Task_{0,2},Task_{2,0}]$

综上所述,在整个 WAL 日志所表示的回放子任务集合中,存在很多子任务序列可以并行执行,而且不会影响最终回放结果的一致性。PolarDB 借助这种思想,提出了一种并行任务执行框架,并成功运用到了 WAL 日志回放的过程中。

并行任务执行框架

将一段共享内存根据并发进程数目进行等分,每一段作为一个环形队列,分配给一个进程。通过配置参数设定每个环形队列的深度:

image.png

  • Dispatcher 进程
    • 通过将任务分发给指定的进程来控制并发调度;
    • 负责将进程执行完的任务从队列中删除;
  • 进程组
    • 组内每一个进程从相应的环形队列中获取需要执行的任务,根据任务的状态决定是否执行。

image.png

任务

环形队列的内容由 Task Node 组成,每个 Task Node 包含五个状态:Idle、Running、Hold、Finished、Removed。

  • Idle:表示该 Task Node 未分配任务;
  • Running:表示该 Task Node 已经分配任务,正在等待进程执行,或已经在执行;
  • Hold:表示该 Task Node 有前向依赖的任务,需要等待依赖的任务执行完再执行;
  • Finished:表示进程组中的进程已经执行完该任务;
  • Removed:当 Dispatcher 进程发现一个任务的状态已经为 Finished,那么该任务所有的前置依赖任务也都应该为 Finished 状态,Removed 状态表示 Dispatcher 进程已经将该任务以及该任务所有前置任务都从管理结构体中删除;可以通过该机制保证 Dispatcher 进程按顺序处理有依赖关系的任务执行结果。

image.png

上述状态机的状态转移过程中,黑色线标识的状态转移过程在 Dispatcher 进程中完成,橙色线标识的状态转移过程在并行回放进程组中完成。

Dispatcher 进程

Dispatcher 进程有三个关键数据结构:Task HashMap、Task Running Queue 以及 Task Idle Nodes。

  • Task HashMap 负责记录 Task Tag 和相应的执行任务列表的 hash 映射关系:
    • 每个任务有一个指定的 Task Tag,如果两个任务间存在依赖关系,则它们的 Task Tag 相同;
    • 在分发任务时,如果一个 Task Node 存在前置依赖任务,则状态标识为 Hold,需等待前置任务先执行。
  • Task Running Queue 负责记录当前正在执行的任务;
  • Task Idel Nodes 负责记录进程组中不同进程,当前处于 Idle 状态的 Task Node;

Dispatcher 调度策略如下:

  • 如果要执行的 Task Node 有相同 Task Tag 的任务在执行,则优先将该 Task Node 分配到该 Task Tag 链表最后一个 Task Node 所在的执行进程;目的是让有依赖关系的任务尽量被同一个进程执行,减少进程间同步的开销;
  • 如果期望优先分配的进程队列已满,或者没有相同的 Task Tag 在执行,则在进程组中按顺序选择一个进程,从中获取状态为 Idle 的 Task Node 来调度任务执行;目的是让任务尽量平均分配到不同的进程进行执行。

image.png

进程组

该并行执行针对的是相同类型的任务,它们具有相同的 Task Node 数据结构;在进程组初始化时配置 SchedContext,指定负责执行具体任务的函数指针:

  • TaskStartup 表示进程执行任务前需要进行的初始化动作
  • TaskHandler 根据传入的 Task Node,负责执行具体的任务
  • TaskCleanup 表示执行进程退出前需要执行的回收动作

image.png

进程组中的进程从环形队列中获取一个 Task Node,如果 Task Node 当前的状态是 Hold,则将该 Task Node 插入到 Hold List 的尾部;如果 Task Node 的状态为 Running,则调用 TaskHandler 执行;如果 TaskHandler 执行失败,则设置该 Task Node 重新执行需要等待调用的次数,默认为 3,将该 Task Node 插入到 Hold List 的头部。

image.png

进程优先从 Hold List 头部搜索,获取可执行的 Task;如果 Task 状态为 Running,且等待调用次数为 0,则执行该 Task;如果 Task 状态为 Running,但等待调用次数大于 0,则将等待调用次数减去 1。

image.png

WAL 日志并行回放

根据 LogIndex 章节介绍,LogIndex 数据中记录了 WAL 日志和其修改的数据块之间的对应关系,而且 LogIndex 数据支持使用 LSN 进行检索,鉴于此,PolarDB 数据库在 Standby 节点持续回放 WAL 日志过程中,引入了上述并行任务执行框架,并结合 LogIndex 数据将 WAL 日志的回放任务并行化,提高了 Standby 节点数据同步的速度。

工作流程

  • Startup 进程:解析 WAL 日志后,仅构建 LogIndex 数据而不真正回放 WAL 日志;
  • LogIndex BGW 后台回放进程:成为上述并行任务执行框架的 Dispatcher 进程,利用 LSN 来检索 LogIndex 数据,构建日志回放的子任务,并分配给并行回放进程组;
  • 并行回放进程组内的进程:执行日志回放子任务,对数据块执行单个日志的回放操作;
  • Backend 进程:主动读取数据块时,根据 PageTag 来检索 LogIndex 数据,获得修改该数据块的 LSN 日志链表,对数据块执行完整日志链的回放操作。

image.png

  • Dispatcher 进程利用 LSN 来检索 LogIndex 数据,按 LogIndex 插入顺序枚举 PageTag 和对应 LSN,构建{LSN -> PageTag},组成相应的 Task Node;
  • PageTag 作为 Task Node 的 Task Tag;
  • 将枚举组成的 Task Node 分发给并行执行框架中进程组的子进程进行回放;

image.png

使用方法

在 Standby 节点的 postgresql.conf 中添加以下参数开启功能:

polar_enable_parallel_replay_standby_mode = ON
+
`,50);function S(d,W){const r=i("Badge"),t=i("ArticleInfo"),o=i("router-link");return c(),p("div",null,[B,a(r,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"}),a(t,{frontmatter:d.$frontmatter},null,8,["frontmatter"]),e("nav",$,[e("ul",null,[e("li",null,[a(o,{to:"#背景"},{default:s(()=>[l("背景")]),_:1})]),e("li",null,[a(o,{to:"#术语"},{default:s(()=>[l("术语")]),_:1})]),e("li",null,[a(o,{to:"#原理"},{default:s(()=>[l("原理")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#概述"},{default:s(()=>[l("概述")]),_:1})]),e("li",null,[a(o,{to:"#并行任务执行框架"},{default:s(()=>[l("并行任务执行框架")]),_:1})]),e("li",null,[a(o,{to:"#wal-日志并行回放-1"},{default:s(()=>[l("WAL 日志并行回放")]),_:1})])])]),e("li",null,[a(o,{to:"#使用方法"},{default:s(()=>[l("使用方法")]),_:1})])])]),A])}const b=n(N,[["render",S],["__file","avail-parallel-replay.html.vue"]]);export{b as default}; diff --git a/assets/avail-parallel-replay.html-a035d420.js b/assets/avail-parallel-replay.html-a035d420.js new file mode 100644 index 00000000000..e958e83f7f5 --- /dev/null +++ b/assets/avail-parallel-replay.html-a035d420.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-18c2ec3b","path":"/zh/features/v11/availability/avail-parallel-replay.html","title":"WAL 日志并行回放","lang":"zh-CN","frontmatter":{"author":"学弈","date":"2022/09/20","minute":30},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[{"level":3,"title":"概述","slug":"概述","link":"#概述","children":[]},{"level":3,"title":"并行任务执行框架","slug":"并行任务执行框架","link":"#并行任务执行框架","children":[]},{"level":3,"title":"WAL 日志并行回放","slug":"wal-日志并行回放-1","link":"#wal-日志并行回放-1","children":[]}]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/availability/avail-parallel-replay.md"}');export{l as data}; diff --git a/assets/back-to-top-8efcbe56.svg b/assets/back-to-top-8efcbe56.svg new file mode 100644 index 00000000000..83236781a94 --- /dev/null +++ b/assets/back-to-top-8efcbe56.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/assets/backup-and-restore.html-03a875b8.js b/assets/backup-and-restore.html-03a875b8.js new file mode 100644 index 00000000000..9f1fc1e8a07 --- /dev/null +++ b/assets/backup-and-restore.html-03a875b8.js @@ -0,0 +1,234 @@ +import{_ as u,r as t,o as d,c as k,d as s,a,w as p,b as n,e as o}from"./app-3d1677bf.js";const l="/PolarDB-for-PostgreSQL/assets/backup-dir-71d185c8.png",b={},m=a("h1",{id:"备份恢复",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#备份恢复","aria-hidden":"true"},"#"),n(" 备份恢复")],-1),_=a("p",null,"PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其备份恢复和 PostgreSQL 存在部分差异。本文将指导您如何对 PolarDB for PostgreSQL 进行备份,并通过备份来搭建 Replica 节点或 Standby 节点。",-1),v={class:"table-of-contents"},h=o('

备份恢复原理

PostgreSQL 的备份流程可以总结为以下几步:

  1. 进入备份模式
    • 强制进入 Full Page Write 模式,并切换当前的 WAL segment 文件
    • 在数据目录下创建 backup_label 文件,其中包含基础备份的起始点位置
    • 备份的恢复必须从一个内存数据与磁盘数据一致的检查点开始,所以将等待下一次检查点的到来,或立刻强制进行一次 CHECKPOINT
  2. 备份数据库:使用文件系统级别的工具进行备份
  3. 退出备份模式
    • 重置 Full Page Write 模式,并切换到下一个 WAL segment 文件
    • 创建备份历史文件,包含当前基础备份的起止 WAL 位置,并删除 backup_label 文件

备份 PostgreSQL 数据库最简便方法是使用 pg_basebackup 工具。

数据目录结构

PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其数据目录分为以下两类:

  • 本地数据目录:位于每个计算节点的本地存储上,为每个计算节点私有
  • 共享数据目录:位于共享存储上,被所有计算节点共享

backup-dir

由于本地数据目录中的目录和文件不涉及数据库的核心数据,因此在备份数据库时,备份本地数据目录是可选的。可以仅备份共享存储上的数据目录,然后使用 initdb 重新生成新的本地存储目录。但是计算节点的本地配置文件需要被手动备份,如 postgresql.confpg_hba.conf 等文件。

本地数据目录

通过以下 SQL 命令可以查看节点的本地数据目录:

postgres=# SHOW data_directory;
+     data_directory
+------------------------
+ /home/postgres/primary
+(1 row)
+

本地数据目录类似于 PostgreSQL 的数据目录,大多数目录和文件都是通过 initdb 生成的。随着数据库服务的运行,本地数据目录中会产生更多的本地文件,如临时文件、缓存文件、配置文件、日志文件等。其结构如下:

$ tree ./ -L 1
+./
+├── base
+├── current_logfiles
+├── global
+├── pg_commit_ts
+├── pg_csnlog
+├── pg_dynshmem
+├── pg_hba.conf
+├── pg_ident.conf
+├── pg_log
+├── pg_logical
+├── pg_logindex
+├── pg_multixact
+├── pg_notify
+├── pg_replslot
+├── pg_serial
+├── pg_snapshots
+├── pg_stat
+├── pg_stat_tmp
+├── pg_subtrans
+├── pg_tblspc
+├── PG_VERSION
+├── pg_xact
+├── polar_cache_trash
+├── polar_dma.conf
+├── polar_fullpage
+├── polar_node_static.conf
+├── polar_rel_size_cache
+├── polar_shmem
+├── polar_shmem_stat_file
+├── postgresql.auto.conf
+├── postgresql.conf
+├── postmaster.opts
+└── postmaster.pid
+
+21 directories, 12 files
+

共享数据目录

通过以下 SQL 命令可以查看所有计算节点在共享存储上的共享数据目录:

postgres=# SHOW polar_datadir;
+     polar_datadir
+-----------------------
+ /nvme1n1/shared_data/
+(1 row)
+

共享数据目录中存放 PolarDB for PostgreSQL 的核心数据文件,如表文件、索引文件、WAL 日志、DMA、LogIndex、Flashback Log 等。这些文件被所有节点共享,因此必须被备份。其结构如下:

$ sudo pfs -C disk ls /nvme1n1/shared_data/
+   Dir  1     512               Wed Jan 11 09:34:01 2023  base
+   Dir  1     7424              Wed Jan 11 09:34:02 2023  global
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_tblspc
+   Dir  1     512               Wed Jan 11 09:35:05 2023  pg_wal
+   Dir  1     384               Wed Jan 11 09:35:01 2023  pg_logindex
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_twophase
+   Dir  1     128               Wed Jan 11 09:34:02 2023  pg_xact
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_commit_ts
+   Dir  1     256               Wed Jan 11 09:34:03 2023  pg_multixact
+   Dir  1     0                 Wed Jan 11 09:34:03 2023  pg_csnlog
+   Dir  1     256               Wed Jan 11 09:34:03 2023  polar_dma
+   Dir  1     512               Wed Jan 11 09:35:09 2023  polar_fullpage
+  File  1     32                Wed Jan 11 09:35:00 2023  RWID
+   Dir  1     256               Wed Jan 11 10:25:42 2023  pg_replslot
+  File  1     224               Wed Jan 11 10:19:37 2023  polar_non_exclusive_backup_label
+total 16384 (unit: 512Bytes)
+

polar_basebackup 备份工具

`,20),g=a("code",null,"polar_basebackup",-1),f={href:"https://www.postgresql.org/docs/11/app-pgbasebackup.html",target:"_blank",rel:"noopener noreferrer"},y=a("code",null,"pg_basebackup",-1),S=a("code",null,"pg_basebackup",-1),P=a("code",null,"polar_basebackup",-1),x=a("code",null,"bin/",-1),D=o(`

该工具的主要功能是将一个运行中的 PolarDB for PostgreSQL 数据库的数据目录(包括本地数据目录和共享数据目录)备份到目标目录中。

polar_basebackup takes a base backup of a running PostgreSQL server.
+
+Usage:
+  polar_basebackup [OPTION]...
+
+Options controlling the output:
+  -D, --pgdata=DIRECTORY receive base backup into directory
+  -F, --format=p|t       output format (plain (default), tar)
+  -r, --max-rate=RATE    maximum transfer rate to transfer data directory
+                         (in kB/s, or use suffix "k" or "M")
+  -R, --write-recovery-conf
+                         write recovery.conf for replication
+  -T, --tablespace-mapping=OLDDIR=NEWDIR
+                         relocate tablespace in OLDDIR to NEWDIR
+      --waldir=WALDIR    location for the write-ahead log directory
+  -X, --wal-method=none|fetch|stream
+                         include required WAL files with specified method
+  -z, --gzip             compress tar output
+  -Z, --compress=0-9     compress tar output with given compression level
+
+General options:
+  -c, --checkpoint=fast|spread
+                         set fast or spread checkpointing
+  -C, --create-slot      create replication slot
+  -l, --label=LABEL      set backup label
+  -n, --no-clean         do not clean up after errors
+  -N, --no-sync          do not wait for changes to be written safely to disk
+  -P, --progress         show progress information
+  -S, --slot=SLOTNAME    replication slot to use
+  -v, --verbose          output verbose messages
+  -V, --version          output version information, then exit
+      --no-slot          prevent creation of temporary replication slot
+      --no-verify-checksums
+                         do not verify checksums
+  -?, --help             show this help, then exit
+
+Connection options:
+  -d, --dbname=CONNSTR   connection string
+  -h, --host=HOSTNAME    database server host or socket directory
+  -p, --port=PORT        database server port number
+  -s, --status-interval=INTERVAL
+                         time between status packets sent to server (in seconds)
+  -U, --username=NAME    connect as specified database user
+  -w, --no-password      never prompt for password
+  -W, --password         force password prompt (should happen automatically)
+      --polardata=datadir  receive polar data backup into directory
+      --polar_disk_home=disk_home  polar_disk_home for polar data backup
+      --polar_host_id=host_id  polar_host_id for polar data backup
+      --polar_storage_cluster_name=cluster_name  polar_storage_cluster_name for polar data backup
+

polar_basebackup 的参数及用法几乎和 pg_basebackup 一致,新增了以下与共享存储相关的参数:

  • --polar_disk_home / --polar_host_id / --polar_storage_cluster_name:这三个参数指定了用于存放备份共享数据的共享存储节点
  • --polardata:该参数指定了备份共享存储节点上存放共享数据的路径;如不指定,则默认将共享数据备份到本地数据备份目录的 polar_shared_data/ 路径下

备份并恢复一个 Replica 节点

基础备份可用于搭建一个新的 Replica(RO)节点。如前文所述,一个正在运行中的 PolarDB for PostgreSQL 实例的数据文件分布在各计算节点的本地存储和存储节点的共享存储中。下面将说明如何使用 polar_basebackup 将实例的数据文件备份到一个本地磁盘上,并从这个备份上启动一个 Replica 节点。

PFS 文件系统挂载

首先,在将要部署 Replica 节点的机器上启动 PFSD 守护进程,挂载到正在运行中的共享存储的 PFS 文件系统上。后续启动的 Replica 节点将使用这个守护进程来访问共享存储。

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

备份数据到本地存储

运行如下命令,将实例 Primary 节点的本地数据和共享数据备份到用于部署 Replica 节点的本地存储路径 /home/postgres/replica1 下:

polar_basebackup \\
+    --host=[Primary节点所在IP] \\
+    --port=[Primary节点所在端口号] \\
+    -D /home/postgres/replica1 \\
+    -X stream --progress --write-recovery-conf -v
+

将看到如下输出:

polar_basebackup: initiating base backup, waiting for checkpoint to complete
+polar_basebackup: checkpoint completed
+polar_basebackup: write-ahead log start point: 0/16ADD60 on timeline 1
+polar_basebackup: starting background WAL receiver
+polar_basebackup: created temporary replication slot "pg_basebackup_359"
+851371/851371 kB (100%), 2/2 tablespaces
+polar_basebackup: write-ahead log end point: 0/16ADE30
+polar_basebackup: waiting for background process to finish streaming ...
+polar_basebackup: base backup completed
+

备份完成后,可以以这个备份目录作为本地数据目录,启动一个新的 Replica 节点。由于本地数据目录中不需要共享存储上已有的共享数据文件,所以删除掉本地数据目录中的 polar_shared_data/ 目录:

rm -rf ~/replica1/polar_shared_data
+

重新配置 Replica 节点

重新编辑 Replica 节点的配置文件 ~/replica1/postgresql.conf

-polar_hostid=1
++polar_hostid=2
+-synchronous_standby_names='replica1'
+

重新编辑 Replica 节点的复制配置文件 ~/replica1/recovery.conf

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[Primary节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

Replica 节点启动

启动 Replica 节点:

pg_ctl -D $HOME/replica1 start
+

Replica 节点验证

在 Primary 节点上执行建表并插入数据,在 Replica 节点上可以查到 Primary 节点插入的数据:

$ psql -q \\
+    -h [Primary节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+
+$ psql -q \\
+    -h [Replica节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

备份并恢复一个 Standby 节点

基础备份也可以用于搭建一个新的 Standby 节点。如下图所示,Standby 节点与 Primary / Replica 节点各自使用独立的共享存储,与 Primary 节点使用物理复制保持同步。Standby 节点可用于作为主共享存储的灾备。

backup-dir

PFS 文件系统格式化和挂载

假设此时用于部署 Standby 计算节点的机器已经准备好用于后备的共享存储 nvme2n1

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:1    0  40G  0 disk
+└─nvme0n1p1 259:2    0  40G  0 part /etc/hosts
+nvme2n1     259:3    0  70G  0 disk
+nvme1n1     259:0    0  60G  0 disk
+

将这个共享存储格式化为 PFS 格式,并启动 PFSD 守护进程挂载到 PFS 文件系统:

sudo pfs -C disk mkfs nvme2n1
+sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme2n1 -w 2
+

备份数据到本地存储和共享存储

在用于部署 Standby 节点的机器上执行备份,以 ~/standby 作为本地数据目录,以 /nvme2n1/shared_data 作为共享存储目录:

polar_basebackup \\
+    --host=[Primary节点所在IP] \\
+    --port=[Primary节点所在端口号] \\
+    -D /home/postgres/standby \\
+    --polardata=/nvme2n1/shared_data/ \\
+    --polar_storage_cluster_name=disk \\
+    --polar_disk_name=nvme2n1 \\
+    --polar_host_id=3 \\
+    -X stream --progress --write-recovery-conf -v
+

将会看到如下输出。其中,除了 polar_basebackup 的输出以外,还有 PFS 的输出日志:

[PFSD_SDK INF Jan 11 10:11:27.247112][99]pfs_mount_prepare 103: begin prepare mount cluster(disk), PBD(nvme2n1), hostid(3),flags(0x13)
+[PFSD_SDK INF Jan 11 10:11:27.247161][99]pfs_mount_prepare 165: pfs_mount_prepare success for nvme2n1 hostid 3
+[PFSD_SDK INF Jan 11 10:11:27.293900][99]chnl_connection_poll_shm 1238: ack data update s_mount_epoch 1
+[PFSD_SDK INF Jan 11 10:11:27.293912][99]chnl_connection_poll_shm 1266: connect and got ack data from svr, err = 0, mntid 0
+[PFSD_SDK INF Jan 11 10:11:27.293979][99]pfsd_sdk_init 191: pfsd_chnl_connect success
+[PFSD_SDK INF Jan 11 10:11:27.293987][99]pfs_mount_post 208: pfs_mount_post err : 0
+[PFSD_SDK ERR Jan 11 10:11:27.297257][99]pfsd_opendir 1437: opendir /nvme2n1/shared_data/ error: No such file or directory
+[PFSD_SDK INF Jan 11 10:11:27.297396][99]pfsd_mkdir 1320: mkdir /nvme2n1/shared_data
+polar_basebackup: initiating base backup, waiting for checkpoint to complete
+WARNING:  a labelfile "/nvme1n1/shared_data//polar_non_exclusive_backup_label" is already on disk
+HINT:  POLAR: we overwrite it
+polar_basebackup: checkpoint completed
+polar_basebackup: write-ahead log start point: 0/16C91F8 on timeline 1
+polar_basebackup: starting background WAL receiver
+polar_basebackup: created temporary replication slot "pg_basebackup_373"
+...
+[PFSD_SDK INF Jan 11 10:11:32.992005][99]pfsd_open 539: open /nvme2n1/shared_data/polar_non_exclusive_backup_label with inode 6325, fd 0
+[PFSD_SDK INF Jan 11 10:11:32.993074][99]pfsd_open 539: open /nvme2n1/shared_data/global/pg_control with inode 8373, fd 0
+851396/851396 kB (100%), 2/2 tablespaces
+polar_basebackup: write-ahead log end point: 0/16C9300
+polar_basebackup: waiting for background process to finish streaming ...
+polar_basebackup: base backup completed
+[PFSD_SDK INF Jan 11 10:11:52.378220][99]pfsd_umount_force 247: pbdname nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.378229][99]pfs_umount_prepare 269: pfs_umount_prepare. pbdname:nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.404010][99]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme2n1/99.pid
+[PFSD_SDK INF Jan 11 10:11:52.404171][99]pfs_umount_post 281: pfs_umount_post. pbdname:nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.404174][99]pfsd_umount_force 261: umount success for nvme2n1
+

上述命令会在当前机器的本地存储上备份 Primary 节点的本地数据目录,在参数指定的共享存储目录上备份共享数据目录。

重新配置 Standby 节点

重新编辑 Standby 节点的配置文件 ~/standby/postgresql.conf

-polar_hostid=1
++polar_hostid=3
+-polar_disk_name='nvme1n1'
+-polar_datadir='/nvme1n1/shared_data/'
++polar_disk_name='nvme2n1'
++polar_datadir='/nvme2n1/shared_data/'
+-synchronous_standby_names='replica1'
+

在 Standby 节点的复制配置文件 ~/standby/recovery.conf 中添加:

+recovery_target_timeline = 'latest'
++primary_slot_name = 'standby1'
+

Standby 节点启动

在 Primary 节点上创建用于与 Standby 进行物理复制的复制槽:

$ psql \\
+    --host=[Primary节点所在IP] --port=5432 \\
+    -d postgres \\
+    -c "SELECT * FROM pg_create_physical_replication_slot('standby1');"
+ slot_name | lsn
+-----------+-----
+ standby1  |
+(1 row)
+

启动 Standby 节点:

pg_ctl -D $HOME/standby start
+

Standby 节点验证

在 Primary 节点上创建表并插入数据,在 Standby 节点上可以查询到数据:

$ psql -q \\
+    -h [Primary节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+
+$ psql -q \\
+    -h [Standby节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+
`,54);function w(c,R){const r=t("ArticleInfo"),e=t("router-link"),i=t("ExternalLinkIcon");return d(),k("div",null,[m,s(r,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),_,a("nav",v,[a("ul",null,[a("li",null,[s(e,{to:"#备份恢复原理"},{default:p(()=>[n("备份恢复原理")]),_:1})]),a("li",null,[s(e,{to:"#数据目录结构"},{default:p(()=>[n("数据目录结构")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#本地数据目录"},{default:p(()=>[n("本地数据目录")]),_:1})]),a("li",null,[s(e,{to:"#共享数据目录"},{default:p(()=>[n("共享数据目录")]),_:1})])])]),a("li",null,[s(e,{to:"#polar-basebackup-备份工具"},{default:p(()=>[n("polar_basebackup 备份工具")]),_:1})]),a("li",null,[s(e,{to:"#备份并恢复一个-replica-节点"},{default:p(()=>[n("备份并恢复一个 Replica 节点")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#pfs-文件系统挂载"},{default:p(()=>[n("PFS 文件系统挂载")]),_:1})]),a("li",null,[s(e,{to:"#备份数据到本地存储"},{default:p(()=>[n("备份数据到本地存储")]),_:1})]),a("li",null,[s(e,{to:"#重新配置-replica-节点"},{default:p(()=>[n("重新配置 Replica 节点")]),_:1})]),a("li",null,[s(e,{to:"#replica-节点启动"},{default:p(()=>[n("Replica 节点启动")]),_:1})]),a("li",null,[s(e,{to:"#replica-节点验证"},{default:p(()=>[n("Replica 节点验证")]),_:1})])])]),a("li",null,[s(e,{to:"#备份并恢复一个-standby-节点"},{default:p(()=>[n("备份并恢复一个 Standby 节点")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#pfs-文件系统格式化和挂载"},{default:p(()=>[n("PFS 文件系统格式化和挂载")]),_:1})]),a("li",null,[s(e,{to:"#备份数据到本地存储和共享存储"},{default:p(()=>[n("备份数据到本地存储和共享存储")]),_:1})]),a("li",null,[s(e,{to:"#重新配置-standby-节点"},{default:p(()=>[n("重新配置 Standby 节点")]),_:1})]),a("li",null,[s(e,{to:"#standby-节点启动"},{default:p(()=>[n("Standby 节点启动")]),_:1})]),a("li",null,[s(e,{to:"#standby-节点验证"},{default:p(()=>[n("Standby 节点验证")]),_:1})])])])])]),h,a("p",null,[n("PolarDB for PostgreSQL 的备份工具 "),g,n(",由 PostgreSQL 的 "),a("a",f,[y,s(i)]),n(" 改造而来,完全兼容 "),S,n(",因此同样可以用于对 PostgreSQL 做备份恢复。"),P,n(" 的可执行文件位于 PolarDB for PostgreSQL 安装目录下的 "),x,n(" 目录中。")]),D])}const F=u(b,[["render",w],["__file","backup-and-restore.html.vue"]]);export{F as default}; diff --git a/assets/backup-and-restore.html-15f20d92.js b/assets/backup-and-restore.html-15f20d92.js new file mode 100644 index 00000000000..a4ccee637f9 --- /dev/null +++ b/assets/backup-and-restore.html-15f20d92.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-6fed01c8","path":"/zh/operation/backup-and-restore.html","title":"备份恢复","lang":"zh-CN","frontmatter":{"author":"慎追、棠羽","date":"2023/01/11","minute":30},"headers":[{"level":2,"title":"备份恢复原理","slug":"备份恢复原理","link":"#备份恢复原理","children":[]},{"level":2,"title":"数据目录结构","slug":"数据目录结构","link":"#数据目录结构","children":[{"level":3,"title":"本地数据目录","slug":"本地数据目录","link":"#本地数据目录","children":[]},{"level":3,"title":"共享数据目录","slug":"共享数据目录","link":"#共享数据目录","children":[]}]},{"level":2,"title":"polar_basebackup 备份工具","slug":"polar-basebackup-备份工具","link":"#polar-basebackup-备份工具","children":[]},{"level":2,"title":"备份并恢复一个 Replica 节点","slug":"备份并恢复一个-replica-节点","link":"#备份并恢复一个-replica-节点","children":[{"level":3,"title":"PFS 文件系统挂载","slug":"pfs-文件系统挂载","link":"#pfs-文件系统挂载","children":[]},{"level":3,"title":"备份数据到本地存储","slug":"备份数据到本地存储","link":"#备份数据到本地存储","children":[]},{"level":3,"title":"重新配置 Replica 节点","slug":"重新配置-replica-节点","link":"#重新配置-replica-节点","children":[]},{"level":3,"title":"Replica 节点启动","slug":"replica-节点启动","link":"#replica-节点启动","children":[]},{"level":3,"title":"Replica 节点验证","slug":"replica-节点验证","link":"#replica-节点验证","children":[]}]},{"level":2,"title":"备份并恢复一个 Standby 节点","slug":"备份并恢复一个-standby-节点","link":"#备份并恢复一个-standby-节点","children":[{"level":3,"title":"PFS 文件系统格式化和挂载","slug":"pfs-文件系统格式化和挂载","link":"#pfs-文件系统格式化和挂载","children":[]},{"level":3,"title":"备份数据到本地存储和共享存储","slug":"备份数据到本地存储和共享存储","link":"#备份数据到本地存储和共享存储","children":[]},{"level":3,"title":"重新配置 Standby 节点","slug":"重新配置-standby-节点","link":"#重新配置-standby-节点","children":[]},{"level":3,"title":"Standby 节点启动","slug":"standby-节点启动","link":"#standby-节点启动","children":[]},{"level":3,"title":"Standby 节点验证","slug":"standby-节点验证","link":"#standby-节点验证","children":[]}]}],"git":{"updatedTime":1673450922000},"filePathRelative":"zh/operation/backup-and-restore.md"}');export{l as data}; diff --git a/assets/backup-and-restore.html-293288f7.js b/assets/backup-and-restore.html-293288f7.js new file mode 100644 index 00000000000..9f1fc1e8a07 --- /dev/null +++ b/assets/backup-and-restore.html-293288f7.js @@ -0,0 +1,234 @@ +import{_ as u,r as t,o as d,c as k,d as s,a,w as p,b as n,e as o}from"./app-3d1677bf.js";const l="/PolarDB-for-PostgreSQL/assets/backup-dir-71d185c8.png",b={},m=a("h1",{id:"备份恢复",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#备份恢复","aria-hidden":"true"},"#"),n(" 备份恢复")],-1),_=a("p",null,"PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其备份恢复和 PostgreSQL 存在部分差异。本文将指导您如何对 PolarDB for PostgreSQL 进行备份,并通过备份来搭建 Replica 节点或 Standby 节点。",-1),v={class:"table-of-contents"},h=o('

备份恢复原理

PostgreSQL 的备份流程可以总结为以下几步:

  1. 进入备份模式
    • 强制进入 Full Page Write 模式,并切换当前的 WAL segment 文件
    • 在数据目录下创建 backup_label 文件,其中包含基础备份的起始点位置
    • 备份的恢复必须从一个内存数据与磁盘数据一致的检查点开始,所以将等待下一次检查点的到来,或立刻强制进行一次 CHECKPOINT
  2. 备份数据库:使用文件系统级别的工具进行备份
  3. 退出备份模式
    • 重置 Full Page Write 模式,并切换到下一个 WAL segment 文件
    • 创建备份历史文件,包含当前基础备份的起止 WAL 位置,并删除 backup_label 文件

备份 PostgreSQL 数据库最简便方法是使用 pg_basebackup 工具。

数据目录结构

PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其数据目录分为以下两类:

  • 本地数据目录:位于每个计算节点的本地存储上,为每个计算节点私有
  • 共享数据目录:位于共享存储上,被所有计算节点共享

backup-dir

由于本地数据目录中的目录和文件不涉及数据库的核心数据,因此在备份数据库时,备份本地数据目录是可选的。可以仅备份共享存储上的数据目录,然后使用 initdb 重新生成新的本地存储目录。但是计算节点的本地配置文件需要被手动备份,如 postgresql.confpg_hba.conf 等文件。

本地数据目录

通过以下 SQL 命令可以查看节点的本地数据目录:

postgres=# SHOW data_directory;
+     data_directory
+------------------------
+ /home/postgres/primary
+(1 row)
+

本地数据目录类似于 PostgreSQL 的数据目录,大多数目录和文件都是通过 initdb 生成的。随着数据库服务的运行,本地数据目录中会产生更多的本地文件,如临时文件、缓存文件、配置文件、日志文件等。其结构如下:

$ tree ./ -L 1
+./
+├── base
+├── current_logfiles
+├── global
+├── pg_commit_ts
+├── pg_csnlog
+├── pg_dynshmem
+├── pg_hba.conf
+├── pg_ident.conf
+├── pg_log
+├── pg_logical
+├── pg_logindex
+├── pg_multixact
+├── pg_notify
+├── pg_replslot
+├── pg_serial
+├── pg_snapshots
+├── pg_stat
+├── pg_stat_tmp
+├── pg_subtrans
+├── pg_tblspc
+├── PG_VERSION
+├── pg_xact
+├── polar_cache_trash
+├── polar_dma.conf
+├── polar_fullpage
+├── polar_node_static.conf
+├── polar_rel_size_cache
+├── polar_shmem
+├── polar_shmem_stat_file
+├── postgresql.auto.conf
+├── postgresql.conf
+├── postmaster.opts
+└── postmaster.pid
+
+21 directories, 12 files
+

共享数据目录

通过以下 SQL 命令可以查看所有计算节点在共享存储上的共享数据目录:

postgres=# SHOW polar_datadir;
+     polar_datadir
+-----------------------
+ /nvme1n1/shared_data/
+(1 row)
+

共享数据目录中存放 PolarDB for PostgreSQL 的核心数据文件,如表文件、索引文件、WAL 日志、DMA、LogIndex、Flashback Log 等。这些文件被所有节点共享,因此必须被备份。其结构如下:

$ sudo pfs -C disk ls /nvme1n1/shared_data/
+   Dir  1     512               Wed Jan 11 09:34:01 2023  base
+   Dir  1     7424              Wed Jan 11 09:34:02 2023  global
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_tblspc
+   Dir  1     512               Wed Jan 11 09:35:05 2023  pg_wal
+   Dir  1     384               Wed Jan 11 09:35:01 2023  pg_logindex
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_twophase
+   Dir  1     128               Wed Jan 11 09:34:02 2023  pg_xact
+   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_commit_ts
+   Dir  1     256               Wed Jan 11 09:34:03 2023  pg_multixact
+   Dir  1     0                 Wed Jan 11 09:34:03 2023  pg_csnlog
+   Dir  1     256               Wed Jan 11 09:34:03 2023  polar_dma
+   Dir  1     512               Wed Jan 11 09:35:09 2023  polar_fullpage
+  File  1     32                Wed Jan 11 09:35:00 2023  RWID
+   Dir  1     256               Wed Jan 11 10:25:42 2023  pg_replslot
+  File  1     224               Wed Jan 11 10:19:37 2023  polar_non_exclusive_backup_label
+total 16384 (unit: 512Bytes)
+

polar_basebackup 备份工具

`,20),g=a("code",null,"polar_basebackup",-1),f={href:"https://www.postgresql.org/docs/11/app-pgbasebackup.html",target:"_blank",rel:"noopener noreferrer"},y=a("code",null,"pg_basebackup",-1),S=a("code",null,"pg_basebackup",-1),P=a("code",null,"polar_basebackup",-1),x=a("code",null,"bin/",-1),D=o(`

该工具的主要功能是将一个运行中的 PolarDB for PostgreSQL 数据库的数据目录(包括本地数据目录和共享数据目录)备份到目标目录中。

polar_basebackup takes a base backup of a running PostgreSQL server.
+
+Usage:
+  polar_basebackup [OPTION]...
+
+Options controlling the output:
+  -D, --pgdata=DIRECTORY receive base backup into directory
+  -F, --format=p|t       output format (plain (default), tar)
+  -r, --max-rate=RATE    maximum transfer rate to transfer data directory
+                         (in kB/s, or use suffix "k" or "M")
+  -R, --write-recovery-conf
+                         write recovery.conf for replication
+  -T, --tablespace-mapping=OLDDIR=NEWDIR
+                         relocate tablespace in OLDDIR to NEWDIR
+      --waldir=WALDIR    location for the write-ahead log directory
+  -X, --wal-method=none|fetch|stream
+                         include required WAL files with specified method
+  -z, --gzip             compress tar output
+  -Z, --compress=0-9     compress tar output with given compression level
+
+General options:
+  -c, --checkpoint=fast|spread
+                         set fast or spread checkpointing
+  -C, --create-slot      create replication slot
+  -l, --label=LABEL      set backup label
+  -n, --no-clean         do not clean up after errors
+  -N, --no-sync          do not wait for changes to be written safely to disk
+  -P, --progress         show progress information
+  -S, --slot=SLOTNAME    replication slot to use
+  -v, --verbose          output verbose messages
+  -V, --version          output version information, then exit
+      --no-slot          prevent creation of temporary replication slot
+      --no-verify-checksums
+                         do not verify checksums
+  -?, --help             show this help, then exit
+
+Connection options:
+  -d, --dbname=CONNSTR   connection string
+  -h, --host=HOSTNAME    database server host or socket directory
+  -p, --port=PORT        database server port number
+  -s, --status-interval=INTERVAL
+                         time between status packets sent to server (in seconds)
+  -U, --username=NAME    connect as specified database user
+  -w, --no-password      never prompt for password
+  -W, --password         force password prompt (should happen automatically)
+      --polardata=datadir  receive polar data backup into directory
+      --polar_disk_home=disk_home  polar_disk_home for polar data backup
+      --polar_host_id=host_id  polar_host_id for polar data backup
+      --polar_storage_cluster_name=cluster_name  polar_storage_cluster_name for polar data backup
+

polar_basebackup 的参数及用法几乎和 pg_basebackup 一致,新增了以下与共享存储相关的参数:

  • --polar_disk_home / --polar_host_id / --polar_storage_cluster_name:这三个参数指定了用于存放备份共享数据的共享存储节点
  • --polardata:该参数指定了备份共享存储节点上存放共享数据的路径;如不指定,则默认将共享数据备份到本地数据备份目录的 polar_shared_data/ 路径下

备份并恢复一个 Replica 节点

基础备份可用于搭建一个新的 Replica(RO)节点。如前文所述,一个正在运行中的 PolarDB for PostgreSQL 实例的数据文件分布在各计算节点的本地存储和存储节点的共享存储中。下面将说明如何使用 polar_basebackup 将实例的数据文件备份到一个本地磁盘上,并从这个备份上启动一个 Replica 节点。

PFS 文件系统挂载

首先,在将要部署 Replica 节点的机器上启动 PFSD 守护进程,挂载到正在运行中的共享存储的 PFS 文件系统上。后续启动的 Replica 节点将使用这个守护进程来访问共享存储。

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

备份数据到本地存储

运行如下命令,将实例 Primary 节点的本地数据和共享数据备份到用于部署 Replica 节点的本地存储路径 /home/postgres/replica1 下:

polar_basebackup \\
+    --host=[Primary节点所在IP] \\
+    --port=[Primary节点所在端口号] \\
+    -D /home/postgres/replica1 \\
+    -X stream --progress --write-recovery-conf -v
+

将看到如下输出:

polar_basebackup: initiating base backup, waiting for checkpoint to complete
+polar_basebackup: checkpoint completed
+polar_basebackup: write-ahead log start point: 0/16ADD60 on timeline 1
+polar_basebackup: starting background WAL receiver
+polar_basebackup: created temporary replication slot "pg_basebackup_359"
+851371/851371 kB (100%), 2/2 tablespaces
+polar_basebackup: write-ahead log end point: 0/16ADE30
+polar_basebackup: waiting for background process to finish streaming ...
+polar_basebackup: base backup completed
+

备份完成后,可以以这个备份目录作为本地数据目录,启动一个新的 Replica 节点。由于本地数据目录中不需要共享存储上已有的共享数据文件,所以删除掉本地数据目录中的 polar_shared_data/ 目录:

rm -rf ~/replica1/polar_shared_data
+

重新配置 Replica 节点

重新编辑 Replica 节点的配置文件 ~/replica1/postgresql.conf

-polar_hostid=1
++polar_hostid=2
+-synchronous_standby_names='replica1'
+

重新编辑 Replica 节点的复制配置文件 ~/replica1/recovery.conf

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[Primary节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

Replica 节点启动

启动 Replica 节点:

pg_ctl -D $HOME/replica1 start
+

Replica 节点验证

在 Primary 节点上执行建表并插入数据,在 Replica 节点上可以查到 Primary 节点插入的数据:

$ psql -q \\
+    -h [Primary节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+
+$ psql -q \\
+    -h [Replica节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

备份并恢复一个 Standby 节点

基础备份也可以用于搭建一个新的 Standby 节点。如下图所示,Standby 节点与 Primary / Replica 节点各自使用独立的共享存储,与 Primary 节点使用物理复制保持同步。Standby 节点可用于作为主共享存储的灾备。

backup-dir

PFS 文件系统格式化和挂载

假设此时用于部署 Standby 计算节点的机器已经准备好用于后备的共享存储 nvme2n1

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:1    0  40G  0 disk
+└─nvme0n1p1 259:2    0  40G  0 part /etc/hosts
+nvme2n1     259:3    0  70G  0 disk
+nvme1n1     259:0    0  60G  0 disk
+

将这个共享存储格式化为 PFS 格式,并启动 PFSD 守护进程挂载到 PFS 文件系统:

sudo pfs -C disk mkfs nvme2n1
+sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme2n1 -w 2
+

备份数据到本地存储和共享存储

在用于部署 Standby 节点的机器上执行备份,以 ~/standby 作为本地数据目录,以 /nvme2n1/shared_data 作为共享存储目录:

polar_basebackup \\
+    --host=[Primary节点所在IP] \\
+    --port=[Primary节点所在端口号] \\
+    -D /home/postgres/standby \\
+    --polardata=/nvme2n1/shared_data/ \\
+    --polar_storage_cluster_name=disk \\
+    --polar_disk_name=nvme2n1 \\
+    --polar_host_id=3 \\
+    -X stream --progress --write-recovery-conf -v
+

将会看到如下输出。其中,除了 polar_basebackup 的输出以外,还有 PFS 的输出日志:

[PFSD_SDK INF Jan 11 10:11:27.247112][99]pfs_mount_prepare 103: begin prepare mount cluster(disk), PBD(nvme2n1), hostid(3),flags(0x13)
+[PFSD_SDK INF Jan 11 10:11:27.247161][99]pfs_mount_prepare 165: pfs_mount_prepare success for nvme2n1 hostid 3
+[PFSD_SDK INF Jan 11 10:11:27.293900][99]chnl_connection_poll_shm 1238: ack data update s_mount_epoch 1
+[PFSD_SDK INF Jan 11 10:11:27.293912][99]chnl_connection_poll_shm 1266: connect and got ack data from svr, err = 0, mntid 0
+[PFSD_SDK INF Jan 11 10:11:27.293979][99]pfsd_sdk_init 191: pfsd_chnl_connect success
+[PFSD_SDK INF Jan 11 10:11:27.293987][99]pfs_mount_post 208: pfs_mount_post err : 0
+[PFSD_SDK ERR Jan 11 10:11:27.297257][99]pfsd_opendir 1437: opendir /nvme2n1/shared_data/ error: No such file or directory
+[PFSD_SDK INF Jan 11 10:11:27.297396][99]pfsd_mkdir 1320: mkdir /nvme2n1/shared_data
+polar_basebackup: initiating base backup, waiting for checkpoint to complete
+WARNING:  a labelfile "/nvme1n1/shared_data//polar_non_exclusive_backup_label" is already on disk
+HINT:  POLAR: we overwrite it
+polar_basebackup: checkpoint completed
+polar_basebackup: write-ahead log start point: 0/16C91F8 on timeline 1
+polar_basebackup: starting background WAL receiver
+polar_basebackup: created temporary replication slot "pg_basebackup_373"
+...
+[PFSD_SDK INF Jan 11 10:11:32.992005][99]pfsd_open 539: open /nvme2n1/shared_data/polar_non_exclusive_backup_label with inode 6325, fd 0
+[PFSD_SDK INF Jan 11 10:11:32.993074][99]pfsd_open 539: open /nvme2n1/shared_data/global/pg_control with inode 8373, fd 0
+851396/851396 kB (100%), 2/2 tablespaces
+polar_basebackup: write-ahead log end point: 0/16C9300
+polar_basebackup: waiting for background process to finish streaming ...
+polar_basebackup: base backup completed
+[PFSD_SDK INF Jan 11 10:11:52.378220][99]pfsd_umount_force 247: pbdname nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.378229][99]pfs_umount_prepare 269: pfs_umount_prepare. pbdname:nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.404010][99]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme2n1/99.pid
+[PFSD_SDK INF Jan 11 10:11:52.404171][99]pfs_umount_post 281: pfs_umount_post. pbdname:nvme2n1
+[PFSD_SDK INF Jan 11 10:11:52.404174][99]pfsd_umount_force 261: umount success for nvme2n1
+

上述命令会在当前机器的本地存储上备份 Primary 节点的本地数据目录,在参数指定的共享存储目录上备份共享数据目录。

重新配置 Standby 节点

重新编辑 Standby 节点的配置文件 ~/standby/postgresql.conf

-polar_hostid=1
++polar_hostid=3
+-polar_disk_name='nvme1n1'
+-polar_datadir='/nvme1n1/shared_data/'
++polar_disk_name='nvme2n1'
++polar_datadir='/nvme2n1/shared_data/'
+-synchronous_standby_names='replica1'
+

在 Standby 节点的复制配置文件 ~/standby/recovery.conf 中添加:

+recovery_target_timeline = 'latest'
++primary_slot_name = 'standby1'
+

Standby 节点启动

在 Primary 节点上创建用于与 Standby 进行物理复制的复制槽:

$ psql \\
+    --host=[Primary节点所在IP] --port=5432 \\
+    -d postgres \\
+    -c "SELECT * FROM pg_create_physical_replication_slot('standby1');"
+ slot_name | lsn
+-----------+-----
+ standby1  |
+(1 row)
+

启动 Standby 节点:

pg_ctl -D $HOME/standby start
+

Standby 节点验证

在 Primary 节点上创建表并插入数据,在 Standby 节点上可以查询到数据:

$ psql -q \\
+    -h [Primary节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+
+$ psql -q \\
+    -h [Standby节点所在IP] \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+
`,54);function w(c,R){const r=t("ArticleInfo"),e=t("router-link"),i=t("ExternalLinkIcon");return d(),k("div",null,[m,s(r,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),_,a("nav",v,[a("ul",null,[a("li",null,[s(e,{to:"#备份恢复原理"},{default:p(()=>[n("备份恢复原理")]),_:1})]),a("li",null,[s(e,{to:"#数据目录结构"},{default:p(()=>[n("数据目录结构")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#本地数据目录"},{default:p(()=>[n("本地数据目录")]),_:1})]),a("li",null,[s(e,{to:"#共享数据目录"},{default:p(()=>[n("共享数据目录")]),_:1})])])]),a("li",null,[s(e,{to:"#polar-basebackup-备份工具"},{default:p(()=>[n("polar_basebackup 备份工具")]),_:1})]),a("li",null,[s(e,{to:"#备份并恢复一个-replica-节点"},{default:p(()=>[n("备份并恢复一个 Replica 节点")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#pfs-文件系统挂载"},{default:p(()=>[n("PFS 文件系统挂载")]),_:1})]),a("li",null,[s(e,{to:"#备份数据到本地存储"},{default:p(()=>[n("备份数据到本地存储")]),_:1})]),a("li",null,[s(e,{to:"#重新配置-replica-节点"},{default:p(()=>[n("重新配置 Replica 节点")]),_:1})]),a("li",null,[s(e,{to:"#replica-节点启动"},{default:p(()=>[n("Replica 节点启动")]),_:1})]),a("li",null,[s(e,{to:"#replica-节点验证"},{default:p(()=>[n("Replica 节点验证")]),_:1})])])]),a("li",null,[s(e,{to:"#备份并恢复一个-standby-节点"},{default:p(()=>[n("备份并恢复一个 Standby 节点")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#pfs-文件系统格式化和挂载"},{default:p(()=>[n("PFS 文件系统格式化和挂载")]),_:1})]),a("li",null,[s(e,{to:"#备份数据到本地存储和共享存储"},{default:p(()=>[n("备份数据到本地存储和共享存储")]),_:1})]),a("li",null,[s(e,{to:"#重新配置-standby-节点"},{default:p(()=>[n("重新配置 Standby 节点")]),_:1})]),a("li",null,[s(e,{to:"#standby-节点启动"},{default:p(()=>[n("Standby 节点启动")]),_:1})]),a("li",null,[s(e,{to:"#standby-节点验证"},{default:p(()=>[n("Standby 节点验证")]),_:1})])])])])]),h,a("p",null,[n("PolarDB for PostgreSQL 的备份工具 "),g,n(",由 PostgreSQL 的 "),a("a",f,[y,s(i)]),n(" 改造而来,完全兼容 "),S,n(",因此同样可以用于对 PostgreSQL 做备份恢复。"),P,n(" 的可执行文件位于 PolarDB for PostgreSQL 安装目录下的 "),x,n(" 目录中。")]),D])}const F=u(b,[["render",w],["__file","backup-and-restore.html.vue"]]);export{F as default}; diff --git a/assets/backup-and-restore.html-7682916f.js b/assets/backup-and-restore.html-7682916f.js new file mode 100644 index 00000000000..29922891543 --- /dev/null +++ b/assets/backup-and-restore.html-7682916f.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-7fdfc12a","path":"/operation/backup-and-restore.html","title":"备份恢复","lang":"en-US","frontmatter":{"author":"慎追、棠羽","date":"2023/01/11","minute":30},"headers":[{"level":2,"title":"备份恢复原理","slug":"备份恢复原理","link":"#备份恢复原理","children":[]},{"level":2,"title":"数据目录结构","slug":"数据目录结构","link":"#数据目录结构","children":[{"level":3,"title":"本地数据目录","slug":"本地数据目录","link":"#本地数据目录","children":[]},{"level":3,"title":"共享数据目录","slug":"共享数据目录","link":"#共享数据目录","children":[]}]},{"level":2,"title":"polar_basebackup 备份工具","slug":"polar-basebackup-备份工具","link":"#polar-basebackup-备份工具","children":[]},{"level":2,"title":"备份并恢复一个 Replica 节点","slug":"备份并恢复一个-replica-节点","link":"#备份并恢复一个-replica-节点","children":[{"level":3,"title":"PFS 文件系统挂载","slug":"pfs-文件系统挂载","link":"#pfs-文件系统挂载","children":[]},{"level":3,"title":"备份数据到本地存储","slug":"备份数据到本地存储","link":"#备份数据到本地存储","children":[]},{"level":3,"title":"重新配置 Replica 节点","slug":"重新配置-replica-节点","link":"#重新配置-replica-节点","children":[]},{"level":3,"title":"Replica 节点启动","slug":"replica-节点启动","link":"#replica-节点启动","children":[]},{"level":3,"title":"Replica 节点验证","slug":"replica-节点验证","link":"#replica-节点验证","children":[]}]},{"level":2,"title":"备份并恢复一个 Standby 节点","slug":"备份并恢复一个-standby-节点","link":"#备份并恢复一个-standby-节点","children":[{"level":3,"title":"PFS 文件系统格式化和挂载","slug":"pfs-文件系统格式化和挂载","link":"#pfs-文件系统格式化和挂载","children":[]},{"level":3,"title":"备份数据到本地存储和共享存储","slug":"备份数据到本地存储和共享存储","link":"#备份数据到本地存储和共享存储","children":[]},{"level":3,"title":"重新配置 Standby 节点","slug":"重新配置-standby-节点","link":"#重新配置-standby-节点","children":[]},{"level":3,"title":"Standby 节点启动","slug":"standby-节点启动","link":"#standby-节点启动","children":[]},{"level":3,"title":"Standby 节点验证","slug":"standby-节点验证","link":"#standby-节点验证","children":[]}]}],"git":{"updatedTime":1673450922000},"filePathRelative":"operation/backup-and-restore.md"}');export{l as data}; diff --git a/assets/backup-dir-71d185c8.png b/assets/backup-dir-71d185c8.png new file mode 100644 index 00000000000..320fd144d3c Binary files /dev/null and b/assets/backup-dir-71d185c8.png differ diff --git a/assets/buffer-management.html-120b73ba.js b/assets/buffer-management.html-120b73ba.js new file mode 100644 index 00000000000..140ecf611f8 --- /dev/null +++ b/assets/buffer-management.html-120b73ba.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-46e5eefa","path":"/theory/buffer-management.html","title":"Buffer Management","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Background Information","slug":"background-information","link":"#background-information","children":[]},{"level":2,"title":"Terms","slug":"terms","link":"#terms","children":[]},{"level":2,"title":"Flushing Control","slug":"flushing-control","link":"#flushing-control","children":[]},{"level":2,"title":"Consistent LSNs","slug":"consistent-lsns","link":"#consistent-lsns","children":[{"level":3,"title":"Flush Lists","slug":"flush-lists","link":"#flush-lists","children":[]},{"level":3,"title":"Parallel Flushing","slug":"parallel-flushing","link":"#parallel-flushing","children":[]}]},{"level":2,"title":"Hot Buffers","slug":"hot-buffers","link":"#hot-buffers","children":[]},{"level":2,"title":"Lazy Checkpointing","slug":"lazy-checkpointing","link":"#lazy-checkpointing","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"theory/buffer-management.md"}');export{l as data}; diff --git a/assets/buffer-management.html-35dc0ba1.js b/assets/buffer-management.html-35dc0ba1.js new file mode 100644 index 00000000000..b40fbd7997e --- /dev/null +++ b/assets/buffer-management.html-35dc0ba1.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-7ac661aa","path":"/zh/theory/buffer-management.html","title":"缓冲区管理","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"术语解释","slug":"术语解释","link":"#术语解释","children":[]},{"level":2,"title":"刷脏控制","slug":"刷脏控制","link":"#刷脏控制","children":[]},{"level":2,"title":"一致性位点","slug":"一致性位点","link":"#一致性位点","children":[{"level":3,"title":"FlushList","slug":"flushlist","link":"#flushlist","children":[]},{"level":3,"title":"并行刷脏","slug":"并行刷脏","link":"#并行刷脏","children":[]}]},{"level":2,"title":"热点页","slug":"热点页","link":"#热点页","children":[]},{"level":2,"title":"Lazy Checkpoint","slug":"lazy-checkpoint","link":"#lazy-checkpoint","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/theory/buffer-management.md"}');export{l as data}; diff --git a/assets/buffer-management.html-5ac35282.js b/assets/buffer-management.html-5ac35282.js new file mode 100644 index 00000000000..c2c63696d95 --- /dev/null +++ b/assets/buffer-management.html-5ac35282.js @@ -0,0 +1,5 @@ +import{_ as e,c as r,a,b as l}from"./9_future_pages-9e3b8fc6.js";import{_ as o,o as s,c as p,e as f}from"./app-3d1677bf.js";const i="/PolarDB-for-PostgreSQL/assets/42_buffer_conntrol-0b37890d.png",t="/PolarDB-for-PostgreSQL/assets/42_FlushList-20e70d3c.png",n="/PolarDB-for-PostgreSQL/assets/43_parr_Flush-3be063a3.png",u="/PolarDB-for-PostgreSQL/assets/44_Copy_Buffer-505a142f.png",c={},d=f('

缓冲区管理

背景介绍

传统数据库的主备架构,主备有各自的存储,备节点回放 WAL 日志并读写自己的存储,主备节点在存储层没有耦合。PolarDB 的实现是基于共享存储的一写多读架构,主备使用共享存储中的一份数据。读写节点,也称为主节点或 Primary 节点,可以读写共享存储中的数据;只读节点,也称为备节点或 Replica 节点,仅能各自通过回放日志,从共享存储中读取数据,而不能写入。基本架构图如下所示:

image.png

一写多读架构下,只读节点可能从共享存储中读到两类数据页:

  • 未来页:数据页中包含只读节点尚未回放到的数据,比如只读节点回放到 LSN 为 200 的 WAL 日志,但数据页中已经包含 LSN 为 300 的 WAL 日志对应的改动。此类数据页被称为“未来页”。

    image.png

  • 过去页:数据页中未包含所有回放位点之前的改动,比如只读节点将数据页回放到 LSN 为 200 的 WAL 日志,但该数据页在从 Buffer Pool 淘汰之后,再次从共享存储中读取的数据页中没有包含 LSN 为 200 的 WAL 日志的改动,此类数据页被称为“过去页”。

    image.png

对于只读节点而言,只需要访问与其回放位点相对应的数据页。如果读取到如上所述的“未来页”和“过去页”应该如何处理呢?

  • 对于“过去页”,只读节点需要回放数据页上截止回放位点之前缺失的 WAL 日志,对“过去页”的回放由每个只读节点根据自己的回放位点完成,属于只读节点回放功能,本文暂不讨论。
  • 对于“未来页”,只读节点无法将“未来”的数据页转换为所需的数据页,因此需要在主节点将数据写入共享存储时考虑所有只读节点的回放情况,从而避免只读节点读取到“未来页”,这也是 Buffer 管理要解决的主要问题。

除此之外,Buffer 管理还需要维护一致性位点,对于某个数据页,只读节点仅需回放一致性位点和当前回放位点之间的 WAL 日志即可,从而加速回放效率。

术语解释

  • Buffer Pool:缓冲池,是一种内存结构用来存储最常访问的数据,通常以页为单位来缓存数据。PolarDB 中每个节点都有自己的 Buffer Pool。
  • LSN:Log Sequence Number,日志序列号,是 WAL 日志的唯一标识。LSN 在全局是递增的。
  • 回放位点:Apply LSN,表示只读节点回放日志的位置,一般用 LSN 来标记。
  • 最老回放位点:Oldest Apply LSN,表示所有只读节点中 LSN 最小的回放位点。

刷脏控制

为避免只读节点读取到“未来页”,PolarDB 引入刷脏控制功能,即在主节点要将数据页写入共享存储时,判断所有只读节点是否均已回放到该数据页最近一次修改对应的 WAL 日志。

image.png

主节点 Buffer Pool 中的数据页,根据是否包含“未来数据”(即只读节点的回放位点之后新产生的数据),可以分为两类:可以写入存储的和不能写入存储的。该判断依赖两个位点:

  • Buffer 最近一次修改对应的 LSN,我们称之为 Buffer Latest LSN。
  • 最老回放位点,即所有只读节点中最小的回放位点,我们称之为 Oldest Apply LSN。

刷脏控制判断规则如下:

if buffer latest lsn <= oldest apply lsn
+    flush buffer
+else
+    do not flush buffer
+

一致性位点

为将数据页回放到指定的 LSN 位点,只读节点会维护数据页与该页上的 LSN 的映射关系,这种映射关系保存在 LogIndex 中。LogIndex 可以理解为是一种可以持久化存储的 HashTable。访问数据页时,会从该映射关系中获取数据页需要回放的所有 LSN,依次回放对应的 WAL 日志,最终生成需要使用的数据页。

image.png

可见,数据页上的修改越多,其对应的 LSN 也越多,回放所需耗时也越长。为了尽量减少数据页需要回放的 LSN 数量,PolarDB 中引入了一致性位点的概念。

一致性位点表示该位点之前的所有 WAL 日志修改的数据页均已经持久化到存储。主备之间,主节点向备节点发送当前 WAL 日志的写入位点和一致性位点,备节点向主节点发送当前回放的位点。由于一致性位点之前的 WAL 修改都已经写入共享存储,备节点无需再回放该位点之前的 WAL 日志。因此,可以将 LogIndex 中所有小于一致性位点的 LSN 清理掉,既加速回放效率,同时还能减少 LogIndex 占用的空间。

FlushList

为维护一致性位点,PolarDB 为每个 Buffer 引入了一个内存状态,即第一次修改该 Buffer 对应的 LSN,称之为 oldest LSN,所有 Buffer 中最小的 oldest LSN 即为一致性位点。

一种获取一致性位点的方法是遍历 Buffer Pool 中所有 Buffer,找到最小值,但遍历代价较大,CPU 开销和耗时都不能接受。为高效获取一致性位点,PolarDB 引入 FlushList 机制,将 Buffer Pool 中所有脏页按照 oldest LSN 从小到大排序。借助 FlushList,获取一致性位点的时间复杂度可以达到 O(1)。

image.png

第一次修改 Buffer 并将其标记为脏时,将该 Buffer 插入到 FlushList 中,并设置其 oldest LSN。Buffer 被写入存储时,将该内存中的标记清除。

为高效推进一致性位点,PolarDB 的后台刷脏进程(bgwriter)采用“先被修改的 Buffer 先落盘”的刷脏策略,即 bgwriter 会从前往后遍历 FlushList,逐个刷脏,一旦有脏页写入存储,一致性位点就可以向前推进。以上图为例,如果 oldest LSN 为 10 的 Buffer 落盘,一致性位点就可以推进到 30。

并行刷脏

为进一步提升一致性位点的推进效率,PolarDB 实现了并行刷脏。每个后台刷脏进程会从 FlushList 中获取一批数据页进行刷脏。

image.png

热点页

引入刷脏控制之后,仅满足刷脏条件的 Buffer 才能写入存储,假如某个 Buffer 修改非常频繁,可能导致 Buffer Latest LSN 总是大于 Oldest Apply LSN,该 Buffer 始终无法满足刷脏条件,此类 Buffer 我们称之为热点页。热点页会导致一致性位点无法推进,为解决热点页的刷脏问题,PolarDB 引入了 Copy Buffer 机制。

Copy Buffer 机制会将特定的、不满足刷脏条件的 Buffer 从 Buffer Pool 中拷贝至新增的 Copy Buffer Pool 中,Copy Buffer Pool 中的 Buffer 不会再被修改,其对应的 Latest LSN 也不会更新,随着 Oldest Apply LSN 的推进,Copy Buffer 会逐步满足刷脏条件,从而可以将 Copy Buffer 落盘。

引入 Copy Buffer 机制后,刷脏的流程如下:

  1. 如果 Buffer 不满足刷脏条件,判断其最近修改次数以及距离当前日志位点的距离,超过一定阈值,则将当前数据页拷贝一份至 Copy Buffer Pool 中。
  2. 下次再刷该 Buffer 时,判断其是否满足刷脏条件,如果满足,则将该 Buffer 写入存储并释放其对应的 Copy Buffer。
  3. 如果 Buffer 不满足刷脏条件,则判断其是否存在 Copy Buffer,若存在且 Copy Buffer 满足刷脏条件,则将 Copy Buffer 落盘。
  4. Buffer 被拷贝到 Copy Buffer Pool 之后,如果有对该 Buffer 的修改,则会重新生成该 Buffer 的 Oldest LSN,并将其追加到 FlushList 末尾。

如下图中,[oldest LSN, latest LSN][30, 500] 的 Buffer 被认为是热点页,将当前 Buffer 拷贝至 Copy Buffer Pool 中,随后该数据页再次被修改,假设修改对应的 LSN 为 600,则设置其 Oldest LSN 为 600,并将其从 FlushList 中删除,然后追加至 FlushList 末尾。此时,Copy Buffer 中数据页不会再修改,其 Latest LSN 始终为 500,若满足刷脏条件,则可以将 Copy Buffer 写入存储。

image.png

需要注意的是,引入 Copy Buffer 之后,一致性位点的计算方法有所改变。FlushList 中的 Oldest LSN 不再是最小的 Oldest LSN,Copy Buffer Pool 中可能存在更小的 oldest LSN。因此,除考虑 FlushList 中的 Oldest LSN 之外,还需要遍历 Copy Buffer Pool,找到 Copy Buffer Pool 中最小的 Oldest LSN,取两者的最小值即为一致性位点。

Lazy Checkpoint

PolarDB 引入的一致性位点概念,与 checkpoint 的概念类似。PolarDB 中 checkpoint 位点表示该位点之前的所有数据都已经落盘,数据库 Crash Recovery 时可以从 checkpoint 位点开始恢复,提升恢复效率。普通的 checkpoint 会将所有 Buffer Pool 中的脏页以及其他内存数据落盘,这个过程可能耗时较长且在此期间 I/O 吞吐较大,可能会对正常的业务请求产生影响。

借助一致性位点,PolarDB 中引入了一种特殊的 checkpoint:Lazy Checkpoint。之所以称之为 Lazy(懒惰的),是与普通的 checkpoint 相比,lazy checkpoint 不会把 Buffer Pool 中所有的脏页落盘,而是直接使用当前的一致性位点作为 checkpoint 位点,极大地提升了 checkpoint 的执行效率。

Lazy Checkpoint 的整体思路是将普通 checkpoint 一次性刷大量脏页落盘的逻辑转换为后台刷脏进程持续不断落盘并维护一致性位点的逻辑。需要注意的是,Lazy Checkpoint 与 PolarDB 中 Full Page Write 的功能有冲突,开启 Full Page Write 之后会自动关闭该功能。

',44),h=[d];function L(B,g){return s(),p("div",null,h)}const N=o(c,[["render",L],["__file","buffer-management.html.vue"]]);export{N as default}; diff --git a/assets/buffer-management.html-d9b5fb2e.js b/assets/buffer-management.html-d9b5fb2e.js new file mode 100644 index 00000000000..e532c7fcb7b --- /dev/null +++ b/assets/buffer-management.html-d9b5fb2e.js @@ -0,0 +1,5 @@ +import{_ as e,c as t,a,b as o}from"./9_future_pages-13873b1a.js";import{_ as s,o as r,c as n,e as h}from"./app-3d1677bf.js";const i="/PolarDB-for-PostgreSQL/assets/42_buffer_conntrol-6b7ab4e5.png",l="/PolarDB-for-PostgreSQL/assets/42_FlushList-a5ba8869.png",d="/PolarDB-for-PostgreSQL/assets/43_parr_Flush-3be063a3.png",f="/PolarDB-for-PostgreSQL/assets/44_Copy_Buffer-505a142f.png",p={},c=h('

Buffer Management

Background Information

In a conventional database system, the primary instance and the read-only instances are each allocated a specific amount of exclusive storage space. The read-only instances can apply write-ahead logging (WAL) records and can read and write data to their own storage. A PolarDB cluster consists of a primary node and at least one read-only node. The primary node and the read-only nodes share the same physical storage. The primary node can read and write data to the shared storage. The read-only nodes can read data from the shared storage by applying WAL records but cannot write data to the shared storage. The following figure shows the architecture of a PolarDB cluster.

image.png

The read-only nodes may read two types of pages from the shared storage:

  • Future pages: The pages that the read-only nodes read from the shared storage incorporate changes that are made after the apply log sequence numbers (LSNs) of the pages. For example, the read-only nodes have applied all WAL records up to the WAL record with an LSN of 200 to a page, but the change described by the most recent WAL record with an LSN of 300 has been incorporated into the same page in the shared storage. These pages are called future pages.

    image.png

  • Outdated pages: The pages that the read-only nodes read from the shared storage do not incorporate changes that are made before the apply LSNs of the pages. For example, the read-only nodes have applied all WAL records up to the most recent WAL record with an LSN of 200 to a page, but the change described by a previous WAL record with an LSN of 200 has not been incorporated into the same page in the shared storage. These pages are called outdated pages.

    image.png

Each read-only node expects to read pages that incorporate only the changes made up to the apply LSNs of the pages on that read-only node. If the read-only nodes read outdated pages or future pages from the shared storage, you can take the following measures:

  • To prevent outdated pages, configure the read-only nodes to apply all omitted WAL records up to the apply LSN of each page. A page may have different apply LSNs on different read-only nodes.
  • To prevent future pages, configure the primary node to identify how many WAL records are applied on the read-only nodes at the time when the primary node writes data to the shared storage. This is the focus of buffer management.

Buffer management involves consistent LSNs. For a specific page, each read-only node needs to apply only the WAL records that are generated between the consistent LSN and the apply LSN. This reduces the time that is required to apply WAL records on the read-only nodes.

Terms

  • Buffer Pool: A buffer pool is an amount of memory that is used to store frequently accessed data. In most cases, data is cached in the buffer pool as pages. In a PolarDB cluster, each compute node has its own buffer pool.
  • LSN: Each LSN is the unique identifier of a WAL record. LSNs globally increment.
  • Apply LSN: The apply LSN of a page on a read-only node marks the most recent WAL record that is applied on the read-only node for the page. Also called Replay LSN.
  • Oldest Apply LSN: The oldest apply LSN of a page is the smallest apply LSN among the apply LSNs of the page on all the read-only nodes.

Flushing Control

PolarDB provides a flushing control mechanism to prevent the read-only nodes from reading future pages from the shared storage. Before the primary node writes a page to the shared storage, the primary node checks whether all the read-only nodes have applied the most recent WAL record of the page.

image.png

The pages in the buffer pool of the primary node are divided into the following two types based on whether the pages incorporate the changes that are made after the apply LSNs of the pages: pages that can be flushed to the shared storage and pages that cannot be flushed to the shared storage. This categorization is based on the following LSNs:

  • Latest LSN: The latest LSN of a page on a read-only node marks the most recent WAL record that is applied on the read-only node for the page.
  • Oldest apply LSN: The oldest apply LSN of a page is the smallest apply LSN among the apply LSNs of the page on all the read-only nodes.

The primary node determines whether to flush a dirty page to the shared storage based on the following rules:

if buffer latest lsn <= oldest apply lsn
+    flush buffer
+else
+    do not flush buffer
+

Consistent LSNs

To apply the WAL records of a page up to a specified LSN, each read-only node manages the mapping between the page and the LSNs of all WAL records that are generated for the page. This mapping is stored as a LogIndex. A LogIndex is used as a hash table that can be persistently stored. When a read-only node requests a page, the read-only node traverses the LogIndex of the page to obtain the LSNs of all WAL records that need to be applied. Then, the read-only node applies the WAL records in sequence to generate the most recent version of the page.

image.png

For a specific page, more changes mean more LSNs and a longer period of time required to apply WAL records. To minimize the number of WAL records that need to be applied for each page, PolarDB provides consistent LSNs.

After all changes that are made up to the consistent LSN of a page are written to the shared storage, the page is persistently stored. The primary node sends the write LSN and consistent LSN of the page to each read-only node, and each read-only node sends the apply LSN of the page to the primary node. The read-only nodes do not need to apply the WAL records that are generated before the consistent LSN of the page. Therefore, all LSNs that are smaller than the consistent LSN can be removed from the LogIndex of the page. This reduces the number of WAL records that the read-only nodes need to apply. This also reduces the storage space that is occupied by LogIndex records.

Flush Lists

PolarDB holds a specific state for each buffer in the memory. The state of a buffer in the memory is represented by the LSN that marks the first change to the buffer. This LSN is called the oldest LSN. The consistent LSN of a page is the smallest oldest LSN among the oldest LSNs of all buffers for the page.

A conventional method of obtaining the consistent LSN of a page requires the primary node to traverse the LSNs of all buffers for the page in the buffer pool. This method causes significant CPU overhead and a long traversal process. To address these issues, PolarDB uses a flush list, in which all dirty pages in the buffer pool are sorted in ascending order based on their oldest LSNs. The flush list helps you reduce the time complexity of obtaining consistent LSNs to O(1).

image.png

When a buffer is updated for the first time, the buffer is labeled as dirty. PolarDB inserts the buffer into the flush list and generates an oldest LSN for the buffer. When the buffer is flushed to the shared storage, the label is removed.

To efficiently move the consistent LSN of each page towards the head of the flush list, PolarDB runs a BGWRITER process to traverse all buffers in the flush list in chronological order and flush early buffers to the shared storage one by one. After a buffer is flushed to the shared storage, the consistent LSN is moved one position forward towards the head of the flush list. In the example shown in the preceding figure, if the buffer with an oldest LSN of 10 is flushed to the shared storage, the buffer with an oldest LSN of 30 is moved one position forward towards the head of the flush list. LSN 30 becomes the consistent LSN.

Parallel Flushing

To further improve the efficiency of moving the consistent LSN of each page to the head of the flush list, PolarDB runs multiple BGWRITER processes to flush buffers in parallel. Each BGWRITER process reads a number of buffers from the flush list and flushes the buffers to the shared storage at a time.

image.png

Hot Buffers

After the flushing control mechanism is introduced, PolarDB flushes only the buffers that meet specific flush conditions to the shared storage. If a buffer is frequently updated, its latest LSN may remain larger than its oldest apply LSN. As a result, the buffer can never meet the flush conditions. This type of buffer is called hot buffers. If a page has hot buffers, the consistent LSN of the page cannot be moved towards the head of the flush list. To resolve this issue, PolarDB provides a copy buffering mechanism.

The copy buffering mechanism allows PolarDB to copy buffers that do not meet the flush conditions to a copy buffer pool. Buffers in the copy buffer pool and their latest LSNs are no longer updated. As the oldest apply LSN moves towards the head of the flush list, these buffers start to meet the flush conditions. When these buffers meet the flush conditions, PolarDB can flush them from the copy buffer pool to the shared storage.

The following flush rules apply:

  1. If a buffer does not meet the flush conditions, PolarDB checks the number of recent changes to the buffer and the time difference between the most recent change and the latest LSN. If the number and the time difference exceed their predefined thresholds, PolarDB copies the buffer to the copy buffer pool.
  2. When a buffer is updated again, PolarDB checks whether the buffer meets the flush conditions. If the buffer meets the flush conditions, PolarDB flushes the buffer to the shared storage and deletes the copy of the buffer from the copy buffer pool.
  3. If a buffer does not meet the flush conditions, PolarDB checks whether a copy of the buffer can be found in the copy buffer pool. If a copy of the buffer can be found in the copy buffer pool and the copy meets the flush conditions, PolarDB flushes the copy to the shared storage.
  4. After a buffer that is copied to the copy buffer pool is updated, PolarDB regenerates an oldest LSN for the buffer and moves the buffer to the tail of the flush list.

In the example shown in the following figure, the buffer with an oldest LSN of 30 and a latest LSN of 500 is considered a hot buffer. The buffer is updated after it is copied to the copy buffer pool. If the change is marked by LSN 600, PolarDB changes the oldest LSN of the buffer to 600 and moves the buffer to the tail of the flush list. At this time, the copy of the buffer is no longer updated, and the latest LSN of the copy remains 500. When the copy meets the flush conditions, PolarDB flushes the copy to the shared storage.

image.png

After the copy buffering mechanism is introduced, PolarDB uses a different method to calculate the consistent LSN of each page. For a specific page, the oldest LSN in the flush list is no longer the smallest oldest LSN because the oldest LSN in the copy buffer pool can be smaller. Therefore, PolarDB needs to compare the oldest LSN in the flush list with the oldest LSN in the copy buffer pool. The smaller oldest LSN is considered the consistent LSN.

Lazy Checkpointing

PolarDB supports consistent LSNs, which are similar to checkpoints. All changes that are made to a page before the checkpoint LSN of the page are flushed to the shared storage. If a recovery operation is run, PolarDB starts to recover the page from the checkpoint LSN. This improves recovery efficiency. If regular checkpoint LSNs are used, PolarDB flushes all dirty pages in the buffer pool and other in-memory pages to the shared storage. This process may require a long period of time and high I/O throughput. As a result, normal queries may be affected.

Consistent LSNs empower PolarDB to implement lazy checkpointing. If the lazy checkpointing mechanism is used, PolarDB does not flush all dirty pages in the buffer pool to the shared storage. Instead, PolarDB uses consistent LSNs as checkpoint LSNs. This significantly increases checkpointing efficiency.

The underlying logic of the lazy checkpointing mechanism allows PolarDB to run BGWRITER processes that continuously flush dirty pages and maintain consistent LSNs. The lazy checkpointing mechanism cannot be used with the full page write feature. If you enable the full page write feature, the lazy checkpointing mechanism is automatically disabled.

',44),u=[c];function g(m,y){return r(),n("div",null,u)}const S=s(p,[["render",g],["__file","buffer-management.html.vue"]]);export{S as default}; diff --git a/assets/bulk-read-and-extend.html-586617cf.js b/assets/bulk-read-and-extend.html-586617cf.js new file mode 100644 index 00000000000..4ef928cd5c3 --- /dev/null +++ b/assets/bulk-read-and-extend.html-586617cf.js @@ -0,0 +1,13 @@ +import{_ as d,r as t,o as i,c as u,d as n,a,w as l,b as e,e as h}from"./app-3d1677bf.js";const _="/PolarDB-for-PostgreSQL/assets/bulk_read-7ba232a8.png",k="/PolarDB-for-PostgreSQL/assets/bulk_vacuum_data-f68a39eb.png",f="/PolarDB-for-PostgreSQL/assets/bulk_seq_scan-98bbf92e.png",g="/PolarDB-for-PostgreSQL/assets/bulk_insert_data-4b171395.png",S="/PolarDB-for-PostgreSQL/assets/bulk_create_index_data-9d2da036.png",B={},b=a("h1",{id:"预读-预扩展",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#预读-预扩展","aria-hidden":"true"},"#"),e(" 预读 / 预扩展")],-1),P={class:"table-of-contents"},x=a("h2",{id:"背景介绍",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#背景介绍","aria-hidden":"true"},"#"),e(" 背景介绍")],-1),m={href:"https://en.wikipedia.org/wiki/Ext4",target:"_blank",rel:"noopener noreferrer"},E=h('

功能介绍

堆表预读

在 PostgreSQL 读取堆表的过程中,会以 8kB 页为单位通过文件系统读取页面至内存缓冲池(Buffer Pool)中。PFS 对于这种数据量较小的 I/O 操作并不是特别高效。所以,PolarDB 为了适配 PFS 而设计了 堆表批量预读。当读取的页数量大于 1 时,将会触发批量预读,一次 I/O 读取 128kB 数据至 Buffer Pool 中。预读对顺序扫描(Sequential Scan)、Vacuum 两种场景性能可以带来一倍左右的提升,在索引创建场景下可以带来 18% 的性能提升。

堆表预扩展

在 PostgreSQL 中,表空间的扩展过程中将会逐个申请并扩展 8kB 的页。即使是 PostgreSQL 支持的批量页扩展,进行一次 N 页扩展的流程中也包含了 N 次 I/O 操作。这种页扩展不符合 PFS 最小页扩展粒度为 4MB 的特性。为此,PolarDB 设计了堆表批量预扩展,在扩展堆表的过程中,一次 I/O 扩展 4MB 页。在写表频繁的场景下(如装载数据),能够带来一倍的性能提升。

索引创建预扩展

索引创建预扩展与堆表预扩展的功能类似。索引创建预扩展特别针对 PFS 优化索引创建过程。在索引创建的页扩展过程中,一次 I/O 扩展 4MB 页。这种设计可以在创建索引的过程中带来 30% 的性能提升。

注意

当前索引创建预扩展只适配了 B-Tree 索引。其他索引类型暂未支持。

功能设计

堆表预读

堆表预读的实现步骤主要分为四步:

  1. 在 Buffer Pool 中申请 N 个 Buffer
  2. 通过 palloc 在内存中申请一段大小为 N * 页大小 的空间,简称为 p
  3. 通过 PFS 批量读取堆表中 N * 页大小 的数据拷贝至 p
  4. p 中 N 个页的内容逐个拷贝至从 Buffer Pool 申请的 N 个 Buffer 中。

后续的读取操作会直接命中 Buffer。数据流图如下所示:

heap-read

堆表预扩展

预扩展的实现步骤主要分为三步:

  1. 从 Buffer Pool 中申请 N 个 Buffer,不触发文件系统的页扩展
  2. 通过 PFS 的文件写入接口进行批量页扩展,并且写入为全零页
  3. 对申请出来的页逐个进行页初始化,标识页的可用空间,结束预扩展

索引创建预扩展

索引创建预扩展的实现步骤与预扩展类似,但没有涉及 Buffer 的申请。步骤如下:

  1. 写索引页时,通过 PFS 的文件写入接口进行批量页扩展,并且写入为全零页
  2. 将 Buffer Pool 中已经构建好的索引页写入文件系统中

使用指南

堆表预读

堆表预读的参数名为 polar_bulk_read_size,功能默认开启,默认大小为 128kB。不建议用户自行修改该参数,128kB 是贴合 PFS 的最优值,自行调整并不会带来性能的提升。

关闭功能:

ALTER SYSTEM SET polar_bulk_read_size = 0;
+SELECT pg_reload_conf();
+

打开功能并设置预读大小为 128kB:

ALTER SYSTEM SET polar_bulk_read_size = '128kB';
+SELECT pg_reload_conf();
+

堆表预扩展

堆表预扩展的参数名为 polar_bulk_extend_size,功能默认开启,预扩展的大小默认是 4MB。不建议用户自行修改该参数值,4MB 是贴合 PFS 的最优值。

关闭功能:

ALTER SYSTEM SET polar_bulk_extend_size = 0;
+SELECT pg_reload_conf();
+

打开功能并设置预扩展大小为 4MB:

ALTER SYSTEM SET polar_bulk_extend_size = '4MB';
+SELECT pg_reload_conf();
+

索引创建预扩展

索引创建预扩展的参数名为 polar_index_create_bulk_extend_size,功能默认开启。索引创建预扩展的大小默认是 4MB。不建议用户自行修改该参数值,4MB 是贴合 PFS 的最优值。

关闭功能:

ALTER SYSTEM SET polar_index_create_bulk_extend_size = 0;
+SELECT pg_reload_conf();
+

打开功能,并设置预扩展大小为 4MB:

ALTER SYSTEM SET polar_index_create_bulk_extend_size = 512;
+SELECT pg_reload_conf();
+

性能表现

为了展示堆表预读、堆表预扩展、索引创建预扩展的性能提升效果,我们在 PolarDB for PostgreSQL 14 的实例上进行了测试。

  • 规格:8 核 32GB 内存
  • 测试场景:400GB pgbench 测试

堆表预读

400GB 表的 Vacuum 性能:

400gb-vacuum-perf

400GB 表的 SeqScan 性能:

400gb-vacuum-seqscan

结论:

  • 堆表预读在 Vacuum 和 SeqScan 场景上,性能提升了 1-2 倍
  • 堆表预读大小在超过默认值 128kB 之后对性能提升没有明显帮助

堆表预扩展

400GB 表数据装载性能:

400gb-insert-data-perf

结论:

  • 堆表预扩展在数据装载场景下带来一倍的性能提升
  • 堆表预扩展大小在超过默认值 4MB 后对性能没有明显帮助

索引创建预扩展

400GB 表创建索引性能:

400GB 表创建索引性能

结论:

  • 索引创建预扩展在索引创建场景下能够带来 30% 的性能提升
  • 加大索引创建预扩展大小超过默认值 4MB 对性能没有明显帮助
',59);function T(o,L){const r=t("Badge"),p=t("ArticleInfo"),s=t("router-link"),c=t("ExternalLinkIcon");return i(),u("div",null,[b,n(r,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"}),n(p,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),a("nav",P,[a("ul",null,[a("li",null,[n(s,{to:"#背景介绍"},{default:l(()=>[e("背景介绍")]),_:1})]),a("li",null,[n(s,{to:"#功能介绍"},{default:l(()=>[e("功能介绍")]),_:1}),a("ul",null,[a("li",null,[n(s,{to:"#堆表预读"},{default:l(()=>[e("堆表预读")]),_:1})]),a("li",null,[n(s,{to:"#堆表预扩展"},{default:l(()=>[e("堆表预扩展")]),_:1})]),a("li",null,[n(s,{to:"#索引创建预扩展"},{default:l(()=>[e("索引创建预扩展")]),_:1})])])]),a("li",null,[n(s,{to:"#功能设计"},{default:l(()=>[e("功能设计")]),_:1}),a("ul",null,[a("li",null,[n(s,{to:"#堆表预读-1"},{default:l(()=>[e("堆表预读")]),_:1})]),a("li",null,[n(s,{to:"#堆表预扩展-1"},{default:l(()=>[e("堆表预扩展")]),_:1})]),a("li",null,[n(s,{to:"#索引创建预扩展-1"},{default:l(()=>[e("索引创建预扩展")]),_:1})])])]),a("li",null,[n(s,{to:"#使用指南"},{default:l(()=>[e("使用指南")]),_:1}),a("ul",null,[a("li",null,[n(s,{to:"#堆表预读-2"},{default:l(()=>[e("堆表预读")]),_:1})]),a("li",null,[n(s,{to:"#堆表预扩展-2"},{default:l(()=>[e("堆表预扩展")]),_:1})]),a("li",null,[n(s,{to:"#索引创建预扩展-2"},{default:l(()=>[e("索引创建预扩展")]),_:1})])])]),a("li",null,[n(s,{to:"#性能表现"},{default:l(()=>[e("性能表现")]),_:1}),a("ul",null,[a("li",null,[n(s,{to:"#堆表预读-3"},{default:l(()=>[e("堆表预读")]),_:1})]),a("li",null,[n(s,{to:"#堆表预扩展-3"},{default:l(()=>[e("堆表预扩展")]),_:1})]),a("li",null,[n(s,{to:"#索引创建预扩展-3"},{default:l(()=>[e("索引创建预扩展")]),_:1})])])])])]),x,a("p",null,[e("PolarDB for PostgreSQL(以下简称 PolarDB)底层使用 PolarFS(以下简称为 PFS)作为文件系统。不同于 "),a("a",m,[e("ext4"),n(c)]),e(" 等单机文件系统,PFS 在页扩展过程中,元数据更新开销较大;且 PFS 的最小页扩展粒度为 4MB。而 PostgreSQL 8kB 的页扩展粒度并不适合 PFS,将会导致写表或创建索引时性能下降;同时,PFS 在读取大块页面时 I/O 效率更高。为了适配上述特征,我们为 PolarDB 设计了堆表预读、堆表预扩展、索引创建预扩展的功能,使运行在 PFS 上的 PolarDB 能够获得更好的性能。")]),E])}const v=d(B,[["render",T],["__file","bulk-read-and-extend.html.vue"]]);export{v as default}; diff --git a/assets/bulk-read-and-extend.html-ecac2d9c.js b/assets/bulk-read-and-extend.html-ecac2d9c.js new file mode 100644 index 00000000000..0539e10faae --- /dev/null +++ b/assets/bulk-read-and-extend.html-ecac2d9c.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-0bb2232b","path":"/zh/features/v11/performance/bulk-read-and-extend.html","title":"预读 / 预扩展","lang":"zh-CN","frontmatter":{"author":"何柯文","date":"2022/09/21","minute":30},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[{"level":3,"title":"堆表预读","slug":"堆表预读","link":"#堆表预读","children":[]},{"level":3,"title":"堆表预扩展","slug":"堆表预扩展","link":"#堆表预扩展","children":[]},{"level":3,"title":"索引创建预扩展","slug":"索引创建预扩展","link":"#索引创建预扩展","children":[]}]},{"level":2,"title":"功能设计","slug":"功能设计","link":"#功能设计","children":[{"level":3,"title":"堆表预读","slug":"堆表预读-1","link":"#堆表预读-1","children":[]},{"level":3,"title":"堆表预扩展","slug":"堆表预扩展-1","link":"#堆表预扩展-1","children":[]},{"level":3,"title":"索引创建预扩展","slug":"索引创建预扩展-1","link":"#索引创建预扩展-1","children":[]}]},{"level":2,"title":"使用指南","slug":"使用指南","link":"#使用指南","children":[{"level":3,"title":"堆表预读","slug":"堆表预读-2","link":"#堆表预读-2","children":[]},{"level":3,"title":"堆表预扩展","slug":"堆表预扩展-2","link":"#堆表预扩展-2","children":[]},{"level":3,"title":"索引创建预扩展","slug":"索引创建预扩展-2","link":"#索引创建预扩展-2","children":[]}]},{"level":2,"title":"性能表现","slug":"性能表现","link":"#性能表现","children":[{"level":3,"title":"堆表预读","slug":"堆表预读-3","link":"#堆表预读-3","children":[]},{"level":3,"title":"堆表预扩展","slug":"堆表预扩展-3","link":"#堆表预扩展-3","children":[]},{"level":3,"title":"索引创建预扩展","slug":"索引创建预扩展-3","link":"#索引创建预扩展-3","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/performance/bulk-read-and-extend.md"}');export{l as data}; diff --git a/assets/bulk_create_index_data-9d2da036.png b/assets/bulk_create_index_data-9d2da036.png new file mode 100644 index 00000000000..594b0f95beb Binary files /dev/null and b/assets/bulk_create_index_data-9d2da036.png differ diff --git a/assets/bulk_insert_data-4b171395.png b/assets/bulk_insert_data-4b171395.png new file mode 100644 index 00000000000..e1bbba04ddb Binary files /dev/null and b/assets/bulk_insert_data-4b171395.png differ diff --git a/assets/bulk_read-7ba232a8.png b/assets/bulk_read-7ba232a8.png new file mode 100644 index 00000000000..08ebba0b8e6 Binary files /dev/null and b/assets/bulk_read-7ba232a8.png differ diff --git a/assets/bulk_seq_scan-98bbf92e.png b/assets/bulk_seq_scan-98bbf92e.png new file mode 100644 index 00000000000..8932f5b4e0f Binary files /dev/null and b/assets/bulk_seq_scan-98bbf92e.png differ diff --git a/assets/bulk_vacuum_data-f68a39eb.png b/assets/bulk_vacuum_data-f68a39eb.png new file mode 100644 index 00000000000..9c61075ab5e Binary files /dev/null and b/assets/bulk_vacuum_data-f68a39eb.png differ diff --git a/assets/cluster-info.html-8592b599.js b/assets/cluster-info.html-8592b599.js new file mode 100644 index 00000000000..982aac337f8 --- /dev/null +++ b/assets/cluster-info.html-8592b599.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-798d4bcc","path":"/zh/features/v11/epq/cluster-info.html","title":"集群拓扑视图","lang":"zh-CN","frontmatter":{"author":"烛远","date":"2022/09/20","minute":20},"headers":[{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"功能使用","slug":"功能使用","link":"#功能使用","children":[]},{"level":2,"title":"设计实现","slug":"设计实现","link":"#设计实现","children":[{"level":3,"title":"信息采集","slug":"信息采集","link":"#信息采集","children":[]},{"level":3,"title":"更新频率","slug":"更新频率","link":"#更新频率","children":[]},{"level":3,"title":"采集维度","slug":"采集维度","link":"#采集维度","children":[]},{"level":3,"title":"消息格式","slug":"消息格式","link":"#消息格式","children":[]},{"level":3,"title":"内部使用","slug":"内部使用","link":"#内部使用","children":[]},{"level":3,"title":"结果展示","slug":"结果展示","link":"#结果展示","children":[]}]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/cluster-info.md"}');export{l as data}; diff --git a/assets/cluster-info.html-987f49a8.js b/assets/cluster-info.html-987f49a8.js new file mode 100644 index 00000000000..56f7d0e364d --- /dev/null +++ b/assets/cluster-info.html-987f49a8.js @@ -0,0 +1,15 @@ +import{_ as c,r as p,o as i,c as k,d as s,a,w as o,b as n,e as d}from"./app-3d1677bf.js";const u="/PolarDB-for-PostgreSQL/assets/cluster_info_generate-0c9329ce.png",m={},b=a("h1",{id:"集群拓扑视图",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#集群拓扑视图","aria-hidden":"true"},"#"),n(" 集群拓扑视图")],-1),_={class:"table-of-contents"},h=d(`

功能介绍

PolarDB for PostgreSQL 的 ePQ 弹性跨机并行查询功能可以将一个大查询分散到多个节点上执行,从而加快查询速度。该功能会涉及到各个节点之间的通信,包括执行计划的分发、执行的控制、结果的获取等等。因此设计了 集群拓扑视图 功能,用于为 ePQ 组件收集并展示集群的拓扑信息,实现跨节点查询。

术语

  • RW / Primary:读写节点,后统称为 Primary
  • RO / Replica:只读节点,后统称为 Replica
  • Standby:灾备节点
  • Replication Slot:流复制槽,PostgreSQL 中用于持久化流复制关系的机制

功能使用

集群拓扑视图的维护是完全透明的,用户只需要按照部署文档搭建一写多读的集群,集群拓扑视图即可正确维护起来。关键在于需要搭建带有流复制槽的 Replica / Standby 节点。

使用以下接口可以获取集群拓扑视图(执行结果来自于 PolarDB for PostgreSQL 11):

postgres=# SELECT * FROM polar_cluster_info;
+ name  |   host    | port | release_date | version | slot_name |  type   | state | cpu | cpu_quota | memory | memory_quota | iops | iops_quota | connection | connection_quota | px_connection | px_connection_quota | px_node
+-------+-----------+------+--------------+---------+-----------+---------+-------+-----+-----------+--------+--------------+------+------------+------------+------------------+---------------+---------------------+---------
+ node0 | 127.0.0.1 | 5432 | 20220930     | 1.1.27  |           | RW      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | f
+ node1 | 127.0.0.1 | 5433 | 20220930     | 1.1.27  | replica1  | RO      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | t
+ node2 | 127.0.0.1 | 5434 | 20220930     | 1.1.27  | replica2  | RO      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | t
+ node3 | 127.0.0.1 | 5431 | 20220930     | 1.1.27  | standby1  | Standby | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | f
+(4 rows)
+
  • name 是节点的名称,是自动生成的。
  • host / port 表示了节点的连接信息。在这里,都是本地地址。
  • release_dateversion 标识了 PolarDB 的版本信息。
  • slot_name 是节点连接所使用的流复制槽,只有使用流复制槽连接上来的节点才会被统计在该视图中(除 Primary 节点外)。
  • type 表示节点的类型,有三类:
    • PolarDB for PostgreSQL 11:RW / RO / Standby
    • PolarDB for PostgreSQL 14:Primary / Replica / Standby
  • state 表示节点的状态。有 Offline / Going Offline / Disabled / Initialized / Pending / Ready / Unknown 这些状态,其中只有 Ready 才有可能参与 PX 计算,其他的都无法参与 PX 计算。
  • px_node 表示是否参与 PX 计算。
  • 后续字段都是性能采集相关的字段,目前都是留空的。

对于 ePQ 查询来说,默认只有 Replica 节点参与。可以通过参数控制使用 Primary 节点或者 Standby 节点参与计算:

-- 使 Primary 节点参与计算
+SET polar_px_use_master = ON;
+
+-- 使 Standby 节点参与计算
+SET polar_px_use_standby = ON;
+

提示

从 PolarDB for PostgreSQL 14 起,polar_px_use_master 参数改名为 polar_px_use_primary

还可以使用 polar_px_nodes 指定哪些节点参与 PX 计算。例如使用上述集群拓扑视图,可以执行如下命令,让 PX 查询只在 replica1 上执行。

SET polar_px_nodes = 'node1';
+

设计实现

信息采集

集群拓扑视图信息的采集是通过流复制来传递信息的。该功能对流复制协议增加了新的消息类型用于集群拓扑视图的传递。分为以下两个步骤:

  • Replica / Standby 将状态传递给 Primary
  • Primary 汇总集群拓扑视图,返回给 Replica / Standby

更新频率

集群拓扑视图并非定时更新与发送,因为视图并非一直变化。只有当节点刚启动时,或发生关键状态变化时再进行更新发送。

在具体实现上,Primary 节点收集的全局状态带有版本 generation,只有在接收到节点拓扑变化才会递增;当全局状态版本更新后,才会发送到其他节点,其他节点接收到后,设置到自己的节点上。

生成集群拓扑视图

采集维度

状态指标:

  • 节点 name
  • 节点 host / port
  • 节点 slot_name
  • 节点负载(CPU / MEM / 连接 / IOPS)
  • 节点状态
    • Offline
    • Going Offline
    • Disabled
    • Initialized
    • Pending
    • Ready
    • Unknown

消息格式

同 WAL Sender / WAL Reciver 的其他消息的做法,新增 'm''M' 消息类型,用于收集节点信息和广播集群拓扑视图。

内部使用

提供接口获取 Replica 列表,提供 IP / port 等信息,用于 PX 查询。

预留了较多的负载接口,可以根据负载来实现动态调整并行度。(尚未接入)

同时增加了参数 polar_px_use_master / polar_px_use_standby,将 Primary / Standby 加入到 PX 计算中,默认不打开(可能会有正确性问题,因为快照格式、Vacuum 等原因,快照有可能不可用)。

ePQ 会使用上述信息生成节点的连接信息并缓存下来,并在 ePQ 查询中使用该视图。当 generation 更新或者设置了 polar_px_nodes / polar_px_use_master / polar_px_use_standby 时,该缓存会被重置,并在下次使用时重新生成缓存。

结果展示

通过 polar_monitor 插件提供视图,将上述集群拓扑视图提供出去,在任意节点均可获取。

',34);function f(t,y){const r=p("Badge"),l=p("ArticleInfo"),e=p("router-link");return i(),k("div",null,[b,s(r,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"}),s(l,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),a("nav",_,[a("ul",null,[a("li",null,[s(e,{to:"#功能介绍"},{default:o(()=>[n("功能介绍")]),_:1})]),a("li",null,[s(e,{to:"#术语"},{default:o(()=>[n("术语")]),_:1})]),a("li",null,[s(e,{to:"#功能使用"},{default:o(()=>[n("功能使用")]),_:1})]),a("li",null,[s(e,{to:"#设计实现"},{default:o(()=>[n("设计实现")]),_:1}),a("ul",null,[a("li",null,[s(e,{to:"#信息采集"},{default:o(()=>[n("信息采集")]),_:1})]),a("li",null,[s(e,{to:"#更新频率"},{default:o(()=>[n("更新频率")]),_:1})]),a("li",null,[s(e,{to:"#采集维度"},{default:o(()=>[n("采集维度")]),_:1})]),a("li",null,[s(e,{to:"#消息格式"},{default:o(()=>[n("消息格式")]),_:1})]),a("li",null,[s(e,{to:"#内部使用"},{default:o(()=>[n("内部使用")]),_:1})]),a("li",null,[s(e,{to:"#结果展示"},{default:o(()=>[n("结果展示")]),_:1})])])])])]),h])}const g=c(m,[["render",f],["__file","cluster-info.html.vue"]]);export{g as default}; diff --git a/assets/cluster_info_generate-0c9329ce.png b/assets/cluster_info_generate-0c9329ce.png new file mode 100644 index 00000000000..2442fbd75af Binary files /dev/null and b/assets/cluster_info_generate-0c9329ce.png differ diff --git a/assets/coding-style.html-181aff3b.js b/assets/coding-style.html-181aff3b.js new file mode 100644 index 00000000000..4d5a448e3be --- /dev/null +++ b/assets/coding-style.html-181aff3b.js @@ -0,0 +1 @@ +import{_ as l,r as s,o as a,c as r,a as e,b as o,d as n,e as i}from"./app-3d1677bf.js";const d={},c=i('

编码风格

警告

需要翻译

Languages

  • PostgreSQL kernel, extension, and kernel related tools use C, in order to remain compatible with community versions and to easily upgrade.
  • Management related tools can use shell, GO, or Python, for efficient development.

Style

',5),u={href:"https://www.postgresql.org/docs/12/source.html",target:"_blank",rel:"noopener noreferrer"},h=e("ul",null,[e("li",null,"Code in PostgreSQL should only rely on language features available in the C99 standard"),e("li",null,"Do not use // for comments"),e("li",null,"Both, macros with arguments and static inline functions, may be used. The latter is preferred only if the former simplifies coding."),e("li",null,"Follow BSD C programming conventions")],-1),m=e("li",null,[e("p",null,"Programs in Shell, Go, or Python can follow Google code conventions"),e("ul",null,[e("li",null,"https://google.github.io/styleguide/pyguide.html"),e("li",null,"https://github.com/golang/go/wiki/CodeReviewComments"),e("li",null,"https://google.github.io/styleguide/shellguide.html")])],-1),g=e("h2",{id:"code-design-and-review",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#code-design-and-review","aria-hidden":"true"},"#"),o(" Code design and review")],-1),p={href:"https://github.com/google/eng-practices/blob/master/review/index.md",target:"_blank",rel:"noopener noreferrer"},f=i("

Before submitting for code review, please do unit test and pass all tests under src/test, such as regress and isolation. Unit tests or function tests should be submitted with code modification.

In addition to code review, this doc offers instructions for the whole cycle of high-quality development, from design, implementation, testing, documentation, to preparing for code review. Many good questions are asked for critical steps during development, such as about design, about function, about complexity, about test, about naming, about documentation, and about code review. The doc summarized rules for code review as follows.

In doing a code review, you should make sure that:

  • The code is well-designed.
  • The functionality is good for the users of the code.
  • Any UI changes are sensible and look good.
  • Any parallel programming is done safely.
  • The code isn't more complex than it needs to be.
  • The developer isn't implementing things they might need in the future but don't know they need now.
  • Code has appropriate unit tests.
  • Tests are well-designed.
  • The developer used clear names for everything.
  • Comments are clear and useful, and mostly explain why instead of what.
  • Code is appropriately documented.
  • The code conforms to our style guides.
",4);function y(b,w){const t=s("ExternalLinkIcon");return a(),r("div",null,[c,e("ul",null,[e("li",null,[e("p",null,[o("Coding in C follows PostgreSQL's programing style, such as naming, error message format, control statements, length of lines, comment format, length of functions, and global variable. For detail, please reference "),e("a",u,[o("Postgresql style"),n(t)]),o(". Here is some highlines:")]),h]),m]),g,e("p",null,[o("We share the same thoughts and rules as "),e("a",p,[o("Google Open Source Code Review"),n(t)])]),f])}const v=l(d,[["render",y],["__file","coding-style.html.vue"]]);export{v as default}; diff --git a/assets/coding-style.html-9c14d7a6.js b/assets/coding-style.html-9c14d7a6.js new file mode 100644 index 00000000000..b4afe0271a1 --- /dev/null +++ b/assets/coding-style.html-9c14d7a6.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-43a2065f","path":"/contributing/coding-style.html","title":"Coding Style","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Languages","slug":"languages","link":"#languages","children":[]},{"level":2,"title":"Style","slug":"style","link":"#style","children":[]},{"level":2,"title":"Code design and review","slug":"code-design-and-review","link":"#code-design-and-review","children":[]}],"git":{"updatedTime":1641825312000},"filePathRelative":"contributing/coding-style.md"}');export{e as data}; diff --git a/assets/coding-style.html-b182657f.js b/assets/coding-style.html-b182657f.js new file mode 100644 index 00000000000..4e7faab74cd --- /dev/null +++ b/assets/coding-style.html-b182657f.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-3ec72c4e","path":"/zh/contributing/coding-style.html","title":"编码风格","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"Languages","slug":"languages","link":"#languages","children":[]},{"level":2,"title":"Style","slug":"style","link":"#style","children":[]},{"level":2,"title":"Code design and review","slug":"code-design-and-review","link":"#code-design-and-review","children":[]}],"git":{"updatedTime":1641825312000},"filePathRelative":"zh/contributing/coding-style.md"}');export{e as data}; diff --git a/assets/coding-style.html-f771a098.js b/assets/coding-style.html-f771a098.js new file mode 100644 index 00000000000..78ea9967bc8 --- /dev/null +++ b/assets/coding-style.html-f771a098.js @@ -0,0 +1 @@ +import{_ as l,r as s,o as a,c as r,a as e,b as o,d as n,e as i}from"./app-3d1677bf.js";const d={},c=i('

Coding Style

Languages

  • PostgreSQL kernel, extension, and kernel related tools use C, in order to remain compatible with community versions and to easily upgrade.
  • Management related tools can use shell, GO, or Python, for efficient development.

Style

',4),h={href:"https://www.postgresql.org/docs/12/source.html",target:"_blank",rel:"noopener noreferrer"},u=e("ul",null,[e("li",null,"Code in PostgreSQL should only rely on language features available in the C99 standard"),e("li",null,"Do not use // for comments"),e("li",null,"Both, macros with arguments and static inline functions, may be used. The latter is preferred only if the former simplifies coding."),e("li",null,"Follow BSD C programming conventions")],-1),m=e("li",null,[e("p",null,"Programs in Shell, Go, or Python can follow Google code conventions"),e("ul",null,[e("li",null,"https://google.github.io/styleguide/pyguide.html"),e("li",null,"https://github.com/golang/go/wiki/CodeReviewComments"),e("li",null,"https://google.github.io/styleguide/shellguide.html")])],-1),g=e("h2",{id:"code-design-and-review",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#code-design-and-review","aria-hidden":"true"},"#"),o(" Code design and review")],-1),f={href:"https://github.com/google/eng-practices/blob/master/review/index.md",target:"_blank",rel:"noopener noreferrer"},p=i("

Before submitting for code review, please do unit test and pass all tests under src/test, such as regress and isolation. Unit tests or function tests should be submitted with code modification.

In addition to code review, this doc offers instructions for the whole cycle of high-quality development, from design, implementation, testing, documentation, to preparing for code review. Many good questions are asked for critical steps during development, such as about design, about function, about complexity, about test, about naming, about documentation, and about code review. The doc summarized rules for code review as follows.

In doing a code review, you should make sure that:

  • The code is well-designed.
  • The functionality is good for the users of the code.
  • Any UI changes are sensible and look good.
  • Any parallel programming is done safely.
  • The code isn't more complex than it needs to be.
  • The developer isn't implementing things they might need in the future but don't know they need now.
  • Code has appropriate unit tests.
  • Tests are well-designed.
  • The developer used clear names for everything.
  • Comments are clear and useful, and mostly explain why instead of what.
  • Code is appropriately documented.
  • The code conforms to our style guides.
",4);function y(b,w){const t=s("ExternalLinkIcon");return a(),r("div",null,[c,e("ul",null,[e("li",null,[e("p",null,[o("Coding in C follows PostgreSQL's programing style, such as naming, error message format, control statements, length of lines, comment format, length of functions, and global variable. For detail, please reference "),e("a",h,[o("Postgresql style"),n(t)]),o(". Here is some highlines:")]),u]),m]),g,e("p",null,[o("We share the same thoughts and rules as "),e("a",f,[o("Google Open Source Code Review"),n(t)])]),p])}const v=l(d,[["render",y],["__file","coding-style.html.vue"]]);export{v as default}; diff --git a/assets/contributing-polardb-docs.html-2f5025bd.js b/assets/contributing-polardb-docs.html-2f5025bd.js new file mode 100644 index 00000000000..49849ef7468 --- /dev/null +++ b/assets/contributing-polardb-docs.html-2f5025bd.js @@ -0,0 +1,28 @@ +import{_ as r,r as o,o as t,c as i,a as e,b as n,d as s,e as d}from"./app-3d1677bf.js";const c={},l=e("h1",{id:"documentation-contributing",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#documentation-contributing","aria-hidden":"true"},"#"),n(" Documentation Contributing")],-1),p=e("div",{class:"custom-container danger"},[e("p",{class:"custom-container-title"},"DANGER"),e("p",null,"需要翻译")],-1),h={href:"https://v2.vuepress.vuejs.org/",target:"_blank",rel:"noopener noreferrer"},u=e("h2",{id:"浏览文档",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#浏览文档","aria-hidden":"true"},"#"),n(" 浏览文档")],-1),v={href:"https://ApsaraDB.github.io/PolarDB-for-PostgreSQL/",target:"_blank",rel:"noopener noreferrer"},g=e("h2",{id:"本地文档开发",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#本地文档开发","aria-hidden":"true"},"#"),n(" 本地文档开发")],-1),b={href:"https://yarnpkg.com/",target:"_blank",rel:"noopener noreferrer"},_={href:"https://nodejs.org/en/",target:"_blank",rel:"noopener noreferrer"},m=e("h3",{id:"node-环境准备",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#node-环境准备","aria-hidden":"true"},"#"),n(" Node 环境准备")],-1),f={href:"https://nodejs.org/en/download/",target:"_blank",rel:"noopener noreferrer"},k=d(`

通过 curl 安装 Node 版本管理器 nvm

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
+command -v nvm
+

如果上一步显示 command not found,那么请关闭当前终端,然后重新打开。

如果 nvm 已经被成功安装,执行以下命令安装 Node 的 LTS 版本:

nvm install --lts
+

Node.js 安装完毕后,使用如下命令检查安装是否成功:

node -v
+npm -v
+

使用 npm 全局安装软件包管理器 yarn

npm install -g yarn
+yarn -v
+

文档依赖安装

在 PolarDB for PostgreSQL 工程的根目录下运行以下命令,yarn 将会根据 package.json 安装所有依赖:

yarn
+

运行文档开发服务器

在 PolarDB for PostgreSQL 工程的根目录下运行以下命令:

yarn docs:dev
+

文档开发服务器将运行于 http://localhost:8080/PolarDB-for-PostgreSQL/,打开浏览器即可访问。对 Markdown 文件作出修改后,可以在网页上实时查看变化。

文档目录组织

PolarDB for PostgreSQL 的文档资源位于工程根目录的 docs/ 目录下。其目录被组织为:

└── docs
+    ├── .vuepress
+    │   ├── configs
+    │   ├── public
+    │   └── styles
+    ├── README.md
+    ├── architecture
+    ├── contributing
+    ├── guide
+    ├── imgs
+    ├── roadmap
+    └── zh
+        ├── README.md
+        ├── architecture
+        ├── contributing
+        ├── guide
+        ├── imgs
+        └── roadmap
+

可以看到,docs/zh/ 目录下是其父级目录除 .vuepress/ 以外的翻版。docs/ 目录中全部为英语文档,docs/zh/ 目录下全部是相对应的简体中文文档。

.vuepress/ 目录下包含文档工程的全局配置信息:

  • config.js:文档配置
  • configs/:文档配置模块(导航栏 / 侧边栏、英文 / 中文等配置)
  • public/:公共静态资源
  • styles/:文档主题默认样式覆盖
`,22),x={href:"https://v2.vuepress.vuejs.org/guide/configuration.html",target:"_blank",rel:"noopener noreferrer"},P=e("h2",{id:"文档开发规范",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#文档开发规范","aria-hidden":"true"},"#"),n(" 文档开发规范")],-1),D=e("li",null,"新的文档写好后,需要在文档配置中配置路由使其在导航栏和侧边栏中显示(可参考其他已有文档)",-1),N=e("li",null,"修正一种语言的文档时,也需要顺带修正其他语言的相同文档",-1),B={href:"https://prettier.io/",target:"_blank",rel:"noopener noreferrer"},j={href:"https://github.com/prettier/prettier-vscode",target:"_blank",rel:"noopener noreferrer"},S={href:"https://github.com/prettier/vim-prettier",target:"_blank",rel:"noopener noreferrer"},L=e("li",null,[n("或直接在源码根目录运行:"),e("code",null,"npx prettier --write docs/")],-1),y=e("h2",{id:"文档在线部署",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#文档在线部署","aria-hidden":"true"},"#"),n(" 文档在线部署")],-1),V={href:"https://github.com/features/actions",target:"_blank",rel:"noopener noreferrer"},E=e("code",null,"docs/",-1),Q={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL/tree/gh-pages",target:"_blank",rel:"noopener noreferrer"},w={href:"https://pages.github.com/",target:"_blank",rel:"noopener noreferrer"};function A(C,M){const a=o("ExternalLinkIcon");return t(),i("div",null,[l,p,e("p",null,[n("PolarDB for PostgreSQL 的文档使用 "),e("a",h,[n("VuePress 2"),s(a)]),n(" 进行管理,以 Markdown 为中心进行写作。")]),u,e("p",null,[n("本文档在线托管于 "),e("a",v,[n("GitHub Pages"),s(a)]),n(" 服务上。")]),g,e("p",null,[n("若您发现文档中存在内容或格式错误,或者您希望能够贡献新文档,那么您需要在本地安装并配置文档开发环境。本项目的文档是一个 Node.js 工程,以 "),e("a",b,[n("Yarn"),s(a)]),n(" 作为软件包管理器。"),e("a",_,[n("Node.js®"),s(a)]),n(" 是一个基于 Chrome V8 引擎的 JavaScript 运行时环境。")]),m,e("p",null,[n("您需要在本地准备 Node.js 环境。可以选择在 Node.js 官网 "),e("a",f,[n("下载"),s(a)]),n(" 页面下载安装包手动安装,也可以使用下面的命令自动安装。")]),k,e("p",null,[n("文档的配置方式请参考 VuePress 2 官方文档的 "),e("a",x,[n("配置指南"),s(a)]),n("。")]),P,e("ol",null,[D,N,e("li",null,[n("修改文档后,使用 "),e("a",B,[n("Prettier"),s(a)]),n(" 工具对 Markdown 文档进行格式化: "),e("ul",null,[e("li",null,[n("Prettier 支持的编辑器集成: "),e("ul",null,[e("li",null,[e("a",j,[n("Prettier-VSCode"),s(a)])]),e("li",null,[e("a",S,[n("Vim-Prettier"),s(a)])])])]),L])])]),y,e("p",null,[n("本文档借助 "),e("a",V,[n("GitHub Actions"),s(a)]),n(" 提供 CI 服务。向主分支推送代码时,将触发对 "),E,n(" 目录下文档资源的构建,并将构建结果推送到 "),e("a",Q,[n("gh-pages"),s(a)]),n(" 分支上。"),e("a",w,[n("GitHub Pages"),s(a)]),n(" 服务会自动将该分支上的文档静态资源部署到 Web 服务器上形成文档网站。")])])}const z=r(c,[["render",A],["__file","contributing-polardb-docs.html.vue"]]);export{z as default}; diff --git a/assets/contributing-polardb-docs.html-43544697.js b/assets/contributing-polardb-docs.html-43544697.js new file mode 100644 index 00000000000..aa5c27141ec --- /dev/null +++ b/assets/contributing-polardb-docs.html-43544697.js @@ -0,0 +1,28 @@ +import{_ as r,r as o,o as t,c as d,a as e,b as n,d as s,e as i}from"./app-3d1677bf.js";const c={},l=e("h1",{id:"贡献文档",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#贡献文档","aria-hidden":"true"},"#"),n(" 贡献文档")],-1),p={href:"https://v2.vuepress.vuejs.org/zh/",target:"_blank",rel:"noopener noreferrer"},h=e("h2",{id:"浏览文档",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#浏览文档","aria-hidden":"true"},"#"),n(" 浏览文档")],-1),u={href:"https://ApsaraDB.github.io/PolarDB-for-PostgreSQL/zh/",target:"_blank",rel:"noopener noreferrer"},v=e("h2",{id:"本地文档开发",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#本地文档开发","aria-hidden":"true"},"#"),n(" 本地文档开发")],-1),b={href:"https://www.yarnpkg.cn/",target:"_blank",rel:"noopener noreferrer"},g={href:"https://nodejs.org/zh-cn/",target:"_blank",rel:"noopener noreferrer"},_=e("h3",{id:"node-环境准备",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#node-环境准备","aria-hidden":"true"},"#"),n(" Node 环境准备")],-1),m={href:"https://nodejs.org/zh-cn/download/",target:"_blank",rel:"noopener noreferrer"},f=i(`

通过 curl 安装 Node 版本管理器 nvm

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
+command -v nvm
+

如果上一步显示 command not found,那么请关闭当前终端,然后重新打开。

如果 nvm 已经被成功安装,执行以下命令安装 Node 的 LTS 版本:

nvm install --lts
+

Node.js 安装完毕后,使用如下命令检查安装是否成功:

node -v
+npm -v
+

使用 npm 全局安装软件包管理器 yarn

npm install -g yarn
+yarn -v
+

文档依赖安装

在 PolarDB for PostgreSQL 工程的根目录下运行以下命令,yarn 将会根据 package.json 安装所有依赖:

yarn
+

运行文档开发服务器

在 PolarDB for PostgreSQL 工程的根目录下运行以下命令:

yarn docs:dev
+

文档开发服务器将运行于 http://localhost:8080/PolarDB-for-PostgreSQL/,打开浏览器即可访问。对 Markdown 文件作出修改后,可以在网页上实时查看变化。

文档目录组织

PolarDB for PostgreSQL 的文档资源位于工程根目录的 docs/ 目录下。其目录被组织为:

└── docs
+    ├── .vuepress
+    │   ├── configs
+    │   ├── public
+    │   └── styles
+    ├── README.md
+    ├── architecture
+    ├── contributing
+    ├── guide
+    ├── imgs
+    ├── roadmap
+    └── zh
+        ├── README.md
+        ├── architecture
+        ├── contributing
+        ├── guide
+        ├── imgs
+        └── roadmap
+

可以看到,docs/zh/ 目录下是其父级目录除 .vuepress/ 以外的翻版。docs/ 目录中全部为英语文档,docs/zh/ 目录下全部是相对应的简体中文文档。

.vuepress/ 目录下包含文档工程的全局配置信息:

  • config.js:文档配置
  • configs/:文档配置模块(导航栏 / 侧边栏、英文 / 中文等配置)
  • public/:公共静态资源
  • styles/:文档主题默认样式覆盖
`,22),k={href:"https://v2.vuepress.vuejs.org/zh/guide/configuration.html",target:"_blank",rel:"noopener noreferrer"},x=e("h2",{id:"文档开发规范",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#文档开发规范","aria-hidden":"true"},"#"),n(" 文档开发规范")],-1),P=e("li",null,"新的文档写好后,需要在文档配置中配置路由使其在导航栏和侧边栏中显示(可参考其他已有文档)",-1),B=e("li",null,"修正一种语言的文档时,也需要顺带修正其他语言的相同文档",-1),D={href:"https://prettier.io/",target:"_blank",rel:"noopener noreferrer"},N={href:"https://github.com/prettier/prettier-vscode",target:"_blank",rel:"noopener noreferrer"},j={href:"https://github.com/prettier/vim-prettier",target:"_blank",rel:"noopener noreferrer"},S=e("li",null,[n("或直接在源码根目录运行:"),e("code",null,"npx prettier --write docs/")],-1),L=e("h2",{id:"文档在线部署",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#文档在线部署","aria-hidden":"true"},"#"),n(" 文档在线部署")],-1),w={href:"https://github.com/features/actions",target:"_blank",rel:"noopener noreferrer"},y=e("code",null,"docs/",-1),V={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL/tree/gh-pages",target:"_blank",rel:"noopener noreferrer"},z={href:"https://pages.github.com/",target:"_blank",rel:"noopener noreferrer"};function E(Q,A){const a=o("ExternalLinkIcon");return t(),d("div",null,[l,e("p",null,[n("PolarDB for PostgreSQL 的文档使用 "),e("a",p,[n("VuePress 2"),s(a)]),n(" 进行管理,以 Markdown 为中心进行写作。")]),h,e("p",null,[n("本文档在线托管于 "),e("a",u,[n("GitHub Pages"),s(a)]),n(" 服务上。")]),v,e("p",null,[n("若您发现文档中存在内容或格式错误,或者您希望能够贡献新文档,那么您需要在本地安装并配置文档开发环境。本项目的文档是一个 Node.js 工程,以 "),e("a",b,[n("Yarn"),s(a)]),n(" 作为软件包管理器。"),e("a",g,[n("Node.js®"),s(a)]),n(" 是一个基于 Chrome V8 引擎的 JavaScript 运行时环境。")]),_,e("p",null,[n("您需要在本地准备 Node.js 环境。可以选择在 Node.js 官网 "),e("a",m,[n("下载"),s(a)]),n(" 页面下载安装包手动安装,也可以使用下面的命令自动安装。")]),f,e("p",null,[n("文档的配置方式请参考 VuePress 2 官方文档的 "),e("a",k,[n("配置指南"),s(a)]),n("。")]),x,e("ol",null,[P,B,e("li",null,[n("修改文档后,使用 "),e("a",D,[n("Prettier"),s(a)]),n(" 工具对 Markdown 文档进行格式化: "),e("ul",null,[e("li",null,[n("Prettier 支持的编辑器集成: "),e("ul",null,[e("li",null,[e("a",N,[n("Prettier-VSCode"),s(a)])]),e("li",null,[e("a",j,[n("Vim-Prettier"),s(a)])])])]),S])])]),L,e("p",null,[n("本文档借助 "),e("a",w,[n("GitHub Actions"),s(a)]),n(" 提供 CI 服务。向主分支推送代码时,将触发对 "),y,n(" 目录下文档资源的构建,并将构建结果推送到 "),e("a",V,[n("gh-pages"),s(a)]),n(" 分支上。"),e("a",z,[n("GitHub Pages"),s(a)]),n(" 服务会自动将该分支上的文档静态资源部署到 Web 服务器上形成文档网站。")])])}const C=r(c,[["render",E],["__file","contributing-polardb-docs.html.vue"]]);export{C as default}; diff --git a/assets/contributing-polardb-docs.html-5c2bada8.js b/assets/contributing-polardb-docs.html-5c2bada8.js new file mode 100644 index 00000000000..6dd8e40269d --- /dev/null +++ b/assets/contributing-polardb-docs.html-5c2bada8.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-210f48a7","path":"/zh/contributing/contributing-polardb-docs.html","title":"贡献文档","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"浏览文档","slug":"浏览文档","link":"#浏览文档","children":[]},{"level":2,"title":"本地文档开发","slug":"本地文档开发","link":"#本地文档开发","children":[{"level":3,"title":"Node 环境准备","slug":"node-环境准备","link":"#node-环境准备","children":[]},{"level":3,"title":"文档依赖安装","slug":"文档依赖安装","link":"#文档依赖安装","children":[]},{"level":3,"title":"运行文档开发服务器","slug":"运行文档开发服务器","link":"#运行文档开发服务器","children":[]}]},{"level":2,"title":"文档目录组织","slug":"文档目录组织","link":"#文档目录组织","children":[]},{"level":2,"title":"文档开发规范","slug":"文档开发规范","link":"#文档开发规范","children":[]},{"level":2,"title":"文档在线部署","slug":"文档在线部署","link":"#文档在线部署","children":[]}],"git":{"updatedTime":1652766573000},"filePathRelative":"zh/contributing/contributing-polardb-docs.md"}');export{l as data}; diff --git a/assets/contributing-polardb-docs.html-f51dbfef.js b/assets/contributing-polardb-docs.html-f51dbfef.js new file mode 100644 index 00000000000..3446fbdb469 --- /dev/null +++ b/assets/contributing-polardb-docs.html-f51dbfef.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-2be11236","path":"/contributing/contributing-polardb-docs.html","title":"Documentation Contributing","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"浏览文档","slug":"浏览文档","link":"#浏览文档","children":[]},{"level":2,"title":"本地文档开发","slug":"本地文档开发","link":"#本地文档开发","children":[{"level":3,"title":"Node 环境准备","slug":"node-环境准备","link":"#node-环境准备","children":[]},{"level":3,"title":"文档依赖安装","slug":"文档依赖安装","link":"#文档依赖安装","children":[]},{"level":3,"title":"运行文档开发服务器","slug":"运行文档开发服务器","link":"#运行文档开发服务器","children":[]}]},{"level":2,"title":"文档目录组织","slug":"文档目录组织","link":"#文档目录组织","children":[]},{"level":2,"title":"文档开发规范","slug":"文档开发规范","link":"#文档开发规范","children":[]},{"level":2,"title":"文档在线部署","slug":"文档在线部署","link":"#文档在线部署","children":[]}],"git":{"updatedTime":1652766573000},"filePathRelative":"contributing/contributing-polardb-docs.md"}');export{l as data}; diff --git a/assets/contributing-polardb-kernel.html-1eca7ee4.js b/assets/contributing-polardb-kernel.html-1eca7ee4.js new file mode 100644 index 00000000000..846a74442b2 --- /dev/null +++ b/assets/contributing-polardb-kernel.html-1eca7ee4.js @@ -0,0 +1 @@ +const e=JSON.parse(`{"key":"v-48520b74","path":"/contributing/contributing-polardb-kernel.html","title":"Code Contributing","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Branch Description and Management","slug":"branch-description-and-management","link":"#branch-description-and-management","children":[]},{"level":2,"title":"Before Contributing","slug":"before-contributing","link":"#before-contributing","children":[]},{"level":2,"title":"Contributing","slug":"contributing","link":"#contributing","children":[]},{"level":2,"title":"An Example of Submitting Code Change to PolarDB","slug":"an-example-of-submitting-code-change-to-polardb","link":"#an-example-of-submitting-code-change-to-polardb","children":[{"level":3,"title":"Fork Your Own Repository","slug":"fork-your-own-repository","link":"#fork-your-own-repository","children":[]},{"level":3,"title":"Create Local Repository","slug":"create-local-repository","link":"#create-local-repository","children":[]},{"level":3,"title":"Create a Local Development Branch","slug":"create-a-local-development-branch","link":"#create-a-local-development-branch","children":[]},{"level":3,"title":"Make Changes and Commit Locally","slug":"make-changes-and-commit-locally","link":"#make-changes-and-commit-locally","children":[]},{"level":3,"title":"Rebase and Commit to Remote Repository","slug":"rebase-and-commit-to-remote-repository","link":"#rebase-and-commit-to-remote-repository","children":[]},{"level":3,"title":"Create a Pull Request","slug":"create-a-pull-request","link":"#create-a-pull-request","children":[]},{"level":3,"title":"Address Reviewers' Comments","slug":"address-reviewers-comments","link":"#address-reviewers-comments","children":[]},{"level":3,"title":"Merge","slug":"merge","link":"#merge","children":[]}]}],"git":{"updatedTime":1689229584000},"filePathRelative":"contributing/contributing-polardb-kernel.md"}`);export{e as data}; diff --git a/assets/contributing-polardb-kernel.html-54788b1e.js b/assets/contributing-polardb-kernel.html-54788b1e.js new file mode 100644 index 00000000000..e64f25369a0 --- /dev/null +++ b/assets/contributing-polardb-kernel.html-54788b1e.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-aa672cb6","path":"/zh/contributing/contributing-polardb-kernel.html","title":"贡献代码","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"分支说明与管理方式","slug":"分支说明与管理方式","link":"#分支说明与管理方式","children":[]},{"level":2,"title":"贡献代码之前","slug":"贡献代码之前","link":"#贡献代码之前","children":[]},{"level":2,"title":"贡献流程","slug":"贡献流程","link":"#贡献流程","children":[]},{"level":2,"title":"代码提交实例说明","slug":"代码提交实例说明","link":"#代码提交实例说明","children":[{"level":3,"title":"复制您自己的仓库","slug":"复制您自己的仓库","link":"#复制您自己的仓库","children":[]},{"level":3,"title":"克隆您的仓库到本地","slug":"克隆您的仓库到本地","link":"#克隆您的仓库到本地","children":[]},{"level":3,"title":"创建本地开发分支","slug":"创建本地开发分支","link":"#创建本地开发分支","children":[]},{"level":3,"title":"在本地仓库修改代码并提交","slug":"在本地仓库修改代码并提交","link":"#在本地仓库修改代码并提交","children":[]},{"level":3,"title":"变基并提交到远程仓库","slug":"变基并提交到远程仓库","link":"#变基并提交到远程仓库","children":[]},{"level":3,"title":"创建 Pull Request","slug":"创建-pull-request","link":"#创建-pull-request","children":[]},{"level":3,"title":"解决代码评审中的问题","slug":"解决代码评审中的问题","link":"#解决代码评审中的问题","children":[]},{"level":3,"title":"代码合并","slug":"代码合并","link":"#代码合并","children":[]}]}],"git":{"updatedTime":1689229584000},"filePathRelative":"zh/contributing/contributing-polardb-kernel.md"}');export{l as data}; diff --git a/assets/contributing-polardb-kernel.html-92d0b879.js b/assets/contributing-polardb-kernel.html-92d0b879.js new file mode 100644 index 00000000000..0d1564672de --- /dev/null +++ b/assets/contributing-polardb-kernel.html-92d0b879.js @@ -0,0 +1,13 @@ +import{_ as l,r,o as c,c as d,a as e,b as a,d as o,w as s,e as i}from"./app-3d1677bf.js";const h={},u=i('

Code Contributing

PolarDB for PostgreSQL is an open source product from PostgreSQL and other open source projects. Our main target is to create a larger community for PostgreSQL. Contributors are welcomed to submit their code and ideas. In a long run, we hope this project can be managed by developers from both inside and outside Alibaba.

Branch Description and Management

  • POLARDB_11_STABLE is the stable branch of PolarDB, it can accept the merge from POLARDB_11_DEV only
  • POLARDB_11_DEV is the stable development branch of PolarDB, it can accept the merge from both pull requests and direct pushes from maintainers

New features will be merged to POLARDB_11_DEV, and will be merged to POLARDB_11_STABLE periodically by maintainers

Before Contributing

',6),p={href:"https://gist.github.com/alibaba-oss/151a13b0a72e44ba471119c7eb737d74",target:"_blank",rel:"noopener noreferrer"},m=e("h2",{id:"contributing",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#contributing","aria-hidden":"true"},"#"),a(" Contributing")],-1),b=e("p",null,"Here is a checklist to prepare and submit your PR (pull request):",-1),g=e("li",null,[a("Create your own Github repository copy by forking "),e("code",null,"ApsaraDB/PolarDB-for-PostgreSQL"),a(".")],-1),f=e("li",null,"Create a PR with a detailed description, if commit messages do not express themselves.",-1),v=e("li",null,"Submit PR for review and address all feedbacks.",-1),_=e("li",null,"Wait for merging",-1),k=e("h2",{id:"an-example-of-submitting-code-change-to-polardb",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#an-example-of-submitting-code-change-to-polardb","aria-hidden":"true"},"#"),a(" An Example of Submitting Code Change to PolarDB")],-1),y=e("p",null,"Let's use an example to walk through the list.",-1),P=e("h3",{id:"fork-your-own-repository",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#fork-your-own-repository","aria-hidden":"true"},"#"),a(" Fork Your Own Repository")],-1),D={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},B=e("strong",null,"fork",-1),L=i(`

Create Local Repository

git clone https://github.com/<your-github>/PolarDB-for-PostgreSQL.git
+

Create a Local Development Branch

Check out a new development branch from the stable development branch POLARDB_11_DEV. Suppose your branch is named as dev:

git checkout POLARDB_11_DEV
+git checkout -b dev
+

Make Changes and Commit Locally

git status
+git add <files-to-change>
+git commit -m "modification for dev"
+

Rebase and Commit to Remote Repository

Click Fetch upstream on your own repository page to make sure your stable development branch is up do date with PolarDB official. Then pull the latest commits on stable development branch to your local repository.

git checkout POLARDB_11_DEV
+git pull
+

Then, rebase your development branch to the stable development branch, and resolve the conflict:

git checkout dev
+git rebase POLARDB_11_DEV
+-- resolve conflict --
+git push -f dev
+

Create a Pull Request

Click New pull request or Compare & pull request button, choose to compare branches ApsaraDB/PolarDB-for-PostgreSQL:POLARDB_11_DEV and <your-github>/PolarDB-for-PostgreSQL:dev, and write PR description.

GitHub will automatically run regression test on your code. Your PR should pass all these checks.

Address Reviewers' Comments

Resolve all problems raised by reviewers and update the PR.

Merge

It is done by PolarDB maintainers.

`,19);function x(w,R){const n=r("ExternalLinkIcon"),t=r("RouterLink");return c(),d("div",null,[u,e("ul",null,[e("li",null,[a("Sign the "),e("a",p,[a("CLA"),o(n)]),a(" of PolarDB for PostgreSQL")])]),m,b,e("ul",null,[g,e("li",null,[a("Checkout documentations for "),o(t,{to:"/deploying/deploy.html"},{default:s(()=>[a("Advanced Deployment")]),_:1}),a(" from PolarDB source code.")]),e("li",null,[a("Push changes to your personal fork and make sure they follow our "),o(t,{to:"/contributing/coding-style.html"},{default:s(()=>[a("coding style")]),_:1}),a(".")]),f,v,_]),k,y,P,e("p",null,[a("On GitHub repository of "),e("a",D,[a("PolarDB for PostgreSQL"),o(n)]),a(", Click "),B,a(" button to create your own PolarDB repository.")]),L])}const A=l(h,[["render",x],["__file","contributing-polardb-kernel.html.vue"]]);export{A as default}; diff --git a/assets/contributing-polardb-kernel.html-9fffc22f.js b/assets/contributing-polardb-kernel.html-9fffc22f.js new file mode 100644 index 00000000000..5fe7967c859 --- /dev/null +++ b/assets/contributing-polardb-kernel.html-9fffc22f.js @@ -0,0 +1,13 @@ +import{_ as d,r as t,o as l,c,a as e,b as a,d as n,w as o,e as i}from"./app-3d1677bf.js";const h={},u=i('

贡献代码

PolarDB for PostgreSQL 基于 PostgreSQL 和其它开源项目进行开发,我们的主要目标是为 PostgreSQL 建立一个更大的社区。我们欢迎来自社区的贡献者提交他们的代码或想法。在更远的未来,我们希望这个项目能够被来自阿里巴巴内部和外部的开发者共同管理。

分支说明与管理方式

  • POLARDB_11_STABLE 是 PolarDB 的稳定分支,只接受来自 POLARDB_11_DEV 的合并
  • POLARDB_11_DEV 是 PolarDB 的稳定开发分支,接受来自开源社区的 PR 合并,以及内部开发者的直接推送

新的代码将被合并到 POLARDB_11_DEV 上,再由内部开发者定期合并到 POLARDB_11_STABLE 上。

贡献代码之前

',6),p={href:"https://gist.github.com/alibaba-oss/151a13b0a72e44ba471119c7eb737d74",target:"_blank",rel:"noopener noreferrer"},b=e("h2",{id:"贡献流程",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#贡献流程","aria-hidden":"true"},"#"),a(" 贡献流程")],-1),g=e("li",null,[a("在 "),e("code",null,"ApsaraDB/PolarDB-for-PostgreSQL"),a(" 仓库点击 "),e("code",null,"fork"),a(" 复制一个属于您自己的仓库")],-1),_=e("li",null,"向 PolarDB 官方源码仓库发起 pull request;如果 commit message 本身不能很好地表达您的贡献内容,您可以在 PR 中给出较为细节的描述",-1),v=e("li",null,"等待维护者评审您的代码,讨论并解决所有的评审意见",-1),m=e("li",null,"等待维护者合并您的代码",-1),f=e("h2",{id:"代码提交实例说明",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#代码提交实例说明","aria-hidden":"true"},"#"),a(" 代码提交实例说明")],-1),P=e("h3",{id:"复制您自己的仓库",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#复制您自己的仓库","aria-hidden":"true"},"#"),a(" 复制您自己的仓库")],-1),D={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},k=e("strong",null,"fork",-1),B=i(`

克隆您的仓库到本地

git clone https://github.com/<your-github>/PolarDB-for-PostgreSQL.git
+

创建本地开发分支

从稳定开发分支 POLARDB_11_DEV 上检出一个新的开发分支,假设这个分支名为 dev

git checkout POLARDB_11_DEV
+git checkout -b dev
+

在本地仓库修改代码并提交

git status
+git add <files-to-change>
+git commit -m "modification for dev"
+

变基并提交到远程仓库

首先点击您自己仓库页面上的 Fetch upstream 确保您的稳定开发分支与 PolarDB 官方仓库的稳定开发分支一致。然后将稳定开发分支上的最新修改拉取到本地:

git checkout POLARDB_11_DEV
+git pull
+

接下来将您的开发分支变基到目前的稳定开发分支,并解决冲突:

git checkout dev
+git rebase POLARDB_11_DEV
+-- 解决冲突 --
+git push -f dev
+

创建 Pull Request

点击 New pull requestCompare & pull request 按钮,选择对 ApsaraDB/PolarDB-for-PostgreSQL:POLARDB_11_DEV 分支和 <your-github>/PolarDB-for-PostgreSQL:dev 分支进行比较,并撰写 PR 描述。

GitHub 会对您的 PR 进行自动化的回归测试,您的 PR 需要 100% 通过这些测试。

解决代码评审中的问题

您可以与维护者就代码中的问题进行讨论,并解决他们提出的评审意见。

代码合并

如果您的代码通过了测试和评审,PolarDB 的维护者将会把您的 PR 合并到稳定分支上。

`,19);function L(x,R){const s=t("ExternalLinkIcon"),r=t("RouterLink");return l(),c("div",null,[u,e("ul",null,[e("li",null,[a("签署 PolarDB for PostgreSQL 的 "),e("a",p,[a("CLA"),n(s)])])]),b,e("ul",null,[g,e("li",null,[a("查阅 "),n(r,{to:"/zh/deploying/deploy.html"},{default:o(()=>[a("进阶部署")]),_:1}),a(" 了解如何从源码编译开发 PolarDB")]),e("li",null,[a("向您的复制源码仓库推送代码,并确保代码符合我们的 "),n(r,{to:"/zh/contributing/coding-style.html"},{default:o(()=>[a("编码风格规范")]),_:1})]),_,v,m]),f,P,e("p",null,[a("在 "),e("a",D,[a("PolarDB for PostgreSQL"),n(s)]),a(" 的代码仓库页面上,点击右上角的 "),k,a(" 按钮复制您自己的 PolarDB 仓库。")]),B])}const E=d(h,[["render",L],["__file","contributing-polardb-kernel.html.vue"]]);export{E as default}; diff --git a/assets/cpu-usage-high.html-7366dfbc.js b/assets/cpu-usage-high.html-7366dfbc.js new file mode 100644 index 00000000000..09050d4e709 --- /dev/null +++ b/assets/cpu-usage-high.html-7366dfbc.js @@ -0,0 +1,34 @@ +import{_ as d,r as o,o as k,c as i,d as n,a as s,w as t,b as a,e as p}from"./app-3d1677bf.js";const u={},_=s("h1",{id:"cpu-使用率高的排查方法",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#cpu-使用率高的排查方法","aria-hidden":"true"},"#"),a(" CPU 使用率高的排查方法")],-1),g=s("p",null,"在 PolarDB for PostgreSQL 的使用过程中,可能会出现 CPU 使用率异常升高甚至达到满载的情况。本文将介绍造成这种情况的常见原因和排查方法,以及相应的解决方案。",-1),h={class:"table-of-contents"},y=p(`

业务量上涨

当 CPU 使用率上升时,最有可能的情况是业务量的上涨导致数据库使用的计算资源增多。所以首先需要排查目前数据库的活跃连接数是否比平时高很多。如果数据库配备了监控系统,那么活跃连接数的变化情况可以通过图表的形式观察到;否则可以直接连接到数据库,执行如下 SQL 来获取当前活跃连接数:

SELECT COUNT(*) FROM pg_stat_activity WHERE state NOT LIKE 'idle';
+

pg_stat_activity 是 PostgreSQL 的内置系统视图,该视图返回的每一行都是一个正在运行中的 PostgreSQL 进程,state 列表示进程当前的状态。该列可能的取值为:

  • active:进程正在执行查询
  • idle:进程空闲,正在等待新的客户端命令
  • idle in transaction:进程处于事务中,但目前暂未执行查询
  • idle in transaction (aborted):进程处于事务中,且有一条语句发生过错误
  • fastpath function call:进程正在执行一个 fast-path 函数
  • disabled:进程的状态采集功能被关闭

上述 SQL 能够查询到所有非空闲状态的进程数,即可能占用 CPU 的活跃连接数。如果活跃连接数较平时更多,则 CPU 使用率的上升是符合预期的。

慢查询

如果 CPU 使用率上升,而活跃连接数的变化范围处在正常范围内,那么有可能出现了较多性能较差的慢查询。这些慢查询可能在很长一段时间里占用了较多的 CPU,导致 CPU 使用率上升。PostgreSQL 提供了慢查询日志的功能,执行时间高于 log_min_duration_statement 的 SQL 将会被记录到慢查询日志中。然而当 CPU 占用率接近满载时,将会导致整个系统的停滞,所有 SQL 的执行可能都会慢下来,所以慢查询日志中记录的信息可能非常多,并不容易排查。

定位执行时间较长的慢查询

`,9),w={href:"https://www.postgresql.org/docs/current/pgstatstatements.html",target:"_blank",rel:"noopener noreferrer"},E=s("code",null,"pg_stat_statements",-1),L=s("code",null,"shared_preload_libraries",-1),S=p(`

如果没有在当前数据库中创建过 pg_stat_statements 插件的话,首先需要创建这个插件。该过程将会注册好插件提供的函数及视图:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
+

该插件和数据库系统本身都会不断累积统计信息。为了排查 CPU 异常升高后这段时间内的问题,需要把数据库和插件中留存的统计信息做一次清空,然后开始收集从当前时刻开始的统计信息:

-- 清空当前数据库的统计信息
+SELECT pg_stat_reset();
+-- 清空 pg_stat_statements 插件截止目前收集的统计信息
+SELECT pg_stat_statements_reset();
+

接下来需要等待一段时间(1-2 分钟),使数据库和插件充分采集这段时间内的统计信息。

统计信息收集完毕后,参考使用如下 SQL 查询执行时间最长的 5 条 SQL:

-- < PostgreSQL 13
+SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
+-- >= PostgreSQL 13
+SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;
+

定位读取 Buffer 数量较多的慢查询

当一张表缺少索引,而对该表的查询基本上都是点查时,数据库将不得不使用全表扫描,并在内存中进行过滤条件的判断,处理掉大量的无效记录,导致 CPU 使用率大幅提升。利用 pg_stat_statements 插件的统计信息,参考如下 SQL,可以列出截止目前读取 Buffer 数量最多的 5 条 SQL:

SELECT * FROM pg_stat_statements
+ORDER BY shared_blks_hit + shared_blks_read DESC
+LIMIT 5;
+
`,10),f={href:"https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-PG-STAT-ALL-TABLES-VIEW",target:"_blank",rel:"noopener noreferrer"},m=s("code",null,"pg_stat_user_tables",-1),q=p(`
SELECT * FROM pg_stat_user_tables
+WHERE n_live_tup > 100000 AND seq_scan > 0
+ORDER BY seq_tup_read DESC
+LIMIT 5;
+

定位长时间执行不结束的慢查询

通过系统内置视图 pg_stat_activity,可以查询出长时间执行不结束的 SQL,这些 SQL 有极大可能造成 CPU 使用率过高。参考以下 SQL 获取查询执行时间最长,且目前还未退出的 5 条 SQL:

SELECT
+    *,
+    extract(epoch FROM (NOW() - xact_start)) AS xact_stay,
+    extract(epoch FROM (NOW() - query_start)) AS query_stay
+FROM pg_stat_activity
+WHERE state NOT LIKE 'idle%'
+ORDER BY query_stay DESC
+LIMIT 5;
+

结合前一步中排查到的 使用全表扫描最多的表,参考如下 SQL 获取 在该表上 执行时间超过一定阈值(比如 10s)的慢查询:

SELECT * FROM pg_stat_activity
+WHERE
+    state NOT LIKE 'idle%' AND
+    query ILIKE '%表名%' AND
+    NOW() - query_start > interval '10s';
+

解决方法与优化思路

对于异常占用 CPU 较高的 SQL,如果仅有个别非预期 SQL,则可以通过给后端进程发送信号的方式,先让 SQL 执行中断,使 CPU 使用率恢复正常。参考如下 SQL,以慢查询执行所使用的进程 pid(pg_stat_activity 视图的 pid 列)作为参数,中止相应的进程的执行:

SELECT pg_cancel_backend(pid);
+SELECT pg_terminate_backend(pid);
+

如果执行较慢的 SQL 是业务上必要的 SQL,那么需要对它进行调优。

首先可以对 SQL 涉及到的表进行采样,更新其统计信息,使优化器能够产生更加准确的执行计划。采样需要占用一定的 CPU,最好在业务低谷期运行:

ANALYZE 表名;
+

对于全表扫描较多的表,可以在常用的过滤列上创建索引,以尽量使用索引扫描,减少全表扫描在内存中过滤不符合条件的记录所造成的 CPU 浪费。

`,13);function C(l,v){const r=o("ArticleInfo"),e=o("router-link"),c=o("ExternalLinkIcon");return k(),i("div",null,[_,n(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),g,s("nav",h,[s("ul",null,[s("li",null,[n(e,{to:"#业务量上涨"},{default:t(()=>[a("业务量上涨")]),_:1})]),s("li",null,[n(e,{to:"#慢查询"},{default:t(()=>[a("慢查询")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#定位执行时间较长的慢查询"},{default:t(()=>[a("定位执行时间较长的慢查询")]),_:1})]),s("li",null,[n(e,{to:"#定位读取-buffer-数量较多的慢查询"},{default:t(()=>[a("定位读取 Buffer 数量较多的慢查询")]),_:1})]),s("li",null,[n(e,{to:"#定位长时间执行不结束的慢查询"},{default:t(()=>[a("定位长时间执行不结束的慢查询")]),_:1})]),s("li",null,[n(e,{to:"#解决方法与优化思路"},{default:t(()=>[a("解决方法与优化思路")]),_:1})])])])])]),y,s("p",null,[s("a",w,[E,n(c)]),a(" 插件能够记录数据库服务器上所有 SQL 语句在优化和执行阶段的统计信息。由于该插件需要使用共享内存,因此插件名需要被配置在 "),L,a(" 参数中。")]),S,s("p",null,[a("借助 PostgreSQL 内置系统视图 "),s("a",f,[m,n(c)]),a(" 中的统计信息,也可以统计出使用全表扫描的次数最多的表。参考如下 SQL,可以获取具备一定规模数据量(元组约为 10 万个)且使用全表扫描获取到的元组数量最多的 5 张表:")]),q])}const Q=d(u,[["render",C],["__file","cpu-usage-high.html.vue"]]);export{Q as default}; diff --git a/assets/cpu-usage-high.html-c7872413.js b/assets/cpu-usage-high.html-c7872413.js new file mode 100644 index 00000000000..bda338289a7 --- /dev/null +++ b/assets/cpu-usage-high.html-c7872413.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-a8802f54","path":"/zh/operation/cpu-usage-high.html","title":"CPU 使用率高的排查方法","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/03/06","minute":20},"headers":[{"level":2,"title":"业务量上涨","slug":"业务量上涨","link":"#业务量上涨","children":[]},{"level":2,"title":"慢查询","slug":"慢查询","link":"#慢查询","children":[{"level":3,"title":"定位执行时间较长的慢查询","slug":"定位执行时间较长的慢查询","link":"#定位执行时间较长的慢查询","children":[]},{"level":3,"title":"定位读取 Buffer 数量较多的慢查询","slug":"定位读取-buffer-数量较多的慢查询","link":"#定位读取-buffer-数量较多的慢查询","children":[]},{"level":3,"title":"定位长时间执行不结束的慢查询","slug":"定位长时间执行不结束的慢查询","link":"#定位长时间执行不结束的慢查询","children":[]},{"level":3,"title":"解决方法与优化思路","slug":"解决方法与优化思路","link":"#解决方法与优化思路","children":[]}]}],"git":{"updatedTime":1678166880000},"filePathRelative":"zh/operation/cpu-usage-high.md"}');export{e as data}; diff --git a/assets/curve-cluster-77966d1c.png b/assets/curve-cluster-77966d1c.png new file mode 100644 index 00000000000..d60d7dde8bd Binary files /dev/null and b/assets/curve-cluster-77966d1c.png differ diff --git a/assets/customize-dev-env.html-6e08f45f.js b/assets/customize-dev-env.html-6e08f45f.js new file mode 100644 index 00000000000..7f03360e490 --- /dev/null +++ b/assets/customize-dev-env.html-6e08f45f.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-c4fe9fca","path":"/development/customize-dev-env.html","title":"定制开发环境","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"自行构建开发镜像","slug":"自行构建开发镜像","link":"#自行构建开发镜像","children":[]},{"level":2,"title":"从干净的系统开始搭建开发环境","slug":"从干净的系统开始搭建开发环境","link":"#从干净的系统开始搭建开发环境","children":[{"level":3,"title":"建立非 root 用户","slug":"建立非-root-用户","link":"#建立非-root-用户","children":[]},{"level":3,"title":"依赖安装","slug":"依赖安装","link":"#依赖安装","children":[]}]}],"git":{"updatedTime":1690894847000},"filePathRelative":"development/customize-dev-env.md"}');export{e as data}; diff --git a/assets/customize-dev-env.html-95ee07be.js b/assets/customize-dev-env.html-95ee07be.js new file mode 100644 index 00000000000..34431755e9f --- /dev/null +++ b/assets/customize-dev-env.html-95ee07be.js @@ -0,0 +1,154 @@ +import{_ as r,r as o,o as c,c as d,a as s,b as n,d as a,w as l,e as p}from"./app-3d1677bf.js";const m={},v=s("h1",{id:"定制开发环境",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#定制开发环境","aria-hidden":"true"},"#"),n(" 定制开发环境")],-1),u=s("h2",{id:"自行构建开发镜像",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#自行构建开发镜像","aria-hidden":"true"},"#"),n(" 自行构建开发镜像")],-1),b={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},k=s("code",null,"polardb/polardb_pg_devel",-1),g=s("code",null,"linux/amd64",-1),h=s("code",null,"linux/arm64",-1),_={href:"https://hub.docker.com/_/ubuntu/tags",target:"_blank",rel:"noopener noreferrer"},f=s("code",null,"ubuntu:20.04",-1),S=p(`
FROM ubuntu:20.04
+LABEL maintainer="mrdrivingduck@gmail.com"
+CMD bash
+
+# Timezone problem
+ENV TZ=Asia/Shanghai
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+
+# Upgrade softwares
+RUN apt update -y && \\
+    apt upgrade -y && \\
+    apt clean -y
+
+# GCC (force to 9) and LLVM (force to 11)
+RUN apt install -y \\
+        gcc-9 \\
+        g++-9 \\
+        llvm-11-dev \\
+        clang-11 \\
+        make \\
+        gdb \\
+        pkg-config \\
+        locales && \\
+    update-alternatives --install \\
+        /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave \\
+        /usr/bin/g++ g++ /usr/bin/g++-9 && \\
+    update-alternatives --install \\
+        /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-11 60 --slave \\
+        /usr/bin/clang++ clang++ /usr/bin/clang++-11 --slave \\
+        /usr/bin/clang clang /usr/bin/clang-11 && \\
+    apt clean -y
+
+# Generate locale
+RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && \\
+    sed -i '/zh_CN.UTF-8/s/^# //g' /etc/locale.gen && \\
+    locale-gen
+
+# Dependencies
+RUN apt install -y \\
+        libicu-dev \\
+        bison \\
+        flex \\
+        python3-dev \\
+        libreadline-dev \\
+        libgss-dev \\
+        libssl-dev \\
+        libpam0g-dev \\
+        libxml2-dev \\
+        libxslt1-dev \\
+        libldap2-dev \\
+        uuid-dev \\
+        liblz4-dev \\
+        libkrb5-dev \\
+        gettext \\
+        libxerces-c-dev \\
+        tcl-dev \\
+        libperl-dev \\
+        libipc-run-perl \\
+        libaio-dev \\
+        libfuse-dev && \\
+    apt clean -y
+
+# Tools
+RUN apt install -y \\
+        iproute2 \\
+        wget \\
+        ccache \\
+        sudo \\
+        vim \\
+        git \\
+        cmake && \\
+    apt clean -y
+
+# set to empty if GitHub is not barriered
+# ENV GITHUB_PROXY=https://ghproxy.com/
+ENV GITHUB_PROXY=
+
+ENV ZLOG_VERSION=1.2.14
+ENV PFSD_VERSION=pfsd4pg-release-1.2.42-20220419
+
+# install dependencies from GitHub mirror
+RUN cd /usr/local && \\
+    # zlog for PFSD
+    wget --no-verbose --no-check-certificate "\${GITHUB_PROXY}https://github.com/HardySimpson/zlog/archive/refs/tags/\${ZLOG_VERSION}.tar.gz" && \\
+    # PFSD
+    wget --no-verbose --no-check-certificate "\${GITHUB_PROXY}https://github.com/ApsaraDB/PolarDB-FileSystem/archive/refs/tags/\${PFSD_VERSION}.tar.gz" && \\
+    # unzip and install zlog
+    gzip -d $ZLOG_VERSION.tar.gz && \\
+    tar xpf $ZLOG_VERSION.tar && \\
+    cd zlog-$ZLOG_VERSION && \\
+    make && make install && \\
+    echo '/usr/local/lib' >> /etc/ld.so.conf && ldconfig && \\
+    cd .. && \\
+    rm -rf $ZLOG_VERSION* && \\
+    rm -rf zlog-$ZLOG_VERSION && \\
+    # unzip and install PFSD
+    gzip -d $PFSD_VERSION.tar.gz && \\
+    tar xpf $PFSD_VERSION.tar && \\
+    cd PolarDB-FileSystem-$PFSD_VERSION && \\
+    sed -i 's/-march=native //' CMakeLists.txt && \\
+    ./autobuild.sh && ./install.sh && \\
+    cd .. && \\
+    rm -rf $PFSD_VERSION* && \\
+    rm -rf PolarDB-FileSystem-$PFSD_VERSION*
+
+# create default user
+ENV USER_NAME=postgres
+RUN echo "create default user" && \\
+    groupadd -r $USER_NAME && \\
+    useradd -ms /bin/bash -g $USER_NAME $USER_NAME -p '' && \\
+    usermod -aG sudo $USER_NAME
+
+# modify conf
+RUN echo "modify conf" && \\
+    mkdir -p /var/log/pfs && chown $USER_NAME /var/log/pfs && \\
+    mkdir -p /var/run/pfs && chown $USER_NAME /var/run/pfs && \\
+    mkdir -p /var/run/pfsd && chown $USER_NAME /var/run/pfsd && \\
+    mkdir -p /dev/shm/pfsd && chown $USER_NAME /dev/shm/pfsd && \\
+    touch /var/run/pfsd/.pfsd && \\
+    echo "ulimit -c unlimited" >> /home/postgres/.bashrc && \\
+    echo "export PGHOST=127.0.0.1" >> /home/postgres/.bashrc && \\
+    echo "alias pg='psql -h /home/postgres/tmp_master_dir_polardb_pg_1100_bld/'" >> /home/postgres/.bashrc
+
+ENV PATH="/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin:$PATH"
+WORKDIR /home/$USER_NAME
+USER $USER_NAME
+

将上述内容复制到一个文件内(假设文件名为 Dockerfile-PolarDB)后,使用如下命令构建镜像:

TIP

💡 请在下面的高亮行中按需替换 <image_name> 内的 Docker 镜像名称

docker build --network=host \\
+    -t <image_name> \\
+    -f Dockerfile-PolarDB .
+

 

从干净的系统开始搭建开发环境

该方式假设您从一台具有 root 权限的干净的 CentOS 7 操作系统上从零开始,可以是:

  • 安装 CentOS 7 的物理机/虚拟机
  • 从 CentOS 7 官方 Docker 镜像 centos:centos7 上启动的 Docker 容器

建立非 root 用户

PolarDB for PostgreSQL 需要以非 root 用户运行。以下步骤能够帮助您创建一个名为 postgres 的用户组和一个名为 postgres 的用户。

TIP

如果您已经有了一个非 root 用户,但名称不是 postgres:postgres,可以忽略该步骤;但请注意在后续示例步骤中将命令中用户相关的信息替换为您自己的用户组名与用户名。

下面的命令能够创建用户组 postgres 和用户 postgres,并为该用户赋予 sudo 和工作目录的权限。需要以 root 用户执行这些命令。

# install sudo
+yum install -y sudo
+# create user and group
+groupadd -r postgres
+useradd -m -g postgres postgres -p ''
+usermod -aG wheel postgres
+# make postgres as sudoer
+chmod u+w /etc/sudoers
+echo 'postgres ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
+chmod u-w /etc/sudoers
+# grant access to home directory
+chown -R postgres:postgres /home/postgres/
+echo 'source /etc/bashrc' >> /home/postgres/.bashrc
+# for su postgres
+sed -i 's/4096/unlimited/g' /etc/security/limits.d/20-nproc.conf
+

接下来,切换到 postgres 用户,就可以进行后续的步骤了:

su postgres
+source /etc/bashrc
+cd ~
+

依赖安装

在 PolarDB for PostgreSQL 的源码库根目录下,有一个 install_dependencies.sh 脚本,包含了 PolarDB 和 PFS 需要运行的所有依赖。因此,首先需要克隆 PolarDB 的源码库。

`,16),E={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},N=s("code",null,"POLARDB_11_STABLE",-1),P={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},y=s("div",{class:"language-bash","data-ext":"sh"},[s("pre",{class:"language-bash"},[s("code",null,[s("span",{class:"token function"},"sudo"),n(" yum "),s("span",{class:"token function"},"install"),n(),s("span",{class:"token parameter variable"},"-y"),n(),s("span",{class:"token function"},"git"),n(` +`),s("span",{class:"token function"},"git"),n(" clone "),s("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),R=s("div",{class:"language-bash","data-ext":"sh"},[s("pre",{class:"language-bash"},[s("code",null,[s("span",{class:"token function"},"sudo"),n(" yum "),s("span",{class:"token function"},"install"),n(),s("span",{class:"token parameter variable"},"-y"),n(),s("span",{class:"token function"},"git"),n(` +`),s("span",{class:"token function"},"git"),n(" clone "),s("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),D=p(`

源码下载完毕后,使用 sudo 执行源代码根目录下的依赖安装脚本 install_dependencies.sh 自动完成所有的依赖安装。如果有定制的开发需求,请自行修改 install_dependencies.sh

cd PolarDB-for-PostgreSQL
+sudo ./install_dependencies.sh
+
`,2);function w(O,L){const e=o("ExternalLinkIcon"),i=o("CodeGroupItem"),t=o("CodeGroup");return c(),d("div",null,[v,u,s("p",null,[n("DockerHub 上已有构建完毕的开发镜像 "),s("a",b,[k,a(e)]),n(" 可供直接使用(支持 "),g,n(" 和 "),h,n(" 两种架构)。")]),s("p",null,[n("另外,我们也提供了构建上述开发镜像的 Dockerfile,从 "),s("a",_,[n("Ubuntu 官方镜像"),a(e)]),n(),f,n(" 开始构建出一个安装完所有开发和运行时依赖的镜像,您可以根据自己的需要在 Dockerfile 中添加更多依赖。以下是手动构建镜像的 Dockerfile 及方法:")]),S,s("p",null,[n("PolarDB for PostgreSQL 的代码托管于 "),s("a",E,[n("GitHub"),a(e)]),n(" 上,稳定分支为 "),N,n("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),s("a",P,[n("Gitee 国内镜像"),a(e)]),n("。")]),a(t,null,{default:l(()=>[a(i,{title:"GitHub"},{default:l(()=>[y]),_:1}),a(i,{title:"Gitee 国内镜像"},{default:l(()=>[R]),_:1})]),_:1}),D])}const x=r(m,[["render",w],["__file","customize-dev-env.html.vue"]]);export{x as default}; diff --git a/assets/customize-dev-env.html-aa6a8576.js b/assets/customize-dev-env.html-aa6a8576.js new file mode 100644 index 00000000000..1f67f8cdcf0 --- /dev/null +++ b/assets/customize-dev-env.html-aa6a8576.js @@ -0,0 +1,154 @@ +import{_ as r,r as o,o as c,c as d,a as s,b as n,d as a,w as l,e as p}from"./app-3d1677bf.js";const m={},v=s("h1",{id:"定制开发环境",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#定制开发环境","aria-hidden":"true"},"#"),n(" 定制开发环境")],-1),u=s("h2",{id:"自行构建开发镜像",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#自行构建开发镜像","aria-hidden":"true"},"#"),n(" 自行构建开发镜像")],-1),b={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},k=s("code",null,"polardb/polardb_pg_devel",-1),g=s("code",null,"linux/amd64",-1),h=s("code",null,"linux/arm64",-1),_={href:"https://hub.docker.com/_/ubuntu/tags",target:"_blank",rel:"noopener noreferrer"},f=s("code",null,"ubuntu:20.04",-1),S=p(`
FROM ubuntu:20.04
+LABEL maintainer="mrdrivingduck@gmail.com"
+CMD bash
+
+# Timezone problem
+ENV TZ=Asia/Shanghai
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+
+# Upgrade softwares
+RUN apt update -y && \\
+    apt upgrade -y && \\
+    apt clean -y
+
+# GCC (force to 9) and LLVM (force to 11)
+RUN apt install -y \\
+        gcc-9 \\
+        g++-9 \\
+        llvm-11-dev \\
+        clang-11 \\
+        make \\
+        gdb \\
+        pkg-config \\
+        locales && \\
+    update-alternatives --install \\
+        /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave \\
+        /usr/bin/g++ g++ /usr/bin/g++-9 && \\
+    update-alternatives --install \\
+        /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-11 60 --slave \\
+        /usr/bin/clang++ clang++ /usr/bin/clang++-11 --slave \\
+        /usr/bin/clang clang /usr/bin/clang-11 && \\
+    apt clean -y
+
+# Generate locale
+RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && \\
+    sed -i '/zh_CN.UTF-8/s/^# //g' /etc/locale.gen && \\
+    locale-gen
+
+# Dependencies
+RUN apt install -y \\
+        libicu-dev \\
+        bison \\
+        flex \\
+        python3-dev \\
+        libreadline-dev \\
+        libgss-dev \\
+        libssl-dev \\
+        libpam0g-dev \\
+        libxml2-dev \\
+        libxslt1-dev \\
+        libldap2-dev \\
+        uuid-dev \\
+        liblz4-dev \\
+        libkrb5-dev \\
+        gettext \\
+        libxerces-c-dev \\
+        tcl-dev \\
+        libperl-dev \\
+        libipc-run-perl \\
+        libaio-dev \\
+        libfuse-dev && \\
+    apt clean -y
+
+# Tools
+RUN apt install -y \\
+        iproute2 \\
+        wget \\
+        ccache \\
+        sudo \\
+        vim \\
+        git \\
+        cmake && \\
+    apt clean -y
+
+# set to empty if GitHub is not barriered
+# ENV GITHUB_PROXY=https://ghproxy.com/
+ENV GITHUB_PROXY=
+
+ENV ZLOG_VERSION=1.2.14
+ENV PFSD_VERSION=pfsd4pg-release-1.2.42-20220419
+
+# install dependencies from GitHub mirror
+RUN cd /usr/local && \\
+    # zlog for PFSD
+    wget --no-verbose --no-check-certificate "\${GITHUB_PROXY}https://github.com/HardySimpson/zlog/archive/refs/tags/\${ZLOG_VERSION}.tar.gz" && \\
+    # PFSD
+    wget --no-verbose --no-check-certificate "\${GITHUB_PROXY}https://github.com/ApsaraDB/PolarDB-FileSystem/archive/refs/tags/\${PFSD_VERSION}.tar.gz" && \\
+    # unzip and install zlog
+    gzip -d $ZLOG_VERSION.tar.gz && \\
+    tar xpf $ZLOG_VERSION.tar && \\
+    cd zlog-$ZLOG_VERSION && \\
+    make && make install && \\
+    echo '/usr/local/lib' >> /etc/ld.so.conf && ldconfig && \\
+    cd .. && \\
+    rm -rf $ZLOG_VERSION* && \\
+    rm -rf zlog-$ZLOG_VERSION && \\
+    # unzip and install PFSD
+    gzip -d $PFSD_VERSION.tar.gz && \\
+    tar xpf $PFSD_VERSION.tar && \\
+    cd PolarDB-FileSystem-$PFSD_VERSION && \\
+    sed -i 's/-march=native //' CMakeLists.txt && \\
+    ./autobuild.sh && ./install.sh && \\
+    cd .. && \\
+    rm -rf $PFSD_VERSION* && \\
+    rm -rf PolarDB-FileSystem-$PFSD_VERSION*
+
+# create default user
+ENV USER_NAME=postgres
+RUN echo "create default user" && \\
+    groupadd -r $USER_NAME && \\
+    useradd -ms /bin/bash -g $USER_NAME $USER_NAME -p '' && \\
+    usermod -aG sudo $USER_NAME
+
+# modify conf
+RUN echo "modify conf" && \\
+    mkdir -p /var/log/pfs && chown $USER_NAME /var/log/pfs && \\
+    mkdir -p /var/run/pfs && chown $USER_NAME /var/run/pfs && \\
+    mkdir -p /var/run/pfsd && chown $USER_NAME /var/run/pfsd && \\
+    mkdir -p /dev/shm/pfsd && chown $USER_NAME /dev/shm/pfsd && \\
+    touch /var/run/pfsd/.pfsd && \\
+    echo "ulimit -c unlimited" >> /home/postgres/.bashrc && \\
+    echo "export PGHOST=127.0.0.1" >> /home/postgres/.bashrc && \\
+    echo "alias pg='psql -h /home/postgres/tmp_master_dir_polardb_pg_1100_bld/'" >> /home/postgres/.bashrc
+
+ENV PATH="/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin:$PATH"
+WORKDIR /home/$USER_NAME
+USER $USER_NAME
+

将上述内容复制到一个文件内(假设文件名为 Dockerfile-PolarDB)后,使用如下命令构建镜像:

提示

💡 请在下面的高亮行中按需替换 <image_name> 内的 Docker 镜像名称

docker build --network=host \\
+    -t <image_name> \\
+    -f Dockerfile-PolarDB .
+

 

从干净的系统开始搭建开发环境

该方式假设您从一台具有 root 权限的干净的 CentOS 7 操作系统上从零开始,可以是:

  • 安装 CentOS 7 的物理机/虚拟机
  • 从 CentOS 7 官方 Docker 镜像 centos:centos7 上启动的 Docker 容器

建立非 root 用户

PolarDB for PostgreSQL 需要以非 root 用户运行。以下步骤能够帮助您创建一个名为 postgres 的用户组和一个名为 postgres 的用户。

提示

如果您已经有了一个非 root 用户,但名称不是 postgres:postgres,可以忽略该步骤;但请注意在后续示例步骤中将命令中用户相关的信息替换为您自己的用户组名与用户名。

下面的命令能够创建用户组 postgres 和用户 postgres,并为该用户赋予 sudo 和工作目录的权限。需要以 root 用户执行这些命令。

# install sudo
+yum install -y sudo
+# create user and group
+groupadd -r postgres
+useradd -m -g postgres postgres -p ''
+usermod -aG wheel postgres
+# make postgres as sudoer
+chmod u+w /etc/sudoers
+echo 'postgres ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
+chmod u-w /etc/sudoers
+# grant access to home directory
+chown -R postgres:postgres /home/postgres/
+echo 'source /etc/bashrc' >> /home/postgres/.bashrc
+# for su postgres
+sed -i 's/4096/unlimited/g' /etc/security/limits.d/20-nproc.conf
+

接下来,切换到 postgres 用户,就可以进行后续的步骤了:

su postgres
+source /etc/bashrc
+cd ~
+

依赖安装

在 PolarDB for PostgreSQL 的源码库根目录下,有一个 install_dependencies.sh 脚本,包含了 PolarDB 和 PFS 需要运行的所有依赖。因此,首先需要克隆 PolarDB 的源码库。

`,16),E={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},N=s("code",null,"POLARDB_11_STABLE",-1),y={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},R=s("div",{class:"language-bash","data-ext":"sh"},[s("pre",{class:"language-bash"},[s("code",null,[s("span",{class:"token function"},"sudo"),n(" yum "),s("span",{class:"token function"},"install"),n(),s("span",{class:"token parameter variable"},"-y"),n(),s("span",{class:"token function"},"git"),n(` +`),s("span",{class:"token function"},"git"),n(" clone "),s("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),D=s("div",{class:"language-bash","data-ext":"sh"},[s("pre",{class:"language-bash"},[s("code",null,[s("span",{class:"token function"},"sudo"),n(" yum "),s("span",{class:"token function"},"install"),n(),s("span",{class:"token parameter variable"},"-y"),n(),s("span",{class:"token function"},"git"),n(` +`),s("span",{class:"token function"},"git"),n(" clone "),s("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),P=p(`

源码下载完毕后,使用 sudo 执行源代码根目录下的依赖安装脚本 install_dependencies.sh 自动完成所有的依赖安装。如果有定制的开发需求,请自行修改 install_dependencies.sh

cd PolarDB-for-PostgreSQL
+sudo ./install_dependencies.sh
+
`,2);function w(O,L){const e=o("ExternalLinkIcon"),i=o("CodeGroupItem"),t=o("CodeGroup");return c(),d("div",null,[v,u,s("p",null,[n("DockerHub 上已有构建完毕的开发镜像 "),s("a",b,[k,a(e)]),n(" 可供直接使用(支持 "),g,n(" 和 "),h,n(" 两种架构)。")]),s("p",null,[n("另外,我们也提供了构建上述开发镜像的 Dockerfile,从 "),s("a",_,[n("Ubuntu 官方镜像"),a(e)]),n(),f,n(" 开始构建出一个安装完所有开发和运行时依赖的镜像,您可以根据自己的需要在 Dockerfile 中添加更多依赖。以下是手动构建镜像的 Dockerfile 及方法:")]),S,s("p",null,[n("PolarDB for PostgreSQL 的代码托管于 "),s("a",E,[n("GitHub"),a(e)]),n(" 上,稳定分支为 "),N,n("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),s("a",y,[n("Gitee 国内镜像"),a(e)]),n("。")]),a(t,null,{default:l(()=>[a(i,{title:"GitHub"},{default:l(()=>[R]),_:1}),a(i,{title:"Gitee 国内镜像"},{default:l(()=>[D]),_:1})]),_:1}),P])}const x=r(m,[["render",w],["__file","customize-dev-env.html.vue"]]);export{x as default}; diff --git a/assets/customize-dev-env.html-f893e063.js b/assets/customize-dev-env.html-f893e063.js new file mode 100644 index 00000000000..9c5d266326a --- /dev/null +++ b/assets/customize-dev-env.html-f893e063.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-d69972ec","path":"/zh/development/customize-dev-env.html","title":"定制开发环境","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"自行构建开发镜像","slug":"自行构建开发镜像","link":"#自行构建开发镜像","children":[]},{"level":2,"title":"从干净的系统开始搭建开发环境","slug":"从干净的系统开始搭建开发环境","link":"#从干净的系统开始搭建开发环境","children":[{"level":3,"title":"建立非 root 用户","slug":"建立非-root-用户","link":"#建立非-root-用户","children":[]},{"level":3,"title":"依赖安装","slug":"依赖安装","link":"#依赖安装","children":[]}]}],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/development/customize-dev-env.md"}');export{e as data}; diff --git a/assets/datamax.html-2f4105a2.js b/assets/datamax.html-2f4105a2.js new file mode 100644 index 00000000000..a382ea9ac33 --- /dev/null +++ b/assets/datamax.html-2f4105a2.js @@ -0,0 +1,42 @@ +import{_ as c,r as o,o as i,c as d,d as s,a,w as e,b as n,e as m}from"./app-3d1677bf.js";const u="/PolarDB-for-PostgreSQL/assets/datamax_availability_architecture-53d60a58.png",_="/PolarDB-for-PostgreSQL/assets/datamax_realization_1-1bc15abd.png",k="/PolarDB-for-PostgreSQL/assets/datamax_realization_2-803e32f3.png",x="/PolarDB-for-PostgreSQL/assets/datamax_availability_1-44d46f3b.png",y="/PolarDB-for-PostgreSQL/assets/datamax_availability_2-2972e8a3.png",h={},g=a("h1",{id:"datamax-日志节点",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#datamax-日志节点","aria-hidden":"true"},"#"),n(" DataMax 日志节点")],-1),b={class:"table-of-contents"},D=m('

术语

  • RPO (Recovery Point Objective):数据恢复点目标,指业务系统所能容忍的数据丢失量。
  • AZ (Availability Zone):可用区,指同一个地域内电力和网络相互独立的区域,可用区之间可以做到故障隔离。

背景

在高可用的场景中,为保证 RPO = 0,主库和备库之间需配置为同步复制模式。但当主备库距离较远时,同步复制的方式会存在较大延迟,从而对主库性能带来较大影响。异步复制对主库的性能影响较小,但会带来一定程度的数据丢失。PolarDB for PostgreSQL 采用基于共享存储的一写多读架构,可同时提供 AZ 内 / 跨 AZ / 跨域级别的高可用。为了减少日志同步对主库的影响,PolarDB for PostgreSQL 引入了 DataMax 节点。在进行跨 AZ 甚至跨域同步时,DataMax 节点可以作为主库日志的中转节点,能够以较低成本实现零数据丢失的同时,降低日志同步对主库性能的影响。

原理

DataMax 高可用架构

PolarDB for PostgreSQL 基于物理流复制实现主备库之间的数据同步,主库与备库的流复制模式分为 同步模式异步模式 两种:

  • 异步模式:主库事务提交仅需等待对应 WAL 日志写入本地磁盘文件后,即可进行事务提交的后续操作,备库状态对主库性能无影响;但异步模式下无法保证 RPO = 0,备库相较于主库存在一定的延迟,若主库所在集群出现故障,切换至备库可能存在数据丢失的问题;
  • 同步模式:主库及备库之间的同步模式包含不同的级别,当设置 synchronous_standby_names 参数开启备库同步后,可以通过 synchronous_commit 参数设置主库及备库之间的同步级别,包括:
    • remote_write:主库的事务提交需等待对应 WAL 日志写入主库磁盘文件及备库的系统缓存中后,才能进行事务提交的后续操作;
    • on:主库的事务提交需等待对应 WAL 日志已写入主库及备库的磁盘文件中后,才能进行事务提交的后续操作;
    • remote_apply:主库的事务提交需等待对应 WAL 日志写入主库及备库的磁盘文件中,并且备库已经回放完相应 WAL 日志使备库上的查询对该事务可见后,才能进行事务提交的后续操作。

同步模式保证了主库的事务提交操作需等待备库接收到对应的 WAL 日志数据之后才可执行,实现了主库与备库之间的零数据丢失,可保证 RPO = 0。然而,该模式下主库的事务提交操作能否继续进行依赖于备库的 WAL 日志接收结果,当主备之间距离较远导致传输延迟较大时,同步模式会对主库的性能带来影响。极端情况下,若备库异常崩溃,则主库会一直阻塞等待备库,导致无法正常提供服务。

针对传统主备模式下同步复制对主库性能影响较大的问题,PolarDB for PostgreSQL 新增了 DataMax 节点用于实现远程同步,该模式下的高可用架构如下所示:

dma-arch

其中:

  1. 一个数据库集群部署在一个可用区内,不同的集群之间互为灾备,以主备模式保证跨 AZ / 跨域级别的高可用;
  2. 单个数据库集群内为一写多读架构, Primary 节点和 Replica 节点共享同一份存储,有效降低存储成本;同时 Replica 节点还可以实现单个 AZ 内计算节点的高可用;
  3. DataMax 节点与集群内的 Primary 节点部署在同一个可用区内:
    • DataMax 节点只接收并保存 Primary 节点的 WAL 日志文件,但不对日志进行回放操作,也不保存 Primary 节点的数据文件,降低存储成本;
    • DataMax 节点与 Primary 节点的数据不共享,两者的存储设备彼此隔离,防止计算集群存储异常导致 Primary 节点与 DataMax 节点保存的日志同时丢失;
    • DataMax 节点与 Primary 节点之间为 同步复制 模式,确保 RPO = 0;DataMax 节点部署在距离 Primary 节点较近的区域,通常与 Primary 节点位于同一可用区,最小化日志同步对 Primary 节点带来的性能影响;
    • DataMax 节点将其接收的 WAL 日志发送至其他可用区的 Standby 节点,Standby 节点接收并回放 DataMax 节点的日志,实现与 Primary 节点(主库)的数据同步;Standby 节点与 DataMax 节点之间可设置为异步流复制模式,通过 DataMax 节点可分流 Primary 节点向多个备份数据库传输 WAL 日志的开销。

DataMax 实现

DataMax 是一种新的节点角色,用户需要通过配置文件来标识当前节点是否为 DataMax 节点。DataMax 模式下,Startup 进程在回放完 DataMax 节点自身日志之后,从 PM_HOT_STANDBY 进入到 PM_DATAMAX 模式。PM_DATAMAX 模式下,Startup 进程仅进行相关信号及状态的处理,并通知 Postmaster 进程启动流复制,Startup 进程不再进行日志回放的操作。因此 DataMax 节点不会保存 Primary 节点的数据文件,从而降低了存储成本。

datamax-impl

如上图所示,DataMax 节点通过 WalReceiver 进程向 Primary 节点发起流复制请求,接收并保存 Primary 节点发送的 WAL 日志信息;同时通过 WalSender 进程将所接收的主库 WAL 日志发送给异地的备库节点;备库节点接收到 WAL 日志后,通知其 Startup 进程进行日志回放,从而实现备库节点与 Primary 节点的数据同步。

DataMax 节点在数据目录中新增了 polar_datamax/ 目录,用于保存所接收的主库 WAL 日志。DataMax 节点自身的 WAL 日志仍保存在原始目录下,两者的 WAL 日志不会相互覆盖,DataMax 节点也可以有自身的独有数据。

由于 DataMax 节点不会回放 Primary 节点的日志数据,在 DataMax 节点因为异常原因需要重启恢复时,就有了日志起始位点的问题。DataMax 节点通过 polar_datamax_meta 元数据文件存储相关的位点信息,以此来确认运行的起始位点:

  • 初始化部署:在全新部署或者 DataMax 节点重搭的场景下,没有存量的位点信息;在向主库请求流复制时,需要表明自己是 DataMax 节点,同时还需要额外传递 InvalidXLogRecPtr 位点,表明其需要从 Primary 节点当前最旧的位点开始复制; Primary 节点接收到 InvalidXLogRecPtr 的流复制请求之后,会开始从当前最旧且完整的 WAL segment 文件开始发送 WAL 日志,并将相应复制槽的 restart_lsn 设置为该位点;
  • 异常恢复:从存储上读取元数据文件,确认位点信息;以该位点为起点请求流复制。

datamax-impl-dir

DataMax 集群高可用

如下图所示,增加 DataMax 节点后,若 Primary 节点与 Replica 节点同时异常,或存储无法提供服务时,则可将位于不同可用区的 Standby 节点提升为 Primary 节点,保证服务的可用性。在将 Standby 节点提升为 Primary 节点并向外提供服务之前,会确认 Standby 节点是否已从 DataMax 节点拉取完所有日志,待 Standby 节点获取完所有日志后才会将其提升为 Primary 节点。由于 DataMax 节点与 Primary 节点为同步复制,因此该场景下可保证 RPO = 0。

此外,DataMax 节点在进行日志清理时,除了保留下游 Standby 节点尚未接收的 WAL 日志文件以外,还会保留上游 Primary 节点尚未删除的 WAL 日志文件,避免 Primary 节点异常后,备份系统无法获取到 Primary 节点相较于 DataMax 节点多出的日志信息,保证集群数据的完整性。

datamax-ha

若 DataMax 节点异常,则优先尝试通过重启进行恢复;若重启失败则会对其进行重建。因 DataMax 节点与 Primary 节点的存储彼此隔离,因此两者的数据不会互相影响。此外,DataMax 节点同样可以使用计算存储分离架构,确保 DataMax 节点的异常不会导致其存储的 WAL 日志数据丢失。

datamax-restart

类似地,DataMax 节点实现了如下几种日志同步模式,用户可以根据具体业务需求进行相应配置:

  • 最大保护模式:DataMax 节点与 Primary 节点进行同步复制,确保 RPO = 0;若 DataMax 节点因网络或硬件故障无法提供服务,则 Primary 节点也会因此阻塞而无法对外提供服务;
  • 最大性能模式:DataMax 节点与 Primary 节点进行异步复制,DataMax 节点不对 Primary 节点性能带来影响,DataMax 节点异常也不会影响 Primary 节点的服务;若 Primary 节点的存储或对应的集群发生故障,可能导致丢失数据,无法确保 RPO = 0;
  • 最大高可用模式
    • 当 DataMax 节点正常工作时,DataMax 节点与 Primary 节点进行同步复制,即为最大保护模式;
    • 若 DataMax 节点异常,Primary 节点自动将同步模式降级为最大性能模式,保证 Primary 节点服务的持续可用性;
    • 当 DataMax 节点恢复正常后,Primary 节点将最大性能模式提升为最大保护模式,避免 WAL 日志数据丢失的可能性。

综上,通过 DataMax 日志中转节点降低日志同步延迟、分流 Primary 节点的日志传输压力,在性能稳定的情况下,可以保障跨 AZ / 跨域 RPO = 0 的高可用。

使用指南

DataMax 节点目录初始化

初始化 DataMax 节点时需要指定 Primary 节点的 system identifier:

# 获取 Primary 节点的 system identifier
+~/tmp_basedir_polardb_pg_1100_bld/bin/pg_controldata -D ~/primary | grep 'system identifier'
+
+# 创建 DataMax 节点
+# -i 参数指定的 [primary_system_identifier] 为上一步得到的 Primary 节点 system identifier
+~/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D datamax -i [primary_system_identifier]
+
+# 如有需要,参考 Primary 节点,对 DataMax 节点的共享存储进行初始化
+sudo pfs -C disk mkdir /nvme0n1/dm_shared_data
+sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh ~/datamax/ /nvme0n1/dm_shared_data/
+

加载运维插件

以可写节点的形式拉起 DataMax 节点,创建用户和插件以方便后续运维。DataMax 节点默认为只读模式,无法创建用户和插件。

~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D ~/datamax
+

创建管理账号及插件:

postgres=# create user test superuser;
+CREATE ROLE
+postgres=# create extension polar_monitor;
+CREATE EXTENSION
+

关闭 DataMax 节点:

~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl stop -D ~/datamax;
+

DataMax 节点配置及启动

在 DataMax 节点的 recovery.conf 中添加 polar_datamax_mode 参数,表示当前节点为 DataMax 节点:

polar_datamax_mode = standalone
+recovery_target_timeline='latest'
+primary_slot_name='datamax'
+primary_conninfo='host=[主节点的IP] port=[主节点的端口] user=[$USER] dbname=postgres application_name=datamax'
+

启动 DataMax 节点:

~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D ~/datamax
+

DataMax 节点检查

DataMax 节点自身可通过 polar_get_datamax_info() 接口来判断其运行是否正常:

postgres=# SELECT * FROM polar_get_datamax_info();
+ min_received_timeline | min_received_lsn | last_received_timeline | last_received_lsn | last_valid_received_lsn | clean_reserved_lsn | force_clean
+-----------------------+------------------+------------------------+-------------------+-------------------------+--------------------+-------------
+                     1 | 0/40000000       |                      1 | 0/4079DFE0        | 0/4079DFE0              | 0/0                | f
+(1 row)
+

在 Primary 节点可以通过 pg_replication_slots 查看对应复制槽的状态:

postgres=# SELECT * FROM pg_replication_slots;
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ datamax   |        | physical  |        |          | f         | t      |     124551 |  570 |              | 0/4079DFE0  |
+(1 row)
+

日志同步模式配置

通过配置 Primary 节点的 postgresql.conf,可以设置下游 DataMax 节点的日志同步模式:

最大保护模式。其中 datamax 为 Primary 节点创建的复制槽名称:

polar_enable_transaction_sync_mode = on
+synchronous_commit = on
+synchronous_standby_names = 'datamax'
+

最大性能模式:

polar_enable_transaction_sync_mode = on
+synchronous_commit = on
+

最大高可用模式:

  • 参数 polar_sync_replication_timeout 用于设置同步超时时间阈值,单位为毫秒;等待同步复制锁超过此阈值时,同步复制将降级为异步复制;
  • 参数 polar_sync_rep_timeout_break_lsn_lag 用于设置同步恢复延迟阈值,单位为字节;当异步复制延迟阈值小于此阈值时,异步复制将重新恢复为同步复制。
polar_enable_transaction_sync_mode = on
+synchronous_commit = on
+synchronous_standby_names = 'datamax'
+polar_sync_replication_timeout = 10s
+polar_sync_rep_timeout_break_lsn_lag = 8kB
+
`,60);function v(r,P){const p=o("Badge"),l=o("ArticleInfo"),t=o("router-link");return i(),d("div",null,[g,s(p,{type:"tip",text:"V11 / v1.1.6-",vertical:"top"}),s(l,{frontmatter:r.$frontmatter},null,8,["frontmatter"]),a("nav",b,[a("ul",null,[a("li",null,[s(t,{to:"#术语"},{default:e(()=>[n("术语")]),_:1})]),a("li",null,[s(t,{to:"#背景"},{default:e(()=>[n("背景")]),_:1})]),a("li",null,[s(t,{to:"#原理"},{default:e(()=>[n("原理")]),_:1}),a("ul",null,[a("li",null,[s(t,{to:"#datamax-高可用架构"},{default:e(()=>[n("DataMax 高可用架构")]),_:1})]),a("li",null,[s(t,{to:"#datamax-实现"},{default:e(()=>[n("DataMax 实现")]),_:1})]),a("li",null,[s(t,{to:"#datamax-集群高可用"},{default:e(()=>[n("DataMax 集群高可用")]),_:1})])])]),a("li",null,[s(t,{to:"#使用指南"},{default:e(()=>[n("使用指南")]),_:1}),a("ul",null,[a("li",null,[s(t,{to:"#datamax-节点目录初始化"},{default:e(()=>[n("DataMax 节点目录初始化")]),_:1})]),a("li",null,[s(t,{to:"#加载运维插件"},{default:e(()=>[n("加载运维插件")]),_:1})]),a("li",null,[s(t,{to:"#datamax-节点配置及启动"},{default:e(()=>[n("DataMax 节点配置及启动")]),_:1})]),a("li",null,[s(t,{to:"#datamax-节点检查"},{default:e(()=>[n("DataMax 节点检查")]),_:1})]),a("li",null,[s(t,{to:"#日志同步模式配置"},{default:e(()=>[n("日志同步模式配置")]),_:1})])])])])]),D])}const f=c(h,[["render",v],["__file","datamax.html.vue"]]);export{f as default}; diff --git a/assets/datamax.html-5138183e.js b/assets/datamax.html-5138183e.js new file mode 100644 index 00000000000..49deef66c29 --- /dev/null +++ b/assets/datamax.html-5138183e.js @@ -0,0 +1 @@ +const a=JSON.parse('{"key":"v-4e16f0f0","path":"/zh/features/v11/availability/datamax.html","title":"DataMax 日志节点","lang":"zh-CN","frontmatter":{"author":"玊于","date":"2022/11/17","minute":30},"headers":[{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[{"level":3,"title":"DataMax 高可用架构","slug":"datamax-高可用架构","link":"#datamax-高可用架构","children":[]},{"level":3,"title":"DataMax 实现","slug":"datamax-实现","link":"#datamax-实现","children":[]},{"level":3,"title":"DataMax 集群高可用","slug":"datamax-集群高可用","link":"#datamax-集群高可用","children":[]}]},{"level":2,"title":"使用指南","slug":"使用指南","link":"#使用指南","children":[{"level":3,"title":"DataMax 节点目录初始化","slug":"datamax-节点目录初始化","link":"#datamax-节点目录初始化","children":[]},{"level":3,"title":"加载运维插件","slug":"加载运维插件","link":"#加载运维插件","children":[]},{"level":3,"title":"DataMax 节点配置及启动","slug":"datamax-节点配置及启动","link":"#datamax-节点配置及启动","children":[]},{"level":3,"title":"DataMax 节点检查","slug":"datamax-节点检查","link":"#datamax-节点检查","children":[]},{"level":3,"title":"日志同步模式配置","slug":"日志同步模式配置","link":"#日志同步模式配置","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/availability/datamax.md"}');export{a as data}; diff --git a/assets/datamax_availability_1-44d46f3b.png b/assets/datamax_availability_1-44d46f3b.png new file mode 100644 index 00000000000..28d0a500846 Binary files /dev/null and b/assets/datamax_availability_1-44d46f3b.png differ diff --git a/assets/datamax_availability_2-2972e8a3.png b/assets/datamax_availability_2-2972e8a3.png new file mode 100644 index 00000000000..b6578197355 Binary files /dev/null and b/assets/datamax_availability_2-2972e8a3.png differ diff --git a/assets/datamax_availability_architecture-53d60a58.png b/assets/datamax_availability_architecture-53d60a58.png new file mode 100644 index 00000000000..5a87a7ae72d Binary files /dev/null and b/assets/datamax_availability_architecture-53d60a58.png differ diff --git a/assets/datamax_realization_1-1bc15abd.png b/assets/datamax_realization_1-1bc15abd.png new file mode 100644 index 00000000000..9b1f7dfd15b Binary files /dev/null and b/assets/datamax_realization_1-1bc15abd.png differ diff --git a/assets/datamax_realization_2-803e32f3.png b/assets/datamax_realization_2-803e32f3.png new file mode 100644 index 00000000000..83ad5f47789 Binary files /dev/null and b/assets/datamax_realization_2-803e32f3.png differ diff --git a/assets/db-localfs.html-0d436603.js b/assets/db-localfs.html-0d436603.js new file mode 100644 index 00000000000..ffaf3f3798c --- /dev/null +++ b/assets/db-localfs.html-0d436603.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-55351ab4","path":"/zh/deploying/db-localfs.html","title":"基于单机文件系统部署","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/08/01","minute":15},"headers":[{"level":2,"title":"拉取镜像","slug":"拉取镜像","link":"#拉取镜像","children":[]},{"level":2,"title":"初始化数据库","slug":"初始化数据库","link":"#初始化数据库","children":[]},{"level":2,"title":"启动 PolarDB-PG 服务","slug":"启动-polardb-pg-服务","link":"#启动-polardb-pg-服务","children":[]}],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/deploying/db-localfs.md"}');export{l as data}; diff --git a/assets/db-localfs.html-6fce4fb5.js b/assets/db-localfs.html-6fce4fb5.js new file mode 100644 index 00000000000..2a2af903185 --- /dev/null +++ b/assets/db-localfs.html-6fce4fb5.js @@ -0,0 +1,17 @@ +import{_ as c,r as o,o as d,c as i,d as s,a,w as t,b as n,e as u}from"./app-3d1677bf.js";const _={},b=a("h1",{id:"基于单机文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于单机文件系统部署","aria-hidden":"true"},"#"),n(" 基于单机文件系统部署")],-1),k=a("p",null,"本文将指导您在单机文件系统(如 ext4)上编译部署 PolarDB-PG,适用于所有计算节点都可以访问相同本地磁盘存储的场景。",-1),h={class:"table-of-contents"},v=a("h2",{id:"拉取镜像",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#拉取镜像","aria-hidden":"true"},"#"),n(" 拉取镜像")],-1),m={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},f=a("code",null,"linux/amd64",-1),g=a("code",null,"linux/arm64",-1),P=u(`
docker pull polardb/polardb_pg_local_instance
+

初始化数据库

新建一个空白目录 \${your_data_dir} 作为 PolarDB-PG 实例的数据目录。启动容器时,将该目录作为 VOLUME 挂载到容器内,对数据目录进行初始化。在初始化的过程中,可以传入环境变量覆盖默认值:

  • POLARDB_PORT:PolarDB-PG 运行所需要使用的端口号,默认值为 5432;镜像将会使用三个连续的端口号(默认 5432-5434
  • POLARDB_USER:初始化数据库时创建默认的 superuser(默认 postgres
  • POLARDB_PASSWORD:默认 superuser 的密码

使用如下命令初始化数据库:

docker run -it --rm \\
+    --env POLARDB_PORT=5432 \\
+    --env POLARDB_USER=u1 \\
+    --env POLARDB_PASSWORD=your_password \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance \\
+    echo 'done'
+

启动 PolarDB-PG 服务

数据库初始化完毕后,使用 -d 参数以后台模式创建容器,启动 PolarDB-PG 服务。通常 PolarDB-PG 的端口需要暴露给外界使用,使用 -p 参数将容器内的端口范围暴露到容器外。比如,初始化数据库时使用的是 5432-5434 端口,如下命令将会把这三个端口映射到容器外的 54320-54322 端口:

docker run -d \\
+    -p 54320-54322:5432-5434 \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance
+

或者也可以直接让容器与宿主机共享网络:

docker run -d \\
+    --network=host \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance
+
`,11);function B(l,D){const r=o("ArticleInfo"),e=o("router-link"),p=o("ExternalLinkIcon");return d(),i("div",null,[b,s(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),k,a("nav",h,[a("ul",null,[a("li",null,[s(e,{to:"#拉取镜像"},{default:t(()=>[n("拉取镜像")]),_:1})]),a("li",null,[s(e,{to:"#初始化数据库"},{default:t(()=>[n("初始化数据库")]),_:1})]),a("li",null,[s(e,{to:"#启动-polardb-pg-服务"},{default:t(()=>[n("启动 PolarDB-PG 服务")]),_:1})])])]),v,a("p",null,[n("我们在 DockerHub 上提供了 PolarDB-PG 的 "),a("a",m,[n("本地实例镜像"),s(p)]),n(",里面已包含启动 PolarDB-PG 本地存储实例的入口脚本。镜像目前支持 "),f,n(" 和 "),g,n(" 两种 CPU 架构。")]),P])}const R=c(_,[["render",B],["__file","db-localfs.html.vue"]]);export{R as default}; diff --git a/assets/db-localfs.html-bdc3f77a.js b/assets/db-localfs.html-bdc3f77a.js new file mode 100644 index 00000000000..c002b4ebab3 --- /dev/null +++ b/assets/db-localfs.html-bdc3f77a.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-64270bfa","path":"/deploying/db-localfs.html","title":"基于单机文件系统部署","lang":"en-US","frontmatter":{"author":"棠羽","date":"2023/08/01","minute":15},"headers":[{"level":2,"title":"拉取镜像","slug":"拉取镜像","link":"#拉取镜像","children":[]},{"level":2,"title":"初始化数据库","slug":"初始化数据库","link":"#初始化数据库","children":[]},{"level":2,"title":"启动 PolarDB-PG 服务","slug":"启动-polardb-pg-服务","link":"#启动-polardb-pg-服务","children":[]}],"git":{"updatedTime":1690894847000},"filePathRelative":"deploying/db-localfs.md"}');export{l as data}; diff --git a/assets/db-localfs.html-d7558701.js b/assets/db-localfs.html-d7558701.js new file mode 100644 index 00000000000..2a2af903185 --- /dev/null +++ b/assets/db-localfs.html-d7558701.js @@ -0,0 +1,17 @@ +import{_ as c,r as o,o as d,c as i,d as s,a,w as t,b as n,e as u}from"./app-3d1677bf.js";const _={},b=a("h1",{id:"基于单机文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于单机文件系统部署","aria-hidden":"true"},"#"),n(" 基于单机文件系统部署")],-1),k=a("p",null,"本文将指导您在单机文件系统(如 ext4)上编译部署 PolarDB-PG,适用于所有计算节点都可以访问相同本地磁盘存储的场景。",-1),h={class:"table-of-contents"},v=a("h2",{id:"拉取镜像",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#拉取镜像","aria-hidden":"true"},"#"),n(" 拉取镜像")],-1),m={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},f=a("code",null,"linux/amd64",-1),g=a("code",null,"linux/arm64",-1),P=u(`
docker pull polardb/polardb_pg_local_instance
+

初始化数据库

新建一个空白目录 \${your_data_dir} 作为 PolarDB-PG 实例的数据目录。启动容器时,将该目录作为 VOLUME 挂载到容器内,对数据目录进行初始化。在初始化的过程中,可以传入环境变量覆盖默认值:

  • POLARDB_PORT:PolarDB-PG 运行所需要使用的端口号,默认值为 5432;镜像将会使用三个连续的端口号(默认 5432-5434
  • POLARDB_USER:初始化数据库时创建默认的 superuser(默认 postgres
  • POLARDB_PASSWORD:默认 superuser 的密码

使用如下命令初始化数据库:

docker run -it --rm \\
+    --env POLARDB_PORT=5432 \\
+    --env POLARDB_USER=u1 \\
+    --env POLARDB_PASSWORD=your_password \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance \\
+    echo 'done'
+

启动 PolarDB-PG 服务

数据库初始化完毕后,使用 -d 参数以后台模式创建容器,启动 PolarDB-PG 服务。通常 PolarDB-PG 的端口需要暴露给外界使用,使用 -p 参数将容器内的端口范围暴露到容器外。比如,初始化数据库时使用的是 5432-5434 端口,如下命令将会把这三个端口映射到容器外的 54320-54322 端口:

docker run -d \\
+    -p 54320-54322:5432-5434 \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance
+

或者也可以直接让容器与宿主机共享网络:

docker run -d \\
+    --network=host \\
+    -v \${your_data_dir}:/var/polardb \\
+    polardb/polardb_pg_local_instance
+
`,11);function B(l,D){const r=o("ArticleInfo"),e=o("router-link"),p=o("ExternalLinkIcon");return d(),i("div",null,[b,s(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),k,a("nav",h,[a("ul",null,[a("li",null,[s(e,{to:"#拉取镜像"},{default:t(()=>[n("拉取镜像")]),_:1})]),a("li",null,[s(e,{to:"#初始化数据库"},{default:t(()=>[n("初始化数据库")]),_:1})]),a("li",null,[s(e,{to:"#启动-polardb-pg-服务"},{default:t(()=>[n("启动 PolarDB-PG 服务")]),_:1})])])]),v,a("p",null,[n("我们在 DockerHub 上提供了 PolarDB-PG 的 "),a("a",m,[n("本地实例镜像"),s(p)]),n(",里面已包含启动 PolarDB-PG 本地存储实例的入口脚本。镜像目前支持 "),f,n(" 和 "),g,n(" 两种 CPU 架构。")]),P])}const R=c(_,[["render",B],["__file","db-localfs.html.vue"]]);export{R as default}; diff --git a/assets/db-pfs-curve.html-210b20fc.js b/assets/db-pfs-curve.html-210b20fc.js new file mode 100644 index 00000000000..ef0af0fc99d --- /dev/null +++ b/assets/db-pfs-curve.html-210b20fc.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-71a5b926","path":"/zh/deploying/db-pfs-curve.html","title":"基于 PFS for CurveBS 文件系统部署","lang":"zh-CN","frontmatter":{"author":"程义","date":"2022/11/02","minute":15},"headers":[{"level":2,"title":"源码下载","slug":"源码下载","link":"#源码下载","children":[]},{"level":2,"title":"编译部署 PolarDB","slug":"编译部署-polardb","link":"#编译部署-polardb","children":[{"level":3,"title":"读写节点部署","slug":"读写节点部署","link":"#读写节点部署","children":[]},{"level":3,"title":"只读节点部署","slug":"只读节点部署","link":"#只读节点部署","children":[]},{"level":3,"title":"集群检查和测试","slug":"集群检查和测试","link":"#集群检查和测试","children":[]}]}],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/deploying/db-pfs-curve.md"}');export{l as data}; diff --git a/assets/db-pfs-curve.html-2c67fb2a.js b/assets/db-pfs-curve.html-2c67fb2a.js new file mode 100644 index 00000000000..1326d69ef8f --- /dev/null +++ b/assets/db-pfs-curve.html-2c67fb2a.js @@ -0,0 +1,95 @@ +import{_ as d,r as e,o as k,c as v,d as s,a,b as n,w as t,e as l}from"./app-3d1677bf.js";const m={},_=a("h1",{id:"基于-pfs-for-curvebs-文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-pfs-for-curvebs-文件系统部署","aria-hidden":"true"},"#"),n(" 基于 PFS for CurveBS 文件系统部署")],-1),b=a("p",null,"本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在 Curve 块存储上格式化并挂载 PFS 的计算节点。",-1),g={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},h=a("h2",{id:"源码下载",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#源码下载","aria-hidden":"true"},"#"),n(" 源码下载")],-1),f={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},y=a("code",null,"POLARDB_11_STABLE",-1),E={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},x=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),n(" clone "),a("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),P=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),n(" clone "),a("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),B=l(`

代码克隆完毕后,进入源码目录:

cd PolarDB-for-PostgreSQL/
+

编译部署 PolarDB

读写节点部署

`,4),D=a("code",null,"--with-pfsd",-1),H=l(`
./polardb_build.sh --with-pfsd
+

WARNING

上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \\
+    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \\
+    stop
+

在节点本地初始化数据目录 $HOME/primary/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

在共享存储的 /pool@@volume_my_/shared_data 目录上初始化共享数据目录

# 使用 pfs 创建共享数据目录
+sudo pfs -C curve mkdir /pool@@volume_my_/shared_data
+# 初始化 db 的本地和共享数据目录
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /pool@@volume_my_/shared_data/ curve
+

编辑读写节点的配置。打开 $HOME/primary/postgresql.conf,增加配置项:

port=5432
+polar_hostid=1
+polar_enable_shared_storage_mode=on
+polar_disk_name='pool@@volume_my_'
+polar_datadir='/pool@@volume_my_/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='curve'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

打开 $HOME/primary/pg_hba.conf,增加以下配置项:

host	replication	postgres	0.0.0.0/0	trust
+

最后,启动读写节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
+

检查读写节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'select version();'
+# 下面为输出内容
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

在读写节点上,为对应的只读节点创建相应的 replication slot,用于只读节点的物理流复制:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "select pg_create_physical_replication_slot('replica1');"
+# 下面为输出内容
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

只读节点部署

在只读节点上,使用 --with-pfsd 选项编译 PolarDB 内核。

./polardb_build.sh --with-pfsd
+

WARNING

上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \\
+    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \\
+    stop
+

在节点本地初始化数据目录 $HOME/replica1/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/replica1
+

编辑只读节点的配置。打开 $HOME/replica1/postgresql.conf,增加配置项:

port=5433
+polar_hostid=2
+polar_enable_shared_storage_mode=on
+polar_disk_name='pool@@volume_my_'
+polar_datadir='/pool@@volume_my_/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='curve'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

创建 $HOME/replica1/recovery.conf,增加以下配置项:

WARNING

请在下面替换读写节点(容器)所在的 IP 地址。

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

最后,启动只读节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
+

检查只读节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5433 \\
+    -d postgres \\
+    -c 'select version();'
+# 下面为输出内容
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群检查和测试

部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

登录 读写节点,创建测试表并插入样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "create table t(t1 int primary key, t2 int);insert into t values (1, 1),(2, 3),(3, 3);"
+

登录 只读节点,查询刚刚插入的样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5433 \\
+    -d postgres \\
+    -c "select * from t;"
+# 下面为输出内容
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

在读写节点上插入的数据对只读节点可见。

`,38);function O(r,$){const c=e("ArticleInfo"),p=e("ExternalLinkIcon"),o=e("CodeGroupItem"),i=e("CodeGroup"),u=e("RouterLink");return k(),v("div",null,[_,s(c,{frontmatter:r.$frontmatter},null,8,["frontmatter"]),b,a("p",null,[n("我们在 DockerHub 上提供了一个 "),a("a",g,[n("PolarDB 开发镜像"),s(p)]),n(",里面已经包含编译运行 PolarDB for PostgreSQL 所需要的所有依赖。您可以直接使用这个开发镜像进行实例搭建。镜像目前支持 AMD64 和 ARM64 两种 CPU 架构。")]),h,a("p",null,[n("在前置文档中,我们已经从 DockerHub 上拉取了 PolarDB 开发镜像,并且进入到了容器中。进入容器后,从 "),a("a",f,[n("GitHub"),s(p)]),n(" 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 "),y,n("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),a("a",E,[n("Gitee 国内镜像"),s(p)]),n("。")]),s(i,null,{default:t(()=>[s(o,{title:"GitHub"},{default:t(()=>[x]),_:1}),s(o,{title:"Gitee 国内镜像"},{default:t(()=>[P]),_:1})]),_:1}),B,a("p",null,[n("在读写节点上,使用 "),D,n(" 选项编译 PolarDB 内核。请参考 "),s(u,{to:"/development/dev-on-docker.html#%E7%BC%96%E8%AF%91%E6%B5%8B%E8%AF%95%E9%80%89%E9%A1%B9%E8%AF%B4%E6%98%8E"},{default:t(()=>[n("编译测试选项说明")]),_:1}),n(" 查看更多编译选项的说明。")]),H])}const A=d(m,[["render",O],["__file","db-pfs-curve.html.vue"]]);export{A as default}; diff --git a/assets/db-pfs-curve.html-bc2859d9.js b/assets/db-pfs-curve.html-bc2859d9.js new file mode 100644 index 00000000000..3339c573c5f --- /dev/null +++ b/assets/db-pfs-curve.html-bc2859d9.js @@ -0,0 +1,95 @@ +import{_ as d,r as e,o as k,c as v,d as s,a,b as n,w as t,e as l}from"./app-3d1677bf.js";const m={},_=a("h1",{id:"基于-pfs-for-curvebs-文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-pfs-for-curvebs-文件系统部署","aria-hidden":"true"},"#"),n(" 基于 PFS for CurveBS 文件系统部署")],-1),b=a("p",null,"本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在 Curve 块存储上格式化并挂载 PFS 的计算节点。",-1),g={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},h=a("h2",{id:"源码下载",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#源码下载","aria-hidden":"true"},"#"),n(" 源码下载")],-1),f={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},y=a("code",null,"POLARDB_11_STABLE",-1),E={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},x=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),n(" clone "),a("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),P=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),n(" clone "),a("span",{class:"token parameter variable"},"-b"),n(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),B=l(`

代码克隆完毕后,进入源码目录:

cd PolarDB-for-PostgreSQL/
+

编译部署 PolarDB

读写节点部署

`,4),D=a("code",null,"--with-pfsd",-1),H=l(`
./polardb_build.sh --with-pfsd
+

注意

上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \\
+    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \\
+    stop
+

在节点本地初始化数据目录 $HOME/primary/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

在共享存储的 /pool@@volume_my_/shared_data 目录上初始化共享数据目录

# 使用 pfs 创建共享数据目录
+sudo pfs -C curve mkdir /pool@@volume_my_/shared_data
+# 初始化 db 的本地和共享数据目录
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /pool@@volume_my_/shared_data/ curve
+

编辑读写节点的配置。打开 $HOME/primary/postgresql.conf,增加配置项:

port=5432
+polar_hostid=1
+polar_enable_shared_storage_mode=on
+polar_disk_name='pool@@volume_my_'
+polar_datadir='/pool@@volume_my_/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='curve'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

打开 $HOME/primary/pg_hba.conf,增加以下配置项:

host	replication	postgres	0.0.0.0/0	trust
+

最后,启动读写节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
+

检查读写节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'select version();'
+# 下面为输出内容
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

在读写节点上,为对应的只读节点创建相应的 replication slot,用于只读节点的物理流复制:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "select pg_create_physical_replication_slot('replica1');"
+# 下面为输出内容
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

只读节点部署

在只读节点上,使用 --with-pfsd 选项编译 PolarDB 内核。

./polardb_build.sh --with-pfsd
+

注意

上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \\
+    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \\
+    stop
+

在节点本地初始化数据目录 $HOME/replica1/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/replica1
+

编辑只读节点的配置。打开 $HOME/replica1/postgresql.conf,增加配置项:

port=5433
+polar_hostid=2
+polar_enable_shared_storage_mode=on
+polar_disk_name='pool@@volume_my_'
+polar_datadir='/pool@@volume_my_/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='curve'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

创建 $HOME/replica1/recovery.conf,增加以下配置项:

注意

请在下面替换读写节点(容器)所在的 IP 地址。

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

最后,启动只读节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
+

检查只读节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5433 \\
+    -d postgres \\
+    -c 'select version();'
+# 下面为输出内容
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群检查和测试

部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

登录 读写节点,创建测试表并插入样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "create table t(t1 int primary key, t2 int);insert into t values (1, 1),(2, 3),(3, 3);"
+

登录 只读节点,查询刚刚插入的样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5433 \\
+    -d postgres \\
+    -c "select * from t;"
+# 下面为输出内容
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

在读写节点上插入的数据对只读节点可见。

`,38);function O(r,$){const c=e("ArticleInfo"),p=e("ExternalLinkIcon"),o=e("CodeGroupItem"),i=e("CodeGroup"),u=e("RouterLink");return k(),v("div",null,[_,s(c,{frontmatter:r.$frontmatter},null,8,["frontmatter"]),b,a("p",null,[n("我们在 DockerHub 上提供了一个 "),a("a",g,[n("PolarDB 开发镜像"),s(p)]),n(",里面已经包含编译运行 PolarDB for PostgreSQL 所需要的所有依赖。您可以直接使用这个开发镜像进行实例搭建。镜像目前支持 AMD64 和 ARM64 两种 CPU 架构。")]),h,a("p",null,[n("在前置文档中,我们已经从 DockerHub 上拉取了 PolarDB 开发镜像,并且进入到了容器中。进入容器后,从 "),a("a",f,[n("GitHub"),s(p)]),n(" 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 "),y,n("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),a("a",E,[n("Gitee 国内镜像"),s(p)]),n("。")]),s(i,null,{default:t(()=>[s(o,{title:"GitHub"},{default:t(()=>[x]),_:1}),s(o,{title:"Gitee 国内镜像"},{default:t(()=>[P]),_:1})]),_:1}),B,a("p",null,[n("在读写节点上,使用 "),D,n(" 选项编译 PolarDB 内核。请参考 "),s(u,{to:"/zh/development/dev-on-docker.html#%E7%BC%96%E8%AF%91%E6%B5%8B%E8%AF%95%E9%80%89%E9%A1%B9%E8%AF%B4%E6%98%8E"},{default:t(()=>[n("编译测试选项说明")]),_:1}),n(" 查看更多编译选项的说明。")]),H])}const L=d(m,[["render",O],["__file","db-pfs-curve.html.vue"]]);export{L as default}; diff --git a/assets/db-pfs-curve.html-ee679a35.js b/assets/db-pfs-curve.html-ee679a35.js new file mode 100644 index 00000000000..7fe3107460c --- /dev/null +++ b/assets/db-pfs-curve.html-ee679a35.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-20ec2a08","path":"/deploying/db-pfs-curve.html","title":"基于 PFS for CurveBS 文件系统部署","lang":"en-US","frontmatter":{"author":"程义","date":"2022/11/02","minute":15},"headers":[{"level":2,"title":"源码下载","slug":"源码下载","link":"#源码下载","children":[]},{"level":2,"title":"编译部署 PolarDB","slug":"编译部署-polardb","link":"#编译部署-polardb","children":[{"level":3,"title":"读写节点部署","slug":"读写节点部署","link":"#读写节点部署","children":[]},{"level":3,"title":"只读节点部署","slug":"只读节点部署","link":"#只读节点部署","children":[]},{"level":3,"title":"集群检查和测试","slug":"集群检查和测试","link":"#集群检查和测试","children":[]}]}],"git":{"updatedTime":1690894847000},"filePathRelative":"deploying/db-pfs-curve.md"}');export{e as data}; diff --git a/assets/db-pfs.html-00133c95.js b/assets/db-pfs.html-00133c95.js new file mode 100644 index 00000000000..c5020d3d09a --- /dev/null +++ b/assets/db-pfs.html-00133c95.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-2da78b44","path":"/deploying/db-pfs.html","title":"基于 PFS 文件系统部署","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":15},"headers":[{"level":2,"title":"读写节点部署","slug":"读写节点部署","link":"#读写节点部署","children":[]},{"level":2,"title":"只读节点部署","slug":"只读节点部署","link":"#只读节点部署","children":[]},{"level":2,"title":"集群检查和测试","slug":"集群检查和测试","link":"#集群检查和测试","children":[]},{"level":2,"title":"常见运维步骤","slug":"常见运维步骤","link":"#常见运维步骤","children":[]}],"git":{"updatedTime":1673450922000},"filePathRelative":"deploying/db-pfs.md"}');export{e as data}; diff --git a/assets/db-pfs.html-25c4f785.js b/assets/db-pfs.html-25c4f785.js new file mode 100644 index 00000000000..1cb4010ff6b --- /dev/null +++ b/assets/db-pfs.html-25c4f785.js @@ -0,0 +1,85 @@ +import{_ as r,r as l,o as i,c as u,d as n,a,w as e,b as s,e as d}from"./app-3d1677bf.js";const k={},v=a("h1",{id:"基于-pfs-文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-pfs-文件系统部署","aria-hidden":"true"},"#"),s(" 基于 PFS 文件系统部署")],-1),m=a("p",null,"本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在共享存储上格式化并挂载 PFS 文件系统的计算节点。",-1),b={class:"table-of-contents"},_=d(`

读写节点部署

初始化读写节点的本地数据目录 ~/primary/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

在共享存储的 /nvme1n1/shared_data/ 路径上创建共享数据目录,然后使用 polar-initdb.sh 脚本初始化共享数据目录:

# 使用 pfs 创建共享数据目录
+sudo pfs -C disk mkdir /nvme1n1/shared_data
+# 初始化 db 的本地和共享数据目录
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /nvme1n1/shared_data/
+

编辑读写节点的配置。打开 ~/primary/postgresql.conf,增加配置项:

port=5432
+polar_hostid=1
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,增加以下配置项,允许只读节点进行物理复制:

host	replication	postgres	0.0.0.0/0	trust
+

最后,启动读写节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
+

检查读写节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

在读写节点上,为对应的只读节点创建相应的复制槽,用于只读节点的物理复制:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_create_physical_replication_slot('replica1');"
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

只读节点部署

在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

mkdir -m 0700 $HOME/replica1
+sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \\
+    /nvme1n1/shared_data/ $HOME/replica1/
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
+cp /tmp/replica1/*.conf $HOME/replica1/
+

编辑只读节点的配置。打开 ~/replica1/postgresql.conf,增加配置项:

port=5433
+polar_hostid=2
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

创建只读节点的复制配置文件 ~/replica1/recovery.conf,增加读写节点的连接信息,以及复制槽名称:

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

最后,启动只读节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
+

检查只读节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5433 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群检查和测试

部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

登录 读写节点,创建测试表并插入样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+

登录 只读节点,查询刚刚插入的样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5433 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

在读写节点上插入的数据对只读节点可见,这意味着基于共享存储的 PolarDB 计算节点集群搭建成功。


常见运维步骤

`,35);function g(o,h){const c=l("ArticleInfo"),t=l("router-link"),p=l("RouterLink");return i(),u("div",null,[v,n(c,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),m,a("nav",b,[a("ul",null,[a("li",null,[n(t,{to:"#读写节点部署"},{default:e(()=>[s("读写节点部署")]),_:1})]),a("li",null,[n(t,{to:"#只读节点部署"},{default:e(()=>[s("只读节点部署")]),_:1})]),a("li",null,[n(t,{to:"#集群检查和测试"},{default:e(()=>[s("集群检查和测试")]),_:1})]),a("li",null,[n(t,{to:"#常见运维步骤"},{default:e(()=>[s("常见运维步骤")]),_:1})])])]),_,a("ul",null,[a("li",null,[n(p,{to:"/zh/operation/backup-and-restore.html"},{default:e(()=>[s("备份恢复")]),_:1})]),a("li",null,[n(p,{to:"/zh/operation/grow-storage.html"},{default:e(()=>[s("共享存储在线扩容")]),_:1})]),a("li",null,[n(p,{to:"/zh/operation/scale-out.html"},{default:e(()=>[s("计算节点扩缩容")]),_:1})]),a("li",null,[n(p,{to:"/zh/operation/ro-online-promote.html"},{default:e(()=>[s("只读节点在线 Promote")]),_:1})])])])}const y=r(k,[["render",g],["__file","db-pfs.html.vue"]]);export{y as default}; diff --git a/assets/db-pfs.html-79e35242.js b/assets/db-pfs.html-79e35242.js new file mode 100644 index 00000000000..8477ba62f8a --- /dev/null +++ b/assets/db-pfs.html-79e35242.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-b00a48e2","path":"/zh/deploying/db-pfs.html","title":"基于 PFS 文件系统部署","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":15},"headers":[{"level":2,"title":"读写节点部署","slug":"读写节点部署","link":"#读写节点部署","children":[]},{"level":2,"title":"只读节点部署","slug":"只读节点部署","link":"#只读节点部署","children":[]},{"level":2,"title":"集群检查和测试","slug":"集群检查和测试","link":"#集群检查和测试","children":[]},{"level":2,"title":"常见运维步骤","slug":"常见运维步骤","link":"#常见运维步骤","children":[]}],"git":{"updatedTime":1673450922000},"filePathRelative":"zh/deploying/db-pfs.md"}');export{e as data}; diff --git a/assets/db-pfs.html-ec141362.js b/assets/db-pfs.html-ec141362.js new file mode 100644 index 00000000000..549e3d66544 --- /dev/null +++ b/assets/db-pfs.html-ec141362.js @@ -0,0 +1,85 @@ +import{_ as r,r as l,o as i,c as u,d as n,a,w as e,b as s,e as d}from"./app-3d1677bf.js";const k={},v=a("h1",{id:"基于-pfs-文件系统部署",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-pfs-文件系统部署","aria-hidden":"true"},"#"),s(" 基于 PFS 文件系统部署")],-1),m=a("p",null,"本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在共享存储上格式化并挂载 PFS 文件系统的计算节点。",-1),b={class:"table-of-contents"},_=d(`

读写节点部署

初始化读写节点的本地数据目录 ~/primary/

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

在共享存储的 /nvme1n1/shared_data/ 路径上创建共享数据目录,然后使用 polar-initdb.sh 脚本初始化共享数据目录:

# 使用 pfs 创建共享数据目录
+sudo pfs -C disk mkdir /nvme1n1/shared_data
+# 初始化 db 的本地和共享数据目录
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /nvme1n1/shared_data/
+

编辑读写节点的配置。打开 ~/primary/postgresql.conf,增加配置项:

port=5432
+polar_hostid=1
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,增加以下配置项,允许只读节点进行物理复制:

host	replication	postgres	0.0.0.0/0	trust
+

最后,启动读写节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
+

检查读写节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

在读写节点上,为对应的只读节点创建相应的复制槽,用于只读节点的物理复制:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_create_physical_replication_slot('replica1');"
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

只读节点部署

在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

mkdir -m 0700 $HOME/replica1
+sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \\
+    /nvme1n1/shared_data/ $HOME/replica1/
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
+cp /tmp/replica1/*.conf $HOME/replica1/
+

编辑只读节点的配置。打开 ~/replica1/postgresql.conf,增加配置项:

port=5433
+polar_hostid=2
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

创建只读节点的复制配置文件 ~/replica1/recovery.conf,增加读写节点的连接信息,以及复制槽名称:

polar_replica='on'
+recovery_target_timeline='latest'
+primary_slot_name='replica1'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+

最后,启动只读节点:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
+

检查只读节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5433 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群检查和测试

部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

登录 读写节点,创建测试表并插入样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
+

登录 只读节点,查询刚刚插入的样例数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5433 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ t1 | t2
+----+----
+  1 |  1
+  2 |  3
+  3 |  3
+(3 rows)
+

在读写节点上插入的数据对只读节点可见,这意味着基于共享存储的 PolarDB 计算节点集群搭建成功。


常见运维步骤

`,35);function g(o,h){const c=l("ArticleInfo"),t=l("router-link"),p=l("RouterLink");return i(),u("div",null,[v,n(c,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),m,a("nav",b,[a("ul",null,[a("li",null,[n(t,{to:"#读写节点部署"},{default:e(()=>[s("读写节点部署")]),_:1})]),a("li",null,[n(t,{to:"#只读节点部署"},{default:e(()=>[s("只读节点部署")]),_:1})]),a("li",null,[n(t,{to:"#集群检查和测试"},{default:e(()=>[s("集群检查和测试")]),_:1})]),a("li",null,[n(t,{to:"#常见运维步骤"},{default:e(()=>[s("常见运维步骤")]),_:1})])])]),_,a("ul",null,[a("li",null,[n(p,{to:"/operation/backup-and-restore.html"},{default:e(()=>[s("备份恢复")]),_:1})]),a("li",null,[n(p,{to:"/operation/grow-storage.html"},{default:e(()=>[s("共享存储在线扩容")]),_:1})]),a("li",null,[n(p,{to:"/operation/scale-out.html"},{default:e(()=>[s("计算节点扩缩容")]),_:1})]),a("li",null,[n(p,{to:"/operation/ro-online-promote.html"},{default:e(()=>[s("只读节点在线 Promote")]),_:1})])])])}const y=r(k,[["render",g],["__file","db-pfs.html.vue"]]);export{y as default}; diff --git a/assets/ddl-synchronization.html-37f0cfaf.js b/assets/ddl-synchronization.html-37f0cfaf.js new file mode 100644 index 00000000000..911e086248f --- /dev/null +++ b/assets/ddl-synchronization.html-37f0cfaf.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-5cfdf98b","path":"/theory/ddl-synchronization.html","title":"DDL Synchronization","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Overview","slug":"overview","link":"#overview","children":[]},{"level":2,"title":"Terms","slug":"terms","link":"#terms","children":[]},{"level":2,"title":"DDL Synchronization Mechanism","slug":"ddl-synchronization-mechanism","link":"#ddl-synchronization-mechanism","children":[{"level":3,"title":"DDL Locks","slug":"ddl-locks","link":"#ddl-locks","children":[]},{"level":3,"title":"How to Ensure Data Correctness","slug":"how-to-ensure-data-correctness","link":"#how-to-ensure-data-correctness","children":[]}]},{"level":2,"title":"Apply Optimization for DDL Locks on RO","slug":"apply-optimization-for-ddl-locks-on-ro","link":"#apply-optimization-for-ddl-locks-on-ro","children":[{"level":3,"title":"Asynchronous Apply of DDL Locks","slug":"asynchronous-apply-of-ddl-locks","link":"#asynchronous-apply-of-ddl-locks","children":[]},{"level":3,"title":"How to Ensure Data Correctness","slug":"how-to-ensure-data-correctness-1","link":"#how-to-ensure-data-correctness-1","children":[]}]}],"git":{"updatedTime":1656919280000},"filePathRelative":"theory/ddl-synchronization.md"}');export{e as data}; diff --git a/assets/ddl-synchronization.html-9c478656.js b/assets/ddl-synchronization.html-9c478656.js new file mode 100644 index 00000000000..e904e0efef0 --- /dev/null +++ b/assets/ddl-synchronization.html-9c478656.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-7304dd08","path":"/zh/theory/ddl-synchronization.html","title":"DDL 同步","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"概述","slug":"概述","link":"#概述","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"同步 DDL 机制","slug":"同步-ddl-机制","link":"#同步-ddl-机制","children":[{"level":3,"title":"DDL 锁","slug":"ddl-锁","link":"#ddl-锁","children":[]},{"level":3,"title":"如何保证数据正确性","slug":"如何保证数据正确性","link":"#如何保证数据正确性","children":[]}]},{"level":2,"title":"RO 锁回放优化","slug":"ro-锁回放优化","link":"#ro-锁回放优化","children":[{"level":3,"title":"异步 DDL 锁回放","slug":"异步-ddl-锁回放","link":"#异步-ddl-锁回放","children":[]},{"level":3,"title":"如何保证数据正确性","slug":"如何保证数据正确性-1","link":"#如何保证数据正确性-1","children":[]}]}],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/theory/ddl-synchronization.md"}');export{l as data}; diff --git a/assets/ddl-synchronization.html-bc052c77.js b/assets/ddl-synchronization.html-bc052c77.js new file mode 100644 index 00000000000..d11d94ca4c5 --- /dev/null +++ b/assets/ddl-synchronization.html-bc052c77.js @@ -0,0 +1 @@ +import{_ as t,o as e,c as a,e as l}from"./app-3d1677bf.js";const r="/PolarDB-for-PostgreSQL/assets/45_DDL_1-bc7e6ba3.png",d="/PolarDB-for-PostgreSQL/assets/46_DDL_2-51d294c2.png",D="/PolarDB-for-PostgreSQL/assets/47_DDL_3-7c0328bf.png",i="/PolarDB-for-PostgreSQL/assets/48_DDL_4-b2177964.png",o={},n=l('

DDL 同步

概述

在共享存储一写多读的架构下,数据文件实际上只有一份。得益于多版本机制,不同节点的读写实际上并不会冲突。但是有一些数据操作不具有多版本机制,其中比较有代表性的就是文件操作。

多版本机制仅限于文件内的元组,但不包括文件本身。对文件进行创建、删除等操作实际上会对全集群立即可见,这会导致 RO 在读取文件时出现文件消失的情况,因此需要做一些同步操作,来防止此类情况。

对文件进行操作通常使用 DDL,因此对于 DDL 操作,PolarDB 提供了一种同步机制,来防止并发的文件操作的出现。除了同步机制外,DDL 的其他逻辑和单机执行逻辑并无区别。

术语

  • LSN:Log Sequence Number,日志序列号。是 WAL 日志的唯一标识。LSN 在全局是递增的。
  • 回放位点:Apply LSN,表示只读节点的回放位点。

同步 DDL 机制

DDL 锁

同步 DDL 机制利用 AccessExclusiveLock(后文简称 DDL 锁)来进行 RW / RO 的 DDL 操作同步。

异步回放ddl锁.png
图 1:DDL 锁和 WAL 日志的关系

DDL 锁是数据库中最高级的表锁,对其他所有的锁级别都互斥,会伴随着 WAL 日志同步到 RO 节点上,并且可以获取到该锁在 WAL 日志的写入位点。当 RO 回放超过 Lock LSN 位点时,就可以认为在 RO 中已经获取了这把锁。DDL 锁会伴随着事务的结束而释放。

如图 1 所示,当回放到 ApplyLSN1 时,表示未获取到 DDL 锁;当回放到 ApplyLSN2 时,表示获取到了该锁;当回放到 ApplyLSN3 时,已经释放了 DDL 锁。

异步回放ddl锁.png
图 2:DDL 锁的获取条件

当所有 RO 都回放超过了 Lock LSN 这个位点时(如图 2 所示),可以认为 RW 的事务在集群级别获取到了这把锁。获取到这把锁就意味着 RW / RO 中没有其他的会话能够访问这张表,此时 RW 就可以对这张表做各种文件相关的操作。

说明:Standby 有独立的文件存储,获取锁时不会出现上述情况。

异步回放ddl锁.png
图 3:同步 DDL 流程图

图 3 所示流程说明如下:

  1. RO 会话执行查询语句
  2. RW 会话执行 DDL,在本地获取 DDL 锁并且写到 WAL 日志中,等待所有 RO 回放到该 WAL 日志
  3. RO 的回放进程尝试获取该锁,获取成功后将回放位点返回给 RW
  4. RW 获知所有 RO 均获取到该锁
  5. RO 开始进行 DDL 操作

如何保证数据正确性

DDL 锁是 PostgreSQL 数据库最高级别的锁,当对一个表进行 DROP / ALTER / LOCK / VACUUM (FULL) table 等操作时,需要先获取到 DDL 锁。RW 是通过用户的主动操作来获取锁,获取锁成功时会写入到日志中,RO 则通过回放日志获取锁。

  • 主备环境:热备存在只读查询,同时进行回放,回放到该锁时,如果该表正在被读取,回放就会被阻塞直到超时
  • PolarDB 环境:RW 获取锁需要等待 RO 全部获取锁成功才算成功,因为需要确保主备都不再访问共享存储的数据才能进行 DDL 操作

当以下操作的对象都是某张表,< 表示时间先后顺序时,同步 DDL 的执行逻辑如下:

  1. 本地所有查询操作结束 < 本地获取 DDL 锁 < 本地释放 DDL 锁 < 本地新增查询操作
  2. RW 本地获取 DDL 锁 < 各个 RO 获取本地 DDL 锁 < RW 获取全局 DDL 锁
  3. RW 获取全局 DDL 锁 < RW 进行写数据操作 < RW 释放全局 DDL 锁

结合以上执行逻辑可以得到以下操作的先后顺序:各个 RW / RO 查询操作结束 < RW 获取全局 DDL 锁 < RW 写数据 < RW 释放全局 DDL 锁 < RW / RO 新增查询操作

可以看到在写共享存储的数据时,RW / RO 上都不会存在查询,因此不会造成正确性问题。在整个操作的过程中,都是遵循 2PL 协议的,因此对于多个表,也可以保证正确性。

RO 锁回放优化

上述机制中存在一个问题,就是锁同步发生在主备同步的主路径中,当 RO 的锁同步被阻塞时,会造成 RO 的数据同步阻塞(如图 1 所示,回放进程的 3、4 阶段在等待本地查询会话结束后才能获取锁)。PolarDB 默认设置的同步超时时间为 30s,如果 RW 压力过大,有可能造成较大的数据延迟。

RO 中回放的 DDL 锁还会出现叠加效果,例如 RW 在 1s 内写下了 10 个 DDL 锁日志,在 RO 却需要 300s 才能回放完毕。数据延迟对于 PolarDB 是十分危险的,它会造成 RW 无法及时刷脏、及时做检查点,如果此时发生崩溃,恢复系统会需要更长的时间,这会导致极大的稳定性风险。

异步 DDL 锁回放

针对此问题,PolarDB 对 RO 锁回放进行了优化。

异步回放ddl锁.png
图 4:RO 异步 DDL 锁回放

优化思路:设计一个异步进程来回放这些锁,从而不阻塞主回放进程的工作。

整体流程如图 4 所示,和图 3 不同的是,回放进程会将锁获取的操作卸载到锁回放进程中进行,并且立刻回到主回放流程中,从而不受锁回放阻塞的影响。

锁回放冲突并不是一个常见的情况,因此主回放进程并非将所有的锁都卸载到锁回放进程中进行,它会尝试获取锁,如果获取成功了,就不需要卸载到锁回放进程中进行,这样可以有效减少进程间的同步开销。

该功能在 PolarDB 中默认启用,能够有效的减少回放冲突造成的回放延迟,以及衍生出来的稳定性问题。在 AWS Aurora 中不具备该特性,当发生冲突时会严重增加延迟。

如何保证数据正确性

在异步回放的模式下,仅仅是获取锁的操作者变了,但是执行逻辑并未发生变化,依旧能够保证 RW 获取到全局 DDL 锁、写数据、释放全局 DDL 锁这期间不会存在任何查询,因此不会存在正确性问题。

',38),s=[n];function h(L,c){return e(),a("div",null,s)}const R=t(o,[["render",h],["__file","ddl-synchronization.html.vue"]]);export{R as default}; diff --git a/assets/ddl-synchronization.html-dc62a732.js b/assets/ddl-synchronization.html-dc62a732.js new file mode 100644 index 00000000000..118daed7525 --- /dev/null +++ b/assets/ddl-synchronization.html-dc62a732.js @@ -0,0 +1 @@ +import{_ as e,o,c as a,e as t}from"./app-3d1677bf.js";const r="/PolarDB-for-PostgreSQL/assets/45_DDL_1-bc7e6ba3.png",n="/PolarDB-for-PostgreSQL/assets/46_DDL_2-51d294c2.png",s="/PolarDB-for-PostgreSQL/assets/47_DDL_3-7c0328bf.png",i="/PolarDB-for-PostgreSQL/assets/48_DDL_4-b2177964.png",l={},c=t('

DDL Synchronization

Overview

In a shared storage architecture that consists of one primary node and multiple read-only nodes, a data file has only one copy. Due to multi-version concurrency control (MVCC), the read and write operations performed on different nodes do not conflict. However, MVCC cannot be used to ensure consistency for some specific data operations, such as file operations.

MVCC applies to tuples within a file but does not apply to the file itself. File operations such as creating and deleting files are visible to the entire cluster immediately after they are performed. This causes an issue that files disappear while read-only nodes are reading the files. To prevent the issue from occurring, file operations need to be synchronized.

In most cases, DDL is used to perform operations on files. For DDL operations, PolarDB provides a synchronization mechanism to prevent concurrent file operations. The logic of DDL operations in PolarDB is the same as the logic of single-node execution. However, the synchronization mechanism is different.

Terms

  • LSN: short for log sequence number. Each LSN is the unique identifier of an entry in a write-ahead logging (WAL) log file. LSNs are incremented at a global level.
  • Apply LSN: refers to the position at which a WAL log file is applied on a read-only node.

DDL Synchronization Mechanism

DDL Locks

The DDL synchronization mechanism uses AccessExclusiveLocks (DDL locks) to synchronize DDL operations between primary and read-only nodes.

image.png
Figure 1: Relationship Between DDL Lock and WAL Log

DDL locks are table locks at the highest level in databases. DDL locks and locks at other levels are mutually exclusive. When the primary node synchronizes a WAL log file of a table to the read-only nodes, the primary node acquires the LSN of the lock in the WAL log file. When a read-only node applies the WAL log file beyond the LSN of the lock, the lock is considered to have been acquired on the read-only node. The DDL lock is released after the transaction ends. Figure 1 shows the entire process from the acquisition to the release of a DDL lock. When the WAL log file is applied at Apply LSN 1, the DDL lock is not acquired. When the WAL log file is applied at Apply LSN 2, the DDL lock is acquired. When the WAL log file is applied at Apply LSN 3, the DDL lock is released.

image.png
Figure 2: Conditions for Acquiring DDL Lock

When the WAL log file is applied beyond the LSN of the lock on all read-only nodes, the DDL lock is considered to have been acquired by the transaction of the primary node at the cluster level. Then, this table cannot be accessed over other sessions on the primary node or read-only nodes. During this time period, the primary node can perform various file operations on the table.

Note: A standby node in an active/standby environment has independent file storage. When a standby node acquires a lock, the preceding situation never occurs.

image.png
Figure 3: DDL Synchronization Workflow

Figure 3 shows the workflow of how DDL operations are synchronized.

  1. Each read-only node executes query statements in a session.
  2. The primary node executes DDL statements in a session, acquires a local DDL lock, writes the DDL lock to the WAL log file, and then waits for all read-only nodes to apply the WAL log file.
  3. The apply process of each read-only node attempts to acquire the DDL lock. When the apply process acquires the DDL lock, it returns the Apply LSN to the primary node.
  4. The primary node is notified that the DDL lock is acquired on all read-only nodes.
  5. Each read-only node starts to perform DDL operations.

How to Ensure Data Correctness

DDL locks are locks at the highest level in PostgreSQL databases. Before a database performs operations such as DROP, ALTER, LOCK, and VACUUM (FULL) on a table, a DDL lock must be acquired. The primary node acquires the DDL lock by responding to user requests. When the lock is acquired, the primary node writes the DDL lock to the log file. Read-only nodes acquire the DDL lock by applying the log file.

  • In an active/standby environment, a hot standby node runs read-only queries and applies the log file at the same time. When the log file is applied to the LSN of the lock, the apply is blocked if the table is being read until the apply process times out.
  • In a PolarDB environment, the DDL lock is acquired by the primary node only after the DDL lock is acquired by all read-only nodes. This can ensure that primary and read-only nodes cannot access the data in shared storage. This is a prerequisite for performing DDL operations in PolarDB.

DDL operations on a table are synchronized based on the following logic. The < indicator shows that the operations are performed from left to right.

  1. Completes all local queries < Acquires a local DDL lock < Releases the local DDL lock < Runs new local queries
  2. The primary node acquires a local DDL lock < Each read-only node acquires a local DDL lock < The primary node acquires a global DDL lock
  3. The primary node acquires a global DDL lock < The primary node writes data < The primary node releases the global DDL lock

The sequence of the following operations is inferred based on the preceding execution logic: Queries on the primary node and each read-only node end < The primary node acquires a global DDL lock < The primary node writes data < The primary node releases the global DDL lock < The primary node and read-only nodes run new queries.

When the primary node writes data to the shared storage, no queries are run on the primary node or read-only nodes. This way, data correctness is ensured. The entire operation process follows the two-phase locking (2PL) protocol. This way, data correctness is ensured among multiple tables.

Apply Optimization for DDL Locks on RO

In the preceding synchronization mechanism, DDL locks are synchronized in the main process that is used for primary/secondary synchronization. When the synchronization of a DDL lock to a read-only node is blocked, the synchronization of data to the read-only node is also blocked. In the third and fourth phases of the apply process shown in Figure 1, the DDL lock can be acquired only after the session in which local queries are run is closed. The default timeout period for synchronization in PolarDB is 30s. If the primary node runs in heavy load, a large data latency may occur.

In specific cases, for a read-only node to apply a DDL lock, the data latency is the sum of the time used to apply each log entry. For example, if the primary node writes 10 log entries for a DDL lock within 1s, the read-only node requires 300s to apply all log entries. Data latency can affect the system stability of PolarDB in a negative manner. The primary node may be unable to clean dirty data and perform checkpoints at the earliest opportunity due to data latency. If the system stops responding when a large data latency occurs, the system requires an extended period of time to recover. This can lead to great stability risks.

Asynchronous Apply of DDL Locks

To resolve this issue, PolarDB optimizes DDL lock apply on read-only nodes.

image.png
Figure 4: Asynchronous Apply of DDL Locks on Read-Only Nodes

PolarDB uses an asynchronous process to apply DDL locks so that the main apply process is not blocked.

Figure 4 shows the overall workflow in which PolarDB offloads the acquisition of DDL locks from the main apply process to the lock apply process and immediately returns to the main apply process. This way, the main apply process is not affected even if lock apply are blocked.

Lock apply conflicts rarely occur. PolarDB does not offload the acquisition of all locks to the lock apply process. PolarDB first attempts to acquire a lock in the main apply process. Then, if the attempt is a success, PolarDB does not offload the lock acquisition to the lock apply process. This can reduce the synchronization overheads between processes.

By default, the asynchronous lock apply feature is enabled in PolarDB. This feature can reduce the apply latency caused by apply conflicts to ensure service stability. AWS Aurora does not provide similar features. Apply conflicts in AWS Aurora can severely increase data latency.

How to Ensure Data Correctness

In asynchronous apply mode, only the executor who acquires locks changes, but the execution logic does not change. During the process in which the primary node acquires a global DDL lock, writes data, and then releases the global DDL lock, no queries are run. This way, data correctness is not affected.

',37),h=[c];function d(p,y){return o(),a("div",null,h)}const f=e(l,[["render",d],["__file","ddl-synchronization.html.vue"]]);export{f as default}; diff --git a/assets/deploy-official.html-56c4332d.js b/assets/deploy-official.html-56c4332d.js new file mode 100644 index 00000000000..0576273642a --- /dev/null +++ b/assets/deploy-official.html-56c4332d.js @@ -0,0 +1 @@ +import{_ as a,r as t,o as n,c,a as o,b as e,d as l}from"./app-3d1677bf.js";const s={},d=o("h1",{id:"阿里云官网购买实例",tabindex:"-1"},[o("a",{class:"header-anchor",href:"#阿里云官网购买实例","aria-hidden":"true"},"#"),e(" 阿里云官网购买实例")],-1),_={href:"https://www.aliyun.com/product/polardb",target:"_blank",rel:"noopener noreferrer"};function i(f,p){const r=t("ExternalLinkIcon");return n(),c("div",null,[d,o("p",null,[e("阿里云官网直接提供了可供购买的 "),o("a",_,[e("云原生关系型数据库 PolarDB PostgreSQL 引擎"),l(r)]),e("。")])])}const m=a(s,[["render",i],["__file","deploy-official.html.vue"]]);export{m as default}; diff --git a/assets/deploy-official.html-6001b0e7.js b/assets/deploy-official.html-6001b0e7.js new file mode 100644 index 00000000000..8178772c3bd --- /dev/null +++ b/assets/deploy-official.html-6001b0e7.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-c6592cf8","path":"/zh/deploying/deploy-official.html","title":"阿里云官网购买实例","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/deploying/deploy-official.md"}');export{e as data}; diff --git a/assets/deploy-official.html-d090475d.js b/assets/deploy-official.html-d090475d.js new file mode 100644 index 00000000000..0576273642a --- /dev/null +++ b/assets/deploy-official.html-d090475d.js @@ -0,0 +1 @@ +import{_ as a,r as t,o as n,c,a as o,b as e,d as l}from"./app-3d1677bf.js";const s={},d=o("h1",{id:"阿里云官网购买实例",tabindex:"-1"},[o("a",{class:"header-anchor",href:"#阿里云官网购买实例","aria-hidden":"true"},"#"),e(" 阿里云官网购买实例")],-1),_={href:"https://www.aliyun.com/product/polardb",target:"_blank",rel:"noopener noreferrer"};function i(f,p){const r=t("ExternalLinkIcon");return n(),c("div",null,[d,o("p",null,[e("阿里云官网直接提供了可供购买的 "),o("a",_,[e("云原生关系型数据库 PolarDB PostgreSQL 引擎"),l(r)]),e("。")])])}const m=a(s,[["render",i],["__file","deploy-official.html.vue"]]);export{m as default}; diff --git a/assets/deploy-official.html-efd51867.js b/assets/deploy-official.html-efd51867.js new file mode 100644 index 00000000000..7c8a6dff8cd --- /dev/null +++ b/assets/deploy-official.html-efd51867.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-bca378d6","path":"/deploying/deploy-official.html","title":"阿里云官网购买实例","lang":"en-US","frontmatter":{},"headers":[],"git":{"updatedTime":1656919280000},"filePathRelative":"deploying/deploy-official.md"}');export{e as data}; diff --git a/assets/deploy-stack.html-883c5c20.js b/assets/deploy-stack.html-883c5c20.js new file mode 100644 index 00000000000..bf52ec13fd9 --- /dev/null +++ b/assets/deploy-stack.html-883c5c20.js @@ -0,0 +1 @@ +import{_ as r,r as t,o as c,c as l,a,b as e,d as s}from"./app-3d1677bf.js";const n="/PolarDB-for-PostgreSQL/assets/63-PolarDBStack-arch-88440a72.png",_={},d=a("h1",{id:"基于-polardb-stack-共享存储",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-polardb-stack-共享存储","aria-hidden":"true"},"#"),e(" 基于 PolarDB Stack 共享存储")],-1),i=a("p",null,"PolarDB Stack 是轻量级 PolarDB PaaS 软件。基于共享存储提供一写多读的 PolarDB 数据库服务,特别定制和深度优化了数据库生命周期管理。通过 PolarDB Stack 可以一键部署 PolarDB-for-PostgreSQL 内核和 PolarDB-FileSystem。",-1),p={href:"https://github.com/ApsaraDB/PolarDB-Stack-Operator/blob/master/README.md",target:"_blank",rel:"noopener noreferrer"},h=a("p",null,[a("img",{src:n,alt:"PolarDB Stack arch"})],-1);function k(B,P){const o=t("ExternalLinkIcon");return c(),l("div",null,[d,i,a("p",null,[e("PolarDB Stack 架构如下图所示,进入 "),a("a",p,[e("PolarDB Stack 的部署文档"),s(o)])]),h])}const m=r(_,[["render",k],["__file","deploy-stack.html.vue"]]);export{m as default}; diff --git a/assets/deploy-stack.html-9812b946.js b/assets/deploy-stack.html-9812b946.js new file mode 100644 index 00000000000..248c88588e4 --- /dev/null +++ b/assets/deploy-stack.html-9812b946.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-097f9dea","path":"/deploying/deploy-stack.html","title":"基于 PolarDB Stack 共享存储","lang":"en-US","frontmatter":{},"headers":[],"git":{"updatedTime":1656919280000},"filePathRelative":"deploying/deploy-stack.md"}');export{e as data}; diff --git a/assets/deploy-stack.html-b9d4cc47.js b/assets/deploy-stack.html-b9d4cc47.js new file mode 100644 index 00000000000..bf52ec13fd9 --- /dev/null +++ b/assets/deploy-stack.html-b9d4cc47.js @@ -0,0 +1 @@ +import{_ as r,r as t,o as c,c as l,a,b as e,d as s}from"./app-3d1677bf.js";const n="/PolarDB-for-PostgreSQL/assets/63-PolarDBStack-arch-88440a72.png",_={},d=a("h1",{id:"基于-polardb-stack-共享存储",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#基于-polardb-stack-共享存储","aria-hidden":"true"},"#"),e(" 基于 PolarDB Stack 共享存储")],-1),i=a("p",null,"PolarDB Stack 是轻量级 PolarDB PaaS 软件。基于共享存储提供一写多读的 PolarDB 数据库服务,特别定制和深度优化了数据库生命周期管理。通过 PolarDB Stack 可以一键部署 PolarDB-for-PostgreSQL 内核和 PolarDB-FileSystem。",-1),p={href:"https://github.com/ApsaraDB/PolarDB-Stack-Operator/blob/master/README.md",target:"_blank",rel:"noopener noreferrer"},h=a("p",null,[a("img",{src:n,alt:"PolarDB Stack arch"})],-1);function k(B,P){const o=t("ExternalLinkIcon");return c(),l("div",null,[d,i,a("p",null,[e("PolarDB Stack 架构如下图所示,进入 "),a("a",p,[e("PolarDB Stack 的部署文档"),s(o)])]),h])}const m=r(_,[["render",k],["__file","deploy-stack.html.vue"]]);export{m as default}; diff --git a/assets/deploy-stack.html-d9f23f36.js b/assets/deploy-stack.html-d9f23f36.js new file mode 100644 index 00000000000..5801b9ecb05 --- /dev/null +++ b/assets/deploy-stack.html-d9f23f36.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-3dba534a","path":"/zh/deploying/deploy-stack.html","title":"基于 PolarDB Stack 共享存储","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/deploying/deploy-stack.md"}');export{t as data}; diff --git a/assets/deploy.html-2951b18a.js b/assets/deploy.html-2951b18a.js new file mode 100644 index 00000000000..4eb7061ae2e --- /dev/null +++ b/assets/deploy.html-2951b18a.js @@ -0,0 +1 @@ +import{_ as u,r as s,o as c,c as i,d as l,a as t,b as e,w as n}from"./app-3d1677bf.js";const h={},p=t("h1",{id:"进阶部署",tabindex:"-1"},[t("a",{class:"header-anchor",href:"#进阶部署","aria-hidden":"true"},"#"),e(" 进阶部署")],-1),f=t("p",null,"部署 PolarDB for PostgreSQL 需要在以下三个层面上做准备:",-1),m=t("li",null,[t("strong",null,"块存储设备层"),e(":用于提供存储介质。可以是单个物理块存储设备(本地存储),也可以是多个物理块设备构成的分布式块存储。")],-1),g=t("strong",null,"文件系统层",-1),S={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},b=t("li",null,[t("strong",null,"数据库层"),e(":PolarDB for PostgreSQL 的编译和部署环境。")],-1),y=t("p",null,"以下表格给出了三个层次排列组合出的的不同实践方式,其中的步骤包含:",-1),P=t("ul",null,[t("li",null,"存储层:块存储设备的准备"),t("li",null,"文件系统:PolarDB File System 的编译、挂载"),t("li",null,"数据库层:PolarDB for PostgreSQL 各集群形态的编译部署")],-1),v={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},B=t("thead",null,[t("tr",null,[t("th"),t("th",null,"块存储"),t("th",null,"文件系统")])],-1),D=t("td",null,"本地 SSD",-1),k=t("td",null,"本地文件系统(如 ext4)",-1),x={href:"https://developer.aliyun.com/live/249628"},F=t("td",null,"阿里云 ECS + ESSD 云盘",-1),L=t("td",null,"PFS",-1),C={href:"https://developer.aliyun.com/live/250218"},E={href:"https://opencurve.io/Curve/HOME",target:"_blank",rel:"noopener noreferrer"},z={href:"https://github.com/opencurve/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},I=t("td",null,"Ceph 共享存储",-1),N=t("td",null,"PFS",-1),Q=t("td",null,"NBD 共享存储",-1),A=t("td",null,"PFS",-1);function V(d,w){const _=s("ArticleInfo"),r=s("ExternalLinkIcon"),o=s("RouterLink"),a=s("Badge");return c(),i("div",null,[p,l(_,{frontmatter:d.$frontmatter},null,8,["frontmatter"]),f,t("ol",null,[m,t("li",null,[g,e(":由于 PostgreSQL 将数据存储在文件中,因此需要在块存储设备上架设文件系统。根据底层块存储设备的不同,可以选用单机文件系统(如 ext4)或分布式文件系统 "),t("a",S,[e("PolarDB File System(PFS)"),l(r)]),e("。")]),b]),y,P,t("p",null,[e("我们强烈推荐使用发布在 DockerHub 上的 "),t("a",v,[e("PolarDB 开发镜像"),l(r)]),e(" 来完成实践!开发镜像中已经包含了文件系统层和数据库层所需要安装的所有依赖,无需手动安装。")]),t("table",null,[B,t("tbody",null,[t("tr",null,[t("td",null,[l(o,{to:"/zh/deploying/db-localfs.html"},{default:n(()=>[e("实践 1(极简本地部署)")]),_:1})]),D,k]),t("tr",null,[t("td",null,[l(o,{to:"/zh/deploying/storage-aliyun-essd.html"},{default:n(()=>[e("实践 2(生产环境最佳实践)")]),_:1}),e(),t("a",x,[l(a,{type:"tip",text:"视频",vertical:"top"})])]),F,L]),t("tr",null,[t("td",null,[l(o,{to:"/zh/deploying/storage-curvebs.html"},{default:n(()=>[e("实践 3(生产环境最佳实践)")]),_:1}),e(),t("a",C,[l(a,{type:"tip",text:"视频",vertical:"top"})])]),t("td",null,[t("a",E,[e("CurveBS"),l(r)]),e(" 共享存储")]),t("td",null,[t("a",z,[e("PFS for Curve"),l(r)])])]),t("tr",null,[t("td",null,[l(o,{to:"/zh/deploying/storage-ceph.html"},{default:n(()=>[e("实践 4")]),_:1})]),I,N]),t("tr",null,[t("td",null,[l(o,{to:"/zh/deploying/storage-nbd.html"},{default:n(()=>[e("实践 5")]),_:1})]),Q,A])])])])}const R=u(h,[["render",V],["__file","deploy.html.vue"]]);export{R as default}; diff --git a/assets/deploy.html-42673f52.js b/assets/deploy.html-42673f52.js new file mode 100644 index 00000000000..a3ef16e54ac --- /dev/null +++ b/assets/deploy.html-42673f52.js @@ -0,0 +1 @@ +import{_ as u,r as s,o as c,c as i,d as l,a as t,b as e,w as n}from"./app-3d1677bf.js";const h={},p=t("h1",{id:"进阶部署",tabindex:"-1"},[t("a",{class:"header-anchor",href:"#进阶部署","aria-hidden":"true"},"#"),e(" 进阶部署")],-1),f=t("p",null,"部署 PolarDB for PostgreSQL 需要在以下三个层面上做准备:",-1),m=t("li",null,[t("strong",null,"块存储设备层"),e(":用于提供存储介质。可以是单个物理块存储设备(本地存储),也可以是多个物理块设备构成的分布式块存储。")],-1),g=t("strong",null,"文件系统层",-1),S={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},b=t("li",null,[t("strong",null,"数据库层"),e(":PolarDB for PostgreSQL 的编译和部署环境。")],-1),y=t("p",null,"以下表格给出了三个层次排列组合出的的不同实践方式,其中的步骤包含:",-1),P=t("ul",null,[t("li",null,"存储层:块存储设备的准备"),t("li",null,"文件系统:PolarDB File System 的编译、挂载"),t("li",null,"数据库层:PolarDB for PostgreSQL 各集群形态的编译部署")],-1),v={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},B=t("thead",null,[t("tr",null,[t("th"),t("th",null,"块存储"),t("th",null,"文件系统")])],-1),D=t("td",null,"本地 SSD",-1),k=t("td",null,"本地文件系统(如 ext4)",-1),x={href:"https://developer.aliyun.com/live/249628"},F=t("td",null,"阿里云 ECS + ESSD 云盘",-1),L=t("td",null,"PFS",-1),C={href:"https://developer.aliyun.com/live/250218"},E={href:"https://opencurve.io/Curve/HOME",target:"_blank",rel:"noopener noreferrer"},I={href:"https://github.com/opencurve/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},N=t("td",null,"Ceph 共享存储",-1),Q=t("td",null,"PFS",-1),A=t("td",null,"NBD 共享存储",-1),V=t("td",null,"PFS",-1);function w(d,H){const _=s("ArticleInfo"),r=s("ExternalLinkIcon"),o=s("RouterLink"),a=s("Badge");return c(),i("div",null,[p,l(_,{frontmatter:d.$frontmatter},null,8,["frontmatter"]),f,t("ol",null,[m,t("li",null,[g,e(":由于 PostgreSQL 将数据存储在文件中,因此需要在块存储设备上架设文件系统。根据底层块存储设备的不同,可以选用单机文件系统(如 ext4)或分布式文件系统 "),t("a",S,[e("PolarDB File System(PFS)"),l(r)]),e("。")]),b]),y,P,t("p",null,[e("我们强烈推荐使用发布在 DockerHub 上的 "),t("a",v,[e("PolarDB 开发镜像"),l(r)]),e(" 来完成实践!开发镜像中已经包含了文件系统层和数据库层所需要安装的所有依赖,无需手动安装。")]),t("table",null,[B,t("tbody",null,[t("tr",null,[t("td",null,[l(o,{to:"/deploying/db-localfs.html"},{default:n(()=>[e("实践 1(极简本地部署)")]),_:1})]),D,k]),t("tr",null,[t("td",null,[l(o,{to:"/deploying/storage-aliyun-essd.html"},{default:n(()=>[e("实践 2(生产环境最佳实践)")]),_:1}),e(),t("a",x,[l(a,{type:"tip",text:"视频",vertical:"top"})])]),F,L]),t("tr",null,[t("td",null,[l(o,{to:"/deploying/storage-curvebs.html"},{default:n(()=>[e("实践 3(生产环境最佳实践)")]),_:1}),e(),t("a",C,[l(a,{type:"tip",text:"视频",vertical:"top"})])]),t("td",null,[t("a",E,[e("CurveBS"),l(r)]),e(" 共享存储")]),t("td",null,[t("a",I,[e("PFS for Curve"),l(r)])])]),t("tr",null,[t("td",null,[l(o,{to:"/deploying/storage-ceph.html"},{default:n(()=>[e("实践 4")]),_:1})]),N,Q]),t("tr",null,[t("td",null,[l(o,{to:"/deploying/storage-nbd.html"},{default:n(()=>[e("实践 5")]),_:1})]),A,V])])])])}const M=u(h,[["render",w],["__file","deploy.html.vue"]]);export{M as default}; diff --git a/assets/deploy.html-523aee49.js b/assets/deploy.html-523aee49.js new file mode 100644 index 00000000000..68ec4acb8a3 --- /dev/null +++ b/assets/deploy.html-523aee49.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-ccde9c94","path":"/zh/deploying/deploy.html","title":"进阶部署","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":10},"headers":[],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/deploying/deploy.md"}');export{e as data}; diff --git a/assets/deploy.html-d61ba66a.js b/assets/deploy.html-d61ba66a.js new file mode 100644 index 00000000000..0fb6cee4191 --- /dev/null +++ b/assets/deploy.html-d61ba66a.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-4a7bdef6","path":"/deploying/deploy.html","title":"进阶部署","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":10},"headers":[],"git":{"updatedTime":1690894847000},"filePathRelative":"deploying/deploy.md"}');export{e as data}; diff --git a/assets/dev-on-docker.html-36d2d71c.js b/assets/dev-on-docker.html-36d2d71c.js new file mode 100644 index 00000000000..4d1238f2ae8 --- /dev/null +++ b/assets/dev-on-docker.html-36d2d71c.js @@ -0,0 +1,31 @@ +import{_ as c,r as t,o as i,c as p,a,b as e,d as n,w as r,e as l}from"./app-3d1677bf.js";const u={},h=l('

基于 Docker 容器开发

DANGER

为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

在开发机器上下载源代码

',3),b={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},v=a("code",null,"POLARDB_11_STABLE",-1),_={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},m=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),e(" clone "),a("span",{class:"token parameter variable"},"-b"),e(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),g=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),e(" clone "),a("span",{class:"token parameter variable"},"-b"),e(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),k=l(`

代码克隆完毕后,进入源码目录:

cd PolarDB-for-PostgreSQL/
+

拉取开发镜像

`,3),f={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},x=l(`
# 拉取 PolarDB 开发镜像
+docker pull polardb/polardb_pg_devel
+

创建并运行容器

此时我们已经在开发机器的源码目录中。从开发镜像上创建一个容器,将当前目录作为一个 volume 挂载到容器中,这样可以:

  • 在容器内的环境中编译源码
  • 在容器外(开发机器上)使用编辑器来查看或修改代码
docker run -it \\
+    -v $PWD:/home/postgres/polardb_pg \\
+    --shm-size=512m --cap-add=SYS_PTRACE --privileged=true \\
+    --name polardb_pg_devel \\
+    polardb/polardb_pg_devel \\
+    bash
+

进入容器后,为容器内用户获取源码目录的权限,然后编译部署 PolarDB-PG 实例。

# 获取权限并编译部署
+cd polardb_pg
+sudo chmod -R a+wr ./
+sudo chown -R postgres:postgres ./
+./polardb_build.sh
+
+# 验证
+psql -h 127.0.0.1 -c 'select version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

编译测试选项说明

以下表格列出了编译、初始化或测试 PolarDB-PG 集群所可能使用到的选项及说明。更多选项及其说明详见源码目录下的 polardb_build.sh 脚本。

`,9),P=a("thead",null,[a("tr",null,[a("th",null,"选项"),a("th",null,"描述"),a("th",null,"默认值")])],-1),D=a("tr",null,[a("td",null,[a("code",null,"--withrep")]),a("td",null,"是否初始化只读节点"),a("td",null,[a("code",null,"NO")])],-1),B=a("tr",null,[a("td",null,[a("code",null,"--repnum")]),a("td",null,"只读节点数量"),a("td",null,[a("code",null,"1")])],-1),w=a("tr",null,[a("td",null,[a("code",null,"--withstandby")]),a("td",null,"是否初始化热备份节点"),a("td",null,[a("code",null,"NO")])],-1),A=a("tr",null,[a("td",null,[a("code",null,"--initpx")]),a("td",null,"是否初始化为 HTAP 集群(1 个读写节点,2 个只读节点)"),a("td",null,[a("code",null,"NO")])],-1),L=a("tr",null,[a("td",null,[a("code",null,"--with-pfsd")]),a("td",null,"是否编译 PolarDB File System(PFS)相关功能"),a("td",null,[a("code",null,"NO")])],-1),N=a("td",null,[a("code",null,"--with-tde")],-1),O={href:"https://zhuanlan.zhihu.com/p/84829027",target:"_blank",rel:"noopener noreferrer"},S=a("td",null,[a("code",null,"NO")],-1),G=a("tr",null,[a("td",null,[a("code",null,"--with-dma")]),a("td",null,"是否初始化为 DMA(Data Max Availability)高可用三节点集群"),a("td",null,[a("code",null,"NO")])],-1),T=a("tr",null,[a("td",null,[a("code",null,"-r"),e("/ "),a("code",null,"-t"),e(" / "),a("br"),a("code",null,"--regress")]),a("td",null,"在编译安装完毕后运行内核回归测试"),a("td",null,[a("code",null,"NO")])],-1),E=a("tr",null,[a("td",null,[a("code",null,"-r-px")]),a("td",null,"运行 HTAP 实例的回归测试"),a("td",null,[a("code",null,"NO")])],-1),H=a("tr",null,[a("td",null,[a("code",null,"-e"),e(" /"),a("br"),a("code",null,"--extension")]),a("td",null,"运行扩展插件测试"),a("td",null,[a("code",null,"NO")])],-1),Q=a("tr",null,[a("td",null,[a("code",null,"-r-external")]),a("td",null,[e("测试 "),a("code",null,"external/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),R=a("tr",null,[a("td",null,[a("code",null,"-r-contrib")]),a("td",null,[e("测试 "),a("code",null,"contrib/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),C=a("tr",null,[a("td",null,[a("code",null,"-r-pl")]),a("td",null,[e("测试 "),a("code",null,"src/pl/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),y=l(`

如无定制的需求,则可以按照下面给出的选项编译部署不同形态的 PolarDB-PG 集群并进行测试。

PolarDB-PG 各形态编译部署

本地单节点实例

  • 1 个读写节点(运行于 5432 端口)
./polardb_build.sh
+

本地多节点实例

  • 1 个读写节点(运行于 5432 端口)
  • 1 个只读节点(运行于 5433 端口)
./polardb_build.sh --withrep --repnum=1
+

本地多节点带备库实例

  • 1 个读写节点(运行于 5432 端口)
  • 1 个只读节点(运行于 5433 端口)
  • 1 个备库节点(运行于 5434 端口)
./polardb_build.sh --withrep --repnum=1 --withstandby
+

本地多节点 HTAP 实例

  • 1 个读写节点(运行于 5432 端口)
  • 2 个只读节点(运行于 5433 / 5434 端口)
./polardb_build.sh --initpx
+

实例回归测试

普通实例回归测试:

./polardb_build.sh --withrep -r -e -r-external -r-contrib -r-pl --with-tde
+

HTAP 实例回归测试:

./polardb_build.sh -r-px -e -r-external -r-contrib -r-pl --with-tde
+

DMA 实例回归测试:

./polardb_build.sh -r -e -r-external -r-contrib -r-pl --with-tde --with-dma
+
`,21);function I(V,z){const s=t("ExternalLinkIcon"),o=t("CodeGroupItem"),d=t("CodeGroup");return i(),p("div",null,[h,a("p",null,[e("从 "),a("a",b,[e("GitHub"),n(s)]),e(" 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 "),v,e("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),a("a",_,[e("Gitee 国内镜像"),n(s)]),e("。")]),n(d,null,{default:r(()=>[n(o,{title:"GitHub"},{default:r(()=>[m]),_:1}),n(o,{title:"Gitee 国内镜像"},{default:r(()=>[g]),_:1})]),_:1}),k,a("p",null,[e("从 DockerHub 上拉取 PolarDB for PostgreSQL 的 "),a("a",f,[e("开发镜像"),n(s)]),e("。")]),x,a("table",null,[P,a("tbody",null,[D,B,w,A,L,a("tr",null,[N,a("td",null,[e("是否初始化 "),a("a",O,[e("透明数据加密(TDE)"),n(s)]),e(" 功能")]),S]),G,T,E,H,Q,R,C])]),y])}const F=c(u,[["render",I],["__file","dev-on-docker.html.vue"]]);export{F as default}; diff --git a/assets/dev-on-docker.html-c045aaf0.js b/assets/dev-on-docker.html-c045aaf0.js new file mode 100644 index 00000000000..0699dd0b7c1 --- /dev/null +++ b/assets/dev-on-docker.html-c045aaf0.js @@ -0,0 +1,31 @@ +import{_ as c,r as t,o as i,c as p,a,b as e,d as n,w as r,e as l}from"./app-3d1677bf.js";const u={},h=l('

基于 Docker 容器开发

警告

为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

在开发机器上下载源代码

',3),b={href:"https://github.com/ApsaraDB/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},v=a("code",null,"POLARDB_11_STABLE",-1),_={href:"https://gitee.com/mirrors/PolarDB-for-PostgreSQL",target:"_blank",rel:"noopener noreferrer"},m=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),e(" clone "),a("span",{class:"token parameter variable"},"-b"),e(` POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git +`)])])],-1),g=a("div",{class:"language-bash","data-ext":"sh"},[a("pre",{class:"language-bash"},[a("code",null,[a("span",{class:"token function"},"git"),e(" clone "),a("span",{class:"token parameter variable"},"-b"),e(` POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL +`)])])],-1),k=l(`

代码克隆完毕后,进入源码目录:

cd PolarDB-for-PostgreSQL/
+

拉取开发镜像

`,3),f={href:"https://hub.docker.com/r/polardb/polardb_pg_devel/tags",target:"_blank",rel:"noopener noreferrer"},x=l(`
# 拉取 PolarDB 开发镜像
+docker pull polardb/polardb_pg_devel
+

创建并运行容器

此时我们已经在开发机器的源码目录中。从开发镜像上创建一个容器,将当前目录作为一个 volume 挂载到容器中,这样可以:

  • 在容器内的环境中编译源码
  • 在容器外(开发机器上)使用编辑器来查看或修改代码
docker run -it \\
+    -v $PWD:/home/postgres/polardb_pg \\
+    --shm-size=512m --cap-add=SYS_PTRACE --privileged=true \\
+    --name polardb_pg_devel \\
+    polardb/polardb_pg_devel \\
+    bash
+

进入容器后,为容器内用户获取源码目录的权限,然后编译部署 PolarDB-PG 实例。

# 获取权限并编译部署
+cd polardb_pg
+sudo chmod -R a+wr ./
+sudo chown -R postgres:postgres ./
+./polardb_build.sh
+
+# 验证
+psql -h 127.0.0.1 -c 'select version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

编译测试选项说明

以下表格列出了编译、初始化或测试 PolarDB-PG 集群所可能使用到的选项及说明。更多选项及其说明详见源码目录下的 polardb_build.sh 脚本。

`,9),P=a("thead",null,[a("tr",null,[a("th",null,"选项"),a("th",null,"描述"),a("th",null,"默认值")])],-1),D=a("tr",null,[a("td",null,[a("code",null,"--withrep")]),a("td",null,"是否初始化只读节点"),a("td",null,[a("code",null,"NO")])],-1),B=a("tr",null,[a("td",null,[a("code",null,"--repnum")]),a("td",null,"只读节点数量"),a("td",null,[a("code",null,"1")])],-1),w=a("tr",null,[a("td",null,[a("code",null,"--withstandby")]),a("td",null,"是否初始化热备份节点"),a("td",null,[a("code",null,"NO")])],-1),A=a("tr",null,[a("td",null,[a("code",null,"--initpx")]),a("td",null,"是否初始化为 HTAP 集群(1 个读写节点,2 个只读节点)"),a("td",null,[a("code",null,"NO")])],-1),L=a("tr",null,[a("td",null,[a("code",null,"--with-pfsd")]),a("td",null,"是否编译 PolarDB File System(PFS)相关功能"),a("td",null,[a("code",null,"NO")])],-1),O=a("td",null,[a("code",null,"--with-tde")],-1),N={href:"https://zhuanlan.zhihu.com/p/84829027",target:"_blank",rel:"noopener noreferrer"},S=a("td",null,[a("code",null,"NO")],-1),G=a("tr",null,[a("td",null,[a("code",null,"--with-dma")]),a("td",null,"是否初始化为 DMA(Data Max Availability)高可用三节点集群"),a("td",null,[a("code",null,"NO")])],-1),T=a("tr",null,[a("td",null,[a("code",null,"-r"),e("/ "),a("code",null,"-t"),e(" / "),a("br"),a("code",null,"--regress")]),a("td",null,"在编译安装完毕后运行内核回归测试"),a("td",null,[a("code",null,"NO")])],-1),E=a("tr",null,[a("td",null,[a("code",null,"-r-px")]),a("td",null,"运行 HTAP 实例的回归测试"),a("td",null,[a("code",null,"NO")])],-1),H=a("tr",null,[a("td",null,[a("code",null,"-e"),e(" /"),a("br"),a("code",null,"--extension")]),a("td",null,"运行扩展插件测试"),a("td",null,[a("code",null,"NO")])],-1),Q=a("tr",null,[a("td",null,[a("code",null,"-r-external")]),a("td",null,[e("测试 "),a("code",null,"external/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),C=a("tr",null,[a("td",null,[a("code",null,"-r-contrib")]),a("td",null,[e("测试 "),a("code",null,"contrib/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),R=a("tr",null,[a("td",null,[a("code",null,"-r-pl")]),a("td",null,[e("测试 "),a("code",null,"src/pl/"),e(" 下的扩展插件")]),a("td",null,[a("code",null,"NO")])],-1),y=l(`

如无定制的需求,则可以按照下面给出的选项编译部署不同形态的 PolarDB-PG 集群并进行测试。

PolarDB-PG 各形态编译部署

本地单节点实例

  • 1 个读写节点(运行于 5432 端口)
./polardb_build.sh
+

本地多节点实例

  • 1 个读写节点(运行于 5432 端口)
  • 1 个只读节点(运行于 5433 端口)
./polardb_build.sh --withrep --repnum=1
+

本地多节点带备库实例

  • 1 个读写节点(运行于 5432 端口)
  • 1 个只读节点(运行于 5433 端口)
  • 1 个备库节点(运行于 5434 端口)
./polardb_build.sh --withrep --repnum=1 --withstandby
+

本地多节点 HTAP 实例

  • 1 个读写节点(运行于 5432 端口)
  • 2 个只读节点(运行于 5433 / 5434 端口)
./polardb_build.sh --initpx
+

实例回归测试

普通实例回归测试:

./polardb_build.sh --withrep -r -e -r-external -r-contrib -r-pl --with-tde
+

HTAP 实例回归测试:

./polardb_build.sh -r-px -e -r-external -r-contrib -r-pl --with-tde
+

DMA 实例回归测试:

./polardb_build.sh -r -e -r-external -r-contrib -r-pl --with-tde --with-dma
+
`,21);function I(V,z){const s=t("ExternalLinkIcon"),o=t("CodeGroupItem"),d=t("CodeGroup");return i(),p("div",null,[h,a("p",null,[e("从 "),a("a",b,[e("GitHub"),n(s)]),e(" 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 "),v,e("。如果因网络原因不能稳定访问 GitHub,则可以访问 "),a("a",_,[e("Gitee 国内镜像"),n(s)]),e("。")]),n(d,null,{default:r(()=>[n(o,{title:"GitHub"},{default:r(()=>[m]),_:1}),n(o,{title:"Gitee 国内镜像"},{default:r(()=>[g]),_:1})]),_:1}),k,a("p",null,[e("从 DockerHub 上拉取 PolarDB for PostgreSQL 的 "),a("a",f,[e("开发镜像"),n(s)]),e("。")]),x,a("table",null,[P,a("tbody",null,[D,B,w,A,L,a("tr",null,[O,a("td",null,[e("是否初始化 "),a("a",N,[e("透明数据加密(TDE)"),n(s)]),e(" 功能")]),S]),G,T,E,H,Q,C,R])]),y])}const F=c(u,[["render",I],["__file","dev-on-docker.html.vue"]]);export{F as default}; diff --git a/assets/dev-on-docker.html-efa784b2.js b/assets/dev-on-docker.html-efa784b2.js new file mode 100644 index 00000000000..0b7531bfbf3 --- /dev/null +++ b/assets/dev-on-docker.html-efa784b2.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-2a8fa310","path":"/development/dev-on-docker.html","title":"基于 Docker 容器开发","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"在开发机器上下载源代码","slug":"在开发机器上下载源代码","link":"#在开发机器上下载源代码","children":[]},{"level":2,"title":"拉取开发镜像","slug":"拉取开发镜像","link":"#拉取开发镜像","children":[]},{"level":2,"title":"创建并运行容器","slug":"创建并运行容器","link":"#创建并运行容器","children":[]},{"level":2,"title":"编译测试选项说明","slug":"编译测试选项说明","link":"#编译测试选项说明","children":[]},{"level":2,"title":"PolarDB-PG 各形态编译部署","slug":"polardb-pg-各形态编译部署","link":"#polardb-pg-各形态编译部署","children":[{"level":3,"title":"本地单节点实例","slug":"本地单节点实例","link":"#本地单节点实例","children":[]},{"level":3,"title":"本地多节点实例","slug":"本地多节点实例","link":"#本地多节点实例","children":[]},{"level":3,"title":"本地多节点带备库实例","slug":"本地多节点带备库实例","link":"#本地多节点带备库实例","children":[]},{"level":3,"title":"本地多节点 HTAP 实例","slug":"本地多节点-htap-实例","link":"#本地多节点-htap-实例","children":[]}]},{"level":2,"title":"实例回归测试","slug":"实例回归测试","link":"#实例回归测试","children":[]}],"git":{"updatedTime":1690894847000},"filePathRelative":"development/dev-on-docker.md"}');export{l as data}; diff --git a/assets/dev-on-docker.html-fe137802.js b/assets/dev-on-docker.html-fe137802.js new file mode 100644 index 00000000000..ff39c7b93d5 --- /dev/null +++ b/assets/dev-on-docker.html-fe137802.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-25b4c8ff","path":"/zh/development/dev-on-docker.html","title":"基于 Docker 容器开发","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"在开发机器上下载源代码","slug":"在开发机器上下载源代码","link":"#在开发机器上下载源代码","children":[]},{"level":2,"title":"拉取开发镜像","slug":"拉取开发镜像","link":"#拉取开发镜像","children":[]},{"level":2,"title":"创建并运行容器","slug":"创建并运行容器","link":"#创建并运行容器","children":[]},{"level":2,"title":"编译测试选项说明","slug":"编译测试选项说明","link":"#编译测试选项说明","children":[]},{"level":2,"title":"PolarDB-PG 各形态编译部署","slug":"polardb-pg-各形态编译部署","link":"#polardb-pg-各形态编译部署","children":[{"level":3,"title":"本地单节点实例","slug":"本地单节点实例","link":"#本地单节点实例","children":[]},{"level":3,"title":"本地多节点实例","slug":"本地多节点实例","link":"#本地多节点实例","children":[]},{"level":3,"title":"本地多节点带备库实例","slug":"本地多节点带备库实例","link":"#本地多节点带备库实例","children":[]},{"level":3,"title":"本地多节点 HTAP 实例","slug":"本地多节点-htap-实例","link":"#本地多节点-htap-实例","children":[]}]},{"level":2,"title":"实例回归测试","slug":"实例回归测试","link":"#实例回归测试","children":[]}],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/development/dev-on-docker.md"}');export{l as data}; diff --git a/assets/docsearch-1d421ddb.js b/assets/docsearch-1d421ddb.js new file mode 100644 index 00000000000..182023d2f3f --- /dev/null +++ b/assets/docsearch-1d421ddb.js @@ -0,0 +1,2 @@ +const i=`@media (min-width: 751px){#docsearch-container{min-width:171.36px}}@media (max-width: 750px){.DocSearch-Container{position:fixed}#docsearch-container{min-width:52px}}@media print{#docsearch-container{display:none}} +`;export{i as default}; diff --git a/assets/epq-create-btree-index.html-86dfc866.js b/assets/epq-create-btree-index.html-86dfc866.js new file mode 100644 index 00000000000..a408ebf44b3 --- /dev/null +++ b/assets/epq-create-btree-index.html-86dfc866.js @@ -0,0 +1,15 @@ +import{_ as i,r as t,o as u,c as k,d as a,a as n,w as o,b as s,e as l}from"./app-3d1677bf.js";const _={},h=n("h1",{id:"epq-支持创建-b-tree-索引并行加速",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#epq-支持创建-b-tree-索引并行加速","aria-hidden":"true"},"#"),s(" ePQ 支持创建 B-Tree 索引并行加速")],-1),E={class:"table-of-contents"},w=n("h2",{id:"背景",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),s(" 背景")],-1),T=n("p",null,"在使用 PostgreSQL 时,如果想要在一张表中查询符合某个条件的行,默认情况下需要扫描整张表的数据,然后对每一行数据依次判断过滤条件。如果符合条件的行数非常少,而表的数据总量非常大,这显然是一个非常低效的操作。与阅读书籍类似,想要阅读某个特定的章节时,读者通常会通过书籍开头处的索引查询到对应章节的页码,然后直接从指定的页码开始阅读;在数据库中,通常会对被频繁查找的列创建索引,以避免进行开销极大的全表扫描:通过索引可以精确定位到被查找的数据位于哪些数据页面上。",-1),x={href:"https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE",target:"_blank",rel:"noopener noreferrer"},g=l(`
  1. 顺序扫描表中的每一行数据
  2. 根据要创建索引的列值(Scan Key)顺序,对每行数据在表中的物理位置进行排序
  3. 构建索引元组,按 B-Tree 的结构组织并写入索引页面

PostgreSQL 支持并行(多进程扫描/排序)和并发(不阻塞 DML)创建索引,但只能在创建索引的过程中使用单个计算节点的资源。

PolarDB-PG 的 ePQ 弹性跨机并行查询特性支持对 B-Tree 类型的索引创建进行加速。ePQ 能够利用多个计算节点的 I/O 带宽并行扫描全表数据,并利用多个计算节点的 CPU 和内存资源对每行数据在表中的物理位置按索引列值进行排序,构建索引元组。最终,将有序的索引元组归并到创建索引的进程中,写入索引页面,完成索引的创建。

使用方法

数据准备

创建一张包含三个列,数据量为 1000000 行的表:

CREATE TABLE t (id INT, age INT, msg TEXT);
+
+INSERT INTO t
+SELECT
+    random() * 1000000,
+    random() * 10000,
+    md5(random()::text)
+FROM generate_series(1, 1000000);
+

创建索引

使用 ePQ 创建索引需要以下三个步骤:

  1. 设置参数 polar_enable_pxON,打开 ePQ 的开关
  2. 按需设置参数 polar_px_dop_per_node 调整查询并行度
  3. 在创建索引时显式声明 px_build 属性为 ON
SET polar_enable_px TO ON;
+SET polar_px_dop_per_node TO 8;
+CREATE INDEX t_idx1 ON t(id, msg) WITH(px_build = ON);
+
`,11),y={href:"https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-TABLES",target:"_blank",rel:"noopener noreferrer"},f=n("code",null,"ShareLock",-1),m=n("code",null,"INSERT",-1),N=n("code",null,"UPDATE",-1),b=n("code",null,"DELETE",-1),I=l(`

并发创建索引

类似地,ePQ 支持并发创建索引,只需要在 CREATE INDEX 后加上 CONCURRENTLY 关键字即可:

SET polar_enable_px TO ON;
+SET polar_px_dop_per_node TO 8;
+CREATE INDEX CONCURRENTLY t_idx2 ON t(id, msg) WITH(px_build = ON);
+
`,3),L={href:"https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-TABLES",target:"_blank",rel:"noopener noreferrer"},S=n("code",null,"ShareUpdateExclusiveLock",-1),O=l('

使用限制

ePQ 加速创建索引暂不支持以下场景:

  • 创建 UNIQUE 索引
  • 创建索引时附带 INCLUDING
  • 创建索引时指定 TABLESPACE
  • 创建索引时带有 WHERE 而成为部分索引(Partial Index)
',3);function P(c,C){const r=t("Badge"),d=t("ArticleInfo"),e=t("router-link"),p=t("ExternalLinkIcon");return u(),k("div",null,[h,a(r,{type:"tip",text:"V11 / v1.1.15-",vertical:"top"}),a(d,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),n("nav",E,[n("ul",null,[n("li",null,[a(e,{to:"#背景"},{default:o(()=>[s("背景")]),_:1})]),n("li",null,[a(e,{to:"#使用方法"},{default:o(()=>[s("使用方法")]),_:1}),n("ul",null,[n("li",null,[a(e,{to:"#数据准备"},{default:o(()=>[s("数据准备")]),_:1})]),n("li",null,[a(e,{to:"#创建索引"},{default:o(()=>[s("创建索引")]),_:1})]),n("li",null,[a(e,{to:"#并发创建索引"},{default:o(()=>[s("并发创建索引")]),_:1})])])]),n("li",null,[a(e,{to:"#使用限制"},{default:o(()=>[s("使用限制")]),_:1})])])]),w,T,n("p",null,[s("PostgreSQL 支持创建多种类型的索引,其中使用得最多的是 "),n("a",x,[s("B-Tree"),a(p)]),s(" 索引,也是 PostgreSQL 默认创建的索引类型。在一张数据量较大的表上创建索引是一件非常耗时的事,因为其中涉及到的工作包含:")]),g,n("p",null,[s("在创建索引的过程中,数据库会对正在创建索引的表施加 "),n("a",y,[f,a(p)]),s(" 锁。这个级别的锁将会阻塞其它进程对表的 DML 操作("),m,s(" / "),N,s(" / "),b,s(")。")]),I,n("p",null,[s("在创建索引的过程中,数据库会对正在创建索引的表施加 "),n("a",L,[S,a(p)]),s(" 锁。这个级别的锁将不会阻塞其它进程对表的 DML 操作。")]),O])}const B=i(_,[["render",P],["__file","epq-create-btree-index.html.vue"]]);export{B as default}; diff --git a/assets/epq-create-btree-index.html-c1e1de42.js b/assets/epq-create-btree-index.html-c1e1de42.js new file mode 100644 index 00000000000..c504b7e1003 --- /dev/null +++ b/assets/epq-create-btree-index.html-c1e1de42.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-5b4b4332","path":"/zh/features/v11/epq/epq-create-btree-index.html","title":"ePQ 支持创建 B-Tree 索引并行加速","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/09/20","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[{"level":3,"title":"数据准备","slug":"数据准备","link":"#数据准备","children":[]},{"level":3,"title":"创建索引","slug":"创建索引","link":"#创建索引","children":[]},{"level":3,"title":"并发创建索引","slug":"并发创建索引","link":"#并发创建索引","children":[]}]},{"level":2,"title":"使用限制","slug":"使用限制","link":"#使用限制","children":[]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/epq-create-btree-index.md"}');export{e as data}; diff --git a/assets/epq-ctas-mtview-bulk-insert.html-120d6540.js b/assets/epq-ctas-mtview-bulk-insert.html-120d6540.js new file mode 100644 index 00000000000..a339c7282ee --- /dev/null +++ b/assets/epq-ctas-mtview-bulk-insert.html-120d6540.js @@ -0,0 +1,3 @@ +import{_ as i,r as s,o as p,c as _,d as a,a as e,w as n,b as t,e as h}from"./app-3d1677bf.js";const u={},f=e("h1",{id:"epq-支持创建-刷新物化视图并行加速和批量写入",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#epq-支持创建-刷新物化视图并行加速和批量写入","aria-hidden":"true"},"#"),t(" ePQ 支持创建/刷新物化视图并行加速和批量写入")],-1),g={class:"table-of-contents"},k=e("h2",{id:"背景",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),t(" 背景")],-1),w={href:"https://en.wikipedia.org/wiki/Materialized_view",target:"_blank",rel:"noopener noreferrer"},E={href:"https://www.postgresql.org/docs/current/sql-creatematerializedview.html",target:"_blank",rel:"noopener noreferrer"},b={href:"https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html",target:"_blank",rel:"noopener noreferrer"},m={href:"https://www.postgresql.org/docs/current/sql-createtableas.html",target:"_blank",rel:"noopener noreferrer"},q=e("code",null,"CREATE TABLE AS",-1),T={href:"https://www.postgresql.org/docs/current/sql-selectinto.html",target:"_blank",rel:"noopener noreferrer"},x=e("code",null,"SELECT INTO",-1),L=h(`

功能原理介绍

对于物化视图的创建和刷新,以及 CREATE TABLE AS / SELECT INTO 语法,由于在数据库层面需要完成的工作步骤十分相似,因此 PostgreSQL 内核使用同一套代码逻辑来处理这几种语法。内核执行过程中的主要步骤包含:

  1. 数据扫描:执行视图定义或 CREATE TABLE AS / SELECT INTO 语法中定义的查询,扫描符合查询条件的数据
  2. 数据写入:将上述步骤中扫描到的数据写入到一个新的物化视图 / 表中

PolarDB for PostgreSQL 对上述两个步骤分别引入了 ePQ 并行扫描和批量数据写入的优化。在需要扫描或写入的数据量较大时,能够显著提升上述 DDL 语法的性能,缩短执行时间:

  1. ePQ 并行扫描:通过 ePQ 功能,利用多个计算节点的 I/O 带宽和计算资源并行执行视图定义中的查询,提升计算资源和带宽的利用率
  2. 批量写入:不再将扫描到的每一个元组依次写入表或物化视图,而是在内存中攒够一定数量的元组后,一次性批量写入表或物化视图中,减少记录 WAL 日志的开销,降低对页面的锁定频率

使用说明

ePQ 并行扫描

将以下参数设置为 ON 即可启用 ePQ 并行扫描来加速上述语法中的查询过程,目前其默认值为 ON。该参数生效的前置条件是 ePQ 特性的总开关 polar_enable_px 被打开。

SET polar_px_enable_create_table_as = ON;
+

由于 ePQ 特性的限制,该优化不支持 CREATE TABLE AS ... WITH OIDS 语法。对于该语法的处理流程中将会回退使用 PostgreSQL 内置优化器为 DDL 定义中的查询生成执行计划,并通过 PostgreSQL 的单机执行器完成查询。

批量写入

将以下参数设置为 ON 即可启用批量写入来加速上述语法中的写入过程,目前其默认值为 ON

SET polar_enable_create_table_as_bulk_insert = ON;
+
`,13);function v(l,A){const c=s("Badge"),d=s("ArticleInfo"),o=s("router-link"),r=s("ExternalLinkIcon");return p(),_("div",null,[f,a(c,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"}),a(d,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),e("nav",g,[e("ul",null,[e("li",null,[a(o,{to:"#背景"},{default:n(()=>[t("背景")]),_:1})]),e("li",null,[a(o,{to:"#功能原理介绍"},{default:n(()=>[t("功能原理介绍")]),_:1})]),e("li",null,[a(o,{to:"#使用说明"},{default:n(()=>[t("使用说明")]),_:1}),e("ul",null,[e("li",null,[a(o,{to:"#epq-并行扫描"},{default:n(()=>[t("ePQ 并行扫描")]),_:1})]),e("li",null,[a(o,{to:"#批量写入"},{default:n(()=>[t("批量写入")]),_:1})])])])])]),k,e("p",null,[e("a",w,[t("物化视图 (Materialized View)"),a(r)]),t(" 是一个包含查询结果的数据库对象。与普通的视图不同,物化视图不仅保存视图的定义,还保存了 "),e("a",E,[t("创建物化视图"),a(r)]),t(" 时的数据副本。当物化视图的数据与视图定义中的数据不一致时,可以进行 "),e("a",b,[t("物化视图刷新 (Refresh)"),a(r)]),t(" 保持物化视图中的数据与视图定义一致。物化视图本质上是对视图定义中的查询做预计算,以便于在查询时复用。")]),e("p",null,[e("a",m,[q,a(r)]),t(" 语法用于将一个查询所对应的数据构建为一个新的表,其表结构与查询的输出列完全相同。")]),e("p",null,[e("a",T,[x,a(r)]),t(" 语法用于建立一张新表,并将查询所对应的数据写入表中,而不是将查询到的数据返回给客户端。其表结构与查询的输出列完全相同。")]),L])}const P=i(u,[["render",v],["__file","epq-ctas-mtview-bulk-insert.html.vue"]]);export{P as default}; diff --git a/assets/epq-ctas-mtview-bulk-insert.html-ca602c4f.js b/assets/epq-ctas-mtview-bulk-insert.html-ca602c4f.js new file mode 100644 index 00000000000..c3175687531 --- /dev/null +++ b/assets/epq-ctas-mtview-bulk-insert.html-ca602c4f.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-da223262","path":"/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html","title":"ePQ 支持创建/刷新物化视图并行加速和批量写入","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/02/08","minute":10},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"功能原理介绍","slug":"功能原理介绍","link":"#功能原理介绍","children":[]},{"level":2,"title":"使用说明","slug":"使用说明","link":"#使用说明","children":[{"level":3,"title":"ePQ 并行扫描","slug":"epq-并行扫描","link":"#epq-并行扫描","children":[]},{"level":3,"title":"批量写入","slug":"批量写入","link":"#批量写入","children":[]}]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/epq-ctas-mtview-bulk-insert.md"}');export{e as data}; diff --git a/assets/epq-explain-analyze.html-948c1fdb.js b/assets/epq-explain-analyze.html-948c1fdb.js new file mode 100644 index 00000000000..78d12229e63 --- /dev/null +++ b/assets/epq-explain-analyze.html-948c1fdb.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-9aa77614","path":"/zh/features/v11/epq/epq-explain-analyze.html","title":"ePQ 执行计划查看与分析","lang":"zh-CN","frontmatter":{"author":"渊云、秦疏","date":"2023/09/06","minute":30},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[{"level":3,"title":"执行计划查看","slug":"执行计划查看","link":"#执行计划查看","children":[]}]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/epq-explain-analyze.md"}');export{e as data}; diff --git a/assets/epq-explain-analyze.html-c636bd81.js b/assets/epq-explain-analyze.html-c636bd81.js new file mode 100644 index 00000000000..87313462666 --- /dev/null +++ b/assets/epq-explain-analyze.html-c636bd81.js @@ -0,0 +1,27 @@ +import{_ as r,r as o,o as k,c as d,d as n,a as s,w as p,b as a,e as i}from"./app-3d1677bf.js";const u={},y=s("h1",{id:"epq-执行计划查看与分析",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#epq-执行计划查看与分析","aria-hidden":"true"},"#"),a(" ePQ 执行计划查看与分析")],-1),w={class:"table-of-contents"},m=i(`

背景

PostgreSQL 提供了 EXPLAIN 命令用于 SQL 语句的性能分析。它能够输出 SQL 对应的查询计划,以及在执行过程中的具体耗时、资源消耗等信息,可用于排查 SQL 的性能瓶颈。

EXPLAIN 命令原先只适用于单机执行的 SQL 性能分析。PolarDB-PG 的 ePQ 弹性跨机并行查询扩展了 EXPLAIN 的功能,使其可以打印 ePQ 的跨机并行执行计划,还能够统计 ePQ 执行计划在各个算子上的执行时间、数据扫描量、内存使用量等信息,并以统一的视角返回给客户端。

功能介绍

执行计划查看

ePQ 的执行计划是分片的。每个计划分片(Slice)由计算节点上的虚拟执行单元(Segment)启动的一组进程(Gang)负责执行,完成 SQL 的一部分计算。ePQ 在执行计划中引入了 Motion 算子,用于在执行不同计划分片的进程组之间进行数据传递。因此,Motion 算子就是计划分片的边界。

ePQ 中总共引入了三种 Motion 算子:

  • PX Coordinator:源端数据发送到同一个目标端(汇聚)
  • PX Broadcast:源端数据发送到每一个目标端(广播)
  • PX Hash:源端数据经过哈希计算后发送到某一个目标端(重分布)

以一个简单查询作为例子:

=> CREATE TABLE t (id INT);
+=> SET polar_enable_px TO ON;
+=> EXPLAIN (COSTS OFF) SELECT * FROM t LIMIT 1;
+                   QUERY PLAN
+-------------------------------------------------
+ Limit
+   ->  PX Coordinator 6:1  (slice1; segments: 6)
+         ->  Partial Seq Scan on t
+ Optimizer: PolarDB PX Optimizer
+(4 rows)
+

以上执行计划以 Motion 算子为界,被分为了两个分片:一个是接收最终结果的分片 slice0,一个是扫描数据的分片slice1。对于 slice1 这个计划分片,ePQ 将使用六个执行单元(segments: 6)分别启动一个进程来执行,这六个进程各自负责扫描表的一部分数据(Partial Seq Scan),通过 Motion 算子将六个进程的数据汇聚到一个目标端(PX Coordinator 6:1),传递给 Limit 算子。

如果查询逐渐复杂,则执行计划中的计划分片和 Motion 算子会越来越多:

=> CREATE TABLE t1 (a INT, b INT, c INT);
+=> SET polar_enable_px TO ON;
+=> EXPLAIN (COSTS OFF) SELECT SUM(b) FROM t1 GROUP BY a LIMIT 1;
+                         QUERY PLAN
+------------------------------------------------------------
+ Limit
+   ->  PX Coordinator 6:1  (slice1; segments: 6)
+         ->  GroupAggregate
+               Group Key: a
+               ->  Sort
+                     Sort Key: a
+                     ->  PX Hash 6:6  (slice2; segments: 6)
+                           Hash Key: a
+                           ->  Partial Seq Scan on t1
+ Optimizer: PolarDB PX Optimizer
+(10 rows)
+

以上执行计划中总共有三个计划分片。将会有六个进程(segments: 6)负责执行 slice2 分片,分别扫描表的一部分数据,然后通过 Motion 算子(PX Hash 6:6)将数据重分布到另外六个(segments: 6)负责执行 slice1 分片的进程上,各自完成排序(Sort)和聚合(GroupAggregate),最终通过 Motion 算子(PX Coordinator 6:1)将数据汇聚到结果分片 slice0

`,14);function g(t,P){const c=o("Badge"),l=o("ArticleInfo"),e=o("router-link");return k(),d("div",null,[y,n(c,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"}),n(l,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),s("nav",w,[s("ul",null,[s("li",null,[n(e,{to:"#背景"},{default:p(()=>[a("背景")]),_:1})]),s("li",null,[n(e,{to:"#功能介绍"},{default:p(()=>[a("功能介绍")]),_:1}),s("ul",null,[s("li",null,[n(e,{to:"#执行计划查看"},{default:p(()=>[a("执行计划查看")]),_:1})])])])])]),m])}const _=r(u,[["render",g],["__file","epq-explain-analyze.html.vue"]]);export{_ as default}; diff --git a/assets/epq-node-and-dop.html-2ee64cdd.js b/assets/epq-node-and-dop.html-2ee64cdd.js new file mode 100644 index 00000000000..a978345ace9 --- /dev/null +++ b/assets/epq-node-and-dop.html-2ee64cdd.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-351ad83c","path":"/zh/features/v11/epq/epq-node-and-dop.html","title":"ePQ 计算节点范围选择与并行度控制","lang":"zh-CN","frontmatter":{"author":"渊云","date":"2023/09/06","minute":20},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"计算节点范围选择","slug":"计算节点范围选择","link":"#计算节点范围选择","children":[]},{"level":2,"title":"并行度控制","slug":"并行度控制","link":"#并行度控制","children":[]},{"level":2,"title":"并行度计算方法示例","slug":"并行度计算方法示例","link":"#并行度计算方法示例","children":[]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/epq-node-and-dop.md"}');export{e as data}; diff --git a/assets/epq-node-and-dop.html-bb13ee52.js b/assets/epq-node-and-dop.html-bb13ee52.js new file mode 100644 index 00000000000..f42680a0a2c --- /dev/null +++ b/assets/epq-node-and-dop.html-bb13ee52.js @@ -0,0 +1,60 @@ +import{_ as r,r as e,o as k,c as d,d as n,a as s,w as o,b as a,e as u}from"./app-3d1677bf.js";const i={},m=s("h1",{id:"epq-计算节点范围选择与并行度控制",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#epq-计算节点范围选择与并行度控制","aria-hidden":"true"},"#"),a(" ePQ 计算节点范围选择与并行度控制")],-1),_={class:"table-of-contents"},w=u(`

背景介绍

PolarDB-PG 的 ePQ 弹性跨机并行查询特性提供了精细的粒度控制方法,可以合理使用集群内的计算资源。在最大程度利用闲置计算资源进行并行查询,提升资源利用率的同时,避免了对其它业务负载产生影响:

  1. ePQ 可以动态调整集群中参与并行查询的计算节点范围,避免使用负载较高的计算节点
  2. ePQ 支持为每条查询动态调整在计算节点上的并行度,避免 ePQ 并行查询进程对计算资源的消耗影响到相同节点上的其它进程

计算节点范围选择

参数 polar_px_nodes 指定了参与 ePQ 的计算节点范围,默认值为空,表示所有只读节点都参与 ePQ 并行查询:

=> SHOW polar_px_nodes;
+ polar_px_nodes
+----------------
+
+(1 row)
+

如果希望读写节点也参与 ePQ 并行,则可以设置如下参数:

SET polar_px_use_primary TO ON;
+

如果部分只读节点负载较高,则可以通过修改 polar_px_nodes 参数设置仅特定几个而非所有只读节点参与 ePQ 并行查询。参数 polar_px_nodes 的合法格式是一个以英文逗号分隔的节点名称列表。获取节点名称需要安装 polar_monitor 插件:

CREATE EXTENSION IF NOT EXISTS polar_monitor;
+

通过 polar_monitor 插件提供的集群拓扑视图,可以查询到集群中所有计算节点的名称:

=> SELECT name,slot_name,type FROM polar_cluster_info;
+ name  | slot_name |  type
+-------+-----------+---------
+ node0 |           | Primary
+ node1 | standby1  | Standby
+ node2 | replica1  | Replica
+ node3 | replica2  | Replica
+(4 rows)
+

其中:

  • Primary 表示读写节点
  • Replica 表示只读节点
  • Standby 表示备库节点

通用的最佳实践是使用负载较低的只读节点参与 ePQ 并行查询:

=> SET polar_px_nodes = 'node2,node3';
+=> SHOW polar_px_nodes;
+ polar_px_nodes
+----------------
+ node2,node3
+(1 row)
+

并行度控制

参数 polar_px_dop_per_node 用于设置当前会话中的 ePQ 查询在每个计算节点上的执行单元(Segment)数量,每个执行单元会为其需要执行的每一个计划分片(Slice)启动一个进程。

该参数默认值为 3,通用最佳实践值为当前计算节点 CPU 核心数的一半。如果计算节点的 CPU 负载较高,可以酌情递减该参数,控制计算节点的 CPU 占用率至 80% 以下;如果查询性能不佳时,可以酌情递增该参数,也需要保持计算节点的 CPU 水位不高于 80%。否则可能会拖慢其它的后台进程。

并行度计算方法示例

创建一张表:

CREATE TABLE test(id INT);
+

假设集群内有两个只读节点,polar_px_nodes 为空,此时 ePQ 将使用集群内的所有只读节点参与并行查询;参数 polar_px_dop_per_node 的值为 3,表示每个计算节点上将会有三个执行单元。执行计划如下:

=> SHOW polar_px_nodes;
+ polar_px_nodes
+----------------
+
+(1 row)
+
+=> SHOW polar_px_dop_per_node;
+ polar_px_dop_per_node
+-----------------------
+ 3
+(1 row)
+
+=> EXPLAIN SELECT * FROM test;
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

从执行计划中可以看出,两个只读节点上总计有六个执行单元(segments: 6)将会执行这个计划中唯一的计划分片 slice1。这意味着总计会有六个进程并行执行当前查询。

此时,调整 polar_px_dop_per_node4,再次执行查询,两个只读节点上总计会有八个执行单元参与当前查询。由于执行计划中只有一个计划分片 slice1,这意味着总计会有八个进程并行执行当前查询:

=> SET polar_px_dop_per_node TO 4;
+SET
+=> EXPLAIN SELECT * FROM test;
+                                  QUERY PLAN
+-------------------------------------------------------------------------------
+ PX Coordinator 8:1  (slice1; segments: 8)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+

此时,如果设置 polar_px_use_primary 参数,让读写节点也参与查询,那么读写节点上也将会有四个执行单元参与 ePQ 并行执行,集群内总计 12 个进程参与并行执行:

=> SET polar_px_use_primary TO ON;
+SET
+=> EXPLAIN SELECT * FROM test;
+                                   QUERY PLAN
+---------------------------------------------------------------------------------
+ PX Coordinator 12:1  (slice1; segments: 12)  (cost=0.00..431.00 rows=1 width=4)
+   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
+ Optimizer: PolarDB PX Optimizer
+(3 rows)
+
`,29);function y(t,b){const c=e("Badge"),l=e("ArticleInfo"),p=e("router-link");return k(),d("div",null,[m,n(c,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"}),n(l,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),s("nav",_,[s("ul",null,[s("li",null,[n(p,{to:"#背景介绍"},{default:o(()=>[a("背景介绍")]),_:1})]),s("li",null,[n(p,{to:"#计算节点范围选择"},{default:o(()=>[a("计算节点范围选择")]),_:1})]),s("li",null,[n(p,{to:"#并行度控制"},{default:o(()=>[a("并行度控制")]),_:1})]),s("li",null,[n(p,{to:"#并行度计算方法示例"},{default:o(()=>[a("并行度计算方法示例")]),_:1})])])]),w])}const h=r(i,[["render",y],["__file","epq-node-and-dop.html.vue"]]);export{h as default}; diff --git a/assets/epq-partitioned-table.html-804dd467.js b/assets/epq-partitioned-table.html-804dd467.js new file mode 100644 index 00000000000..b329abd14ac --- /dev/null +++ b/assets/epq-partitioned-table.html-804dd467.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-5d5635bc","path":"/zh/features/v11/epq/epq-partitioned-table.html","title":"ePQ 支持分区表查询","lang":"zh-CN","frontmatter":{"author":"渊云","date":"2023/09/06","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"使用指南","slug":"使用指南","link":"#使用指南","children":[{"level":3,"title":"分区表并行查询","slug":"分区表并行查询","link":"#分区表并行查询","children":[]},{"level":3,"title":"分区静态裁剪","slug":"分区静态裁剪","link":"#分区静态裁剪","children":[]},{"level":3,"title":"智能分区连接","slug":"智能分区连接","link":"#智能分区连接","children":[]},{"level":3,"title":"多级分区表并行查询","slug":"多级分区表并行查询","link":"#多级分区表并行查询","children":[]}]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/epq-partitioned-table.md"}');export{e as data}; diff --git a/assets/epq-partitioned-table.html-bde50aed.js b/assets/epq-partitioned-table.html-bde50aed.js new file mode 100644 index 00000000000..6d86f5e18dc --- /dev/null +++ b/assets/epq-partitioned-table.html-bde50aed.js @@ -0,0 +1,107 @@ +import{_ as l,r as e,o as r,c as d,d as n,a as s,w as o,b as a,e as u}from"./app-3d1677bf.js";const i="/PolarDB-for-PostgreSQL/assets/htap-multi-level-partition-1-c17a6008.png",w={},y=s("h1",{id:"epq-支持分区表查询",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#epq-支持分区表查询","aria-hidden":"true"},"#"),a(" ePQ 支持分区表查询")],-1),T={class:"table-of-contents"},O=u(`

背景

随着数据量的不断增长,表的规模将会越来越大。为了方便管理和提高查询性能,比较好的实践是使用分区表,将大表拆分成多个子分区表。甚至每个子分区表还可以进一步拆成二级子分区表,从而形成了多级分区表。

PolarDB-PG 支持 ePQ 弹性跨机并行查询,能够利用集群中多个计算节点提升只读查询的性能。ePQ 不仅能够对普通表进行高效的跨机并行查询,对分区表也实现了跨机并行查询。

ePQ 对分区表的基础功能支持包含:

  • 对分区策略为 Range / List / Hash 的分区表进行并行扫描
  • 对分区表进行索引扫描
  • 对分区表进行连接查询

此外,ePQ 还支持了部分与分区表相关的高级功能:

  • 分区裁剪
  • 智能分区连接(Partition Wise Join)
  • 对多级分区表进行并行查询

ePQ 暂不支持对具有多列分区键的分区表进行并行查询。

使用指南

分区表并行查询

创建一张分区策略为 Range 的分区表,并创建三个子分区:

CREATE TABLE t1 (id INT) PARTITION BY RANGE(id);
+CREATE TABLE t1_p1 PARTITION OF t1 FOR VALUES FROM (0) TO (200);
+CREATE TABLE t1_p2 PARTITION OF t1 FOR VALUES FROM (200) TO (400);
+CREATE TABLE t1_p3 PARTITION OF t1 FOR VALUES FROM (400) TO (600);
+

设置参数打开 ePQ 开关和 ePQ 分区表扫描功能的开关:

SET polar_enable_px TO ON;
+SET polar_px_enable_partition TO ON;
+

查看对分区表进行全表扫描的执行计划:

=> EXPLAIN (COSTS OFF) SELECT * FROM t1;
+                QUERY PLAN
+-------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)
+   ->  Append
+         ->  Partial Seq Scan on t1_p1
+         ->  Partial Seq Scan on t1_p2
+         ->  Partial Seq Scan on t1_p3
+ Optimizer: PolarDB PX Optimizer
+(6 rows)
+

ePQ 将会启动一组进程并行扫描分区表的每一个子表。每一个扫描进程都会通过 Append 算子依次扫描每一个子表的一部分数据(Partial Seq Scan),并通过 Motion 算子(PX Coordinator)将所有进程的扫描结果汇聚到发起查询的进程并返回。

分区静态裁剪

当查询的过滤条件中包含分区键时,ePQ 优化器可以根据过滤条件对将要扫描的分区表进行裁剪,避免扫描不需要的子分区,节省系统资源,提升查询性能。以上述 t1 表为例,查看以下查询的执行计划:

=> EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE id < 100;
+                QUERY PLAN
+-------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)
+   ->  Append
+         ->  Partial Seq Scan on t1_p1
+               Filter: (id < 100)
+ Optimizer: PolarDB PX Optimizer
+(5 rows)
+

由于查询的过滤条件 id < 100 包含分区键,因此 ePQ 优化器可以根据分区表的分区边界,在产生执行计划时去除不符合过滤条件的子分区(t1_p2t1_p3),只保留符合过滤条件的子分区(t1_p1)。

智能分区连接

在进行分区表之间的连接操作时,如果分区策略和边界相同,并且连接条件为分区键时,ePQ 优化器可以产生以子分区为单位进行连接的执行计划,避免两张分区表的进行笛卡尔积式的连接,节省系统资源,提升查询性能。

以两张 Range 分区表的连接为例。使用以下 SQL 创建两张分区策略和边界都相同的分区表 t2t3

CREATE TABLE t2 (id INT) PARTITION BY RANGE(id);
+CREATE TABLE t2_p1 PARTITION OF t2 FOR VALUES FROM (0) TO (200);
+CREATE TABLE t2_p2 PARTITION OF t2 FOR VALUES FROM (200) TO (400);
+CREATE TABLE t2_p3 PARTITION OF t2 FOR VALUES FROM (400) TO (600);
+
+CREATE TABLE t3 (id INT) PARTITION BY RANGE(id);
+CREATE TABLE t3_p1 PARTITION OF t3 FOR VALUES FROM (0) TO (200);
+CREATE TABLE t3_p2 PARTITION OF t3 FOR VALUES FROM (200) TO (400);
+CREATE TABLE t3_p3 PARTITION OF t3 FOR VALUES FROM (400) TO (600);
+

打开以下参数启用 ePQ 对分区表的支持:

SET polar_enable_px TO ON;
+SET polar_px_enable_partition TO ON;
+

当 Partition Wise join 关闭时,两表在分区键上等值连接的执行计划如下:

=> SET polar_px_enable_partitionwise_join TO OFF;
+=> EXPLAIN (COSTS OFF) SELECT * FROM t2 JOIN t3 ON t2.id = t3.id;
+                        QUERY PLAN
+-----------------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)
+   ->  Hash Join
+         Hash Cond: (t2_p1.id = t3_p1.id)
+         ->  Append
+               ->  Partial Seq Scan on t2_p1
+               ->  Partial Seq Scan on t2_p2
+               ->  Partial Seq Scan on t2_p3
+         ->  Hash
+               ->  PX Broadcast 6:6  (slice2; segments: 6)
+                     ->  Append
+                           ->  Partial Seq Scan on t3_p1
+                           ->  Partial Seq Scan on t3_p2
+                           ->  Partial Seq Scan on t3_p3
+ Optimizer: PolarDB PX Optimizer
+(14 rows)
+

从执行计划中可以看出,执行 slice1 计划分片的六个进程会分别通过 Append 算子依次扫描分区表 t2 每一个子分区的一部分数据,并通过 Motion 算子(PX Broadcast)接收来自执行 slice2 的六个进程广播的 t3 全表数据,在本地完成哈希连接(Hash Join)后,通过 Motion 算子(PX Coordinator)汇聚结果并返回。本质上,分区表 t2 的每一行数据都与 t3 的每一行数据做了一次连接。

打开参数 polar_px_enable_partitionwise_join 启用 Partition Wise join 后,再次查看执行计划:

=> SET polar_px_enable_partitionwise_join TO ON;
+=> EXPLAIN (COSTS OFF) SELECT * FROM t2 JOIN t3 ON t2.id = t3.id;
+                   QUERY PLAN
+------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)
+   ->  Append
+         ->  Hash Join
+               Hash Cond: (t2_p1.id = t3_p1.id)
+               ->  Partial Seq Scan on t2_p1
+               ->  Hash
+                     ->  Full Seq Scan on t3_p1
+         ->  Hash Join
+               Hash Cond: (t2_p2.id = t3_p2.id)
+               ->  Partial Seq Scan on t2_p2
+               ->  Hash
+                     ->  Full Seq Scan on t3_p2
+         ->  Hash Join
+               Hash Cond: (t2_p3.id = t3_p3.id)
+               ->  Partial Seq Scan on t2_p3
+               ->  Hash
+                     ->  Full Seq Scan on t3_p3
+ Optimizer: PolarDB PX Optimizer
+(18 rows)
+

在上述执行计划中,执行 slice1 计划分片的六个进程将通过 Append 算子依次扫描分区表 t2 每个子分区中的一部分数据,以及分区表 t3 相对应子分区 的全部数据,将两份数据进行哈希连接(Hash Join),最终通过 Motion 算子(PX Coordinator)汇聚结果并返回。在上述执行过程中,分区表 t2 的每一个子分区 t2_p1t2_p2t2_p3 分别只与分区表 t3 对应的 t3_p1t3_p2t3_p3 做了连接,并没有与其它不相关的分区连接,节省了不必要的工作。

多级分区表并行查询

在多级分区表中,每级分区表的分区维度(分区键)可以不同:比如一级分区表按照时间维度分区,二级分区表按照地域维度分区。当查询 SQL 的过滤条件中包含每一级分区表中的分区键时,ePQ 优化器支持对多级分区表进行静态分区裁剪,从而过滤掉不需要被扫描的子分区。

以下图为例:当查询过滤条件 WHERE date = '202201' AND region = 'beijing' 中包含一级分区键 date 和二级分区键 region 时,ePQ 优化器能够裁剪掉所有不相关的分区,产生的执行计划中只包含符合条件的子分区。由此,执行器只对需要扫描的子分区进行扫描即可。

multi-level-partition

使用以下 SQL 为例,创建一张多级分区表:

CREATE TABLE r1 (a INT, b TIMESTAMP) PARTITION BY RANGE (b);
+
+CREATE TABLE r1_p1 PARTITION OF r1 FOR VALUES FROM ('2000-01-01') TO ('2010-01-01')  PARTITION BY RANGE (a);
+CREATE TABLE r1_p1_p1 PARTITION OF r1_p1 FOR VALUES FROM (1) TO (1000000);
+CREATE TABLE r1_p1_p2 PARTITION OF r1_p1 FOR VALUES FROM (1000000) TO (2000000);
+
+CREATE TABLE r1_p2 PARTITION OF r1 FOR VALUES FROM ('2010-01-01') TO ('2020-01-01')  PARTITION BY RANGE (a);
+CREATE TABLE r1_p2_p1 PARTITION OF r1_p2 FOR VALUES FROM (1) TO (1000000);
+CREATE TABLE r1_p2_p2 PARTITION OF r1_p2 FOR VALUES FROM (1000000) TO (2000000);
+

打开以下参数启用 ePQ 对分区表的支持:

SET polar_enable_px TO ON;
+SET polar_px_enable_partition TO ON;
+

执行一条以两级分区键作为过滤条件的 SQL,并关闭 ePQ 的多级分区扫描功能,将得到 PostgreSQL 内置优化器经过多级分区静态裁剪后的执行计划:

=> SET polar_px_optimizer_multilevel_partitioning TO OFF;
+=> EXPLAIN (COSTS OFF) SELECT * FROM r1 WHERE a < 1000000 AND b < '2009-01-01 00:00:00';
+                                       QUERY PLAN
+----------------------------------------------------------------------------------------
+ Seq Scan on r1_p1_p1 r1
+   Filter: ((a < 1000000) AND (b < '2009-01-01 00:00:00'::timestamp without time zone))
+(2 rows)
+

启用 ePQ 的多级分区扫描功能,再次查看执行计划:

=> SET polar_px_optimizer_multilevel_partitioning TO ON;
+=> EXPLAIN (COSTS OFF) SELECT * FROM r1 WHERE a < 1000000 AND b < '2009-01-01 00:00:00';
+                                             QUERY PLAN
+----------------------------------------------------------------------------------------------------
+ PX Coordinator 6:1  (slice1; segments: 6)
+   ->  Append
+         ->  Partial Seq Scan on r1_p1_p1
+               Filter: ((a < 1000000) AND (b < '2009-01-01 00:00:00'::timestamp without time zone))
+ Optimizer: PolarDB PX Optimizer
+(5 rows)
+

在上述计划中,ePQ 优化器进行了对多级分区表的静态裁剪。执行 slice1 计划分片的六个进程只需对符合过滤条件的子分区 r1_p1_p1 进行并行扫描(Partial Seq Scan)即可,并将扫描到的数据通过 Motion 算子(PX Coordinator)汇聚并返回。

`,46);function _(t,E){const c=e("Badge"),k=e("ArticleInfo"),p=e("router-link");return r(),d("div",null,[y,n(c,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"}),n(k,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),s("nav",T,[s("ul",null,[s("li",null,[n(p,{to:"#背景"},{default:o(()=>[a("背景")]),_:1})]),s("li",null,[n(p,{to:"#使用指南"},{default:o(()=>[a("使用指南")]),_:1}),s("ul",null,[s("li",null,[n(p,{to:"#分区表并行查询"},{default:o(()=>[a("分区表并行查询")]),_:1})]),s("li",null,[n(p,{to:"#分区静态裁剪"},{default:o(()=>[a("分区静态裁剪")]),_:1})]),s("li",null,[n(p,{to:"#智能分区连接"},{default:o(()=>[a("智能分区连接")]),_:1})]),s("li",null,[n(p,{to:"#多级分区表并行查询"},{default:o(()=>[a("多级分区表并行查询")]),_:1})])])])])]),O])}const m=l(w,[["render",_],["__file","epq-partitioned-table.html.vue"]]);export{m as default}; diff --git a/assets/essd-storage-grow-11277a20.png b/assets/essd-storage-grow-11277a20.png new file mode 100644 index 00000000000..a7a5cc9fbd7 Binary files /dev/null and b/assets/essd-storage-grow-11277a20.png differ diff --git a/assets/essd-storage-grow-complete-f9a772d3.png b/assets/essd-storage-grow-complete-f9a772d3.png new file mode 100644 index 00000000000..d726050d6ab Binary files /dev/null and b/assets/essd-storage-grow-complete-f9a772d3.png differ diff --git a/assets/essd-storage-online-grow-bce55f20.png b/assets/essd-storage-online-grow-bce55f20.png new file mode 100644 index 00000000000..031df34fa00 Binary files /dev/null and b/assets/essd-storage-online-grow-bce55f20.png differ diff --git a/assets/flashback-table.html-2404989b.js b/assets/flashback-table.html-2404989b.js new file mode 100644 index 00000000000..505c394486a --- /dev/null +++ b/assets/flashback-table.html-2404989b.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-bb50ce5c","path":"/zh/features/v11/availability/flashback-table.html","title":"闪回表和闪回日志","lang":"zh-CN","frontmatter":{"author":"恒亦","date":"2022/11/23","minute":20},"headers":[{"level":2,"title":"概述","slug":"概述","link":"#概述","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[{"level":3,"title":"语法","slug":"语法","link":"#语法","children":[]},{"level":3,"title":"示例","slug":"示例","link":"#示例","children":[]}]},{"level":2,"title":"实践指南","slug":"实践指南","link":"#实践指南","children":[{"level":3,"title":"内存占用","slug":"内存占用","link":"#内存占用","children":[]},{"level":3,"title":"磁盘占用","slug":"磁盘占用","link":"#磁盘占用","children":[]},{"level":3,"title":"性能影响","slug":"性能影响","link":"#性能影响","children":[]},{"level":3,"title":"使用限制","slug":"使用限制","link":"#使用限制","children":[]},{"level":3,"title":"使用建议","slug":"使用建议","link":"#使用建议","children":[]}]},{"level":2,"title":"详细参数列表","slug":"详细参数列表","link":"#详细参数列表","children":[]}],"git":{"updatedTime":1693374263000},"filePathRelative":"zh/features/v11/availability/flashback-table.md"}');export{l as data}; diff --git a/assets/flashback-table.html-f52a25dd.js b/assets/flashback-table.html-f52a25dd.js new file mode 100644 index 00000000000..ae71dd7a5db --- /dev/null +++ b/assets/flashback-table.html-f52a25dd.js @@ -0,0 +1,40 @@ +import{_ as p,r as n,o as r,c as i,d as e,a,w as o,b as t,e as k}from"./app-3d1677bf.js";const u={},_=a("h1",{id:"闪回表和闪回日志",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#闪回表和闪回日志","aria-hidden":"true"},"#"),t(" 闪回表和闪回日志")],-1),h={class:"table-of-contents"},b=k(`

概述

目前文件系统并不能保证数据库页面级别的原子读写,在一次页面的 I/O 过程中,如果发生设备断电等情况,就会造成页面数据的错乱和丢失。在实现闪回表的过程中,我们发现通过定期保存旧版本数据页 + WAL 日志回放的方式可以得到任意时间点的数据页,这样就可以解决半写问题。这种方式和 PostgreSQL 原生的 Full Page Write 相比,由于不在事务提交的主路径上,因此性能有了约 30% ~ 100% 的提升。实例规格越大,负载压力越大,效果越明显。

闪回日志 (Flashback Log) 用于保存压缩后的旧版本数据页。其解决半写问题的方案如下:

  1. 对 Shared Buffer 中的每个 buffer,在每次 闪回点 (Flashback Point) 后第一次修改页面期间,记录 Flashback Log,保存该版本的数据页面
  2. Flashback Log 顺序落盘
  3. 维护 Flashback Log 的日志索引,用于快速检索某个数据页与其对应的 Flashback Log 记录

当遭遇半写问题(数据页 checksum 不正确)时,通过日志索引快速找到该页对应的 Flashback Log 记录,通过 Flashback Log 记录可以得到旧版本的正确数据页,用于替换被损坏的页。在文件系统不能保证 8kB 级别原子读写的任何设备上,都可以使用这个功能。需要特别注意的是,启用这个功能会造成一定的性能下降。

闪回表 (Flashback Table) 功能通过定期保留数据页面快照到闪回日志中,保留事务信息到快速恢复区中,支持用户将某个时刻的表数据恢复到一个新的表中。

使用方法

语法

FLASHBACK TABLE
+    [ schema. ]table
+    TO TIMESTAMP expr;
+

示例

准备测试数据。创建表 test,并插入数据:

CREATE TABLE test(id int);
+INSERT INTO test select * FROM generate_series(1, 10000);
+

查看已插入的数据:

polardb=# SELECT count(1) FROM test;
+ count
+-------
+ 10000
+(1 row)
+
+polardb=# SELECT sum(id) FROM test;
+   sum
+----------
+ 50005000
+(1 row)
+

等待 10 秒并删除表数据:

SELECT pg_sleep(10);
+DELETE FROM test;
+

表中已无数据:

polardb=# SELECT * FROM test;
+ id
+----
+(0 rows)
+

闪回表到 10 秒之前的数据:

polardb=# FLASHBACK TABLE test TO TIMESTAMP now() - interval'10s';
+NOTICE:  Flashback the relation test to new relation polar_flashback_65566, please check the data
+FLASHBACK TABLE
+

检查闪回表数据:

polardb=# SELECT count(1) FROM polar_flashback_65566;
+ count
+-------
+ 10000
+(1 row)
+
+polardb=# SELECT sum(id) FROM polar_flashback_65566;
+   sum
+----------
+ 50005000
+(1 row)
+

实践指南

闪回表功能依赖闪回日志和快速恢复区功能,需要设置 polar_enable_flashback_logpolar_enable_fast_recovery_area 参数并重启。其他的参数也需要按照需求来修改,建议一次性修改完成并在业务低峰期重启。打开闪回表功能将会增大内存、磁盘的占用量,并带来一定的性能损失,请谨慎评估后再使用。

内存占用

打开闪回日志功能需要增加的共享内存大小为以下三项之和:

  • polar_flashback_log_buffers * 8kB
  • polar_flashback_logindex_mem_size MB
  • polar_flashback_logindex_queue_buffers MB

打开快速恢复区需要增加大约 32kB 的共享内存大小,请评估当前实例状态后再调整参数。

磁盘占用

为了保证能够闪回到一定时间之前,需要保留该段时间的闪回日志和 WAL 日志,以及两者的 LogIndex 文件,这会增加磁盘空间的占用。理论上 polar_fast_recovery_area_rotation 设置得越大,磁盘占用越多。若 polar_fast_recovery_area_rotation 设置为 300,则将会保存 5 个小时的历史数据。

打开闪回日志之后,会定期去做 闪回点(Flashback Point)。闪回点是检查点的一种,当触发检查点后会检查 polar_flashback_point_segmentspolar_flashback_point_timeout 参数来判断当前检查点是否为闪回点。所以建议:

  • 设置 polar_flashback_point_segmentsmax_wal_size 的倍数
  • 设置 polar_flashback_point_timeoutcheckpoint_timeout 的倍数

假设 5 个小时共产生 20GB 的 WAL 日志,闪回日志与 WAL 日志的比例大约是 1:20,那么大约会产生 1GB 的闪回日志。闪回日志和 WAL 日志的比例大小和以下两个因素有关:

  • 业务模型中,写业务越多,闪回日志越多
  • polar_flashback_point_segmentspolar_flashback_point_timeout 参数设定越大,闪回日志越少

性能影响

闪回日志特性增加了两个后台进程来消费闪回日志,这势必会增大 CPU 的开销。可以调整 polar_flashback_log_bgwrite_delaypolar_flashback_log_insert_list_delay 参数使得两个后台进程工作间隔周期更长,从而减少 CPU 消耗,但是这可能会造成一定性能的下降,建议使用默认值即可。

另外,由于闪回日志功能需要在该页面刷脏之前,先刷对应的闪回日志,来保证不丢失闪回日志,所以可能会造成一定的性能下降。目前测试在大多数场景下性能下降不超过 5%。

在表闪回的过程中,目标表涉及到的页面在共享内存池中换入换出,可能会造成其他数据库访问操作的性能抖动。

使用限制

目前闪回表功能会恢复目标表的数据到一个新表中,表名为 polar_flashback_目标表 OID。在执行 FLASHBACK TABLE 语法后会有如下 NOTICE 提示:

polardb=# flashback table test to timestamp now() - interval '1h';
+NOTICE:  Flashback the relation test to new relation polar_flashback_54986, please check the data
+FLASHBACK TABLE
+

其中的 polar_flashback_54986 就是闪回恢复出的临时表,只恢复表数据到目标时刻。目前只支持 普通表 的闪回,不支持以下数据库对象:

  • 索引
  • Toast 表
  • 物化视图
  • 分区表 / 分区子表
  • 系统表
  • 外表
  • 含有 toast 子表的表

另外,如果在目标时间到当前时刻对表执行过某些 DDL,则无法闪回:

  • DROP TABLE
  • ALTER TABLE SET WITH OIDS
  • ALTER TABLE SET WITHOUT OIDS
  • TRUNCATE TABLE
  • 修改列类型,修改前后的类型不可以直接隐式转换,且不是无需增加其他值安全强制转换的 USING 子句
  • 修改表为 UNLOGGED 或者 LOGGED
  • 增加 IDENTITY 的列
  • 增加有约束限制的列
  • 增加默认值表达式含有易变的函数的列

其中 DROP TABLE 的闪回可以使用 PolarDB for PostgreSQL/Oracle 的闪回删除功能来恢复。

使用建议

当出现人为误操作数据的情况时,建议先使用审计日志快速定位到误操作发生的时间,然后将目标表闪回到该时间之前。在表闪回过程中,会持有目标表的排他锁,因此仅可以对目标表进行查询操作。另外,在表闪回的过程中,目标表涉及到的页面在共享内存池中换入换出,可能会造成其他数据库访问操作的性能抖动。因此,建议在业务低峰期执行闪回操作。

闪回的速度和表的大小相关。当表比较大时,为节约时间,可以加大 polar_workers_per_flashback_table 参数,增加并行闪回的 worker 个数。

在表闪回结束后,可以根据 NOTICE 的提示,查询对应闪回表的数据,和原表的数据进行比对。闪回表上不会有任何索引,用户可以根据查询需要自行创建索引。在数据比对完成之后,可以将缺失的数据重新回流到原表。

详细参数列表

参数名参数含义取值范围默认值生效方法
polar_enable_flashback_log是否打开闪回日志on / offoff修改配置文件后重启生效
polar_enable_fast_recovery_area是否打开快速恢复区on / offoff修改配置文件后重启生效
polar_flashback_log_keep_segments闪回日志保留的文件个数,可重用。每个文件 256MB[3, 2147483647]8SIGHUP 生效
polar_fast_recovery_area_rotation快速恢复区保留的事务信息时长,单位为分钟,即最大可闪回表到几分钟之前。[1, 14400]180SIGHUP 生效
polar_flashback_point_segments两个闪回点之间的最小 WAL 日志个数,每个 WAL 日志 1GB[1, 2147483647]16SIGHUP 生效
polar_flashback_point_timeout两个闪回点之间的最小时间间隔,单位为秒[1, 86400]300SIGHUP 生效
polar_flashback_log_buffers闪回日志共享内存大小,单位为 8kB[4, 262144]2048 (16MB)修改配置文件后重启生效
polar_flashback_logindex_mem_size闪回日志索引共享内存大小,单位为 MB[3, 1073741823]64修改配置文件后重启生效
polar_flashback_logindex_bloom_blocks闪回日志索引的布隆过滤器页面个数[8, 1073741823]512修改配置文件后重启生效
polar_flashback_log_insert_locks闪回日志插入锁的个数[1, 2147483647]8修改配置文件后重启生效
polar_workers_per_flashback_table闪回表并行 worker 的数量[0, 1024] (0 为关闭并行)5即时生效
polar_flashback_log_bgwrite_delay闪回日志 bgwriter 进程的工作间隔周期,单位为 ms[1, 10000]100SIGHUP 生效
polar_flashback_log_flush_max_size闪回日志 bgwriter 进程每次刷盘闪回日志的大小,单位为 kB[0, 2097152] (0 为不限制)5120SIGHUP 生效
polar_flashback_log_insert_list_delay闪回日志 bginserter 进程的工作间隔周期,单位为 ms[1, 10000]10SIGHUP 生效
`,52);function f(d,g){const c=n("Badge"),l=n("ArticleInfo"),s=n("router-link");return r(),i("div",null,[_,e(c,{type:"tip",text:"V11 / v1.1.22-",vertical:"top"}),e(l,{frontmatter:d.$frontmatter},null,8,["frontmatter"]),a("nav",h,[a("ul",null,[a("li",null,[e(s,{to:"#概述"},{default:o(()=>[t("概述")]),_:1})]),a("li",null,[e(s,{to:"#使用方法"},{default:o(()=>[t("使用方法")]),_:1}),a("ul",null,[a("li",null,[e(s,{to:"#语法"},{default:o(()=>[t("语法")]),_:1})]),a("li",null,[e(s,{to:"#示例"},{default:o(()=>[t("示例")]),_:1})])])]),a("li",null,[e(s,{to:"#实践指南"},{default:o(()=>[t("实践指南")]),_:1}),a("ul",null,[a("li",null,[e(s,{to:"#内存占用"},{default:o(()=>[t("内存占用")]),_:1})]),a("li",null,[e(s,{to:"#磁盘占用"},{default:o(()=>[t("磁盘占用")]),_:1})]),a("li",null,[e(s,{to:"#性能影响"},{default:o(()=>[t("性能影响")]),_:1})]),a("li",null,[e(s,{to:"#使用限制"},{default:o(()=>[t("使用限制")]),_:1})]),a("li",null,[e(s,{to:"#使用建议"},{default:o(()=>[t("使用建议")]),_:1})])])]),a("li",null,[e(s,{to:"#详细参数列表"},{default:o(()=>[t("详细参数列表")]),_:1})])])]),b])}const w=p(u,[["render",f],["__file","flashback-table.html.vue"]]);export{w as default}; diff --git a/assets/fs-pfs-curve.html-99b42104.js b/assets/fs-pfs-curve.html-99b42104.js new file mode 100644 index 00000000000..4a1ea4fbdd5 --- /dev/null +++ b/assets/fs-pfs-curve.html-99b42104.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-63309b3e","path":"/zh/deploying/fs-pfs-curve.html","title":"格式化并挂载 PFS for CurveBS","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/08/31","minute":20},"headers":[{"level":2,"title":"PFS 编译安装","slug":"pfs-编译安装","link":"#pfs-编译安装","children":[]},{"level":2,"title":"读写节点块设备映射与格式化","slug":"读写节点块设备映射与格式化","link":"#读写节点块设备映射与格式化","children":[]},{"level":2,"title":"格式化 curve 卷","slug":"格式化-curve-卷","link":"#格式化-curve-卷","children":[]},{"level":2,"title":"启动 pfsd 守护进程","slug":"启动-pfsd-守护进程","link":"#启动-pfsd-守护进程","children":[]},{"level":2,"title":"在 PFS 上编译部署 PolarDB for Curve","slug":"在-pfs-上编译部署-polardb-for-curve","link":"#在-pfs-上编译部署-polardb-for-curve","children":[]}],"git":{"updatedTime":1652432106000},"filePathRelative":"zh/deploying/fs-pfs-curve.md"}');export{e as data}; diff --git a/assets/fs-pfs-curve.html-afd924fe.js b/assets/fs-pfs-curve.html-afd924fe.js new file mode 100644 index 00000000000..7220ebc062c --- /dev/null +++ b/assets/fs-pfs-curve.html-afd924fe.js @@ -0,0 +1,18 @@ +import{_ as p,r as n,o as d,c as u,d as s,a,b as e,w as o,e as t}from"./app-3d1677bf.js";const v={},m=a("h1",{id:"格式化并挂载-pfs-for-curvebs",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#格式化并挂载-pfs-for-curvebs","aria-hidden":"true"},"#"),e(" 格式化并挂载 PFS for CurveBS")],-1),h=a("p",null,"PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。",-1),b=a("h2",{id:"pfs-编译安装",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#pfs-编译安装","aria-hidden":"true"},"#"),e(" PFS 编译安装")],-1),f={href:"https://github.com/opencurve",target:"_blank",rel:"noopener noreferrer"},_={href:"https://github.com/opencurve/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},k=t(`
docker pull polardb/polardb_pg_devel:curvebs
+docker run -it \\
+    --network=host \\
+    --cap-add=SYS_PTRACE --privileged=true \\
+    --name polardb_pg \\
+    polardb/polardb_pg_devel:curvebs bash
+

读写节点块设备映射与格式化

进入容器后需要修改 curve 相关的配置文件:

sudo vim /etc/curve/client.conf
+#
+################### mds一侧配置信息 ##################
+#
+
+# mds的地址信息,对于mds集群,地址以逗号隔开
+mds.listen.addr=127.0.0.1:6666
+... ...
+
`,4),g=a("code",null,"mds.listen.addr",-1),P=a("code",null,"cluster mds addr",-1),S=t(`

容器内已经安装了 curve 工具,该工具可用于创建卷,用户需要使用该工具创建实际存储 PolarFS 数据的 curve 卷:

curve create --filename /volume --user my --length 10 --stripeUnit 16384 --stripeCount 64
+

用户可通过 curve create -h 命令查看创建卷的详细说明。上面的列子中,我们创建了一个拥有以下属性的卷:

  • 卷名为 /volume
  • 所属用户为 my
  • 大小为 10GB
  • 条带大小为 16KB
  • 条带个数为 64

特别需要注意的是,在数据库场景下,我们强烈建议使用条带卷,只有这样才能充分发挥 Curve 的性能优势,而 16384 * 64 的条带设置是目前最优的条带设置。

格式化 curve 卷

在使用 curve 卷之前需要使用 pfs 来格式化对应的 curve 卷:

sudo pfs -C curve mkfs pool@@volume_my_
+

与我们在本地挂载文件系统前要先在磁盘上格式化文件系统一样,我们也要把我们的 curve 卷格式化为 PolarFS 文件系统。

注意,由于 PolarFS 解析的特殊性,我们将以 pool@\${volume}_\${user}_ 的形式指定我们的 curve 卷,此外还需要将卷名中的 / 替换成 @

启动 pfsd 守护进程

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p pool@@volume_my_
+

如果 pfsd 启动成功,那么至此 curve 版 PolarFS 已全部部署完成,已经成功挂载 PFS 文件系统。 下面需要编译部署 PolarDB。


在 PFS 上编译部署 PolarDB for Curve

`,15);function B(c,F){const i=n("ArticleInfo"),r=n("ExternalLinkIcon"),l=n("RouterLink");return d(),u("div",null,[m,s(i,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),h,b,a("p",null,[e("在 PolarDB 计算节点上准备好 PFS 相关工具。推荐使用 DockerHub 上的 PolarDB 开发镜像,其中已经包含了编译完毕的 PFS,无需再次编译安装。"),a("a",f,[e("Curve 开源社区"),s(r)]),e(" 针对 PFS 对接 CurveBS 存储做了专门的优化。在用于部署 PolarDB 的计算节点上,使用下面的命令拉起带有 "),a("a",_,[e("PFS for CurveBS"),s(r)]),e(" 的 PolarDB 开发镜像:")]),k,a("p",null,[e("注意,这里的 "),g,e(" 请填写"),s(l,{to:"/zh/deploying/storage-curvebs.html#%E9%83%A8%E7%BD%B2-curvebs-%E9%9B%86%E7%BE%A4"},{default:o(()=>[e("部署 CurveBS 集群")]),_:1}),e("中集群状态中输出的 "),P]),S,a("p",null,[e("参阅 "),s(l,{to:"/zh/deploying/db-pfs-curve.html"},{default:o(()=>[e("PolarDB 编译部署:PFS 文件系统")]),_:1}),e("。")])])}const C=p(v,[["render",B],["__file","fs-pfs-curve.html.vue"]]);export{C as default}; diff --git a/assets/fs-pfs-curve.html-b215dfd2.js b/assets/fs-pfs-curve.html-b215dfd2.js new file mode 100644 index 00000000000..195f351ecc1 --- /dev/null +++ b/assets/fs-pfs-curve.html-b215dfd2.js @@ -0,0 +1,18 @@ +import{_ as p,r as n,o as d,c as u,d as s,a,b as e,w as o,e as t}from"./app-3d1677bf.js";const v={},m=a("h1",{id:"格式化并挂载-pfs-for-curvebs",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#格式化并挂载-pfs-for-curvebs","aria-hidden":"true"},"#"),e(" 格式化并挂载 PFS for CurveBS")],-1),b=a("p",null,"PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。",-1),h=a("h2",{id:"pfs-编译安装",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#pfs-编译安装","aria-hidden":"true"},"#"),e(" PFS 编译安装")],-1),f={href:"https://github.com/opencurve",target:"_blank",rel:"noopener noreferrer"},_={href:"https://github.com/opencurve/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},k=t(`
docker pull polardb/polardb_pg_devel:curvebs
+docker run -it \\
+    --network=host \\
+    --cap-add=SYS_PTRACE --privileged=true \\
+    --name polardb_pg \\
+    polardb/polardb_pg_devel:curvebs bash
+

读写节点块设备映射与格式化

进入容器后需要修改 curve 相关的配置文件:

sudo vim /etc/curve/client.conf
+#
+################### mds一侧配置信息 ##################
+#
+
+# mds的地址信息,对于mds集群,地址以逗号隔开
+mds.listen.addr=127.0.0.1:6666
+... ...
+
`,4),g=a("code",null,"mds.listen.addr",-1),P=a("code",null,"cluster mds addr",-1),S=t(`

容器内已经安装了 curve 工具,该工具可用于创建卷,用户需要使用该工具创建实际存储 PolarFS 数据的 curve 卷:

curve create --filename /volume --user my --length 10 --stripeUnit 16384 --stripeCount 64
+

用户可通过 curve create -h 命令查看创建卷的详细说明。上面的列子中,我们创建了一个拥有以下属性的卷:

  • 卷名为 /volume
  • 所属用户为 my
  • 大小为 10GB
  • 条带大小为 16KB
  • 条带个数为 64

特别需要注意的是,在数据库场景下,我们强烈建议使用条带卷,只有这样才能充分发挥 Curve 的性能优势,而 16384 * 64 的条带设置是目前最优的条带设置。

格式化 curve 卷

在使用 curve 卷之前需要使用 pfs 来格式化对应的 curve 卷:

sudo pfs -C curve mkfs pool@@volume_my_
+

与我们在本地挂载文件系统前要先在磁盘上格式化文件系统一样,我们也要把我们的 curve 卷格式化为 PolarFS 文件系统。

注意,由于 PolarFS 解析的特殊性,我们将以 pool@\${volume}_\${user}_ 的形式指定我们的 curve 卷,此外还需要将卷名中的 / 替换成 @

启动 pfsd 守护进程

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p pool@@volume_my_
+

如果 pfsd 启动成功,那么至此 curve 版 PolarFS 已全部部署完成,已经成功挂载 PFS 文件系统。 下面需要编译部署 PolarDB。


在 PFS 上编译部署 PolarDB for Curve

`,15);function B(c,F){const i=n("ArticleInfo"),r=n("ExternalLinkIcon"),l=n("RouterLink");return d(),u("div",null,[m,s(i,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),b,h,a("p",null,[e("在 PolarDB 计算节点上准备好 PFS 相关工具。推荐使用 DockerHub 上的 PolarDB 开发镜像,其中已经包含了编译完毕的 PFS,无需再次编译安装。"),a("a",f,[e("Curve 开源社区"),s(r)]),e(" 针对 PFS 对接 CurveBS 存储做了专门的优化。在用于部署 PolarDB 的计算节点上,使用下面的命令拉起带有 "),a("a",_,[e("PFS for CurveBS"),s(r)]),e(" 的 PolarDB 开发镜像:")]),k,a("p",null,[e("注意,这里的 "),g,e(" 请填写"),s(l,{to:"/deploying/storage-curvebs.html#%E9%83%A8%E7%BD%B2-curvebs-%E9%9B%86%E7%BE%A4"},{default:o(()=>[e("部署 CurveBS 集群")]),_:1}),e("中集群状态中输出的 "),P]),S,a("p",null,[e("参阅 "),s(l,{to:"/deploying/db-pfs-curve.html"},{default:o(()=>[e("PolarDB 编译部署:PFS 文件系统")]),_:1}),e("。")])])}const C=p(v,[["render",B],["__file","fs-pfs-curve.html.vue"]]);export{C as default}; diff --git a/assets/fs-pfs-curve.html-fa75d4d3.js b/assets/fs-pfs-curve.html-fa75d4d3.js new file mode 100644 index 00000000000..0d7c1c123ba --- /dev/null +++ b/assets/fs-pfs-curve.html-fa75d4d3.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-e8e53a66","path":"/deploying/fs-pfs-curve.html","title":"格式化并挂载 PFS for CurveBS","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/08/31","minute":20},"headers":[{"level":2,"title":"PFS 编译安装","slug":"pfs-编译安装","link":"#pfs-编译安装","children":[]},{"level":2,"title":"读写节点块设备映射与格式化","slug":"读写节点块设备映射与格式化","link":"#读写节点块设备映射与格式化","children":[]},{"level":2,"title":"格式化 curve 卷","slug":"格式化-curve-卷","link":"#格式化-curve-卷","children":[]},{"level":2,"title":"启动 pfsd 守护进程","slug":"启动-pfsd-守护进程","link":"#启动-pfsd-守护进程","children":[]},{"level":2,"title":"在 PFS 上编译部署 PolarDB for Curve","slug":"在-pfs-上编译部署-polardb-for-curve","link":"#在-pfs-上编译部署-polardb-for-curve","children":[]}],"git":{"updatedTime":1652432106000},"filePathRelative":"deploying/fs-pfs-curve.md"}');export{e as data}; diff --git a/assets/fs-pfs.html-0c262459.js b/assets/fs-pfs.html-0c262459.js new file mode 100644 index 00000000000..1eafdb4614d --- /dev/null +++ b/assets/fs-pfs.html-0c262459.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-0aa4c420","path":"/zh/deploying/fs-pfs.html","title":"格式化并挂载 PFS","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":15},"headers":[{"level":2,"title":"PFS 编译安装","slug":"pfs-编译安装","link":"#pfs-编译安装","children":[]},{"level":2,"title":"块设备重命名","slug":"块设备重命名","link":"#块设备重命名","children":[]},{"level":2,"title":"块设备格式化","slug":"块设备格式化","link":"#块设备格式化","children":[]},{"level":2,"title":"PFS 文件系统挂载","slug":"pfs-文件系统挂载","link":"#pfs-文件系统挂载","children":[]},{"level":2,"title":"在 PFS 上编译部署 PolarDB","slug":"在-pfs-上编译部署-polardb","link":"#在-pfs-上编译部署-polardb","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"zh/deploying/fs-pfs.md"}');export{l as data}; diff --git a/assets/fs-pfs.html-78c353ce.js b/assets/fs-pfs.html-78c353ce.js new file mode 100644 index 00000000000..c4899453e09 --- /dev/null +++ b/assets/fs-pfs.html-78c353ce.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-4bd622ef","path":"/deploying/fs-pfs.html","title":"格式化并挂载 PFS","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":15},"headers":[{"level":2,"title":"PFS 编译安装","slug":"pfs-编译安装","link":"#pfs-编译安装","children":[]},{"level":2,"title":"块设备重命名","slug":"块设备重命名","link":"#块设备重命名","children":[]},{"level":2,"title":"块设备格式化","slug":"块设备格式化","link":"#块设备格式化","children":[]},{"level":2,"title":"PFS 文件系统挂载","slug":"pfs-文件系统挂载","link":"#pfs-文件系统挂载","children":[]},{"level":2,"title":"在 PFS 上编译部署 PolarDB","slug":"在-pfs-上编译部署-polardb","link":"#在-pfs-上编译部署-polardb","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"deploying/fs-pfs.md"}');export{l as data}; diff --git a/assets/fs-pfs.html-b712bf3a.js b/assets/fs-pfs.html-b712bf3a.js new file mode 100644 index 00000000000..4baf31c3a06 --- /dev/null +++ b/assets/fs-pfs.html-b712bf3a.js @@ -0,0 +1,23 @@ +import{_ as i,r as o,o as k,c as d,d as a,a as n,w as p,b as s,e as c}from"./app-3d1677bf.js";const h={},m=n("h1",{id:"格式化并挂载-pfs",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#格式化并挂载-pfs","aria-hidden":"true"},"#"),s(" 格式化并挂载 PFS")],-1),f=n("p",null,"PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。",-1),b={class:"table-of-contents"},_=n("h2",{id:"pfs-编译安装",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#pfs-编译安装","aria-hidden":"true"},"#"),s(" PFS 编译安装")],-1),g={href:"https://hub.docker.com/u/polardb",target:"_blank",rel:"noopener noreferrer"},v={href:"https://hub.docker.com/r/polardb/polardb_pg_binary/tags",target:"_blank",rel:"noopener noreferrer"},x=n("code",null,"linux/amd64",-1),P=n("code",null,"linux/arm64",-1),S=c(`
docker pull polardb/polardb_pg_binary
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+
`,1),F={href:"https://github.com/ApsaraDB/polardb-file-system/blob/master/Readme-CN.md",target:"_blank",rel:"noopener noreferrer"},B=n("h2",{id:"块设备重命名",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#块设备重命名","aria-hidden":"true"},"#"),s(" 块设备重命名")],-1),D=n("strong",null,"以特定字符开头的块设备",-1),q={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},y={href:"https://github.com/ApsaraDB/PolarDB-FileSystem/blob/master/src/pfs_core/pfs_api.h",target:"_blank",rel:"noopener noreferrer"},A=n("code",null,"src/pfs_core/pfs_api.h",-1),L=c(`
#define PFS_PATH_ISVALID(path)                                  \\
+    (path != NULL &&                                            \\
+     ((path[0] == '/' && isdigit((path)[1])) || path[0] == '.'  \\
+      || strncmp(path, "/pangu-", 7) == 0                       \\
+      || strncmp(path, "/sd", 3) == 0                           \\
+      || strncmp(path, "/sf", 3) == 0                           \\
+      || strncmp(path, "/vd", 3) == 0                           \\
+      || strncmp(path, "/nvme", 5) == 0                         \\
+      || strncmp(path, "/loop", 5) == 0                         \\
+      || strncmp(path, "/mapper_", 8) ==0))
+

因此,为了保证能够顺畅完成后续流程,我们建议在所有访问块设备的节点上使用相同的软链接访问共享块设备。例如,在 NBD 服务端主机上,使用新的块设备名 /dev/nvme1n1 软链接到共享存储块设备的原有名称 /dev/vdb 上:

sudo ln -s /dev/vdb /dev/nvme1n1
+

在 NBD 客户端主机上,使用同样的块设备名 /dev/nvme1n1 软链到共享存储块设备的原有名称 /dev/nbd0 上:

sudo ln -s /dev/nbd0 /dev/nvme1n1
+

这样便可以在服务端和客户端两台主机上使用相同的块设备名 /dev/nvme1n1 访问同一个块设备。

块设备格式化

使用 任意一台主机,在共享存储块设备上格式化 PFS 分布式文件系统:

sudo pfs -C disk mkfs nvme1n1
+

PFS 文件系统挂载

在能够访问共享存储的 所有主机节点 上分别启动 PFS 守护进程,挂载 PFS 文件系统:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

在 PFS 上编译部署 PolarDB

`,14);function N(l,I){const r=o("ArticleInfo"),t=o("router-link"),e=o("ExternalLinkIcon"),u=o("RouterLink");return k(),d("div",null,[m,a(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),f,n("nav",b,[n("ul",null,[n("li",null,[a(t,{to:"#pfs-编译安装"},{default:p(()=>[s("PFS 编译安装")]),_:1})]),n("li",null,[a(t,{to:"#块设备重命名"},{default:p(()=>[s("块设备重命名")]),_:1})]),n("li",null,[a(t,{to:"#块设备格式化"},{default:p(()=>[s("块设备格式化")]),_:1})]),n("li",null,[a(t,{to:"#pfs-文件系统挂载"},{default:p(()=>[s("PFS 文件系统挂载")]),_:1})]),n("li",null,[a(t,{to:"#在-pfs-上编译部署-polardb"},{default:p(()=>[s("在 PFS 上编译部署 PolarDB")]),_:1})])])]),_,n("p",null,[s("推荐使用 "),n("a",g,[s("DockerHub"),a(e)]),s(" 上的 PolarDB for PostgreSQL "),n("a",v,[s("可执行文件镜像"),a(e)]),s(",目前支持 "),x,s(" 和 "),P,s(" 两种架构,其中已经包含了编译完毕的 PFS 工具,无需手动编译安装。通过以下命令进入容器即可:")]),S,n("p",null,[s("PFS 的手动编译安装方式请参考 PFS 的 "),n("a",F,[s("README"),a(e)]),s(",此处不再赘述。")]),B,n("p",null,[s("PFS 仅支持访问 "),D,s("(详情可见 "),n("a",q,[s("PolarDB File System"),a(e)]),s(" 源代码的 "),n("a",y,[A,a(e)]),s(" 文件):")]),L,n("p",null,[s("参阅 "),a(u,{to:"/zh/deploying/db-pfs.html"},{default:p(()=>[s("PolarDB 编译部署:PFS 文件系统")]),_:1}),s("。")])])}const C=i(h,[["render",N],["__file","fs-pfs.html.vue"]]);export{C as default}; diff --git a/assets/fs-pfs.html-fa03f7c3.js b/assets/fs-pfs.html-fa03f7c3.js new file mode 100644 index 00000000000..dc92b79bac6 --- /dev/null +++ b/assets/fs-pfs.html-fa03f7c3.js @@ -0,0 +1,23 @@ +import{_ as i,r as o,o as k,c as d,d as a,a as n,w as p,b as s,e as c}from"./app-3d1677bf.js";const h={},m=n("h1",{id:"格式化并挂载-pfs",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#格式化并挂载-pfs","aria-hidden":"true"},"#"),s(" 格式化并挂载 PFS")],-1),f=n("p",null,"PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。",-1),b={class:"table-of-contents"},_=n("h2",{id:"pfs-编译安装",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#pfs-编译安装","aria-hidden":"true"},"#"),s(" PFS 编译安装")],-1),g={href:"https://hub.docker.com/u/polardb",target:"_blank",rel:"noopener noreferrer"},v={href:"https://hub.docker.com/r/polardb/polardb_pg_binary/tags",target:"_blank",rel:"noopener noreferrer"},x=n("code",null,"linux/amd64",-1),P=n("code",null,"linux/arm64",-1),S=c(`
docker pull polardb/polardb_pg_binary
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+
`,1),F={href:"https://github.com/ApsaraDB/polardb-file-system/blob/master/Readme-CN.md",target:"_blank",rel:"noopener noreferrer"},B=n("h2",{id:"块设备重命名",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#块设备重命名","aria-hidden":"true"},"#"),s(" 块设备重命名")],-1),D=n("strong",null,"以特定字符开头的块设备",-1),q={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},y={href:"https://github.com/ApsaraDB/PolarDB-FileSystem/blob/master/src/pfs_core/pfs_api.h",target:"_blank",rel:"noopener noreferrer"},A=n("code",null,"src/pfs_core/pfs_api.h",-1),L=c(`
#define PFS_PATH_ISVALID(path)                                  \\
+    (path != NULL &&                                            \\
+     ((path[0] == '/' && isdigit((path)[1])) || path[0] == '.'  \\
+      || strncmp(path, "/pangu-", 7) == 0                       \\
+      || strncmp(path, "/sd", 3) == 0                           \\
+      || strncmp(path, "/sf", 3) == 0                           \\
+      || strncmp(path, "/vd", 3) == 0                           \\
+      || strncmp(path, "/nvme", 5) == 0                         \\
+      || strncmp(path, "/loop", 5) == 0                         \\
+      || strncmp(path, "/mapper_", 8) ==0))
+

因此,为了保证能够顺畅完成后续流程,我们建议在所有访问块设备的节点上使用相同的软链接访问共享块设备。例如,在 NBD 服务端主机上,使用新的块设备名 /dev/nvme1n1 软链接到共享存储块设备的原有名称 /dev/vdb 上:

sudo ln -s /dev/vdb /dev/nvme1n1
+

在 NBD 客户端主机上,使用同样的块设备名 /dev/nvme1n1 软链到共享存储块设备的原有名称 /dev/nbd0 上:

sudo ln -s /dev/nbd0 /dev/nvme1n1
+

这样便可以在服务端和客户端两台主机上使用相同的块设备名 /dev/nvme1n1 访问同一个块设备。

块设备格式化

使用 任意一台主机,在共享存储块设备上格式化 PFS 分布式文件系统:

sudo pfs -C disk mkfs nvme1n1
+

PFS 文件系统挂载

在能够访问共享存储的 所有主机节点 上分别启动 PFS 守护进程,挂载 PFS 文件系统:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

在 PFS 上编译部署 PolarDB

`,14);function N(l,I){const r=o("ArticleInfo"),t=o("router-link"),e=o("ExternalLinkIcon"),u=o("RouterLink");return k(),d("div",null,[m,a(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),f,n("nav",b,[n("ul",null,[n("li",null,[a(t,{to:"#pfs-编译安装"},{default:p(()=>[s("PFS 编译安装")]),_:1})]),n("li",null,[a(t,{to:"#块设备重命名"},{default:p(()=>[s("块设备重命名")]),_:1})]),n("li",null,[a(t,{to:"#块设备格式化"},{default:p(()=>[s("块设备格式化")]),_:1})]),n("li",null,[a(t,{to:"#pfs-文件系统挂载"},{default:p(()=>[s("PFS 文件系统挂载")]),_:1})]),n("li",null,[a(t,{to:"#在-pfs-上编译部署-polardb"},{default:p(()=>[s("在 PFS 上编译部署 PolarDB")]),_:1})])])]),_,n("p",null,[s("推荐使用 "),n("a",g,[s("DockerHub"),a(e)]),s(" 上的 PolarDB for PostgreSQL "),n("a",v,[s("可执行文件镜像"),a(e)]),s(",目前支持 "),x,s(" 和 "),P,s(" 两种架构,其中已经包含了编译完毕的 PFS 工具,无需手动编译安装。通过以下命令进入容器即可:")]),S,n("p",null,[s("PFS 的手动编译安装方式请参考 PFS 的 "),n("a",F,[s("README"),a(e)]),s(",此处不再赘述。")]),B,n("p",null,[s("PFS 仅支持访问 "),D,s("(详情可见 "),n("a",q,[s("PolarDB File System"),a(e)]),s(" 源代码的 "),n("a",y,[A,a(e)]),s(" 文件):")]),L,n("p",null,[s("参阅 "),a(u,{to:"/deploying/db-pfs.html"},{default:p(()=>[s("PolarDB 编译部署:PFS 文件系统")]),_:1}),s("。")])])}const C=i(h,[["render",N],["__file","fs-pfs.html.vue"]]);export{C as default}; diff --git a/assets/grow-storage.html-2006e829.js b/assets/grow-storage.html-2006e829.js new file mode 100644 index 00000000000..32dc839fbd7 --- /dev/null +++ b/assets/grow-storage.html-2006e829.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-a3c3fc30","path":"/zh/operation/grow-storage.html","title":"共享存储在线扩容","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/10/12","minute":15},"headers":[{"level":2,"title":"块存储层扩容","slug":"块存储层扩容","link":"#块存储层扩容","children":[]},{"level":2,"title":"文件系统层扩容","slug":"文件系统层扩容","link":"#文件系统层扩容","children":[]},{"level":2,"title":"数据库实例层扩容","slug":"数据库实例层扩容","link":"#数据库实例层扩容","children":[]}],"git":{"updatedTime":1672970315000},"filePathRelative":"zh/operation/grow-storage.md"}');export{e as data}; diff --git a/assets/grow-storage.html-358b501e.js b/assets/grow-storage.html-358b501e.js new file mode 100644 index 00000000000..6925c7d08c2 --- /dev/null +++ b/assets/grow-storage.html-358b501e.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-530a6d12","path":"/operation/grow-storage.html","title":"共享存储在线扩容","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/10/12","minute":15},"headers":[{"level":2,"title":"块存储层扩容","slug":"块存储层扩容","link":"#块存储层扩容","children":[]},{"level":2,"title":"文件系统层扩容","slug":"文件系统层扩容","link":"#文件系统层扩容","children":[]},{"level":2,"title":"数据库实例层扩容","slug":"数据库实例层扩容","link":"#数据库实例层扩容","children":[]}],"git":{"updatedTime":1672970315000},"filePathRelative":"operation/grow-storage.md"}');export{e as data}; diff --git a/assets/grow-storage.html-ae16c782.js b/assets/grow-storage.html-ae16c782.js new file mode 100644 index 00000000000..cce308d9184 --- /dev/null +++ b/assets/grow-storage.html-ae16c782.js @@ -0,0 +1,28 @@ +import{_ as u,r as e,o as d,c as i,a as s,b as n,d as a,w as o,e as k}from"./app-3d1677bf.js";const m="/PolarDB-for-PostgreSQL/assets/essd-storage-grow-11277a20.png",b="/PolarDB-for-PostgreSQL/assets/essd-storage-online-grow-bce55f20.png",g="/PolarDB-for-PostgreSQL/assets/essd-storage-grow-complete-f9a772d3.png",_={},h={id:"共享存储在线扩容",tabindex:"-1"},S=s("a",{class:"header-anchor",href:"#共享存储在线扩容","aria-hidden":"true"},"#",-1),f={href:"https://developer.aliyun.com/live/250669"},v=s("p",null,"在使用数据库时,随着数据量的逐渐增大,不可避免需要对数据库所使用的存储空间进行扩容。由于 PolarDB for PostgreSQL 基于共享存储与分布式文件系统 PFS 的架构设计,与安装部署时类似,在扩容时,需要在以下三个层面分别进行操作:",-1),P={class:"table-of-contents"},E=s("p",null,"本文将指导您分别在以上三个层面上分别完成扩容操作,以实现不停止数据库实例的动态扩容。",-1),x=s("h2",{id:"块存储层扩容",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#块存储层扩容","aria-hidden":"true"},"#"),n(" 块存储层扩容")],-1),B=s("code",null,"lsblk",-1),D=k(`

另外,为保证后续扩容步骤的成功,请以 10GB 为单位进行扩容。

本示例中,在扩容之前,已有一个 20GB 的 ESSD 云盘多重挂载在两台 ECS 上。在这两台 ECS 上运行 lsblk,可以看到 ESSD 云盘共享存储对应的块设备 nvme1n1 目前的物理空间为 20GB。

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0  40G  0 disk
+└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
+nvme1n1     259:2    0  20G  0 disk
+

接下来对这块 ESSD 云盘进行扩容。在阿里云 ESSD 云盘的管理页面上,点击 云盘扩容

essd-storage-grow

进入到云盘扩容界面以后,可以看到该云盘已被两台 ECS 实例多重挂载。填写扩容后的容量,然后点击确认扩容,把 20GB 的云盘扩容为 40GB:

essd-storage-online-grow

扩容成功后,将会看到如下提示:

essd-storage-grow-complete

此时,两台 ECS 上运行 lsblk,可以看到 ESSD 对应块设备 nvme1n1 的物理空间已经变为 40GB:

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0  40G  0 disk
+└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
+nvme1n1     259:2    0  40G  0 disk
+

至此,块存储层面的扩容就完成了。

文件系统层扩容

在物理块设备完成扩容以后,接下来需要使用 PFS 文件系统提供的工具,对块设备上扩大后的物理空间进行格式化,以完成文件系统层面的扩容。

在能够访问共享存储的 任意一台主机上 运行 PFS 的 growfs 命令,其中:

  • -o 表示共享存储扩容前的空间(以 10GB 为单位)
  • -n 表示共享存储扩容后的空间(以 10GB 为单位)

本例将共享存储从 20GB 扩容至 40GB,所以参数分别填写 24

$ sudo pfs -C disk growfs -o 2 -n 4 nvme1n1
+
+...
+
+Init chunk 2
+                metaset        2/1: sectbda      0x500001000, npage       80, objsize  128, nobj 2560, oid range [    2000,     2a00)
+                metaset        2/2: sectbda      0x500051000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
+                metaset        2/3: sectbda      0x500091000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
+
+Init chunk 3
+                metaset        3/1: sectbda      0x780001000, npage       80, objsize  128, nobj 2560, oid range [    3000,     3a00)
+                metaset        3/2: sectbda      0x780051000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
+                metaset        3/3: sectbda      0x780091000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
+
+pfs growfs succeeds!
+

如果看到上述输出,说明文件系统层面的扩容已经完成。

数据库实例层扩容

最后,在数据库实例层,扩容需要做的工作是执行 SQL 函数来通知每个实例上已经挂载到共享存储的 PFSD(PFS Daemon)守护进程,告知共享存储上的新空间已经可以被使用了。需要注意的是,数据库实例集群中的 所有 PFSD 都需要被通知到,并且需要 先通知所有 RO 节点上的 PFSD,最后通知 RW 节点上的 PFSD。这意味着我们需要在 每一个 PolarDB for PostgreSQL 节点上执行一次通知 PFSD 的 SQL 函数,并且 RO 节点在先,RW 节点在后

数据库实例层通知 PFSD 的扩容函数实现在 PolarDB for PostgreSQL 的 polar_vfs 插件中,所以首先需要在 RW 节点 上加载 polar_vfs 插件。在加载插件的过程中,会在 RW 节点和所有 RO 节点上注册好 polar_vfs_disk_expansion 这个 SQL 函数。

CREATE EXTENSION IF NOT EXISTS polar_vfs;
+

接下来,依次 在所有的 RO 节点上,再到 RW 节点上 分别 执行这个 SQL 函数。其中函数的参数名为块设备名:

SELECT polar_vfs_disk_expansion('nvme1n1');
+

执行完毕后,数据库实例层面的扩容也就完成了。此时,新的存储空间已经能够被数据库使用了。

`,26);function w(p,G){const c=e("Badge"),r=e("ArticleInfo"),t=e("router-link"),l=e("RouterLink");return d(),i("div",null,[s("h1",h,[S,n(" 共享存储在线扩容 "),s("a",f,[a(c,{type:"tip",text:"视频",vertical:"top"})])]),a(r,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),v,s("nav",P,[s("ul",null,[s("li",null,[a(t,{to:"#块存储层扩容"},{default:o(()=>[n("块存储层扩容")]),_:1})]),s("li",null,[a(t,{to:"#文件系统层扩容"},{default:o(()=>[n("文件系统层扩容")]),_:1})]),s("li",null,[a(t,{to:"#数据库实例层扩容"},{default:o(()=>[n("数据库实例层扩容")]),_:1})])])]),E,x,s("p",null,[n("首先需要进行的是块存储层面上的扩容。不管使用哪种类型的共享存储,存储层面扩容最终需要达成的目的是:在能够访问共享存储的主机上运行 "),B,n(" 命令,显示存储块设备的物理空间变大。由于不同类型的共享存储有不同的扩容方式,本文以 "),a(l,{to:"/deploying/storage-aliyun-essd.html"},{default:o(()=>[n("阿里云 ECS + ESSD 云盘共享存储")]),_:1}),n(" 为例演示如何进行存储层面的扩容。")]),D])}const N=u(_,[["render",w],["__file","grow-storage.html.vue"]]);export{N as default}; diff --git a/assets/grow-storage.html-f1072fd0.js b/assets/grow-storage.html-f1072fd0.js new file mode 100644 index 00000000000..e41cfbbb606 --- /dev/null +++ b/assets/grow-storage.html-f1072fd0.js @@ -0,0 +1,28 @@ +import{_ as u,r as e,o as d,c as i,a as s,b as n,d as a,w as o,e as k}from"./app-3d1677bf.js";const m="/PolarDB-for-PostgreSQL/assets/essd-storage-grow-11277a20.png",b="/PolarDB-for-PostgreSQL/assets/essd-storage-online-grow-bce55f20.png",g="/PolarDB-for-PostgreSQL/assets/essd-storage-grow-complete-f9a772d3.png",_={},h={id:"共享存储在线扩容",tabindex:"-1"},S=s("a",{class:"header-anchor",href:"#共享存储在线扩容","aria-hidden":"true"},"#",-1),f={href:"https://developer.aliyun.com/live/250669"},v=s("p",null,"在使用数据库时,随着数据量的逐渐增大,不可避免需要对数据库所使用的存储空间进行扩容。由于 PolarDB for PostgreSQL 基于共享存储与分布式文件系统 PFS 的架构设计,与安装部署时类似,在扩容时,需要在以下三个层面分别进行操作:",-1),P={class:"table-of-contents"},E=s("p",null,"本文将指导您分别在以上三个层面上分别完成扩容操作,以实现不停止数据库实例的动态扩容。",-1),x=s("h2",{id:"块存储层扩容",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#块存储层扩容","aria-hidden":"true"},"#"),n(" 块存储层扩容")],-1),B=s("code",null,"lsblk",-1),D=k(`

另外,为保证后续扩容步骤的成功,请以 10GB 为单位进行扩容。

本示例中,在扩容之前,已有一个 20GB 的 ESSD 云盘多重挂载在两台 ECS 上。在这两台 ECS 上运行 lsblk,可以看到 ESSD 云盘共享存储对应的块设备 nvme1n1 目前的物理空间为 20GB。

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0  40G  0 disk
+└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
+nvme1n1     259:2    0  20G  0 disk
+

接下来对这块 ESSD 云盘进行扩容。在阿里云 ESSD 云盘的管理页面上,点击 云盘扩容

essd-storage-grow

进入到云盘扩容界面以后,可以看到该云盘已被两台 ECS 实例多重挂载。填写扩容后的容量,然后点击确认扩容,把 20GB 的云盘扩容为 40GB:

essd-storage-online-grow

扩容成功后,将会看到如下提示:

essd-storage-grow-complete

此时,两台 ECS 上运行 lsblk,可以看到 ESSD 对应块设备 nvme1n1 的物理空间已经变为 40GB:

$ lsblk
+NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0  40G  0 disk
+└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
+nvme1n1     259:2    0  40G  0 disk
+

至此,块存储层面的扩容就完成了。

文件系统层扩容

在物理块设备完成扩容以后,接下来需要使用 PFS 文件系统提供的工具,对块设备上扩大后的物理空间进行格式化,以完成文件系统层面的扩容。

在能够访问共享存储的 任意一台主机上 运行 PFS 的 growfs 命令,其中:

  • -o 表示共享存储扩容前的空间(以 10GB 为单位)
  • -n 表示共享存储扩容后的空间(以 10GB 为单位)

本例将共享存储从 20GB 扩容至 40GB,所以参数分别填写 24

$ sudo pfs -C disk growfs -o 2 -n 4 nvme1n1
+
+...
+
+Init chunk 2
+                metaset        2/1: sectbda      0x500001000, npage       80, objsize  128, nobj 2560, oid range [    2000,     2a00)
+                metaset        2/2: sectbda      0x500051000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
+                metaset        2/3: sectbda      0x500091000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
+
+Init chunk 3
+                metaset        3/1: sectbda      0x780001000, npage       80, objsize  128, nobj 2560, oid range [    3000,     3a00)
+                metaset        3/2: sectbda      0x780051000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
+                metaset        3/3: sectbda      0x780091000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
+
+pfs growfs succeeds!
+

如果看到上述输出,说明文件系统层面的扩容已经完成。

数据库实例层扩容

最后,在数据库实例层,扩容需要做的工作是执行 SQL 函数来通知每个实例上已经挂载到共享存储的 PFSD(PFS Daemon)守护进程,告知共享存储上的新空间已经可以被使用了。需要注意的是,数据库实例集群中的 所有 PFSD 都需要被通知到,并且需要 先通知所有 RO 节点上的 PFSD,最后通知 RW 节点上的 PFSD。这意味着我们需要在 每一个 PolarDB for PostgreSQL 节点上执行一次通知 PFSD 的 SQL 函数,并且 RO 节点在先,RW 节点在后

数据库实例层通知 PFSD 的扩容函数实现在 PolarDB for PostgreSQL 的 polar_vfs 插件中,所以首先需要在 RW 节点 上加载 polar_vfs 插件。在加载插件的过程中,会在 RW 节点和所有 RO 节点上注册好 polar_vfs_disk_expansion 这个 SQL 函数。

CREATE EXTENSION IF NOT EXISTS polar_vfs;
+

接下来,依次 在所有的 RO 节点上,再到 RW 节点上 分别 执行这个 SQL 函数。其中函数的参数名为块设备名:

SELECT polar_vfs_disk_expansion('nvme1n1');
+

执行完毕后,数据库实例层面的扩容也就完成了。此时,新的存储空间已经能够被数据库使用了。

`,26);function w(p,G){const c=e("Badge"),r=e("ArticleInfo"),t=e("router-link"),l=e("RouterLink");return d(),i("div",null,[s("h1",h,[S,n(" 共享存储在线扩容 "),s("a",f,[a(c,{type:"tip",text:"视频",vertical:"top"})])]),a(r,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),v,s("nav",P,[s("ul",null,[s("li",null,[a(t,{to:"#块存储层扩容"},{default:o(()=>[n("块存储层扩容")]),_:1})]),s("li",null,[a(t,{to:"#文件系统层扩容"},{default:o(()=>[n("文件系统层扩容")]),_:1})]),s("li",null,[a(t,{to:"#数据库实例层扩容"},{default:o(()=>[n("数据库实例层扩容")]),_:1})])])]),E,x,s("p",null,[n("首先需要进行的是块存储层面上的扩容。不管使用哪种类型的共享存储,存储层面扩容最终需要达成的目的是:在能够访问共享存储的主机上运行 "),B,n(" 命令,显示存储块设备的物理空间变大。由于不同类型的共享存储有不同的扩容方式,本文以 "),a(l,{to:"/zh/deploying/storage-aliyun-essd.html"},{default:o(()=>[n("阿里云 ECS + ESSD 云盘共享存储")]),_:1}),n(" 为例演示如何进行存储层面的扩容。")]),D])}const N=u(_,[["render",w],["__file","grow-storage.html.vue"]]);export{N as default}; diff --git a/assets/htap-1-background-c1448c2b.png b/assets/htap-1-background-c1448c2b.png new file mode 100644 index 00000000000..34e7b04bb1c Binary files /dev/null and b/assets/htap-1-background-c1448c2b.png differ diff --git a/assets/htap-2-arch-75a7a690.png b/assets/htap-2-arch-75a7a690.png new file mode 100644 index 00000000000..423e2413530 Binary files /dev/null and b/assets/htap-2-arch-75a7a690.png differ diff --git a/assets/htap-3-mpp-125b1127.png b/assets/htap-3-mpp-125b1127.png new file mode 100644 index 00000000000..5f3fafb28c8 Binary files /dev/null and b/assets/htap-3-mpp-125b1127.png differ diff --git a/assets/htap-4-1-consistency-b92b1c5f.png b/assets/htap-4-1-consistency-b92b1c5f.png new file mode 100644 index 00000000000..4913c48ffbc Binary files /dev/null and b/assets/htap-4-1-consistency-b92b1c5f.png differ diff --git a/assets/htap-4-2-serverless-a6102d5e.png b/assets/htap-4-2-serverless-a6102d5e.png new file mode 100644 index 00000000000..400c46d654c Binary files /dev/null and b/assets/htap-4-2-serverless-a6102d5e.png differ diff --git a/assets/htap-4-3-serverlessmap-8c3c8571.png b/assets/htap-4-3-serverlessmap-8c3c8571.png new file mode 100644 index 00000000000..1b4c07c4d1f Binary files /dev/null and b/assets/htap-4-3-serverlessmap-8c3c8571.png differ diff --git a/assets/htap-5-skew-c7747f23.png b/assets/htap-5-skew-c7747f23.png new file mode 100644 index 00000000000..3ab903d0fe4 Binary files /dev/null and b/assets/htap-5-skew-c7747f23.png differ diff --git a/assets/htap-6-btbuild-adea540c.png b/assets/htap-6-btbuild-adea540c.png new file mode 100644 index 00000000000..21148fae067 Binary files /dev/null and b/assets/htap-6-btbuild-adea540c.png differ diff --git a/assets/htap-7-1-acc-f65e825a.png b/assets/htap-7-1-acc-f65e825a.png new file mode 100644 index 00000000000..3ef22381d10 Binary files /dev/null and b/assets/htap-7-1-acc-f65e825a.png differ diff --git a/assets/htap-7-2-cpu-48d29353.png b/assets/htap-7-2-cpu-48d29353.png new file mode 100644 index 00000000000..dbedfc1980b Binary files /dev/null and b/assets/htap-7-2-cpu-48d29353.png differ diff --git a/assets/htap-7-3-dop-4dd408f5.png b/assets/htap-7-3-dop-4dd408f5.png new file mode 100644 index 00000000000..52e1e37e5c2 Binary files /dev/null and b/assets/htap-7-3-dop-4dd408f5.png differ diff --git a/assets/htap-8-1-tpch-mpp-1d438468.png b/assets/htap-8-1-tpch-mpp-1d438468.png new file mode 100644 index 00000000000..334a6b89a86 Binary files /dev/null and b/assets/htap-8-1-tpch-mpp-1d438468.png differ diff --git a/assets/htap-8-2-tpch-mpp-each-2433a941.png b/assets/htap-8-2-tpch-mpp-each-2433a941.png new file mode 100644 index 00000000000..f4695f57f75 Binary files /dev/null and b/assets/htap-8-2-tpch-mpp-each-2433a941.png differ diff --git a/assets/htap-adaptive-scan-21b95764.png b/assets/htap-adaptive-scan-21b95764.png new file mode 100644 index 00000000000..a11d09a9e46 Binary files /dev/null and b/assets/htap-adaptive-scan-21b95764.png differ diff --git a/assets/htap-multi-level-partition-1-c17a6008.png b/assets/htap-multi-level-partition-1-c17a6008.png new file mode 100644 index 00000000000..014d96e63d8 Binary files /dev/null and b/assets/htap-multi-level-partition-1-c17a6008.png differ diff --git a/assets/htap-non-adaptive-scan-5fb1b1e0.png b/assets/htap-non-adaptive-scan-5fb1b1e0.png new file mode 100644 index 00000000000..a780e53586b Binary files /dev/null and b/assets/htap-non-adaptive-scan-5fb1b1e0.png differ diff --git a/assets/index-82585c84.js b/assets/index-82585c84.js new file mode 100644 index 00000000000..1f525122895 --- /dev/null +++ b/assets/index-82585c84.js @@ -0,0 +1,17 @@ +/*! @docsearch/js 3.5.2 | MIT License | © Algolia, Inc. and contributors | https://docsearch.algolia.com */function un(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function I(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function se(t,e){return function(n){if(Array.isArray(n))return n}(t)||function(n,r){var o=n==null?null:typeof Symbol<"u"&&n[Symbol.iterator]||n["@@iterator"];if(o!=null){var i,a,u=[],c=!0,s=!1;try{for(o=o.call(n);!(c=(i=o.next()).done)&&(u.push(i.value),!r||u.length!==r);c=!0);}catch(l){s=!0,a=l}finally{try{c||o.return==null||o.return()}finally{if(s)throw a}}return u}}(t,e)||yr(t,e)||function(){throw new TypeError(`Invalid attempt to destructure non-iterable instance. +In order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}()}function ft(t){return function(e){if(Array.isArray(e))return Lt(e)}(t)||function(e){if(typeof Symbol<"u"&&e[Symbol.iterator]!=null||e["@@iterator"]!=null)return Array.from(e)}(t)||yr(t)||function(){throw new TypeError(`Invalid attempt to spread non-iterable instance. +In order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}()}function yr(t,e){if(t){if(typeof t=="string")return Lt(t,e);var n=Object.prototype.toString.call(t).slice(8,-1);return n==="Object"&&t.constructor&&(n=t.constructor.name),n==="Map"||n==="Set"?Array.from(t):n==="Arguments"||/^(?:Ui|I)nt(?:8|16|32)(?:Clamped)?Array$/.test(n)?Lt(t,e):void 0}}function Lt(t,e){(e==null||e>t.length)&&(e=t.length);for(var n=0,r=new Array(e);n3)for(n=[n],i=3;i0?Ie(v.type,v.props,v.key,null,v.__v):v)!=null){if(v.__=n,v.__b=n.__b+1,(p=b[l])===null||p&&v.key==p.key&&v.type===p.type)b[l]=void 0;else for(m=0;m<_;m++){if((p=b[m])&&v.key==p.key&&v.type===p.type){b[m]=void 0;break}p=null}Yt(t,v,p=p||mt,o,i,a,u,c,s),d=v.__e,(m=v.ref)&&p.ref!=m&&(y||(y=[]),p.ref&&y.push(p.ref,null,v),y.push(m,v.__c||d,v)),d!=null?(h==null&&(h=d),typeof v.type=="function"&&v.__k!=null&&v.__k===p.__k?v.__d=c=jr(v,c,t):c=wr(t,v,p,b,d,c),s||n.type!=="option"?typeof n.type=="function"&&(n.__d=c):t.value=""):c&&p.__e==c&&c.parentNode!=t&&(c=We(p))}for(n.__e=h,l=_;l--;)b[l]!=null&&(typeof n.type=="function"&&b[l].__e!=null&&b[l].__e==n.__d&&(n.__d=We(r,l+1)),Ir(b[l],b[l]));if(y)for(l=0;l3)for(n=[n],i=3;i=n.__.length&&n.__.push({}),n.__[t]}function kr(t){return pe=1,Ar(xr,t)}function Ar(t,e,n){var r=Je(de++,2);return r.t=t,r.__c||(r.__=[n?n(e):xr(void 0,e),function(o){var i=r.t(r.__[0],o);r.__[0]!==i&&(r.__=[i,r.__[1]],r.__c.setState({}))}],r.__c=q),r.__}function Cr(t,e){var n=Je(de++,3);!j.__s&&Gt(n.__H,e)&&(n.__=t,n.__H=e,q.__H.__h.push(n))}function bn(t,e){var n=Je(de++,4);!j.__s&&Gt(n.__H,e)&&(n.__=t,n.__H=e,q.__h.push(n))}function Pt(t,e){var n=Je(de++,7);return Gt(n.__H,e)&&(n.__=t(),n.__H=e,n.__h=t),n.__}function yo(){Ht.forEach(function(t){if(t.__P)try{t.__H.__h.forEach(ct),t.__H.__h.forEach(Ut),t.__H.__h=[]}catch(e){t.__H.__h=[],j.__e(e,t.__v)}}),Ht=[]}j.__b=function(t){q=null,vn&&vn(t)},j.__r=function(t){dn&&dn(t),de=0;var e=(q=t.__c).__H;e&&(e.__h.forEach(ct),e.__h.forEach(Ut),e.__h=[])},j.diffed=function(t){hn&&hn(t);var e=t.__c;e&&e.__H&&e.__H.__h.length&&(Ht.push(e)!==1&&pn===j.requestAnimationFrame||((pn=j.requestAnimationFrame)||function(n){var r,o=function(){clearTimeout(i),_n&&cancelAnimationFrame(r),setTimeout(n)},i=setTimeout(o,100);_n&&(r=requestAnimationFrame(o))})(yo)),q=void 0},j.__c=function(t,e){e.some(function(n){try{n.__h.forEach(ct),n.__h=n.__h.filter(function(r){return!r.__||Ut(r)})}catch(r){e.some(function(o){o.__h&&(o.__h=[])}),e=[],j.__e(r,n.__v)}}),yn&&yn(t,e)},j.unmount=function(t){gn&&gn(t);var e=t.__c;if(e&&e.__H)try{e.__H.__.forEach(ct)}catch(n){j.__e(n,e.__v)}};var _n=typeof requestAnimationFrame=="function";function ct(t){var e=q;typeof t.__c=="function"&&t.__c(),q=e}function Ut(t){var e=q;t.__c=t.__(),q=e}function Gt(t,e){return!t||t.length!==e.length||e.some(function(n,r){return n!==t[r]})}function xr(t,e){return typeof e=="function"?e(t):e}function Nr(t,e){for(var n in e)t[n]=e[n];return t}function Ft(t,e){for(var n in t)if(n!=="__source"&&!(n in e))return!0;for(var r in e)if(r!=="__source"&&t[r]!==e[r])return!0;return!1}function Bt(t){this.props=t}(Bt.prototype=new K).isPureReactComponent=!0,Bt.prototype.shouldComponentUpdate=function(t,e){return Ft(this.props,t)||Ft(this.state,e)};var On=j.__b;j.__b=function(t){t.type&&t.type.__f&&t.ref&&(t.props.ref=t.ref,t.ref=null),On&&On(t)};var go=typeof Symbol<"u"&&Symbol.for&&Symbol.for("react.forward_ref")||3911,Sn=function(t,e){return t==null?null:$($(t).map(e))},bo={map:Sn,forEach:Sn,count:function(t){return t?$(t).length:0},only:function(t){var e=$(t);if(e.length!==1)throw"Children.only";return e[0]},toArray:$},_o=j.__e;function ut(){this.__u=0,this.t=null,this.__b=null}function Tr(t){var e=t.__.__c;return e&&e.__e&&e.__e(t)}function we(){this.u=null,this.o=null}j.__e=function(t,e,n){if(t.then){for(var r,o=e;o=o.__;)if((r=o.__c)&&r.__c)return e.__e==null&&(e.__e=n.__e,e.__k=n.__k),r.__c(t,e)}_o(t,e,n)},(ut.prototype=new K).__c=function(t,e){var n=e.__c,r=this;r.t==null&&(r.t=[]),r.t.push(n);var o=Tr(r.__v),i=!1,a=function(){i||(i=!0,n.componentWillUnmount=n.__c,o?o(u):u())};n.__c=n.componentWillUnmount,n.componentWillUnmount=function(){a(),n.__c&&n.__c()};var u=function(){if(!--r.__u){if(r.state.__e){var s=r.state.__e;r.__v.__k[0]=function m(p,v,d){return p&&(p.__v=null,p.__k=p.__k&&p.__k.map(function(h){return m(h,v,d)}),p.__c&&p.__c.__P===v&&(p.__e&&d.insertBefore(p.__e,p.__d),p.__c.__e=!0,p.__c.__P=d)),p}(s,s.__c.__P,s.__c.__O)}var l;for(r.setState({__e:r.__b=null});l=r.t.pop();)l.forceUpdate()}},c=e.__h===!0;r.__u++||c||r.setState({__e:r.__b=r.__v.__k[0]}),t.then(a,a)},ut.prototype.componentWillUnmount=function(){this.t=[]},ut.prototype.render=function(t,e){if(this.__b){if(this.__v.__k){var n=document.createElement("div"),r=this.__v.__k[0].__c;this.__v.__k[0]=function i(a,u,c){return a&&(a.__c&&a.__c.__H&&(a.__c.__H.__.forEach(function(s){typeof s.__c=="function"&&s.__c()}),a.__c.__H=null),(a=Nr({},a)).__c!=null&&(a.__c.__P===c&&(a.__c.__P=u),a.__c=null),a.__k=a.__k&&a.__k.map(function(s){return i(s,u,c)})),a}(this.__b,n,r.__O=r.__P)}this.__b=null}var o=e.__e&&W(X,null,t.fallback);return o&&(o.__h=null),[W(X,null,e.__e?null:t.children),o]};var jn=function(t,e,n){if(++n[1]===n[0]&&t.o.delete(e),t.props.revealOrder&&(t.props.revealOrder[0]!=="t"||!t.o.size))for(n=t.u;n;){for(;n.length>3;)n.pop()();if(n[1]>>1,1),e.i.removeChild(r)}}),Ke(W(Oo,{context:e.context},t.__v),e.l)):e.l&&e.componentWillUnmount()}function Rr(t,e){return W(So,{__v:t,i:e})}(we.prototype=new K).__e=function(t){var e=this,n=Tr(e.__v),r=e.o.get(t);return r[0]++,function(o){var i=function(){e.props.revealOrder?(r.push(o),jn(e,t,r)):o()};n?n(i):i()}},we.prototype.render=function(t){this.u=null,this.o=new Map;var e=$(t.children);t.revealOrder&&t.revealOrder[0]==="b"&&e.reverse();for(var n=e.length;n--;)this.o.set(e[n],this.u=[1,0,this.u]);return t.children},we.prototype.componentDidUpdate=we.prototype.componentDidMount=function(){var t=this;this.o.forEach(function(e,n){jn(t,n,e)})};var qr=typeof Symbol<"u"&&Symbol.for&&Symbol.for("react.element")||60103,jo=/^(?:accent|alignment|arabic|baseline|cap|clip(?!PathU)|color|fill|flood|font|glyph(?!R)|horiz|marker(?!H|W|U)|overline|paint|stop|strikethrough|stroke|text(?!L)|underline|unicode|units|v|vector|vert|word|writing|x(?!C))[A-Z]/,wo=function(t){return(typeof Symbol<"u"&&Ve(Symbol())=="symbol"?/fil|che|rad/i:/fil|che|ra/i).test(t)};function Lr(t,e,n){return e.__k==null&&(e.textContent=""),Ke(t,e),typeof n=="function"&&n(),t?t.__c:null}K.prototype.isReactComponent={},["componentWillMount","componentWillReceiveProps","componentWillUpdate"].forEach(function(t){Object.defineProperty(K.prototype,t,{configurable:!0,get:function(){return this["UNSAFE_"+t]},set:function(e){Object.defineProperty(this,t,{configurable:!0,writable:!0,value:e})}})});var wn=j.event;function Eo(){}function Po(){return this.cancelBubble}function Io(){return this.defaultPrevented}j.event=function(t){return wn&&(t=wn(t)),t.persist=Eo,t.isPropagationStopped=Po,t.isDefaultPrevented=Io,t.nativeEvent=t};var Mr,En={configurable:!0,get:function(){return this.class}},Pn=j.vnode;j.vnode=function(t){var e=t.type,n=t.props,r=n;if(typeof e=="string"){for(var o in r={},n){var i=n[o];o==="value"&&"defaultValue"in n&&i==null||(o==="defaultValue"&&"value"in n&&n.value==null?o="value":o==="download"&&i===!0?i="":/ondoubleclick/i.test(o)?o="ondblclick":/^onchange(textarea|input)/i.test(o+e)&&!wo(n.type)?o="oninput":/^on(Ani|Tra|Tou|BeforeInp)/.test(o)?o=o.toLowerCase():jo.test(o)?o=o.replace(/[A-Z0-9]/,"-$&").toLowerCase():i===null&&(i=void 0),r[o]=i)}e=="select"&&r.multiple&&Array.isArray(r.value)&&(r.value=$(n.children).forEach(function(a){a.props.selected=r.value.indexOf(a.props.value)!=-1})),e=="select"&&r.defaultValue!=null&&(r.value=$(n.children).forEach(function(a){a.props.selected=r.multiple?r.defaultValue.indexOf(a.props.value)!=-1:r.defaultValue==a.props.value})),t.props=r}e&&n.class!=n.className&&(En.enumerable="className"in n,n.className!=null&&(r.class=n.className),Object.defineProperty(r,"className",En)),t.$$typeof=qr,Pn&&Pn(t)};var In=j.__r;j.__r=function(t){In&&In(t),Mr=t.__c};var Do={ReactCurrentDispatcher:{current:{readContext:function(t){return Mr.__n[t.__c].props.value}}}};(typeof performance>"u"?"undefined":Ve(performance))=="object"&&typeof performance.now=="function"&&performance.now.bind(performance);function Dn(t){return!!t&&t.$$typeof===qr}var f={useState:kr,useReducer:Ar,useEffect:Cr,useLayoutEffect:bn,useRef:function(t){return pe=5,Pt(function(){return{current:t}},[])},useImperativeHandle:function(t,e,n){pe=6,bn(function(){typeof t=="function"?t(e()):t&&(t.current=e())},n==null?n:n.concat(t))},useMemo:Pt,useCallback:function(t,e){return pe=8,Pt(function(){return t},e)},useContext:function(t){var e=q.context[t.__c],n=Je(de++,9);return n.__c=t,e?(n.__==null&&(n.__=!0,e.sub(q)),e.props.value):t.__},useDebugValue:function(t,e){j.useDebugValue&&j.useDebugValue(e?e(t):t)},version:"16.8.0",Children:bo,render:Lr,hydrate:function(t,e,n){return Dr(t,e),typeof n=="function"&&n(),t?t.__c:null},unmountComponentAtNode:function(t){return!!t.__k&&(Ke(null,t),!0)},createPortal:Rr,createElement:W,createContext:function(t,e){var n={__c:e="__cC"+br++,__:t,Consumer:function(r,o){return r.children(o)},Provider:function(r){var o,i;return this.getChildContext||(o=[],(i={})[e]=this,this.getChildContext=function(){return i},this.shouldComponentUpdate=function(a){this.props.value!==a.value&&o.some(Mt)},this.sub=function(a){o.push(a);var u=a.componentWillUnmount;a.componentWillUnmount=function(){o.splice(o.indexOf(a),1),u&&u.call(a)}}),r.children}};return n.Provider.__=n.Consumer.contextType=n},createFactory:function(t){return W.bind(null,t)},cloneElement:function(t){return Dn(t)?ho.apply(null,arguments):t},createRef:function(){return{current:null}},Fragment:X,isValidElement:Dn,findDOMNode:function(t){return t&&(t.base||t.nodeType===1&&t)||null},Component:K,PureComponent:Bt,memo:function(t,e){function n(o){var i=this.props.ref,a=i==o.ref;return!a&&i&&(i.call?i(null):i.current=null),e?!e(this.props,o)||!a:Ft(this.props,o)}function r(o){return this.shouldComponentUpdate=n,W(t,o)}return r.displayName="Memo("+(t.displayName||t.name)+")",r.prototype.isReactComponent=!0,r.__f=!0,r},forwardRef:function(t){function e(n,r){var o=Nr({},n);return delete o.ref,t(o,(r=n.ref||r)&&(Ve(r)!="object"||"current"in r)?r:null)}return e.$$typeof=go,e.render=e,e.prototype.isReactComponent=e.__f=!0,e.displayName="ForwardRef("+(t.displayName||t.name)+")",e},unstable_batchedUpdates:function(t,e){return t(e)},StrictMode:X,Suspense:ut,SuspenseList:we,lazy:function(t){var e,n,r;function o(i){if(e||(e=t()).then(function(a){n=a.default||a},function(a){r=a}),r)throw r;if(!n)throw e;return W(n,i)}return o.displayName="Lazy",o.__f=!0,o},__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED:Do};function ko(){return f.createElement("svg",{width:"15",height:"15",className:"DocSearch-Control-Key-Icon"},f.createElement("path",{d:"M4.505 4.496h2M5.505 5.496v5M8.216 4.496l.055 5.993M10 7.5c.333.333.5.667.5 1v2M12.326 4.5v5.996M8.384 4.496c1.674 0 2.116 0 2.116 1.5s-.442 1.5-2.116 1.5M3.205 9.303c-.09.448-.277 1.21-1.241 1.203C1 10.5.5 9.513.5 8V7c0-1.57.5-2.5 1.464-2.494.964.006 1.134.598 1.24 1.342M12.553 10.5h1.953",strokeWidth:"1.2",stroke:"currentColor",fill:"none",strokeLinecap:"square"}))}function Hr(){return f.createElement("svg",{width:"20",height:"20",className:"DocSearch-Search-Icon",viewBox:"0 0 20 20"},f.createElement("path",{d:"M14.386 14.386l4.0877 4.0877-4.0877-4.0877c-2.9418 2.9419-7.7115 2.9419-10.6533 0-2.9419-2.9418-2.9419-7.7115 0-10.6533 2.9418-2.9419 7.7115-2.9419 10.6533 0 2.9419 2.9418 2.9419 7.7115 0 10.6533z",stroke:"currentColor",fill:"none",fillRule:"evenodd",strokeLinecap:"round",strokeLinejoin:"round"}))}var Ao=["translations"];function Vt(){return Vt=Object.assign||function(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}var No=f.forwardRef(function(t,e){var n=t.translations,r=n===void 0?{}:n,o=xo(t,Ao),i=r.buttonText,a=i===void 0?"Search":i,u=r.buttonAriaLabel,c=u===void 0?"Search":u,s=Co(kr(null),2),l=s[0],m=s[1];return Cr(function(){typeof navigator<"u"&&(/(Mac|iPhone|iPod|iPad)/i.test(navigator.platform)?m("⌘"):m("Ctrl"))},[]),f.createElement("button",Vt({type:"button",className:"DocSearch DocSearch-Button","aria-label":c},o,{ref:e}),f.createElement("span",{className:"DocSearch-Button-Container"},f.createElement(Hr,null),f.createElement("span",{className:"DocSearch-Button-Placeholder"},a)),f.createElement("span",{className:"DocSearch-Button-Keys"},l!==null&&f.createElement(f.Fragment,null,f.createElement("kbd",{className:"DocSearch-Button-Key"},l==="Ctrl"?f.createElement(ko,null):l),f.createElement("kbd",{className:"DocSearch-Button-Key"},"K"))))});function Ur(t,e){var n=void 0;return function(){for(var r=arguments.length,o=new Array(r),i=0;it.length)&&(e=t.length);for(var n=0,r=new Array(e);nt.length)&&(e=t.length);for(var n=0,r=new Array(e);n=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function Nn(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function ve(t){for(var e=1;e1&&arguments[1]!==void 0?arguments[1]:20,n=[],r=0;r=3||n===2&&r>=4||n===1&&r>=10);function i(a,u,c){if(o&&c!==void 0){var s=c[0].__autocomplete_algoliaCredentials,l={"X-Algolia-Application-Id":s.appId,"X-Algolia-API-Key":s.apiKey};t.apply(void 0,[a].concat(Ge(u),[{headers:l}]))}else t.apply(void 0,[a].concat(Ge(u)))}return{init:function(a,u){t("init",{appId:a,apiKey:u})},setUserToken:function(a){t("setUserToken",a)},clickedObjectIDsAfterSearch:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&i("clickedObjectIDsAfterSearch",Xe(u),u[0].items)},clickedObjectIDs:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&i("clickedObjectIDs",Xe(u),u[0].items)},clickedFilters:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&t.apply(void 0,["clickedFilters"].concat(u))},convertedObjectIDsAfterSearch:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&i("convertedObjectIDsAfterSearch",Xe(u),u[0].items)},convertedObjectIDs:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&i("convertedObjectIDs",Xe(u),u[0].items)},convertedFilters:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&t.apply(void 0,["convertedFilters"].concat(u))},viewedObjectIDs:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&u.reduce(function(s,l){var m=l.items,p=Br(l,Mo);return[].concat(Ge(s),Ge(Uo(ve(ve({},p),{},{objectIDs:(m==null?void 0:m.map(function(v){return v.objectID}))||p.objectIDs})).map(function(v){return{items:m,payload:v}})))},[]).forEach(function(s){var l=s.items;return i("viewedObjectIDs",[s.payload],l)})},viewedFilters:function(){for(var a=arguments.length,u=new Array(a),c=0;c0&&t.apply(void 0,["viewedFilters"].concat(u))}}}function Bo(t){var e=t.items.reduce(function(n,r){var o;return n[r.__autocomplete_indexName]=((o=n[r.__autocomplete_indexName])!==null&&o!==void 0?o:[]).concat(r),n},{});return Object.keys(e).map(function(n){return{index:n,items:e[n],algoliaSource:["autocomplete"]}})}function Dt(t){return t.objectID&&t.__autocomplete_indexName&&t.__autocomplete_queryID}function ke(t){return ke=typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?function(e){return typeof e}:function(e){return e&&typeof Symbol=="function"&&e.constructor===Symbol&&e!==Symbol.prototype?"symbol":typeof e},ke(t)}function ie(t){return function(e){if(Array.isArray(e))return kt(e)}(t)||function(e){if(typeof Symbol<"u"&&e[Symbol.iterator]!=null||e["@@iterator"]!=null)return Array.from(e)}(t)||function(e,n){if(e){if(typeof e=="string")return kt(e,n);var r=Object.prototype.toString.call(e).slice(8,-1);if(r==="Object"&&e.constructor&&(r=e.constructor.name),r==="Map"||r==="Set")return Array.from(e);if(r==="Arguments"||/^(?:Ui|I)nt(?:8|16|32)(?:Clamped)?Array$/.test(r))return kt(e,n)}}(t)||function(){throw new TypeError(`Invalid attempt to spread non-iterable instance. +In order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}()}function kt(t,e){(e==null||e>t.length)&&(e=t.length);for(var n=0,r=new Array(e);n0&&Ko({onItemsChange:r,items:p,insights:u,state:m}))}},0);return{name:"aa.algoliaInsightsPlugin",subscribe:function(l){var m=l.setContext,p=l.onSelect,v=l.onActive;a("addAlgoliaAgent","insights-plugin"),m({algoliaInsightsPlugin:{__algoliaSearchParameters:{clickAnalytics:!0},insights:u}}),p(function(d){var h=d.item,y=d.state,b=d.event;Dt(h)&&o({state:y,event:b,insights:u,item:h,insightsEvents:[G({eventName:"Item Selected"},Cn({item:h,items:c.current}))]})}),v(function(d){var h=d.item,y=d.state,b=d.event;Dt(h)&&i({state:y,event:b,insights:u,item:h,insightsEvents:[G({eventName:"Item Active"},Cn({item:h,items:c.current}))]})})},onStateChange:function(l){var m=l.state;s({state:m})},__autocomplete_pluginOptions:t}}function lt(t,e){var n=e;return{then:function(r,o){return lt(t.then(et(r,n,t),et(o,n,t)),n)},catch:function(r){return lt(t.catch(et(r,n,t)),n)},finally:function(r){return r&&n.onCancelList.push(r),lt(t.finally(et(r&&function(){return n.onCancelList=[],r()},n,t)),n)},cancel:function(){n.isCanceled=!0;var r=n.onCancelList;n.onCancelList=[],r.forEach(function(o){o()})},isCanceled:function(){return n.isCanceled===!0}}}function Rn(t){return lt(t,{isCanceled:!1,onCancelList:[]})}function et(t,e,n){return t?function(r){return e.isCanceled?r:t(r)}:n}function qn(t,e,n,r){if(!n)return null;if(t<0&&(e===null||r!==null&&e===0))return n+t;var o=(e===null?-1:e)+t;return o<=-1||o>=n?r===null?null:0:o}function Ln(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function Mn(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n0},reshape:function(i){return i.sources}},t),{},{id:(n=t.id)!==null&&n!==void 0?n:"autocomplete-".concat(To++),plugins:o,initialState:ae({activeItemId:null,query:"",completion:null,collections:[],isOpen:!1,status:"idle",context:{}},t.initialState),onStateChange:function(i){var a;(a=t.onStateChange)===null||a===void 0||a.call(t,i),o.forEach(function(u){var c;return(c=u.onStateChange)===null||c===void 0?void 0:c.call(u,i)})},onSubmit:function(i){var a;(a=t.onSubmit)===null||a===void 0||a.call(t,i),o.forEach(function(u){var c;return(c=u.onSubmit)===null||c===void 0?void 0:c.call(u,i)})},onReset:function(i){var a;(a=t.onReset)===null||a===void 0||a.call(t,i),o.forEach(function(u){var c;return(c=u.onReset)===null||c===void 0?void 0:c.call(u,i)})},getSources:function(i){return Promise.all([].concat(Go(o.map(function(a){return a.getSources})),[t.getSources]).filter(Boolean).map(function(a){return function(u,c){var s=[];return Promise.resolve(u(c)).then(function(l){return Promise.all(l.filter(function(m){return!!m}).map(function(m){if(m.sourceId,s.includes(m.sourceId))throw new Error("[Autocomplete] The `sourceId` ".concat(JSON.stringify(m.sourceId)," is not unique."));s.push(m.sourceId);var p={getItemInputValue:function(d){return d.state.query},getItemUrl:function(){},onSelect:function(d){(0,d.setIsOpen)(!1)},onActive:vt,onResolve:vt};Object.keys(p).forEach(function(d){p[d].__default=!0});var v=Mn(Mn({},p),m);return Promise.resolve(v)}))})}(a,i)})).then(function(a){return ze(a)}).then(function(a){return a.map(function(u){return ae(ae({},u),{},{onSelect:function(c){u.onSelect(c),e.forEach(function(s){var l;return(l=s.onSelect)===null||l===void 0?void 0:l.call(s,c)})},onActive:function(c){u.onActive(c),e.forEach(function(s){var l;return(l=s.onActive)===null||l===void 0?void 0:l.call(s,c)})},onResolve:function(c){u.onResolve(c),e.forEach(function(s){var l;return(l=s.onResolve)===null||l===void 0?void 0:l.call(s,c)})}})})})},navigator:ae({navigate:function(i){var a=i.itemUrl;r.location.assign(a)},navigateNewTab:function(i){var a=i.itemUrl,u=r.open(a,"_blank","noopener");u==null||u.focus()},navigateNewWindow:function(i){var a=i.itemUrl;r.open(a,"_blank","noopener")}},t.navigator)})}function Te(t){return Te=typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?function(e){return typeof e}:function(e){return e&&typeof Symbol=="function"&&e.constructor===Symbol&&e!==Symbol.prototype?"symbol":typeof e},Te(t)}function Bn(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function nt(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}var Kn,xt,ot,je=null,zn=(Kn=-1,xt=-1,ot=void 0,function(t){var e=++Kn;return Promise.resolve(t).then(function(n){return ot&&e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function Me(t){return Me=typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?function(e){return typeof e}:function(e){return e&&typeof Symbol=="function"&&e.constructor===Symbol&&e!==Symbol.prototype?"symbol":typeof e},Me(t)}var fi=["props","refresh","store"],mi=["inputElement","formElement","panelElement"],pi=["inputElement"],vi=["inputElement","maxLength"],di=["sourceIndex"],hi=["sourceIndex"],yi=["item","source","sourceIndex"];function $n(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function R(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function bi(t){var e=t.props,n=t.refresh,r=t.store,o=ne(t,fi),i=function(a,u){return u!==void 0?"".concat(a,"-").concat(u):a};return{getEnvironmentProps:function(a){var u=a.inputElement,c=a.formElement,s=a.panelElement;function l(m){!r.getState().isOpen&&r.pendingRequests.isEmpty()||m.target===u||[c,s].some(function(p){return v=p,d=m.target,v===d||v.contains(d);var v,d})===!1&&(r.dispatch("blur",null),e.debug||r.pendingRequests.cancelAll())}return R({onTouchStart:l,onMouseDown:l,onTouchMove:function(m){r.getState().isOpen!==!1&&u===e.environment.document.activeElement&&m.target!==u&&u.blur()}},ne(a,mi))},getRootProps:function(a){return R({role:"combobox","aria-expanded":r.getState().isOpen,"aria-haspopup":"listbox","aria-owns":r.getState().isOpen?"".concat(e.id,"-list"):void 0,"aria-labelledby":"".concat(e.id,"-label")},a)},getFormProps:function(a){return a.inputElement,R({action:"",noValidate:!0,role:"search",onSubmit:function(u){var c;u.preventDefault(),e.onSubmit(R({event:u,refresh:n,state:r.getState()},o)),r.dispatch("submit",null),(c=a.inputElement)===null||c===void 0||c.blur()},onReset:function(u){var c;u.preventDefault(),e.onReset(R({event:u,refresh:n,state:r.getState()},o)),r.dispatch("reset",null),(c=a.inputElement)===null||c===void 0||c.focus()}},ne(a,pi))},getLabelProps:function(a){var u=a||{},c=u.sourceIndex,s=ne(u,di);return R({htmlFor:"".concat(i(e.id,c),"-input"),id:"".concat(i(e.id,c),"-label")},s)},getInputProps:function(a){var u;function c(y){(e.openOnFocus||r.getState().query)&&le(R({event:y,props:e,query:r.getState().completion||r.getState().query,refresh:n,store:r},o)),r.dispatch("focus",null)}var s=a||{},l=(s.inputElement,s.maxLength),m=l===void 0?512:l,p=ne(s,vi),v=fe(r.getState()),d=function(y){return!!(y&&y.match($o))}(((u=e.environment.navigator)===null||u===void 0?void 0:u.userAgent)||""),h=v!=null&&v.itemUrl&&!d?"go":"search";return R({"aria-autocomplete":"both","aria-activedescendant":r.getState().isOpen&&r.getState().activeItemId!==null?"".concat(e.id,"-item-").concat(r.getState().activeItemId):void 0,"aria-controls":r.getState().isOpen?"".concat(e.id,"-list"):void 0,"aria-labelledby":"".concat(e.id,"-label"),value:r.getState().completion||r.getState().query,id:"".concat(e.id,"-input"),autoComplete:"off",autoCorrect:"off",autoCapitalize:"off",enterKeyHint:h,spellCheck:"false",autoFocus:e.autoFocus,placeholder:e.placeholder,maxLength:m,type:"search",onChange:function(y){le(R({event:y,props:e,query:y.currentTarget.value.slice(0,m),refresh:n,store:r},o))},onKeyDown:function(y){(function(b){var _=b.event,S=b.props,O=b.refresh,g=b.store,P=si(b,ui);if(_.key==="ArrowUp"||_.key==="ArrowDown"){var C=function(){var M=S.environment.document.getElementById("".concat(S.id,"-item-").concat(g.getState().activeItemId));M&&(M.scrollIntoViewIfNeeded?M.scrollIntoViewIfNeeded(!1):M.scrollIntoView(!1))},L=function(){var M=fe(g.getState());if(g.getState().activeItemId!==null&&M){var Ot=M.item,St=M.itemInputValue,$e=M.itemUrl,B=M.source;B.onActive(te({event:_,item:Ot,itemInputValue:St,itemUrl:$e,refresh:O,source:B,state:g.getState()},P))}};_.preventDefault(),g.getState().isOpen===!1&&(S.openOnFocus||g.getState().query)?le(te({event:_,props:S,query:g.getState().query,refresh:O,store:g},P)).then(function(){g.dispatch(_.key,{nextActiveItemId:S.defaultActiveItemId}),L(),setTimeout(C,0)}):(g.dispatch(_.key,{}),L(),C())}else if(_.key==="Escape")_.preventDefault(),g.dispatch(_.key,null),g.pendingRequests.cancelAll();else if(_.key==="Tab")g.dispatch("blur",null),g.pendingRequests.cancelAll();else if(_.key==="Enter"){if(g.getState().activeItemId===null||g.getState().collections.every(function(M){return M.items.length===0}))return void(S.debug||g.pendingRequests.cancelAll());_.preventDefault();var x=fe(g.getState()),k=x.item,N=x.itemInputValue,U=x.itemUrl,F=x.source;if(_.metaKey||_.ctrlKey)U!==void 0&&(F.onSelect(te({event:_,item:k,itemInputValue:N,itemUrl:U,refresh:O,source:F,state:g.getState()},P)),S.navigator.navigateNewTab({itemUrl:U,item:k,state:g.getState()}));else if(_.shiftKey)U!==void 0&&(F.onSelect(te({event:_,item:k,itemInputValue:N,itemUrl:U,refresh:O,source:F,state:g.getState()},P)),S.navigator.navigateNewWindow({itemUrl:U,item:k,state:g.getState()}));else if(!_.altKey){if(U!==void 0)return F.onSelect(te({event:_,item:k,itemInputValue:N,itemUrl:U,refresh:O,source:F,state:g.getState()},P)),void S.navigator.navigate({itemUrl:U,item:k,state:g.getState()});le(te({event:_,nextState:{isOpen:!1},props:S,query:N,refresh:O,store:g},P)).then(function(){F.onSelect(te({event:_,item:k,itemInputValue:N,itemUrl:U,refresh:O,source:F,state:g.getState()},P))})}}})(R({event:y,props:e,refresh:n,store:r},o))},onFocus:c,onBlur:vt,onClick:function(y){a.inputElement!==e.environment.document.activeElement||r.getState().isOpen||c(y)}},p)},getPanelProps:function(a){return R({onMouseDown:function(u){u.preventDefault()},onMouseLeave:function(){r.dispatch("mouseleave",null)}},a)},getListProps:function(a){var u=a||{},c=u.sourceIndex,s=ne(u,hi);return R({role:"listbox","aria-labelledby":"".concat(i(e.id,c),"-label"),id:"".concat(i(e.id,c),"-list")},s)},getItemProps:function(a){var u=a.item,c=a.source,s=a.sourceIndex,l=ne(a,yi);return R({id:"".concat(i(e.id,s),"-item-").concat(u.__autocomplete_id),role:"option","aria-selected":r.getState().activeItemId===u.__autocomplete_id,onMouseMove:function(m){if(u.__autocomplete_id!==r.getState().activeItemId){r.dispatch("mousemove",u.__autocomplete_id);var p=fe(r.getState());if(r.getState().activeItemId!==null&&p){var v=p.item,d=p.itemInputValue,h=p.itemUrl,y=p.source;y.onActive(R({event:m,item:v,itemInputValue:d,itemUrl:h,refresh:n,source:y,state:r.getState()},o))}}},onMouseDown:function(m){m.preventDefault()},onClick:function(m){var p=c.getItemInputValue({item:u,state:r.getState()}),v=c.getItemUrl({item:u,state:r.getState()});(v?Promise.resolve():le(R({event:m,nextState:{isOpen:!1},props:e,query:p,refresh:n,store:r},o))).then(function(){c.onSelect(R({event:m,item:u,itemInputValue:p,itemUrl:v,refresh:n,source:c,state:r.getState()},o))})}},l)}}}function He(t){return He=typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?function(e){return typeof e}:function(e){return e&&typeof Symbol=="function"&&e.constructor===Symbol&&e!==Symbol.prototype?"symbol":typeof e},He(t)}function Qn(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function _i(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function Bi(t){var e=t.translations,n=e===void 0?{}:e,r=Fi(t,Hi),o=n.noResultsText,i=o===void 0?"No results for":o,a=n.suggestedQueryText,u=a===void 0?"Try searching for":a,c=n.reportMissingResultsText,s=c===void 0?"Believe this query should return results?":c,l=n.reportMissingResultsLinkText,m=l===void 0?"Let us know.":l,p=r.state.context.searchSuggestions;return f.createElement("div",{className:"DocSearch-NoResults"},f.createElement("div",{className:"DocSearch-Screen-Icon"},f.createElement(Li,null)),f.createElement("p",{className:"DocSearch-Title"},i,' "',f.createElement("strong",null,r.state.query),'"'),p&&p.length>0&&f.createElement("div",{className:"DocSearch-NoResults-Prefill-List"},f.createElement("p",{className:"DocSearch-Help"},u,":"),f.createElement("ul",null,p.slice(0,3).reduce(function(v,d){return[].concat(Ui(v),[f.createElement("li",{key:d},f.createElement("button",{className:"DocSearch-Prefill",key:d,type:"button",onClick:function(){r.setQuery(d.toLowerCase()+" "),r.refresh(),r.inputRef.current.focus()}},d))])},[]))),r.getMissingResultsUrl&&f.createElement("p",{className:"DocSearch-Help"},"".concat(s," "),f.createElement("a",{href:r.getMissingResultsUrl({query:r.state.query}),target:"_blank",rel:"noopener noreferrer"},m)))}var Vi=["hit","attribute","tagName"];function er(t,e){var n=Object.keys(t);if(Object.getOwnPropertySymbols){var r=Object.getOwnPropertySymbols(t);e&&(r=r.filter(function(o){return Object.getOwnPropertyDescriptor(t,o).enumerable})),n.push.apply(n,r)}return n}function tr(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function nr(t,e){return e.split(".").reduce(function(n,r){return n!=null&&n[r]?n[r]:null},t)}function ue(t){var e=t.hit,n=t.attribute,r=t.tagName;return W(r===void 0?"span":r,tr(tr({},Ki(t,Vi)),{},{dangerouslySetInnerHTML:{__html:nr(e,"_snippetResult.".concat(n,".value"))||nr(e,n)}}))}function rr(t,e){return function(n){if(Array.isArray(n))return n}(t)||function(n,r){var o=n==null?null:typeof Symbol<"u"&&n[Symbol.iterator]||n["@@iterator"];if(o!=null){var i,a,u=[],c=!0,s=!1;try{for(o=o.call(n);!(c=(i=o.next()).done)&&(u.push(i.value),!r||u.length!==r);c=!0);}catch(l){s=!0,a=l}finally{try{c||o.return==null||o.return()}finally{if(s)throw a}}return u}}(t,e)||function(n,r){if(n){if(typeof n=="string")return or(n,r);var o=Object.prototype.toString.call(n).slice(8,-1);if(o==="Object"&&n.constructor&&(o=n.constructor.name),o==="Map"||o==="Set")return Array.from(n);if(o==="Arguments"||/^(?:Ui|I)nt(?:8|16|32)(?:Clamped)?Array$/.test(o))return or(n,r)}}(t,e)||function(){throw new TypeError(`Invalid attempt to destructure non-iterable instance. +In order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}()}function or(t,e){(e==null||e>t.length)&&(e=t.length);for(var n=0,r=new Array(e);n|<\/mark>)/g,$i=RegExp(zr.source);function Jr(t){var e,n,r=t;if(!r.__docsearch_parent&&!t._highlightResult)return t.hierarchy.lvl0;var o=((r.__docsearch_parent?(e=r.__docsearch_parent)===null||e===void 0||(e=e._highlightResult)===null||e===void 0||(e=e.hierarchy)===null||e===void 0?void 0:e.lvl0:(n=t._highlightResult)===null||n===void 0||(n=n.hierarchy)===null||n===void 0?void 0:n.lvl0)||{}).value;return o&&$i.test(o)?o.replace(zr,""):o}function Jt(){return Jt=Object.assign||function(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function Gi(t){var e=t.translations,n=e===void 0?{}:e,r=Yi(t,Zi),o=n.recentSearchesTitle,i=o===void 0?"Recent":o,a=n.noRecentSearchesText,u=a===void 0?"No recent searches":a,c=n.saveRecentSearchButtonTitle,s=c===void 0?"Save this search":c,l=n.removeRecentSearchButtonTitle,m=l===void 0?"Remove this search from history":l,p=n.favoriteSearchesTitle,v=p===void 0?"Favorite":p,d=n.removeFavoriteSearchButtonTitle,h=d===void 0?"Remove this search from favorites":d;return r.state.status==="idle"&&r.hasCollections===!1?r.disableUserPersonalization?null:f.createElement("div",{className:"DocSearch-StartScreen"},f.createElement("p",{className:"DocSearch-Help"},u)):r.hasCollections===!1?null:f.createElement("div",{className:"DocSearch-Dropdown-Container"},f.createElement(zt,ht({},r,{title:i,collection:r.state.collections[0],renderIcon:function(){return f.createElement("div",{className:"DocSearch-Hit-icon"},f.createElement(Ai,null))},renderAction:function(y){var b=y.item,_=y.runFavoriteTransition,S=y.runDeleteTransition;return f.createElement(f.Fragment,null,f.createElement("div",{className:"DocSearch-Hit-action"},f.createElement("button",{className:"DocSearch-Hit-action-button",title:s,type:"submit",onClick:function(O){O.preventDefault(),O.stopPropagation(),_(function(){r.favoriteSearches.add(b),r.recentSearches.remove(b),r.refresh()})}},f.createElement(Xn,null))),f.createElement("div",{className:"DocSearch-Hit-action"},f.createElement("button",{className:"DocSearch-Hit-action-button",title:m,type:"submit",onClick:function(O){O.preventDefault(),O.stopPropagation(),S(function(){r.recentSearches.remove(b),r.refresh()})}},f.createElement(Kt,null))))}})),f.createElement(zt,ht({},r,{title:v,collection:r.state.collections[1],renderIcon:function(){return f.createElement("div",{className:"DocSearch-Hit-icon"},f.createElement(Xn,null))},renderAction:function(y){var b=y.item,_=y.runDeleteTransition;return f.createElement("div",{className:"DocSearch-Hit-action"},f.createElement("button",{className:"DocSearch-Hit-action-button",title:h,type:"submit",onClick:function(S){S.preventDefault(),S.stopPropagation(),_(function(){r.favoriteSearches.remove(b),r.refresh()})}},f.createElement(Kt,null)))}})))}var Xi=["translations"];function yt(){return yt=Object.assign||function(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}var ta=f.memo(function(t){var e=t.translations,n=e===void 0?{}:e,r=ea(t,Xi);if(r.state.status==="error")return f.createElement(Mi,{translations:n==null?void 0:n.errorScreen});var o=r.state.collections.some(function(i){return i.items.length>0});return r.state.query?o===!1?f.createElement(Bi,yt({},r,{translations:n==null?void 0:n.noResultsScreen})):f.createElement(Qi,r):f.createElement(Gi,yt({},r,{hasCollections:o,translations:n==null?void 0:n.startScreen}))},function(t,e){return e.state.status==="loading"||e.state.status==="stalled"}),na=["translations"];function gt(){return gt=Object.assign||function(t){for(var e=1;e=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function oa(t){var e=t.translations,n=e===void 0?{}:e,r=ra(t,na),o=n.resetButtonTitle,i=o===void 0?"Clear the query":o,a=n.resetButtonAriaLabel,u=a===void 0?"Clear the query":a,c=n.cancelButtonText,s=c===void 0?"Cancel":c,l=n.cancelButtonAriaLabel,m=l===void 0?"Cancel":l,p=r.getFormProps({inputElement:r.inputRef.current}).onReset;return f.useEffect(function(){r.autoFocus&&r.inputRef.current&&r.inputRef.current.focus()},[r.autoFocus,r.inputRef]),f.useEffect(function(){r.isFromSelection&&r.inputRef.current&&r.inputRef.current.select()},[r.isFromSelection,r.inputRef]),f.createElement(f.Fragment,null,f.createElement("form",{className:"DocSearch-Form",onSubmit:function(v){v.preventDefault()},onReset:p},f.createElement("label",gt({className:"DocSearch-MagnifierLabel"},r.getLabelProps()),f.createElement(Hr,null)),f.createElement("div",{className:"DocSearch-LoadingIndicator"},f.createElement(ki,null)),f.createElement("input",gt({className:"DocSearch-Input",ref:r.inputRef},r.getInputProps({inputElement:r.inputRef.current,autoFocus:r.autoFocus,maxLength:64}))),f.createElement("button",{type:"reset",title:i,className:"DocSearch-Reset","aria-label":u,hidden:!r.state.query},f.createElement(Kt,null))),f.createElement("button",{className:"DocSearch-Cancel",type:"reset","aria-label":m,onClick:r.onClose},s))}var ia=["_highlightResult","_snippetResult"];function aa(t,e){if(t==null)return{};var n,r,o=function(a,u){if(a==null)return{};var c,s,l={},m=Object.keys(a);for(s=0;s=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function ca(t){return function(){var e="__TEST_KEY__";try{return localStorage.setItem(e,""),localStorage.removeItem(e),!0}catch{return!1}}()===!1?{setItem:function(){},getItem:function(){return[]}}:{setItem:function(e){return window.localStorage.setItem(t,JSON.stringify(e))},getItem:function(){var e=window.localStorage.getItem(t);return e?JSON.parse(e):[]}}}function cr(t){var e=t.key,n=t.limit,r=n===void 0?5:n,o=ca(e),i=o.getItem().slice(0,r);return{add:function(a){var u=a,c=(u._highlightResult,u._snippetResult,aa(u,ia)),s=i.findIndex(function(l){return l.objectID===c.objectID});s>-1&&i.splice(s,1),i.unshift(c),i=i.slice(0,r),o.setItem(i)},remove:function(a){i=i.filter(function(u){return u.objectID!==a.objectID}),o.setItem(i)},getAll:function(){return i}}}var ua=["facetName","facetQuery"];function la(t){var e,n="algoliasearch-client-js-".concat(t.key),r=function(){return e===void 0&&(e=t.localStorage||window.localStorage),e},o=function(){return JSON.parse(r().getItem(n)||"{}")},i=function(u){r().setItem(n,JSON.stringify(u))},a=function(){var u=t.timeToLive?1e3*t.timeToLive:null,c=o(),s=Object.fromEntries(Object.entries(c).filter(function(m){return se(m,2)[1].timestamp!==void 0}));if(i(s),u){var l=Object.fromEntries(Object.entries(s).filter(function(m){var p=se(m,2)[1],v=new Date().getTime();return!(p.timestamp+u2&&arguments[2]!==void 0?arguments[2]:{miss:function(){return Promise.resolve()}};return Promise.resolve().then(function(){a();var l=JSON.stringify(u);return o()[l]}).then(function(l){return Promise.all([l?l.value:c(),l!==void 0])}).then(function(l){var m=se(l,2),p=m[0],v=m[1];return Promise.all([p,v||s.miss(p)])}).then(function(l){return se(l,1)[0]})},set:function(u,c){return Promise.resolve().then(function(){var s=o();return s[JSON.stringify(u)]={timestamp:new Date().getTime(),value:c},r().setItem(n,JSON.stringify(s)),c})},delete:function(u){return Promise.resolve().then(function(){var c=o();delete c[JSON.stringify(u)],r().setItem(n,JSON.stringify(c))})},clear:function(){return Promise.resolve().then(function(){r().removeItem(n)})}}}function Ee(t){var e=ft(t.caches),n=e.shift();return n===void 0?{get:function(r,o){var i=arguments.length>2&&arguments[2]!==void 0?arguments[2]:{miss:function(){return Promise.resolve()}};return o().then(function(a){return Promise.all([a,i.miss(a)])}).then(function(a){return se(a,1)[0]})},set:function(r,o){return Promise.resolve(o)},delete:function(r){return Promise.resolve()},clear:function(){return Promise.resolve()}}:{get:function(r,o){var i=arguments.length>2&&arguments[2]!==void 0?arguments[2]:{miss:function(){return Promise.resolve()}};return n.get(r,o,i).catch(function(){return Ee({caches:e}).get(r,o,i)})},set:function(r,o){return n.set(r,o).catch(function(){return Ee({caches:e}).set(r,o)})},delete:function(r){return n.delete(r).catch(function(){return Ee({caches:e}).delete(r)})},clear:function(){return n.clear().catch(function(){return Ee({caches:e}).clear()})}}}function Tt(){var t=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{serializable:!0},e={};return{get:function(n,r){var o=arguments.length>2&&arguments[2]!==void 0?arguments[2]:{miss:function(){return Promise.resolve()}},i=JSON.stringify(n);if(i in e)return Promise.resolve(t.serializable?JSON.parse(e[i]):e[i]);var a=r(),u=o&&o.miss||function(){return Promise.resolve()};return a.then(function(c){return u(c)}).then(function(){return a})},set:function(n,r){return e[JSON.stringify(n)]=t.serializable?JSON.stringify(r):r,Promise.resolve(r)},delete:function(n){return delete e[JSON.stringify(n)],Promise.resolve()},clear:function(){return e={},Promise.resolve()}}}function sa(t){for(var e=t.length-1;e>0;e--){var n=Math.floor(Math.random()*(e+1)),r=t[e];t[e]=t[n],t[n]=r}return t}function $r(t,e){return e&&Object.keys(e).forEach(function(n){t[n]=e[n](t)}),t}function bt(t){for(var e=arguments.length,n=new Array(e>1?e-1:0),r=1;r0?r:void 0,timeout:n.timeout||e,headers:n.headers||{},queryParameters:n.queryParameters||{},cacheable:n.cacheable}}var me={Read:1,Write:2,Any:3},Qr=1,fa=2,Zr=3;function Yr(t){var e=arguments.length>1&&arguments[1]!==void 0?arguments[1]:Qr;return I(I({},t),{},{status:e,lastUpdate:Date.now()})}function Gr(t){return typeof t=="string"?{protocol:"https",url:t,accept:me.Any}:{protocol:t.protocol||"https",url:t.url,accept:t.accept||me.Any}}var $t="GET",_t="POST";function ma(t,e){return Promise.all(e.map(function(n){return t.get(n,function(){return Promise.resolve(Yr(n))})})).then(function(n){var r=n.filter(function(a){return function(u){return u.status===Qr||Date.now()-u.lastUpdate>12e4}(a)}),o=n.filter(function(a){return function(u){return u.status===Zr&&Date.now()-u.lastUpdate<=12e4}(a)}),i=[].concat(ft(r),ft(o));return{getTimeout:function(a,u){return(o.length===0&&a===0?1:o.length+3+a)*u},statelessHosts:i.length>0?i.map(function(a){return Gr(a)}):e}})}function lr(t,e,n,r){var o=[],i=function(p,v){if(!(p.method===$t||p.data===void 0&&v.data===void 0)){var d=Array.isArray(p.data)?p.data:I(I({},p.data),v.data);return JSON.stringify(d)}}(n,r),a=function(p,v){var d=I(I({},p.headers),v.headers),h={};return Object.keys(d).forEach(function(y){var b=d[y];h[y.toLowerCase()]=b}),h}(t,r),u=n.method,c=n.method!==$t?{}:I(I({},n.data),r.data),s=I(I(I({"x-algolia-agent":t.userAgent.value},t.queryParameters),c),r.queryParameters),l=0,m=function p(v,d){var h=v.pop();if(h===void 0)throw{name:"RetryError",message:"Unreachable hosts - your application id may be incorrect. If the error persists, contact support@algolia.com.",transporterStackTrace:sr(o)};var y={data:i,headers:a,method:u,url:va(h,n.path,s),connectTimeout:d(l,t.timeouts.connect),responseTimeout:d(l,r.timeout)},b=function(S){var O={request:y,response:S,host:h,triesLeft:v.length};return o.push(O),O},_={onSuccess:function(S){return function(O){try{return JSON.parse(O.content)}catch(g){throw function(P,C){return{name:"DeserializationError",message:P,response:C}}(g.message,O)}}(S)},onRetry:function(S){var O=b(S);return S.isTimedOut&&l++,Promise.all([t.logger.info("Retryable failure",eo(O)),t.hostsCache.set(h,Yr(h,S.isTimedOut?Zr:fa))]).then(function(){return p(v,d)})},onFail:function(S){throw b(S),function(O,g){var P=O.content,C=O.status,L=P;try{L=JSON.parse(P).message}catch{}return function(x,k,N){return{name:"ApiError",message:x,status:k,transporterStackTrace:N}}(L,C,g)}(S,sr(o))}};return t.requester.send(y).then(function(S){return function(O,g){return function(P){var C=P.status;return P.isTimedOut||function(L){var x=L.isTimedOut,k=L.status;return!x&&~~k==0}(P)||~~(C/100)!=2&&~~(C/100)!=4}(O)?g.onRetry(O):~~(O.status/100)==2?g.onSuccess(O):g.onFail(O)}(S,_)})};return ma(t.hostsCache,e).then(function(p){return m(ft(p.statelessHosts).reverse(),p.getTimeout)})}function pa(t){var e={value:"Algolia for JavaScript (".concat(t,")"),add:function(n){var r="; ".concat(n.segment).concat(n.version!==void 0?" (".concat(n.version,")"):"");return e.value.indexOf(r)===-1&&(e.value="".concat(e.value).concat(r)),e}};return e}function va(t,e,n){var r=Xr(n),o="".concat(t.protocol,"://").concat(t.url,"/").concat(e.charAt(0)==="/"?e.substr(1):e);return r.length&&(o+="?".concat(r)),o}function Xr(t){return Object.keys(t).map(function(e){return bt("%s=%s",e,(n=t[e],Object.prototype.toString.call(n)==="[object Object]"||Object.prototype.toString.call(n)==="[object Array]"?JSON.stringify(t[e]):t[e]));var n}).join("&")}function sr(t){return t.map(function(e){return eo(e)})}function eo(t){var e=t.request.headers["x-algolia-api-key"]?{"x-algolia-api-key":"*****"}:{};return I(I({},t),{},{request:I(I({},t.request),{},{headers:I(I({},t.request.headers),e)})})}var da=function(t){var e=t.appId,n=function(i,a,u){var c={"x-algolia-api-key":u,"x-algolia-application-id":a};return{headers:function(){return i===st.WithinHeaders?c:{}},queryParameters:function(){return i===st.WithinQueryParameters?c:{}}}}(t.authMode!==void 0?t.authMode:st.WithinHeaders,e,t.apiKey),r=function(i){var a=i.hostsCache,u=i.logger,c=i.requester,s=i.requestsCache,l=i.responsesCache,m=i.timeouts,p=i.userAgent,v=i.hosts,d=i.queryParameters,h={hostsCache:a,logger:u,requester:c,requestsCache:s,responsesCache:l,timeouts:m,userAgent:p,headers:i.headers,queryParameters:d,hosts:v.map(function(y){return Gr(y)}),read:function(y,b){var _=ur(b,h.timeouts.read),S=function(){return lr(h,h.hosts.filter(function(g){return(g.accept&me.Read)!=0}),y,_)};if((_.cacheable!==void 0?_.cacheable:y.cacheable)!==!0)return S();var O={request:y,mappedRequestOptions:_,transporter:{queryParameters:h.queryParameters,headers:h.headers}};return h.responsesCache.get(O,function(){return h.requestsCache.get(O,function(){return h.requestsCache.set(O,S()).then(function(g){return Promise.all([h.requestsCache.delete(O),g])},function(g){return Promise.all([h.requestsCache.delete(O),Promise.reject(g)])}).then(function(g){var P=se(g,2);return P[0],P[1]})})},{miss:function(g){return h.responsesCache.set(O,g)}})},write:function(y,b){return lr(h,h.hosts.filter(function(_){return(_.accept&me.Write)!=0}),y,ur(b,h.timeouts.write))}};return h}(I(I({hosts:[{url:"".concat(e,"-dsn.algolia.net"),accept:me.Read},{url:"".concat(e,".algolia.net"),accept:me.Write}].concat(sa([{url:"".concat(e,"-1.algolianet.com")},{url:"".concat(e,"-2.algolianet.com")},{url:"".concat(e,"-3.algolianet.com")}]))},t),{},{headers:I(I(I({},n.headers()),{"content-type":"application/x-www-form-urlencoded"}),t.headers),queryParameters:I(I({},n.queryParameters()),t.queryParameters)})),o={transporter:r,appId:e,addAlgoliaAgent:function(i,a){r.userAgent.add({segment:i,version:a})},clearCache:function(){return Promise.all([r.requestsCache.clear(),r.responsesCache.clear()]).then(function(){})}};return $r(o,t.methods)},ha=function(t){return function(e,n){return e.method===$t?t.transporter.read(e,n):t.transporter.write(e,n)}},to=function(t){return function(e){var n=arguments.length>1&&arguments[1]!==void 0?arguments[1]:{},r={transporter:t.transporter,appId:t.appId,indexName:e};return $r(r,n.methods)}},fr=function(t){return function(e,n){var r=e.map(function(o){return I(I({},o),{},{params:Xr(o.params||{})})});return t.transporter.read({method:_t,path:"1/indexes/*/queries",data:{requests:r},cacheable:!0},n)}},mr=function(t){return function(e,n){return Promise.all(e.map(function(r){var o=r.params,i=o.facetName,a=o.facetQuery,u=fo(o,ua);return to(t)(r.indexName,{methods:{searchForFacetValues:no}}).searchForFacetValues(i,a,I(I({},n),u))}))}},ya=function(t){return function(e,n,r){return t.transporter.read({method:_t,path:bt("1/answers/%s/prediction",t.indexName),data:{query:e,queryLanguages:n},cacheable:!0},r)}},ga=function(t){return function(e,n){return t.transporter.read({method:_t,path:bt("1/indexes/%s/query",t.indexName),data:{query:e},cacheable:!0},n)}},no=function(t){return function(e,n,r){return t.transporter.read({method:_t,path:bt("1/indexes/%s/facets/%s/query",t.indexName,e),data:{facetQuery:n},cacheable:!0},r)}},ba=1,_a=2,Oa=3;function ro(t,e,n){var r,o={appId:t,apiKey:e,timeouts:{connect:1,read:2,write:30},requester:{send:function(i){return new Promise(function(a){var u=new XMLHttpRequest;u.open(i.method,i.url,!0),Object.keys(i.headers).forEach(function(m){return u.setRequestHeader(m,i.headers[m])});var c,s=function(m,p){return setTimeout(function(){u.abort(),a({status:0,content:p,isTimedOut:!0})},1e3*m)},l=s(i.connectTimeout,"Connection timeout");u.onreadystatechange=function(){u.readyState>u.OPENED&&c===void 0&&(clearTimeout(l),c=s(i.responseTimeout,"Socket timeout"))},u.onerror=function(){u.status===0&&(clearTimeout(l),clearTimeout(c),a({content:u.responseText||"Network request failed",status:u.status,isTimedOut:!1}))},u.onload=function(){clearTimeout(l),clearTimeout(c),a({content:u.responseText,status:u.status,isTimedOut:!1})},u.send(i.data)})}},logger:(r=Oa,{debug:function(i,a){return ba>=r&&console.debug(i,a),Promise.resolve()},info:function(i,a){return _a>=r&&console.info(i,a),Promise.resolve()},error:function(i,a){return console.error(i,a),Promise.resolve()}}),responsesCache:Tt(),requestsCache:Tt({serializable:!1}),hostsCache:Ee({caches:[la({key:"".concat("4.19.1","-").concat(t)}),Tt()]}),userAgent:pa("4.19.1").add({segment:"Browser",version:"lite"}),authMode:st.WithinQueryParameters};return da(I(I(I({},o),n),{},{methods:{search:fr,searchForFacetValues:mr,multipleQueries:fr,multipleSearchForFacetValues:mr,customRequest:ha,initIndex:function(i){return function(a){return to(i)(a,{methods:{search:ga,searchForFacetValues:no,findAnswers:ya}})}}}}))}ro.version="4.19.1";var Sa=["footer","searchBox"];function Be(){return Be=Object.assign||function(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n=0||(l[c]=a[c]);return l}(t,e);if(Object.getOwnPropertySymbols){var i=Object.getOwnPropertySymbols(t);for(r=0;r=0||Object.prototype.propertyIsEnumerable.call(t,n)&&(o[n]=t[n])}return o}function Pa(t){var e=t.appId,n=t.apiKey,r=t.indexName,o=t.placeholder,i=o===void 0?"Search docs":o,a=t.searchParameters,u=t.maxResultsPerGroup,c=t.onClose,s=c===void 0?Ji:c,l=t.transformItems,m=l===void 0?ar:l,p=t.hitComponent,v=p===void 0?Di:p,d=t.resultsFooterComponent,h=d===void 0?function(){return null}:d,y=t.navigator,b=t.initialScrollY,_=b===void 0?0:b,S=t.transformSearchClient,O=S===void 0?ar:S,g=t.disableUserPersonalization,P=g!==void 0&&g,C=t.initialQuery,L=C===void 0?"":C,x=t.translations,k=x===void 0?{}:x,N=t.getMissingResultsUrl,U=t.insights,F=U!==void 0&&U,M=k.footer,Ot=k.searchBox,St=Ea(k,Sa),$e=wa(f.useState({query:"",collections:[],completion:null,context:{},isOpen:!1,activeItemId:null,status:"idle"}),2),B=$e[0],oo=$e[1],Xt=f.useRef(null),jt=f.useRef(null),en=f.useRef(null),Qe=f.useRef(null),he=f.useRef(null),Q=f.useRef(10),tn=f.useRef(typeof window<"u"?window.getSelection().toString().slice(0,64):"").current,ee=f.useRef(L||tn).current,nn=function(w,D,T){return f.useMemo(function(){var H=ro(w,D);return H.addAlgoliaAgent("docsearch","3.5.2"),/docsearch.js \(.*\)/.test(H.transporter.userAgent.value)===!1&&H.addAlgoliaAgent("docsearch-react","3.5.2"),T(H)},[w,D,T])}(e,n,O),oe=f.useRef(cr({key:"__DOCSEARCH_FAVORITE_SEARCHES__".concat(r),limit:10})).current,ye=f.useRef(cr({key:"__DOCSEARCH_RECENT_SEARCHES__".concat(r),limit:oe.getAll().length===0?7:4})).current,ge=f.useCallback(function(w){if(!P){var D=w.type==="content"?w.__docsearch_parent:w;D&&oe.getAll().findIndex(function(T){return T.objectID===D.objectID})===-1&&ye.add(D)}},[oe,ye,P]),io=f.useCallback(function(w){if(B.context.algoliaInsightsPlugin&&w.__autocomplete_id){var D=w,T={eventName:"Item Selected",index:D.__autocomplete_indexName,items:[D],positions:[w.__autocomplete_id],queryID:D.__autocomplete_queryID};B.context.algoliaInsightsPlugin.insights.clickedObjectIDsAfterSearch(T)}},[B.context.algoliaInsightsPlugin]),be=f.useMemo(function(){return Ei({id:"docsearch",defaultActiveItemId:0,placeholder:i,openOnFocus:!0,initialState:{query:ee,context:{searchSuggestions:[]}},insights:F,navigator:y,onStateChange:function(w){oo(w.state)},getSources:function(w){var D=w.query,T=w.state,H=w.setContext,Z=w.setStatus;if(!D)return P?[]:[{sourceId:"recentSearches",onSelect:function(A){var V=A.item,_e=A.event;ge(V),at(_e)||s()},getItemUrl:function(A){return A.item.url},getItems:function(){return ye.getAll()}},{sourceId:"favoriteSearches",onSelect:function(A){var V=A.item,_e=A.event;ge(V),at(_e)||s()},getItemUrl:function(A){return A.item.url},getItems:function(){return oe.getAll()}}];var Y=!!F;return nn.search([{query:D,indexName:r,params:Rt({attributesToRetrieve:["hierarchy.lvl0","hierarchy.lvl1","hierarchy.lvl2","hierarchy.lvl3","hierarchy.lvl4","hierarchy.lvl5","hierarchy.lvl6","content","type","url"],attributesToSnippet:["hierarchy.lvl1:".concat(Q.current),"hierarchy.lvl2:".concat(Q.current),"hierarchy.lvl3:".concat(Q.current),"hierarchy.lvl4:".concat(Q.current),"hierarchy.lvl5:".concat(Q.current),"hierarchy.lvl6:".concat(Q.current),"content:".concat(Q.current)],snippetEllipsisText:"…",highlightPreTag:"",highlightPostTag:"",hitsPerPage:20,clickAnalytics:Y},a)}]).catch(function(A){throw A.name==="RetryError"&&Z("error"),A}).then(function(A){var V=A.results[0],_e=V.hits,uo=V.nbHits,wt=ir(_e,function(Et){return Jr(Et)},u);T.context.searchSuggestions.length0&&(rn(),he.current&&he.current.focus())},[ee,rn]),f.useEffect(function(){function w(){if(jt.current){var D=.01*window.innerHeight;jt.current.style.setProperty("--docsearch-vh","".concat(D,"px"))}}return w(),window.addEventListener("resize",w),function(){window.removeEventListener("resize",w)}},[]),f.createElement("div",Be({ref:Xt},co({"aria-expanded":!0}),{className:["DocSearch","DocSearch-Container",B.status==="stalled"&&"DocSearch-Container--Stalled",B.status==="error"&&"DocSearch-Container--Errored"].filter(Boolean).join(" "),role:"button",tabIndex:0,onMouseDown:function(w){w.target===w.currentTarget&&s()}}),f.createElement("div",{className:"DocSearch-Modal",ref:jt},f.createElement("header",{className:"DocSearch-SearchBar",ref:en},f.createElement(oa,Be({},be,{state:B,autoFocus:ee.length===0,inputRef:he,isFromSelection:!!ee&&ee===tn,translations:Ot,onClose:s}))),f.createElement("div",{className:"DocSearch-Dropdown",ref:Qe},f.createElement(ta,Be({},be,{indexName:r,state:B,hitComponent:v,resultsFooterComponent:h,disableUserPersonalization:P,recentSearches:ye,favoriteSearches:oe,inputRef:he,translations:St,getMissingResultsUrl:N,onItemClick:function(w,D){io(w),ge(w),at(D)||s()}}))),f.createElement("footer",{className:"DocSearch-Footer"},f.createElement(Ii,{translations:M}))))}function Qt(){return Qt=Object.assign||function(t){for(var e=1;et.length)&&(e=t.length);for(var n=0,r=new Array(e);n1&&arguments[1]!==void 0?arguments[1]:window;return typeof e=="string"?n.document.querySelector(e):e}(t.container,t.environment))}export{Da as default}; diff --git a/assets/index.html-2e0bfe16.js b/assets/index.html-2e0bfe16.js new file mode 100644 index 00000000000..7752fdd058c --- /dev/null +++ b/assets/index.html-2e0bfe16.js @@ -0,0 +1 @@ +import{_ as c,r as l,o as s,c as d,a,d as t,w as n,b as e}from"./app-3d1677bf.js";const i={},_=a("h1",{id:"高性能",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#高性能","aria-hidden":"true"},"#"),e(" 高性能")],-1);function u(h,p){const r=l("RouterLink"),o=l("Badge");return s(),d("div",null,[_,a("ul",null,[a("li",null,[t(r,{to:"/zh/features/v11/performance/bulk-read-and-extend.html"},{default:n(()=>[e("预读 / 预扩展")]),_:1}),e(),t(o,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})]),a("li",null,[t(r,{to:"/zh/features/v11/performance/rel-size-cache.html"},{default:n(()=>[e("表大小缓存")]),_:1}),e(),t(o,{type:"tip",text:"V11 / v1.1.10-",vertical:"top"})]),a("li",null,[t(r,{to:"/zh/features/v11/performance/shared-server.html"},{default:n(()=>[e("Shared Server")]),_:1}),e(),t(o,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"})])])])}const m=c(i,[["render",u],["__file","index.html.vue"]]);export{m as default}; diff --git a/assets/index.html-390f5696.js b/assets/index.html-390f5696.js new file mode 100644 index 00000000000..f4ed9211870 --- /dev/null +++ b/assets/index.html-390f5696.js @@ -0,0 +1,11 @@ +import{_ as t,r as o,o as l,c as i,a as e,b as n,d as s,e as r}from"./app-3d1677bf.js";const c={},p=e("hr",null,null,-1),d=e("h3",{id:"quick-start-with-docker",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#quick-start-with-docker","aria-hidden":"true"},"#"),n(" Quick Start with Docker")],-1),h={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},u=r(`
# pull the instance image from DockerHub
+docker pull polardb/polardb_pg_local_instance
+# create and run the container
+docker run -it --rm polardb/polardb_pg_local_instance psql
+# check
+postgres=# SELECT version();
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+
`,2);function m(f,v){const a=o("ExternalLinkIcon");return l(),i("div",null,[p,d,e("p",null,[n("Pull the "),e("a",h,[n("local instance image"),s(a)]),n(" of PolarDB for PostgreSQL based on local storage. Create and run the container, and try PolarDB-PG instance directly:")]),u])}const g=t(c,[["render",m],["__file","index.html.vue"]]);export{g as default}; diff --git a/assets/index.html-4d0beb35.js b/assets/index.html-4d0beb35.js new file mode 100644 index 00000000000..3af84358533 --- /dev/null +++ b/assets/index.html-4d0beb35.js @@ -0,0 +1 @@ +import{_ as r,r as i,o as n,c as u,a as l,d as t,w as o,b as e}from"./app-3d1677bf.js";const c={},s=l("h1",{id:"弹性跨机并行查询-epq",tabindex:"-1"},[l("a",{class:"header-anchor",href:"#弹性跨机并行查询-epq","aria-hidden":"true"},"#"),e(" 弹性跨机并行查询(ePQ)")],-1);function d(v,h){const a=i("RouterLink"),p=i("Badge");return n(),u("div",null,[s,l("ul",null,[l("li",null,[t(a,{to:"/zh/features/v11/epq/epq-explain-analyze.html"},{default:o(()=>[e("ePQ 执行计划查看与分析")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.22-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/epq-node-and-dop.html"},{default:o(()=>[e("ePQ 计算节点范围选择与并行度控制")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/epq-partitioned-table.html"},{default:o(()=>[e("ePQ 支持分区表查询")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/epq-create-btree-index.html"},{default:o(()=>[e("ePQ 支持创建 B-Tree 索引并行加速")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.15-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/cluster-info.html"},{default:o(()=>[e("集群拓扑视图")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/adaptive-scan.html"},{default:o(()=>[e("自适应扫描")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/parallel-dml.html"},{default:o(()=>[e("并行 INSERT")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})]),l("li",null,[t(a,{to:"/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html"},{default:o(()=>[e("ePQ 支持创建/刷新物化视图并行加速和批量写入")]),_:1}),e(),t(p,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"})])])])}const f=r(c,[["render",d],["__file","index.html.vue"]]);export{f as default}; diff --git a/assets/index.html-60aab00b.js b/assets/index.html-60aab00b.js new file mode 100644 index 00000000000..e2bb724653e --- /dev/null +++ b/assets/index.html-60aab00b.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-98064128","path":"/roadmap/","title":"Roadmap","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Version 1.0","slug":"version-1-0","link":"#version-1-0","children":[]},{"level":2,"title":"Version 2.0","slug":"version-2-0","link":"#version-2-0","children":[]},{"level":2,"title":"Version 3.0","slug":"version-3-0","link":"#version-3-0","children":[]},{"level":2,"title":"Version 4.0","slug":"version-4-0","link":"#version-4-0","children":[]},{"level":2,"title":"Version 5.0","slug":"version-5-0","link":"#version-5-0","children":[]}],"git":{"updatedTime":1642525053000},"filePathRelative":"roadmap/README.md"}');export{e as data}; diff --git a/assets/index.html-66d290ab.js b/assets/index.html-66d290ab.js new file mode 100644 index 00000000000..d9eafa89872 --- /dev/null +++ b/assets/index.html-66d290ab.js @@ -0,0 +1 @@ +import{_ as e,o as a,c as r,e as o}from"./app-3d1677bf.js";const l={},d=o('

版本规划

PolarDB PostgreSQL 将持续发布对用户有价值的功能。当前我们计划了 5 个阶段:

PolarDB PostgreSQL 1.0 版本

1.0 版本基于 Shared-Storage 的存储计算分离架构,发布必备的最小功能集合,例如:PolarVFS、刷脏和 Buffer 管理、LogIndex、SyncDDL 等。

  • PolarVFS:数据库内核中抽象出了一层 VFS 层,使得内核可以对接任意的存储,包括 bufferIO 和 directIO。
  • 刷脏和 Buffer 管理:由原来的 N 份计算+N 份存储,转变成了 N 份计算+1 份存储,主节点在刷脏时需要做协调,避免只读节点读取到超前的“未来页面”。
  • LogIndex: 由于只读节点不能刷脏,所需要的特定版本页面需要从 Shared-Storage 上读取一个老的版本页面,并通过在内存中回放来得到正确的版本。LogIndex 结构记录了每个 Page 所对应的 WAL 日志 Meta 信息,在需要回放时直接查找 LogIndex,从而加速回放过程。
  • DDL 同步: 在存储计算分离后,主节点在执行 DDL 时需要兼顾只读节点对 Relation 等对象的引用,相关的 DDL 动作需要同步地在只读节点上上锁。
  • 数据库监控:支持主机和数据库的监控,同时为 HA 切换提供了判断依据。

PolarDB PostgreSQL 2.0 版本

除了在存储计算分离架构上改动之外,2.0 版本将在优化器上进行深度的优化,例如:

  • UniqueKey:和 Plan 节点的有序性类似,UniqueKey 维护的是 Plan 节点数据的唯一性。数据的唯一性可以减少不必要的 DISTINCT、Group By,增加 Join 结果有序性判断等。

PolarDB PostgreSQL 3.0 版本

3.0 版本主要在存储计算分离后在可用性上进行重大优化,例如:

  • 并行回放:存储计算分离之后,PolarDB 通过 LogIndex 实现了 Lazy 的回放。实现原理为:仅标记一个 Page 应该回放哪些 WAL 日志,在读进程时再进行真正的回放过程。此时对读的性能是有影响的。在 3.0 版本中,我们在 Lazy 回放基础上实现了并行回放,从而加速 Page 的回放过程。
  • OnlinePromote:在主节点崩溃后,切换到任意只读节点。该只读节点无需重启,继续并行回放完所有的 WAL 之后,Promote 成为新的主节点,从而进一步降低了不可用时间。

PolarDB PostgreSQL 4.0 版本

为了满足日益增多的 HTAP 混合负载需求,4.0 版本将发布基于 Shared-Storage 架构的分布式并行执行引擎,充分发挥多个只读节点的 CPU/MEM/IO 资源。

经测试,在计算集群逐步扩展到 256 核时,性能仍然能够线性提升。

PolarDB PostgreSQL 5.0 版本

基于存储计算分离的一写多读架构中,读能力能够弹性的扩展,但是写入能力仍然只能在单个节点上执行。

5.0 版本将发布 Shared-Nothing On Share-Everything 架构,结合 PolarDB 的分布式版本和 PolarDB 集中式版本的架构优势,使得多个节点都能够写入。

',17),i=[d];function t(s,n){return a(),r("div",null,i)}const p=e(l,[["render",t],["__file","index.html.vue"]]);export{p as default}; diff --git a/assets/index.html-8e3e01b7.js b/assets/index.html-8e3e01b7.js new file mode 100644 index 00000000000..c7cf6b08bc8 --- /dev/null +++ b/assets/index.html-8e3e01b7.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-7f44b843","path":"/zh/features/v11/","title":"自研功能","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/features/v11/README.md"}');export{e as data}; diff --git a/assets/index.html-a1b339d0.js b/assets/index.html-a1b339d0.js new file mode 100644 index 00000000000..1762a3fb789 --- /dev/null +++ b/assets/index.html-a1b339d0.js @@ -0,0 +1 @@ +import{_ as n,r,o as c,c as s,a,d as e,w as o,b as t}from"./app-3d1677bf.js";const u={},p=a("h1",{id:"高可用",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#高可用","aria-hidden":"true"},"#"),t(" 高可用")],-1);function v(d,_){const l=r("RouterLink"),i=r("Badge");return c(),s("div",null,[p,a("ul",null,[a("li",null,[e(l,{to:"/zh/features/v11/availability/avail-online-promote.html"},{default:o(()=>[t("只读节点 Online Promote")]),_:1}),t(),e(i,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})]),a("li",null,[e(l,{to:"/zh/features/v11/availability/avail-parallel-replay.html"},{default:o(()=>[t("WAL 日志并行回放")]),_:1}),t(),e(i,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})]),a("li",null,[e(l,{to:"/zh/features/v11/availability/datamax.html"},{default:o(()=>[t("DataMax 日志节点")]),_:1}),t(),e(i,{type:"tip",text:"V11 / v1.1.6-",vertical:"top"})]),a("li",null,[e(l,{to:"/zh/features/v11/availability/resource-manager.html"},{default:o(()=>[t("Resource Manager")]),_:1}),t(),e(i,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})]),a("li",null,[e(l,{to:"/zh/features/v11/availability/flashback-table.html"},{default:o(()=>[t("闪回表和闪回日志")]),_:1}),t(),e(i,{type:"tip",text:"V11 / v1.1.22-",vertical:"top"})])])])}const f=n(u,[["render",v],["__file","index.html.vue"]]);export{f as default}; diff --git a/assets/index.html-a68fc122.js b/assets/index.html-a68fc122.js new file mode 100644 index 00000000000..eea33719e28 --- /dev/null +++ b/assets/index.html-a68fc122.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-0bbe1b6a","path":"/zh/features/","title":"自研功能","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"功能 / 版本映射矩阵","slug":"功能-版本映射矩阵","link":"#功能-版本映射矩阵","children":[]}],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/features/README.md"}');export{e as data}; diff --git a/assets/index.html-b1951828.js b/assets/index.html-b1951828.js new file mode 100644 index 00000000000..4a11a4cdcc0 --- /dev/null +++ b/assets/index.html-b1951828.js @@ -0,0 +1,13 @@ +import{_ as l,r as n,o as s,c as i,a as e,b as a,d as o,e as r}from"./app-3d1677bf.js";const c={},p=e("hr",null,null,-1),h=e("h3",{id:"通过-docker-快速使用",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#通过-docker-快速使用","aria-hidden":"true"},"#"),a(" 通过 Docker 快速使用")],-1),d={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},u=r(`
# 拉取 PolarDB-PG 镜像
+docker pull polardb/polardb_pg_local_instance
+# 创建并运行容器
+docker run -it --rm polardb/polardb_pg_local_instance psql
+# 测试可用性
+postgres=# SELECT version();
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+
`,2);function f(m,v){const t=n("ExternalLinkIcon");return s(),i("div",null,[p,h,e("p",null,[a("从 DockerHub 上拉取 PolarDB for PostgreSQL 的 "),e("a",d,[a("本地存储实例镜像"),o(t)]),a(",创建并运行容器,然后直接试用 PolarDB-PG:")]),u])}const k=l(c,[["render",f],["__file","index.html.vue"]]);export{k as default}; diff --git a/assets/index.html-c24c33bc.js b/assets/index.html-c24c33bc.js new file mode 100644 index 00000000000..15236463384 --- /dev/null +++ b/assets/index.html-c24c33bc.js @@ -0,0 +1 @@ +import{_ as i,r as s,o as r,c as a,a as t,d as e,w as d,b as n}from"./app-3d1677bf.js";const c={},_=t("h1",{id:"自研功能",tabindex:"-1"},[t("a",{class:"header-anchor",href:"#自研功能","aria-hidden":"true"},"#"),n(" 自研功能")],-1),h=t("h2",{id:"功能-版本映射矩阵",tabindex:"-1"},[t("a",{class:"header-anchor",href:"#功能-版本映射矩阵","aria-hidden":"true"},"#"),n(" 功能 / 版本映射矩阵")],-1),p=t("thead",null,[t("tr",null,[t("th",null,"功能 / 版本"),t("th",{style:{"text-align":"center"}},"PostgreSQL"),t("th",{style:{"text-align":"center"}},"PolarDB for PostgreSQL 11")])],-1),x=t("tr",null,[t("td",null,[t("strong",null,"高性能")]),t("td",{style:{"text-align":"center"}},"..."),t("td",{style:{"text-align":"center"}},[t("a",{href:"./v11/performance/"},"...")])],-1),v=t("td",null,"预读 / 预扩展",-1),y=t("td",{style:{"text-align":"center"}},"/",-1),u={style:{"text-align":"center"}},g={href:"./v11/performance/bulk-read-and-extend.html"},f=t("td",null,"表大小缓存",-1),m=t("td",{style:{"text-align":"center"}},"/",-1),V={style:{"text-align":"center"}},b={href:"./v11/performance/rel-size-cache.html"},q=t("td",null,"Shared Server",-1),P=t("td",{style:{"text-align":"center"}},"/",-1),Q={style:{"text-align":"center"}},B={href:"./v11/performance/shared-server.html"},k=t("tr",null,[t("td",null,[t("strong",null,"高可用")]),t("td",{style:{"text-align":"center"}},"..."),t("td",{style:{"text-align":"center"}},[t("a",{href:"./v11/availability/"},"...")])],-1),L=t("td",null,"只读节点 Online Promote",-1),S=t("td",{style:{"text-align":"center"}},"/",-1),N={style:{"text-align":"center"}},R={href:"./v11/availability/avail-online-promote.html"},w=t("td",null,"WAL 日志并行回放",-1),z=t("td",{style:{"text-align":"center"}},"/",-1),D={style:{"text-align":"center"}},T={href:"./v11/availability/avail-parallel-replay.html"},C=t("td",null,"DataMax 日志节点",-1),E=t("td",{style:{"text-align":"center"}},"/",-1),M={style:{"text-align":"center"}},A={href:"./v11/availability/datamax.html"},I=t("td",null,"Resource Manager",-1),O=t("td",{style:{"text-align":"center"}},"/",-1),W={style:{"text-align":"center"}},j={href:"./v11/availability/resource-manager.html"},F=t("td",null,"闪回表和闪回日志",-1),G=t("td",{style:{"text-align":"center"}},"/",-1),H={style:{"text-align":"center"}},J={href:"./v11/availability/flashback-table.html"},K=t("tr",null,[t("td",null,[t("strong",null,"安全")]),t("td",{style:{"text-align":"center"}},"..."),t("td",{style:{"text-align":"center"}},[t("a",{href:"./v11/security/"},"...")])],-1),U=t("td",null,"透明数据加密",-1),X=t("td",{style:{"text-align":"center"}},"/",-1),Y={style:{"text-align":"center"}},Z={href:"./v11/security/tde.html"},$=t("tr",null,[t("td",null,[t("strong",null,"弹性跨机并行查询(ePQ)")]),t("td",{style:{"text-align":"center"}},"..."),t("td",{style:{"text-align":"center"}},[t("a",{href:"./v11/epq/"},"...")])],-1),tt=t("td",null,"ePQ 执行计划查看与分析",-1),et=t("td",{style:{"text-align":"center"}},"/",-1),lt={style:{"text-align":"center"}},nt={href:"./v11/epq/epq-explain-analyze.html"},st=t("td",null,"ePQ 计算节点范围选择与并行度控制",-1),ot=t("td",{style:{"text-align":"center"}},"/",-1),it={style:{"text-align":"center"}},rt={href:"./v11/epq/epq-node-and-dop.html"},at=t("td",null,"ePQ 支持分区表查询",-1),dt=t("td",{style:{"text-align":"center"}},"/",-1),ct={style:{"text-align":"center"}},_t={href:"./v11/epq/epq-partitioned-table.html"},ht=t("td",null,"ePQ 支持创建 B-Tree 索引并行加速",-1),pt=t("td",{style:{"text-align":"center"}},"/",-1),xt={style:{"text-align":"center"}},vt={href:"./v11/epq/epq-create-btree-index.html"},yt=t("td",null,"集群拓扑视图",-1),ut=t("td",{style:{"text-align":"center"}},"/",-1),gt={style:{"text-align":"center"}},ft={href:"./v11/epq/cluster-info.html"},mt=t("td",null,"自适应扫描",-1),Vt=t("td",{style:{"text-align":"center"}},"/",-1),bt={style:{"text-align":"center"}},qt={href:"./v11/epq/adaptive-scan.html"},Pt=t("td",null,"并行 INSERT",-1),Qt=t("td",{style:{"text-align":"center"}},"/",-1),Bt={style:{"text-align":"center"}},kt={href:"./v11/epq/parallel-dml.html"},Lt=t("td",null,"ePQ 支持创建/刷新物化视图并行加速和批量写入",-1),St=t("td",{style:{"text-align":"center"}},"/",-1),Nt={style:{"text-align":"center"}},Rt={href:"./v11/epq/epq-ctas-mtview-bulk-insert.html"},wt=t("tr",null,[t("td",null,[t("strong",null,"第三方插件")]),t("td",{style:{"text-align":"center"}},"..."),t("td",{style:{"text-align":"center"}},[t("a",{href:"./v11/extensions/"},"...")])],-1),zt=t("td",null,"pgvector",-1),Dt=t("td",{style:{"text-align":"center"}},"/",-1),Tt={style:{"text-align":"center"}},Ct={href:"./v11/extensions/pgvector.html"},Et=t("td",null,"smlar",-1),Mt=t("td",{style:{"text-align":"center"}},"/",-1),At={style:{"text-align":"center"}},It={href:"./v11/extensions/smlar.html"};function Ot(Wt,jt){const o=s("RouterLink"),l=s("Badge");return r(),a("div",null,[_,t("ul",null,[t("li",null,[e(o,{to:"/zh/features/v11/"},{default:d(()=>[n("PolarDB for PostgreSQL 11")]),_:1})])]),h,t("table",null,[p,t("tbody",null,[x,t("tr",null,[v,y,t("td",u,[t("a",g,[e(l,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})])])]),t("tr",null,[f,m,t("td",V,[t("a",b,[e(l,{type:"tip",text:"V11 / v1.1.10-",vertical:"top"})])])]),t("tr",null,[q,P,t("td",Q,[t("a",B,[e(l,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"})])])]),k,t("tr",null,[L,S,t("td",N,[t("a",R,[e(l,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})])])]),t("tr",null,[w,z,t("td",D,[t("a",T,[e(l,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})])])]),t("tr",null,[C,E,t("td",M,[t("a",A,[e(l,{type:"tip",text:"V11 / v1.1.6-",vertical:"top"})])])]),t("tr",null,[I,O,t("td",W,[t("a",j,[e(l,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})])])]),t("tr",null,[F,G,t("td",H,[t("a",J,[e(l,{type:"tip",text:"V11 / v1.1.22-",vertical:"top"})])])]),K,t("tr",null,[U,X,t("td",Y,[t("a",Z,[e(l,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})])])]),$,t("tr",null,[tt,et,t("td",lt,[t("a",nt,[e(l,{type:"tip",text:"V11 / v1.1.22-",vertical:"top"})])])]),t("tr",null,[st,ot,t("td",it,[t("a",rt,[e(l,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"})])])]),t("tr",null,[at,dt,t("td",ct,[t("a",_t,[e(l,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})])])]),t("tr",null,[ht,pt,t("td",xt,[t("a",vt,[e(l,{type:"tip",text:"V11 / v1.1.15-",vertical:"top"})])])]),t("tr",null,[yt,ut,t("td",gt,[t("a",ft,[e(l,{type:"tip",text:"V11 / v1.1.20-",vertical:"top"})])])]),t("tr",null,[mt,Vt,t("td",bt,[t("a",qt,[e(l,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})])])]),t("tr",null,[Pt,Qt,t("td",Bt,[t("a",kt,[e(l,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"})])])]),t("tr",null,[Lt,St,t("td",Nt,[t("a",Rt,[e(l,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"})])])]),wt,t("tr",null,[zt,Dt,t("td",Tt,[t("a",Ct,[e(l,{type:"tip",text:"V11 / v1.1.35-",vertical:"top"})])])]),t("tr",null,[Et,Mt,t("td",At,[t("a",It,[e(l,{type:"tip",text:"V11 / v1.1.35-",vertical:"top"})])])])])])])}const Gt=i(c,[["render",Ot],["__file","index.html.vue"]]);export{Gt as default}; diff --git a/assets/index.html-c2968b1e.js b/assets/index.html-c2968b1e.js new file mode 100644 index 00000000000..35db882e101 --- /dev/null +++ b/assets/index.html-c2968b1e.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-8daa1a0e","path":"/","title":"Documentation","lang":"en-US","frontmatter":{"home":true,"title":"Documentation","heroImage":"/images/polardb.png","footer":"Apache 2.0 Licensed | Copyright © Alibaba Group, Inc."},"headers":[{"level":3,"title":"Quick Start with Docker","slug":"quick-start-with-docker","link":"#quick-start-with-docker","children":[]}],"git":{"updatedTime":1690894847000},"filePathRelative":"README.md"}');export{e as data}; diff --git a/assets/index.html-c8d3a2fb.js b/assets/index.html-c8d3a2fb.js new file mode 100644 index 00000000000..7741de626a0 --- /dev/null +++ b/assets/index.html-c8d3a2fb.js @@ -0,0 +1 @@ +import{_ as c,r as a,o as s,c as d,a as e,d as o,w as l,b as t}from"./app-3d1677bf.js";const i={},_=e("h1",{id:"安全",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#安全","aria-hidden":"true"},"#"),t(" 安全")],-1);function u(h,f){const n=a("RouterLink"),r=a("Badge");return s(),d("div",null,[_,e("ul",null,[e("li",null,[o(n,{to:"/zh/features/v11/security/tde.html"},{default:l(()=>[t("TDE 透明数据加密")]),_:1}),t(),o(r,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"})])])])}const p=c(i,[["render",u],["__file","index.html.vue"]]);export{p as default}; diff --git a/assets/index.html-cd3aa341.js b/assets/index.html-cd3aa341.js new file mode 100644 index 00000000000..49e65b1a5ad --- /dev/null +++ b/assets/index.html-cd3aa341.js @@ -0,0 +1 @@ +import{_ as e,o as r,c as a,e as o}from"./app-3d1677bf.js";const s={},n=o('

Roadmap

Alibaba Cloud continuously releases updates to PolarDB PostgreSQL (hereafter simplified as PolarDB) to improve user experience. At present, Alibaba Cloud plans the following versions for PolarDB:

Version 1.0

Version 1.0 supports shared storage and compute-storage separation. This version provides the minimum set of features such as Polar virtual file system (PolarVFS), flushing and buffer management, LogIndex, and SyncDDL.

  • PolarVFS: A VFS is abstracted from the database engine. This way, the database engine can connect to all types of storage, and you do not need to consider whether the storage uses buffered I/O or direct I/O.
  • Flushing and buffer management: In each PolarDB cluster, data is separately processed on each compute node, but all compute nodes share the same physical storage. The speed at which the primary node flushes write-ahead logging (WAL) records must be controlled to prevent the read-only nodes from reading future pages.
  • LogIndex: The read-only nodes cannot flush WAL records. When you query a page on a read-only node, the read-only node reads a previous version of the page from the shared storage. Then, the read-only node reads and replays the WAL records of the page from its memory to obtain the most recent version of the page. Each LogIndex record consists of the metadata of a specific WAL record. The read-only nodes can efficiently retrieve the WAL records of a page by using LogIndex records.
  • SyncDDL: PolarDB supports compute-storage separation. When the primary node runs DDL operations, it considers the objects, such as relations, that are referenced by the read-only nodes. The locks that are held by the DDL operations are synchronized from the primary node to the read-only nodes.
  • db-monitor: The db-monitor module monitors the host on which your PolarDB cluster runs. The db-monitor module also monitors the databases that you create in your PolarDB cluster. The monitoring data provides a basis for switchovers and helps ensure high availability.

Version 2.0

In addition to improvements to compute-storage separation, version 2.0 provides a significantly improved optimizer.

  • UniqueKey: The UniqueKey module ensures that the data on plan nodes is unique. This feature is similar to the ordering feature that you can use on plan nodes. Data uniqueness reduces unnecessary DISTINCT and GROUP BY clauses and improves the ordering of the results of joins.

Version 3.0

The availability of PolarDB with compute-storage separation is significantly improved.

  • Parallel replay: LogIndex enables PolarDB to replay WAL records in lazy replay mode. In the lazy replay mode, the read-only nodes only mark the WAL records of each updated page. The read-only nodes read and replay the WAL records only when you query the page on these nodes. The lazy replay mechanism may impair read performance. Version 3.0 uses the parallel replay mechanism together with the lazy replay mechanism to accelerate read queries.
  • OnlinePromote: If the primary node unexpectedly exits, your workloads can be switched over to a read-only node. The read-only node does not need to restart. The read-only node is promoted to run as the new primary node immediately after it replays all WAL records in parallel. This significantly reduces downtime.

Version 4.0

Version 4.0 can meet your growing business requirements in hybrid transaction/analytical processing (HTAP) scenarios. Version 4.0 is based on the shared storage-based massively parallel processing (MPP) architecture, which allows PolarDB to fully utilize the CPU, memory, and I/O resources of multiple read-only nodes.

Test results show that the performance of a PolarDB cluster linearly increases as you increase the number of cores from 1 to 256.

Version 5.0

In earlier versions, each PolarDB cluster consists of one primary node that processes both read requests and write requests and one or more read-only nodes that process only read requests. You can increase the read capability of a PolarDB cluster by creating more read-only nodes. However, you cannot increase the writing capability because each PolarDB cluster consists of only one primary node.

Version 5.0 uses the shared-nothing architecture together with the shared-everything architecture. This allows multiple compute nodes to process write requests.

',17),t=[n];function i(d,l){return r(),a("div",null,t)}const c=e(s,[["render",i],["__file","index.html.vue"]]);export{c as default}; diff --git a/assets/index.html-d6e90735.js b/assets/index.html-d6e90735.js new file mode 100644 index 00000000000..d7c43cd6d20 --- /dev/null +++ b/assets/index.html-d6e90735.js @@ -0,0 +1 @@ +import{_ as o,r,o as s,c as u,a as e,d as l,w as n,b as t}from"./app-3d1677bf.js";const i={},c=e("h1",{id:"自研功能",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#自研功能","aria-hidden":"true"},"#"),t(" 自研功能")],-1);function d(_,f){const a=r("RouterLink");return s(),u("div",null,[c,e("ul",null,[e("li",null,[l(a,{to:"/zh/features/v11/performance/"},{default:n(()=>[t("高性能")]),_:1})]),e("li",null,[l(a,{to:"/zh/features/v11/availability/"},{default:n(()=>[t("高可用")]),_:1})]),e("li",null,[l(a,{to:"/zh/features/v11/security/"},{default:n(()=>[t("安全")]),_:1})]),e("li",null,[l(a,{to:"/zh/features/v11/epq/"},{default:n(()=>[t("弹性跨机并行查询(ePQ)")]),_:1})]),e("li",null,[l(a,{to:"/zh/features/v11/extensions/"},{default:n(()=>[t("第三方插件")]),_:1})])])])}const m=o(i,[["render",d],["__file","index.html.vue"]]);export{m as default}; diff --git a/assets/index.html-de3342c7.js b/assets/index.html-de3342c7.js new file mode 100644 index 00000000000..98925e1ec74 --- /dev/null +++ b/assets/index.html-de3342c7.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-9d84b310","path":"/zh/features/v11/extensions/","title":"第三方插件","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/features/v11/extensions/README.md"}');export{e as data}; diff --git a/assets/index.html-e2ca5e7d.js b/assets/index.html-e2ca5e7d.js new file mode 100644 index 00000000000..26d57bb1278 --- /dev/null +++ b/assets/index.html-e2ca5e7d.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-62087a8c","path":"/zh/features/v11/epq/","title":"弹性跨机并行查询(ePQ)","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/README.md"}');export{e as data}; diff --git a/assets/index.html-ebe7d04c.js b/assets/index.html-ebe7d04c.js new file mode 100644 index 00000000000..53b88c44a2d --- /dev/null +++ b/assets/index.html-ebe7d04c.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-7b6b229b","path":"/zh/roadmap/","title":"版本规划","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"PolarDB PostgreSQL 1.0 版本","slug":"polardb-postgresql-1-0-版本","link":"#polardb-postgresql-1-0-版本","children":[]},{"level":2,"title":"PolarDB PostgreSQL 2.0 版本","slug":"polardb-postgresql-2-0-版本","link":"#polardb-postgresql-2-0-版本","children":[]},{"level":2,"title":"PolarDB PostgreSQL 3.0 版本","slug":"polardb-postgresql-3-0-版本","link":"#polardb-postgresql-3-0-版本","children":[]},{"level":2,"title":"PolarDB PostgreSQL 4.0 版本","slug":"polardb-postgresql-4-0-版本","link":"#polardb-postgresql-4-0-版本","children":[]},{"level":2,"title":"PolarDB PostgreSQL 5.0 版本","slug":"polardb-postgresql-5-0-版本","link":"#polardb-postgresql-5-0-版本","children":[]}],"git":{"updatedTime":1675309212000},"filePathRelative":"zh/roadmap/README.md"}');export{l as data}; diff --git a/assets/index.html-ed802d58.js b/assets/index.html-ed802d58.js new file mode 100644 index 00000000000..fefe8ae0580 --- /dev/null +++ b/assets/index.html-ed802d58.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-ba4b3c7c","path":"/zh/features/v11/performance/","title":"高性能","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1693374263000},"filePathRelative":"zh/features/v11/performance/README.md"}');export{e as data}; diff --git a/assets/index.html-efbb7ed1.js b/assets/index.html-efbb7ed1.js new file mode 100644 index 00000000000..7db192cca74 --- /dev/null +++ b/assets/index.html-efbb7ed1.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-010157e8","path":"/zh/features/v11/security/","title":"安全","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/security/README.md"}');export{e as data}; diff --git a/assets/index.html-f4a2cc54.js b/assets/index.html-f4a2cc54.js new file mode 100644 index 00000000000..23567d00e3e --- /dev/null +++ b/assets/index.html-f4a2cc54.js @@ -0,0 +1 @@ +const a=JSON.parse('{"key":"v-6024a2d1","path":"/zh/features/v11/availability/","title":"高可用","lang":"zh-CN","frontmatter":{},"headers":[],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/availability/README.md"}');export{a as data}; diff --git a/assets/index.html-f767cea4.js b/assets/index.html-f767cea4.js new file mode 100644 index 00000000000..d2a79d95076 --- /dev/null +++ b/assets/index.html-f767cea4.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-2d0ad528","path":"/zh/","title":"文档","lang":"zh-CN","frontmatter":{"home":true,"title":"文档","heroImage":"/images/polardb.png","footer":"Apache 2.0 Licensed | Copyright © Alibaba Group, Inc."},"headers":[{"level":3,"title":"通过 Docker 快速使用","slug":"通过-docker-快速使用","link":"#通过-docker-快速使用","children":[]}],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/README.md"}');export{e as data}; diff --git a/assets/index.html-f9a07053.js b/assets/index.html-f9a07053.js new file mode 100644 index 00000000000..fd3208c5948 --- /dev/null +++ b/assets/index.html-f9a07053.js @@ -0,0 +1 @@ +import{_ as l,r,o as c,c as i,a as t,d as o,w as s,b as e}from"./app-3d1677bf.js";const d={},_=t("h1",{id:"第三方插件",tabindex:"-1"},[t("a",{class:"header-anchor",href:"#第三方插件","aria-hidden":"true"},"#"),e(" 第三方插件")],-1);function u(p,h){const a=r("RouterLink"),n=r("Badge");return c(),i("div",null,[_,t("ul",null,[t("li",null,[o(a,{to:"/zh/features/v11/extensions/pgvector.html"},{default:s(()=>[e("pgvector")]),_:1}),e(),o(n,{type:"tip",text:"V11 / v1.1.35-",vertical:"top"})]),t("li",null,[o(a,{to:"/zh/features/v11/extensions/smlar.html"},{default:s(()=>[e("smlar")]),_:1}),e(),o(n,{type:"tip",text:"V11 / v1.1.28-",vertical:"top"})])])])}const f=l(d,[["render",u],["__file","index.html.vue"]]);export{f as default}; diff --git a/assets/introduction.html-1d0705b0.js b/assets/introduction.html-1d0705b0.js new file mode 100644 index 00000000000..85dee50998d --- /dev/null +++ b/assets/introduction.html-1d0705b0.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-12a5021c","path":"/deploying/introduction.html","title":"架构简介","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":5},"headers":[],"git":{"updatedTime":1675307501000},"filePathRelative":"deploying/introduction.md"}');export{t as data}; diff --git a/assets/introduction.html-5114e518.js b/assets/introduction.html-5114e518.js new file mode 100644 index 00000000000..a512ab58df7 --- /dev/null +++ b/assets/introduction.html-5114e518.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-635e913a","path":"/zh/deploying/introduction.html","title":"架构简介","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":5},"headers":[],"git":{"updatedTime":1661950750000},"filePathRelative":"zh/deploying/introduction.md"}');export{t as data}; diff --git a/assets/introduction.html-606b1a82.js b/assets/introduction.html-606b1a82.js new file mode 100644 index 00000000000..c8b5c07c02d --- /dev/null +++ b/assets/introduction.html-606b1a82.js @@ -0,0 +1 @@ +import{_ as l,r as s,o as c,c as i,d as t,a as o,b as e}from"./app-3d1677bf.js";const f="/PolarDB-for-PostgreSQL/assets/software-level-5e0933bc.png",_={},d=o("h1",{id:"架构简介",tabindex:"-1"},[o("a",{class:"header-anchor",href:"#架构简介","aria-hidden":"true"},"#"),e(" 架构简介")],-1),h=o("p",null,"PolarDB for PostgreSQL 采用了基于 Shared-Storage 的存储计算分离架构。数据库由传统的 Shared-Nothing 架构,转变成了 Shared-Storage 架构——由原来的 N 份计算 + N 份存储,转变成了 N 份计算 + 1 份存储;而 PostgreSQL 使用了传统的单体数据库架构,存储和计算耦合在一起。",-1),p=o("p",null,[o("img",{src:f,alt:"software-level"})],-1),u={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},m=o("sup",{class:"footnote-ref"},[o("a",{href:"#footnote1"},"[1]"),o("a",{class:"footnote-anchor",id:"footnote-ref1"})],-1),g=o("hr",{class:"footnotes-sep"},null,-1),S={class:"footnotes"},P={class:"footnotes-list"},b={id:"footnote1",class:"footnote-item"},B={href:"https://www.vldb.org/pvldb/vol11/p1849-cao.pdf",target:"_blank",rel:"noopener noreferrer"},v=o("a",{href:"#footnote-ref1",class:"footnote-backref"},"↩︎",-1);function k(n,x){const a=s("ArticleInfo"),r=s("ExternalLinkIcon");return c(),i("div",null,[d,t(a,{frontmatter:n.$frontmatter},null,8,["frontmatter"]),h,p,o("p",null,[e("为保证所有计算节点能够以相同的可见性视角访问分布式块存储设备,PolarDB 需要使用分布式文件系统 "),o("a",u,[e("PolarDB File System(PFS)"),t(r)]),e(" 来访问块设备,其实现原理可参考发表在 2018 年 VLDB 上的论文"),m,e(";如果所有计算节点都可以本地访问同一个块存储设备,那么也可以不使用 PFS,直接使用本地的单机文件系统(如 ext4)。这是与 PostgreSQL 的不同点之一。")]),g,o("section",S,[o("ol",P,[o("li",b,[o("p",null,[o("a",B,[e("PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database"),t(r)]),e(),v])])])])])}const L=l(_,[["render",k],["__file","introduction.html.vue"]]);export{L as default}; diff --git a/assets/introduction.html-db3ff455.js b/assets/introduction.html-db3ff455.js new file mode 100644 index 00000000000..f8831731a56 --- /dev/null +++ b/assets/introduction.html-db3ff455.js @@ -0,0 +1 @@ +import{_ as l,r as s,o as c,c as i,d as t,a as o,b as e}from"./app-3d1677bf.js";const f="/PolarDB-for-PostgreSQL/assets/software-level-5e0933bc.png",_={},d=o("h1",{id:"架构简介",tabindex:"-1"},[o("a",{class:"header-anchor",href:"#架构简介","aria-hidden":"true"},"#"),e(" 架构简介")],-1),h=o("p",null,"PolarDB for PostgreSQL 采用了基于 Shared-Storage 的存储计算分离架构。数据库由传统的 Share-Nothing 架构,转变成了 Shared-Storage 架构——由原来的 N 份计算 + N 份存储,转变成了 N 份计算 + 1 份存储;而 PostgreSQL 使用了传统的单体数据库架构,存储和计算耦合在一起。",-1),p=o("p",null,[o("img",{src:f,alt:"software-level"})],-1),u={href:"https://github.com/ApsaraDB/PolarDB-FileSystem",target:"_blank",rel:"noopener noreferrer"},m=o("sup",{class:"footnote-ref"},[o("a",{href:"#footnote1"},"[1]"),o("a",{class:"footnote-anchor",id:"footnote-ref1"})],-1),g=o("hr",{class:"footnotes-sep"},null,-1),S={class:"footnotes"},P={class:"footnotes-list"},b={id:"footnote1",class:"footnote-item"},B={href:"https://www.vldb.org/pvldb/vol11/p1849-cao.pdf",target:"_blank",rel:"noopener noreferrer"},v=o("a",{href:"#footnote-ref1",class:"footnote-backref"},"↩︎",-1);function k(n,x){const a=s("ArticleInfo"),r=s("ExternalLinkIcon");return c(),i("div",null,[d,t(a,{frontmatter:n.$frontmatter},null,8,["frontmatter"]),h,p,o("p",null,[e("为保证所有计算节点能够以相同的可见性视角访问分布式块存储设备,PolarDB 需要使用分布式文件系统 "),o("a",u,[e("PolarDB File System(PFS)"),t(r)]),e(" 来访问块设备,其实现原理可参考发表在 2018 年 VLDB 上的论文"),m,e(";如果所有计算节点都可以本地访问同一个块存储设备,那么也可以不使用 PFS,直接使用本地的单机文件系统(如 ext4)。这是与 PostgreSQL 的不同点之一。")]),g,o("section",S,[o("ol",P,[o("li",b,[o("p",null,[o("a",B,[e("PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database"),t(r)]),e(),v])])])])])}const L=l(_,[["render",k],["__file","introduction.html.vue"]]);export{L as default}; diff --git a/assets/logindex.html-1973076c.js b/assets/logindex.html-1973076c.js new file mode 100644 index 00000000000..02130aba477 --- /dev/null +++ b/assets/logindex.html-1973076c.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-65697b4c","path":"/theory/logindex.html","title":"LogIndex","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"Background Information","slug":"background-information","link":"#background-information","children":[]},{"level":2,"title":"Memory Synchronization Architecture for RO","slug":"memory-synchronization-architecture-for-ro","link":"#memory-synchronization-architecture-for-ro","children":[]},{"level":2,"title":"WAL Meta","slug":"wal-meta","link":"#wal-meta","children":[]},{"level":2,"title":"LogIndex","slug":"logindex-1","link":"#logindex-1","children":[{"level":3,"title":"Memory data structure","slug":"memory-data-structure","link":"#memory-data-structure","children":[]},{"level":3,"title":"Data Structure on Disk","slug":"data-structure-on-disk","link":"#data-structure-on-disk","children":[]}]},{"level":2,"title":"Log replay","slug":"log-replay","link":"#log-replay","children":[{"level":3,"title":"Delayed replay","slug":"delayed-replay","link":"#delayed-replay","children":[]},{"level":3,"title":"Mini Transaction","slug":"mini-transaction","link":"#mini-transaction","children":[]}]},{"level":2,"title":"Summary","slug":"summary","link":"#summary","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"theory/logindex.md"}');export{e as data}; diff --git a/assets/logindex.html-2840dbbf.js b/assets/logindex.html-2840dbbf.js new file mode 100644 index 00000000000..68e94ce2cd3 --- /dev/null +++ b/assets/logindex.html-2840dbbf.js @@ -0,0 +1 @@ +import{_ as o,r as s,o as r,c as n,a as d,b as e,d as i,w as h,e as a}from"./app-3d1677bf.js";const l="/PolarDB-for-PostgreSQL/assets/49_LogIndex_1-e49fa6a7.png",c="/PolarDB-for-PostgreSQL/assets/50_LogIndex_2-2d85ed00.png",g="/PolarDB-for-PostgreSQL/assets/51_LogIndex_3-1c28dec4.png",p="/PolarDB-for-PostgreSQL/assets/52_LogIndex_4-50a08309.png",m="/PolarDB-for-PostgreSQL/assets/53_LogIndex_5-3a25393f.png",f="/PolarDB-for-PostgreSQL/assets/54_LogIndex_6-ea27fcdf.png",u="/PolarDB-for-PostgreSQL/assets/55_LogIndex_7-a84ed0dd.png",y="/PolarDB-for-PostgreSQL/assets/56_LogIndex_8-3f14f302.png",L="/PolarDB-for-PostgreSQL/assets/57_LogIndex_9-1fcc55d8.png",b="/PolarDB-for-PostgreSQL/assets/58_LogIndex_10-2eab9094.png",I="/PolarDB-for-PostgreSQL/assets/59_LogIndex_11-e0277c33.png",x="/PolarDB-for-PostgreSQL/assets/60_LogIndex_12-6b577085.png",T="/PolarDB-for-PostgreSQL/assets/61_LogIndex_13-4a2d72a8.png",_="/PolarDB-for-PostgreSQL/assets/62_LogIndex_14-c90cc6e7.png",w={},A=a('

LogIndex

Background Information

PolarDB uses a shared storage architecture. Each PolarDB cluster consists of a primary node and multiple read-only nodes. The primary node can share data in the shared storage. The primary node can read data from the shared storage and write data to the storage. Read-only nodes can read data from the shared storage only by replaying logs. Data in the memory is synchronized from the primary node to read-only nodes. This ensures that data is consistent between the primary node and read-only nodes. Read-only nodes can also provide services to implement read/write splitting and load balancing. If the primary node becomes unavailable, a read-only node can be used as the primary node. This ensures the high availability of the cluster. The following figure shows the architecture of PolarDB.

image.png

In the shared-nothing architecture, read-only nodes have independent memory and storage. These nodes need only to receive write-ahead logging (WAL) logs from the primary node and replay the WAL logs. If the data that needs to be replayed is not in buffer pools, the data must be read from storage files and written to buffer pools for replay. This can cause cache misses. More data is evicted from buffer pools because the data is replayed in a continuous manner. The following figure shows more details.

image.png

Multiple transactions on the primary node can be executed in parallel. Read-only nodes must replay WAL logs in the sequence in which the WAL logs are generated. As a result, read-only nodes replay WAL logs at a low speed and the latency between the primary node and read-only nodes increases.

image.png

If a PolarDB cluster uses a shared storage architecture and consists of one primary node and multiple read-only nodes, the read-only nodes can obtain WAL logs that need to be replayed from the shared storage. If data pages on the shared storage are the most recent pages, read-only nodes can read the data pages without replaying the pages. PolarDB provides LogIndex that can be used on read-only nodes to replay WAL logs at a higher speed.

Memory Synchronization Architecture for RO

LogIndex stores the mapping between a data page and all the log sequence numbers (LSNs) of updates on the page. LogIndex can be used to rapidly obtain all LSNs of updates on a data page. This way, the WAL logs generated for the data page can be replayed when the data page is read. The following figure shows the architecture that is used to synchronize data from the primary node to read-only nodes.

image.png

Compared with the shared-nothing architecture, the workflow of the primary node and read-only nodes in the shared storage architecture has the following differences:

  • Complete WAL logs are not replicated from the primary node to read-only nodes. Only WAL log metadata is replicated to the read-only nodes. This reduces the amount of data transmitted on the network and the latency between the primary node and read-only nodes.
  • The primary node generates LogIndex records based on WAL log metadata and writes the records to the LogIndex Memory Table. After the LogIndex Memory Table is full, data in the table is flushed to the disk and stored in the LogIndex Table of the shared storage. The LogIndex Memory Table can be reused.
  • The primary node uses the LogIndex metadata file to ensure the atomicity of I/O operations on the LogIndex Memory Table. After data in the Memory Table is flushed to the disk, the LogIndex metadata file is updated. When the data is being flushed to the disk, bloom data is generated. Bloom data can be used to check whether a specific page exists in a LogIndex Table. This way, the LogIndex Tables that are skipped during scans can be skipped. This improves efficiency.
  • Read-only nodes receive WAL log metadata from the primary node. Then, the read-only nodes generate LogIndex records in the memory based on WAL log metadata and write the records to the LogIndex Memory Table stored in the memory of read-only nodes. The pages that correspond to WAL log metadata in buffer pools are marked as outdated pages. In this process, the read-only nodes do not replay logs or perform I/O operations on data. No cost is required for cache misses.
  • After read-only nodes generate LogIndex records based on WAL log metadata, WAL logs generated for the next LSN are replayed. On the read-only nodes, the backend processes that access a page and the background replay processes replay the logs. In this case, the read-only nodes can replay the WAL logs in parallel.
  • Data in the LogIndex Memory Table generated by read-only nodes is not flushed to the disk. The read-only nodes use the LogIndex metadata file to determine whether data in the full LogIndex Memory Table is flushed to the disk on the primary node. If data in the LogIndex Memory Table is flushed to the disk, the data can be reused. When the primary node determines that the LogIndex Table in the storage is no longer used, the LogIndex Table can be truncated.

PolarDB reduces the latency between the primary node and read-only nodes by replicating only WAL log metadata. PolarDB uses LogIndex to delay the replay of WAL logs and replay WAL logs in parallel. This can increase the speed at which read-only nodes replay WAL logs.

WAL Meta

WAL logs are also called XLogRecord. Each XLogRecord consists of two parts, as shown in the following figure.

  • General header portion: This portion is the schema of the XLogRecord. The length of this portion is fixed. This portion stores the general information about the XLogRecord, such as the length, transaction ID, and the type of the resource manager of the XLogRecord.
  • Data portion: This portion is divided into two parts: header and data. The header part contains 0 to N XLogRecordBlockHeader schemas and 0 to 1 XLogRecordDataHeader[Short|Long] schema. The data part contains block data and main data. Each XLogRecordBlockHeader structure corresponds to block data of the data part. The XLogRecordDataHeader[Short|Long] schema corresponds to main data of the data part.

wal meta.png

In shared storage mode, complete WAL logs do not need to be replicated from the primary node to read-only nodes. Only WAL log metadata is replicated to the read-only nodes. WAL log metadata consists of the general header portion, header part, and main data, as shown in the preceding figure. Read-only nodes can read complete WAL log content from the shared storage based on WAL log metadata. The following figure shows the process of replicating WAL log metadata from the primary node to read-only nodes.

wal meta trans.png

  1. When a transaction on the primary node modifies data on this node, the WAL logs are generated for the modification and the metadata of the WAL logs is replicated to the metadata queue of WAL logs in the memory.
  2. In synchronous streaming replication mode, before the transaction is committed, WAL logs in the WAL buffer are flushed to the disk and then the WalSender process is woken up.
  3. If the WalSender process finds new WAL logs that can be sent, the process reads the metadata of the logs from the metadata queue of WAL logs. After the metadata is read, the process sends the metadata to read-only nodes over the streaming replication connection that is established.
  4. After the WalReceiver processes on read-only nodes receive the metadata, the processes push the metadata to the metadata queue of WAL logs in the memory and notify the startup processes of the new metadata.
  5. The startup processes read the metadata from the metadata queue of WAL logs and parse the metadata into a LogIndex Memtable.

In streaming replication mode, payloads are not replicated from the primary node to read-only nodes. This reduces the amount of data transmitted on the network. The WalSender process on the primary node obtains the metadata of WAL logs from the metadata queue stored in the memory. After the WalReceiver process on the read-only nodes receives the metadata, the process stores the metadata in the metadata queue of WAL logs in the memory. The disk I/O in streaming replication mode is lower than that in primary/secondary mode. This increases the speed at which logs are transmitted and reduces the latency between the primary node and read-only nodes.

LogIndex

Memory data structure

LogIndex is a HashTable structure. The key of this structure is PageTag. A PageTag can identify a specific data page . In this case, the values of this structure are all LSNs generated for updates on the page. The following figure shows the memory data structure of LogIndex. A LogIndex Memtable contains Memtable ID values, maximum and minimum LSNs, and the following arrays:

  • HashTable: The HashTable array records the mapping between a page and the LSN list for updates on the page. Each member of the HashTable array points to a specific LogIndex Item in the Segment array.
  • Segment: Each member in the Segment array is a LogIndex Item. A LogIndex Item has two structures: Item Head and Item Seg, as shown in the following figure. Item Head is the head of the LSN linked list for a page. Item Seg is the subsequent node of the LSN linked list. PageTag in Item Head is used to record the metadata of a single Page. In Item Head, Next Seg points to the subsequent node and Tail Seg points to the tail node. Item Seg stores pointers that point to the previous node Prev Seg and the subsequent node Next Seg. A complete LSN can consist of a Suffix LSN stored in Item Head and Item Seg and a Prefix LSN stored in the LogIndex Memtable. This way, each stored Prefix LSN is unique and the storage space is not wasted. When different values of PageTag specify the same item in the HashTable array based on the calculated result, Next Item in Item Head points to the next page where the hash value is the same as that of the page. This way, the hash collision is resolved.
  • Index Order: The Index Order array records the order in which LogIndex records are added to a LogIndex Memtable. Each member in the array occupies 2 bytes. The last 12 bits of each member correspond to a subscript of the Segment array and point to a specific LogIndex Item. The first four bits correspond to a subscript of the Suffix LSN array in the LogIndex Item and point to a specific Suffix LSN. The Index Order array can be used to obtain all LSNs that are inserted into a LogIndex Memtable and obtain the mapping between an LSN and all modified pages for which the LSN is generated.

logindex.png

LogIndex Memtables stored in the memory are divided into two categories: Active LogIndex Memtables and Inactive LogIndex Memtables. The LogIndex records generated based on WAL log metadata are written to an Active LogIndex Memtable. After the Active LogIndex Memtable is full, the table is converted to an Inactive LogIndex Memtable and the system generates another Active LogIndex Memtable. The data in the Inactive LogIndex Memtable can be flushed to the disk. Then, the Inactive LogIndex Memtable can be converted to an Active LogIndex Memtable again. The following figure shows more details.

image.png

Data Structure on Disk

The disk stores a large number of LogIndex Tables. The structure of a LogIndex Table is similar to the structure of a LogIndex Memtable. A LogIndex Table can contain a maximum of 64 LogIndex Memtables. When data in Inactive LogIndex Memtables is flushed to the disk, Bloom filters are generated for the Memtables. The size of a single Bloom filter is 4,096 bytes. A Bloom filter records the information about an Inactive LogIndex Memtable, such as the mapped values that the bit array of the Bloom filter stores for all pages in the Inactive LogIndex Memtable, the minimum LSN, and the maximum LSN. The following figure shows more details. A Bloom filter can be used to determine whether a page exists in the LogIndex Table that corresponds to the filter. This way, LogIndex Tables in which the page does not exist do not need to be scanned. This accelerates data retrieval.

image.png

After the data in an Inactive LogIndex Memtable is flushed to the disk, the LogIndex metadata file is updated. This file is used to ensure the atomicity of I/O operations on the LogIndex Memtable file. The LogIndex metadata file stores the information about the smallest LogIndex Table and the largest LogIndex Memtable on the disk. Start LSN in this file records the maximum LSN among all LogIndex Memtables whose data is flushed to the disk. If data is written to the LogIndex Memtable when the Memtable is flushed, the system parses the WAL logs from Start LSN that are recorded in the LogIndex metadata file. Then, LogIndex records that are discarded during the data write are also regenerated to ensure the atomicity of I/O operations on the Memtable.

image.png

',35),S=a('

Log replay

Delayed replay

For scenarios in which LogIndex Tables are used, the startup processes of read-only nodes generate LogIndex records based on the received WAL metadata and mark the pages that correspond to the WAL metadata and exist in buffer pools as outdated pages. This way, WAL logs for the next LSN can be replayed. The startup processes do not replay WAL logs. The backend processes that access the page and the background replay processes replay the logs. The following figure shows how WAL logs are replayed.

  • The background replay process replays WAL logs in the sequence of WAL logs. The process retrieves modified pages from LogIndex Memtables and LogIndex Tables based on the LSN of a page that you want to replay. If a page exists in a buffer pool, the page is replayed. Otherwise, the page is skipped. The background replay process replays WAL logs generated for the next LSN of a page in a buffer pool in the sequence of LSNs. This prevents a large number of LSNs for a single page that you want to replay from being accumulated.
  • The backend process replays only the pages it must access. If the backend process must access a page that does not exist in a buffer pool, the process reads this page from the shared storage, writes the page to a buffer pool, and replays this page. If the page exists in a buffer pool and is marked as an outdated page, the process replays the most recent WAL logs of this page. The backend process retrieves the LSNs of the page from LogIndex Memtables and LogIndex Tables based on the value of PageTag. After the process retrieves the LSNs, the process generates the LSNs for the page in sequence. Then, the process reads the complete WAL logs from the shared storage based on the generated LSNs to replay the page.

image.png

The XLOG Buffer is added to cache the read WAL logs. This reduces performance overhead when WAL logs are read from the disk for replay. WAL logs are read from the WAL segment file on the disk. After the XLOG Page Buffer is added, WAL logs are preferentially read from the XLOG Buffer. If WAL logs that you want to replay are not in the XLOG Buffer, the pages of the WAL logs are read from the disk, written to the buffer, and then copied to readBuf of XLogReaderState. If the WAL logs are in the buffer, the logs are copied to readBuf of XLogReaderState. This reduces the number of I/O operations that need to be performed to replay the WAL logs to increase the speed at which the WAL logs are replayed. The following figure shows more details.

image.png

Mini Transaction

The LogIndex mechanism is different from the shared-nothing architecture in terms of log replay. If the LogIndex mechanism is used, the startup process parses WAL metadata to generate LogIndex records and the backend process replays pages based on LogIndex records in parallel. In this case, the startup process and backend process perform the operations in parallel. The backend process replays only the pages that it must access. An XLogRecord may be used to modify multiple pages. For example, in an index block split, Page_0 and Page_1 are modified. The modification is an atomic operation. This indicates that Page_0 or Page_1 is completely modified or not modified. The service provides the mini transaction lock mechanism. This ensures that the memory data structures are consistent when the backend process replays pages.

When mini transaction locks are unavailable, the startup process parses WAL metadata and sequentially inserts the current LSN into the LSN list of each page. The following figure shows more details. The startup process completes the update of the LSN list of Page_0 but does not complete the update of the LSN list of Page_1. In this case, Backend_0 accesses Page_0 and Backend_1 accesses Page_1. Backend_0 replays Page_0 based on the LSN list of Page_0. Backend_1 replays Page_1 based on the LSN list of Page_1. The WAL log for LSN_N+1 is replayed for Page_0 and the WAL log for LSN_N is replayed for Page_1. As a result, the versions of the two pages are not consistent in the buffer pool. This causes inconsistency between the memory data structure of Page_0 and that of Page_1.

image.png

In the mini transaction lock mechanism, an update on the LSN list of Page_0 or Page_1 is a mini transaction. Before the startup process updates the LSN list of a page, the process must obtain the mini transaction lock of the page. In the following figure, the process first obtains the mini transaction lock of Page_0. The sequence of the obtained mini transaction lock is consistent with the Page_0 modification sequence in which the WAL log of this page is replayed. After the LSN lists of Page_0 and Page_1 are updated, the mini transaction lock is released. If the backend process replays a specific page based on LogIndex records and the startup process for the page is in a mini transaction, the mini transaction lock of the page must be obtained before the page is replayed. The startup process completes the update of the LSN list of Page_0 but does not complete the update of the LSN list of Page_1. Backend_0 accesses Page_0 and Backend_1 accesses Page_1. In this case, Backend_0 cannot replay Page_0 until the LSN list of this page is updated and the mini transaction lock of this page is released. Before the mini transaction lock of this page is released, the update of the LSN list of page_1 is completed. The memory data structures are modified based on the atomic operation rule.

mini trans.png

Summary

PolarDB provides LogIndex based on the shared storage between the primary node and read-only nodes. LogIndex accelerates the speed at which memory data is synchronized from the primary node to read-only nodes and reduces the latency between the primary node and read-only nodes. This ensures the availability of read-only nodes and makes data between the primary node and read-only nodes consistent. This topic describes LogIndex and the LogIndex-based memory synchronization architecture of read-only nodes. LogIndex can be used to synchronize memory data from the primary node to read-only nodes. LogIndex can also be used to promote a read-only node as the primary node online. If the primary node becomes unavailable, the speed at which a read-only node is promoted to the primary node can be increased. This achieves the high availability of compute nodes. In addition, services can be restored in a short period of time.

',15);function W(k,P){const t=s("RouterLink");return r(),n("div",null,[A,d("p",null,[e("All modified data pages recorded in WAL logs before the LSN of consistent data are persisted to the shared storage based on the information described in "),i(t,{to:"/theory/buffer-management.html"},{default:h(()=>[e("Buffer Management")]),_:1}),e(". The LSN of consistent data is the LSN before which data is consistent between the primary node and read-only nodes. Read-only nodes do not need to replay WAL logs generated before the LSN of consistent data. In this case, the WAL logs for the LSNs that are smaller than the LSN of consistent data can be cleared from LogIndex Tables. This way, the primary node can truncate LogIndex Tables that are no longer used in the storage. This enables more efficient log replay for read-only nodes and reduces the space occupied by LogIndex Tables.")]),S])}const v=o(w,[["render",W],["__file","logindex.html.vue"]]);export{v as default}; diff --git a/assets/logindex.html-2ff46a28.js b/assets/logindex.html-2ff46a28.js new file mode 100644 index 00000000000..86e3a27fdea --- /dev/null +++ b/assets/logindex.html-2ff46a28.js @@ -0,0 +1 @@ +import{_ as o,r as n,o as r,c as g,a as d,b as e,d as L,w as i,e as a}from"./app-3d1677bf.js";const l="/PolarDB-for-PostgreSQL/assets/49_LogIndex_1-e49fa6a7.png",s="/PolarDB-for-PostgreSQL/assets/50_LogIndex_2-2d85ed00.png",c="/PolarDB-for-PostgreSQL/assets/51_LogIndex_3-1c28dec4.png",m="/PolarDB-for-PostgreSQL/assets/52_LogIndex_4-50a08309.png",p="/PolarDB-for-PostgreSQL/assets/53_LogIndex_5-3a25393f.png",x="/PolarDB-for-PostgreSQL/assets/54_LogIndex_6-ea27fcdf.png",I="/PolarDB-for-PostgreSQL/assets/55_LogIndex_7-a84ed0dd.png",P="/PolarDB-for-PostgreSQL/assets/56_LogIndex_8-3f14f302.png",_="/PolarDB-for-PostgreSQL/assets/57_LogIndex_9-1fcc55d8.png",h="/PolarDB-for-PostgreSQL/assets/58_LogIndex_10-2eab9094.png",f="/PolarDB-for-PostgreSQL/assets/59_LogIndex_11-e0277c33.png",S="/PolarDB-for-PostgreSQL/assets/60_LogIndex_12-6b577085.png",u="/PolarDB-for-PostgreSQL/assets/61_LogIndex_13-4a2d72a8.png",R="/PolarDB-for-PostgreSQL/assets/62_LogIndex_14-c90cc6e7.png",b={},W=a('

LogIndex

背景介绍

PolarDB 采用了共享存储一写多读架构,读写节点 RW 和多个只读节点 RO 共享同一份存储,读写节点可以读写共享存储中的数据;只读节点仅能各自通过回放日志,从共享存储中读取数据,而不能写入,只读节点 RO 通过内存同步来维护数据的一致性。此外,只读节点可同时对外提供服务用于实现读写分离与负载均衡,在读写节点异常 crash 时,可将只读节点提升为读写节点,保证集群的高可用。基本架构图如下所示:

image.png

传统 share nothing 的架构下,只读节点 RO 有自己的内存及存储,只需要接收 RW 节点的 WAL 日志进行回放即可。如下图所示,如果需要回放的数据页不在 Buffer Pool 中,需将其从存储文件中读至 Buffer Pool 中进行回放,从而带来 CacheMiss 的成本,且持续性的回放会带来较频繁的 Buffer Pool 淘汰问题。

image.png

此外,RW 节点多个事务之间可并行执行,RO 节点则需依照 WAL 日志的顺序依次进行串行回放,导致 RO 回放速度较慢,与 RW 节点的延迟逐步增大。

image.png

与传统 share nothing 架构不同,共享存储一写多读架构下 RO 节点可直接从共享存储上获取需要回放的 WAL 日志。若共享存储上的数据页是最新的,那么 RO 可直接读取数据页而不需要再进行回放操作。基于此,PolarDB 设计了 LogIndex 来加速 RO 节点的日志回放。

RO 内存同步架构

LogIndex 中保存了数据页与修改该数据页的所有 LSN 的映射关系,基于 LogIndex 可快速获取到修改某个数据页的所有 LSN,从而可将该数据页对应日志的回放操作延迟到真正访问该数据页的时刻进行。LogIndex 机制下 RO 内存同步的架构如下图所示。

image.png

RW / RO 的相关流程相较传统 share nothing 架构下有如下区别:

  • 读写节点 RW 与只读节点 RO 之间不再传输完整的 WAL 日志,仅传输 WAL meta,减少网络数据传输量,降低了 RO 与 RW 节点的延迟;
  • 读写节点 RW 依据 WAL meta 生成 LogIndex 写入 LogIndex Memory Table 中,LogIndex Memory Table 写满之后落盘,保存至共享存储的 LogIndex Table 中,已落盘的 LogIndex Memory Table 可以被复用;
  • 读写节点 RW 通过 LogIndex Meta 文件保证 LogIndex Memory Table I/O 操作的原子性,LogIndex Memory Table 落盘后会更新 LogIndex Meta 文件,落盘的同时还会生成 Bloom Data,通过 Bloom Data 可快速检索特定 Page 是否存在于某 LogIndex Table 中,从而忽略不必扫描的 LogIndex Table 提升效率;
  • 只读节点 RO 接收 RW 所发送的 WAL Meta,并基于 WAL Meta 在内存中生成相应的 LogIndex,同样写入其内存的 LogIndex Memory Table 中,同时将 WAL Meta 对应已存在于 Buffer Pool 中的页面标记为 Outdate,该阶段 RO 节点并不进行真正的日志回放,无数据 I/O 操作,可去除 cache miss 的成本;
  • 只读节点 RO 基于 WAL Meta 生成 LogIndex 后即可推进回放位点,日志回放操作被交由背景进程及真正访问该页面的 backend 进程执行,由此 RO 节点也可实现日志的并行回放;
  • 只读节点 RO 生成的 LogIndex Memory Table 不会落盘,其基于 LogIndex Meta 文件判断已满的 LogIndex Memory Table 是否在 RW 节点已落盘,已落盘的 LogIndex Memory Table 可被复用,当 RW 节点判断存储上的 LogIndex Table 不再使用时可将相应的 LogIndex Table Truncate。

PolarDB 通过仅传输 WAL Meta 降低 RW 与 RO 之间的延迟,通过 LogIndex 实现 WAL 日志的延迟回放 + 并行回放以加速 RO 的回放速度,以下则对这两点进行详细介绍。

WAL Meta

WAL 日志又称为 XLOG Record,如下图,每个 XLOG Record 由两部分组成:

  • 通用的首部部分 general header portion:该部分即为 XLogRecord 结构体,固定长度。主要用于存放该条 XLOG Record 的通用信息,如 XLOG Record 的长度、生成该条 XLOG Record 的事务 ID、该条 XLOG Record 对应的资源管理器类型等;
  • 数据部分 data portion:该部分又可以划分为首部和数据两个部分,其中首部部分 header part 包含 0 ~ N 个 XLogRecordBlockHeader 结构体及 0 ~ 1 个 XLogRecordDataHeader[Short|Long] 结构体。数据部分 data part 则包含 block data 及 main data。每一个 XLogRecordBlockHeader 对应数据部分的一个 Block data,XLogRecordDataHeader[Short|Long] 则与数据部分的 main data 对应。

wal meta.png

共享存储模式下,读写节点 RW 与只读节点 RO 之间无需传输完整的 WAL 日志,仅传输 WAL Meta 数据,WAL Meta 即为上图中的 general header portion + header part + main data,RO 节点可基于 WAL Meta 从共享存储上读取完整的 WAL 日志内容。该机制下,RW 与 RO 之间传输 WAL Meta 的流程如下:

wal meta传输.png

  1. 当 RW 节点中的事务对其数据进行修改时,会生成对应的 WAL 日志并将其写入 WAL Buffer,同时拷贝对应的 WAL meta 数据至内存中的 WAL Meta queue 中;
  2. 同步流复制模式下,事务提交时会先将 WAL Buffer 中对应的 WAL 日志 flush 到磁盘,此后会唤醒 WalSender 进程;
  3. WalSender 进程发现有新的日志可以发送,则从 WAL Meta queue 中读取对应的 WAL Meta,通过已建立的流复制连接发送到对端的 RO;
  4. RO 的 WalReceiver 进程接收到新的日志数据之后,将其 push 到内存的 WAL Meta queue 中,同时通知 Startup 进程有新的日志到达;
  5. Startup 从 WAL Meta queue 中读取对应的 meta 数据,解析生成对应的 LogIndex memtable 即可。

RW 与 RO 节点的流复制不传输具体的 payload 数据,减少了网络数据传输量;此外,RW 节点的 WalSender 进程从内存中的 WAL Meta queue 中获取 WAL Meta 信息,RO 节点的 WalReceiver 进程接收到 WAL Meta 后也同样将其保存至内存的 WAL Meta queue 中,相较于传统主备模式减少了日志发送及接收的磁盘 I/O 过程,从而提升传输速度,降低 RW 与 RO 之间的延迟。

LogIndex

内存数据结构

LogIndex 实质为一个 HashTable 结构,其 key 为 PageTag,可标识一个具体数据页,其 value 即为修改该 page 的所有 LSN。LogIndex 的内存数据结构如下图所示,除了 Memtable ID、Memtable 保存的最大 LSN、最小 LSN 等信息,LogIndex Memtable 中还包含了三个数组,分别为:

  • HashTable:HashTable 数组记录了某个 Page 与修改该 Page 的 LSN List 的映射关系,HashTable 数组的每一个成员指向 Segment 数组中一个具体的 LogIndex Item;
  • Segment:Segment 数组中的每个成员为一个 LogIndex Item,LogIndex Item 有两种结构,即下图中的 Item Head 和 Item Seg,Item Head 为某个 Page 对应的 LSN 链表的头部,Item Seg 则为该 LSN 链表的后续节点。Item Head 中的 Page TAG 用于记录单个 Page 的元信息,其 Next Seg 和 Tail Seg 则分别指向后续节点和尾节点,Item Seg 存储着指向上一节点 Prev Seg 和后续节点 Next Seg 的指针。Item Head 和 Item Seg 中保存的 Suffix LSN 与 LogIndex Memtable 中保存的 Prefix LSN 可构成一个完整的 LSN,避免了重复存储 Prefix LSN 带来的空间浪费。当不同 Page TAG 计算到 HashTable 的同一位置时,通过 Item Head 中的 Next Item 指向下一个具有相同 hash 值的 Page,以此解决哈希冲突;
  • Index Order:Index Order 数组记录了 LogIndex 添加到 LogIndex Memtable 的顺序,该数组中的每个成员占据 2 个字节。每个成员的后 12bit 对应 Segment 数组的一个下标,指向一个具体的 LogIndex Item,前 4bit 则对应 LogIndex Item 中 Suffix LSN 数组的一个下标,指向一个具体的 Suffix LSN,通过 Index Order 可方便地获取插入到该 LogIndex Memtable 的所有 LSN 及某个 LSN 与其对应修改的全部 Page 的映射关系。

logindex.png

内存中保存的 LogIndex Memtable 又可分为 Active LogIndex Memtable 和 Inactive LogIndex Memtable。如下图所示,基于 WAL Meta 数据生成的 LogIndex 记录会写入 Active LogIndex Memtable,Active LogIndex Memtable 写满后会转为 Inactive LogIndex Memtable,并重新申请一个新的 Active LogIndex Memtable,Inactive LogIndex Memtable 可直接落盘,落盘后的 Inactive LogIndex Memtable 可再次转为 Active LogIndex Memtable。

image.png

磁盘数据结构

磁盘上保存了若干个 LogIndex Table,LogIndex Table 与 LogIndex Memtable 结构类似,一个 LogIndex Table 可包含 64 个 LogIndex Memtable,Inactive LogIndex Memtable 落盘的同时会生成其对应的 Bloom Filter。如下图所示,单个 Bloom Filter 的大小为 4096 字节,Bloom Filter 记录了该 Inactive LogIndex Memtable 的相关信息,如保存的最小 LSN、最大 LSN、该 Memtable 中所有 Page 在 bloom filter bit array 中的映射值等。通过 Bloom Filter 可快速判断某个 Page 是否存在于对应的 LogIndex Table 中,从而可忽略无需扫描的 LogIndex Table 以加速检索。

image.png

当 Inactive LogIndex MemTable 成功落盘后,LogIndex Meta 文件也被更新,该文件可保证 LogIndex Memtable 文件 I/O 操作的原子性。如下,LogIndex Meta 文件保存了当前磁盘上最小 LogIndex Table 及最大 LogIndex Memtable 的相关信息,其 Start LSN 记录了当前已落盘的所有 LogIndex MemTable 中最大的 LSN。若 Flush LogIndex MemTable 时发生部分写,系统会从 LogIndex Meta 记录的 Start LSN 开始解析日志,如此部分写舍弃的 LogIndex 记录也会重新生成,保证了其 I/O 操作的原子性。

image.png

',35),M=a('

日志回放

延迟回放

LogIndex 机制下,RO 节点的 Startup 进程基于接收到的 WAL Meta 生成 LogIndex,同时将该 WAL Meta 对应的已存在于 Buffer Pool 中的页面标记为 Outdate 后即可推进回放位点,Startup 进程本身并不对日志进行回放,日志的回放操作交由背景回放进程及真正访问该页面的 Backend 进程进行,回放过程如下图所示,其中:

  • 背景回放进程按照 WAL 顺序依次进行日志回放操作,根据要回放的 LSN 检索 LogIndex Memtable 及 LogIndex Table,获取该 LSN 修改的 Page List,若某个 Page 存在于 Buffer Pool 中则对其进行回放,否则直接跳过。背景回放进程按照 LSN 的顺序逐步推进 Buffer Pool 中的页面位点,避免单个 Page 需要回放的 LSN 数量堆积太多;
  • Backend 进程则仅对其实际需要访问的 Page 进行回放,当 Backend 进程需要访问一个 Page 时,如果该 Page 在 Buffer Pool 中不存在,则将该 Page 读到 Buffer Pool 后进行回放;如果该 Page 已经在 Buffer Pool 中且标记为 outdate,则将该 Page 回放到最新。Backend 进程依据 Page TAG 对 LogIndex Memtable 及 LogIndex Table 进行检索,按序生成与该 Page 相关的 LSN List,基于 LSN List 从共享存储中读取完整的 WAL 日志来对该 Page 进行回放。

image.png

为降低回放时读取磁盘 WAL 日志带来的性能损耗,同时添加了 XLOG Buffer 用于缓存读取的 WAL 日志。如下图所示,原始方式下直接从磁盘上的 WAL Segment File 中读取 WAL 日志,添加 XLog Page Buffer 后,会先从 XLog Buffer 中读取,若所需 WAL 日志不在 XLog Buffer 中,则从磁盘上读取对应的 WAL Page 到 Buffer 中,然后再将其拷贝至 XLogReaderState 的 readBuf 中;若已在 Buffer 中,则直接将其拷贝至 XLogReaderState 的 readBuf 中,以此减少回放 WAL 日志时的 I/O 次数,从而进一步加速日志回放的速度。

image.png

Mini Transaction

与传统 share nothing 架构下的日志回放不同,LogIndex 机制下,Startup 进程解析 WAL Meta 生成 LogIndex 与 Backend 进程基于 LogIndex 对 Page 进行回放的操作是并行的,且各个 Backend 进程仅对其需要访问的 Page 进行回放。由于一条 XLog Record 可能会对多个 Page 进行修改,以索引分裂为例,其涉及对 Page_0、Page_1 的修改,且其对 Page_0 及 Page_1 的修改为一个原子操作,即修改要么全部可见,要么全部不可见。针对此,设计了 mini transaction 锁机制以保证 Backend 进程回放过程中内存数据结构的一致性。

如下图所示,无 mini transaction lock 时,Startup 进程对 WAL Meta 进行解析并按序将当前 LSN 插入到各个 Page 对应的 LSN List 中。若 Startup 进程完成对 Page_0 LSN List 的更新,但尚未完成对 Page_1 LSN List 的更新时,Backend_0 和 Backend_1 分别对 Page_0 及 Page_1 进行访问,Backend_0 和 Backend_1 分别基于 Page 对应的 LSN List 进行回放操作,Page_0 被回放至 LSN_N + 1 处,Page_1 被回放至 LSN_N 处,可见此时 Buffer Pool 中两个 Page 对应的版本并不一致,从而导致相应内存数据结构的不一致。

image.png

Mini transaction 锁机制下,对 Page_0 及 Page_1 LSN List 的更新被视为一个 mini transaction。Startup 进程更新 Page 对应的 LSN List 时,需先获取该 Page 的 mini transaction lock,如下先获取 Page_0 对应的 mtr lock,获取 Page mtr lock 的顺序与回放时的顺序保持一致,更新完 Page_0 及 Page_1 LSN List 后再释放 Page_0 对应的 mtr lock。Backend 进程基于 LogIndex 对特定 Page 进行回放时,若该 Page 对应在 Startup 进程仍处于一个 mini transaction 中,则同样需先获取该 Page 对应的 mtr lock 后再进行回放操作。故若 Startup 进程完成对 Page_0 LSN List 的更新,但尚未完成对 Page_1 LSN List 的更新时,Backend_0 和 Backend_1 分别对 Page_0 及 Page_1 进行访问,此时 Backend_0 需等待 LSN List 更新完毕并释放 Page_0 mtr lock 之后才可进行回放操作,而释放 Page_0 mtr lock 时 Page_1 的 LSN List 已完成更新,从而实现了内存数据结构的原子修改。

mini trans.png

总结

PolarDB 基于 RW 节点与 RO 节点共享存储这一特性,设计了 LogIndex 机制来加速 RO 节点的内存同步,降低 RO 节点与 RW 节点之间的延迟,确保了 RO 节点的一致性与可用性。本文对 LogIndex 的设计背景、基于 LogIndex 的 RO 内存同步架构及具体细节进行了分析。除了实现 RO 节点的内存同步,基于 LogIndex 机制还可实现 RO 节点的 Online Promote,可加速 RW 节点异常崩溃时,RO 节点提升为 RW 节点的速度,从而构建计算节点的高可用,实现服务的快速恢复。

',15);function B(A,O){const t=n("RouterLink");return r(),g("div",null,[W,d("p",null,[e("由 "),L(t,{to:"/zh/theory/buffer-management.html"},{default:i(()=>[e("Buffer 管理")]),_:1}),e(" 可知,一致性位点之前的所有 WAL 日志修改的数据页均已持久化到共享存储中,RO 节点无需回放该位点之前的 WAL 日志,故 LogIndex Table 中小于一致性位点的 LSN 均可清除。RW 据此 Truncate 掉存储上不再使用的 LogIndex Table,在加速 RO 回放效率的同时还可减少 LogIndex Table 占用的空间。")]),M])}const T=o(b,[["render",B],["__file","logindex.html.vue"]]);export{T as default}; diff --git a/assets/logindex.html-d1deed5e.js b/assets/logindex.html-d1deed5e.js new file mode 100644 index 00000000000..a2903e4761f --- /dev/null +++ b/assets/logindex.html-d1deed5e.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-170991ee","path":"/zh/theory/logindex.html","title":"LogIndex","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"RO 内存同步架构","slug":"ro-内存同步架构","link":"#ro-内存同步架构","children":[]},{"level":2,"title":"WAL Meta","slug":"wal-meta","link":"#wal-meta","children":[]},{"level":2,"title":"LogIndex","slug":"logindex-1","link":"#logindex-1","children":[{"level":3,"title":"内存数据结构","slug":"内存数据结构","link":"#内存数据结构","children":[]},{"level":3,"title":"磁盘数据结构","slug":"磁盘数据结构","link":"#磁盘数据结构","children":[]}]},{"level":2,"title":"日志回放","slug":"日志回放","link":"#日志回放","children":[{"level":3,"title":"延迟回放","slug":"延迟回放","link":"#延迟回放","children":[]},{"level":3,"title":"Mini Transaction","slug":"mini-transaction","link":"#mini-transaction","children":[]}]},{"level":2,"title":"总结","slug":"总结","link":"#总结","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/theory/logindex.md"}');export{l as data}; diff --git a/assets/online_promote_logindex_bgw-d9f46b31.png b/assets/online_promote_logindex_bgw-d9f46b31.png new file mode 100644 index 00000000000..beef14da81b Binary files /dev/null and b/assets/online_promote_logindex_bgw-d9f46b31.png differ diff --git a/assets/online_promote_postmaster-92e1fd76.png b/assets/online_promote_postmaster-92e1fd76.png new file mode 100644 index 00000000000..e5c946e554c Binary files /dev/null and b/assets/online_promote_postmaster-92e1fd76.png differ diff --git a/assets/online_promote_startup-b84a6f37.png b/assets/online_promote_startup-b84a6f37.png new file mode 100644 index 00000000000..2c8a78dd5d3 Binary files /dev/null and b/assets/online_promote_startup-b84a6f37.png differ diff --git a/assets/parallel-dml.html-ce10b755.js b/assets/parallel-dml.html-ce10b755.js new file mode 100644 index 00000000000..14747005946 --- /dev/null +++ b/assets/parallel-dml.html-ce10b755.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-3f61fca0","path":"/zh/features/v11/epq/parallel-dml.html","title":"并行 INSERT","lang":"zh-CN","frontmatter":{"author":"渊云","date":"2022/09/27","minute":30},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[]},{"level":2,"title":"使用说明","slug":"使用说明","link":"#使用说明","children":[]},{"level":2,"title":"原理介绍","slug":"原理介绍","link":"#原理介绍","children":[]}],"git":{"updatedTime":1697908247000},"filePathRelative":"zh/features/v11/epq/parallel-dml.md"}');export{l as data}; diff --git a/assets/parallel-dml.html-fe244403.js b/assets/parallel-dml.html-fe244403.js new file mode 100644 index 00000000000..8836c06d855 --- /dev/null +++ b/assets/parallel-dml.html-fe244403.js @@ -0,0 +1,25 @@ +import{_ as r,r as o,o as k,c as d,d as n,a as s,w as p,b as a,e as u}from"./app-3d1677bf.js";const i="/PolarDB-for-PostgreSQL/assets/parallel_data_flow-94a4a827.png",m={},w=s("h1",{id:"并行-insert",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#并行-insert","aria-hidden":"true"},"#"),a(" 并行 INSERT")],-1),b={class:"table-of-contents"},_=u(`

背景介绍

PolarDB-PG 支持 ePQ 弹性跨机并行查询,能够利用集群中多个计算节点提升只读查询的性能。此外,ePQ 也支持在读写节点上通过多进程并行写入,实现对 INSERT 语句的加速。

功能介绍

ePQ 的并行 INSERT 功能可以用于加速 INSERT INTO ... SELECT ... 这种读写兼备的 SQL。对于 SQL 中的 SELECT 部分,ePQ 将启动多个进程并行执行查询;对于 SQL 中的 INSERT 部分,ePQ 将在读写节点上启动多个进程并行执行写入。执行写入的进程与执行查询的进程之间通过 Motion 算子 进行数据传递。

能够支持并行 INSERT 的表类型有:

  • 普通表
  • 分区表
  • (部分)外部表

并行 INSERT 支持动态调整写入并行度(写入进程数量),在查询不成为瓶颈的条件下性能最高能提升三倍。

使用方法

创建两张表 t1t2,向 t1 中插入一些数据:

CREATE TABLE t1 (id INT);
+CREATE TABLE t2 (id INT);
+INSERT INTO t1 SELECT generate_series(1,100000);
+

打开 ePQ 及并行 INSERT 的开关:

SET polar_enable_px TO ON;
+SET polar_px_enable_insert_select TO ON;
+

通过 INSERT 语句将 t1 表中的所有数据插入到 t2 表中。查看并行 INSERT 的执行计划:

=> EXPLAIN INSERT INTO t2 SELECT * FROM t1;
+                                       QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Insert on t2  (cost=0.00..952.87 rows=33334 width=4)
+   ->  Result  (cost=0.00..0.00 rows=0 width=0)
+         ->  PX Hash 6:3  (slice1; segments: 6)  (cost=0.00..432.04 rows=100000 width=8)
+               ->  Partial Seq Scan on t1  (cost=0.00..431.37 rows=16667 width=4)
+ Optimizer: PolarDB PX Optimizer
+(5 rows)
+

其中的 PX Hash 6:3 表示 6 个并行查询 t1 的进程通过 Motion 算子将数据传递给 3 个并行写入 t2 的进程。

通过参数 polar_px_insert_dop_num 可以动态调整写入并行度,比如:

=> SET polar_px_insert_dop_num TO 12;
+=> EXPLAIN INSERT INTO t2 SELECT * FROM t1;
+                                        QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on t2  (cost=0.00..952.87 rows=8334 width=4)
+   ->  Result  (cost=0.00..0.00 rows=0 width=0)
+         ->  PX Hash 6:12  (slice1; segments: 6)  (cost=0.00..432.04 rows=100000 width=8)
+               ->  Partial Seq Scan on t1  (cost=0.00..431.37 rows=16667 width=4)
+ Optimizer: PolarDB PX Optimizer
+(5 rows)
+

执行计划中的 PX Hash 6:12 显示,并行查询 t1 的进程数量不变,并行写入 t2 的进程数量变更为 12

使用说明

调整 polar_px_dop_per_nodepolar_px_insert_dop_num 可以分别修改 INSERT INTO ... SELECT ... 中查询和写入的并行度。

  1. 当查询并行度较低时,逐步提升写入并行度,SQL 执行时间将会逐渐下降并趋于平缓;趋于平缓的原因是查询速度跟不上写入速度而成为瓶颈
  2. 当查询并行度较高时,逐步提升写入并行度,SQL 执行时间将会逐渐下降并趋于平缓;趋于平缓的原因是并行写入只能在读写节点上进行,写入速度因多个写入进程对表页面扩展锁的争抢而跟不上查询速度,成为瓶颈

原理介绍

ePQ 对并行 INSERT 的处理如下:

  1. ePQ 优化器以查询解析得到的语法树作为输入,产生计划树
  2. ePQ 执行器将计划树分发到各计算节点,并创建并行查询/并行写入进程,开始执行各自负责执行的子计划
  3. 并行查询进程从存储中并行读取各自负责的数据分片,并将数据发送到 Motion 算子
  4. 并行写入进程从 Motion 算子中获取数据,向存储并行写入数据

并行查询和并行写入是以流水线的形式同时进行的。上述执行过程如图所示:

parallel_insert_data_flow

',26);function h(t,y){const c=o("Badge"),l=o("ArticleInfo"),e=o("router-link");return k(),d("div",null,[w,n(c,{type:"tip",text:"V11 / v1.1.17-",vertical:"top"}),n(l,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),s("nav",b,[s("ul",null,[s("li",null,[n(e,{to:"#背景介绍"},{default:p(()=>[a("背景介绍")]),_:1})]),s("li",null,[n(e,{to:"#功能介绍"},{default:p(()=>[a("功能介绍")]),_:1})]),s("li",null,[n(e,{to:"#使用方法"},{default:p(()=>[a("使用方法")]),_:1})]),s("li",null,[n(e,{to:"#使用说明"},{default:p(()=>[a("使用说明")]),_:1})]),s("li",null,[n(e,{to:"#原理介绍"},{default:p(()=>[a("原理介绍")]),_:1})])])]),_])}const T=r(m,[["render",h],["__file","parallel-dml.html.vue"]]);export{T as default}; diff --git a/assets/parallel_data_flow-94a4a827.png b/assets/parallel_data_flow-94a4a827.png new file mode 100644 index 00000000000..021ba46b1b6 Binary files /dev/null and b/assets/parallel_data_flow-94a4a827.png differ diff --git a/assets/pgvector.html-3c1132df.js b/assets/pgvector.html-3c1132df.js new file mode 100644 index 00000000000..d8d25c941df --- /dev/null +++ b/assets/pgvector.html-3c1132df.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-3c5bafa7","path":"/zh/features/v11/extensions/pgvector.html","title":"pgvector","lang":"zh-CN","frontmatter":{"author":"山现","date":"2023/12/25","minute":10},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[{"level":3,"title":"安装插件","slug":"安装插件","link":"#安装插件","children":[]},{"level":3,"title":"向量操作","slug":"向量操作","link":"#向量操作","children":[]},{"level":3,"title":"卸载插件","slug":"卸载插件","link":"#卸载插件","children":[]}]},{"level":2,"title":"注意事项","slug":"注意事项","link":"#注意事项","children":[]}],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/features/v11/extensions/pgvector.md"}');export{e as data}; diff --git a/assets/pgvector.html-74643e40.js b/assets/pgvector.html-74643e40.js new file mode 100644 index 00000000000..d1a589e3f7c --- /dev/null +++ b/assets/pgvector.html-74643e40.js @@ -0,0 +1,14 @@ +import{_ as i,r as o,o as d,c as k,d as s,a as n,w as t,b as a,e as h}from"./app-3d1677bf.js";const _={},g=n("h1",{id:"pgvector",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#pgvector","aria-hidden":"true"},"#"),a(" pgvector")],-1),v={class:"table-of-contents"},m=n("h2",{id:"背景",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),a(" 背景")],-1),f={href:"https://github.com/pgvector/pgvector",target:"_blank",rel:"noopener noreferrer"},b=n("code",null,"pgvector",-1),E=n("p",null,[n("code",null,"pgvector"),a(" 支持 IVFFlat 索引。IVFFlat 索引能够将向量空间分为若干个划分区域,每个区域都包含一些向量,并创建倒排索引,用于快速地查找与给定向量相似的向量。IVFFlat 是 IVFADC 索引的简化版本,适用于召回精度要求高,但对查询耗时要求不严格(100ms 级别)的场景。相比其他索引类型,IVFFlat 索引具有高召回率、高精度、算法和参数简单、空间占用小的优势。")],-1),w=n("p",null,[n("code",null,"pgvector"),a(" 插件算法的具体流程如下:")],-1),x=n("ol",null,[n("li",null,"高维空间中的点基于隐形的聚类属性,按照 K-Means 等聚类算法对向量进行聚类处理,使得每个类簇有一个中心点"),n("li",null,"检索向量时首先遍历计算所有类簇的中心点,找到与目标向量最近的 n 个类簇中心"),n("li",null,"遍历计算 n 个类簇中心所在聚类中的所有元素,经过全局排序得到距离最近的 k 个向量")],-1),q=n("h2",{id:"使用方法",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#使用方法","aria-hidden":"true"},"#"),a(" 使用方法")],-1),I=n("code",null,"pgvector",-1),y={href:"https://github.com/pgvector/pgvector/blob/master/README.md",target:"_blank",rel:"noopener noreferrer"},N=h(`

安装插件

CREATE EXTENSION vector;
+

向量操作

执行如下命令,创建一个含有向量字段的表:

CREATE TABLE t (val vector(3));
+

执行如下命令,可以插入向量数据:

INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);
+

创建 IVFFlat 类型的索引:

  1. val vector_ip_ops 表示需要创建索引的列名为 val,并且使用向量操作符 vector_ip_ops 来计算向量之间的相似度。该操作符支持向量之间的点积、余弦相似度、欧几里得距离等计算方式
  2. WITH (lists = 1) 表示使用的划分区域数量为 1,这意味着所有向量都将被分配到同一个区域中。在实际应用中,划分区域数量需要根据数据规模和查询性能进行调整
CREATE INDEX ON t USING ivfflat (val vector_ip_ops) WITH (lists = 1);
+

计算近似向量:

=> SELECT * FROM t ORDER BY val <#> '[3,3,3]';
+   val
+---------
+ [1,2,3]
+ [1,1,1]
+ [0,0,0]
+
+(4 rows)
+

卸载插件

DROP EXTENSION vector;
+

注意事项

`,15);function F(l,R){const c=o("Badge"),r=o("ArticleInfo"),e=o("router-link"),p=o("ExternalLinkIcon"),u=o("RouterLink");return d(),k("div",null,[g,s(c,{type:"tip",text:"V11 / v1.1.35-",vertical:"top"}),s(r,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),n("nav",v,[n("ul",null,[n("li",null,[s(e,{to:"#背景"},{default:t(()=>[a("背景")]),_:1})]),n("li",null,[s(e,{to:"#使用方法"},{default:t(()=>[a("使用方法")]),_:1}),n("ul",null,[n("li",null,[s(e,{to:"#安装插件"},{default:t(()=>[a("安装插件")]),_:1})]),n("li",null,[s(e,{to:"#向量操作"},{default:t(()=>[a("向量操作")]),_:1})]),n("li",null,[s(e,{to:"#卸载插件"},{default:t(()=>[a("卸载插件")]),_:1})])])]),n("li",null,[s(e,{to:"#注意事项"},{default:t(()=>[a("注意事项")]),_:1})])])]),m,n("p",null,[n("a",f,[b,s(p)]),a(" 作为一款高效的向量数据库插件,基于 PostgreSQL 的扩展机制,利用 C 语言实现了多种向量数据类型和运算算法,同时还能够高效存储与查询以向量表示的 AI Embedding。")]),E,w,x,q,n("p",null,[I,a(" 可以顺序检索或索引检索高维向量,关于索引类型和更多参数介绍可以参考插件源代码的 "),n("a",y,[a("README"),s(p)]),a("。")]),N,n("ul",null,[n("li",null,[s(u,{to:"/zh/features/v11/epq/"},{default:t(()=>[a("ePQ")]),_:1}),a(" 支持通过排序遍历高维向量,不支持通过索引查询向量类型")])])])}const V=i(_,[["render",F],["__file","pgvector.html.vue"]]);export{V as default}; diff --git a/assets/polar-sequence-tech.html-0de65483.js b/assets/polar-sequence-tech.html-0de65483.js new file mode 100644 index 00000000000..4bc6accd16c --- /dev/null +++ b/assets/polar-sequence-tech.html-0de65483.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-6edf83b7","path":"/theory/polar-sequence-tech.html","title":"Sequence","lang":"en-US","frontmatter":{"author":"羁鸟","date":"2022/08/22","minute":30},"headers":[{"level":2,"title":"介绍","slug":"介绍","link":"#介绍","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[{"level":3,"title":"SQL 接口","slug":"sql-接口","link":"#sql-接口","children":[]},{"level":3,"title":"Sequence 组合使用场景","slug":"sequence-组合使用场景","link":"#sequence-组合使用场景","children":[]}]},{"level":2,"title":"原理剖析","slug":"原理剖析","link":"#原理剖析","children":[{"level":3,"title":"Sequence 在系统表与数据表中的描述","slug":"sequence-在系统表与数据表中的描述","link":"#sequence-在系统表与数据表中的描述","children":[]},{"level":3,"title":"序列申请机制剖析","slug":"序列申请机制剖析","link":"#序列申请机制剖析","children":[]},{"level":3,"title":"Sequence 缓存机制","slug":"sequence-缓存机制","link":"#sequence-缓存机制","children":[]}]},{"level":2,"title":"总结","slug":"总结","link":"#总结","children":[]}],"git":{"updatedTime":1672148725000},"filePathRelative":"theory/polar-sequence-tech.md"}');export{e as data}; diff --git a/assets/polar-sequence-tech.html-2a7cb868.js b/assets/polar-sequence-tech.html-2a7cb868.js new file mode 100644 index 00000000000..6e6ab7cce98 --- /dev/null +++ b/assets/polar-sequence-tech.html-2a7cb868.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-4f41c8b0","path":"/zh/theory/polar-sequence-tech.html","title":"Sequence 使用、原理全面解析","lang":"zh-CN","frontmatter":{"author":"羁鸟","date":"2022/08/22","minute":30},"headers":[{"level":2,"title":"介绍","slug":"介绍","link":"#介绍","children":[]},{"level":2,"title":"使用方法","slug":"使用方法","link":"#使用方法","children":[{"level":3,"title":"SQL 接口","slug":"sql-接口","link":"#sql-接口","children":[]},{"level":3,"title":"Sequence 组合使用场景","slug":"sequence-组合使用场景","link":"#sequence-组合使用场景","children":[]}]},{"level":2,"title":"原理剖析","slug":"原理剖析","link":"#原理剖析","children":[{"level":3,"title":"Sequence 在系统表与数据表中的描述","slug":"sequence-在系统表与数据表中的描述","link":"#sequence-在系统表与数据表中的描述","children":[]},{"level":3,"title":"序列申请机制剖析","slug":"序列申请机制剖析","link":"#序列申请机制剖析","children":[]},{"level":3,"title":"Sequence 缓存机制","slug":"sequence-缓存机制","link":"#sequence-缓存机制","children":[]}]},{"level":2,"title":"总结","slug":"总结","link":"#总结","children":[]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/theory/polar-sequence-tech.md"}');export{e as data}; diff --git a/assets/polar-sequence-tech.html-569c0f3f.js b/assets/polar-sequence-tech.html-569c0f3f.js new file mode 100644 index 00000000000..89a48a02b36 --- /dev/null +++ b/assets/polar-sequence-tech.html-569c0f3f.js @@ -0,0 +1,340 @@ +import{_ as e,r as p,o,c as t,d as c,a as n,b as l,e as i}from"./app-3d1677bf.js";const d="/PolarDB-for-PostgreSQL/assets/polar_sequence_monotonic_cyclic-5db3b890.png",r="/PolarDB-for-PostgreSQL/assets/polar_sequence_sql_interface-d4586fc0.png",u="/PolarDB-for-PostgreSQL/assets/polar_sequence_is_called-d65d8316.png",k="/PolarDB-for-PostgreSQL/assets/polar_sequence_called-ce740e57.png",v="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_no_cache-728e2b9f.png",m="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_desc_1-07cebb0a.png",b="/PolarDB-for-PostgreSQL/assets/polar_sequence_session_cache-81dce2b3.png",E="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_cache-89a2a0c4.png",g="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_cache_1-695685fe.png",q="/PolarDB-for-PostgreSQL/assets/polar_sequence_performance_comparison-5432b1d6.png",w={},h=n("h1",{id:"sequence",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#sequence","aria-hidden":"true"},"#"),l(" Sequence")],-1),y=i(`

介绍

Sequence 作为数据库中的一个特别的表级对象,可以根据用户设定的不同属性,产生一系列有规则的整数,从而起到发号器的作用。

在使用方面,可以设置永不重复的 Sequence 用来作为一张表的主键,也可以通过不同表共享同一个 Sequence 来记录多个表的总插入行数。根据 ANSI 标准,一个 Sequence 对象在数据库要具备以下特征:

  1. 独立的数据库对象 (CREATE SEQUENCE),和表、视图同一层级
  2. 可以设置生成属性:初始值 (star value),步长 (increment),最大/小值 (max/min),循环产生 (cycle),缓存 (cache)等
  3. Sequence 对象在当前值的基础上进行递增或者递减,当前值被初始化为初始值
  4. 在设置循环后,当前值的变化具有周期性;不设置循环下,当前值的变化具有单调性,当前值到达最值后不可再变化

为了解释上述特性,我们分别定义 ab 两种序列来举例其具体的行为。

CREATE SEQUENCE a start with 5 minvalue -1 increment -2;
+CREATE SEQUENCE b start with 2 minvalue 1 maxvalue 4 cycle;
+

两个 Sequence 对象提供的序列值,随着序列申请次数的变化,如下所示:

单调序列与循环序列

PostgreSQLOracleSQLSERVERMySQLMariaDBDB2SybaseHive
支持支持支持仅支持自增字段支持支持仅支持自增字段不支持

为了更进一步了解 PostgreSQL 中的 Sequence 对象,我们先来了解 Sequence 的用法,并从用法中透析 Sequence 背后的设计原理。

使用方法

PostgreSQL 提供了丰富的 Sequence 调用接口,以及组合使用的场景,以充分支持开发者的各种需求。

SQL 接口

PostgreSQL 对 Sequence 对象也提供了类似于 的访问方式,即 DQL、DML 以及 DDL。我们从下图中可一览对外提供的 SQL 接口。

SQL接口

分别来介绍以下这几个接口:

currval

该接口的含义为,返回 Session 上次使用的某一 Sequence 的值。

postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+
+postgres=# select currval('seq');
+ currval
+---------
+       2
+(1 row)
+

需要注意的是,使用该接口必须使用过一次 nextval 方法,否则会提示目标 Sequence 在当前 Session 未定义。

postgres=# select currval('seq');
+ERROR:  currval of sequence "seq" is not yet defined in this session
+

lastval

该接口的含义为,返回 Session 上次使用的 Sequence 的值。

postgres=# select nextval('seq');
+ nextval
+---------
+       3
+(1 row)
+
+postgres=# select lastval();
+ lastval
+---------
+       3
+(1 row)
+

同样,为了知道上次用的是哪个 Sequence 对象,需要用一次 nextval('seq'),让 Session 以全局变量的形式记录下上次使用的 Sequence 对象。

lastvalcurval 两个接口仅仅只是参数不同,currval 需要指定是哪个访问过的 Sequence 对象,而 lastval 无法指定,只能是最近一次使用的 Sequence 对象。

nextval

该接口的含义为,取 Sequence 对象的下一个序列值。

通过使用 nextval 方法,可以让数据库基于 Sequence 对象的当前值,返回一个递增了 increment 数量的一个序列值,并将递增后的值作为 Sequence 对象当前值。

postgres=# CREATE SEQUENCE seq start with 1 increment 2;
+CREATE SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       3
+(1 row)
+

increment 称作 Sequence 对象的步长,Sequence 的每次以 nextval 的方式进行申请,都是以步长为单位进行申请的。同时,需要注意的是,Sequence 对象创建好以后,第一次申请获得的值,是 start value 所定义的值。对于 start value 的默认值,有以下 PostgreSQL 规则:

$$start_value = 1, if:increment > 0;$$ $$start_value = -1,if:increment < 0;$$

另外,nextval 是一种特殊的 DML,其不受事务所保护,即:申请出的序列值不会再回滚。

postgres=# BEGIN;
+BEGIN
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# ROLLBACK;
+ROLLBACK
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+

PostgreSQL 为了 Sequence 对象可以获得较好的并发性能,并没有采用多版本的方式来更新 Sequence 对象,而是采用了原地修改的方式完成 Sequence 对象的更新,这种不用事务保护的方式几乎成为所有支持 Sequence 对象的 RDMS 的通用做法,这也使得 Sequence 成为一种特殊的表级对象。

setval

该接口的含义是,设置 Sequence 对象的序列值。

postgres=# select nextval('seq');
+ nextval
+---------
+       4
+(1 row)
+
+postgres=# select setval('seq', 1);
+ setval
+--------
+      1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+

该方法可以将 Sequence 对象的序列值设置到给定的位置,同时可以将第一个序列值申请出来。如果不想申请出来,可以采用加入 false 参数的做法。

postgres=# select nextval('seq');
+ nextval
+---------
+       4
+(1 row)
+
+postgres=# select setval('seq', 1, false);
+ setval
+--------
+      1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+

SQL接口

通过在 setval 来设置好 Sequence 对象的值以后,同时来设置 Sequence 对象的 is_called 属性。nextval 就可以根据 Sequence 对象的 is_called 属性来判断要返回的是否要返回设置的序列值。即:如果 is_calledfalsenextval 接口会去设置 is_calledtrue,而不是进行 increment。

CREATE/ALTER SEQUENCE

CREATEALTER SEQUENCE 用于创建/变更 Sequence 对象,其中 Sequence 属性也通过 CREATEALTER SEQUENCE 接口进行设置,前面已简单介绍部分属性,下面将详细描述具体的属性。

CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name
+    [ AS data_type ]
+    [ INCREMENT [ BY ] increment ]
+    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
+    [ START [ WITH ] start ] [ CACHE cache ] [ [ NO ] CYCLE ]
+    [ OWNED BY { table_name.column_name | NONE } ]
+ALTER SEQUENCE [ IF EXISTS ] name
+    [ AS data_type ]
+    [ INCREMENT [ BY ] increment ]
+    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
+    [ START [ WITH ] start ]
+    [ RESTART [ [ WITH ] restart ] ]
+    [ CACHE cache ] [ [ NO ] CYCLE ]
+    [ OWNED BY { table_name.column_name | NONE } ]
+
  • AS:设置 Sequence 的数据类型,只可以设置为 smallintintbigint;与此同时也限定了 minvaluemaxvalue 的设置范围,默认为 bigint 类型(注意,只是限定,而不是设置,设置的范围不得超过数据类型的范围)。
  • INCREMENT:步长,nextval 申请序列值的递增数量,默认值为 1。
  • MINVALUE / NOMINVALUE:设置/不设置 Sequence 对象的最小值,如果不设置则是数据类型规定的范围,例如 bigint 类型,则最小值设置为 PG_INT64_MIN(-9223372036854775808)
  • MAXVALUE / NOMAXVALUE:设置/不设置 Sequence 对象的最大值,如果不设置,则默认设置规则如上。
  • START:Sequence 对象的初始值,必须在 MINVALUEMAXVALUE 范围之间。
  • RESTART:ALTER 后,可以重新设置 Sequence 对象的序列值,默认设置为 start value。
  • CACHE / NOCACHE:设置 Sequence 对象使用的 Cache 大小,NOCACHE 或者不设置则默认为 1。
  • OWNED BY:设置 Sequence 对象归属于某张表的某一列,删除列后,Sequence 对象也将删除。

特殊场景下的序列回滚

下面描述了一种序列回滚的场景

CREATE SEQUENCE
+postgres=# BEGIN;
+BEGIN
+postgres=# ALTER SEQUENCE seq maxvalue 10;
+ALTER SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+
+postgres=# ROLLBACK;
+ROLLBACK
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+

与之前描述的不同,此处 Sequence 对象受到了事务的保护,序列值发生了发生回滚。实际上,此处事务保护的是 ALTER SEQUENCE(DDL),而非 nextval(DML),因此此处发生的回滚是将 Sequence 对象回滚到 ALTER SEQUENCE 之前的状态,故发生了序列回滚现象。

DROP/TRUNCATE

  • DROP SEQUENCE,如字面意思,去除数据库中的 Sequence 对象。
  • TRUNCATE,准确来讲,是通过 TRUNCATE TABLE 完成 RESTART SEQUENCE
postgres=# CREATE TABLE tbl_iden (i INTEGER, j int GENERATED ALWAYS AS IDENTITY);
+CREATE TABLE
+postgres=# insert into tbl_iden values (100);
+INSERT 0 1
+postgres=# insert into tbl_iden values (1000);
+INSERT 0 1
+postgres=# select * from tbl_iden;
+  i   | j
+------+---
+  100 | 1
+ 1000 | 2
+(2 rows)
+
+postgres=# TRUNCATE TABLE tbl_iden RESTART IDENTITY;
+TRUNCATE TABLE
+postgres=# insert into tbl_iden values (1234);
+INSERT 0 1
+postgres=# select * from tbl_iden;
+  i   | j
+------+---
+ 1234 | 1
+(1 row)
+

此处相当于在 TRUNCATE 表的时候,执行 ALTER SEQUENCE RESTART

Sequence 组合使用场景

SEQUENCE 除了作为一个独立的对象时候以外,还可以组合其他 PostgreSQL 其他组件进行使用,我们总结了一下几个常用的场景。

组合调用

显式调用

CREATE SEQUENCE seq;
+CREATE TABLE tbl (i INTEGER PRIMARY KEY);
+INSERT INTO tbl (i) VALUES (nextval('seq'));
+SELECT * FROM tbl ORDER BY 1 DESC;
+   tbl
+---------
+       1
+(1 row)
+

触发器调用

CREATE SEQUENCE seq;
+CREATE TABLE tbl (i INTEGER PRIMARY KEY, j INTEGER);
+CREATE FUNCTION f()
+RETURNS TRIGGER AS
+$$
+BEGIN
+NEW.i := nextval('seq');
+RETURN NEW;
+END;
+$$
+LANGUAGE 'plpgsql';
+
+CREATE TRIGGER tg
+BEFORE INSERT ON tbl
+FOR EACH ROW
+EXECUTE PROCEDURE f();
+
+INSERT INTO tbl (j) VALUES (4);
+
+SELECT * FROM tbl;
+ i | j
+---+---
+ 1 | 4
+(1 row)
+

DEFAULT 调用

显式 DEFAULT 调用:

CREATE SEQUENCE seq;
+CREATE TABLE tbl(i INTEGER DEFAULT nextval('seq') PRIMARY KEY, j INTEGER);
+
+INSERT INTO tbl (i,j) VALUES (DEFAULT,11);
+INSERT INTO tbl(j) VALUES (321);
+INSERT INTO tbl (i,j) VALUES (nextval('seq'),1);
+
+SELECT * FROM tbl;
+ i |  j
+---+-----
+ 2 | 321
+ 1 |  11
+ 3 |   1
+(3 rows)
+

SERIAL 调用:

CREATE TABLE tbl (i SERIAL PRIMARY KEY, j INTEGER);
+INSERT INTO tbl (i,j) VALUES (DEFAULT,42);
+
+INSERT INTO tbl (j) VALUES (25);
+
+SELECT * FROM tbl;
+ i | j
+---+----
+ 1 | 42
+ 2 | 25
+(2 rows)
+

注意,SERIAL 并不是一种类型,而是 DEFAULT 调用的另一种形式,只不过 SERIAL 会自动创建 DEFAULT 约束所要使用的 Sequence。

AUTO_INC 调用

CREATE TABLE tbl (i int GENERATED ALWAYS AS IDENTITY,
+                  j INTEGER);
+INSERT INTO tbl(i,j) VALUES (DEFAULT,32);
+
+INSERT INTO tbl(j) VALUES (23);
+
+SELECT * FROM tbl;
+ i | j
+---+----
+ 1 | 32
+ 2 | 23
+(2 rows)
+

AUTO_INC 调用对列附加了自增约束,与 default 约束不同,自增约束通过查找 dependency 的方式找到该列关联的 Sequence,而 default 调用仅仅是将默认值设置为一个 nextval 表达式。

原理剖析

Sequence 在系统表与数据表中的描述

在 PostgreSQL 中有一张专门记录 Sequence 信息的系统表,即 pg_sequence。其表结构如下:

postgres=# \\d pg_sequence
+             Table "pg_catalog.pg_sequence"
+    Column    |  Type   | Collation | Nullable | Default
+--------------+---------+-----------+----------+---------
+ seqrelid     | oid     |           | not null |
+ seqtypid     | oid     |           | not null |
+ seqstart     | bigint  |           | not null |
+ seqincrement | bigint  |           | not null |
+ seqmax       | bigint  |           | not null |
+ seqmin       | bigint  |           | not null |
+ seqcache     | bigint  |           | not null |
+ seqcycle     | boolean |           | not null |
+Indexes:
+    "pg_sequence_seqrelid_index" PRIMARY KEY, btree (seqrelid)
+

不难看出,pg_sequence 中记录了 Sequence 的全部的属性信息,该属性在 CREATE/ALTER SEQUENCE 中被设置,Sequence 的 nextval 以及 setval 要经常打开这张系统表,按照规则办事。

对于 Sequence 序列数据本身,其实现方式是基于 heap 表实现的,heap 表共计三个字段,其在表结构如下:

typedef struct FormData_pg_sequence_data
+{
+    int64		last_value;
+    int64		log_cnt;
+    bool		is_called;
+} FormData_pg_sequence_data;
+
  • last_value 记录了 Sequence 的当前的序列值,我们称之为页面值(与后续的缓存值相区分)
  • log_cnt 记录了 Sequence 在 nextval 申请时,预先向 WAL 中额外申请的序列次数,这一部分我们放在序列申请机制剖析中详细介绍。
  • is_called 标记 Sequence 的 last_value 是否已经被申请过,例如 setval 可以设置 is_called 字段:
-- setval false
+postgres=# select setval('seq', 10, false);
+ setval
+--------
+     10
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         10 |       0 | f
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      10
+(1 row)
+
+-- setval true
+postgres=# select setval('seq', 10, true);
+ setval
+--------
+     10
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         10 |       0 | t
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      11
+(1 row)
+

每当用户创建一个 Sequence 对象时,PostgreSQL 总是会创建出一张上面这种结构的 heap 表,来记录 Sequence 对象的数据信息。当 Sequence 对象因为 nextvalsetval 导致序列值变化时,PostgreSQL 就会通过原地更新的方式更新 heap 表中的这一行的三个字段。

setval 为例,下面的逻辑解释了其具体的原地更新过程。

static void
+do_setval(Oid relid, int64 next, bool iscalled)
+{
+
+    /* 打开并对Sequence heap表进行加锁 */
+    init_sequence(relid, &elm, &seqrel);
+
+    ...
+
+    /* 对buffer进行加锁,同时提取tuple */
+    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
+
+    ...
+
+    /* 原地更新tuple */
+    seq->last_value = next;		/* last fetched number */
+    seq->is_called = iscalled;
+    seq->log_cnt = 0;
+
+    ...
+
+    /* 释放buffer锁以及表锁 */
+    UnlockReleaseBuffer(buf);
+    relation_close(seqrel, NoLock);
+}
+

可见,do_setval 会直接去设置 Sequence heap 表中的这一行元组,而非普通 heap 表中的删除 + 插入的方式来完成元组更新,对于 nextval 而言,也是类似的过程,只不过 last_value 的值需要计算得出,而非用户设置。

序列申请机制剖析

讲清楚 Sequence 对象在内核中的存在形式之后,就需要讲清楚一个序列值是如何发出的,即 nextval 方法。其在内核的具体实现在 sequence.c 中的 nextval_internal 函数,其最核心的功能,就是计算 last_value 以及 log_cnt

last_valuelog_cnt 的具体关系如下图:

页面值与wal关系

其中 log_cnt 是一个预留的申请次数。默认值为 32,由下面的宏定义决定:

/*
+ * We don't want to log each fetching of a value from a sequence,
+ * so we pre-log a few fetches in advance. In the event of
+ * crash we can lose (skip over) as many values as we pre-logged.
+ */
+#define SEQ_LOG_VALS	32
+

每当将 last_value 增加一个 increment 的长度时,log_cnt 就会递减 1。

页面值递增

log_cnt 为 0,或者发生 checkpoint 以后,就会触发一次 WAL 日志写入,按下面的公式设置 WAL 日志中的页面值,并重新将 log_cnt 设置为 SEQ_LOG_VALS

$$wal_value = last_value+increment*SEQ_LOG_VALS$$

通过这种方式,PostgreSQL 每次通过 nextval 修改页面中的 last_value 后,不需要每次都写入 WAL 日志。这意味着:如果 nextval 每次都需要修改页面值的话,这种优化将会使得写 WAL 的频率降低 32 倍。其代价就是,在发生 crash 前如果没有及时进行 checkpoint,那么会丢失一段序列。如下面所示:

postgres=# create sequence seq;
+CREATE SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+          1 |      32 | t
+(1 row)
+
+-- crash and restart
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         33 |       0 | t
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      34
+(1 row)
+

显然,crash 以后,Sequence 对象产生了 2-33 这段空洞,但这个代价是可以被接受的,因为 Sequence 并没有违背唯一性原则。同时,在特定场景下极大地降低了写 WAL 的频率。

Sequence 缓存机制

通过上述描述,不难发现 Sequence 每次发生序列申请,都需要通过加入 buffer 锁的方式来修改页面,这意味着 Sequence 的并发性能是比较差的。

针对这个问题,PostgreSQL 使用对 Sequence 使用了 Session Cache 来提前缓存一段序列,来提高并发性能。如下图所示:

Session Cache

Sequence Session Cache 的实现是一个 entry 数量固定为 16 的哈希表,以 Sequence 的 OID 为 key 去检索已经缓存好的 Sequence 序列,其缓存的 value 结构如下:

typedef struct SeqTableData
+{
+    Oid			relid;			/* Sequence OID(hash key) */
+    int64		last;			/* value last returned by nextval */
+    int64		cached;			/* last value already cached for nextval */
+    int64		increment;		/* copy of sequence's increment field */
+} SeqTableData;
+

其中 last 即为 Sequence 在 Session 中的当前值,即 current_value,cached 为 Sequence 在 Session 中的缓存值,即 cached_value,increment 记录了步长,有了这三个值即可满足 Sequence 缓存的基本条件。

对于 Sequence Session Cache 与页面值之间的关系,如下图所示:

cache与页面关系

类似于 log_cntcache_cnt 即为用户在定义 Sequence 时,设置的 Cache 大小,最小为 1。只有当 cache domain 中的序列用完以后,才会去对 buffer 加锁,修改页中的 Sequence 页面值。调整过程如下所示:

cache申请

例如,如果 CACHE 设置的值为 20,那么当 cache 使用完以后,就会尝试对 buffer 加锁来调整页面值,并重新申请 20 个 increment 至 cache 中。对于上图而言,有如下关系:

$$cached_value = NEW\\ current_value$$ $$NEW\\ current_value+20\\times INC=NEW\\ cached_value$$ $$NEW\\ last_value = NEW\\ cached_value$$

在 Sequence Session Cache 的加持下,nextval 方法的并发性能得到了极大的提升,以下是通过 pgbench 进行压测的结果对比。

性能对比

总结

Sequence 在 PostgreSQL 中是一类特殊的表级对象,提供了简单而又丰富的 SQL 接口,使得用户可以更加方便的创建、使用定制化的序列对象。不仅如此,Sequence 在内核中也具有丰富的组合使用场景,其使用场景也得到了极大地扩展。

本文详细介绍了 Sequence 对象在 PostgreSQL 内核中的具体设计,从对象的元数据描述、对象的数据描述出发,介绍了 Sequence 对象的组成。本文随后介绍了 Sequence 最为核心的 SQL 接口——nextval,从 nextval 的序列值计算、原地更新、降低 WAL 日志写入三个方面进行了详细阐述。最后,本文介绍了 Sequence Session Cache 的相关原理,描述了引入 Cache 以后,序列值在 Cache 中,以及页面中的计算方法以及对齐关系,并对比了引入 Cache 前后,nextval 方法在单序列和多序列并发场景下的对比情况。

',114);function S(s,_){const a=p("ArticleInfo");return o(),t("div",null,[h,c(a,{frontmatter:s.$frontmatter},null,8,["frontmatter"]),y])}const A=e(w,[["render",S],["__file","polar-sequence-tech.html.vue"]]);export{A as default}; diff --git a/assets/polar-sequence-tech.html-a8c531ba.js b/assets/polar-sequence-tech.html-a8c531ba.js new file mode 100644 index 00000000000..a588efd3c72 --- /dev/null +++ b/assets/polar-sequence-tech.html-a8c531ba.js @@ -0,0 +1,340 @@ +import{_ as e,r as p,o,c as t,d as c,a as n,b as l,e as i}from"./app-3d1677bf.js";const d="/PolarDB-for-PostgreSQL/assets/polar_sequence_monotonic_cyclic-5db3b890.png",r="/PolarDB-for-PostgreSQL/assets/polar_sequence_sql_interface-d4586fc0.png",u="/PolarDB-for-PostgreSQL/assets/polar_sequence_is_called-d65d8316.png",k="/PolarDB-for-PostgreSQL/assets/polar_sequence_called-ce740e57.png",v="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_no_cache-728e2b9f.png",m="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_desc_1-07cebb0a.png",b="/PolarDB-for-PostgreSQL/assets/polar_sequence_session_cache-81dce2b3.png",E="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_cache-89a2a0c4.png",g="/PolarDB-for-PostgreSQL/assets/polar_sequence_alignment_cache_1-695685fe.png",q="/PolarDB-for-PostgreSQL/assets/polar_sequence_performance_comparison-5432b1d6.png",w={},h=n("h1",{id:"sequence-使用、原理全面解析",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#sequence-使用、原理全面解析","aria-hidden":"true"},"#"),l(" Sequence 使用、原理全面解析")],-1),y=i(`

介绍

Sequence 作为数据库中的一个特别的表级对象,可以根据用户设定的不同属性,产生一系列有规则的整数,从而起到发号器的作用。

在使用方面,可以设置永不重复的 Sequence 用来作为一张表的主键,也可以通过不同表共享同一个 Sequence 来记录多个表的总插入行数。根据 ANSI 标准,一个 Sequence 对象在数据库要具备以下特征:

  1. 独立的数据库对象 (CREATE SEQUENCE),和表、视图同一层级
  2. 可以设置生成属性:初始值 (star value),步长 (increment),最大/小值 (max/min),循环产生 (cycle),缓存 (cache)等
  3. Sequence 对象在当前值的基础上进行递增或者递减,当前值被初始化为初始值
  4. 在设置循环后,当前值的变化具有周期性;不设置循环下,当前值的变化具有单调性,当前值到达最值后不可再变化

为了解释上述特性,我们分别定义 ab 两种序列来举例其具体的行为。

CREATE SEQUENCE a start with 5 minvalue -1 increment -2;
+CREATE SEQUENCE b start with 2 minvalue 1 maxvalue 4 cycle;
+

两个 Sequence 对象提供的序列值,随着序列申请次数的变化,如下所示:

单调序列与循环序列

PostgreSQLOracleSQLSERVERMySQLMariaDBDB2SybaseHive
支持支持支持仅支持自增字段支持支持仅支持自增字段不支持

为了更进一步了解 PostgreSQL 中的 Sequence 对象,我们先来了解 Sequence 的用法,并从用法中透析 Sequence 背后的设计原理。

使用方法

PostgreSQL 提供了丰富的 Sequence 调用接口,以及组合使用的场景,以充分支持开发者的各种需求。

SQL 接口

PostgreSQL 对 Sequence 对象也提供了类似于 的访问方式,即 DQL、DML 以及 DDL。我们从下图中可一览对外提供的 SQL 接口。

SQL接口

分别来介绍以下这几个接口:

currval

该接口的含义为,返回 Session 上次使用的某一 Sequence 的值。

postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+
+postgres=# select currval('seq');
+ currval
+---------
+       2
+(1 row)
+

需要注意的是,使用该接口必须使用过一次 nextval 方法,否则会提示目标 Sequence 在当前 Session 未定义。

postgres=# select currval('seq');
+ERROR:  currval of sequence "seq" is not yet defined in this session
+

lastval

该接口的含义为,返回 Session 上次使用的 Sequence 的值。

postgres=# select nextval('seq');
+ nextval
+---------
+       3
+(1 row)
+
+postgres=# select lastval();
+ lastval
+---------
+       3
+(1 row)
+

同样,为了知道上次用的是哪个 Sequence 对象,需要用一次 nextval('seq'),让 Session 以全局变量的形式记录下上次使用的 Sequence 对象。

lastvalcurval 两个接口仅仅只是参数不同,currval 需要指定是哪个访问过的 Sequence 对象,而 lastval 无法指定,只能是最近一次使用的 Sequence 对象。

nextval

该接口的含义为,取 Sequence 对象的下一个序列值。

通过使用 nextval 方法,可以让数据库基于 Sequence 对象的当前值,返回一个递增了 increment 数量的一个序列值,并将递增后的值作为 Sequence 对象当前值。

postgres=# CREATE SEQUENCE seq start with 1 increment 2;
+CREATE SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       3
+(1 row)
+

increment 称作 Sequence 对象的步长,Sequence 的每次以 nextval 的方式进行申请,都是以步长为单位进行申请的。同时,需要注意的是,Sequence 对象创建好以后,第一次申请获得的值,是 start value 所定义的值。对于 start value 的默认值,有以下 PostgreSQL 规则:

$$start_value = 1, if:increment > 0;$$ $$start_value = -1,if:increment < 0;$$

另外,nextval 是一种特殊的 DML,其不受事务所保护,即:申请出的序列值不会再回滚。

postgres=# BEGIN;
+BEGIN
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# ROLLBACK;
+ROLLBACK
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+

PostgreSQL 为了 Sequence 对象可以获得较好的并发性能,并没有采用多版本的方式来更新 Sequence 对象,而是采用了原地修改的方式完成 Sequence 对象的更新,这种不用事务保护的方式几乎成为所有支持 Sequence 对象的 RDMS 的通用做法,这也使得 Sequence 成为一种特殊的表级对象。

setval

该接口的含义是,设置 Sequence 对象的序列值。

postgres=# select nextval('seq');
+ nextval
+---------
+       4
+(1 row)
+
+postgres=# select setval('seq', 1);
+ setval
+--------
+      1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+

该方法可以将 Sequence 对象的序列值设置到给定的位置,同时可以将第一个序列值申请出来。如果不想申请出来,可以采用加入 false 参数的做法。

postgres=# select nextval('seq');
+ nextval
+---------
+       4
+(1 row)
+
+postgres=# select setval('seq', 1, false);
+ setval
+--------
+      1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+

SQL接口

通过在 setval 来设置好 Sequence 对象的值以后,同时来设置 Sequence 对象的 is_called 属性。nextval 就可以根据 Sequence 对象的 is_called 属性来判断要返回的是否要返回设置的序列值。即:如果 is_calledfalsenextval 接口会去设置 is_calledtrue,而不是进行 increment。

CREATE/ALTER SEQUENCE

CREATEALTER SEQUENCE 用于创建/变更 Sequence 对象,其中 Sequence 属性也通过 CREATEALTER SEQUENCE 接口进行设置,前面已简单介绍部分属性,下面将详细描述具体的属性。

CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name
+    [ AS data_type ]
+    [ INCREMENT [ BY ] increment ]
+    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
+    [ START [ WITH ] start ] [ CACHE cache ] [ [ NO ] CYCLE ]
+    [ OWNED BY { table_name.column_name | NONE } ]
+ALTER SEQUENCE [ IF EXISTS ] name
+    [ AS data_type ]
+    [ INCREMENT [ BY ] increment ]
+    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
+    [ START [ WITH ] start ]
+    [ RESTART [ [ WITH ] restart ] ]
+    [ CACHE cache ] [ [ NO ] CYCLE ]
+    [ OWNED BY { table_name.column_name | NONE } ]
+
  • AS:设置 Sequence 的数据类型,只可以设置为 smallintintbigint;与此同时也限定了 minvaluemaxvalue 的设置范围,默认为 bigint 类型(注意,只是限定,而不是设置,设置的范围不得超过数据类型的范围)。
  • INCREMENT:步长,nextval 申请序列值的递增数量,默认值为 1。
  • MINVALUE / NOMINVALUE:设置/不设置 Sequence 对象的最小值,如果不设置则是数据类型规定的范围,例如 bigint 类型,则最小值设置为 PG_INT64_MIN(-9223372036854775808)
  • MAXVALUE / NOMAXVALUE:设置/不设置 Sequence 对象的最大值,如果不设置,则默认设置规则如上。
  • START:Sequence 对象的初始值,必须在 MINVALUEMAXVALUE 范围之间。
  • RESTART:ALTER 后,可以重新设置 Sequence 对象的序列值,默认设置为 start value。
  • CACHE / NOCACHE:设置 Sequence 对象使用的 Cache 大小,NOCACHE 或者不设置则默认为 1。
  • OWNED BY:设置 Sequence 对象归属于某张表的某一列,删除列后,Sequence 对象也将删除。

特殊场景下的序列回滚

下面描述了一种序列回滚的场景

CREATE SEQUENCE
+postgres=# BEGIN;
+BEGIN
+postgres=# ALTER SEQUENCE seq maxvalue 10;
+ALTER SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+       2
+(1 row)
+
+postgres=# ROLLBACK;
+ROLLBACK
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+

与之前描述的不同,此处 Sequence 对象受到了事务的保护,序列值发生了发生回滚。实际上,此处事务保护的是 ALTER SEQUENCE(DDL),而非 nextval(DML),因此此处发生的回滚是将 Sequence 对象回滚到 ALTER SEQUENCE 之前的状态,故发生了序列回滚现象。

DROP/TRUNCATE

  • DROP SEQUENCE,如字面意思,去除数据库中的 Sequence 对象。
  • TRUNCATE,准确来讲,是通过 TRUNCATE TABLE 完成 RESTART SEQUENCE
postgres=# CREATE TABLE tbl_iden (i INTEGER, j int GENERATED ALWAYS AS IDENTITY);
+CREATE TABLE
+postgres=# insert into tbl_iden values (100);
+INSERT 0 1
+postgres=# insert into tbl_iden values (1000);
+INSERT 0 1
+postgres=# select * from tbl_iden;
+  i   | j
+------+---
+  100 | 1
+ 1000 | 2
+(2 rows)
+
+postgres=# TRUNCATE TABLE tbl_iden RESTART IDENTITY;
+TRUNCATE TABLE
+postgres=# insert into tbl_iden values (1234);
+INSERT 0 1
+postgres=# select * from tbl_iden;
+  i   | j
+------+---
+ 1234 | 1
+(1 row)
+

此处相当于在 TRUNCATE 表的时候,执行 ALTER SEQUENCE RESTART

Sequence 组合使用场景

SEQUENCE 除了作为一个独立的对象时候以外,还可以组合其他 PostgreSQL 其他组件进行使用,我们总结了一下几个常用的场景。

组合调用

显式调用

CREATE SEQUENCE seq;
+CREATE TABLE tbl (i INTEGER PRIMARY KEY);
+INSERT INTO tbl (i) VALUES (nextval('seq'));
+SELECT * FROM tbl ORDER BY 1 DESC;
+   tbl
+---------
+       1
+(1 row)
+

触发器调用

CREATE SEQUENCE seq;
+CREATE TABLE tbl (i INTEGER PRIMARY KEY, j INTEGER);
+CREATE FUNCTION f()
+RETURNS TRIGGER AS
+$$
+BEGIN
+NEW.i := nextval('seq');
+RETURN NEW;
+END;
+$$
+LANGUAGE 'plpgsql';
+
+CREATE TRIGGER tg
+BEFORE INSERT ON tbl
+FOR EACH ROW
+EXECUTE PROCEDURE f();
+
+INSERT INTO tbl (j) VALUES (4);
+
+SELECT * FROM tbl;
+ i | j
+---+---
+ 1 | 4
+(1 row)
+

DEFAULT 调用

显式 DEFAULT 调用:

CREATE SEQUENCE seq;
+CREATE TABLE tbl(i INTEGER DEFAULT nextval('seq') PRIMARY KEY, j INTEGER);
+
+INSERT INTO tbl (i,j) VALUES (DEFAULT,11);
+INSERT INTO tbl(j) VALUES (321);
+INSERT INTO tbl (i,j) VALUES (nextval('seq'),1);
+
+SELECT * FROM tbl;
+ i |  j
+---+-----
+ 2 | 321
+ 1 |  11
+ 3 |   1
+(3 rows)
+

SERIAL 调用:

CREATE TABLE tbl (i SERIAL PRIMARY KEY, j INTEGER);
+INSERT INTO tbl (i,j) VALUES (DEFAULT,42);
+
+INSERT INTO tbl (j) VALUES (25);
+
+SELECT * FROM tbl;
+ i | j
+---+----
+ 1 | 42
+ 2 | 25
+(2 rows)
+

注意,SERIAL 并不是一种类型,而是 DEFAULT 调用的另一种形式,只不过 SERIAL 会自动创建 DEFAULT 约束所要使用的 Sequence。

AUTO_INC 调用

CREATE TABLE tbl (i int GENERATED ALWAYS AS IDENTITY,
+                  j INTEGER);
+INSERT INTO tbl(i,j) VALUES (DEFAULT,32);
+
+INSERT INTO tbl(j) VALUES (23);
+
+SELECT * FROM tbl;
+ i | j
+---+----
+ 1 | 32
+ 2 | 23
+(2 rows)
+

AUTO_INC 调用对列附加了自增约束,与 default 约束不同,自增约束通过查找 dependency 的方式找到该列关联的 Sequence,而 default 调用仅仅是将默认值设置为一个 nextval 表达式。

原理剖析

Sequence 在系统表与数据表中的描述

在 PostgreSQL 中有一张专门记录 Sequence 信息的系统表,即 pg_sequence。其表结构如下:

postgres=# \\d pg_sequence
+             Table "pg_catalog.pg_sequence"
+    Column    |  Type   | Collation | Nullable | Default
+--------------+---------+-----------+----------+---------
+ seqrelid     | oid     |           | not null |
+ seqtypid     | oid     |           | not null |
+ seqstart     | bigint  |           | not null |
+ seqincrement | bigint  |           | not null |
+ seqmax       | bigint  |           | not null |
+ seqmin       | bigint  |           | not null |
+ seqcache     | bigint  |           | not null |
+ seqcycle     | boolean |           | not null |
+Indexes:
+    "pg_sequence_seqrelid_index" PRIMARY KEY, btree (seqrelid)
+

不难看出,pg_sequence 中记录了 Sequence 的全部的属性信息,该属性在 CREATE/ALTER SEQUENCE 中被设置,Sequence 的 nextval 以及 setval 要经常打开这张系统表,按照规则办事。

对于 Sequence 序列数据本身,其实现方式是基于 heap 表实现的,heap 表共计三个字段,其在表结构如下:

typedef struct FormData_pg_sequence_data
+{
+    int64		last_value;
+    int64		log_cnt;
+    bool		is_called;
+} FormData_pg_sequence_data;
+
  • last_value 记录了 Sequence 的当前的序列值,我们称之为页面值(与后续的缓存值相区分)
  • log_cnt 记录了 Sequence 在 nextval 申请时,预先向 WAL 中额外申请的序列次数,这一部分我们放在序列申请机制剖析中详细介绍。
  • is_called 标记 Sequence 的 last_value 是否已经被申请过,例如 setval 可以设置 is_called 字段:
-- setval false
+postgres=# select setval('seq', 10, false);
+ setval
+--------
+     10
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         10 |       0 | f
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      10
+(1 row)
+
+-- setval true
+postgres=# select setval('seq', 10, true);
+ setval
+--------
+     10
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         10 |       0 | t
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      11
+(1 row)
+

每当用户创建一个 Sequence 对象时,PostgreSQL 总是会创建出一张上面这种结构的 heap 表,来记录 Sequence 对象的数据信息。当 Sequence 对象因为 nextvalsetval 导致序列值变化时,PostgreSQL 就会通过原地更新的方式更新 heap 表中的这一行的三个字段。

setval 为例,下面的逻辑解释了其具体的原地更新过程。

static void
+do_setval(Oid relid, int64 next, bool iscalled)
+{
+
+    /* 打开并对Sequence heap表进行加锁 */
+    init_sequence(relid, &elm, &seqrel);
+
+    ...
+
+    /* 对buffer进行加锁,同时提取tuple */
+    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
+
+    ...
+
+    /* 原地更新tuple */
+    seq->last_value = next;		/* last fetched number */
+    seq->is_called = iscalled;
+    seq->log_cnt = 0;
+
+    ...
+
+    /* 释放buffer锁以及表锁 */
+    UnlockReleaseBuffer(buf);
+    relation_close(seqrel, NoLock);
+}
+

可见,do_setval 会直接去设置 Sequence heap 表中的这一行元组,而非普通 heap 表中的删除 + 插入的方式来完成元组更新,对于 nextval 而言,也是类似的过程,只不过 last_value 的值需要计算得出,而非用户设置。

序列申请机制剖析

讲清楚 Sequence 对象在内核中的存在形式之后,就需要讲清楚一个序列值是如何发出的,即 nextval 方法。其在内核的具体实现在 sequence.c 中的 nextval_internal 函数,其最核心的功能,就是计算 last_value 以及 log_cnt

last_valuelog_cnt 的具体关系如下图:

页面值与wal关系

其中 log_cnt 是一个预留的申请次数。默认值为 32,由下面的宏定义决定:

/*
+ * We don't want to log each fetching of a value from a sequence,
+ * so we pre-log a few fetches in advance. In the event of
+ * crash we can lose (skip over) as many values as we pre-logged.
+ */
+#define SEQ_LOG_VALS	32
+

每当将 last_value 增加一个 increment 的长度时,log_cnt 就会递减 1。

页面值递增

log_cnt 为 0,或者发生 checkpoint 以后,就会触发一次 WAL 日志写入,按下面的公式设置 WAL 日志中的页面值,并重新将 log_cnt 设置为 SEQ_LOG_VALS

$$wal_value = last_value+increment*SEQ_LOG_VALS$$

通过这种方式,PostgreSQL 每次通过 nextval 修改页面中的 last_value 后,不需要每次都写入 WAL 日志。这意味着:如果 nextval 每次都需要修改页面值的话,这种优化将会使得写 WAL 的频率降低 32 倍。其代价就是,在发生 crash 前如果没有及时进行 checkpoint,那么会丢失一段序列。如下面所示:

postgres=# create sequence seq;
+CREATE SEQUENCE
+postgres=# select nextval('seq');
+ nextval
+---------
+       1
+(1 row)
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+          1 |      32 | t
+(1 row)
+
+-- crash and restart
+
+postgres=# select * from seq;
+ last_value | log_cnt | is_called
+------------+---------+-----------
+         33 |       0 | t
+(1 row)
+
+postgres=# select nextval('seq');
+ nextval
+---------
+      34
+(1 row)
+

显然,crash 以后,Sequence 对象产生了 2-33 这段空洞,但这个代价是可以被接受的,因为 Sequence 并没有违背唯一性原则。同时,在特定场景下极大地降低了写 WAL 的频率。

Sequence 缓存机制

通过上述描述,不难发现 Sequence 每次发生序列申请,都需要通过加入 buffer 锁的方式来修改页面,这意味着 Sequence 的并发性能是比较差的。

针对这个问题,PostgreSQL 使用对 Sequence 使用了 Session Cache 来提前缓存一段序列,来提高并发性能。如下图所示:

Session Cache

Sequence Session Cache 的实现是一个 entry 数量固定为 16 的哈希表,以 Sequence 的 OID 为 key 去检索已经缓存好的 Sequence 序列,其缓存的 value 结构如下:

typedef struct SeqTableData
+{
+    Oid			relid;			/* Sequence OID(hash key) */
+    int64		last;			/* value last returned by nextval */
+    int64		cached;			/* last value already cached for nextval */
+    int64		increment;		/* copy of sequence's increment field */
+} SeqTableData;
+

其中 last 即为 Sequence 在 Session 中的当前值,即 current_value,cached 为 Sequence 在 Session 中的缓存值,即 cached_value,increment 记录了步长,有了这三个值即可满足 Sequence 缓存的基本条件。

对于 Sequence Session Cache 与页面值之间的关系,如下图所示:

cache与页面关系

类似于 log_cntcache_cnt 即为用户在定义 Sequence 时,设置的 Cache 大小,最小为 1。只有当 cache domain 中的序列用完以后,才会去对 buffer 加锁,修改页中的 Sequence 页面值。调整过程如下所示:

cache申请

例如,如果 CACHE 设置的值为 20,那么当 cache 使用完以后,就会尝试对 buffer 加锁来调整页面值,并重新申请 20 个 increment 至 cache 中。对于上图而言,有如下关系:

$$cached_value = NEW\\ current_value$$ $$NEW\\ current_value+20\\times INC=NEW\\ cached_value$$ $$NEW\\ last_value = NEW\\ cached_value$$

在 Sequence Session Cache 的加持下,nextval 方法的并发性能得到了极大的提升,以下是通过 pgbench 进行压测的结果对比。

性能对比

总结

Sequence 在 PostgreSQL 中是一类特殊的表级对象,提供了简单而又丰富的 SQL 接口,使得用户可以更加方便的创建、使用定制化的序列对象。不仅如此,Sequence 在内核中也具有丰富的组合使用场景,其使用场景也得到了极大地扩展。

本文详细介绍了 Sequence 对象在 PostgreSQL 内核中的具体设计,从对象的元数据描述、对象的数据描述出发,介绍了 Sequence 对象的组成。本文随后介绍了 Sequence 最为核心的 SQL 接口——nextval,从 nextval 的序列值计算、原地更新、降低 WAL 日志写入三个方面进行了详细阐述。最后,本文介绍了 Sequence Session Cache 的相关原理,描述了引入 Cache 以后,序列值在 Cache 中,以及页面中的计算方法以及对齐关系,并对比了引入 Cache 前后,nextval 方法在单序列和多序列并发场景下的对比情况。

',114);function S(s,_){const a=p("ArticleInfo");return o(),t("div",null,[h,c(a,{frontmatter:s.$frontmatter},null,8,["frontmatter"]),y])}const A=e(w,[["render",S],["__file","polar-sequence-tech.html.vue"]]);export{A as default}; diff --git a/assets/polar_sequence_alignment_cache-89a2a0c4.png b/assets/polar_sequence_alignment_cache-89a2a0c4.png new file mode 100644 index 00000000000..e836a0d0816 Binary files /dev/null and b/assets/polar_sequence_alignment_cache-89a2a0c4.png differ diff --git a/assets/polar_sequence_alignment_cache_1-695685fe.png b/assets/polar_sequence_alignment_cache_1-695685fe.png new file mode 100644 index 00000000000..4b0ee8e8389 Binary files /dev/null and b/assets/polar_sequence_alignment_cache_1-695685fe.png differ diff --git a/assets/polar_sequence_alignment_desc_1-07cebb0a.png b/assets/polar_sequence_alignment_desc_1-07cebb0a.png new file mode 100644 index 00000000000..b78d312a60c Binary files /dev/null and b/assets/polar_sequence_alignment_desc_1-07cebb0a.png differ diff --git a/assets/polar_sequence_alignment_no_cache-728e2b9f.png b/assets/polar_sequence_alignment_no_cache-728e2b9f.png new file mode 100644 index 00000000000..01a4b57639c Binary files /dev/null and b/assets/polar_sequence_alignment_no_cache-728e2b9f.png differ diff --git a/assets/polar_sequence_called-ce740e57.png b/assets/polar_sequence_called-ce740e57.png new file mode 100644 index 00000000000..f0d603f2ab6 Binary files /dev/null and b/assets/polar_sequence_called-ce740e57.png differ diff --git a/assets/polar_sequence_is_called-d65d8316.png b/assets/polar_sequence_is_called-d65d8316.png new file mode 100644 index 00000000000..cdded4d3ac4 Binary files /dev/null and b/assets/polar_sequence_is_called-d65d8316.png differ diff --git a/assets/polar_sequence_monotonic_cyclic-5db3b890.png b/assets/polar_sequence_monotonic_cyclic-5db3b890.png new file mode 100644 index 00000000000..53e74b69f71 Binary files /dev/null and b/assets/polar_sequence_monotonic_cyclic-5db3b890.png differ diff --git a/assets/polar_sequence_performance_comparison-5432b1d6.png b/assets/polar_sequence_performance_comparison-5432b1d6.png new file mode 100644 index 00000000000..c602c605ba8 Binary files /dev/null and b/assets/polar_sequence_performance_comparison-5432b1d6.png differ diff --git a/assets/polar_sequence_session_cache-81dce2b3.png b/assets/polar_sequence_session_cache-81dce2b3.png new file mode 100644 index 00000000000..905b7c25726 Binary files /dev/null and b/assets/polar_sequence_session_cache-81dce2b3.png differ diff --git a/assets/polar_sequence_sql_interface-d4586fc0.png b/assets/polar_sequence_sql_interface-d4586fc0.png new file mode 100644 index 00000000000..a21732f8036 Binary files /dev/null and b/assets/polar_sequence_sql_interface-d4586fc0.png differ diff --git a/assets/pr_parallel_execute_1-5ade2a03.png b/assets/pr_parallel_execute_1-5ade2a03.png new file mode 100644 index 00000000000..0861d45b981 Binary files /dev/null and b/assets/pr_parallel_execute_1-5ade2a03.png differ diff --git a/assets/pr_parallel_execute_2-f13fd3b5.png b/assets/pr_parallel_execute_2-f13fd3b5.png new file mode 100644 index 00000000000..a505d26040b Binary files /dev/null and b/assets/pr_parallel_execute_2-f13fd3b5.png differ diff --git a/assets/pr_parallel_execute_dispatcher-ef5aa6cd.png b/assets/pr_parallel_execute_dispatcher-ef5aa6cd.png new file mode 100644 index 00000000000..77273b49879 Binary files /dev/null and b/assets/pr_parallel_execute_dispatcher-ef5aa6cd.png differ diff --git a/assets/pr_parallel_execute_procs_1-37e52fe7.png b/assets/pr_parallel_execute_procs_1-37e52fe7.png new file mode 100644 index 00000000000..759689ee109 Binary files /dev/null and b/assets/pr_parallel_execute_procs_1-37e52fe7.png differ diff --git a/assets/pr_parallel_execute_procs_2-777f9348.png b/assets/pr_parallel_execute_procs_2-777f9348.png new file mode 100644 index 00000000000..f8a79ce0a39 Binary files /dev/null and b/assets/pr_parallel_execute_procs_2-777f9348.png differ diff --git a/assets/pr_parallel_execute_procs_3-c2b9a687.png b/assets/pr_parallel_execute_procs_3-c2b9a687.png new file mode 100644 index 00000000000..466184289ba Binary files /dev/null and b/assets/pr_parallel_execute_procs_3-c2b9a687.png differ diff --git a/assets/pr_parallel_execute_task-6ddc37a4.png b/assets/pr_parallel_execute_task-6ddc37a4.png new file mode 100644 index 00000000000..e187b55934f Binary files /dev/null and b/assets/pr_parallel_execute_task-6ddc37a4.png differ diff --git a/assets/pr_parallel_replay_1-fa5b96d0.png b/assets/pr_parallel_replay_1-fa5b96d0.png new file mode 100644 index 00000000000..db3c6f0685b Binary files /dev/null and b/assets/pr_parallel_replay_1-fa5b96d0.png differ diff --git a/assets/pr_parallel_replay_2-bf2d2654.png b/assets/pr_parallel_replay_2-bf2d2654.png new file mode 100644 index 00000000000..21d0705fe6a Binary files /dev/null and b/assets/pr_parallel_replay_2-bf2d2654.png differ diff --git a/assets/quick-start.html-993589ee.js b/assets/quick-start.html-993589ee.js new file mode 100644 index 00000000000..3b29d99c42d --- /dev/null +++ b/assets/quick-start.html-993589ee.js @@ -0,0 +1,11 @@ +import{_ as l,r as t,o as c,c as p,d as o,a as e,b as n,e as i}from"./app-3d1677bf.js";const d={},_=e("h1",{id:"快速部署",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#快速部署","aria-hidden":"true"},"#"),n(" 快速部署")],-1),k=e("div",{class:"custom-container danger"},[e("p",{class:"custom-container-title"},"DANGER"),e("p",null,[n("为简化使用,容器内的 "),e("code",null,"postgres"),n(" 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!")])],-1),u=e("p",null,"仅需单台计算机,同时满足以下要求,就可以快速开启您的 PolarDB 之旅:",-1),h=e("li",null,"CPU 架构为 AMD64 / ARM64",-1),m=e("li",null,"可用内存 4GB 以上",-1),f={href:"https://www.docker.com/",target:"_blank",rel:"noopener noreferrer"},b={href:"https://docs.docker.com/engine/install/ubuntu/",target:"_blank",rel:"noopener noreferrer"},g={href:"https://docs.docker.com/engine/install/debian/",target:"_blank",rel:"noopener noreferrer"},D={href:"https://docs.docker.com/engine/install/centos/",target:"_blank",rel:"noopener noreferrer"},E={href:"https://docs.docker.com/engine/install/rhel/",target:"_blank",rel:"noopener noreferrer"},v={href:"https://docs.docker.com/engine/install/fedora/",target:"_blank",rel:"noopener noreferrer"},B={href:"https://docs.docker.com/desktop/mac/install/",target:"_blank",rel:"noopener noreferrer"},P={href:"https://docs.docker.com/desktop/windows/install/",target:"_blank",rel:"noopener noreferrer"},w={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},L=i(`
# 拉取 PolarDB-PG 镜像
+docker pull polardb/polardb_pg_local_instance
+# 创建并运行容器
+docker run -it --rm polardb/polardb_pg_local_instance psql
+# 测试可用性
+postgres=# SELECT version();
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+
`,1);function S(a,x){const r=t("ArticleInfo"),s=t("ExternalLinkIcon");return c(),p("div",null,[_,o(r,{frontmatter:a.$frontmatter},null,8,["frontmatter"]),k,u,e("ul",null,[h,m,e("li",null,[n("已安装 "),e("a",f,[n("Docker"),o(s)]),e("ul",null,[e("li",null,[n("Ubuntu:"),e("a",b,[n("在 Ubuntu 上安装 Docker Engine"),o(s)])]),e("li",null,[n("Debian:"),e("a",g,[n("在 Debian 上安装 Docker Engine"),o(s)])]),e("li",null,[n("CentOS:"),e("a",D,[n("在 CentOS 上安装 Docker Engine"),o(s)])]),e("li",null,[n("RHEL:"),e("a",E,[n("在 RHEL 上安装 Docker Engine"),o(s)])]),e("li",null,[n("Fedora:"),e("a",v,[n("在 Fedora 上安装 Docker Engine"),o(s)])]),e("li",null,[n("macOS(支持 M1 芯片):"),e("a",B,[n("在 Mac 上安装 Docker Desktop"),o(s)]),n(",并建议将内存调整为 4GB 以上")]),e("li",null,[n("Windows:"),e("a",P,[n("在 Windows 上安装 Docker Desktop"),o(s)]),n(",并建议将内存调整为 4GB 以上")])])])]),e("p",null,[n("从 DockerHub 上拉取 PolarDB for PostgreSQL 的 "),e("a",w,[n("本地存储实例镜像"),o(s)]),n(",创建并运行容器,然后直接试用 PolarDB-PG:")]),L])}const G=l(d,[["render",S],["__file","quick-start.html.vue"]]);export{G as default}; diff --git a/assets/quick-start.html-b665e5e8.js b/assets/quick-start.html-b665e5e8.js new file mode 100644 index 00000000000..5851aba3bed --- /dev/null +++ b/assets/quick-start.html-b665e5e8.js @@ -0,0 +1,11 @@ +import{_ as l,r as t,o as c,c as p,d as o,a as e,b as n,e as i}from"./app-3d1677bf.js";const d={},_=e("h1",{id:"快速部署",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#快速部署","aria-hidden":"true"},"#"),n(" 快速部署")],-1),k=e("div",{class:"custom-container danger"},[e("p",{class:"custom-container-title"},"警告"),e("p",null,[n("为简化使用,容器内的 "),e("code",null,"postgres"),n(" 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!")])],-1),u=e("p",null,"仅需单台计算机,同时满足以下要求,就可以快速开启您的 PolarDB 之旅:",-1),h=e("li",null,"CPU 架构为 AMD64 / ARM64",-1),m=e("li",null,"可用内存 4GB 以上",-1),f={href:"https://www.docker.com/",target:"_blank",rel:"noopener noreferrer"},b={href:"https://docs.docker.com/engine/install/ubuntu/",target:"_blank",rel:"noopener noreferrer"},g={href:"https://docs.docker.com/engine/install/debian/",target:"_blank",rel:"noopener noreferrer"},D={href:"https://docs.docker.com/engine/install/centos/",target:"_blank",rel:"noopener noreferrer"},E={href:"https://docs.docker.com/engine/install/rhel/",target:"_blank",rel:"noopener noreferrer"},v={href:"https://docs.docker.com/engine/install/fedora/",target:"_blank",rel:"noopener noreferrer"},B={href:"https://docs.docker.com/desktop/mac/install/",target:"_blank",rel:"noopener noreferrer"},P={href:"https://docs.docker.com/desktop/windows/install/",target:"_blank",rel:"noopener noreferrer"},w={href:"https://hub.docker.com/r/polardb/polardb_pg_local_instance/tags",target:"_blank",rel:"noopener noreferrer"},L=i(`
# 拉取 PolarDB-PG 镜像
+docker pull polardb/polardb_pg_local_instance
+# 创建并运行容器
+docker run -it --rm polardb/polardb_pg_local_instance psql
+# 测试可用性
+postgres=# SELECT version();
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+
`,1);function S(a,x){const r=t("ArticleInfo"),s=t("ExternalLinkIcon");return c(),p("div",null,[_,o(r,{frontmatter:a.$frontmatter},null,8,["frontmatter"]),k,u,e("ul",null,[h,m,e("li",null,[n("已安装 "),e("a",f,[n("Docker"),o(s)]),e("ul",null,[e("li",null,[n("Ubuntu:"),e("a",b,[n("在 Ubuntu 上安装 Docker Engine"),o(s)])]),e("li",null,[n("Debian:"),e("a",g,[n("在 Debian 上安装 Docker Engine"),o(s)])]),e("li",null,[n("CentOS:"),e("a",D,[n("在 CentOS 上安装 Docker Engine"),o(s)])]),e("li",null,[n("RHEL:"),e("a",E,[n("在 RHEL 上安装 Docker Engine"),o(s)])]),e("li",null,[n("Fedora:"),e("a",v,[n("在 Fedora 上安装 Docker Engine"),o(s)])]),e("li",null,[n("macOS(支持 M1 芯片):"),e("a",B,[n("在 Mac 上安装 Docker Desktop"),o(s)]),n(",并建议将内存调整为 4GB 以上")]),e("li",null,[n("Windows:"),e("a",P,[n("在 Windows 上安装 Docker Desktop"),o(s)]),n(",并建议将内存调整为 4GB 以上")])])])]),e("p",null,[n("从 DockerHub 上拉取 PolarDB for PostgreSQL 的 "),e("a",w,[n("本地存储实例镜像"),o(s)]),n(",创建并运行容器,然后直接试用 PolarDB-PG:")]),L])}const C=l(d,[["render",S],["__file","quick-start.html.vue"]]);export{C as default}; diff --git a/assets/quick-start.html-ede64a2e.js b/assets/quick-start.html-ede64a2e.js new file mode 100644 index 00000000000..9a43f98e416 --- /dev/null +++ b/assets/quick-start.html-ede64a2e.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-1ced8944","path":"/deploying/quick-start.html","title":"快速部署","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":5},"headers":[],"git":{"updatedTime":1690894847000},"filePathRelative":"deploying/quick-start.md"}');export{t as data}; diff --git a/assets/quick-start.html-fadf16d2.js b/assets/quick-start.html-fadf16d2.js new file mode 100644 index 00000000000..62489e1a1fe --- /dev/null +++ b/assets/quick-start.html-fadf16d2.js @@ -0,0 +1 @@ +const t=JSON.parse('{"key":"v-7eb8feb3","path":"/zh/deploying/quick-start.html","title":"快速部署","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":5},"headers":[],"git":{"updatedTime":1690894847000},"filePathRelative":"zh/deploying/quick-start.md"}');export{t as data}; diff --git a/assets/rel-size-cache.html-0ff52651.js b/assets/rel-size-cache.html-0ff52651.js new file mode 100644 index 00000000000..0d7f85dfa28 --- /dev/null +++ b/assets/rel-size-cache.html-0ff52651.js @@ -0,0 +1,107 @@ +import{_ as t,r as l,o as d,c as r,d as s,a as n,w as o,b as a,e as u}from"./app-3d1677bf.js";const k="/PolarDB-for-PostgreSQL/assets/rsc-first-cache-98d69d90.png",v="/PolarDB-for-PostgreSQL/assets/rsc-second-cache-e3089e17.png",m={},b=n("h1",{id:"表大小缓存",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#表大小缓存","aria-hidden":"true"},"#"),a(" 表大小缓存")],-1),S={class:"table-of-contents"},h=u('

背景介绍

在 SQL 执行的过程中,存在若干次对系统表和用户表的查询。PolarDB for PostgreSQL 通过文件系统的 lseek 系统调用来获取表大小。频繁执行 lseek 系统调用会严重影响数据库的执行性能,特别是对于存储计算分离架构的 PolarDB for PostgreSQL 来说,在 PolarFS 上的 PFS lseek 系统调用会带来更大的 RTO 时延。为了降低 lseek 系统调用的使用频率,PolarDB for PostgreSQL 在自身存储引擎上提供了一层表大小缓存接口,用于提升数据库的运行时性能。

术语

  • RSC (Relation Size Cache):表大小缓存。
  • Smgr (Storage manager):PolarDB for PostgreSQL 存储管理器。
  • SmgrRelation:PolarDB for PostgreSQL 存储侧的表级元信息。

功能介绍

PolarDB for PostgreSQL 为了实现 RSC,在 smgr 层进行了重新适配与设计。在整体上,RSC 是一个 缓存数组 + 两级索引 的结构设计:一级索引通过内存地址 + 引用计数来寻找共享内存 RSC 缓存中的一个缓存块;二级索引通过共享内存中的哈希表来索引得到一个 RSC 缓存块的数组下标,根据下标进一步访问 RSC 缓存,获取表大小信息。

功能设计

总体设计

在开启 RSC 缓存功能后,各个 smgr 层接口将会生效 RSC 缓存查询与更新的逻辑:

  • smgrnblocks:获取表大小的实际入口,将会通过查询 RSC 一级或二级索引得到 RSC 缓存块地址,从而得到物理表大小。如果 RSC 缓存命中则直接返回缓存中的物理表大小;否则需要进行一次 lseek 系统调用,并将实际的物理表大小更新到 RSC 缓存中,并同步更新 RSC 一级与二级索引。
  • smgrextend:表文件扩展接口,将会把物理表文件扩展一个页,并更新对应表的 RSC 索引与缓存。
  • smgrextendbatch:表文件的预扩展接口,将会把物理表文件预扩展多个页,并更新对应表的 RSC 索引与缓存。
  • smgrtruncate:表文件的删除接口,将会把物理表文件删除,并清空对应表的 RSC 索引与缓存。

RSC 缓存数组

在共享内存中,维护了一个数组形式的 RSC 缓存。数组中的每个元素是一个 RSC 缓存块,其中保存的关键信息包含:

  • 表标识符
  • 一个长度为 64 位的引用计数 generation:表发生更新操作时,这个计数会自增
  • 表大小

RSC 一级索引

对于每个执行用户操作的会话进程而言,其所需访问的表被维护在进程私有的 SmgrRelation 结构中,其中包含:

  • 一个指向 RSC 缓存块的指针,初始值为空,后续将被更新
  • 一个长度为 64 位的 generation 计数

当执行表访问操作时,如果引用计数与 RSC 缓存中的 generation 一致,则认为 RSC 缓存没有被更新过,可以直接通过指针得到 RSC 缓存,获得物理表的当前大小。RSC 一级索引整体上是一个共享引用计数 + 共享内存指针的设计,在对大多数特定表的读多写少场景中,这样的设计可以有效降低对 RSC 二级索引的并发访问。

rsc-first-cache

RSC 二级索引

当表大小发生更新(例如 INSERTUPDATECOPY 等触发表文件大小元信息变更的操作)时,会导致 RSC 一级索引失效(generation 计数不一致),会话进程会尝试访问 RSC 二级索引。RSC 二级索引的形式是一个共享内存哈希表:

  • Key 为表 OID
  • Value 为表的 RSC 缓存块在 RSC 缓存数组中的下标

通过待访问物理表的 OID,查找位于共享内存中的 RSC 二级索引:如果命中,则直接得到 RSC 缓存块,取得表大小,同时更新 RSC 一级索引;如果不命中,则使用 lseek 系统调用获取物理表的实际大小,并更新 RSC 缓存及其一二级索引。RSC 缓存更新的过程可能因缓存已满而触发缓存淘汰。

rsc-second-cache

RSC 缓存更新与淘汰

在 RSC 缓存被更新的过程中,可能会因为缓存总容量已满,进而触发缓存淘汰。RSC 实现了一个 SLRU 缓存淘汰算法,用于在缓存块满时选择一个旧缓存块进行淘汰。每一个 RSC 缓存块上都维护了一个引用计数器,缓存每被访问一次,计数器的值加 1;缓存被淘汰时计数器清 0。当缓存淘汰被触发时,将从 RSC 缓存数组上一次遍历到的位置开始向前遍历,递减每一个 RSC 缓存上的引用计数,直到找到一个引用计数为 0 的缓存块进行淘汰。遍历的长度可以通过 GUC 参数控制,默认为 8:当向前遍历 8 个块后仍未找到一个可以被淘汰的 RSC 缓存块时,将会随机选择一个缓存块进行淘汰。

备节点的 RSC 缓存

PolarDB for PostgreSQL 的备节点分为两种,一种是提供只读服务的共享存储 Read Only 节点(RO),一种是提供跨数据中心高可用的 Standby 节点。对于 Standby 节点,由于其数据同步机制采用传统流复制 + WAL 日志回放的方式进行,故 RSC 缓存的使用与更新方式与 Read Write 节点(RW)无异。但对于 RO 节点,其数据是通过 PolarDB for PostgreSQL 实现的 LogIndex 机制实现同步的,故需要额外支持该机制下 RO 节点的 RSC 缓存同步方式。对于每种 WAL 日志类型,都需要根据当前是否存在 New Page 类型的日志,进行缓存更新与淘汰处理,保证 RO 节点下 RSC 缓存的一致性。

使用指南

该功能默认生效。提供如下 GUC 参数控制:

  • polar_nblocks_cache_mode:是否开启 RSC 功能,取值为:
    • scan(默认值):表示仅在 scan 顺序查询场景下开启
    • on:在所有场景下全量开启 RSC
    • off:关闭 RSC;参数从 scanon 设置为 off,可以直接通过 ALTER SYSTEM SET 进行设置,无需重启即可生效;参数从 off 设置为 scan / on,需要修改 postgresql.conf 配置文件并重启生效
  • polar_enable_replica_use_smgr_cache:RO 节点是否开启 RSC 功能,默认为 on。可配置为 on / off
  • polar_enable_standby_use_smgr_cache:Standby 节点是否开启 RSC 功能,默认为 on。可配置为 on / off

性能测试

通过如下 Shell 脚本创建一个带有 1000 个子分区的分区表:

psql -c "CREATE TABLE hp(a INT) PARTITION BY HASH(a);"
+for ((i=1; i<1000; i++)); do
+    psql -c "CREATE TABLE hp$i PARTITION OF hp FOR VALUES WITH(modulus 1000, remainder $i);"
+done
+

此时分区子表无数据。接下来借助一条在所有子分区上的聚合查询,来验证打开或关闭 RSC 功能时,lseek 系统调用所带来的时间性能影响。

开启 RSC:

ALTER SYSTEM SET polar_nblocks_cache_mode = 'scan';
+ALTER SYSTEM
+
+ALTER SYSTEM SET polar_enable_replica_use_smgr_cache = on;
+ALTER SYSTEM
+
+ALTER SYSTEM SET polar_enable_standby_use_smgr_cache = on;
+ALTER SYSTEM
+
+SELECT pg_reload_conf();
+ pg_reload_conf
+----------------
+ t
+(1 row)
+
+SHOW polar_nblocks_cache_mode;
+ polar_nblocks_cache_mode
+--------------------------
+ scan
+(1 row)
+
+SHOW polar_enable_replica_use_smgr_cache ;
+ polar_enable_replica_use_smgr_cache
+--------------------------
+ on
+(1 row)
+
+SHOW polar_enable_standby_use_smgr_cache ;
+ polar_enable_standby_use_smgr_cache
+--------------------------
+ on
+(1 row)
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 97.658 ms
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 108.672 ms
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 93.678 ms
+

关闭 RSC:

ALTER SYSTEM SET polar_nblocks_cache_mode = 'off';
+ALTER SYSTEM
+
+ALTER SYSTEM SET polar_enable_replica_use_smgr_cache = off;
+ALTER SYSTEM
+
+ALTER SYSTEM SET polar_enable_standby_use_smgr_cache = off;
+ALTER SYSTEM
+
+SELECT pg_reload_conf();
+ pg_reload_conf
+----------------
+ t
+(1 row)
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 164.772 ms
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 147.255 ms
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 177.039 ms
+
+SELECT COUNT(*) FROM hp;
+ count
+-------
+     0
+(1 row)
+
+Time: 194.724 ms
+
`,38);function _(c,R){const p=l("Badge"),i=l("ArticleInfo"),e=l("router-link");return d(),r("div",null,[b,s(p,{type:"tip",text:"V11 / v1.1.10-",vertical:"top"}),s(i,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),n("nav",S,[n("ul",null,[n("li",null,[s(e,{to:"#背景介绍"},{default:o(()=>[a("背景介绍")]),_:1})]),n("li",null,[s(e,{to:"#术语"},{default:o(()=>[a("术语")]),_:1})]),n("li",null,[s(e,{to:"#功能介绍"},{default:o(()=>[a("功能介绍")]),_:1})]),n("li",null,[s(e,{to:"#功能设计"},{default:o(()=>[a("功能设计")]),_:1}),n("ul",null,[n("li",null,[s(e,{to:"#总体设计"},{default:o(()=>[a("总体设计")]),_:1})]),n("li",null,[s(e,{to:"#rsc-缓存数组"},{default:o(()=>[a("RSC 缓存数组")]),_:1})]),n("li",null,[s(e,{to:"#rsc-一级索引"},{default:o(()=>[a("RSC 一级索引")]),_:1})]),n("li",null,[s(e,{to:"#rsc-二级索引"},{default:o(()=>[a("RSC 二级索引")]),_:1})]),n("li",null,[s(e,{to:"#rsc-缓存更新与淘汰"},{default:o(()=>[a("RSC 缓存更新与淘汰")]),_:1})]),n("li",null,[s(e,{to:"#备节点的-rsc-缓存"},{default:o(()=>[a("备节点的 RSC 缓存")]),_:1})])])]),n("li",null,[s(e,{to:"#使用指南"},{default:o(()=>[a("使用指南")]),_:1})]),n("li",null,[s(e,{to:"#性能测试"},{default:o(()=>[a("性能测试")]),_:1})])])]),h])}const w=t(m,[["render",_],["__file","rel-size-cache.html.vue"]]);export{w as default}; diff --git a/assets/rel-size-cache.html-d3f30121.js b/assets/rel-size-cache.html-d3f30121.js new file mode 100644 index 00000000000..3f06d8687d9 --- /dev/null +++ b/assets/rel-size-cache.html-d3f30121.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-37c6fdad","path":"/zh/features/v11/performance/rel-size-cache.html","title":"表大小缓存","lang":"zh-CN","frontmatter":{"author":"步真","date":"2022/11/14","minute":50},"headers":[{"level":2,"title":"背景介绍","slug":"背景介绍","link":"#背景介绍","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"功能介绍","slug":"功能介绍","link":"#功能介绍","children":[]},{"level":2,"title":"功能设计","slug":"功能设计","link":"#功能设计","children":[{"level":3,"title":"总体设计","slug":"总体设计","link":"#总体设计","children":[]},{"level":3,"title":"RSC 缓存数组","slug":"rsc-缓存数组","link":"#rsc-缓存数组","children":[]},{"level":3,"title":"RSC 一级索引","slug":"rsc-一级索引","link":"#rsc-一级索引","children":[]},{"level":3,"title":"RSC 二级索引","slug":"rsc-二级索引","link":"#rsc-二级索引","children":[]},{"level":3,"title":"RSC 缓存更新与淘汰","slug":"rsc-缓存更新与淘汰","link":"#rsc-缓存更新与淘汰","children":[]},{"level":3,"title":"备节点的 RSC 缓存","slug":"备节点的-rsc-缓存","link":"#备节点的-rsc-缓存","children":[]}]},{"level":2,"title":"使用指南","slug":"使用指南","link":"#使用指南","children":[]},{"level":2,"title":"性能测试","slug":"性能测试","link":"#性能测试","children":[]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/performance/rel-size-cache.md"}');export{l as data}; diff --git a/assets/resource-manager.html-1cf58c68.js b/assets/resource-manager.html-1cf58c68.js new file mode 100644 index 00000000000..099d59a7560 --- /dev/null +++ b/assets/resource-manager.html-1cf58c68.js @@ -0,0 +1,19 @@ +import{_ as l,r as o,o as u,c as i,d as a,a as s,w as e,b as n,e as k}from"./app-3d1677bf.js";const d={},m=s("h1",{id:"resource-manager",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#resource-manager","aria-hidden":"true"},"#"),n(" Resource Manager")],-1),g={class:"table-of-contents"},_=k(`

背景

PolarDB for PostgreSQL 的内存可以分为以下三部分:

  • 共享内存
  • 进程间动态共享内存
  • 进程私有内存

进程间动态共享内存和进程私有内存是 动态分配 的,其使用量随着实例承载的业务运行情况而不断变化。过多使用动态内存,可能会导致内存使用量超过操作系统限制,触发内核内存限制机制,造成实例进程异常退出,实例重启,引发实例不可用的问题。

进程私有内存 MemoryContext 管理的内存可以分为两部分:

  • 工作计算区域内存:业务运行所需的内存,此部分内存会影响业务的正常运行;
  • Cache 内存:数据库会把部分内部元数据存放在进程内,此部分内存只会影响数据库性能;

目标

为了解决以上问题,PolarDB for PostgreSQL 增加了 Resource Manager 资源限制机制,能够在实例运行期间,周期性检测资源使用情况。对于超过资源限制阈值的进程,强制进行资源限制,降低实例不可用的风险。

Resource Manager 主要的限制资源有:

  • 内存
  • CPU
  • I/O

当前仅支持对内存资源进行限制。

内存限制原理

内存限制依赖 Cgroup,如果不存在 Cgroup,则无法有效进行资源限制。Resource Manager 作为 PolarDB for PostgreSQL 一个后台辅助进程,周期性读取 Cgroup 的内存使用数据作为内存限制的依据。当发现存在进程超过内存限制阈值后,会读取内核的用户进程内存记账,按照内存大小排序,依次对内存使用量超过阈值的进程发送中断进程信号(SIGTERM)或取消操作信号(SIGINT)。

内存限制方式

Resource Manager 守护进程会随着实例启动而建立,同时对 RW、RO 以及 Standby 节点起作用。可以通过修改参数改变 Resource Manager 的行为。

  • enable_resource_manager:是否启动 Resource Manager,取值为 on / off,默认值为 on
  • stat_interval:资源使用量周期检测的间隔,单位为毫秒,取值范围为 10-10000,默认值为 500
  • total_mem_limit_rate:限制实例内存使用的百分比,当实例内存使用超过该百分比后,开始强制对内存资源进行限制,默认值为 95
  • total_mem_limit_remain_size:实例内存预留值,当实例空闲内存小于预留值后,开始强制对内存资源进行限制,单位为 kB,取值范围为 131072-MAX_KILOBYTES(整型数值最大值),默认值为 524288
  • mem_release_policy:内存资源限制的策略
    • none:无动作
    • default:缺省策略(默认值),优先中断空闲进程,然后中断活跃进程
    • cancel_query:中断活跃进程
    • terminate_idle_backend:中断空闲进程
    • terminate_any_backend:中断所有进程
    • terminate_random_backend:中断随机进程

内存限制效果

2022-11-28 14:07:56.929 UTC [18179] LOG:  [polar_resource_manager] terminate process 13461 release memory 65434123 bytes
+2022-11-28 14:08:17.143 UTC [35472] FATAL:  terminating connection due to out of memory
+2022-11-28 14:08:17.143 UTC [35472] BACKTRACE:
+        postgres: primary: postgres postgres [local] idle(ProcessInterrupts+0x34c) [0xae5fda]
+        postgres: primary: postgres postgres [local] idle(ProcessClientReadInterrupt+0x3a) [0xae1ad6]
+        postgres: primary: postgres postgres [local] idle(secure_read+0x209) [0x8c9070]
+        postgres: primary: postgres postgres [local] idle() [0x8d4565]
+        postgres: primary: postgres postgres [local] idle(pq_getbyte+0x30) [0x8d4613]
+        postgres: primary: postgres postgres [local] idle() [0xae1861]
+        postgres: primary: postgres postgres [local] idle() [0xae1a83]
+        postgres: primary: postgres postgres [local] idle(PostgresMain+0x8df) [0xae7949]
+        postgres: primary: postgres postgres [local] idle() [0x9f4c4c]
+        postgres: primary: postgres postgres [local] idle() [0x9f440c]
+        postgres: primary: postgres postgres [local] idle() [0x9ef963]
+        postgres: primary: postgres postgres [local] idle(PostmasterMain+0x1321) [0x9ef18a]
+        postgres: primary: postgres postgres [local] idle() [0x8dc1f6]
+        /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f888afff445]
+        postgres: primary: postgres postgres [local] idle() [0x49d209]
+
`,18);function y(t,b){const r=o("Badge"),c=o("ArticleInfo"),p=o("router-link");return u(),i("div",null,[m,a(r,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"}),a(c,{frontmatter:t.$frontmatter},null,8,["frontmatter"]),s("nav",g,[s("ul",null,[s("li",null,[a(p,{to:"#背景"},{default:e(()=>[n("背景")]),_:1})]),s("li",null,[a(p,{to:"#目标"},{default:e(()=>[n("目标")]),_:1})]),s("li",null,[a(p,{to:"#内存限制原理"},{default:e(()=>[n("内存限制原理")]),_:1}),s("ul",null,[s("li",null,[a(p,{to:"#内存限制方式"},{default:e(()=>[n("内存限制方式")]),_:1})]),s("li",null,[a(p,{to:"#内存限制效果"},{default:e(()=>[n("内存限制效果")]),_:1})])])])])]),_])}const h=l(d,[["render",y],["__file","resource-manager.html.vue"]]);export{h as default}; diff --git a/assets/resource-manager.html-ea11f2ad.js b/assets/resource-manager.html-ea11f2ad.js new file mode 100644 index 00000000000..3cc18b3c0e3 --- /dev/null +++ b/assets/resource-manager.html-ea11f2ad.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-4fd5d67a","path":"/zh/features/v11/availability/resource-manager.html","title":"Resource Manager","lang":"zh-CN","frontmatter":{"author":"学有","date":"2022/11/25","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"目标","slug":"目标","link":"#目标","children":[]},{"level":2,"title":"内存限制原理","slug":"内存限制原理","link":"#内存限制原理","children":[{"level":3,"title":"内存限制方式","slug":"内存限制方式","link":"#内存限制方式","children":[]},{"level":3,"title":"内存限制效果","slug":"内存限制效果","link":"#内存限制效果","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/availability/resource-manager.md"}');export{e as data}; diff --git a/assets/ro-online-promote.html-089ffddc.js b/assets/ro-online-promote.html-089ffddc.js new file mode 100644 index 00000000000..e929fe6a233 --- /dev/null +++ b/assets/ro-online-promote.html-089ffddc.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-4cbd0b64","path":"/operation/ro-online-promote.html","title":"只读节点在线 Promote","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/12/25","minute":15},"headers":[{"level":2,"title":"前置准备","slug":"前置准备","link":"#前置准备","children":[]},{"level":2,"title":"验证只读节点不可写","slug":"验证只读节点不可写","link":"#验证只读节点不可写","children":[]},{"level":2,"title":"读写节点停止写入","slug":"读写节点停止写入","link":"#读写节点停止写入","children":[]},{"level":2,"title":"只读节点 Promote","slug":"只读节点-promote","link":"#只读节点-promote","children":[]},{"level":2,"title":"计算集群恢复写入","slug":"计算集群恢复写入","link":"#计算集群恢复写入","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"operation/ro-online-promote.md"}');export{e as data}; diff --git a/assets/ro-online-promote.html-659e21ea.js b/assets/ro-online-promote.html-659e21ea.js new file mode 100644 index 00000000000..3ca6e79c087 --- /dev/null +++ b/assets/ro-online-promote.html-659e21ea.js @@ -0,0 +1,25 @@ +import{_ as l,r as t,o as c,c as d,d as e,a,w as o,b as s,e as i}from"./app-3d1677bf.js";const u={},k=a("h1",{id:"只读节点在线-promote",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#只读节点在线-promote","aria-hidden":"true"},"#"),s(" 只读节点在线 Promote")],-1),h=a("p",null,[s("PolarDB for PostgreSQL 是一款存储与计算分离的云原生数据库,所有计算节点共享一份存储,并且对存储的访问具有 "),a("strong",null,"一写多读"),s(" 的限制:所有计算节点可以对存储进行读取,但只有一个计算节点可以对存储进行写入。这种限制会带来一个问题:当读写节点因为宕机或网络故障而不可用时,集群中将没有能够可以写入存储的计算节点,应用业务中的增、删、改,以及 DDL 都将无法运行。")],-1),_=a("p",null,"本文将指导您在 PolarDB for PostgreSQL 计算集群中的读写节点停止服务时,将任意一个只读节点在线提升为读写节点,从而使集群恢复对于共享存储的写入能力。",-1),g={class:"table-of-contents"},m=i(`

前置准备

为方便起见,本示例使用基于本地磁盘的实例来进行演示。拉取如下镜像并启动容器,可以得到一个基于本地磁盘的 HTAP 实例:

docker pull polardb/polardb_pg_local_instance
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg_htap \\
+    --shm-size=512m \\
+    polardb/polardb_pg_local_instance \\
+    bash
+

容器内的 54325434 端口分别运行着一个读写节点和两个只读节点。两个只读节点与读写节点共享同一份数据,并通过物理复制保持与读写节点的内存状态同步。

验证只读节点不可写

首先,连接到读写节点,创建一张表并插入一些数据:

psql -p5432
+
postgres=# CREATE TABLE t (id int);
+CREATE TABLE
+postgres=# INSERT INTO t SELECT generate_series(1,10);
+INSERT 0 10
+

然后连接到只读节点,并同样试图对表插入数据,将会发现无法进行插入操作:

psql -p5433
+
postgres=# INSERT INTO t SELECT generate_series(1,10);
+ERROR:  cannot execute INSERT in a read-only transaction
+

读写节点停止写入

此时,关闭读写节点,模拟出读写节点不可用的行为:

$ pg_ctl -D ~/tmp_master_dir_polardb_pg_1100_bld/ stop
+waiting for server to shut down.... done
+server stopped
+

此时,集群中没有任何节点可以写入存储了。这时,我们需要将一个只读节点提升为读写节点,恢复对存储的写入。

只读节点 Promote

只有当读写节点停止写入后,才可以将只读节点提升为读写节点,否则将会出现集群内两个节点同时写入的情况。当数据库检测到出现多节点写入时,将会导致运行异常。

将运行在 5433 端口的只读节点提升为读写节点:

$ pg_ctl -D ~/tmp_replica_dir_polardb_pg_1100_bld1/ promote
+waiting for server to promote.... done
+server promoted
+

计算集群恢复写入

连接到已经完成 promote 的新读写节点上,再次尝试之前的 INSERT 操作:

postgres=# INSERT INTO t SELECT generate_series(1,10);
+INSERT 0 10
+

从上述结果中可以看到,新的读写节点能够成功对存储进行写入。这说明原先的只读节点已经被成功提升为读写节点了。

`,23);function b(p,v){const r=t("ArticleInfo"),n=t("router-link");return c(),d("div",null,[k,e(r,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),h,_,a("nav",g,[a("ul",null,[a("li",null,[e(n,{to:"#前置准备"},{default:o(()=>[s("前置准备")]),_:1})]),a("li",null,[e(n,{to:"#验证只读节点不可写"},{default:o(()=>[s("验证只读节点不可写")]),_:1})]),a("li",null,[e(n,{to:"#读写节点停止写入"},{default:o(()=>[s("读写节点停止写入")]),_:1})]),a("li",null,[e(n,{to:"#只读节点-promote"},{default:o(()=>[s("只读节点 Promote")]),_:1})]),a("li",null,[e(n,{to:"#计算集群恢复写入"},{default:o(()=>[s("计算集群恢复写入")]),_:1})])])]),m])}const E=l(u,[["render",b],["__file","ro-online-promote.html.vue"]]);export{E as default}; diff --git a/assets/ro-online-promote.html-73ae6acb.js b/assets/ro-online-promote.html-73ae6acb.js new file mode 100644 index 00000000000..a5426f15d1d --- /dev/null +++ b/assets/ro-online-promote.html-73ae6acb.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-13307193","path":"/zh/operation/ro-online-promote.html","title":"只读节点在线 Promote","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/12/25","minute":15},"headers":[{"level":2,"title":"前置准备","slug":"前置准备","link":"#前置准备","children":[]},{"level":2,"title":"验证只读节点不可写","slug":"验证只读节点不可写","link":"#验证只读节点不可写","children":[]},{"level":2,"title":"读写节点停止写入","slug":"读写节点停止写入","link":"#读写节点停止写入","children":[]},{"level":2,"title":"只读节点 Promote","slug":"只读节点-promote","link":"#只读节点-promote","children":[]},{"level":2,"title":"计算集群恢复写入","slug":"计算集群恢复写入","link":"#计算集群恢复写入","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"zh/operation/ro-online-promote.md"}');export{e as data}; diff --git a/assets/ro-online-promote.html-b639a7d1.js b/assets/ro-online-promote.html-b639a7d1.js new file mode 100644 index 00000000000..3ca6e79c087 --- /dev/null +++ b/assets/ro-online-promote.html-b639a7d1.js @@ -0,0 +1,25 @@ +import{_ as l,r as t,o as c,c as d,d as e,a,w as o,b as s,e as i}from"./app-3d1677bf.js";const u={},k=a("h1",{id:"只读节点在线-promote",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#只读节点在线-promote","aria-hidden":"true"},"#"),s(" 只读节点在线 Promote")],-1),h=a("p",null,[s("PolarDB for PostgreSQL 是一款存储与计算分离的云原生数据库,所有计算节点共享一份存储,并且对存储的访问具有 "),a("strong",null,"一写多读"),s(" 的限制:所有计算节点可以对存储进行读取,但只有一个计算节点可以对存储进行写入。这种限制会带来一个问题:当读写节点因为宕机或网络故障而不可用时,集群中将没有能够可以写入存储的计算节点,应用业务中的增、删、改,以及 DDL 都将无法运行。")],-1),_=a("p",null,"本文将指导您在 PolarDB for PostgreSQL 计算集群中的读写节点停止服务时,将任意一个只读节点在线提升为读写节点,从而使集群恢复对于共享存储的写入能力。",-1),g={class:"table-of-contents"},m=i(`

前置准备

为方便起见,本示例使用基于本地磁盘的实例来进行演示。拉取如下镜像并启动容器,可以得到一个基于本地磁盘的 HTAP 实例:

docker pull polardb/polardb_pg_local_instance
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg_htap \\
+    --shm-size=512m \\
+    polardb/polardb_pg_local_instance \\
+    bash
+

容器内的 54325434 端口分别运行着一个读写节点和两个只读节点。两个只读节点与读写节点共享同一份数据,并通过物理复制保持与读写节点的内存状态同步。

验证只读节点不可写

首先,连接到读写节点,创建一张表并插入一些数据:

psql -p5432
+
postgres=# CREATE TABLE t (id int);
+CREATE TABLE
+postgres=# INSERT INTO t SELECT generate_series(1,10);
+INSERT 0 10
+

然后连接到只读节点,并同样试图对表插入数据,将会发现无法进行插入操作:

psql -p5433
+
postgres=# INSERT INTO t SELECT generate_series(1,10);
+ERROR:  cannot execute INSERT in a read-only transaction
+

读写节点停止写入

此时,关闭读写节点,模拟出读写节点不可用的行为:

$ pg_ctl -D ~/tmp_master_dir_polardb_pg_1100_bld/ stop
+waiting for server to shut down.... done
+server stopped
+

此时,集群中没有任何节点可以写入存储了。这时,我们需要将一个只读节点提升为读写节点,恢复对存储的写入。

只读节点 Promote

只有当读写节点停止写入后,才可以将只读节点提升为读写节点,否则将会出现集群内两个节点同时写入的情况。当数据库检测到出现多节点写入时,将会导致运行异常。

将运行在 5433 端口的只读节点提升为读写节点:

$ pg_ctl -D ~/tmp_replica_dir_polardb_pg_1100_bld1/ promote
+waiting for server to promote.... done
+server promoted
+

计算集群恢复写入

连接到已经完成 promote 的新读写节点上,再次尝试之前的 INSERT 操作:

postgres=# INSERT INTO t SELECT generate_series(1,10);
+INSERT 0 10
+

从上述结果中可以看到,新的读写节点能够成功对存储进行写入。这说明原先的只读节点已经被成功提升为读写节点了。

`,23);function b(p,v){const r=t("ArticleInfo"),n=t("router-link");return c(),d("div",null,[k,e(r,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),h,_,a("nav",g,[a("ul",null,[a("li",null,[e(n,{to:"#前置准备"},{default:o(()=>[s("前置准备")]),_:1})]),a("li",null,[e(n,{to:"#验证只读节点不可写"},{default:o(()=>[s("验证只读节点不可写")]),_:1})]),a("li",null,[e(n,{to:"#读写节点停止写入"},{default:o(()=>[s("读写节点停止写入")]),_:1})]),a("li",null,[e(n,{to:"#只读节点-promote"},{default:o(()=>[s("只读节点 Promote")]),_:1})]),a("li",null,[e(n,{to:"#计算集群恢复写入"},{default:o(()=>[s("计算集群恢复写入")]),_:1})])])]),m])}const E=l(u,[["render",b],["__file","ro-online-promote.html.vue"]]);export{E as default}; diff --git a/assets/rsc-first-cache-98d69d90.png b/assets/rsc-first-cache-98d69d90.png new file mode 100644 index 00000000000..5f1d22a4274 Binary files /dev/null and b/assets/rsc-first-cache-98d69d90.png differ diff --git a/assets/rsc-second-cache-e3089e17.png b/assets/rsc-second-cache-e3089e17.png new file mode 100644 index 00000000000..0671e62cb8a Binary files /dev/null and b/assets/rsc-second-cache-e3089e17.png differ diff --git a/assets/scale-out.html-c244f53c.js b/assets/scale-out.html-c244f53c.js new file mode 100644 index 00000000000..e66b7d5c93d --- /dev/null +++ b/assets/scale-out.html-c244f53c.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-4a816e3e","path":"/zh/operation/scale-out.html","title":"计算节点扩缩容","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/12/19","minute":30},"headers":[{"level":2,"title":"部署读写节点","slug":"部署读写节点","link":"#部署读写节点","children":[{"level":3,"title":"确认存储可访问","slug":"确认存储可访问","link":"#确认存储可访问","children":[]},{"level":3,"title":"格式化并挂载 PFS 文件系统","slug":"格式化并挂载-pfs-文件系统","link":"#格式化并挂载-pfs-文件系统","children":[]},{"level":3,"title":"初始化数据目录","slug":"初始化数据目录","link":"#初始化数据目录","children":[]},{"level":3,"title":"编辑读写节点配置","slug":"编辑读写节点配置","link":"#编辑读写节点配置","children":[]},{"level":3,"title":"启动读写节点","slug":"启动读写节点","link":"#启动读写节点","children":[]}]},{"level":2,"title":"集群扩容","slug":"集群扩容","link":"#集群扩容","children":[{"level":3,"title":"确认存储可访问","slug":"确认存储可访问-1","link":"#确认存储可访问-1","children":[]},{"level":3,"title":"挂载 PFS 文件系统","slug":"挂载-pfs-文件系统","link":"#挂载-pfs-文件系统","children":[]},{"level":3,"title":"初始化数据目录","slug":"初始化数据目录-1","link":"#初始化数据目录-1","children":[]},{"level":3,"title":"编辑只读节点配置","slug":"编辑只读节点配置","link":"#编辑只读节点配置","children":[]},{"level":3,"title":"启动只读节点","slug":"启动只读节点","link":"#启动只读节点","children":[]},{"level":3,"title":"集群功能检查","slug":"集群功能检查","link":"#集群功能检查","children":[]}]},{"level":2,"title":"集群缩容","slug":"集群缩容","link":"#集群缩容","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"zh/operation/scale-out.md"}');export{l as data}; diff --git a/assets/scale-out.html-c7075237.js b/assets/scale-out.html-c7075237.js new file mode 100644 index 00000000000..1d603fc71c9 --- /dev/null +++ b/assets/scale-out.html-c7075237.js @@ -0,0 +1,150 @@ +import{_ as r,r as t,o as c,c as i,d as n,a,w as p,b as s,e as u}from"./app-3d1677bf.js";const d={},k=a("h1",{id:"计算节点扩缩容",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#计算节点扩缩容","aria-hidden":"true"},"#"),s(" 计算节点扩缩容")],-1),b=a("p",null,"PolarDB for PostgreSQL 是一款存储与计算分离的数据库,所有计算节点共享存储,并可以按需要弹性增加或删减计算节点而无需做任何数据迁移。所有本教程将协助您在共享存储集群上添加或删除计算节点。",-1),v={class:"table-of-contents"},m=u(`

部署读写节点

首先,在已经搭建完毕的共享存储集群上,初始化并启动第一个计算节点,即读写节点,该节点可以对共享存储进行读写。我们在下面的镜像中提供了已经编译完毕的 PolarDB for PostgreSQL 内核和周边工具的可执行文件:

$ docker pull polardb/polardb_pg_binary
+$ docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+
+$ ls ~/tmp_basedir_polardb_pg_1100_bld/bin/
+clusterdb     dropuser           pg_basebackup   pg_dump         pg_resetwal    pg_test_timing       polar-initdb.sh          psql
+createdb      ecpg               pgbench         pg_dumpall      pg_restore     pg_upgrade           polar-replica-initdb.sh  reindexdb
+createuser    initdb             pg_config       pg_isready      pg_rewind      pg_verify_checksums  polar_tools              vacuumdb
+dbatools.sql  oid2name           pg_controldata  pg_receivewal   pg_standby     pg_waldump           postgres                 vacuumlo
+dropdb        pg_archivecleanup  pg_ctl          pg_recvlogical  pg_test_fsync  polar_basebackup     postmaster
+

确认存储可访问

使用 lsblk 命令确认存储集群已经能够被当前机器访问到。比如,如下示例中的 nvme1n1 是将要使用的共享存储的块设备:

$ lsblk
+NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0   40G  0 disk
+└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
+nvme1n1     259:2    0  100G  0 disk
+

格式化并挂载 PFS 文件系统

此时,共享存储上没有任何内容。使用容器内的 PFS 工具将共享存储格式化为 PFS 文件系统的格式:

sudo pfs -C disk mkfs nvme1n1
+

格式化完成后,在当前容器内启动 PFS 守护进程,挂载到文件系统上。该守护进程后续将会被计算节点用于访问共享存储:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

初始化数据目录

使用 initdb 在节点本地存储的 ~/primary 路径上创建本地数据目录。本地数据目录中将会存放节点的配置、审计日志等节点私有的信息:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

使用 PFS 工具,在共享存储上创建一个共享数据目录;使用 polar-initdb.sh 脚本把将会被所有节点共享的数据文件拷贝到共享存储的数据目录中。将会被所有节点共享的文件包含所有的表文件、WAL 日志文件等:

sudo pfs -C disk mkdir /nvme1n1/shared_data
+
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /nvme1n1/shared_data/
+

编辑读写节点配置

对读写节点的配置文件 ~/primary/postgresql.conf 进行修改,使数据库以共享模式启动,并能够找到共享存储上的数据目录:

port=5432
+polar_hostid=1
+
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,允许来自所有地址的客户端以 postgres 用户进行物理复制:

host	replication	postgres	0.0.0.0/0	trust
+

启动读写节点

使用以下命令启动读写节点,并检查节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/primary start
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群扩容

接下来,在已经有一个读写节点的计算集群中扩容一个新的计算节点。由于 PolarDB for PostgreSQL 是一写多读的架构,所以后续扩容的节点只可以对共享存储进行读取,但无法对共享存储进行写入。只读节点通过与读写节点进行物理复制来保持内存状态的同步。

类似地,在用于部署新计算节点的机器上,拉取镜像并启动带有可执行文件的容器:

docker pull polardb/polardb_pg_binary
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+

确认存储可访问

确保部署只读节点的机器也可以访问到共享存储的块设备:

$ lsblk
+NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0   40G  0 disk
+└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
+nvme1n1     259:2    0  100G  0 disk
+

挂载 PFS 文件系统

由于此时共享存储已经被读写节点格式化为 PFS 格式了,因此这里无需再次进行格式化。只需要启动 PFS 守护进程完成挂载即可:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

初始化数据目录

在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

mkdir -m 0700 $HOME/replica1
+sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \\
+    /nvme1n1/shared_data/ $HOME/replica1/
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
+cp /tmp/replica1/*.conf $HOME/replica1/
+

编辑只读节点配置

编辑只读节点的配置文件 ~/replica1/postgresql.conf,配置好只读节点的集群标识和监听端口,以及与读写节点相同的共享存储目录:

port=5432
+polar_hostid=2
+
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

编辑只读节点的复制配置文件 ~/replica1/recovery.conf,配置好当前节点的角色(只读),以及从读写节点进行物理复制的连接串和复制槽:

polar_replica='on'
+recovery_target_timeline='latest'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+primary_slot_name='replica1'
+

由于读写节点上暂时还没有名为 replica1 的复制槽,所以需要连接到读写节点上,创建这个复制槽:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_create_physical_replication_slot('replica1');"
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

启动只读节点

完成上述步骤后,启动只读节点并验证:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 start
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群功能检查

连接到读写节点上,创建一个表并插入数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t(id INT); INSERT INTO t SELECT generate_series(1,10);"
+

在只读节点上可以立刻查询到从读写节点上插入的数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ id
+----
+  1
+  2
+  3
+  4
+  5
+  6
+  7
+  8
+  9
+ 10
+(10 rows)
+

从读写节点上可以看到用于与只读节点进行物理复制的复制槽已经处于活跃状态:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM pg_replication_slots;"
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ replica1  |        | physical  |        |          | f         | t      |         45 |      |              | 0/4079E8E8  |
+(1 rows)
+

依次类推,使用类似的方法还可以横向扩容更多的只读节点。

集群缩容

集群缩容的步骤较为简单:将只读节点停机即可。

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 stop
+

在只读节点停机后,读写节点上的复制槽将变为非活跃状态。非活跃的复制槽将会阻止 WAL 日志的回收,所以需要及时清理。

在读写节点上执行如下命令,移除名为 replica1 的复制槽:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_drop_replication_slot('replica1');"
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
`,61);function _(l,g){const o=t("ArticleInfo"),e=t("router-link");return c(),i("div",null,[k,n(o,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),b,a("nav",v,[a("ul",null,[a("li",null,[n(e,{to:"#部署读写节点"},{default:p(()=>[s("部署读写节点")]),_:1}),a("ul",null,[a("li",null,[n(e,{to:"#确认存储可访问"},{default:p(()=>[s("确认存储可访问")]),_:1})]),a("li",null,[n(e,{to:"#格式化并挂载-pfs-文件系统"},{default:p(()=>[s("格式化并挂载 PFS 文件系统")]),_:1})]),a("li",null,[n(e,{to:"#初始化数据目录"},{default:p(()=>[s("初始化数据目录")]),_:1})]),a("li",null,[n(e,{to:"#编辑读写节点配置"},{default:p(()=>[s("编辑读写节点配置")]),_:1})]),a("li",null,[n(e,{to:"#启动读写节点"},{default:p(()=>[s("启动读写节点")]),_:1})])])]),a("li",null,[n(e,{to:"#集群扩容"},{default:p(()=>[s("集群扩容")]),_:1}),a("ul",null,[a("li",null,[n(e,{to:"#确认存储可访问-1"},{default:p(()=>[s("确认存储可访问")]),_:1})]),a("li",null,[n(e,{to:"#挂载-pfs-文件系统"},{default:p(()=>[s("挂载 PFS 文件系统")]),_:1})]),a("li",null,[n(e,{to:"#初始化数据目录-1"},{default:p(()=>[s("初始化数据目录")]),_:1})]),a("li",null,[n(e,{to:"#编辑只读节点配置"},{default:p(()=>[s("编辑只读节点配置")]),_:1})]),a("li",null,[n(e,{to:"#启动只读节点"},{default:p(()=>[s("启动只读节点")]),_:1})]),a("li",null,[n(e,{to:"#集群功能检查"},{default:p(()=>[s("集群功能检查")]),_:1})])])]),a("li",null,[n(e,{to:"#集群缩容"},{default:p(()=>[s("集群缩容")]),_:1})])])]),m])}const f=r(d,[["render",_],["__file","scale-out.html.vue"]]);export{f as default}; diff --git a/assets/scale-out.html-ee1b6f09.js b/assets/scale-out.html-ee1b6f09.js new file mode 100644 index 00000000000..6b036f46ca3 --- /dev/null +++ b/assets/scale-out.html-ee1b6f09.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-4a6d2de2","path":"/operation/scale-out.html","title":"计算节点扩缩容","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/12/19","minute":30},"headers":[{"level":2,"title":"部署读写节点","slug":"部署读写节点","link":"#部署读写节点","children":[{"level":3,"title":"确认存储可访问","slug":"确认存储可访问","link":"#确认存储可访问","children":[]},{"level":3,"title":"格式化并挂载 PFS 文件系统","slug":"格式化并挂载-pfs-文件系统","link":"#格式化并挂载-pfs-文件系统","children":[]},{"level":3,"title":"初始化数据目录","slug":"初始化数据目录","link":"#初始化数据目录","children":[]},{"level":3,"title":"编辑读写节点配置","slug":"编辑读写节点配置","link":"#编辑读写节点配置","children":[]},{"level":3,"title":"启动读写节点","slug":"启动读写节点","link":"#启动读写节点","children":[]}]},{"level":2,"title":"集群扩容","slug":"集群扩容","link":"#集群扩容","children":[{"level":3,"title":"确认存储可访问","slug":"确认存储可访问-1","link":"#确认存储可访问-1","children":[]},{"level":3,"title":"挂载 PFS 文件系统","slug":"挂载-pfs-文件系统","link":"#挂载-pfs-文件系统","children":[]},{"level":3,"title":"初始化数据目录","slug":"初始化数据目录-1","link":"#初始化数据目录-1","children":[]},{"level":3,"title":"编辑只读节点配置","slug":"编辑只读节点配置","link":"#编辑只读节点配置","children":[]},{"level":3,"title":"启动只读节点","slug":"启动只读节点","link":"#启动只读节点","children":[]},{"level":3,"title":"集群功能检查","slug":"集群功能检查","link":"#集群功能检查","children":[]}]},{"level":2,"title":"集群缩容","slug":"集群缩容","link":"#集群缩容","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"operation/scale-out.md"}');export{l as data}; diff --git a/assets/scale-out.html-eed2da3b.js b/assets/scale-out.html-eed2da3b.js new file mode 100644 index 00000000000..1d603fc71c9 --- /dev/null +++ b/assets/scale-out.html-eed2da3b.js @@ -0,0 +1,150 @@ +import{_ as r,r as t,o as c,c as i,d as n,a,w as p,b as s,e as u}from"./app-3d1677bf.js";const d={},k=a("h1",{id:"计算节点扩缩容",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#计算节点扩缩容","aria-hidden":"true"},"#"),s(" 计算节点扩缩容")],-1),b=a("p",null,"PolarDB for PostgreSQL 是一款存储与计算分离的数据库,所有计算节点共享存储,并可以按需要弹性增加或删减计算节点而无需做任何数据迁移。所有本教程将协助您在共享存储集群上添加或删除计算节点。",-1),v={class:"table-of-contents"},m=u(`

部署读写节点

首先,在已经搭建完毕的共享存储集群上,初始化并启动第一个计算节点,即读写节点,该节点可以对共享存储进行读写。我们在下面的镜像中提供了已经编译完毕的 PolarDB for PostgreSQL 内核和周边工具的可执行文件:

$ docker pull polardb/polardb_pg_binary
+$ docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+
+$ ls ~/tmp_basedir_polardb_pg_1100_bld/bin/
+clusterdb     dropuser           pg_basebackup   pg_dump         pg_resetwal    pg_test_timing       polar-initdb.sh          psql
+createdb      ecpg               pgbench         pg_dumpall      pg_restore     pg_upgrade           polar-replica-initdb.sh  reindexdb
+createuser    initdb             pg_config       pg_isready      pg_rewind      pg_verify_checksums  polar_tools              vacuumdb
+dbatools.sql  oid2name           pg_controldata  pg_receivewal   pg_standby     pg_waldump           postgres                 vacuumlo
+dropdb        pg_archivecleanup  pg_ctl          pg_recvlogical  pg_test_fsync  polar_basebackup     postmaster
+

确认存储可访问

使用 lsblk 命令确认存储集群已经能够被当前机器访问到。比如,如下示例中的 nvme1n1 是将要使用的共享存储的块设备:

$ lsblk
+NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0   40G  0 disk
+└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
+nvme1n1     259:2    0  100G  0 disk
+

格式化并挂载 PFS 文件系统

此时,共享存储上没有任何内容。使用容器内的 PFS 工具将共享存储格式化为 PFS 文件系统的格式:

sudo pfs -C disk mkfs nvme1n1
+

格式化完成后,在当前容器内启动 PFS 守护进程,挂载到文件系统上。该守护进程后续将会被计算节点用于访问共享存储:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

初始化数据目录

使用 initdb 在节点本地存储的 ~/primary 路径上创建本地数据目录。本地数据目录中将会存放节点的配置、审计日志等节点私有的信息:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
+

使用 PFS 工具,在共享存储上创建一个共享数据目录;使用 polar-initdb.sh 脚本把将会被所有节点共享的数据文件拷贝到共享存储的数据目录中。将会被所有节点共享的文件包含所有的表文件、WAL 日志文件等:

sudo pfs -C disk mkdir /nvme1n1/shared_data
+
+sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \\
+    $HOME/primary/ /nvme1n1/shared_data/
+

编辑读写节点配置

对读写节点的配置文件 ~/primary/postgresql.conf 进行修改,使数据库以共享模式启动,并能够找到共享存储上的数据目录:

port=5432
+polar_hostid=1
+
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+synchronous_standby_names='replica1'
+

编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,允许来自所有地址的客户端以 postgres 用户进行物理复制:

host	replication	postgres	0.0.0.0/0	trust
+

启动读写节点

使用以下命令启动读写节点,并检查节点能否正常运行:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/primary start
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群扩容

接下来,在已经有一个读写节点的计算集群中扩容一个新的计算节点。由于 PolarDB for PostgreSQL 是一写多读的架构,所以后续扩容的节点只可以对共享存储进行读取,但无法对共享存储进行写入。只读节点通过与读写节点进行物理复制来保持内存状态的同步。

类似地,在用于部署新计算节点的机器上,拉取镜像并启动带有可执行文件的容器:

docker pull polardb/polardb_pg_binary
+docker run -it \\
+    --cap-add=SYS_PTRACE \\
+    --privileged=true \\
+    --name polardb_pg \\
+    --shm-size=512m \\
+    polardb/polardb_pg_binary \\
+    bash
+

确认存储可访问

确保部署只读节点的机器也可以访问到共享存储的块设备:

$ lsblk
+NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
+nvme0n1     259:0    0   40G  0 disk
+└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
+nvme1n1     259:2    0  100G  0 disk
+

挂载 PFS 文件系统

由于此时共享存储已经被读写节点格式化为 PFS 格式了,因此这里无需再次进行格式化。只需要启动 PFS 守护进程完成挂载即可:

sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
+

初始化数据目录

在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

mkdir -m 0700 $HOME/replica1
+sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \\
+    /nvme1n1/shared_data/ $HOME/replica1/
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
+cp /tmp/replica1/*.conf $HOME/replica1/
+

编辑只读节点配置

编辑只读节点的配置文件 ~/replica1/postgresql.conf,配置好只读节点的集群标识和监听端口,以及与读写节点相同的共享存储目录:

port=5432
+polar_hostid=2
+
+polar_enable_shared_storage_mode=on
+polar_disk_name='nvme1n1'
+polar_datadir='/nvme1n1/shared_data/'
+polar_vfs.localfs_mode=off
+shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
+polar_storage_cluster_name='disk'
+
+logging_collector=on
+log_line_prefix='%p\\t%r\\t%u\\t%m\\t'
+log_directory='pg_log'
+listen_addresses='*'
+max_connections=1000
+

编辑只读节点的复制配置文件 ~/replica1/recovery.conf,配置好当前节点的角色(只读),以及从读写节点进行物理复制的连接串和复制槽:

polar_replica='on'
+recovery_target_timeline='latest'
+primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
+primary_slot_name='replica1'
+

由于读写节点上暂时还没有名为 replica1 的复制槽,所以需要连接到读写节点上,创建这个复制槽:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_create_physical_replication_slot('replica1');"
+ pg_create_physical_replication_slot
+-------------------------------------
+ (replica1,)
+(1 row)
+

启动只读节点

完成上述步骤后,启动只读节点并验证:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 start
+
+$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c 'SELECT version();'
+            version
+--------------------------------
+ PostgreSQL 11.9 (POLARDB 11.9)
+(1 row)
+

集群功能检查

连接到读写节点上,创建一个表并插入数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "CREATE TABLE t(id INT); INSERT INTO t SELECT generate_series(1,10);"
+

在只读节点上可以立刻查询到从读写节点上插入的数据:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM t;"
+ id
+----
+  1
+  2
+  3
+  4
+  5
+  6
+  7
+  8
+  9
+ 10
+(10 rows)
+

从读写节点上可以看到用于与只读节点进行物理复制的复制槽已经处于活跃状态:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT * FROM pg_replication_slots;"
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ replica1  |        | physical  |        |          | f         | t      |         45 |      |              | 0/4079E8E8  |
+(1 rows)
+

依次类推,使用类似的方法还可以横向扩容更多的只读节点。

集群缩容

集群缩容的步骤较为简单:将只读节点停机即可。

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 stop
+

在只读节点停机后,读写节点上的复制槽将变为非活跃状态。非活跃的复制槽将会阻止 WAL 日志的回收,所以需要及时清理。

在读写节点上执行如下命令,移除名为 replica1 的复制槽:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \\
+    -p 5432 \\
+    -d postgres \\
+    -c "SELECT pg_drop_replication_slot('replica1');"
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
`,61);function _(l,g){const o=t("ArticleInfo"),e=t("router-link");return c(),i("div",null,[k,n(o,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),b,a("nav",v,[a("ul",null,[a("li",null,[n(e,{to:"#部署读写节点"},{default:p(()=>[s("部署读写节点")]),_:1}),a("ul",null,[a("li",null,[n(e,{to:"#确认存储可访问"},{default:p(()=>[s("确认存储可访问")]),_:1})]),a("li",null,[n(e,{to:"#格式化并挂载-pfs-文件系统"},{default:p(()=>[s("格式化并挂载 PFS 文件系统")]),_:1})]),a("li",null,[n(e,{to:"#初始化数据目录"},{default:p(()=>[s("初始化数据目录")]),_:1})]),a("li",null,[n(e,{to:"#编辑读写节点配置"},{default:p(()=>[s("编辑读写节点配置")]),_:1})]),a("li",null,[n(e,{to:"#启动读写节点"},{default:p(()=>[s("启动读写节点")]),_:1})])])]),a("li",null,[n(e,{to:"#集群扩容"},{default:p(()=>[s("集群扩容")]),_:1}),a("ul",null,[a("li",null,[n(e,{to:"#确认存储可访问-1"},{default:p(()=>[s("确认存储可访问")]),_:1})]),a("li",null,[n(e,{to:"#挂载-pfs-文件系统"},{default:p(()=>[s("挂载 PFS 文件系统")]),_:1})]),a("li",null,[n(e,{to:"#初始化数据目录-1"},{default:p(()=>[s("初始化数据目录")]),_:1})]),a("li",null,[n(e,{to:"#编辑只读节点配置"},{default:p(()=>[s("编辑只读节点配置")]),_:1})]),a("li",null,[n(e,{to:"#启动只读节点"},{default:p(()=>[s("启动只读节点")]),_:1})]),a("li",null,[n(e,{to:"#集群功能检查"},{default:p(()=>[s("集群功能检查")]),_:1})])])]),a("li",null,[n(e,{to:"#集群缩容"},{default:p(()=>[s("集群缩容")]),_:1})])])]),m])}const f=r(d,[["render",_],["__file","scale-out.html.vue"]]);export{f as default}; diff --git a/assets/shared-server.html-5057af3a.js b/assets/shared-server.html-5057af3a.js new file mode 100644 index 00000000000..927f57c45a0 --- /dev/null +++ b/assets/shared-server.html-5057af3a.js @@ -0,0 +1 @@ +import{_ as h,r as i,o as S,c as p,d as r,a as e,w as d,b as t,e as n}from"./app-3d1677bf.js";const g="/PolarDB-for-PostgreSQL/assets/ss-old-18134ff8.png",u="/PolarDB-for-PostgreSQL/assets/ss-new-2f3760ae.png",_="/PolarDB-for-PostgreSQL/assets/ss-pool-4965c655.png",x="/PolarDB-for-PostgreSQL/assets/ss-tpcc-c939c142.jpg",P="/PolarDB-for-PostgreSQL/assets/ss-pgbench1-c889b05c.jpg",f="/PolarDB-for-PostgreSQL/assets/ss-pgbench2-4ff36502.jpg",b={},v=e("h1",{id:"shared-server",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#shared-server","aria-hidden":"true"},"#"),t(" Shared Server")],-1),y={class:"table-of-contents"},m=e("h2",{id:"背景",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),t(" 背景")],-1),C=e("p",null,"原生 PostgreSQL 的连接调度方式是每一个进程对应一个连接 (One-Process-Per-Connection),这种调度方式适合低并发、长连接的业务场景。而在高并发或大量短连接的业务场景中,进程的大量创建、销毁以及上下文切换,会严重影响性能。同时,在业务容器化部署后,每个容器通过连接池向数据库发起连接,业务在高峰期会弹性扩展出很多容器,后端数据库的连接数会瞬间增高,影响数据库稳定性,导致 OOM 频发。",-1),D={href:"https://www.pgbouncer.org/",target:"_blank",rel:"noopener noreferrer"},B={href:"https://github.com/alibaba/druid",target:"_blank",rel:"noopener noreferrer"},L=e("code",null,"role",-1),E=n('

PolarDB for PostgreSQL 针对上述问题,从数据库内部提供了 Shared Server(后文简称 SS)内置连接池功能,采用共享内存 + Session Context + Dispatcher 转发 + Backend Pool 的架构,实现了用户连接与后端进程的解绑。后端进程具备了 Native、Shared、Dedicated 三种执行模式,并且在运行时可以根据实时负载和进程污染情况进行动态转换。负载调度算法充分吸收 AliSQL 对社区版 MySQL 线程池的缺陷改进,使用 Stall 机制弹性控制 Worker 数量,同时避免用户连接饿死。从根本上解决了高并发或者大量短连接带来的性能、稳定性问题。

原理

在 PostgreSQL 原生的 One-Process-Per-Connection 连接调度策略中,用户发起的连接与后端进程一一绑定:这里不仅是生命周期的绑定,同时还是服务与被服务关系的绑定。

ss-old

在 Shared Server 内置连接池中,通过提取出会话相关上下文 Session Context,将用户连接和后端进程进行了解绑,并且引入 Dispatcher 来进行代理转发:

ss-new

  • Session Context 保存 Session 相关数据,存放于共享内存中,跨进程共享。存放数据包括:Prepared Statement、连接私有参数、临时表元数据等,后续还可以不断扩展。
  • Dispatcher 进程承载代理转发工作,用户连接通过 Dispatcher 分发调度到不同的后端进程上,后端进程通过 Dispatcher 被多个用户连接共享使用。Dispatcher 进程可以配置多个。
  • 每个 Dispatcher 管理的后端进程按 <user, database, GUCs> 为 key,划分成不同的后端进程池。每个后端进程池都有自己独占的后端进程组,单个后端进程池内的后端进程数量随着负载增高而增多,随着负载降低而减少。
  • 用户连接中的一个事务会始终被同一个后端进程服务,不同事务可能会被不同的后端进程服务

ss-pool

在 Shared Server 中,后端进程有三种执行模式。进程执行模式在运行时会根据实时负载和进程污染情况进行动态转换:

  • Native 模式(原生模式):一个后端进程只服务一个用户连接,不存在 Dispatcher 转发数据
    • SS 关闭后,所有后端进程都处于 Native 模式
    • SS 开启后,对于以下场景,后端进程也会在用户连接的登录阶段回退为 Native 模式:
      • WAL Sender 进程
      • MPP 进程
      • SS 共享内存耗尽
      • 在参数 polar_ss_dedicated_dbuser_names 黑名单范围内的数据库或用户
  • Shared 模式(共享模式):后端进程作为可共享的工作进程提供给各个用户连接使用。Shared 模式是标准的、期望的连接池状态,表示后端进程是可复用的;SS 开启后,后端进程会优先使用 Shared 模式,同时会在触发兜底机制时转换为 Dedicated 模式。
  • Dedicated 模式(兜底模式):由于各种原因导致后端进程被污染,退化为当前后端进程只能服务当前用户连接,用户连接退出后,后端进程也退出
    • 用户连接不再使用新的 SS 共享内存,而是使用本地进程内存。
    • 用户连接与后端进程之间的数据传输依旧经过 Dispatcher 转发
    • 以下场景中会触发兜底机制,执行模式会由 Shared 转变为 Dedicated:
      • 更新了 SS 黑名单内的 GUC 参数
      • 使用了 SS 黑名单内的插件
      • 执行了 DECLARE CURSOR 命令
      • 对 ONCOMMIT DELETE ROWS 属性的表进行操作
      • 执行 CURSOR WITH HOLD 操作
      • 使用自定义 GUC 参数
      • 加载动态链接库

性能对比

Shared Server 主要应用于高并发或大量短连接的业务场景,因此这里使用 TPC-C 进行测试。

TPC-C 高并发

使用 104c 512GB 的物理机单机部署,测试 TPC-C 1000 仓下,并发数从 300 增大到 5000 时,不同配置下的分数对比。如下图所示:

  • old:不使用任何连接池,使用 PostgreSQL 的原生执行模式(即 Native 模式)
  • ss off:使用 Shared Server 内置连接池,启动前关闭 SS 开关,退化为 Native 模式
  • ss native:使用 Shared Server 内置连接池,启动后关闭 SS 开关,退化为 Native 模式
  • ss didicated:使用 Shared Server 内置连接池,启动后开启 SS 开关,但强制使用 Dedicated 模式
  • ss shared:使用 Shared Server 内置连接池,启动后开启 SS 开关,使用标准的 Shared 模式

ss-tpcc

从图中可以看出:

  • 原生 PostgreSQL 场景、Shared Server 关闭的场景、Shared Server 兜底场景中,均无法稳定进行 TPC-C 高并发测试。性能从并发数为 1500 时开始下跌,在并发数为 5000 时已经不能提供服务
  • Shared Server 开启并进入 Shared 模式后,TPC-C 性能不受高并发数影响,始终保持在稳定状态,很好地支持了高并发场景

PgBench 短连接

使用 104c 512GB 的物理机单机部署,利用 pgbench 分别测试以下配置中,并发短连接数从 1 到 128 的场景下的性能表现:

',20),O={href:"https://www.pgbouncer.org/features.html",target:"_blank",rel:"noopener noreferrer"},T={href:"https://www.pgbouncer.org/features.html",target:"_blank",rel:"noopener noreferrer"},N=e("li",null,"old:不使用任何连接池,使用 PostgreSQL 的原生执行模式",-1),R=e("li",null,"ss dedicated:使用 Shared Server 内置连接池,但强制设置为 Dedicated 模式",-1),k=e("li",null,"ss shared:使用 Shared Server 内置连接池,配置为标准的 Shared 模式",-1),Q=n('

ss-pgbench1

ss-pgbench2

从图中可以看出,使用连接池后,对于短连接,PgBouncer 和 Shared Server 的性能均有所提升。但 PgBouncer 最高只能提升 14 倍性能,Shared Server 最高可以提升 42 倍性能。

功能特性

PgBouncer 对比

业界典型的后置连接池 PgBouncer 具有多种模式。其中 session pooling 模式仅对短连接友好,一般不使用;transaction pooling 模式对短连接、长连接都友好,是默认推荐的模式。与 PgBouncer 相比,Shared Server 的差异化功能特点如下:

FeaturePgBouncer
Session Pooling
PgBouncer
Transaction Pooling
Shared Server
Startup parameters受限受限支持
SSL支持支持未来将支持
LISTEN/NOTIFY支持不支持支持
触发兜底
LOAD statement支持不支持支持
触发兜底
Session-level advisory locks支持不支持支持
触发兜底
SET/RESET GUC支持不支持支持
Protocol-level prepared plans支持未来将支持支持
PREPARE / DEALLOCATE支持不支持支持
Cached Plan Reset支持支持支持
WITHOUT HOLD CURSOR支持支持支持
WITH HOLD CURSOR支持不支持未来将支持
触发兜底
PRESERVE/DELETE ROWS temp支持不支持未来将支持
触发兜底
ON COMMIT DROP temp支持支持支持

注:

  • PgBouncer 的 Startup 参数仅包括:
    • client_encoding
    • datestyle
    • timezone
    • standard_conforming_strings
  • 触发进入 Dedicated 兜底模式,用户连接断开后,后端进程也会释放,避免污染后的进程被其他用户连接使用

自定义配置

为了适应不同的环境,Shared Server 支持丰富了参数配置:

  1. 支持配置 Dispatcher 进程和后端进程的最大数量,可以实时调整出最佳性能模式
  2. 支持总连接数超过阈值后才启用 SS 的 Shared 模式,避免连接数较少时 SS 性能不显著
  3. 支持配置强制启用 Dedicated 模式,避免后端进程被污染后持续影响其他用户连接
  4. 支持配置指定的数据库/用户不使用 Shared Server,给专用账户和管理员留下应急通道
  5. 支持配置指定插件不使用 Shared Server,避免外部插件异常导致 Shared Server 不稳定
  6. 支持配置指定 GUC 参数不使用 Shared Server,避免 GUC 功能复杂导致 Shared Server 不稳定
  7. 支持 Dispatcher 阻塞连接数量超过阈值后回退到 Native 模式,避免 Dispatcher 缺陷导致不可用
  8. 支持配置用户连接的超时等待时间,避免用户连接长时间等待后端进程
  9. 支持配置后端进程空闲时间阈值,避免后端进程长时间空闲,占用系统资源
  10. 支持配置后端进程活跃时间阈值, 避免后端进程长时间活跃,占用系统资源
  11. 支持配置每个后端进程池中保留后端进程的最小个数,保持连接池热度,避免进程被全部释放
  12. 支持配置 Shared Server 调试日志,方便排查后端进程调度相关的任何问题

使用说明

常用参数

Shared Server 的典型配置参数说明如下:

  • polar_enable_shm_aset:是否开启全局共享内存,当前默认关闭,重启生效
  • polar_ss_shared_memory_size:Shared Server 全局共享内存的使用上限,单位 kB,为 0 时表示关闭,默认 1MB。重启生效。
  • polar_ss_dispatcher_count:Dispatcher 进程的最大个数,默认为 2,最大为 CPU 核心数,建议配置与 CPU 核心数相同。重启生效。
  • polar_enable_shared_server:Shared Server 功能是否开启,默认关闭。
  • polar_ss_backend_max_count:后端进程的最大数量,默认为 -5,表示为 max_connection 的 1/5;0 / -1 表示与 max_connection 保持一致。建议设置为 CPU 核心数的 10 倍为佳。
  • polar_ss_backend_idle_timeout:后端进程的空闲退出时间,默认 3 分钟
  • polar_ss_session_wait_timeout:后端进程被用满时,用户连接等待被服务的最大时间,默认 60 秒
  • polar_ss_dedicated_dbuser_names:记录指定数据库/用户使用时进入 Native 模式,默认为空,格式为 d1/_,_/u1,d2/u2,表示对使用数据库 d1 的任意连接、使用用户 u1 的任意连接、使用数据库 d2 且用户 u2 的任意连接,都会回退到 Native 模式
',16);function U(o,w){const s=i("Badge"),c=i("ArticleInfo"),l=i("router-link"),a=i("ExternalLinkIcon");return S(),p("div",null,[v,r(s,{type:"tip",text:"V11 / v1.1.30-",vertical:"top"}),r(c,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),e("nav",y,[e("ul",null,[e("li",null,[r(l,{to:"#背景"},{default:d(()=>[t("背景")]),_:1})]),e("li",null,[r(l,{to:"#原理"},{default:d(()=>[t("原理")]),_:1})]),e("li",null,[r(l,{to:"#性能对比"},{default:d(()=>[t("性能对比")]),_:1}),e("ul",null,[e("li",null,[r(l,{to:"#tpc-c-高并发"},{default:d(()=>[t("TPC-C 高并发")]),_:1})]),e("li",null,[r(l,{to:"#pgbench-短连接"},{default:d(()=>[t("PgBench 短连接")]),_:1})])])]),e("li",null,[r(l,{to:"#功能特性"},{default:d(()=>[t("功能特性")]),_:1}),e("ul",null,[e("li",null,[r(l,{to:"#pgbouncer-对比"},{default:d(()=>[t("PgBouncer 对比")]),_:1})]),e("li",null,[r(l,{to:"#自定义配置"},{default:d(()=>[t("自定义配置")]),_:1})])])]),e("li",null,[r(l,{to:"#使用说明"},{default:d(()=>[t("使用说明")]),_:1}),e("ul",null,[e("li",null,[r(l,{to:"#常用参数"},{default:d(()=>[t("常用参数")]),_:1})])])])])]),m,C,e("p",null,[t("为了解决上述问题,业界在使用 PostgreSQL 时通常会配置连接池组件,比如部署在数据库侧的后置连接池 "),e("a",D,[t("PgBouncer"),r(a)]),t(",部署在应用侧的前置连接池 "),e("a",B,[t("Druid"),r(a)]),t("。但后置连接池无法支持保留用户连接私有信息(如 GUC 参数、Prepared Statement)的相关功能,在面临进程被污染的情况(如加载动态链接库、修改 "),L,t(" 参数)时也无法及时清理。前置连接池不仅无法解决后置连接池的缺陷,还无法根据应用规模扩展而实时调整配置,仍然会面临连接数膨胀的问题。")]),E,e("ul",null,[e("li",null,[t("pgbouncer session:使用 PgBouncer 后置连接池, 配置为 "),e("a",O,[t("session pooling"),r(a)]),t(" 模式")]),e("li",null,[t("pgbouncer transaction:使用 PgBouncer 后置连接池, 配置为 "),e("a",T,[t("transaction pooling"),r(a)]),t(" 模式")]),N,R,k]),Q])}const A=h(b,[["render",U],["__file","shared-server.html.vue"]]);export{A as default}; diff --git a/assets/shared-server.html-aa99c110.js b/assets/shared-server.html-aa99c110.js new file mode 100644 index 00000000000..e21e26d4000 --- /dev/null +++ b/assets/shared-server.html-aa99c110.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-69fcb160","path":"/zh/features/v11/performance/shared-server.html","title":"Shared Server","lang":"zh-CN","frontmatter":{"author":"严华","date":"2022/11/25","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[]},{"level":2,"title":"性能对比","slug":"性能对比","link":"#性能对比","children":[{"level":3,"title":"TPC-C 高并发","slug":"tpc-c-高并发","link":"#tpc-c-高并发","children":[]},{"level":3,"title":"PgBench 短连接","slug":"pgbench-短连接","link":"#pgbench-短连接","children":[]}]},{"level":2,"title":"功能特性","slug":"功能特性","link":"#功能特性","children":[{"level":3,"title":"PgBouncer 对比","slug":"pgbouncer-对比","link":"#pgbouncer-对比","children":[]},{"level":3,"title":"自定义配置","slug":"自定义配置","link":"#自定义配置","children":[]}]},{"level":2,"title":"使用说明","slug":"使用说明","link":"#使用说明","children":[{"level":3,"title":"常用参数","slug":"常用参数","link":"#常用参数","children":[]}]}],"git":{"updatedTime":1693374263000},"filePathRelative":"zh/features/v11/performance/shared-server.md"}');export{e as data}; diff --git a/assets/smlar.html-43cf50c7.js b/assets/smlar.html-43cf50c7.js new file mode 100644 index 00000000000..9eec85c07e1 --- /dev/null +++ b/assets/smlar.html-43cf50c7.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-bc8fc3a4","path":"/zh/features/v11/extensions/smlar.html","title":"smlar","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/10/05","minute":10},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"函数及运算符介绍","slug":"函数及运算符介绍","link":"#函数及运算符介绍","children":[]},{"level":2,"title":"可配置参数说明","slug":"可配置参数说明","link":"#可配置参数说明","children":[]},{"level":2,"title":"基本使用方法","slug":"基本使用方法","link":"#基本使用方法","children":[{"level":3,"title":"安装插件","slug":"安装插件","link":"#安装插件","children":[]},{"level":3,"title":"相似度计算","slug":"相似度计算","link":"#相似度计算","children":[]},{"level":3,"title":"卸载插件","slug":"卸载插件","link":"#卸载插件","children":[]}]},{"level":2,"title":"原理和设计","slug":"原理和设计","link":"#原理和设计","children":[]}],"git":{"updatedTime":1703745117000},"filePathRelative":"zh/features/v11/extensions/smlar.md"}');export{l as data}; diff --git a/assets/smlar.html-e8b6a5e2.js b/assets/smlar.html-e8b6a5e2.js new file mode 100644 index 00000000000..939fcf1e444 --- /dev/null +++ b/assets/smlar.html-e8b6a5e2.js @@ -0,0 +1,20 @@ +import{_ as d,r as l,o as u,c as k,d as s,a as n,w as t,b as a,e as c}from"./app-3d1677bf.js";const h={},g=n("h1",{id:"smlar",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#smlar","aria-hidden":"true"},"#"),a(" smlar")],-1),m={class:"table-of-contents"},_=n("h2",{id:"背景",tabindex:"-1"},[n("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),a(" 背景")],-1),f={href:"https://github.com/jirutka/smlar",target:"_blank",rel:"noopener noreferrer"},y=n("code",null,"smlar",-1),v=c(`

注意

由于 smlar 插件的 % 操作符与 RUM 插件的 % 操作符冲突,因此 smlar 与 RUM 两个插件无法同时创建在同一 schema 中。

函数及运算符介绍

  • float4 smlar(anyarray, anyarray)

    计算两个数组的相似度,数组的数据类型需要一致。

  • float4 smlar(anyarray, anyarray, bool useIntersect)

    计算两个自定义复合类型数组的相似度,useIntersect 参数表示是否让仅重叠元素还是全部元素参与运算;复合类型可由以下方式定义:

    CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
    +
  • float4 smlar(anyarray a, anyarray b, text formula);

    使用参数给定的公式来计算两个数组的相似度,数组的数据类型需要一致;公式中可以使用的预定义变量有:

    • N.i:两个数组中的相同元素个数(交集)
    • N.a:第一个数组中的唯一元素个数
    • N.b:第二个数组中的唯一元素个数
    SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)');
    +
  • anyarray % anyarray

    该运算符的含义为,当两个数组的的相似度超过阈值时返回 TRUE,否则返回 FALSE

  • text[] tsvector2textarray(tsvector)

    tsvector 类型转换为字符串数组。

  • anyarray array_unique(anyarray)

    对数组进行排序、去重。

  • float4 inarray(anyarray, anyelement)

    如果元素出现在数组中,则返回 1.0;否则返回 0

  • float4 inarray(anyarray, anyelement, float4, float4)

    如果元素出现在数组中,则返回第三个参数;否则返回第四个参数。

可配置参数说明

`,4),b=n("li",null,[n("p",null,[n("strong",null,[n("code",null,"smlar.threshold FLOAT")])]),n("p",null,[a("相似度阈值,用于给 "),n("code",null,"%"),a(" 运算符判断两个数组是否相似。")])],-1),w=n("li",null,[n("p",null,[n("strong",null,[n("code",null,"smlar.persistent_cache BOOL")])]),n("p",null,"全局统计信息的缓存是否存放在与事务无关的内存中。")],-1),E=n("p",null,[n("strong",null,[n("code",null,"smlar.type STRING")]),a(":相似度计算公式,可选的相似度类型包含:")],-1),x={href:"https://en.wikipedia.org/wiki/Cosine_similarity",target:"_blank",rel:"noopener noreferrer"},N={href:"https://zh.wikipedia.org/zh-cn/Tf-idf",target:"_blank",rel:"noopener noreferrer"},T={href:"https://en.wikipedia.org/wiki/Overlap_coefficient",target:"_blank",rel:"noopener noreferrer"},q=c(`
  • smlar.stattable STRING

    存储集合范围统计信息的表名,表定义如下:

    CREATE TABLE table_name (
    +  value   data_type UNIQUE,
    +  ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
    +);
    +
  • smlar.tf_method STRING:计算词频(TF,Term Frequency)的方法,取值如下

    • n:简单计数(默认)
    • log1 + log(n)
    • const:频率等于 1
  • smlar.idf_plus_one BOOL:计算逆文本频率指数的方法(IDF,Inverse Document Frequency)的方法,取值如下

    • FALSElog(d / df)(默认)
    • TRUElog(1 + d / df)
  • `,3),L=c(`

    基本使用方法

    安装插件

    CREATE EXTENSION smlar;
    +

    相似度计算

    使用上述的函数计算两个数组的相似度:

    SELECT smlar('{3,2}'::int[], '{3,2,1}');
    +  smlar
    +----------
    + 0.816497
    +(1 row)
    +
    +SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' );
    +  smlar
    +----------
    + 0.333333
    +(1 row)
    +

    卸载插件

    DROP EXTENSION smlar;
    +

    原理和设计

    `,9),S={href:"https://github.com/jirutka/smlar",target:"_blank",rel:"noopener noreferrer"},I={href:"https://www.pgcon.org/2012/schedule/track/Hacking/443.en.html",target:"_blank",rel:"noopener noreferrer"},C={href:"https://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf",target:"_blank",rel:"noopener noreferrer"};function A(r,O){const p=l("Badge"),i=l("ArticleInfo"),e=l("router-link"),o=l("ExternalLinkIcon");return u(),k("div",null,[g,s(p,{type:"tip",text:"V11 / v1.1.28-",vertical:"top"}),s(i,{frontmatter:r.$frontmatter},null,8,["frontmatter"]),n("nav",m,[n("ul",null,[n("li",null,[s(e,{to:"#背景"},{default:t(()=>[a("背景")]),_:1})]),n("li",null,[s(e,{to:"#函数及运算符介绍"},{default:t(()=>[a("函数及运算符介绍")]),_:1})]),n("li",null,[s(e,{to:"#可配置参数说明"},{default:t(()=>[a("可配置参数说明")]),_:1})]),n("li",null,[s(e,{to:"#基本使用方法"},{default:t(()=>[a("基本使用方法")]),_:1}),n("ul",null,[n("li",null,[s(e,{to:"#安装插件"},{default:t(()=>[a("安装插件")]),_:1})]),n("li",null,[s(e,{to:"#相似度计算"},{default:t(()=>[a("相似度计算")]),_:1})]),n("li",null,[s(e,{to:"#卸载插件"},{default:t(()=>[a("卸载插件")]),_:1})])])]),n("li",null,[s(e,{to:"#原理和设计"},{default:t(()=>[a("原理和设计")]),_:1})])])]),_,n("p",null,[a("对大规模的数据进行相似度计算在电商业务、搜索引擎中是一个很关键的技术问题。相对简易的相似度计算实现不仅运算速度慢,还十分消耗资源。"),n("a",f,[y,s(o)]),a(" 是 PostgreSQL 的一款开源第三方插件,提供了可以在数据库内高效计算数据相似度的函数,并提供了支持 GiST 和 GIN 索引的相似度运算符。目前该插件已经支持 PostgreSQL 所有的内置数据类型。")]),v,n("ul",null,[b,w,n("li",null,[E,n("ul",null,[n("li",null,[n("a",x,[a("cosine"),s(o)]),a("(默认)")]),n("li",null,[n("a",N,[a("tfidf"),s(o)])]),n("li",null,[n("a",T,[a("overlap"),s(o)])])])]),q]),L,n("p",null,[n("a",S,[a("GitHub - jirutka/smlar"),s(o)])]),n("p",null,[n("a",I,[a("PGCon 2012 - Finding Similar: Effective similarity search in database"),s(o)]),a(" ("),n("a",C,[a("slides"),s(o)]),a(")")])])}const F=d(h,[["render",A],["__file","smlar.html.vue"]]);export{F as default}; diff --git a/assets/software-level-5e0933bc.png b/assets/software-level-5e0933bc.png new file mode 100644 index 00000000000..141bba3dc19 Binary files /dev/null and b/assets/software-level-5e0933bc.png differ diff --git a/assets/ss-new-2f3760ae.png b/assets/ss-new-2f3760ae.png new file mode 100644 index 00000000000..9138352f595 Binary files /dev/null and b/assets/ss-new-2f3760ae.png differ diff --git a/assets/ss-old-18134ff8.png b/assets/ss-old-18134ff8.png new file mode 100644 index 00000000000..dfbab045b84 Binary files /dev/null and b/assets/ss-old-18134ff8.png differ diff --git a/assets/ss-pgbench1-c889b05c.jpg b/assets/ss-pgbench1-c889b05c.jpg new file mode 100644 index 00000000000..0951e2202ba Binary files /dev/null and b/assets/ss-pgbench1-c889b05c.jpg differ diff --git a/assets/ss-pgbench2-4ff36502.jpg b/assets/ss-pgbench2-4ff36502.jpg new file mode 100644 index 00000000000..04eefb8b124 Binary files /dev/null and b/assets/ss-pgbench2-4ff36502.jpg differ diff --git a/assets/ss-pool-4965c655.png b/assets/ss-pool-4965c655.png new file mode 100644 index 00000000000..dca7f12306f Binary files /dev/null and b/assets/ss-pool-4965c655.png differ diff --git a/assets/ss-tpcc-c939c142.jpg b/assets/ss-tpcc-c939c142.jpg new file mode 100644 index 00000000000..2cf776d708e Binary files /dev/null and b/assets/ss-tpcc-c939c142.jpg differ diff --git a/assets/storage-aliyun-essd.html-3dd7acdd.js b/assets/storage-aliyun-essd.html-3dd7acdd.js new file mode 100644 index 00000000000..30f73dd7bd1 --- /dev/null +++ b/assets/storage-aliyun-essd.html-3dd7acdd.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-5a992740","path":"/deploying/storage-aliyun-essd.html","title":"阿里云 ECS + ESSD 云盘存储","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":20},"headers":[{"level":2,"title":"部署阿里云 ECS","slug":"部署阿里云-ecs","link":"#部署阿里云-ecs","children":[]},{"level":2,"title":"准备 ESSD 云盘","slug":"准备-essd-云盘","link":"#准备-essd-云盘","children":[]},{"level":2,"title":"检查云盘","slug":"检查云盘","link":"#检查云盘","children":[]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1665901724000},"filePathRelative":"deploying/storage-aliyun-essd.md"}');export{e as data}; diff --git a/assets/storage-aliyun-essd.html-82759337.js b/assets/storage-aliyun-essd.html-82759337.js new file mode 100644 index 00000000000..e9c5a32c7dd --- /dev/null +++ b/assets/storage-aliyun-essd.html-82759337.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-6c33fa62","path":"/zh/deploying/storage-aliyun-essd.html","title":"阿里云 ECS + ESSD 云盘存储","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/05/09","minute":20},"headers":[{"level":2,"title":"部署阿里云 ECS","slug":"部署阿里云-ecs","link":"#部署阿里云-ecs","children":[]},{"level":2,"title":"准备 ESSD 云盘","slug":"准备-essd-云盘","link":"#准备-essd-云盘","children":[]},{"level":2,"title":"检查云盘","slug":"检查云盘","link":"#检查云盘","children":[]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1665901724000},"filePathRelative":"zh/deploying/storage-aliyun-essd.md"}');export{e as data}; diff --git a/assets/storage-aliyun-essd.html-a35a0fec.js b/assets/storage-aliyun-essd.html-a35a0fec.js new file mode 100644 index 00000000000..9fe0b752213 --- /dev/null +++ b/assets/storage-aliyun-essd.html-a35a0fec.js @@ -0,0 +1,6 @@ +import{_ as i,r as a,o as d,c as p,a as e,b as s,d as n,w as u,e as m}from"./app-3d1677bf.js";const h="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-procedure-60ba621e.png",_="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-specs-323b2032.png",S="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-system-disk-c8a747ce.png",g="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-instance-b4e46a52.png",E="/PolarDB-for-PostgreSQL/assets/aliyun-essd-specs-207958e6.png",f="/PolarDB-for-PostgreSQL/assets/aliyun-essd-ready-to-mount-59aa890c.png",b="/PolarDB-for-PostgreSQL/assets/aliyun-essd-mounting-1b470123.png",y="/PolarDB-for-PostgreSQL/assets/aliyun-essd-mounted-f02e5c42.png",C={},k={id:"阿里云-ecs-essd-云盘存储",tabindex:"-1"},v=e("a",{class:"header-anchor",href:"#阿里云-ecs-essd-云盘存储","aria-hidden":"true"},"#",-1),D={href:"https://developer.aliyun.com/live/249628"},P={href:"https://help.aliyun.com/document_detail/122389.html",target:"_blank",rel:"noopener noreferrer"},B={href:"https://help.aliyun.com/document_detail/256487.html",target:"_blank",rel:"noopener noreferrer"},x={href:"https://help.aliyun.com/document_detail/262105.html",target:"_blank",rel:"noopener noreferrer"},L=e("p",null,"本文将指导您完成以下过程:",-1),N=e("ol",null,[e("li",null,"部署两台阿里云 ECS 作为计算节点"),e("li",null,"将一块 ESSD 云盘多重挂载到两台 ECS 上,作为共享存储"),e("li",null,"在 ESSD 共享存储上格式化分布式文件系统 PFS"),e("li",null,"基于 PFS,在两台 ECS 上共同搭建一个存算分离、读写分离的 PolarDB 集群")],-1),M=e("p",null,[e("img",{src:h,alt:"aliyun-ecs-procedure"})],-1),Q=e("h2",{id:"部署阿里云-ecs",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#部署阿里云-ecs","aria-hidden":"true"},"#"),s(" 部署阿里云 ECS")],-1),V={href:"https://www.aliyun.com/product/ecs",target:"_blank",rel:"noopener noreferrer"},G={href:"https://help.aliyun.com/document_detail/256487.htm?spm=a2c4g.11186623.0.0.61397e72QGaXV0#section-4w6-dyy-otg",target:"_blank",rel:"noopener noreferrer"},I=e("strong",null,"部分可用区",-1),w=e("strong",null,"部分规格",-1),A=m('

    aliyun-ecs-specs

    对 ECS 存储配置的选择,系统盘可以选用任意的存储类型,数据盘和共享盘暂不选择。后续再单独创建一个 ESSD 云盘作为共享盘:

    aliyun-ecs-system-disk

    如图所示,在 同一可用区 中建好两台 ECS:

    aliyun-ecs-instance

    准备 ESSD 云盘

    在阿里云 ECS 的管理控制台中,选择 存储与快照 下的 云盘,点击 创建云盘。在与已经建好的 ECS 所在的相同可用区内,选择建立一个 ESSD 云盘,并勾选 多实例挂载。如果您的 ECS 不符合多实例挂载的限制条件,则该选框不会出现。

    aliyun-essd-specs

    ESSD 云盘创建完毕后,控制台显示云盘支持多重挂载,状态为 待挂载

    aliyun-essd-ready-to-mount

    接下来,把这个云盘分别挂载到两台 ECS 上:

    aliyun-essd-mounting

    挂载完毕后,查看该云盘,将会显示该云盘已经挂载的两台 ECS 实例:

    aliyun-essd-mounted

    检查云盘

    通过 ssh 分别连接到两台 ECS 上,运行 lsblk 命令可以看到:

    • nvme0n1 是 40GB 的 ECS 系统盘,为 ECS 私有
    • nvme1n1 是 100GB 的 ESSD 云盘,两台 ECS 同时可见
    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    准备分布式文件系统

    `,20);function R(o,T){const r=a("Badge"),l=a("ArticleInfo"),t=a("ExternalLinkIcon"),c=a("RouterLink");return d(),p("div",null,[e("h1",k,[v,s(" 阿里云 ECS + ESSD 云盘存储 "),e("a",D,[n(r,{type:"tip",text:"视频",vertical:"top"})])]),n(l,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),e("p",null,[e("a",P,[s("阿里云 ESSD(Enhanced SSD)云盘"),n(t)]),s(" 结合 25 GE 网络和 RDMA 技术,能够提供单盘高达 100 万的随机读写能力和单路低时延性能。阿里云 ESSD 云盘支持 NVMe 协议,且可以同时挂载到多台支持 NVMe 协议的 ECS(Elastic Compute Service)实例上,从而实现多个 ECS 实例并发读写访问,具备高可靠、高并发、高性能等特点。更新信息请参考阿里云 ECS 文档:")]),e("ul",null,[e("li",null,[e("a",B,[s("支持 NVMe 协议的云盘概述"),n(t)])]),e("li",null,[e("a",x,[s("开启多重挂载功能"),n(t)])])]),L,N,M,Q,e("p",null,[s("首先需要准备两台或以上的 "),e("a",V,[s("阿里云 ECS"),n(t)]),s("。目前,ECS 对支持 ESSD 多重挂载的规格有较多限制,详情请参考 "),e("a",G,[s("使用限制"),n(t)]),s("。仅 "),I,s("、"),w,s("(ecs.g7se、ecs.c7se、ecs.r7se)的 ECS 实例可以支持 ESSD 的多重挂载。如图,请务必选择支持多重挂载的 ECS 规格:")]),A,e("p",null,[s("接下来,将在两台 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要在 ECS 共享的 ESSD 块设备上 "),n(c,{to:"/zh/deploying/fs-pfs.html"},{default:u(()=>[s("格式化并挂载 PFS")]),_:1}),s("。")])])}const O=i(C,[["render",R],["__file","storage-aliyun-essd.html.vue"]]);export{O as default}; diff --git a/assets/storage-aliyun-essd.html-f09c57cf.js b/assets/storage-aliyun-essd.html-f09c57cf.js new file mode 100644 index 00000000000..f95296b28a2 --- /dev/null +++ b/assets/storage-aliyun-essd.html-f09c57cf.js @@ -0,0 +1,6 @@ +import{_ as i,r as a,o as d,c as p,a as e,b as s,d as n,w as u,e as m}from"./app-3d1677bf.js";const h="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-procedure-60ba621e.png",_="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-specs-323b2032.png",S="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-system-disk-c8a747ce.png",g="/PolarDB-for-PostgreSQL/assets/aliyun-ecs-instance-b4e46a52.png",E="/PolarDB-for-PostgreSQL/assets/aliyun-essd-specs-207958e6.png",f="/PolarDB-for-PostgreSQL/assets/aliyun-essd-ready-to-mount-59aa890c.png",b="/PolarDB-for-PostgreSQL/assets/aliyun-essd-mounting-1b470123.png",y="/PolarDB-for-PostgreSQL/assets/aliyun-essd-mounted-f02e5c42.png",C={},k={id:"阿里云-ecs-essd-云盘存储",tabindex:"-1"},v=e("a",{class:"header-anchor",href:"#阿里云-ecs-essd-云盘存储","aria-hidden":"true"},"#",-1),D={href:"https://developer.aliyun.com/live/249628"},P={href:"https://help.aliyun.com/document_detail/122389.html",target:"_blank",rel:"noopener noreferrer"},B={href:"https://help.aliyun.com/document_detail/256487.html",target:"_blank",rel:"noopener noreferrer"},x={href:"https://help.aliyun.com/document_detail/262105.html",target:"_blank",rel:"noopener noreferrer"},L=e("p",null,"本文将指导您完成以下过程:",-1),N=e("ol",null,[e("li",null,"部署两台阿里云 ECS 作为计算节点"),e("li",null,"将一块 ESSD 云盘多重挂载到两台 ECS 上,作为共享存储"),e("li",null,"在 ESSD 共享存储上格式化分布式文件系统 PFS"),e("li",null,"基于 PFS,在两台 ECS 上共同搭建一个存算分离、读写分离的 PolarDB 集群")],-1),M=e("p",null,[e("img",{src:h,alt:"aliyun-ecs-procedure"})],-1),Q=e("h2",{id:"部署阿里云-ecs",tabindex:"-1"},[e("a",{class:"header-anchor",href:"#部署阿里云-ecs","aria-hidden":"true"},"#"),s(" 部署阿里云 ECS")],-1),V={href:"https://www.aliyun.com/product/ecs",target:"_blank",rel:"noopener noreferrer"},G={href:"https://help.aliyun.com/document_detail/256487.htm?spm=a2c4g.11186623.0.0.61397e72QGaXV0#section-4w6-dyy-otg",target:"_blank",rel:"noopener noreferrer"},I=e("strong",null,"部分可用区",-1),w=e("strong",null,"部分规格",-1),A=m('

    aliyun-ecs-specs

    对 ECS 存储配置的选择,系统盘可以选用任意的存储类型,数据盘和共享盘暂不选择。后续再单独创建一个 ESSD 云盘作为共享盘:

    aliyun-ecs-system-disk

    如图所示,在 同一可用区 中建好两台 ECS:

    aliyun-ecs-instance

    准备 ESSD 云盘

    在阿里云 ECS 的管理控制台中,选择 存储与快照 下的 云盘,点击 创建云盘。在与已经建好的 ECS 所在的相同可用区内,选择建立一个 ESSD 云盘,并勾选 多实例挂载。如果您的 ECS 不符合多实例挂载的限制条件,则该选框不会出现。

    aliyun-essd-specs

    ESSD 云盘创建完毕后,控制台显示云盘支持多重挂载,状态为 待挂载

    aliyun-essd-ready-to-mount

    接下来,把这个云盘分别挂载到两台 ECS 上:

    aliyun-essd-mounting

    挂载完毕后,查看该云盘,将会显示该云盘已经挂载的两台 ECS 实例:

    aliyun-essd-mounted

    检查云盘

    通过 ssh 分别连接到两台 ECS 上,运行 lsblk 命令可以看到:

    • nvme0n1 是 40GB 的 ECS 系统盘,为 ECS 私有
    • nvme1n1 是 100GB 的 ESSD 云盘,两台 ECS 同时可见
    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    准备分布式文件系统

    `,20);function R(o,T){const r=a("Badge"),l=a("ArticleInfo"),t=a("ExternalLinkIcon"),c=a("RouterLink");return d(),p("div",null,[e("h1",k,[v,s(" 阿里云 ECS + ESSD 云盘存储 "),e("a",D,[n(r,{type:"tip",text:"视频",vertical:"top"})])]),n(l,{frontmatter:o.$frontmatter},null,8,["frontmatter"]),e("p",null,[e("a",P,[s("阿里云 ESSD(Enhanced SSD)云盘"),n(t)]),s(" 结合 25 GE 网络和 RDMA 技术,能够提供单盘高达 100 万的随机读写能力和单路低时延性能。阿里云 ESSD 云盘支持 NVMe 协议,且可以同时挂载到多台支持 NVMe 协议的 ECS(Elastic Compute Service)实例上,从而实现多个 ECS 实例并发读写访问,具备高可靠、高并发、高性能等特点。更新信息请参考阿里云 ECS 文档:")]),e("ul",null,[e("li",null,[e("a",B,[s("支持 NVMe 协议的云盘概述"),n(t)])]),e("li",null,[e("a",x,[s("开启多重挂载功能"),n(t)])])]),L,N,M,Q,e("p",null,[s("首先需要准备两台或以上的 "),e("a",V,[s("阿里云 ECS"),n(t)]),s("。目前,ECS 对支持 ESSD 多重挂载的规格有较多限制,详情请参考 "),e("a",G,[s("使用限制"),n(t)]),s("。仅 "),I,s("、"),w,s("(ecs.g7se、ecs.c7se、ecs.r7se)的 ECS 实例可以支持 ESSD 的多重挂载。如图,请务必选择支持多重挂载的 ECS 规格:")]),A,e("p",null,[s("接下来,将在两台 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要在 ECS 共享的 ESSD 块设备上 "),n(c,{to:"/deploying/fs-pfs.html"},{default:u(()=>[s("格式化并挂载 PFS")]),_:1}),s("。")])])}const O=i(C,[["render",R],["__file","storage-aliyun-essd.html.vue"]]);export{O as default}; diff --git a/assets/storage-ceph.html-27d081d7.js b/assets/storage-ceph.html-27d081d7.js new file mode 100644 index 00000000000..5e28465a136 --- /dev/null +++ b/assets/storage-ceph.html-27d081d7.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-65d024d1","path":"/zh/deploying/storage-ceph.html","title":"Ceph 共享存储","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"环境准备","slug":"环境准备","link":"#环境准备","children":[{"level":3,"title":"安装 docker","slug":"安装-docker","link":"#安装-docker","children":[]},{"level":3,"title":"配置 ssh 免密登录","slug":"配置-ssh-免密登录","link":"#配置-ssh-免密登录","children":[]},{"level":3,"title":"下载 ceph daemon","slug":"下载-ceph-daemon","link":"#下载-ceph-daemon","children":[]}]},{"level":2,"title":"mon 部署","slug":"mon-部署","link":"#mon-部署","children":[{"level":3,"title":"ceph001 上 mon 进程启动","slug":"ceph001-上-mon-进程启动","link":"#ceph001-上-mon-进程启动","children":[]},{"level":3,"title":"查看容器状态","slug":"查看容器状态","link":"#查看容器状态","children":[]},{"level":3,"title":"生成必须的 keyring","slug":"生成必须的-keyring","link":"#生成必须的-keyring","children":[]},{"level":3,"title":"配置文件同步","slug":"配置文件同步","link":"#配置文件同步","children":[]},{"level":3,"title":"在 ceph002 与 ceph003 中启动 mon","slug":"在-ceph002-与-ceph003-中启动-mon","link":"#在-ceph002-与-ceph003-中启动-mon","children":[]},{"level":3,"title":"查看当前集群状态","slug":"查看当前集群状态","link":"#查看当前集群状态","children":[]}]},{"level":2,"title":"osd 部署","slug":"osd-部署","link":"#osd-部署","children":[{"level":3,"title":"osd 准备阶段","slug":"osd-准备阶段","link":"#osd-准备阶段","children":[]},{"level":3,"title":"osd 激活阶段","slug":"osd-激活阶段","link":"#osd-激活阶段","children":[]},{"level":3,"title":"查看集群状态","slug":"查看集群状态","link":"#查看集群状态","children":[]}]},{"level":2,"title":"mgr、mds、rgw 部署","slug":"mgr、mds、rgw-部署","link":"#mgr、mds、rgw-部署","children":[]},{"level":2,"title":"rbd 块设备创建","slug":"rbd-块设备创建","link":"#rbd-块设备创建","children":[{"level":3,"title":"存储池的创建","slug":"存储池的创建","link":"#存储池的创建","children":[]},{"level":3,"title":"创建镜像文件并查看信息","slug":"创建镜像文件并查看信息","link":"#创建镜像文件并查看信息","children":[]},{"level":3,"title":"映射镜像文件","slug":"映射镜像文件","link":"#映射镜像文件","children":[]},{"level":3,"title":"查看块设备","slug":"查看块设备","link":"#查看块设备","children":[]}]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1675585309000},"filePathRelative":"zh/deploying/storage-ceph.md"}');export{l as data}; diff --git a/assets/storage-ceph.html-4c626f1d.js b/assets/storage-ceph.html-4c626f1d.js new file mode 100644 index 00000000000..a479e006eb1 --- /dev/null +++ b/assets/storage-ceph.html-4c626f1d.js @@ -0,0 +1,216 @@ +import{_ as s,r as e,o as i,c as l,a as r,b as a,d as c,w as p,e as d}from"./app-3d1677bf.js";const t={},o=d(`

    Ceph 共享存储

    Ceph 是一个统一的分布式存储系统,由于它可以提供较好的性能、可靠性和可扩展性,被广泛的应用在存储领域。Ceph 搭建需要 2 台及以上的物理机/虚拟机实现存储共享与数据备份,本教程以 3 台虚拟机环境为例,介绍基于 ceph 共享存储的实例构建方法。大体如下:

    1. 获取在同一网段的虚拟机三台,互相之间配置 ssh 免密登录,用作 ceph 密钥与配置信息的同步;
    2. 在主节点启动 mon 进程,查看状态,并复制配置文件至其余各个节点,完成 mon 启动;
    3. 在三个环境中启动 osd 进程配置存储盘,并在主节点环境启动 mgr 进程、rgw 进程;
    4. 创建存储池与 rbd 块设备镜像,并对创建好的镜像在各个节点进行映射即可实现块设备的共享;
    5. 对块设备进行 PolarFS 的格式化与 PolarDB 的部署。

    注意

    操作系统版本要求 CentOS 7.5 及以上。以下步骤在 CentOS 7.5 上通过测试。

    环境准备

    使用的虚拟机环境如下:

    IP                  hostname
    +192.168.1.173       ceph001
    +192.168.1.174       ceph002
    +192.168.1.175       ceph003
    +

    安装 docker

    提示

    本教程使用阿里云镜像站提供的 docker 包。

    安装 docker 依赖包

    yum install -y yum-utils device-mapper-persistent-data lvm2
    +

    安装并启动 docker

    yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
    +yum makecache
    +yum install -y docker-ce
    +
    +systemctl start docker
    +systemctl enable docker
    +

    检查是否安装成功

    docker run hello-world
    +

    配置 ssh 免密登录

    密钥的生成与拷贝

    ssh-keygen
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph001
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph002
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph003
    +

    检查是否配置成功

    ssh root@ceph003
    +

    下载 ceph daemon

    docker pull ceph/daemon
    +

    mon 部署

    ceph001 上 mon 进程启动

    docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=192.168.1.173 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon01 \\
    +    ceph/daemon mon
    +

    注意

    根据实际网络环境修改 IP、子网掩码位数。

    查看容器状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 1 daemons, quorum ceph001 (age 26m)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    注意

    如果遇到 mon is allowing insecure global_id reclaim 的报错,使用以下命令解决。

    docker exec mon01 ceph config set mon auth_allow_insecure_global_id_reclaim false
    +

    生成必须的 keyring

    docker exec mon01 ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
    +docker exec mon01 ceph auth get client.bootstrap-rgw -o /var/lib/ceph/bootstrap-rgw/ceph.keyring
    +

    配置文件同步

    ssh root@ceph002 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph002:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph002:/var/lib/ceph
    +ssh root@ceph003 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph003:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph003:/var/lib/ceph
    +

    在 ceph002 与 ceph003 中启动 mon

    docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=192.168.1.174 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon02 \\
    +    ceph/daemon mon
    +
    +docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=192.168.1.175 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon03 \\
    +    ceph/daemon mon
    +

    查看当前集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 35s)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    注意

    从 mon 节点信息查看是否有添加在另外两个节点创建的 mon 添加进来。

    osd 部署

    osd 准备阶段

    提示

    本环境的虚拟机只有一个 /dev/vdb 磁盘可用,因此为每个虚拟机只创建了一个 osd 节点。

    docker run --rm --privileged=true --net=host --ipc=host \\
    +    --security-opt seccomp=unconfined \\
    +    -v /run/lock/lvm:/run/lock/lvm:z \\
    +    -v /var/run/udev/:/var/run/udev/:z \\
    +    -v /dev:/dev -v /etc/ceph:/etc/ceph:z \\
    +    -v /run/lvm/:/run/lvm/ \\
    +    -v /var/lib/ceph/:/var/lib/ceph/:z \\
    +    -v /var/log/ceph/:/var/log/ceph/:z \\
    +    --entrypoint=ceph-volume \\
    +    docker.io/ceph/daemon \\
    +    --cluster ceph lvm prepare --bluestore --data /dev/vdb
    +

    注意

    以上命令在三个节点都是一样的,只需要根据磁盘名称进行修改调整即可。

    osd 激活阶段

    docker run -d --privileged=true --net=host --pid=host --ipc=host \\
    +    --security-opt seccomp=unconfined \\
    +    -v /dev:/dev \\
    +    -v /etc/localtime:/etc/ localtime:ro \\
    +    -v /var/lib/ceph:/var/lib/ceph:z \\
    +    -v /etc/ceph:/etc/ceph:z \\
    +    -v /var/run/ceph:/var/run/ceph:z \\
    +    -v /var/run/udev/:/var/run/udev/ \\
    +    -v /var/log/ceph:/var/log/ceph:z \\
    +    -v /run/lvm/:/run/lvm/ \\
    +    -e CLUSTER=ceph \\
    +    -e CEPH_DAEMON=OSD_CEPH_VOLUME_ACTIVATE \\
    +    -e CONTAINER_IMAGE=docker.io/ceph/daemon \\
    +    -e OSD_ID=0 \\
    +    --name=ceph-osd-0 \\
    +    docker.io/ceph/daemon
    +

    注意

    各个节点需要修改 OSD_ID 与 name 属性,OSD_ID 是从编号 0 递增的,其余节点为 OSD_ID=1、OSD_ID=2。

    查看集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_WARN
    +            no active mgr
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 44m)
    +    mgr: no daemons active
    +    osd: 3 osds: 3 up (since 7m), 3 in (since     13m)
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    mgr、mds、rgw 部署

    以下命令均在 ceph001 进行:

    docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    --name=ceph-mgr-0 \\
    +    ceph/daemon mgr
    +
    +docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -e CEPHFS_CREATE=1 \\
    +    --name=ceph-mds-0 \\
    +    ceph/daemon mds
    +
    +docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -v /etc/ceph:/etc/ceph \\
    +    --name=ceph-rgw-0 \\
    +    ceph/daemon rgw
    +

    查看集群状态:

    docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 92m)
    +    mgr: ceph001(active, since 25m)
    +    mds: 1/1 daemons up
    +    osd: 3 osds: 3 up (since 54m), 3 in (since    60m)
    +    rgw: 1 daemon active (1 hosts, 1 zones)
    +
    +data:
    +    volumes: 1/1 healthy
    +    pools:   7 pools, 145 pgs
    +    objects: 243 objects, 7.2 KiB
    +    usage:   50 MiB used, 2.9 TiB / 2.9 TiB avail
    +    pgs:     145 active+clean
    +

    rbd 块设备创建

    提示

    以下命令均在容器 mon01 中进行。

    存储池的创建

    docker exec -it mon01 bash
    +ceph osd pool create rbd_polar
    +

    创建镜像文件并查看信息

    rbd create --size 512000 rbd_polar/image02
    +rbd info rbd_polar/image02
    +
    +rbd image 'image02':
    +size 500 GiB in 128000 objects
    +order 22 (4 MiB objects)
    +snapshot_count: 0
    +id: 13b97b252c5d
    +block_name_prefix: rbd_data.13b97b252c5d
    +format: 2
    +features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    +op_features:
    +flags:
    +create_timestamp: Thu Oct 28 06:18:07 2021
    +access_timestamp: Thu Oct 28 06:18:07 2021
    +modify_timestamp: Thu Oct 28 06:18:07 2021
    +

    映射镜像文件

    modprobe rbd # 加载内核模块,在主机上执行
    +rbd map rbd_polar/image02
    +
    +rbd: sysfs write failed
    +RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten".
    +In some cases useful info is found in syslog -  try "dmesg | tail".
    +rbd: map failed: (6) No such device or address
    +

    注意

    某些特性内核不支持,需要关闭才可以映射成功。如下进行:关闭 rbd 不支持特性,重新映射镜像,并查看映射列表。

    rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten
    +rbd map rbd_polar/image02
    +rbd device list
    +
    +id  pool       namespace  image    snap  device
    +0   rbd_polar             image01  -     /dev/  rbd0
    +1   rbd_polar             image02  -     /dev/  rbd1
    +

    提示

    此处我已经先映射了一个 image01,所以有两条信息。

    查看块设备

    回到容器外,进行操作。查看系统中的块设备:

    lsblk
    +
    +NAME                                                               MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
    +vda                                                                253:0    0  500G  0 disk
    +└─vda1                                                             253:1    0  500G  0 part /
    +vdb                                                                253:16   0 1000G  0 disk
    +└─ceph--7eefe77f--c618--4477--a1ed--b4f44520dfc 2-osd--block--bced3ff1--42b9--43e1--8f63--e853b  ce41435
    +                                                                    252:0    0 1000G  0 lvm
    +rbd0                                                               251:0    0  100G  0 disk
    +rbd1                                                               251:16   0  500G  0 disk
    +

    注意

    块设备镜像需要在各个节点都进行映射才可以在本地环境中通过 lsblk 命令查看到,否则不显示。ceph002 与 ceph003 上映射命令与上述一致。


    准备分布式文件系统

    `,71);function v(u,m){const n=e("RouterLink");return i(),l("div",null,[o,r("p",null,[a("参阅 "),c(n,{to:"/zh/deploying/fs-pfs.html"},{default:p(()=>[a("格式化并挂载 PFS")]),_:1}),a("。")])])}const h=s(t,[["render",v],["__file","storage-ceph.html.vue"]]);export{h as default}; diff --git a/assets/storage-ceph.html-9327a336.js b/assets/storage-ceph.html-9327a336.js new file mode 100644 index 00000000000..5da71acf2dc --- /dev/null +++ b/assets/storage-ceph.html-9327a336.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-e3a62740","path":"/deploying/storage-ceph.html","title":"Ceph 共享存储","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"环境准备","slug":"环境准备","link":"#环境准备","children":[{"level":3,"title":"安装 docker","slug":"安装-docker","link":"#安装-docker","children":[]},{"level":3,"title":"配置 ssh 免密登录","slug":"配置-ssh-免密登录","link":"#配置-ssh-免密登录","children":[]},{"level":3,"title":"下载 ceph daemon","slug":"下载-ceph-daemon","link":"#下载-ceph-daemon","children":[]}]},{"level":2,"title":"mon 部署","slug":"mon-部署","link":"#mon-部署","children":[{"level":3,"title":"ceph001 上 mon 进程启动","slug":"ceph001-上-mon-进程启动","link":"#ceph001-上-mon-进程启动","children":[]},{"level":3,"title":"查看容器状态","slug":"查看容器状态","link":"#查看容器状态","children":[]},{"level":3,"title":"生成必须的 keyring","slug":"生成必须的-keyring","link":"#生成必须的-keyring","children":[]},{"level":3,"title":"配置文件同步","slug":"配置文件同步","link":"#配置文件同步","children":[]},{"level":3,"title":"在 ceph002 与 ceph003 中启动 mon","slug":"在-ceph002-与-ceph003-中启动-mon","link":"#在-ceph002-与-ceph003-中启动-mon","children":[]},{"level":3,"title":"查看当前集群状态","slug":"查看当前集群状态","link":"#查看当前集群状态","children":[]}]},{"level":2,"title":"osd 部署","slug":"osd-部署","link":"#osd-部署","children":[{"level":3,"title":"osd 准备阶段","slug":"osd-准备阶段","link":"#osd-准备阶段","children":[]},{"level":3,"title":"osd 激活阶段","slug":"osd-激活阶段","link":"#osd-激活阶段","children":[]},{"level":3,"title":"查看集群状态","slug":"查看集群状态","link":"#查看集群状态","children":[]}]},{"level":2,"title":"mgr、mds、rgw 部署","slug":"mgr、mds、rgw-部署","link":"#mgr、mds、rgw-部署","children":[]},{"level":2,"title":"rbd 块设备创建","slug":"rbd-块设备创建","link":"#rbd-块设备创建","children":[{"level":3,"title":"存储池的创建","slug":"存储池的创建","link":"#存储池的创建","children":[]},{"level":3,"title":"创建镜像文件并查看信息","slug":"创建镜像文件并查看信息","link":"#创建镜像文件并查看信息","children":[]},{"level":3,"title":"映射镜像文件","slug":"映射镜像文件","link":"#映射镜像文件","children":[]},{"level":3,"title":"查看块设备","slug":"查看块设备","link":"#查看块设备","children":[]}]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"deploying/storage-ceph.md"}');export{l as data}; diff --git a/assets/storage-ceph.html-fd9bfda4.js b/assets/storage-ceph.html-fd9bfda4.js new file mode 100644 index 00000000000..4df104fdd57 --- /dev/null +++ b/assets/storage-ceph.html-fd9bfda4.js @@ -0,0 +1,216 @@ +import{_ as s,r as e,o as i,c as l,a as r,b as a,d as c,w as p,e as d}from"./app-3d1677bf.js";const t={},o=d(`

    Ceph 共享存储

    Ceph 是一个统一的分布式存储系统,由于它可以提供较好的性能、可靠性和可扩展性,被广泛的应用在存储领域。Ceph 搭建需要 2 台及以上的物理机/虚拟机实现存储共享与数据备份,本教程以 3 台虚拟机机环境为例,介绍基于 ceph 共享存储的实例构建方法。大体如下:

    1. 获取在同一网段的虚拟机三台,互相之间配置 ssh 免密登录,用作 ceph 密钥与配置信息的同步;
    2. 在主节点启动 mon 进程,查看状态,并复制配置文件至其余各个节点,完成 mon 启动;
    3. 在三个环境中启动 osd 进程配置存储盘,并在主节点环境启动 mgr 进程、rgw 进程;
    4. 创建存储池与 rbd 块设备镜像,并对创建好的镜像在各个节点进行映射即可实现块设备的共享;
    5. 对块设备进行 PolarFS 的格式化与 PolarDB 的部署。

    WARNING

    操作系统版本要求 CentOS 7.5 及以上。以下步骤在 CentOS 7.5 上通过测试。

    环境准备

    使用的虚拟机环境如下:

    IP                  hostname
    +192.168.1.173       ceph001
    +192.168.1.174       ceph002
    +192.168.1.175       ceph003
    +

    安装 docker

    TIP

    本教程使用阿里云镜像站提供的 docker 包。

    安装 docker 依赖包

    yum install -y yum-utils device-mapper-persistent-data lvm2
    +

    安装并启动 docker

    yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
    +yum makecache
    +yum install -y docker-ce
    +
    +systemctl start docker
    +systemctl enable docker
    +

    检查是否安装成功

    docker run hello-world
    +

    配置 ssh 免密登录

    密钥的生成与拷贝

    ssh-keygen
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph001
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph002
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph003
    +

    检查是否配置成功

    ssh root@ceph003
    +

    下载 ceph daemon

    docker pull ceph/daemon
    +

    mon 部署

    ceph001 上 mon 进程启动

    docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=192.168.1.173 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon01 \\
    +    ceph/daemon mon
    +

    WARNING

    根据实际网络环境修改 IP、子网掩码位数。

    查看容器状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 1 daemons, quorum ceph001 (age 26m)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    WARNING

    如果遇到 mon is allowing insecure global_id reclaim 的报错,使用以下命令解决。

    docker exec mon01 ceph config set mon auth_allow_insecure_global_id_reclaim false
    +

    生成必须的 keyring

    docker exec mon01 ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
    +docker exec mon01 ceph auth get client.bootstrap-rgw -o /var/lib/ceph/bootstrap-rgw/ceph.keyring
    +

    配置文件同步

    ssh root@ceph002 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph002:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph002:/var/lib/ceph
    +ssh root@ceph003 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph003:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph003:/var/lib/ceph
    +

    在 ceph002 与 ceph003 中启动 mon

    docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=192.168.1.174 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon02 \\
    +    ceph/daemon mon
    +
    +docker run -d \\
    +    --net=host \\
    +    --privileged=true \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -e MON_IP=1192.168.1.175 \\
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \\
    +    --security-opt seccomp=unconfined \\
    +    --name=mon03 \\
    +    ceph/daemon mon
    +

    查看当前集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 35s)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    WARNING

    从 mon 节点信息查看是否有添加在另外两个节点创建的 mon 添加进来。

    osd 部署

    osd 准备阶段

    TIP

    本环境的虚拟机只有一个 /dev/vdb 磁盘可用,因此为每个虚拟机只创建了一个 osd 节点。

    docker run --rm --privileged=true --net=host --ipc=host \\
    +    --security-opt seccomp=unconfined \\
    +    -v /run/lock/lvm:/run/lock/lvm:z \\
    +    -v /var/run/udev/:/var/run/udev/:z \\
    +    -v /dev:/dev -v /etc/ceph:/etc/ceph:z \\
    +    -v /run/lvm/:/run/lvm/ \\
    +    -v /var/lib/ceph/:/var/lib/ceph/:z \\
    +    -v /var/log/ceph/:/var/log/ceph/:z \\
    +    --entrypoint=ceph-volume \\
    +    docker.io/ceph/daemon \\
    +    --cluster ceph lvm prepare --bluestore --data /dev/vdb
    +

    WARNING

    以上命令在三个节点都是一样的,只需要根据磁盘名称进行修改调整即可。

    osd 激活阶段

    docker run -d --privileged=true --net=host --pid=host --ipc=host \\
    +    --security-opt seccomp=unconfined \\
    +    -v /dev:/dev \\
    +    -v /etc/localtime:/etc/ localtime:ro \\
    +    -v /var/lib/ceph:/var/lib/ceph:z \\
    +    -v /etc/ceph:/etc/ceph:z \\
    +    -v /var/run/ceph:/var/run/ceph:z \\
    +    -v /var/run/udev/:/var/run/udev/ \\
    +    -v /var/log/ceph:/var/log/ceph:z \\
    +    -v /run/lvm/:/run/lvm/ \\
    +    -e CLUSTER=ceph \\
    +    -e CEPH_DAEMON=OSD_CEPH_VOLUME_ACTIVATE \\
    +    -e CONTAINER_IMAGE=docker.io/ceph/daemon \\
    +    -e OSD_ID=0 \\
    +    --name=ceph-osd-0 \\
    +    docker.io/ceph/daemon
    +

    WARNING

    各个节点需要修改 OSD_ID 与 name 属性,OSD_ID 是从编号 0 递增的,其余节点为 OSD_ID=1、OSD_ID=2。

    查看集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_WARN
    +            no active mgr
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 44m)
    +    mgr: no daemons active
    +    osd: 3 osds: 3 up (since 7m), 3 in (since     13m)
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    mgr、mds、rgw 部署

    以下命令均在 ceph001 进行:

    docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    --name=ceph-mgr-0 \\
    +    ceph/daemon mgr
    +
    +docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -v /etc/ceph:/etc/ceph \\
    +    -e CEPHFS_CREATE=1 \\
    +    --name=ceph-mds-0 \\
    +    ceph/daemon mds
    +
    +docker run -d --net=host \\
    +    --privileged=true \\
    +    --security-opt seccomp=unconfined \\
    +    -v /var/lib/ceph/:/var/lib/ceph/ \\
    +    -v /etc/ceph:/etc/ceph \\
    +    --name=ceph-rgw-0 \\
    +    ceph/daemon rgw
    +

    查看集群状态:

    docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 92m)
    +    mgr: ceph001(active, since 25m)
    +    mds: 1/1 daemons up
    +    osd: 3 osds: 3 up (since 54m), 3 in (since    60m)
    +    rgw: 1 daemon active (1 hosts, 1 zones)
    +
    +data:
    +    volumes: 1/1 healthy
    +    pools:   7 pools, 145 pgs
    +    objects: 243 objects, 7.2 KiB
    +    usage:   50 MiB used, 2.9 TiB / 2.9 TiB avail
    +    pgs:     145 active+clean
    +

    rbd 块设备创建

    TIP

    以下命令均在容器 mon01 中进行。

    存储池的创建

    docker exec -it mon01 bash
    +ceph osd pool create rbd_polar
    +

    创建镜像文件并查看信息

    rbd create --size 512000 rbd_polar/image02
    +rbd info rbd_polar/image02
    +
    +rbd image 'image02':
    +size 500 GiB in 128000 objects
    +order 22 (4 MiB objects)
    +snapshot_count: 0
    +id: 13b97b252c5d
    +block_name_prefix: rbd_data.13b97b252c5d
    +format: 2
    +features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    +op_features:
    +flags:
    +create_timestamp: Thu Oct 28 06:18:07 2021
    +access_timestamp: Thu Oct 28 06:18:07 2021
    +modify_timestamp: Thu Oct 28 06:18:07 2021
    +

    映射镜像文件

    modprobe rbd # 加载内核模块,在主机上执行
    +rbd map rbd_polar/image02
    +
    +rbd: sysfs write failed
    +RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten".
    +In some cases useful info is found in syslog -  try "dmesg | tail".
    +rbd: map failed: (6) No such device or address
    +

    WARNING

    某些特性内核不支持,需要关闭才可以映射成功。如下进行:关闭 rbd 不支持特性,重新映射镜像,并查看映射列表。

    rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten
    +rbd map rbd_polar/image02
    +rbd device list
    +
    +id  pool       namespace  image    snap  device
    +0   rbd_polar             image01  -     /dev/  rbd0
    +1   rbd_polar             image02  -     /dev/  rbd1
    +

    TIP

    此处我已经先映射了一个 image01,所以有两条信息。

    查看块设备

    回到容器外,进行操作。查看系统中的块设备:

    lsblk
    +
    +NAME                                                               MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
    +vda                                                                253:0    0  500G  0 disk
    +└─vda1                                                             253:1    0  500G  0 part /
    +vdb                                                                253:16   0 1000G  0 disk
    +└─ceph--7eefe77f--c618--4477--a1ed--b4f44520dfc 2-osd--block--bced3ff1--42b9--43e1--8f63--e853b  ce41435
    +                                                                    252:0    0 1000G  0 lvm
    +rbd0                                                               251:0    0  100G  0 disk
    +rbd1                                                               251:16   0  500G  0 disk
    +

    WARNING

    块设备镜像需要在各个节点都进行映射才可以在本地环境中通过 lsblk 命令查看到,否则不显示。ceph002 与 ceph003 上映射命令与上述一致。


    准备分布式文件系统

    `,71);function v(u,m){const n=e("RouterLink");return i(),l("div",null,[o,r("p",null,[a("参阅 "),c(n,{to:"/deploying/fs-pfs.html"},{default:p(()=>[a("格式化并挂载 PFS")]),_:1}),a("。")])])}const h=s(t,[["render",v],["__file","storage-ceph.html.vue"]]);export{h as default}; diff --git a/assets/storage-curvebs.html-a99e7740.js b/assets/storage-curvebs.html-a99e7740.js new file mode 100644 index 00000000000..9c93760e1c4 --- /dev/null +++ b/assets/storage-curvebs.html-a99e7740.js @@ -0,0 +1,156 @@ +import{_ as i,r as t,o as u,c as r,a as s,b as n,d as a,w as d,e as k}from"./app-3d1677bf.js";const v="/PolarDB-for-PostgreSQL/assets/curve-cluster-77966d1c.png",m={},b={id:"curvebs-共享存储",tabindex:"-1"},h=s("a",{class:"header-anchor",href:"#curvebs-共享存储","aria-hidden":"true"},"#",-1),g={href:"https://developer.aliyun.com/live/250218"},f={href:"https://github.com/opencurve/curve",target:"_blank",rel:"noopener noreferrer"},y=s("ul",null,[s("li",null,"对接 OpenStack 平台为云主机提供高性能块存储服务;"),s("li",null,"对接 Kubernetes 为其提供 RWO、RWX 等类型的持久化存储卷;"),s("li",null,"对接 PolarFS 作为云原生数据库的高性能存储底座,完美支持云原生数据库的存算分离架构。")],-1),_=s("p",null,"Curve 亦可作为云存储中间件使用 S3 兼容的对象存储作为数据存储引擎,为公有云用户提供高性价比的共享文件存储。",-1),S={href:"https://github.com/opencurve/curveadm/wiki",target:"_blank",rel:"noopener noreferrer"},x=s("h2",{id:"设备准备",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#设备准备","aria-hidden":"true"},"#"),n(" 设备准备")],-1),$=s("p",null,[s("img",{src:v,alt:"curve-cluster"})],-1),C=s("p",null,"如图所示,本示例共使用六台服务器。其中,一台中控服务器和三台存储服务器共同组成 CurveBS 集群,对外暴露为一个共享存储服务。剩余两台服务器分别用于部署 PolarDB for PostgreSQL 数据库的读写节点和只读节点,它们共享 CurveBS 对外暴露的块存储设备。",-1),P={href:"https://openanolis.cn/anolisos",target:"_blank",rel:"noopener noreferrer"},B={href:"https://www.docker.com/",target:"_blank",rel:"noopener noreferrer"},O=s("li",null,"在 Curve 中控机上配置 SSH 免密登陆到其它五台服务器",-1),K=k(`

    在中控机上安装 CurveAdm

    bash -c "$(curl -fsSL https://curveadm.nos-eastchina1.126.net/script/install.sh)"
    +source /root/.bash_profile
    +

    导入主机列表

    在中控机上编辑主机列表文件:

    vim hosts.yaml
    +

    文件中包含另外五台服务器的 IP 地址和在 Curve 集群内的名称,其中:

    • 三台主机为 Curve 存储节点主机
    • 两台主机为 PolarDB for PostgreSQL 计算节点主机
    global:
    +  user: root
    +  ssh_port: 22
    +  private_key_file: /root/.ssh/id_rsa
    +
    +hosts:
    +  # Curve worker nodes
    +  - host: server-host1
    +    hostname: 172.16.0.223
    +  - host: server-host2
    +    hostname: 172.16.0.224
    +  - host: server-host3
    +    hostname: 172.16.0.225
    +  # PolarDB nodes
    +  - host: polardb-primary
    +    hostname: 172.16.0.226
    +  - host: polardb-replica
    +    hostname: 172.16.0.227
    +

    导入主机列表:

    curveadm hosts commit hosts.yaml
    +

    格式化磁盘

    准备磁盘列表,并提前生成一批固定大小并预写过的 chunk 文件。磁盘列表中需要包含:

    • 将要进行格式化的所有存储节点主机
    • 每台主机上的统一块设备名(本例中为 /dev/vdb
    • 将被使用的挂载点
    • 格式化百分比
    vim format.yaml
    +
    host:
    +  - server-host1
    +  - server-host2
    +  - server-host3
    +disk:
    +  - /dev/vdb:/data/chunkserver0:90 # device:mount_path:format_percent
    +

    开始格式化。此时,中控机将在每台存储节点主机上对每个块设备启动一个格式化进程容器。

    $ curveadm format -f format.yaml
    +Start Format Chunkfile Pool: ⠸
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +

    当显示 OK 时,说明这个格式化进程容器已启动,但 并不代表格式化已经完成。格式化是个较久的过程,将会持续一段时间:

    Start Format Chunkfile Pool: [OK]
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +

    可以通过以下命令查看格式化进度,目前仍在格式化状态中:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  19/90      Formatting
    +server-host2  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +server-host3  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +

    格式化完成后的输出:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host2  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host3  /dev/vdb  /data/chunkserver0  95/90      Done
    +

    部署 CurveBS 集群

    首先,准备集群配置文件:

    vim topology.yaml
    +

    粘贴如下配置文件:

    kind: curvebs
    +global:
    +  container_image: opencurvedocker/curvebs:v1.2
    +  log_dir: ${home}/logs/${service_role}${service_replicas_sequence}
    +  data_dir: ${home}/data/${service_role}${service_replicas_sequence}
    +  s3.nos_address: 127.0.0.1
    +  s3.snapshot_bucket_name: curve
    +  s3.ak: minioadmin
    +  s3.sk: minioadmin
    +  variable:
    +    home: /tmp
    +    machine1: server-host1
    +    machine2: server-host2
    +    machine3: server-host3
    +
    +etcd_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 2380
    +    listen.client_port: 2379
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +mds_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 6666
    +    listen.dummy_port: 6667
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +chunkserver_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 82${format_replicas_sequence} # 8200,8201,8202
    +    data_dir: /data/chunkserver${service_replicas_sequence} # /data/chunkserver0, /data/chunkserver1
    +    copysets: 100
    +  deploy:
    +    - host: ${machine1}
    +      replicas: 1
    +    - host: ${machine2}
    +      replicas: 1
    +    - host: ${machine3}
    +      replicas: 1
    +
    +snapshotclone_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 5555
    +    listen.dummy_port: 8081
    +    listen.proxy_port: 8080
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +

    根据上述的集群拓扑文件创建集群 my-cluster

    curveadm cluster add my-cluster -f topology.yaml
    +

    切换 my-cluster 集群为当前管理集群:

    curveadm cluster checkout my-cluster
    +

    部署集群。如果部署成功,将会输出类似 Cluster 'my-cluster' successfully deployed ^_^. 字样。

    $ curveadm deploy --skip snapshotclone
    +
    +...
    +Create Logical Pool: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Start Service: [OK]
    +  + host=server-host1  role=snapshotclone  containerId=9d3555ba72fa [1/1] [OK]
    +  + host=server-host2  role=snapshotclone  containerId=e6ae2b23b57e [1/1] [OK]
    +  + host=server-host3  role=snapshotclone  containerId=f6d3446c7684 [1/1] [OK]
    +
    +Balance Leader: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Cluster 'my-cluster' successfully deployed ^_^.
    +

    查看集群状态:

    $ curveadm status
    +Get Service Status: [OK]
    +
    +cluster name      : my-cluster
    +cluster kind      : curvebs
    +cluster mds addr  : 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +cluster mds leader: 172.16.0.225:6666 / d0a94a7afa14
    +
    +Id            Role         Host          Replicas  Container Id  Status
    +--            ----         ----          --------  ------------  ------
    +5567a1c56ab9  etcd         server-host1  1/1       f894c5485a26  Up 17 seconds
    +68f9f0e6f108  etcd         server-host2  1/1       69b09cdbf503  Up 17 seconds
    +a678263898cc  etcd         server-host3  1/1       2ed141800731  Up 17 seconds
    +4dcbdd08e2cd  mds          server-host1  1/1       76d62ff0eb25  Up 17 seconds
    +8ef1755b0a10  mds          server-host2  1/1       d8d838258a6f  Up 17 seconds
    +f3599044c6b5  mds          server-host3  1/1       d63ae8502856  Up 17 seconds
    +9f1d43bc5b03  chunkserver  server-host1  1/1       39751a4f49d5  Up 16 seconds
    +3fb8fd7b37c1  chunkserver  server-host2  1/1       0f55a19ed44b  Up 16 seconds
    +c4da555952e3  chunkserver  server-host3  1/1       9411274d2c97  Up 16 seconds
    +

    部署 CurveBS 客户端

    在 Curve 中控机上编辑客户端配置文件:

    vim client.yaml
    +

    注意,这里的 mds.listen.addr 请填写上一步集群状态中输出的 cluster mds addr

    kind: curvebs
    +container_image: opencurvedocker/curvebs:v1.2
    +mds.listen.addr: 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +log_dir: /root/curvebs/logs/client
    +

    准备分布式文件系统

    `,43);function D(p,I){const l=t("Badge"),c=t("ArticleInfo"),e=t("ExternalLinkIcon"),o=t("RouterLink");return u(),r("div",null,[s("h1",b,[h,n(" CurveBS 共享存储 "),s("a",g,[a(l,{type:"tip",text:"视频",vertical:"top"})])]),a(c,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),s("p",null,[s("a",f,[n("Curve"),a(e)]),n(" 是一款高性能、易运维、云原生的开源分布式存储系统。可应用于主流的云原生基础设施平台:")]),y,_,s("p",null,[n("本示例将引导您以 CurveBS 作为块存储,部署 PolarDB for PostgreSQL。更多进阶配置和使用方法请参考 Curve 项目的 "),s("a",S,[n("wiki"),a(e)]),n("。")]),x,$,C,s("p",null,[n("本示例使用阿里云 ECS 模拟全部六台服务器。六台 ECS 全部运行 "),s("a",P,[n("Anolis OS"),a(e)]),n(" 8.6(兼容 CentOS 8.6)系统,使用 root 用户,并处于同一局域网段内。需要完成的准备工作包含:")]),s("ol",null,[s("li",null,[n("在全部机器上安装 "),s("a",B,[n("Docker"),a(e)]),n("(请参考 Docker 官方文档)")]),O]),K,s("p",null,[n("接下来,将在两台运行 PolarDB 计算节点的 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要让这两个节点能够共享 CurveBS 块设备,并在块设备上 "),a(o,{to:"/zh/deploying/fs-pfs-curve.html"},{default:d(()=>[n("格式化并挂载 PFS")]),_:1}),n("。")])])}const L=i(m,[["render",D],["__file","storage-curvebs.html.vue"]]);export{L as default}; diff --git a/assets/storage-curvebs.html-c5a165f0.js b/assets/storage-curvebs.html-c5a165f0.js new file mode 100644 index 00000000000..5e6899f9c2d --- /dev/null +++ b/assets/storage-curvebs.html-c5a165f0.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-7a570c87","path":"/zh/deploying/storage-curvebs.html","title":"CurveBS 共享存储","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2022/08/31","minute":30},"headers":[{"level":2,"title":"设备准备","slug":"设备准备","link":"#设备准备","children":[]},{"level":2,"title":"在中控机上安装 CurveAdm","slug":"在中控机上安装-curveadm","link":"#在中控机上安装-curveadm","children":[]},{"level":2,"title":"导入主机列表","slug":"导入主机列表","link":"#导入主机列表","children":[]},{"level":2,"title":"格式化磁盘","slug":"格式化磁盘","link":"#格式化磁盘","children":[]},{"level":2,"title":"部署 CurveBS 集群","slug":"部署-curvebs-集群","link":"#部署-curvebs-集群","children":[]},{"level":2,"title":"部署 CurveBS 客户端","slug":"部署-curvebs-客户端","link":"#部署-curvebs-客户端","children":[]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1685101046000},"filePathRelative":"zh/deploying/storage-curvebs.md"}');export{e as data}; diff --git a/assets/storage-curvebs.html-e2572630.js b/assets/storage-curvebs.html-e2572630.js new file mode 100644 index 00000000000..f4b1e9891bd --- /dev/null +++ b/assets/storage-curvebs.html-e2572630.js @@ -0,0 +1,156 @@ +import{_ as i,r as t,o as u,c as r,a as s,b as n,d as a,w as d,e as k}from"./app-3d1677bf.js";const v="/PolarDB-for-PostgreSQL/assets/curve-cluster-77966d1c.png",m={},b={id:"curvebs-共享存储",tabindex:"-1"},h=s("a",{class:"header-anchor",href:"#curvebs-共享存储","aria-hidden":"true"},"#",-1),g={href:"https://developer.aliyun.com/live/250218"},f={href:"https://github.com/opencurve/curve",target:"_blank",rel:"noopener noreferrer"},y=s("ul",null,[s("li",null,"对接 OpenStack 平台为云主机提供高性能块存储服务;"),s("li",null,"对接 Kubernetes 为其提供 RWO、RWX 等类型的持久化存储卷;"),s("li",null,"对接 PolarFS 作为云原生数据库的高性能存储底座,完美支持云原生数据库的存算分离架构。")],-1),_=s("p",null,"Curve 亦可作为云存储中间件使用 S3 兼容的对象存储作为数据存储引擎,为公有云用户提供高性价比的共享文件存储。",-1),S={href:"https://github.com/opencurve/curveadm/wiki",target:"_blank",rel:"noopener noreferrer"},x=s("h2",{id:"设备准备",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#设备准备","aria-hidden":"true"},"#"),n(" 设备准备")],-1),$=s("p",null,[s("img",{src:v,alt:"curve-cluster"})],-1),C=s("p",null,"如图所示,本示例共使用六台服务器。其中,一台中控服务器和三台存储服务器共同组成 CurveBS 集群,对外暴露为一个共享存储服务。剩余两台服务器分别用于部署 PolarDB for PostgreSQL 数据库的读写节点和只读节点,它们共享 CurveBS 对外暴露的块存储设备。",-1),P={href:"https://openanolis.cn/anolisos",target:"_blank",rel:"noopener noreferrer"},B={href:"https://www.docker.com/",target:"_blank",rel:"noopener noreferrer"},O=s("li",null,"在 Curve 中控机上配置 SSH 免密登陆到其它五台服务器",-1),K=k(`

    在中控机上安装 CurveAdm

    bash -c "$(curl -fsSL https://curveadm.nos-eastchina1.126.net/script/install.sh)"
    +source /root/.bash_profile
    +

    导入主机列表

    在中控机上编辑主机列表文件:

    vim hosts.yaml
    +

    文件中包含另外五台服务器的 IP 地址和在 Curve 集群内的名称,其中:

    • 三台主机为 Curve 存储节点主机
    • 两台主机为 PolarDB for PostgreSQL 计算节点主机
    global:
    +  user: root
    +  ssh_port: 22
    +  private_key_file: /root/.ssh/id_rsa
    +
    +hosts:
    +  # Curve worker nodes
    +  - host: server-host1
    +    hostname: 172.16.0.223
    +  - host: server-host2
    +    hostname: 172.16.0.224
    +  - host: server-host3
    +    hostname: 172.16.0.225
    +  # PolarDB nodes
    +  - host: polardb-primary
    +    hostname: 172.16.0.226
    +  - host: polardb-replica
    +    hostname: 172.16.0.227
    +

    导入主机列表:

    curveadm hosts commit hosts.yaml
    +

    格式化磁盘

    准备磁盘列表,并提前生成一批固定大小并预写过的 chunk 文件。磁盘列表中需要包含:

    • 将要进行格式化的所有存储节点主机
    • 每台主机上的统一块设备名(本例中为 /dev/vdb
    • 将被使用的挂载点
    • 格式化百分比
    vim format.yaml
    +
    host:
    +  - server-host1
    +  - server-host2
    +  - server-host3
    +disk:
    +  - /dev/vdb:/data/chunkserver0:90 # device:mount_path:format_percent
    +

    开始格式化。此时,中控机将在每台存储节点主机上对每个块设备启动一个格式化进程容器。

    $ curveadm format -f format.yaml
    +Start Format Chunkfile Pool: ⠸
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +

    当显示 OK 时,说明这个格式化进程容器已启动,但 并不代表格式化已经完成。格式化是个较久的过程,将会持续一段时间:

    Start Format Chunkfile Pool: [OK]
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +

    可以通过以下命令查看格式化进度,目前仍在格式化状态中:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  19/90      Formatting
    +server-host2  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +server-host3  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +

    格式化完成后的输出:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host2  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host3  /dev/vdb  /data/chunkserver0  95/90      Done
    +

    部署 CurveBS 集群

    首先,准备集群配置文件:

    vim topology.yaml
    +

    粘贴如下配置文件:

    kind: curvebs
    +global:
    +  container_image: opencurvedocker/curvebs:v1.2
    +  log_dir: ${home}/logs/${service_role}${service_replicas_sequence}
    +  data_dir: ${home}/data/${service_role}${service_replicas_sequence}
    +  s3.nos_address: 127.0.0.1
    +  s3.snapshot_bucket_name: curve
    +  s3.ak: minioadmin
    +  s3.sk: minioadmin
    +  variable:
    +    home: /tmp
    +    machine1: server-host1
    +    machine2: server-host2
    +    machine3: server-host3
    +
    +etcd_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 2380
    +    listen.client_port: 2379
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +mds_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 6666
    +    listen.dummy_port: 6667
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +chunkserver_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 82${format_replicas_sequence} # 8200,8201,8202
    +    data_dir: /data/chunkserver${service_replicas_sequence} # /data/chunkserver0, /data/chunkserver1
    +    copysets: 100
    +  deploy:
    +    - host: ${machine1}
    +      replicas: 1
    +    - host: ${machine2}
    +      replicas: 1
    +    - host: ${machine3}
    +      replicas: 1
    +
    +snapshotclone_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 5555
    +    listen.dummy_port: 8081
    +    listen.proxy_port: 8080
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +

    根据上述的集群拓扑文件创建集群 my-cluster

    curveadm cluster add my-cluster -f topology.yaml
    +

    切换 my-cluster 集群为当前管理集群:

    curveadm cluster checkout my-cluster
    +

    部署集群。如果部署成功,将会输出类似 Cluster 'my-cluster' successfully deployed ^_^. 字样。

    $ curveadm deploy --skip snapshotclone
    +
    +...
    +Create Logical Pool: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Start Service: [OK]
    +  + host=server-host1  role=snapshotclone  containerId=9d3555ba72fa [1/1] [OK]
    +  + host=server-host2  role=snapshotclone  containerId=e6ae2b23b57e [1/1] [OK]
    +  + host=server-host3  role=snapshotclone  containerId=f6d3446c7684 [1/1] [OK]
    +
    +Balance Leader: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Cluster 'my-cluster' successfully deployed ^_^.
    +

    查看集群状态:

    $ curveadm status
    +Get Service Status: [OK]
    +
    +cluster name      : my-cluster
    +cluster kind      : curvebs
    +cluster mds addr  : 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +cluster mds leader: 172.16.0.225:6666 / d0a94a7afa14
    +
    +Id            Role         Host          Replicas  Container Id  Status
    +--            ----         ----          --------  ------------  ------
    +5567a1c56ab9  etcd         server-host1  1/1       f894c5485a26  Up 17 seconds
    +68f9f0e6f108  etcd         server-host2  1/1       69b09cdbf503  Up 17 seconds
    +a678263898cc  etcd         server-host3  1/1       2ed141800731  Up 17 seconds
    +4dcbdd08e2cd  mds          server-host1  1/1       76d62ff0eb25  Up 17 seconds
    +8ef1755b0a10  mds          server-host2  1/1       d8d838258a6f  Up 17 seconds
    +f3599044c6b5  mds          server-host3  1/1       d63ae8502856  Up 17 seconds
    +9f1d43bc5b03  chunkserver  server-host1  1/1       39751a4f49d5  Up 16 seconds
    +3fb8fd7b37c1  chunkserver  server-host2  1/1       0f55a19ed44b  Up 16 seconds
    +c4da555952e3  chunkserver  server-host3  1/1       9411274d2c97  Up 16 seconds
    +

    部署 CurveBS 客户端

    在 Curve 中控机上编辑客户端配置文件:

    vim client.yaml
    +

    注意,这里的 mds.listen.addr 请填写上一步集群状态中输出的 cluster mds addr

    kind: curvebs
    +container_image: opencurvedocker/curvebs:v1.2
    +mds.listen.addr: 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +log_dir: /root/curvebs/logs/client
    +

    准备分布式文件系统

    `,43);function D(p,I){const l=t("Badge"),c=t("ArticleInfo"),e=t("ExternalLinkIcon"),o=t("RouterLink");return u(),r("div",null,[s("h1",b,[h,n(" CurveBS 共享存储 "),s("a",g,[a(l,{type:"tip",text:"视频",vertical:"top"})])]),a(c,{frontmatter:p.$frontmatter},null,8,["frontmatter"]),s("p",null,[s("a",f,[n("Curve"),a(e)]),n(" 是一款高性能、易运维、云原生的开源分布式存储系统。可应用于主流的云原生基础设施平台:")]),y,_,s("p",null,[n("本示例将引导您以 CurveBS 作为块存储,部署 PolarDB for PostgreSQL。更多进阶配置和使用方法请参考 Curve 项目的 "),s("a",S,[n("wiki"),a(e)]),n("。")]),x,$,C,s("p",null,[n("本示例使用阿里云 ECS 模拟全部六台服务器。六台 ECS 全部运行 "),s("a",P,[n("Anolis OS"),a(e)]),n(" 8.6(兼容 CentOS 8.6)系统,使用 root 用户,并处于同一局域网段内。需要完成的准备工作包含:")]),s("ol",null,[s("li",null,[n("在全部机器上安装 "),s("a",B,[n("Docker"),a(e)]),n("(请参考 Docker 官方文档)")]),O]),K,s("p",null,[n("接下来,将在两台运行 PolarDB 计算节点的 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要让这两个节点能够共享 CurveBS 块设备,并在块设备上 "),a(o,{to:"/deploying/fs-pfs-curve.html"},{default:d(()=>[n("格式化并挂载 PFS")]),_:1}),n("。")])])}const L=i(m,[["render",D],["__file","storage-curvebs.html.vue"]]);export{L as default}; diff --git a/assets/storage-curvebs.html-f3b814a4.js b/assets/storage-curvebs.html-f3b814a4.js new file mode 100644 index 00000000000..4715a7f4313 --- /dev/null +++ b/assets/storage-curvebs.html-f3b814a4.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-7f31e698","path":"/deploying/storage-curvebs.html","title":"CurveBS 共享存储","lang":"en-US","frontmatter":{"author":"棠羽","date":"2022/08/31","minute":30},"headers":[{"level":2,"title":"设备准备","slug":"设备准备","link":"#设备准备","children":[]},{"level":2,"title":"在中控机上安装 CurveAdm","slug":"在中控机上安装-curveadm","link":"#在中控机上安装-curveadm","children":[]},{"level":2,"title":"导入主机列表","slug":"导入主机列表","link":"#导入主机列表","children":[]},{"level":2,"title":"格式化磁盘","slug":"格式化磁盘","link":"#格式化磁盘","children":[]},{"level":2,"title":"部署 CurveBS 集群","slug":"部署-curvebs-集群","link":"#部署-curvebs-集群","children":[]},{"level":2,"title":"部署 CurveBS 客户端","slug":"部署-curvebs-客户端","link":"#部署-curvebs-客户端","children":[]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1684803631000},"filePathRelative":"deploying/storage-curvebs.md"}');export{e as data}; diff --git a/assets/storage-nbd.html-0d5c1474.js b/assets/storage-nbd.html-0d5c1474.js new file mode 100644 index 00000000000..84143b25f57 --- /dev/null +++ b/assets/storage-nbd.html-0d5c1474.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-c895df30","path":"/deploying/storage-nbd.html","title":"NBD 共享存储","lang":"en-US","frontmatter":{},"headers":[{"level":2,"title":"安装 NBD","slug":"安装-nbd","link":"#安装-nbd","children":[{"level":3,"title":"为操作系统下载安装 NBD 驱动","slug":"为操作系统下载安装-nbd-驱动","link":"#为操作系统下载安装-nbd-驱动","children":[]},{"level":3,"title":"安装 NBD 软件包","slug":"安装-nbd-软件包","link":"#安装-nbd-软件包","children":[]}]},{"level":2,"title":"使用 NBD 来共享块设备","slug":"使用-nbd-来共享块设备","link":"#使用-nbd-来共享块设备","children":[{"level":3,"title":"服务端部署","slug":"服务端部署","link":"#服务端部署","children":[]},{"level":3,"title":"客户端部署","slug":"客户端部署","link":"#客户端部署","children":[]}]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"deploying/storage-nbd.md"}');export{l as data}; diff --git a/assets/storage-nbd.html-162d7e26.js b/assets/storage-nbd.html-162d7e26.js new file mode 100644 index 00000000000..34d15393c83 --- /dev/null +++ b/assets/storage-nbd.html-162d7e26.js @@ -0,0 +1,34 @@ +import{_ as c,r as s,o as d,c as r,a as n,b as a,d as e,w as o,e as l}from"./app-3d1677bf.js";const p={},u=l('

    NBD 共享存储

    Network Block Device (NBD) 是一种网络协议,可以在多个主机间共享块存储设备。NBD 被设计为 Client-Server 的架构,因此至少需要两台物理机来部署。

    以两台物理机环境为例,本小节介绍基于 NBD 共享存储的实例构建方法大体如下:

    • 首先,两台主机通过 NBD 共享一个块设备;
    • 然后,两台主机上均部署 PolarDB File System (PFS) 来初始化并挂载到同一个块设备;
    • 最后,在两台主机上分别部署 PolarDB for PostgreSQL 内核,构建主节点、只读节点以形成简单的一写多读实例。

    WARNING

    以上步骤在 CentOS 7.5 上通过测试。

    安装 NBD

    为操作系统下载安装 NBD 驱动

    TIP

    操作系统内核需要支持 NBD 内核模块,如果操作系统当前不支持该内核模块,则需要自己通过对应内核版本进行编译和加载 NBD 内核模块。

    ',8),v={href:"https://www.centos.org/",target:"_blank",rel:"noopener noreferrer"},b=l(`
    rpm -ihv kernel-3.10.0-862.el7.src.rpm
    +cd ~/rpmbuild/SOURCES
    +tar Jxvf linux-3.10.0-862.el7.tar.xz -C /usr/src/kernels/
    +cd /usr/src/kernels/linux-3.10.0-862.el7/
    +

    NBD 驱动源码路径位于:drivers/block/nbd.c。接下来编译操作系统内核依赖和组件:

    cp ../$(uname -r)/Module.symvers ./
    +make menuconfig # Device Driver -> Block devices -> Set 'M' On 'Network block device support'
    +make prepare && make modules_prepare && make scripts
    +make CONFIG_BLK_DEV_NBD=m M=drivers/block
    +

    检查是否正常生成驱动:

    modinfo drivers/block/nbd.ko
    +

    拷贝、生成依赖并安装驱动:

    cp drivers/block/nbd.ko /lib/modules/$(uname -r)/kernel/drivers/block
    +depmod -a
    +modprobe nbd # 或者 modprobe -f nbd 可以忽略模块版本检查
    +

    检查是否安装成功:

    # 检查已安装内核模块
    +lsmod | grep nbd
    +# 如果NBD驱动已经安装,则会生成/dev/nbd*设备(例如:/dev/nbd0、/dev/nbd1等)
    +ls /dev/nbd*
    +

    安装 NBD 软件包

    yum install nbd
    +

    使用 NBD 来共享块设备

    服务端部署

    拉起 NBD 服务端,按照同步方式(sync/flush=true)配置,在指定端口(例如 1921)上监听对指定块设备(例如 /dev/vdb)的访问。

    nbd-server -C /root/nbd.conf
    +

    配置文件 /root/nbd.conf 的内容举例如下:

    [generic]
    +    #user = nbd
    +    #group = nbd
    +    listenaddr = 0.0.0.0
    +    port = 1921
    +[export1]
    +    exportname = /dev/vdb
    +    readonly = false
    +    multifile = false
    +    copyonwrite = false
    +    flush = true
    +    fua = true
    +    sync = true
    +

    客户端部署

    NBD 驱动安装成功后会看到 /dev/nbd* 设备, 根据服务端的配置把远程块设备映射为本地的某个 NBD 设备即可:

    nbd-client x.x.x.x 1921 -N export1 /dev/nbd0
    +# x.x.x.x是NBD服务端主机的IP地址
    +

    准备分布式文件系统

    `,22);function m(k,h){const i=s("ExternalLinkIcon"),t=s("RouterLink");return d(),r("div",null,[u,n("p",null,[a("从 "),n("a",v,[a("CentOS 官网"),e(i)]),a(" 下载对应内核版本的驱动源码包并解压:")]),b,n("p",null,[a("参阅 "),e(t,{to:"/deploying/fs-pfs.html"},{default:o(()=>[a("格式化并挂载 PFS")]),_:1}),a("。")])])}const f=c(p,[["render",m],["__file","storage-nbd.html.vue"]]);export{f as default}; diff --git a/assets/storage-nbd.html-6a3a12bf.js b/assets/storage-nbd.html-6a3a12bf.js new file mode 100644 index 00000000000..3f9d39ed97c --- /dev/null +++ b/assets/storage-nbd.html-6a3a12bf.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-04fef452","path":"/zh/deploying/storage-nbd.html","title":"NBD 共享存储","lang":"zh-CN","frontmatter":{},"headers":[{"level":2,"title":"安装 NBD","slug":"安装-nbd","link":"#安装-nbd","children":[{"level":3,"title":"为操作系统下载安装 NBD 驱动","slug":"为操作系统下载安装-nbd-驱动","link":"#为操作系统下载安装-nbd-驱动","children":[]},{"level":3,"title":"安装 NBD 软件包","slug":"安装-nbd-软件包","link":"#安装-nbd-软件包","children":[]}]},{"level":2,"title":"使用 NBD 来共享块设备","slug":"使用-nbd-来共享块设备","link":"#使用-nbd-来共享块设备","children":[{"level":3,"title":"服务端部署","slug":"服务端部署","link":"#服务端部署","children":[]},{"level":3,"title":"客户端部署","slug":"客户端部署","link":"#客户端部署","children":[]}]},{"level":2,"title":"准备分布式文件系统","slug":"准备分布式文件系统","link":"#准备分布式文件系统","children":[]}],"git":{"updatedTime":1656919280000},"filePathRelative":"zh/deploying/storage-nbd.md"}');export{l as data}; diff --git a/assets/storage-nbd.html-97a12948.js b/assets/storage-nbd.html-97a12948.js new file mode 100644 index 00000000000..ba7f06d7256 --- /dev/null +++ b/assets/storage-nbd.html-97a12948.js @@ -0,0 +1,34 @@ +import{_ as c,r as s,o as d,c as r,a as n,b as a,d as e,w as o,e as l}from"./app-3d1677bf.js";const p={},u=l('

    NBD 共享存储

    Network Block Device (NBD) 是一种网络协议,可以在多个主机间共享块存储设备。NBD 被设计为 Client-Server 的架构,因此至少需要两台物理机来部署。

    以两台物理机环境为例,本小节介绍基于 NBD 共享存储的实例构建方法大体如下:

    • 首先,两台主机通过 NBD 共享一个块设备;
    • 然后,两台主机上均部署 PolarDB File System (PFS) 来初始化并挂载到同一个块设备;
    • 最后,在两台主机上分别部署 PolarDB for PostgreSQL 内核,构建主节点、只读节点以形成简单的一写多读实例。

    注意

    以上步骤在 CentOS 7.5 上通过测试。

    安装 NBD

    为操作系统下载安装 NBD 驱动

    提示

    操作系统内核需要支持 NBD 内核模块,如果操作系统当前不支持该内核模块,则需要自己通过对应内核版本进行编译和加载 NBD 内核模块。

    ',8),v={href:"https://www.centos.org/",target:"_blank",rel:"noopener noreferrer"},b=l(`
    rpm -ihv kernel-3.10.0-862.el7.src.rpm
    +cd ~/rpmbuild/SOURCES
    +tar Jxvf linux-3.10.0-862.el7.tar.xz -C /usr/src/kernels/
    +cd /usr/src/kernels/linux-3.10.0-862.el7/
    +

    NBD 驱动源码路径位于:drivers/block/nbd.c。接下来编译操作系统内核依赖和组件:

    cp ../$(uname -r)/Module.symvers ./
    +make menuconfig # Device Driver -> Block devices -> Set 'M' On 'Network block device support'
    +make prepare && make modules_prepare && make scripts
    +make CONFIG_BLK_DEV_NBD=m M=drivers/block
    +

    检查是否正常生成驱动:

    modinfo drivers/block/nbd.ko
    +

    拷贝、生成依赖并安装驱动:

    cp drivers/block/nbd.ko /lib/modules/$(uname -r)/kernel/drivers/block
    +depmod -a
    +modprobe nbd # 或者 modprobe -f nbd 可以忽略模块版本检查
    +

    检查是否安装成功:

    # 检查已安装内核模块
    +lsmod | grep nbd
    +# 如果NBD驱动已经安装,则会生成/dev/nbd*设备(例如:/dev/nbd0、/dev/nbd1等)
    +ls /dev/nbd*
    +

    安装 NBD 软件包

    yum install nbd
    +

    使用 NBD 来共享块设备

    服务端部署

    拉起 NBD 服务端,按照同步方式(sync/flush=true)配置,在指定端口(例如 1921)上监听对指定块设备(例如 /dev/vdb)的访问。

    nbd-server -C /root/nbd.conf
    +

    配置文件 /root/nbd.conf 的内容举例如下:

    [generic]
    +    #user = nbd
    +    #group = nbd
    +    listenaddr = 0.0.0.0
    +    port = 1921
    +[export1]
    +    exportname = /dev/vdb
    +    readonly = false
    +    multifile = false
    +    copyonwrite = false
    +    flush = true
    +    fua = true
    +    sync = true
    +

    客户端部署

    NBD 驱动安装成功后会看到 /dev/nbd* 设备, 根据服务端的配置把远程块设备映射为本地的某个 NBD 设备即可:

    nbd-client x.x.x.x 1921 -N export1 /dev/nbd0
    +# x.x.x.x是NBD服务端主机的IP地址
    +

    准备分布式文件系统

    `,22);function m(k,h){const i=s("ExternalLinkIcon"),t=s("RouterLink");return d(),r("div",null,[u,n("p",null,[a("从 "),n("a",v,[a("CentOS 官网"),e(i)]),a(" 下载对应内核版本的驱动源码包并解压:")]),b,n("p",null,[a("参阅 "),e(t,{to:"/zh/deploying/fs-pfs.html"},{default:o(()=>[a("格式化并挂载 PFS")]),_:1}),a("。")])])}const f=c(p,[["render",m],["__file","storage-nbd.html.vue"]]);export{f as default}; diff --git a/assets/style-9683479e.css b/assets/style-9683479e.css new file mode 100644 index 00000000000..80ca564d5b3 --- /dev/null +++ b/assets/style-9683479e.css @@ -0,0 +1 @@ +:root{--back-to-top-z-index: 5;--back-to-top-color: #3eaf7c;--back-to-top-color-hover: #71cda3}.back-to-top{cursor:pointer;position:fixed;bottom:2rem;right:2.5rem;width:2rem;height:1.2rem;background-color:var(--back-to-top-color);-webkit-mask:url(/PolarDB-for-PostgreSQL/assets/back-to-top-8efcbe56.svg) no-repeat;mask:url(/PolarDB-for-PostgreSQL/assets/back-to-top-8efcbe56.svg) no-repeat;z-index:var(--back-to-top-z-index)}.back-to-top:hover{background-color:var(--back-to-top-color-hover)}@media (max-width: 959px){.back-to-top{display:none}}@media print{.back-to-top{display:none}}.back-to-top-enter-active,.back-to-top-leave-active{transition:opacity .3s}.back-to-top-enter-from,.back-to-top-leave-to{opacity:0}:root{--external-link-icon-color: #aaa}.external-link-icon{position:relative;display:inline-block;color:var(--external-link-icon-color);vertical-align:middle;top:-1px}@media print{.external-link-icon{display:none}}.external-link-icon-sr-only{position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip:rect(0,0,0,0);white-space:nowrap;border-width:0;-webkit-user-select:none;-moz-user-select:none;user-select:none}:root{--medium-zoom-z-index: 100;--medium-zoom-bg-color: #ffffff;--medium-zoom-opacity: 1}.medium-zoom-overlay{background-color:var(--medium-zoom-bg-color)!important;z-index:var(--medium-zoom-z-index)}.medium-zoom-overlay~img{z-index:calc(var(--medium-zoom-z-index) + 1)}.medium-zoom--opened .medium-zoom-overlay{opacity:var(--medium-zoom-opacity)}:root{--nprogress-color: #29d;--nprogress-z-index: 1031}#nprogress{pointer-events:none}#nprogress .bar{background:var(--nprogress-color);position:fixed;z-index:var(--nprogress-z-index);top:0;left:0;width:100%;height:2px}:root{--c-brand: #3eaf7c;--c-brand-light: #4abf8a;--c-bg: #ffffff;--c-bg-light: #f3f4f5;--c-bg-lighter: #eeeeee;--c-bg-dark: #ebebec;--c-bg-darker: #e6e6e6;--c-bg-navbar: var(--c-bg);--c-bg-sidebar: var(--c-bg);--c-bg-arrow: #cccccc;--c-text: #2c3e50;--c-text-accent: var(--c-brand);--c-text-light: #3a5169;--c-text-lighter: #4e6e8e;--c-text-lightest: #6a8bad;--c-text-quote: #999999;--c-border: #eaecef;--c-border-dark: #dfe2e5;--c-tip: #42b983;--c-tip-bg: var(--c-bg-light);--c-tip-title: var(--c-text);--c-tip-text: var(--c-text);--c-tip-text-accent: var(--c-text-accent);--c-warning: #ffc310;--c-warning-bg: #fffae3;--c-warning-bg-light: #fff3ba;--c-warning-bg-lighter: #fff0b0;--c-warning-border-dark: #f7dc91;--c-warning-details-bg: #fff5ca;--c-warning-title: #f1b300;--c-warning-text: #746000;--c-warning-text-accent: #edb100;--c-warning-text-light: #c1971c;--c-warning-text-quote: #ccab49;--c-danger: #f11e37;--c-danger-bg: #ffe0e0;--c-danger-bg-light: #ffcfde;--c-danger-bg-lighter: #ffc9c9;--c-danger-border-dark: #f1abab;--c-danger-details-bg: #ffd4d4;--c-danger-title: #ed1e2c;--c-danger-text: #660000;--c-danger-text-accent: #bd1a1a;--c-danger-text-light: #b5474d;--c-danger-text-quote: #c15b5b;--c-details-bg: #eeeeee;--c-badge-tip: var(--c-tip);--c-badge-warning: #ecc808;--c-badge-warning-text: var(--c-bg);--c-badge-danger: #dc2626;--c-badge-danger-text: var(--c-bg);--t-color: .3s ease;--t-transform: .3s ease;--code-bg-color: #282c34;--code-hl-bg-color: rgba(0, 0, 0, .66);--code-ln-color: #9e9e9e;--code-ln-wrapper-width: 3.5rem;--font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, Cantarell, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif;--font-family-code: Consolas, Monaco, "Andale Mono", "Ubuntu Mono", monospace;--navbar-height: 3.6rem;--navbar-padding-v: .7rem;--navbar-padding-h: 1.5rem;--sidebar-width: 20rem;--sidebar-width-mobile: calc(var(--sidebar-width) * .82);--content-width: 740px;--homepage-width: 960px}.back-to-top{--back-to-top-color: var(--c-brand);--back-to-top-color-hover: var(--c-brand-light)}.DocSearch{--docsearch-primary-color: var(--c-brand);--docsearch-text-color: var(--c-text);--docsearch-highlight-color: var(--c-brand);--docsearch-muted-color: var(--c-text-quote);--docsearch-container-background: rgba(9, 10, 17, .8);--docsearch-modal-background: var(--c-bg-light);--docsearch-searchbox-background: var(--c-bg-lighter);--docsearch-searchbox-focus-background: var(--c-bg);--docsearch-searchbox-shadow: inset 0 0 0 2px var(--c-brand);--docsearch-hit-color: var(--c-text-light);--docsearch-hit-active-color: var(--c-bg);--docsearch-hit-background: var(--c-bg);--docsearch-hit-shadow: 0 1px 3px 0 var(--c-border-dark);--docsearch-footer-background: var(--c-bg)}.external-link-icon{--external-link-icon-color: var(--c-text-quote)}.medium-zoom-overlay{--medium-zoom-bg-color: var(--c-bg)}#nprogress{--nprogress-color: var(--c-brand)}.pwa-popup{--pwa-popup-text-color: var(--c-text);--pwa-popup-bg-color: var(--c-bg);--pwa-popup-border-color: var(--c-brand);--pwa-popup-shadow: 0 4px 16px var(--c-brand);--pwa-popup-btn-text-color: var(--c-bg);--pwa-popup-btn-bg-color: var(--c-brand);--pwa-popup-btn-hover-bg-color: var(--c-brand-light)}.search-box{--search-bg-color: var(--c-bg);--search-accent-color: var(--c-brand);--search-text-color: var(--c-text);--search-border-color: var(--c-border);--search-item-text-color: var(--c-text-lighter);--search-item-focus-bg-color: var(--c-bg-light)}html.dark{--c-brand: #3aa675;--c-brand-light: #349469;--c-bg: #22272e;--c-bg-light: #2b313a;--c-bg-lighter: #262c34;--c-bg-dark: #343b44;--c-bg-darker: #37404c;--c-text: #adbac7;--c-text-light: #96a7b7;--c-text-lighter: #8b9eb0;--c-text-lightest: #8094a8;--c-border: #3e4c5a;--c-border-dark: #34404c;--c-tip: #318a62;--c-warning: #e0ad15;--c-warning-bg: #2d2f2d;--c-warning-bg-light: #423e2a;--c-warning-bg-lighter: #44442f;--c-warning-border-dark: #957c35;--c-warning-details-bg: #39392d;--c-warning-title: #fdca31;--c-warning-text: #d8d96d;--c-warning-text-accent: #ffbf00;--c-warning-text-light: #ddb84b;--c-warning-text-quote: #ccab49;--c-danger: #fc1e38;--c-danger-bg: #39232c;--c-danger-bg-light: #4b2b35;--c-danger-bg-lighter: #553040;--c-danger-border-dark: #a25151;--c-danger-details-bg: #482936;--c-danger-title: #fc2d3b;--c-danger-text: #ea9ca0;--c-danger-text-accent: #fd3636;--c-danger-text-light: #d9777c;--c-danger-text-quote: #d56b6b;--c-details-bg: #323843;--c-badge-warning: var(--c-warning);--c-badge-warning-text: #3c2e05;--c-badge-danger: var(--c-danger);--c-badge-danger-text: #401416;--code-hl-bg-color: #363b46}html.dark .DocSearch{--docsearch-logo-color: var(--c-text);--docsearch-modal-shadow: inset 1px 1px 0 0 #2c2e40, 0 3px 8px 0 #000309;--docsearch-key-shadow: inset 0 -2px 0 0 #282d55, inset 0 0 1px 1px #51577d, 0 2px 2px 0 rgba(3, 4, 9, .3);--docsearch-key-gradient: linear-gradient(-225deg, #444950, #1c1e21);--docsearch-footer-shadow: inset 0 1px 0 0 rgba(73, 76, 106, .5), 0 -4px 8px 0 rgba(0, 0, 0, .2)}html,body{padding:0;margin:0;background-color:var(--c-bg);transition:background-color var(--t-color)}html.dark{color-scheme:dark}html{font-size:16px}body{font-family:var(--font-family);-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;font-size:1rem;color:var(--c-text)}a{font-weight:500;color:var(--c-text-accent);text-decoration:none;overflow-wrap:break-word}p a code{font-weight:400;color:var(--c-text-accent)}kbd{font-family:var(--font-family-code);color:var(--c-text);background:var(--c-bg-lighter);border:solid .15rem var(--c-border-dark);border-bottom:solid .25rem var(--c-border-dark);border-radius:.15rem;padding:0 .15em}code{font-family:var(--font-family-code);color:var(--c-text-lighter);padding:.25rem .5rem;margin:0;font-size:.85em;background-color:var(--c-bg-light);border-radius:3px;overflow-wrap:break-word;transition:background-color var(--t-color)}blockquote{font-size:1rem;color:var(--c-text-quote);border-left:.2rem solid var(--c-border-dark);margin:1rem 0;padding:.25rem 0 .25rem 1rem;overflow-wrap:break-word}blockquote>p{margin:0}ul,ol{padding-left:1.2em}strong{font-weight:600}h1,h2,h3,h4,h5,h6{font-weight:600;line-height:1.25;overflow-wrap:break-word}h1:focus-visible,h2:focus-visible,h3:focus-visible,h4:focus-visible,h5:focus-visible,h6:focus-visible{outline:none}h1:hover .header-anchor,h2:hover .header-anchor,h3:hover .header-anchor,h4:hover .header-anchor,h5:hover .header-anchor,h6:hover .header-anchor{opacity:1}h1{font-size:2.2rem}h2{font-size:1.65rem;padding-bottom:.3rem;border-bottom:1px solid var(--c-border);transition:border-color var(--t-color)}h3{font-size:1.35rem}h4{font-size:1.15rem}h5{font-size:1.05rem}h6{font-size:1rem}a.header-anchor{font-size:.85em;float:left;margin-left:-.87em;padding-right:.23em;margin-top:.125em;opacity:0;-webkit-user-select:none;-moz-user-select:none;user-select:none}@media print{a.header-anchor{display:none}}a.header-anchor:hover{text-decoration:none}a.header-anchor:focus-visible{opacity:1}@media print{a[href^="http://"]:after,a[href^="https://"]:after{content:" (" attr(href) ") "}}p,ul,ol{line-height:1.7;overflow-wrap:break-word}hr{border:0;border-top:1px solid var(--c-border)}table{border-collapse:collapse;margin:1rem 0;display:block;overflow-x:auto;transition:border-color var(--t-color)}tr{border-top:1px solid var(--c-border-dark);transition:border-color var(--t-color)}tr:nth-child(2n){background-color:var(--c-bg-light);transition:background-color var(--t-color)}tr:nth-child(2n) code{background-color:var(--c-bg-dark)}th,td{padding:.6em 1em;border:1px solid var(--c-border-dark);transition:border-color var(--t-color)}.arrow{display:inline-block;width:0;height:0}.arrow.up{border-left:4px solid transparent;border-right:4px solid transparent;border-bottom:6px solid var(--c-bg-arrow)}.arrow.down{border-left:4px solid transparent;border-right:4px solid transparent;border-top:6px solid var(--c-bg-arrow)}.arrow.right{border-top:4px solid transparent;border-bottom:4px solid transparent;border-left:6px solid var(--c-bg-arrow)}.arrow.left{border-top:4px solid transparent;border-bottom:4px solid transparent;border-right:6px solid var(--c-bg-arrow)}.badge{display:inline-block;font-size:14px;font-weight:600;height:18px;line-height:18px;border-radius:3px;padding:0 6px;color:var(--c-bg);vertical-align:top;transition:color var(--t-color),background-color var(--t-color)}.badge.tip{background-color:var(--c-badge-tip)}.badge.warning{background-color:var(--c-badge-warning);color:var(--c-badge-warning-text)}.badge.danger{background-color:var(--c-badge-danger);color:var(--c-badge-danger-text)}.badge+.badge{margin-left:5px}code[class*=language-],pre[class*=language-]{color:#ccc;background:none;font-family:var(--font-family-code);font-size:1em;text-align:left;white-space:pre;word-spacing:normal;word-break:normal;word-wrap:normal;line-height:1.5;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-hyphens:none;hyphens:none}pre[class*=language-]{padding:1em;margin:.5em 0;overflow:auto}:not(pre)>code[class*=language-],pre[class*=language-]{background:#2d2d2d}:not(pre)>code[class*=language-]{padding:.1em;border-radius:.3em;white-space:normal}.token.comment,.token.block-comment,.token.prolog,.token.doctype,.token.cdata{color:#999}.token.punctuation{color:#ccc}.token.tag,.token.attr-name,.token.namespace,.token.deleted{color:#ec5975}.token.function-name{color:#6196cc}.token.boolean,.token.number,.token.function{color:#f08d49}.token.property,.token.class-name,.token.constant,.token.symbol{color:#f8c555}.token.selector,.token.important,.token.atrule,.token.keyword,.token.builtin{color:#cc99cd}.token.string,.token.char,.token.attr-value,.token.regex,.token.variable{color:#7ec699}.token.operator,.token.entity,.token.url{color:#67cdcc}.token.important,.token.bold{font-weight:700}.token.italic{font-style:italic}.token.entity{cursor:help}.token.inserted{color:#3eaf7c}.theme-default-content pre,.theme-default-content pre[class*=language-]{line-height:1.375;padding:1.3rem 1.5rem;margin:.85rem 0;border-radius:6px;overflow:auto}.theme-default-content pre code,.theme-default-content pre[class*=language-] code{color:#fff;padding:0;background-color:transparent!important;border-radius:0;overflow-wrap:unset;-webkit-font-smoothing:auto;-moz-osx-font-smoothing:auto}.theme-default-content .line-number{font-family:var(--font-family-code)}div[class*=language-]{position:relative;background-color:var(--code-bg-color);border-radius:6px}div[class*=language-]:before{content:attr(data-ext);position:absolute;z-index:3;top:.8em;right:1em;font-size:.75rem;color:var(--code-ln-color)}div[class*=language-] pre,div[class*=language-] pre[class*=language-]{background:transparent!important;position:relative;z-index:1}div[class*=language-] .highlight-lines{-webkit-user-select:none;-moz-user-select:none;user-select:none;padding-top:1.3rem;position:absolute;top:0;left:0;width:100%;line-height:1.375}div[class*=language-] .highlight-lines .highlight-line{background-color:var(--code-hl-bg-color)}div[class*=language-]:not(.line-numbers-mode) .line-numbers{display:none}div[class*=language-].line-numbers-mode .highlight-lines .highlight-line{position:relative}div[class*=language-].line-numbers-mode .highlight-lines .highlight-line:before{content:" ";position:absolute;z-index:2;left:0;top:0;display:block;width:var(--code-ln-wrapper-width);height:100%}div[class*=language-].line-numbers-mode pre{margin-left:var(--code-ln-wrapper-width);padding-left:1rem;vertical-align:middle}div[class*=language-].line-numbers-mode .line-numbers{position:absolute;top:0;width:var(--code-ln-wrapper-width);text-align:center;color:var(--code-ln-color);padding-top:1.25rem;line-height:1.375;counter-reset:line-number}div[class*=language-].line-numbers-mode .line-numbers .line-number{position:relative;z-index:3;-webkit-user-select:none;-moz-user-select:none;user-select:none;height:1.375em}div[class*=language-].line-numbers-mode .line-numbers .line-number:before{counter-increment:line-number;content:counter(line-number);font-size:.85em}div[class*=language-].line-numbers-mode:after{content:"";position:absolute;top:0;left:0;width:var(--code-ln-wrapper-width);height:100%;border-radius:6px 0 0 6px;border-right:1px solid var(--code-hl-bg-color)}@media (max-width: 419px){.theme-default-content div[class*=language-]{margin:.85rem -1.5rem;border-radius:0}}.code-group__nav{margin-top:.85rem;margin-bottom:calc(-1.7rem - 6px);padding-bottom:calc(1.7rem - 6px);padding-left:10px;padding-top:10px;border-top-left-radius:6px;border-top-right-radius:6px;background-color:var(--code-bg-color)}.code-group__ul{margin:auto 0;padding-left:0;display:inline-flex;list-style:none}.code-group__nav-tab{border:0;padding:5px;cursor:pointer;background-color:transparent;font-size:.85em;line-height:1.4;color:#ffffffe6;font-weight:600}.code-group__nav-tab:focus{outline:none}.code-group__nav-tab:focus-visible{outline:1px solid rgba(255,255,255,.9)}.code-group__nav-tab-active{border-bottom:var(--c-brand) 1px solid}@media (max-width: 419px){.code-group__nav{margin-left:-1.5rem;margin-right:-1.5rem;border-radius:0}}.code-group-item{display:none}.code-group-item__active{display:block}.code-group-item>pre{background-color:orange}.custom-container{transition:color var(--t-color),border-color var(--t-color),background-color var(--t-color)}.custom-container .custom-container-title{font-weight:600}.custom-container .custom-container-title:not(:only-child){margin-bottom:-.4rem}.custom-container.tip,.custom-container.warning,.custom-container.danger{padding:.1rem 1.5rem;border-left-width:.5rem;border-left-style:solid;margin:1rem 0}.custom-container.tip{border-color:var(--c-tip);background-color:var(--c-tip-bg);color:var(--c-tip-text)}.custom-container.tip .custom-container-title{color:var(--c-tip-title)}.custom-container.tip a{color:var(--c-tip-text-accent)}.custom-container.tip code{background-color:var(--c-bg-dark)}.custom-container.warning{border-color:var(--c-warning);background-color:var(--c-warning-bg);color:var(--c-warning-text)}.custom-container.warning .custom-container-title{color:var(--c-warning-title)}.custom-container.warning a{color:var(--c-warning-text-accent)}.custom-container.warning blockquote{border-left-color:var(--c-warning-border-dark);color:var(--c-warning-text-quote)}.custom-container.warning code{color:var(--c-warning-text-light);background-color:var(--c-warning-bg-light)}.custom-container.warning details{background-color:var(--c-warning-details-bg)}.custom-container.warning details code{background-color:var(--c-warning-bg-lighter)}.custom-container.warning .external-link-icon{--external-link-icon-color: var(--c-warning-text-quote)}.custom-container.danger{border-color:var(--c-danger);background-color:var(--c-danger-bg);color:var(--c-danger-text)}.custom-container.danger .custom-container-title{color:var(--c-danger-title)}.custom-container.danger a{color:var(--c-danger-text-accent)}.custom-container.danger blockquote{border-left-color:var(--c-danger-border-dark);color:var(--c-danger-text-quote)}.custom-container.danger code{color:var(--c-danger-text-light);background-color:var(--c-danger-bg-light)}.custom-container.danger details{background-color:var(--c-danger-details-bg)}.custom-container.danger details code{background-color:var(--c-danger-bg-lighter)}.custom-container.danger .external-link-icon{--external-link-icon-color: var(--c-danger-text-quote)}.custom-container.details{display:block;position:relative;border-radius:2px;margin:1.6em 0;padding:1.6em;background-color:var(--c-details-bg)}.custom-container.details code{background-color:var(--c-bg-darker)}.custom-container.details h4{margin-top:0}.custom-container.details figure:last-child,.custom-container.details p:last-child{margin-bottom:0;padding-bottom:0}.custom-container.details summary{outline:none;cursor:pointer}.home{padding:var(--navbar-height) 2rem 0;max-width:var(--homepage-width);margin:0 auto;display:block}.home .hero{text-align:center}.home .hero img{max-width:100%;max-height:280px;display:block;margin:3rem auto 1.5rem}.home .hero h1{font-size:3rem}.home .hero h1,.home .hero .description,.home .hero .actions{margin:1.8rem auto}.home .hero .actions{display:flex;flex-wrap:wrap;gap:1rem;justify-content:center}.home .hero .description{max-width:35rem;font-size:1.6rem;line-height:1.3;color:var(--c-text-lightest)}.home .hero .action-button{display:inline-block;font-size:1.2rem;padding:.8rem 1.6rem;border-width:2px;border-style:solid;border-radius:4px;transition:background-color var(--t-color);box-sizing:border-box}.home .hero .action-button.primary{color:var(--c-bg);background-color:var(--c-brand);border-color:var(--c-brand)}.home .hero .action-button.primary:hover{background-color:var(--c-brand-light)}.home .hero .action-button.secondary{color:var(--c-brand);background-color:var(--c-bg);border-color:var(--c-brand)}.home .hero .action-button.secondary:hover{color:var(--c-bg);background-color:var(--c-brand-light)}.home .features{border-top:1px solid var(--c-border);transition:border-color var(--t-color);padding:1.2rem 0;margin-top:2.5rem;display:flex;flex-wrap:wrap;align-items:flex-start;align-content:stretch;justify-content:space-between}.home .feature{flex-grow:1;flex-basis:30%;max-width:30%}.home .feature h2{font-size:1.4rem;font-weight:500;border-bottom:none;padding-bottom:0;color:var(--c-text-light)}.home .feature p{color:var(--c-text-lighter)}.home .theme-default-content{padding:0;margin:0}.home .footer{padding:2.5rem;border-top:1px solid var(--c-border);text-align:center;color:var(--c-text-lighter);transition:border-color var(--t-color)}@media (max-width: 719px){.home .features{flex-direction:column}.home .feature{max-width:100%;padding:0 2.5rem}}@media (max-width: 419px){.home{padding-left:1.5rem;padding-right:1.5rem}.home .hero img{max-height:210px;margin:2rem auto 1.2rem}.home .hero h1{font-size:2rem}.home .hero h1,.home .hero .description,.home .hero .actions{margin:1.2rem auto}.home .hero .description{font-size:1.2rem}.home .hero .action-button{font-size:1rem;padding:.6rem 1.2rem}.home .feature h2{font-size:1.25rem}}.page{padding-top:var(--navbar-height);padding-left:var(--sidebar-width)}.navbar{position:fixed;z-index:20;top:0;left:0;right:0;height:var(--navbar-height);box-sizing:border-box;border-bottom:1px solid var(--c-border);background-color:var(--c-bg-navbar);transition:background-color var(--t-color),border-color var(--t-color)}.sidebar{font-size:16px;width:var(--sidebar-width);position:fixed;z-index:10;margin:0;top:var(--navbar-height);left:0;bottom:0;box-sizing:border-box;border-right:1px solid var(--c-border);overflow-y:auto;scrollbar-width:thin;scrollbar-color:var(--c-brand) var(--c-border);background-color:var(--c-bg-sidebar);transition:transform var(--t-transform),background-color var(--t-color),border-color var(--t-color)}.sidebar::-webkit-scrollbar{width:7px}.sidebar::-webkit-scrollbar-track{background-color:var(--c-border)}.sidebar::-webkit-scrollbar-thumb{background-color:var(--c-brand)}.sidebar-mask{position:fixed;z-index:9;top:0;left:0;width:100vw;height:100vh;display:none}.theme-container.sidebar-open .sidebar-mask{display:block}.theme-container.sidebar-open .navbar>.toggle-sidebar-button .icon span:nth-child(1){transform:rotate(45deg) translate3d(5.5px,5.5px,0)}.theme-container.sidebar-open .navbar>.toggle-sidebar-button .icon span:nth-child(2){transform:scale3d(0,1,1)}.theme-container.sidebar-open .navbar>.toggle-sidebar-button .icon span:nth-child(3){transform:rotate(-45deg) translate3d(6px,-6px,0)}.theme-container.sidebar-open .navbar>.toggle-sidebar-button .icon span:nth-child(1),.theme-container.sidebar-open .navbar>.toggle-sidebar-button .icon span:nth-child(3){transform-origin:center}.theme-container.no-navbar .theme-default-content h1,.theme-container.no-navbar .theme-default-content h2,.theme-container.no-navbar .theme-default-content h3,.theme-container.no-navbar .theme-default-content h4,.theme-container.no-navbar .theme-default-content h5,.theme-container.no-navbar .theme-default-content h6{margin-top:1.5rem;padding-top:0}.theme-container.no-navbar .page{padding-top:0}.theme-container.no-navbar .sidebar{top:0}.theme-container.no-sidebar .sidebar{display:none}@media (max-width: 719px){.theme-container.no-sidebar .sidebar{display:block}}.theme-container.no-sidebar .page{padding-left:0}.theme-default-content a:hover{text-decoration:underline}.theme-default-content img{max-width:100%}.theme-default-content h1,.theme-default-content h2,.theme-default-content h3,.theme-default-content h4,.theme-default-content h5,.theme-default-content h6{margin-top:calc(.5rem - var(--navbar-height));padding-top:calc(1rem + var(--navbar-height));margin-bottom:0}.theme-default-content h1:first-child,.theme-default-content h2:first-child,.theme-default-content h3:first-child,.theme-default-content h4:first-child,.theme-default-content h5:first-child,.theme-default-content h6:first-child{margin-bottom:1rem}.theme-default-content h1:first-child+p,.theme-default-content h1:first-child+pre,.theme-default-content h1:first-child+.custom-container,.theme-default-content h2:first-child+p,.theme-default-content h2:first-child+pre,.theme-default-content h2:first-child+.custom-container,.theme-default-content h3:first-child+p,.theme-default-content h3:first-child+pre,.theme-default-content h3:first-child+.custom-container,.theme-default-content h4:first-child+p,.theme-default-content h4:first-child+pre,.theme-default-content h4:first-child+.custom-container,.theme-default-content h5:first-child+p,.theme-default-content h5:first-child+pre,.theme-default-content h5:first-child+.custom-container,.theme-default-content h6:first-child+p,.theme-default-content h6:first-child+pre,.theme-default-content h6:first-child+.custom-container{margin-top:2rem}@media (max-width: 959px){.sidebar{font-size:15px;width:var(--sidebar-width-mobile)}.page{padding-left:var(--sidebar-width-mobile)}}@media (max-width: 719px){.sidebar{top:0;padding-top:var(--navbar-height);transform:translate(-100%)}.page{padding-left:0}.theme-container.sidebar-open .sidebar{transform:translate(0)}.theme-container.no-navbar .sidebar{padding-top:0}}@media (max-width: 419px){h1{font-size:1.9rem}}.navbar{--navbar-line-height: calc( var(--navbar-height) - 2 * var(--navbar-padding-v) );padding:var(--navbar-padding-v) var(--navbar-padding-h);line-height:var(--navbar-line-height)}.navbar .logo{height:var(--navbar-line-height);margin-right:var(--navbar-padding-v);vertical-align:top}.navbar .site-name{font-size:1.3rem;font-weight:600;color:var(--c-text);position:relative}.navbar .navbar-items-wrapper{display:flex;position:absolute;box-sizing:border-box;top:var(--navbar-padding-v);right:var(--navbar-padding-h);height:var(--navbar-line-height);padding-left:var(--navbar-padding-h);white-space:nowrap;font-size:.9rem}.navbar .navbar-items-wrapper .search-box{flex:0 0 auto;vertical-align:top}@media screen and (max-width: 719px){.navbar{padding-left:4rem}.navbar .site-name{display:block;width:calc(100vw - 11rem);overflow:hidden;white-space:nowrap;text-overflow:ellipsis}.navbar .can-hide{display:none}}.navbar-items{display:inline-block}@media print{.navbar-items{display:none}}.navbar-items a{display:inline-block;line-height:1.4rem;color:inherit}.navbar-items a:hover,.navbar-items a.router-link-active{color:var(--c-text)}.navbar-items .navbar-item{position:relative;display:inline-block;margin-left:1.5rem;line-height:var(--navbar-line-height)}.navbar-items .navbar-item:first-child{margin-left:0}.navbar-items .navbar-item>a:hover,.navbar-items .navbar-item>a.router-link-active{margin-bottom:-2px;border-bottom:2px solid var(--c-text-accent)}@media (max-width: 719px){.navbar-items .navbar-item{margin-left:0}.navbar-items .navbar-item>a:hover,.navbar-items .navbar-item>a.router-link-active{margin-bottom:0;border-bottom:none}.navbar-items a:hover,.navbar-items a.router-link-active{color:var(--c-text-accent)}}.toggle-sidebar-button{position:absolute;top:.6rem;left:1rem;display:none;padding:.6rem;cursor:pointer}.toggle-sidebar-button .icon{display:flex;flex-direction:column;justify-content:center;align-items:center;width:1.25rem;height:1.25rem;cursor:inherit}.toggle-sidebar-button .icon span{display:inline-block;width:100%;height:2px;border-radius:2px;background-color:var(--c-text);transition:transform var(--t-transform)}.toggle-sidebar-button .icon span:nth-child(2){margin:6px 0}@media screen and (max-width: 719px){.toggle-sidebar-button{display:block}}.toggle-color-mode-button{display:flex;margin:auto;margin-left:1rem;border:0;background:none;color:var(--c-text);opacity:.8;cursor:pointer}@media print{.toggle-color-mode-button{display:none}}.toggle-color-mode-button:hover{opacity:1}.toggle-color-mode-button .icon{width:1.25rem;height:1.25rem}.DocSearch{transition:background-color var(--t-color)}.navbar-dropdown-wrapper{cursor:pointer}.navbar-dropdown-wrapper .navbar-dropdown-title,.navbar-dropdown-wrapper .navbar-dropdown-title-mobile{display:block;font-size:.9rem;font-family:inherit;cursor:inherit;padding:inherit;line-height:1.4rem;background:transparent;border:none;font-weight:500;color:var(--c-text)}.navbar-dropdown-wrapper .navbar-dropdown-title:hover,.navbar-dropdown-wrapper .navbar-dropdown-title-mobile:hover{border-color:transparent}.navbar-dropdown-wrapper .navbar-dropdown-title .arrow,.navbar-dropdown-wrapper .navbar-dropdown-title-mobile .arrow{vertical-align:middle;margin-top:-1px;margin-left:.4rem}.navbar-dropdown-wrapper .navbar-dropdown-title-mobile{display:none;font-weight:600;font-size:inherit}.navbar-dropdown-wrapper .navbar-dropdown-title-mobile:hover{color:var(--c-text-accent)}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item{color:inherit;line-height:1.7rem}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle{margin:.45rem 0 0;border-top:1px solid var(--c-border);padding:1rem 0 .45rem;font-size:.9rem}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle>span{padding:0 1.5rem 0 1.25rem}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle>a{font-weight:inherit}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle>a.router-link-active:after{display:none}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subitem-wrapper{padding:0;list-style:none}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subitem-wrapper .navbar-dropdown-subitem{font-size:.9em}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item a{display:block;line-height:1.7rem;position:relative;border-bottom:none;font-weight:400;margin-bottom:0;padding:0 1.5rem 0 1.25rem}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item a:hover,.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item a.router-link-active{color:var(--c-text-accent)}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item a.router-link-active:after{content:"";width:0;height:0;border-left:5px solid var(--c-text-accent);border-top:3px solid transparent;border-bottom:3px solid transparent;position:absolute;top:calc(50% - 2px);left:9px}.navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item:first-child .navbar-dropdown-subtitle{margin-top:0;padding-top:0;border-top:0}.navbar-dropdown-wrapper.mobile.open .navbar-dropdown-title,.navbar-dropdown-wrapper.mobile.open .navbar-dropdown-title-mobile{margin-bottom:.5rem}.navbar-dropdown-wrapper.mobile .navbar-dropdown-title,.navbar-dropdown-wrapper.mobile .navbar-dropdown-title-mobile{display:none}.navbar-dropdown-wrapper.mobile .navbar-dropdown-title-mobile{display:block}.navbar-dropdown-wrapper.mobile .navbar-dropdown{transition:height .1s ease-out;overflow:hidden}.navbar-dropdown-wrapper.mobile .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle{border-top:0;margin-top:0;padding-top:0;padding-bottom:0}.navbar-dropdown-wrapper.mobile .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subtitle,.navbar-dropdown-wrapper.mobile .navbar-dropdown .navbar-dropdown-item>a{font-size:15px;line-height:2rem}.navbar-dropdown-wrapper.mobile .navbar-dropdown .navbar-dropdown-item .navbar-dropdown-subitem{font-size:14px;padding-left:1rem}.navbar-dropdown-wrapper:not(.mobile){height:1.8rem}.navbar-dropdown-wrapper:not(.mobile):hover .navbar-dropdown,.navbar-dropdown-wrapper:not(.mobile).open .navbar-dropdown{display:block!important}.navbar-dropdown-wrapper:not(.mobile).open:blur{display:none}.navbar-dropdown-wrapper:not(.mobile) .navbar-dropdown{display:none;height:auto!important;box-sizing:border-box;max-height:calc(100vh - 2.7rem);overflow-y:auto;position:absolute;top:100%;right:0;background-color:var(--c-bg-navbar);padding:.6rem 0;border:1px solid var(--c-border);border-bottom-color:var(--c-border-dark);text-align:left;border-radius:.25rem;white-space:nowrap;margin:0}.page{padding-bottom:2rem;display:block}.page .theme-default-content{max-width:var(--content-width);margin:0 auto;padding:2rem 2.5rem;padding-top:0}@media (max-width: 959px){.page .theme-default-content{padding:2rem}}@media (max-width: 419px){.page .theme-default-content{padding:1.5rem}}.page-meta{max-width:var(--content-width);margin:0 auto;padding:1rem 2.5rem;overflow:auto}@media (max-width: 959px){.page-meta{padding:2rem}}@media (max-width: 419px){.page-meta{padding:1.5rem}}.page-meta .meta-item{cursor:default;margin-top:.8rem}.page-meta .meta-item .meta-item-label{font-weight:500;color:var(--c-text-lighter)}.page-meta .meta-item .meta-item-info{font-weight:400;color:var(--c-text-quote)}.page-meta .edit-link{display:inline-block;margin-right:.25rem}@media print{.page-meta .edit-link{display:none}}.page-meta .last-updated{float:right}@media (max-width: 719px){.page-meta .last-updated{font-size:.8em;float:none}.page-meta .contributors{font-size:.8em}}.page-nav{max-width:var(--content-width);margin:0 auto;padding:1rem 2.5rem 2rem;padding-bottom:0}@media (max-width: 959px){.page-nav{padding:2rem}}@media (max-width: 419px){.page-nav{padding:1.5rem}}.page-nav .inner{min-height:2rem;margin-top:0;border-top:1px solid var(--c-border);transition:border-color var(--t-color);padding-top:1rem;overflow:auto}.page-nav .prev a:before{content:"←"}.page-nav .next{float:right}.page-nav .next a:after{content:"→"}.sidebar ul{padding:0;margin:0;list-style-type:none}.sidebar a{display:inline-block}.sidebar .navbar-items{display:none;border-bottom:1px solid var(--c-border);transition:border-color var(--t-color);padding:.5rem 0 .75rem}.sidebar .navbar-items a{font-weight:600}.sidebar .navbar-items .navbar-item{display:block;line-height:1.25rem;font-size:1.1em;padding:.5rem 0 .5rem 1.5rem}.sidebar .sidebar-items{padding:1.5rem 0}@media (max-width: 719px){.sidebar .navbar-items{display:block}.sidebar .navbar-items .navbar-dropdown-wrapper .navbar-dropdown .navbar-dropdown-item a.router-link-active:after{top:calc(1rem - 2px)}.sidebar .sidebar-items{padding:1rem 0}}.sidebar-item{cursor:default;border-left:.25rem solid transparent;color:var(--c-text)}.sidebar-item:focus-visible{outline-width:1px;outline-offset:-1px}.sidebar-item.active:not(p.sidebar-heading){font-weight:600;color:var(--c-text-accent);border-left-color:var(--c-text-accent)}.sidebar-item.sidebar-heading{transition:color .15s ease;font-size:1.1em;font-weight:700;padding:.35rem 1.5rem .35rem 1.25rem;width:100%;box-sizing:border-box;margin:0}.sidebar-item.sidebar-heading+.sidebar-item-children{transition:height .1s ease-out;overflow:hidden;margin-bottom:.75rem}.sidebar-item.collapsible{cursor:pointer}.sidebar-item.collapsible .arrow{position:relative;top:-.12em;left:.5em}.sidebar-item:not(.sidebar-heading){font-size:1em;font-weight:400;display:inline-block;margin:0;padding:.35rem 1rem .35rem 2rem;line-height:1.4;width:100%;box-sizing:border-box}.sidebar-item:not(.sidebar-heading)+.sidebar-item-children{padding-left:1rem;font-size:.95em}.sidebar-item-children .sidebar-item-children .sidebar-item:not(.sidebar-heading){padding:.25rem 1rem .25rem 1.75rem}.sidebar-item-children .sidebar-item-children .sidebar-item:not(.sidebar-heading).active{font-weight:500;border-left-color:transparent}a.sidebar-heading+.sidebar-item-children .sidebar-item:not(.sidebar-heading).active{border-left-color:transparent}a.sidebar-item{cursor:pointer}a.sidebar-item:hover{color:var(--c-text-accent)}.table-of-contents .badge{vertical-align:middle}.dropdown-enter-from,.dropdown-leave-to{height:0!important}.fade-slide-y-enter-active{transition:all .2s ease}.fade-slide-y-leave-active{transition:all .2s cubic-bezier(1,.5,.8,1)}.fade-slide-y-enter-from,.fade-slide-y-leave-to{transform:translateY(10px);opacity:0}:root{scroll-behavior:smooth;--c-brand: #fc5207;--c-brand-light: #fc5207;--c-tip: #fc5207;--content-width: 1020px}html.dark{--c-brand: #fc5207;--c-brand-light: #fc5207;--c-tip: #fc5207}html.dark{--box-shadow: #0f0e0d;--card-shadow: rgba(0, 0, 0, .3);--black: #fff;--dark-grey: #999;--light-grey: #666;--white: #000;--grey3: #bbb;--grey12: #333;--grey14: #111}:root{--vp-bg: var(--c-bg, #fff);--vp-bgl: var(--c-bg-light, #f3f4f5);--vp-bglt: var(--c-bg-lighter, #eeeeee);--vp-c: var(--c-text, #2c3e50);--vp-cl: var(--c-text-light, #3a5169);--vp-clt: var(--c-text-lighter, #4e6e8e);--vp-brc: var(--c-border, #eaecef);--vp-brcd: var(--c-border-dark, #dfe2e5);--vp-tc: var(--c-brand, #3eaf7c);--vp-tcl: var(--c-brand-light, #4abf8a);--vp-ct: var(--t-color, .3s ease);--vp-tt: var(--t-transform, .3s ease);--box-shadow: #f0f1f2;--card-shadow: rgba(0, 0, 0, .15);--black: #000;--dark-grey: #666;--light-grey: #999;--white: #fff;--grey3: #333;--grey12: #bbb;--grey14: #eee}.footnote-item{margin-top:calc(0rem - var(--navbar-height, 3.6rem));padding-top:calc(var(--navbar-height, 3.6rem) + .5rem)}.footnote-item>p{margin-bottom:0}.footnote-ref{position:relative}.footnote-anchor{position:absolute;top:calc(-.5rem - var(--navbar-height, 3.6rem))}.line{display:flex;flex-direction:row;align-items:center}.container{margin-right:20px}.text{color:#999;margin-left:5px}/*! @docsearch/css 3.5.2 | MIT License | © Algolia, Inc. and contributors | https://docsearch.algolia.com */:root{--docsearch-primary-color:#5468ff;--docsearch-text-color:#1c1e21;--docsearch-spacing:12px;--docsearch-icon-stroke-width:1.4;--docsearch-highlight-color:var(--docsearch-primary-color);--docsearch-muted-color:#969faf;--docsearch-container-background:rgba(101,108,133,.8);--docsearch-logo-color:#5468ff;--docsearch-modal-width:560px;--docsearch-modal-height:600px;--docsearch-modal-background:#f5f6f7;--docsearch-modal-shadow:inset 1px 1px 0 0 hsla(0,0%,100%,.5),0 3px 8px 0 #555a64;--docsearch-searchbox-height:56px;--docsearch-searchbox-background:#ebedf0;--docsearch-searchbox-focus-background:#fff;--docsearch-searchbox-shadow:inset 0 0 0 2px var(--docsearch-primary-color);--docsearch-hit-height:56px;--docsearch-hit-color:#444950;--docsearch-hit-active-color:#fff;--docsearch-hit-background:#fff;--docsearch-hit-shadow:0 1px 3px 0 #d4d9e1;--docsearch-key-gradient:linear-gradient(-225deg,#d5dbe4,#f8f8f8);--docsearch-key-shadow:inset 0 -2px 0 0 #cdcde6,inset 0 0 1px 1px #fff,0 1px 2px 1px rgba(30,35,90,.4);--docsearch-footer-height:44px;--docsearch-footer-background:#fff;--docsearch-footer-shadow:0 -1px 0 0 #e0e3e8,0 -3px 6px 0 rgba(69,98,155,.12)}html[data-theme=dark]{--docsearch-text-color:#f5f6f7;--docsearch-container-background:rgba(9,10,17,.8);--docsearch-modal-background:#15172a;--docsearch-modal-shadow:inset 1px 1px 0 0 #2c2e40,0 3px 8px 0 #000309;--docsearch-searchbox-background:#090a11;--docsearch-searchbox-focus-background:#000;--docsearch-hit-color:#bec3c9;--docsearch-hit-shadow:none;--docsearch-hit-background:#090a11;--docsearch-key-gradient:linear-gradient(-26.5deg,#565872,#31355b);--docsearch-key-shadow:inset 0 -2px 0 0 #282d55,inset 0 0 1px 1px #51577d,0 2px 2px 0 rgba(3,4,9,.3);--docsearch-footer-background:#1e2136;--docsearch-footer-shadow:inset 0 1px 0 0 rgba(73,76,106,.5),0 -4px 8px 0 rgba(0,0,0,.2);--docsearch-logo-color:#fff;--docsearch-muted-color:#7f8497}.DocSearch-Button{align-items:center;background:var(--docsearch-searchbox-background);border:0;border-radius:40px;color:var(--docsearch-muted-color);cursor:pointer;display:flex;font-weight:500;height:36px;justify-content:space-between;margin:0 0 0 16px;padding:0 8px;-webkit-user-select:none;-moz-user-select:none;user-select:none}.DocSearch-Button:active,.DocSearch-Button:focus,.DocSearch-Button:hover{background:var(--docsearch-searchbox-focus-background);box-shadow:var(--docsearch-searchbox-shadow);color:var(--docsearch-text-color);outline:none}.DocSearch-Button-Container{align-items:center;display:flex}.DocSearch-Search-Icon{stroke-width:1.6}.DocSearch-Button .DocSearch-Search-Icon{color:var(--docsearch-text-color)}.DocSearch-Button-Placeholder{font-size:1rem;padding:0 12px 0 6px}.DocSearch-Button-Keys{display:flex;min-width:calc(40px + .8em)}.DocSearch-Button-Key{align-items:center;background:var(--docsearch-key-gradient);border-radius:3px;box-shadow:var(--docsearch-key-shadow);color:var(--docsearch-muted-color);display:flex;height:18px;justify-content:center;margin-right:.4em;position:relative;padding:0 0 2px;border:0;top:-1px;width:20px}@media (max-width:768px){.DocSearch-Button-Keys,.DocSearch-Button-Placeholder{display:none}}.DocSearch--active{overflow:hidden!important}.DocSearch-Container,.DocSearch-Container *{box-sizing:border-box}.DocSearch-Container{background-color:var(--docsearch-container-background);height:100vh;left:0;position:fixed;top:0;width:100vw;z-index:200}.DocSearch-Container a{text-decoration:none}.DocSearch-Link{-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;color:var(--docsearch-highlight-color);cursor:pointer;font:inherit;margin:0;padding:0}.DocSearch-Modal{background:var(--docsearch-modal-background);border-radius:6px;box-shadow:var(--docsearch-modal-shadow);flex-direction:column;margin:60px auto auto;max-width:var(--docsearch-modal-width);position:relative}.DocSearch-SearchBar{display:flex;padding:var(--docsearch-spacing) var(--docsearch-spacing) 0}.DocSearch-Form{align-items:center;background:var(--docsearch-searchbox-focus-background);border-radius:4px;box-shadow:var(--docsearch-searchbox-shadow);display:flex;height:var(--docsearch-searchbox-height);margin:0;padding:0 var(--docsearch-spacing);position:relative;width:100%}.DocSearch-Input{-webkit-appearance:none;-moz-appearance:none;appearance:none;background:transparent;border:0;color:var(--docsearch-text-color);flex:1;font:inherit;font-size:1.2em;height:100%;outline:none;padding:0 0 0 8px;width:80%}.DocSearch-Input::-moz-placeholder{color:var(--docsearch-muted-color);opacity:1}.DocSearch-Input::placeholder{color:var(--docsearch-muted-color);opacity:1}.DocSearch-Input::-webkit-search-cancel-button,.DocSearch-Input::-webkit-search-decoration,.DocSearch-Input::-webkit-search-results-button,.DocSearch-Input::-webkit-search-results-decoration{display:none}.DocSearch-LoadingIndicator,.DocSearch-MagnifierLabel,.DocSearch-Reset{margin:0;padding:0}.DocSearch-MagnifierLabel,.DocSearch-Reset{align-items:center;color:var(--docsearch-highlight-color);display:flex;justify-content:center}.DocSearch-Container--Stalled .DocSearch-MagnifierLabel,.DocSearch-LoadingIndicator{display:none}.DocSearch-Container--Stalled .DocSearch-LoadingIndicator{align-items:center;color:var(--docsearch-highlight-color);display:flex;justify-content:center}@media screen and (prefers-reduced-motion:reduce){.DocSearch-Reset{animation:none;-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;border-radius:50%;color:var(--docsearch-icon-color);cursor:pointer;right:0;stroke-width:var(--docsearch-icon-stroke-width)}}.DocSearch-Reset{animation:fade-in .1s ease-in forwards;-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;border-radius:50%;color:var(--docsearch-icon-color);cursor:pointer;padding:2px;right:0;stroke-width:var(--docsearch-icon-stroke-width)}.DocSearch-Reset[hidden]{display:none}.DocSearch-Reset:hover{color:var(--docsearch-highlight-color)}.DocSearch-LoadingIndicator svg,.DocSearch-MagnifierLabel svg{height:24px;width:24px}.DocSearch-Cancel{display:none}.DocSearch-Dropdown{max-height:calc(var(--docsearch-modal-height) - var(--docsearch-searchbox-height) - var(--docsearch-spacing) - var(--docsearch-footer-height));min-height:var(--docsearch-spacing);overflow-y:auto;overflow-y:overlay;padding:0 var(--docsearch-spacing);scrollbar-color:var(--docsearch-muted-color) var(--docsearch-modal-background);scrollbar-width:thin}.DocSearch-Dropdown::-webkit-scrollbar{width:12px}.DocSearch-Dropdown::-webkit-scrollbar-track{background:transparent}.DocSearch-Dropdown::-webkit-scrollbar-thumb{background-color:var(--docsearch-muted-color);border:3px solid var(--docsearch-modal-background);border-radius:20px}.DocSearch-Dropdown ul{list-style:none;margin:0;padding:0}.DocSearch-Label{font-size:.75em;line-height:1.6em}.DocSearch-Help,.DocSearch-Label{color:var(--docsearch-muted-color)}.DocSearch-Help{font-size:.9em;margin:0;-webkit-user-select:none;-moz-user-select:none;user-select:none}.DocSearch-Title{font-size:1.2em}.DocSearch-Logo a{display:flex}.DocSearch-Logo svg{color:var(--docsearch-logo-color);margin-left:8px}.DocSearch-Hits:last-of-type{margin-bottom:24px}.DocSearch-Hits mark{background:none;color:var(--docsearch-highlight-color)}.DocSearch-HitsFooter{color:var(--docsearch-muted-color);display:flex;font-size:.85em;justify-content:center;margin-bottom:var(--docsearch-spacing);padding:var(--docsearch-spacing)}.DocSearch-HitsFooter a{border-bottom:1px solid;color:inherit}.DocSearch-Hit{border-radius:4px;display:flex;padding-bottom:4px;position:relative}@media screen and (prefers-reduced-motion:reduce){.DocSearch-Hit--deleting{transition:none}}.DocSearch-Hit--deleting{opacity:0;transition:all .25s linear}@media screen and (prefers-reduced-motion:reduce){.DocSearch-Hit--favoriting{transition:none}}.DocSearch-Hit--favoriting{transform:scale(0);transform-origin:top center;transition:all .25s linear;transition-delay:.25s}.DocSearch-Hit a{background:var(--docsearch-hit-background);border-radius:4px;box-shadow:var(--docsearch-hit-shadow);display:block;padding-left:var(--docsearch-spacing);width:100%}.DocSearch-Hit-source{background:var(--docsearch-modal-background);color:var(--docsearch-highlight-color);font-size:.85em;font-weight:600;line-height:32px;margin:0 -4px;padding:8px 4px 0;position:sticky;top:0;z-index:10}.DocSearch-Hit-Tree{color:var(--docsearch-muted-color);height:var(--docsearch-hit-height);opacity:.5;stroke-width:var(--docsearch-icon-stroke-width);width:24px}.DocSearch-Hit[aria-selected=true] a{background-color:var(--docsearch-highlight-color)}.DocSearch-Hit[aria-selected=true] mark{text-decoration:underline}.DocSearch-Hit-Container{align-items:center;color:var(--docsearch-hit-color);display:flex;flex-direction:row;height:var(--docsearch-hit-height);padding:0 var(--docsearch-spacing) 0 0}.DocSearch-Hit-icon{height:20px;width:20px}.DocSearch-Hit-action,.DocSearch-Hit-icon{color:var(--docsearch-muted-color);stroke-width:var(--docsearch-icon-stroke-width)}.DocSearch-Hit-action{align-items:center;display:flex;height:22px;width:22px}.DocSearch-Hit-action svg{display:block;height:18px;width:18px}.DocSearch-Hit-action+.DocSearch-Hit-action{margin-left:6px}.DocSearch-Hit-action-button{-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;border-radius:50%;color:inherit;cursor:pointer;padding:2px}svg.DocSearch-Hit-Select-Icon{display:none}.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-Select-Icon{display:block}.DocSearch-Hit-action-button:focus,.DocSearch-Hit-action-button:hover{background:rgba(0,0,0,.2);transition:background-color .1s ease-in}@media screen and (prefers-reduced-motion:reduce){.DocSearch-Hit-action-button:focus,.DocSearch-Hit-action-button:hover{transition:none}}.DocSearch-Hit-action-button:focus path,.DocSearch-Hit-action-button:hover path{fill:#fff}.DocSearch-Hit-content-wrapper{display:flex;flex:1 1 auto;flex-direction:column;font-weight:500;justify-content:center;line-height:1.2em;margin:0 8px;overflow-x:hidden;position:relative;text-overflow:ellipsis;white-space:nowrap;width:80%}.DocSearch-Hit-title{font-size:.9em}.DocSearch-Hit-path{color:var(--docsearch-muted-color);font-size:.75em}.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-action,.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-icon,.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-path,.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-text,.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-title,.DocSearch-Hit[aria-selected=true] .DocSearch-Hit-Tree,.DocSearch-Hit[aria-selected=true] mark{color:var(--docsearch-hit-active-color)!important}@media screen and (prefers-reduced-motion:reduce){.DocSearch-Hit-action-button:focus,.DocSearch-Hit-action-button:hover{background:rgba(0,0,0,.2);transition:none}}.DocSearch-ErrorScreen,.DocSearch-NoResults,.DocSearch-StartScreen{font-size:.9em;margin:0 auto;padding:36px 0;text-align:center;width:80%}.DocSearch-Screen-Icon{color:var(--docsearch-muted-color);padding-bottom:12px}.DocSearch-NoResults-Prefill-List{display:inline-block;padding-bottom:24px;text-align:left}.DocSearch-NoResults-Prefill-List ul{display:inline-block;padding:8px 0 0}.DocSearch-NoResults-Prefill-List li{list-style-position:inside;list-style-type:"» "}.DocSearch-Prefill{-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;border-radius:1em;color:var(--docsearch-highlight-color);cursor:pointer;display:inline-block;font-size:1em;font-weight:700;padding:0}.DocSearch-Prefill:focus,.DocSearch-Prefill:hover{outline:none;text-decoration:underline}.DocSearch-Footer{align-items:center;background:var(--docsearch-footer-background);border-radius:0 0 8px 8px;box-shadow:var(--docsearch-footer-shadow);display:flex;flex-direction:row-reverse;flex-shrink:0;height:var(--docsearch-footer-height);justify-content:space-between;padding:0 var(--docsearch-spacing);position:relative;-webkit-user-select:none;-moz-user-select:none;user-select:none;width:100%;z-index:300}.DocSearch-Commands{color:var(--docsearch-muted-color);display:flex;list-style:none;margin:0;padding:0}.DocSearch-Commands li{align-items:center;display:flex}.DocSearch-Commands li:not(:last-of-type){margin-right:.8em}.DocSearch-Commands-Key{align-items:center;background:var(--docsearch-key-gradient);border-radius:2px;box-shadow:var(--docsearch-key-shadow);display:flex;height:18px;justify-content:center;margin-right:.4em;padding:0 0 1px;color:var(--docsearch-muted-color);border:0;width:20px}@media (max-width:768px){:root{--docsearch-spacing:10px;--docsearch-footer-height:40px}.DocSearch-Dropdown{height:100%}.DocSearch-Container{height:100vh;height:-webkit-fill-available;height:calc(var(--docsearch-vh, 1vh)*100);position:absolute}.DocSearch-Footer{border-radius:0;bottom:0;position:absolute}.DocSearch-Hit-content-wrapper{display:flex;position:relative;width:80%}.DocSearch-Modal{border-radius:0;box-shadow:none;height:100vh;height:-webkit-fill-available;height:calc(var(--docsearch-vh, 1vh)*100);margin:0;max-width:100%;width:100%}.DocSearch-Dropdown{max-height:calc(var(--docsearch-vh, 1vh)*100 - var(--docsearch-searchbox-height) - var(--docsearch-spacing) - var(--docsearch-footer-height))}.DocSearch-Cancel{-webkit-appearance:none;-moz-appearance:none;appearance:none;background:none;border:0;color:var(--docsearch-highlight-color);cursor:pointer;display:inline-block;flex:none;font:inherit;font-size:1em;font-weight:500;margin-left:var(--docsearch-spacing);outline:none;overflow:hidden;padding:0;-webkit-user-select:none;-moz-user-select:none;user-select:none;white-space:nowrap}.DocSearch-Commands,.DocSearch-Hit-Tree{display:none}}@keyframes fade-in{0%{opacity:0}to{opacity:1}}@media (min-width: 751px){#docsearch-container{min-width:171.36px}}@media (max-width: 750px){.DocSearch-Container{position:fixed}#docsearch-container{min-width:52px}}@media print{#docsearch-container{display:none}} diff --git a/assets/style-e9220a04.js b/assets/style-e9220a04.js new file mode 100644 index 00000000000..c9305480987 --- /dev/null +++ b/assets/style-e9220a04.js @@ -0,0 +1 @@ +const t="";export{t as default}; diff --git a/assets/tde.html-babd189d.js b/assets/tde.html-babd189d.js new file mode 100644 index 00000000000..c4c7f88ce04 --- /dev/null +++ b/assets/tde.html-babd189d.js @@ -0,0 +1,43 @@ +import{_ as r,r as t,o as u,c as _,d as e,a,w as l,b as n,e as c}from"./app-3d1677bf.js";const m="/PolarDB-for-PostgreSQL/assets/tde_1-edc40ca7.png",h="/PolarDB-for-PostgreSQL/assets/tde_2-1982f2b3.png",K="/PolarDB-for-PostgreSQL/assets/tde_3-fa26b33c.png",k={},g=a("h1",{id:"tde-透明数据加密",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#tde-透明数据加密","aria-hidden":"true"},"#"),n(" TDE 透明数据加密")],-1),E={class:"table-of-contents"},v=a("h2",{id:"背景",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),n(" 背景")],-1),b=a("p",null,[a("strong",null,"TDE(Transparent Data Encryption)"),n(",即 "),a("strong",null,"透明数据加密"),n("。TDE 通过在数据库层执行透明的数据加密,阻止可能的攻击者绕过数据库直接从存储层读取敏感信息。经过数据库身份验证的用户可以 "),a("strong",null,"透明"),n("(不需要更改应用代码或配置)地访问数据,而尝试读取表空间文件中敏感数据的 OS 用户以及尝试读取磁盘或备份信息的不法之徒将不允许访问明文数据。在国内,为了保证互联网信息安全,国家要求相关服务开发商需要满足一些数据安全标准,例如:")],-1),f={href:"http://www.npc.gov.cn/npc/c30834/201910/6f7be7dd5ae5459a8de8baf36296bc74.shtml",target:"_blank",rel:"noopener noreferrer"},D={href:"http://gxxxzx.gxzf.gov.cn/szjcss/wlyxxaq/P020200429546812083554.pdf",target:"_blank",rel:"noopener noreferrer"},M=a("li",null,"...",-1),C=c(`

    在国际上,一些相关行业也有监管数据安全标准,例如:

    • Payment Card Industry Data Security Standard (PCI DSS)
    • Health Insurance Portability and Accountability Act (HIPAA)
    • General Data Protection Regulation (GDPR)
    • California Consumer Protection Act (CCPA)
    • Sarbanes-Oxley Act (SOX)

    为了满足保护用户数据安全的需求,我们在 PolarDB 中实现 TDE 功能。

    术语

    • KEK:密钥加密密钥(Key Encryption Key)。
    • MDEK:pg_strong_random 随机生成,存在内存中,作为实际加密数据的密码。
    • TDEK:Table Data Encryption Key,由 MDEK 经 HKDF 算法生成,存在内存中,作为实际加密数据的密码。
    • WDEK:Wal Data Encryption Key,MDEK 经 HKDF 算法生成,存在内存中,作为实际加密数据的密码。
    • HMACK:passphrase 经 SHA-512 加密后生成 KEK 和 HMACK。
    • KEK_HMAC:ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC,用于还原密钥时的校验信息。
    • ENCMDEK:用 KEK 加密 MDEK 生成 ENCMDEK。

    使用

    对于用户来说:

    • initdb 时增加 --cluster-passphrase-command 'xxx' -e aes-256 参数就会生成支持 TDE 的集群,其中 cluster-passphrase-command 参数为得到加密密钥的密钥的命令,-e 代表数据加密采用的加密算法,目前支持 AES-128、AES-256 和 SM4。

      initdb --cluster-passphrase-command 'echo \\"abc123\\"' -e aes-256
      +
    • 在数据库运行过程中,只有超级用户可以执行如下命令得到对应的加密算法:

      show polar_data_encryption_cipher;
      +
    • 在数据库运行过程中,可以创建插件 polar_tde_utils 来修改 TDE 的加密密钥或者查询 TDE 的一些执行状态,目前支持:

      1. 修改加密密钥,其中函数参数为获取加密密钥的方法(该方法保证只能在宿主机所在网络才可以获得),该函数执行后,kmgr 文件内容变更,等下次重启后生效。

        select polar_tde_update_kmgr_file('echo \\"abc123456\\"');
        +
      2. 得到当前的 kmgr 的 info 信息。

        select * from polar_tde_kmgr_info_view();
        +
      3. 检查 kmgr 文件的完整性。

        select polar_tde_check_kmgr_file();
        +
    • 执行 pg_filedump 解析加密后的页面,用于一些极端情况下,做页面解析。

      pg_filedump -e aes-128 -C 'echo \\"abc123\\"' -K global/kmgr base/14543/2608
      +

    原理

    密钥管理模块

    密钥结构

    采用 2 层密钥结构,即密钥加密密钥和表数据加密密钥。表数据加密密钥是实际对数据库数据进行加密的密钥。密钥加密密钥则是对表数据加密密钥进行进一步加密的密钥。两层密钥的详细介绍如下:

    • 密钥加密密钥(KEK),以及 KEK 的校验值 HMACK:通过运行 polar_cluster_passphrase_command 参数中命令并计算 SHA-512 后得到 64 字节的数据,其中前 32 字节为顶层加密密钥 KEK,后 32 字节为 HMACK。
    • 表数据加密密钥(TDEK)和 WAL 日志加密密钥(WDEK):通过密码学中的安全随机数生成器生成的密钥,是数据和 WAL 日志加密的真正密钥。两个密钥加密后的密文使用 HMACK 作为密钥,经过 HMAC 算法得到 rdek_hmac 和 wdek_hmac,用于密钥 KEK 的校验,保存在共享存储上。

    KEK 和 HMACK 每次都是通过外部获取,例如 KMS,测试的时候可以直接 echo passphrase 得到。ENCMDEK 和 KEK_HMAC 需要保存在共享存储上,用来保证下次启动时 RW 和 RO 都可以读取该文件,获取真正的加密密钥。其数据结构如下:

    typedef struct KmgrFileData
    +{
    +    /* version for kmgr file */
    +    uint32      kmgr_version_no;
    +
    +    /* Are data pages encrypted? Zero if encryption is disabled */
    +    uint32      data_encryption_cipher;
    +
    +    /*
    +     * Wrapped Key information for data encryption.
    +     */
    +    WrappedEncKeyWithHmac tde_rdek;
    +    WrappedEncKeyWithHmac tde_wdek;
    +
    +    /* CRC of all above ... MUST BE LAST! */
    +    pg_crc32c   crc;
    +} KmgrFileData;
    +

    该文件当前是在 initdb 的时候产生,这样就可以保证 Standby 通过 pg_basebackup 获取到。

    在实例运行状态下,TDE 相关的控制信息保存在进程的内存中,结构如下:

    static keydata_t keyEncKey[TDE_KEK_SIZE];
    +static keydata_t relEncKey[TDE_MAX_DEK_SIZE];
    +static keydata_t walEncKey[TDE_MAX_DEK_SIZE];
    +char *polar_cluster_passphrase_command = NULL;
    +extern int data_encryption_cipher;
    +

    密钥加密

    数据库初始化时需要生成密钥,过程示意图如下:

    image.png

    ',21),A=a("li",null,[n("运行 "),a("code",null,"polar_cluster_passphrase_command"),n(" 得到 64 字节的 KEK + HMACK,其中 KEK 长度为 32 字节,HMACK 长度为 32 字节。")],-1),y={href:"https://www.openssl.org/",target:"_blank",rel:"noopener noreferrer"},H=a("li",null,"使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 TDEK。",-1),x=a("li",null,"使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 WDEK。",-1),w=a("li",null,"使用 KEK 加密 MDEK 生成 ENCMDEK。",-1),S=a("li",null,"ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC 用于还原密钥时的校验信息。",-1),P=a("li",null,[n("将 ENCMDEK 和 KEK_HMAC 补充其他 "),a("code",null,"KmgrFileData"),n(" 结构信息写入 "),a("code",null,"global/kmgr"),n(" 文件。")],-1),I=c('

    密钥解密

    当数据库崩溃或重新启动等情况下,需要通过有限的密文信息解密出对应的密钥,其过程如下:

    image.png

    1. 读取 global/kmgr 文件获取 ENCMDEK 和 KEK_HMAC。
    2. 运行 polar_cluster_passphrase_command 得到 64 字节的 KEK + HMACK。
    3. ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC',比较 KEK_HMAC 和 KEK_HMAC' 两者是否相同,如果相同,继续下一步;如果不同则报错返回。
    4. 使用 KEK 解密 ENCMDEK 生成 MDEK。
    5. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 TDEK,因为是特定的 info 所以可以生成相同 TDEK。
    6. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 WDEK,因为是特定的 info 所以可以生成相同 WDEK。

    密钥更换

    密钥更换的过程可以理解为先用旧的 KEK 还原密钥,然后再用新的 KEK 生成新的 kmgr 文件。其过程如下图:

    image.png

    1. 读取 global/kmgr 文件获取 ENCMDEK 和 KEK_HMAC。
    2. 运行 polar_cluster_passphrase_command 得到 64 字节的 KEK + HMACK
    3. ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC',比较 KEK_HMAC 和 KEK_HMAC' 两者是否相同,如果相同,继续下一步;如果不同则报错返回。
    4. 使用 KEK 解密 ENCMDEK 生成 MDEK。
    5. 运行 polar_cluster_passphrase_command 得到 64 字节新的 new_KEK + new_HMACK。
    6. 使用 new_KEK 加密 MDEK 生成 new_ENCMDEK。
    7. new_ENCMDEK 和 new_HMACK 经过 HMAC 算法生成 new_KEK_HMAC 用于在还原密钥时校验信息。
    8. 将 new_ENCMDEK 和 new_KEK_HMAC 补充其他 KmgrFileData 结构信息写入 global/kmgr 文件。

    加密模块

    我们期望对所有的用户数据按照 Page 的粒度进行加密,加密方法采用 AES-128/256 加密算法(产品化默认使用 AES-256)。(page LSN,page number) 作为每个数据页加密的 IV,IV 是可以保证相同内容加密出不同结果的初始向量。

    每个 Page 的头部数据结构如下:

    typedef struct PageHeaderData
    +{
    +    /* XXX LSN is member of *any* block, not only page-organized ones */
    +    PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog
    +                                 * record for last change to this page */
    +    uint16      pd_checksum;    /* checksum */
    +    uint16      pd_flags;       /* flag bits, see below */
    +    LocationIndex pd_lower;     /* offset to start of free space */
    +    LocationIndex pd_upper;     /* offset to end of free space */
    +    LocationIndex pd_special;   /* offset to start of special space */
    +    uint16      pd_pagesize_version;
    +    TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
    +    ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
    +} PageHeaderData;
    +

    在上述结构中:

    • pd_lsn 不能加密:因为解密时需要使用 IV 来解密。
    • pd_flags 增加是否加密的标志位 0x8000,并且不加密:这样可以兼容明文 page 的读取,为增量实例打开 TDE 提供条件。
    • pd_checksum 不加密:这样可以在密文条件下判断 Page 的校验和。

    加密文件

    当前加密含有用户数据的文件,比如数据目录中以下子目录中的文件:

    • base/
    • global/
    • pg_tblspc/
    • pg_replslot/
    • pg_stat/
    • pg_stat_tmp/
    • ...

    何时加密

    当前对于按照数据 Page 来进行组织的数据,将按照 Page 来进行加密的。Page 落盘之前必定需要计算校验和,即使校验和相关参数关闭,也会调用校验和相关的函数 PageSetChecksumCopyPageSetChecksumInplace。所以,只需要计算校验和之前加密 Page,即可保证用户数据在存储上是被加密的。

    解密模块

    存储上的 Page 读入内存之前必定经过 checksum 校验,即使相关参数关闭,也会调用校验函数 PageIsVerified。所以,只需要在校验和计算之后解密,即可保证内存中的数据已被解密。

    `,21);function L(i,N){const p=t("Badge"),d=t("ArticleInfo"),s=t("router-link"),o=t("ExternalLinkIcon");return u(),_("div",null,[g,e(p,{type:"tip",text:"V11 / v1.1.1-",vertical:"top"}),e(d,{frontmatter:i.$frontmatter},null,8,["frontmatter"]),a("nav",E,[a("ul",null,[a("li",null,[e(s,{to:"#背景"},{default:l(()=>[n("背景")]),_:1})]),a("li",null,[e(s,{to:"#术语"},{default:l(()=>[n("术语")]),_:1})]),a("li",null,[e(s,{to:"#使用"},{default:l(()=>[n("使用")]),_:1})]),a("li",null,[e(s,{to:"#原理"},{default:l(()=>[n("原理")]),_:1}),a("ul",null,[a("li",null,[e(s,{to:"#密钥管理模块"},{default:l(()=>[n("密钥管理模块")]),_:1})]),a("li",null,[e(s,{to:"#加密模块"},{default:l(()=>[n("加密模块")]),_:1})]),a("li",null,[e(s,{to:"#解密模块"},{default:l(()=>[n("解密模块")]),_:1})])])])])]),v,b,a("ul",null,[a("li",null,[a("a",f,[n("《国家密码法》"),e(o)]),n("(2020 年 1 月 1 日施行)")]),a("li",null,[a("a",D,[n("《网络安全等级保护基本要求》"),e(o)]),n("(GB/T 22239-2019)")]),M]),C,a("ol",null,[A,a("li",null,[n("调用 "),a("a",y,[n("OpenSSL"),e(o)]),n(" 中的随机数生成算法生成 MDEK。")]),H,x,w,S,P]),I])}const q=r(k,[["render",L],["__file","tde.html.vue"]]);export{q as default}; diff --git a/assets/tde.html-ef77c890.js b/assets/tde.html-ef77c890.js new file mode 100644 index 00000000000..6800eecf513 --- /dev/null +++ b/assets/tde.html-ef77c890.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-39aa8be0","path":"/zh/features/v11/security/tde.html","title":"TDE 透明数据加密","lang":"zh-CN","frontmatter":{"author":"恒亦","date":"2022/09/27","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"术语","slug":"术语","link":"#术语","children":[]},{"level":2,"title":"使用","slug":"使用","link":"#使用","children":[]},{"level":2,"title":"原理","slug":"原理","link":"#原理","children":[{"level":3,"title":"密钥管理模块","slug":"密钥管理模块","link":"#密钥管理模块","children":[]},{"level":3,"title":"加密模块","slug":"加密模块","link":"#加密模块","children":[]},{"level":3,"title":"解密模块","slug":"解密模块","link":"#解密模块","children":[]}]}],"git":{"updatedTime":1672148725000},"filePathRelative":"zh/features/v11/security/tde.md"}');export{e as data}; diff --git a/assets/tde_1-edc40ca7.png b/assets/tde_1-edc40ca7.png new file mode 100644 index 00000000000..4dd02976db6 Binary files /dev/null and b/assets/tde_1-edc40ca7.png differ diff --git a/assets/tde_2-1982f2b3.png b/assets/tde_2-1982f2b3.png new file mode 100644 index 00000000000..e7801917741 Binary files /dev/null and b/assets/tde_2-1982f2b3.png differ diff --git a/assets/tde_3-fa26b33c.png b/assets/tde_3-fa26b33c.png new file mode 100644 index 00000000000..4b35359c671 Binary files /dev/null and b/assets/tde_3-fa26b33c.png differ diff --git a/assets/tpcc-test.html-0f31266e.js b/assets/tpcc-test.html-0f31266e.js new file mode 100644 index 00000000000..4357190abbb --- /dev/null +++ b/assets/tpcc-test.html-0f31266e.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-52b161a6","path":"/zh/operation/tpcc-test.html","title":"TPC-C 测试","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/04/11","minute":15},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"测试步骤","slug":"测试步骤","link":"#测试步骤","children":[{"level":3,"title":"部署 PolarDB-PG","slug":"部署-polardb-pg","link":"#部署-polardb-pg","children":[]},{"level":3,"title":"安装测试工具 BenchmarkSQL","slug":"安装测试工具-benchmarksql","link":"#安装测试工具-benchmarksql","children":[]},{"level":3,"title":"TPC-C 配置","slug":"tpc-c-配置","link":"#tpc-c-配置","children":[]},{"level":3,"title":"导入数据","slug":"导入数据","link":"#导入数据","children":[]},{"level":3,"title":"预热数据","slug":"预热数据","link":"#预热数据","children":[]},{"level":3,"title":"正式测试","slug":"正式测试","link":"#正式测试","children":[]},{"level":3,"title":"查看结果","slug":"查看结果","link":"#查看结果","children":[]}]}],"git":{"updatedTime":1681281377000},"filePathRelative":"zh/operation/tpcc-test.md"}');export{l as data}; diff --git a/assets/tpcc-test.html-52f0c227.js b/assets/tpcc-test.html-52f0c227.js new file mode 100644 index 00000000000..82301ab9786 --- /dev/null +++ b/assets/tpcc-test.html-52f0c227.js @@ -0,0 +1 @@ +const l=JSON.parse('{"key":"v-3a0d4712","path":"/operation/tpcc-test.html","title":"TPC-C 测试","lang":"en-US","frontmatter":{"author":"棠羽","date":"2023/04/11","minute":15},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"测试步骤","slug":"测试步骤","link":"#测试步骤","children":[{"level":3,"title":"部署 PolarDB-PG","slug":"部署-polardb-pg","link":"#部署-polardb-pg","children":[]},{"level":3,"title":"安装测试工具 BenchmarkSQL","slug":"安装测试工具-benchmarksql","link":"#安装测试工具-benchmarksql","children":[]},{"level":3,"title":"TPC-C 配置","slug":"tpc-c-配置","link":"#tpc-c-配置","children":[]},{"level":3,"title":"导入数据","slug":"导入数据","link":"#导入数据","children":[]},{"level":3,"title":"预热数据","slug":"预热数据","link":"#预热数据","children":[]},{"level":3,"title":"正式测试","slug":"正式测试","link":"#正式测试","children":[]},{"level":3,"title":"查看结果","slug":"查看结果","link":"#查看结果","children":[]}]}],"git":{"updatedTime":1681281377000},"filePathRelative":"operation/tpcc-test.md"}');export{l as data}; diff --git a/assets/tpcc-test.html-88535e70.js b/assets/tpcc-test.html-88535e70.js new file mode 100644 index 00000000000..660364ca580 --- /dev/null +++ b/assets/tpcc-test.html-88535e70.js @@ -0,0 +1,27 @@ +import{_ as u,r as p,o as d,c as k,d as n,a,w as e,b as s,e as l}from"./app-3d1677bf.js";const h={},m=a("h1",{id:"tpc-c-测试",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#tpc-c-测试","aria-hidden":"true"},"#"),s(" TPC-C 测试")],-1),b=a("p",null,"本文将引导您对 PolarDB for PostgreSQL 进行 TPC-C 测试。",-1),_={class:"table-of-contents"},g=a("h2",{id:"背景",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),s(" 背景")],-1),f={href:"https://www.tpc.org/tpcc/",target:"_blank",rel:"noopener noreferrer"},x=a("h2",{id:"测试步骤",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#测试步骤","aria-hidden":"true"},"#"),s(" 测试步骤")],-1),P=a("h3",{id:"部署-polardb-pg",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#部署-polardb-pg","aria-hidden":"true"},"#"),s(" 部署 PolarDB-PG")],-1),v=a("p",null,"参考如下教程部署 PolarDB for PostgreSQL:",-1),C=a("h3",{id:"安装测试工具-benchmarksql",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#安装测试工具-benchmarksql","aria-hidden":"true"},"#"),s(" 安装测试工具 BenchmarkSQL")],-1),T={href:"https://github.com/pgsql-io/benchmarksql",target:"_blank",rel:"noopener noreferrer"},B=a("code",null,"mvn",-1),L=l(`
    $ git clone https://github.com/pgsql-io/benchmarksql.git
    +$ cd benchmarksql
    +$ mvn
    +

    编译出的工具位于如下目录中:

    $ cd target/run
    +

    TPC-C 配置

    在编译完毕的工具目录下,将会存在面向不同数据库产品的示例配置:

    $ ls | grep sample
    +sample.firebird.properties
    +sample.mariadb.properties
    +sample.oracle.properties
    +sample.postgresql.properties
    +sample.transact-sql.properties
    +
    `,6),q=a("code",null,"sample.postgresql.properties",-1),E={href:"https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md",target:"_blank",rel:"noopener noreferrer"},S=l(`

    配置项包含的配置类型有:

    • JDBC 驱动及连接信息:需要自行配置 PostgreSQL 数据库运行的连接串、用户名、密码等
    • 测试规模参数
    • 测试时间参数
    • 吞吐量参数
    • 事务类型参数

    导入数据

    使用 runDatabaseBuild.sh 脚本,以配置文件作为参数,产生和导入测试数据:

    ./runDatabaseBuild.sh sample.postgresql.properties
    +

    预热数据

    通常,在正式测试前会进行一次数据预热:

    ./runBenchmark.sh sample.postgresql.properties
    +

    正式测试

    预热完毕后,再次运行同样的命令进行正式测试:

    ./runBenchmark.sh sample.postgresql.properties
    +

    查看结果

                                              _____ latency (seconds) _____
    +  TransType              count |   mix % |    mean       max     90th% |    rbk%          errors
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +| NEW_ORDER    |           635 |  44.593 |   0.006 |   0.012 |   0.008 |   1.102 |             0 |
    +| PAYMENT      |           628 |  44.101 |   0.001 |   0.006 |   0.002 |   0.000 |             0 |
    +| ORDER_STATUS |            58 |   4.073 |   0.093 |   0.168 |   0.132 |   0.000 |             0 |
    +| STOCK_LEVEL  |            52 |   3.652 |   0.035 |   0.044 |   0.041 |   0.000 |             0 |
    +| DELIVERY     |            51 |   3.581 |   0.000 |   0.001 |   0.001 |   0.000 |             0 |
    +| DELIVERY_BG  |            51 |   0.000 |   0.018 |   0.023 |   0.020 |   0.000 |             0 |
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +
    +Overall NOPM:          635 (98.76% of the theoretical maximum)
    +Overall TPM:         1,424
    +

    另外也有 CSV 形式的结果被保存,从输出日志中可以找到结果存放目录。

    `,14);function D(c,R){const i=p("ArticleInfo"),o=p("router-link"),t=p("ExternalLinkIcon"),r=p("RouterLink");return d(),k("div",null,[m,n(i,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),b,a("nav",_,[a("ul",null,[a("li",null,[n(o,{to:"#背景"},{default:e(()=>[s("背景")]),_:1})]),a("li",null,[n(o,{to:"#测试步骤"},{default:e(()=>[s("测试步骤")]),_:1}),a("ul",null,[a("li",null,[n(o,{to:"#部署-polardb-pg"},{default:e(()=>[s("部署 PolarDB-PG")]),_:1})]),a("li",null,[n(o,{to:"#安装测试工具-benchmarksql"},{default:e(()=>[s("安装测试工具 BenchmarkSQL")]),_:1})]),a("li",null,[n(o,{to:"#tpc-c-配置"},{default:e(()=>[s("TPC-C 配置")]),_:1})]),a("li",null,[n(o,{to:"#导入数据"},{default:e(()=>[s("导入数据")]),_:1})]),a("li",null,[n(o,{to:"#预热数据"},{default:e(()=>[s("预热数据")]),_:1})]),a("li",null,[n(o,{to:"#正式测试"},{default:e(()=>[s("正式测试")]),_:1})]),a("li",null,[n(o,{to:"#查看结果"},{default:e(()=>[s("查看结果")]),_:1})])])])])]),g,a("p",null,[s("TPC 是一系列事务处理和数据库基准测试的规范。其中 "),a("a",f,[s("TPC-C"),n(t)]),s(" (Transaction Processing Performance Council) 是针对 OLTP 的基准测试模型。TPC-C 测试模型给基准测试提供了一种统一的测试标准,可以大体观察出数据库服务稳定性、性能以及系统性能等一系列问题。对数据库展开 TPC-C 基准性能测试,一方面可以衡量数据库的性能,另一方面可以衡量采用不同硬件软件系统的性价比,是被业内广泛应用并关注的一种测试模型。")]),x,P,v,a("ul",null,[a("li",null,[n(r,{to:"/zh/deploying/quick-start.html"},{default:e(()=>[s("快速部署")]),_:1})]),a("li",null,[n(r,{to:"/zh/deploying/deploy.html"},{default:e(()=>[s("进阶部署")]),_:1})])]),C,a("p",null,[a("a",T,[s("BenchmarkSQL"),n(t)]),s(" 依赖 Java 运行环境与 Maven 包管理工具,需要预先安装。拉取 BenchmarkSQL 工具源码并进入目录后,通过 "),B,s(" 编译工程:")]),L,a("p",null,[s("其中,"),q,s(" 包含 PostgreSQL 系列数据库的模板参数,可以基于这个模板来修改并自定义配置。参考 BenchmarkSQL 工具的 "),a("a",E,[s("文档"),n(t)]),s(" 可以查看关于配置项的详细描述。")]),S])}const O=u(h,[["render",D],["__file","tpcc-test.html.vue"]]);export{O as default}; diff --git a/assets/tpcc-test.html-b60e72ae.js b/assets/tpcc-test.html-b60e72ae.js new file mode 100644 index 00000000000..b9ad1390b5b --- /dev/null +++ b/assets/tpcc-test.html-b60e72ae.js @@ -0,0 +1,27 @@ +import{_ as u,r as p,o as d,c as k,d as n,a,w as e,b as s,e as l}from"./app-3d1677bf.js";const h={},m=a("h1",{id:"tpc-c-测试",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#tpc-c-测试","aria-hidden":"true"},"#"),s(" TPC-C 测试")],-1),b=a("p",null,"本文将引导您对 PolarDB for PostgreSQL 进行 TPC-C 测试。",-1),_={class:"table-of-contents"},g=a("h2",{id:"背景",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),s(" 背景")],-1),f={href:"https://www.tpc.org/tpcc/",target:"_blank",rel:"noopener noreferrer"},x=a("h2",{id:"测试步骤",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#测试步骤","aria-hidden":"true"},"#"),s(" 测试步骤")],-1),P=a("h3",{id:"部署-polardb-pg",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#部署-polardb-pg","aria-hidden":"true"},"#"),s(" 部署 PolarDB-PG")],-1),v=a("p",null,"参考如下教程部署 PolarDB for PostgreSQL:",-1),C=a("h3",{id:"安装测试工具-benchmarksql",tabindex:"-1"},[a("a",{class:"header-anchor",href:"#安装测试工具-benchmarksql","aria-hidden":"true"},"#"),s(" 安装测试工具 BenchmarkSQL")],-1),T={href:"https://github.com/pgsql-io/benchmarksql",target:"_blank",rel:"noopener noreferrer"},B=a("code",null,"mvn",-1),L=l(`
    $ git clone https://github.com/pgsql-io/benchmarksql.git
    +$ cd benchmarksql
    +$ mvn
    +

    编译出的工具位于如下目录中:

    $ cd target/run
    +

    TPC-C 配置

    在编译完毕的工具目录下,将会存在面向不同数据库产品的示例配置:

    $ ls | grep sample
    +sample.firebird.properties
    +sample.mariadb.properties
    +sample.oracle.properties
    +sample.postgresql.properties
    +sample.transact-sql.properties
    +
    `,6),q=a("code",null,"sample.postgresql.properties",-1),E={href:"https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md",target:"_blank",rel:"noopener noreferrer"},S=l(`

    配置项包含的配置类型有:

    • JDBC 驱动及连接信息:需要自行配置 PostgreSQL 数据库运行的连接串、用户名、密码等
    • 测试规模参数
    • 测试时间参数
    • 吞吐量参数
    • 事务类型参数

    导入数据

    使用 runDatabaseBuild.sh 脚本,以配置文件作为参数,产生和导入测试数据:

    ./runDatabaseBuild.sh sample.postgresql.properties
    +

    预热数据

    通常,在正式测试前会进行一次数据预热:

    ./runBenchmark.sh sample.postgresql.properties
    +

    正式测试

    预热完毕后,再次运行同样的命令进行正式测试:

    ./runBenchmark.sh sample.postgresql.properties
    +

    查看结果

                                              _____ latency (seconds) _____
    +  TransType              count |   mix % |    mean       max     90th% |    rbk%          errors
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +| NEW_ORDER    |           635 |  44.593 |   0.006 |   0.012 |   0.008 |   1.102 |             0 |
    +| PAYMENT      |           628 |  44.101 |   0.001 |   0.006 |   0.002 |   0.000 |             0 |
    +| ORDER_STATUS |            58 |   4.073 |   0.093 |   0.168 |   0.132 |   0.000 |             0 |
    +| STOCK_LEVEL  |            52 |   3.652 |   0.035 |   0.044 |   0.041 |   0.000 |             0 |
    +| DELIVERY     |            51 |   3.581 |   0.000 |   0.001 |   0.001 |   0.000 |             0 |
    +| DELIVERY_BG  |            51 |   0.000 |   0.018 |   0.023 |   0.020 |   0.000 |             0 |
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +
    +Overall NOPM:          635 (98.76% of the theoretical maximum)
    +Overall TPM:         1,424
    +

    另外也有 CSV 形式的结果被保存,从输出日志中可以找到结果存放目录。

    `,14);function D(c,R){const i=p("ArticleInfo"),o=p("router-link"),t=p("ExternalLinkIcon"),r=p("RouterLink");return d(),k("div",null,[m,n(i,{frontmatter:c.$frontmatter},null,8,["frontmatter"]),b,a("nav",_,[a("ul",null,[a("li",null,[n(o,{to:"#背景"},{default:e(()=>[s("背景")]),_:1})]),a("li",null,[n(o,{to:"#测试步骤"},{default:e(()=>[s("测试步骤")]),_:1}),a("ul",null,[a("li",null,[n(o,{to:"#部署-polardb-pg"},{default:e(()=>[s("部署 PolarDB-PG")]),_:1})]),a("li",null,[n(o,{to:"#安装测试工具-benchmarksql"},{default:e(()=>[s("安装测试工具 BenchmarkSQL")]),_:1})]),a("li",null,[n(o,{to:"#tpc-c-配置"},{default:e(()=>[s("TPC-C 配置")]),_:1})]),a("li",null,[n(o,{to:"#导入数据"},{default:e(()=>[s("导入数据")]),_:1})]),a("li",null,[n(o,{to:"#预热数据"},{default:e(()=>[s("预热数据")]),_:1})]),a("li",null,[n(o,{to:"#正式测试"},{default:e(()=>[s("正式测试")]),_:1})]),a("li",null,[n(o,{to:"#查看结果"},{default:e(()=>[s("查看结果")]),_:1})])])])])]),g,a("p",null,[s("TPC 是一系列事务处理和数据库基准测试的规范。其中 "),a("a",f,[s("TPC-C"),n(t)]),s(" (Transaction Processing Performance Council) 是针对 OLTP 的基准测试模型。TPC-C 测试模型给基准测试提供了一种统一的测试标准,可以大体观察出数据库服务稳定性、性能以及系统性能等一系列问题。对数据库展开 TPC-C 基准性能测试,一方面可以衡量数据库的性能,另一方面可以衡量采用不同硬件软件系统的性价比,是被业内广泛应用并关注的一种测试模型。")]),x,P,v,a("ul",null,[a("li",null,[n(r,{to:"/deploying/quick-start.html"},{default:e(()=>[s("快速部署")]),_:1})]),a("li",null,[n(r,{to:"/deploying/deploy.html"},{default:e(()=>[s("进阶部署")]),_:1})])]),C,a("p",null,[a("a",T,[s("BenchmarkSQL"),n(t)]),s(" 依赖 Java 运行环境与 Maven 包管理工具,需要预先安装。拉取 BenchmarkSQL 工具源码并进入目录后,通过 "),B,s(" 编译工程:")]),L,a("p",null,[s("其中,"),q,s(" 包含 PostgreSQL 系列数据库的模板参数,可以基于这个模板来修改并自定义配置。参考 BenchmarkSQL 工具的 "),a("a",E,[s("文档"),n(t)]),s(" 可以查看关于配置项的详细描述。")]),S])}const O=u(h,[["render",D],["__file","tpcc-test.html.vue"]]);export{O as default}; diff --git a/assets/tpch-test.html-5f912467.js b/assets/tpch-test.html-5f912467.js new file mode 100644 index 00000000000..bab4344482e --- /dev/null +++ b/assets/tpch-test.html-5f912467.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-3b28df6b","path":"/zh/operation/tpch-test.html","title":"TPC-H 测试","lang":"zh-CN","frontmatter":{"author":"棠羽","date":"2023/04/12","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"测试准备","slug":"测试准备","link":"#测试准备","children":[{"level":3,"title":"部署 PolarDB-PG","slug":"部署-polardb-pg","link":"#部署-polardb-pg","children":[]},{"level":3,"title":"生成 TPC-H 测试数据集","slug":"生成-tpc-h-测试数据集","link":"#生成-tpc-h-测试数据集","children":[]}]},{"level":2,"title":"执行 PostgreSQL 单机并行执行","slug":"执行-postgresql-单机并行执行","link":"#执行-postgresql-单机并行执行","children":[]},{"level":2,"title":"执行 ePQ 单机并行执行","slug":"执行-epq-单机并行执行","link":"#执行-epq-单机并行执行","children":[]},{"level":2,"title":"执行 ePQ 跨机并行执行","slug":"执行-epq-跨机并行执行","link":"#执行-epq-跨机并行执行","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"zh/operation/tpch-test.md"}');export{e as data}; diff --git a/assets/tpch-test.html-78343832.js b/assets/tpch-test.html-78343832.js new file mode 100644 index 00000000000..3e283c50205 --- /dev/null +++ b/assets/tpch-test.html-78343832.js @@ -0,0 +1 @@ +const e=JSON.parse('{"key":"v-691e4b88","path":"/operation/tpch-test.html","title":"TPC-H 测试","lang":"en-US","frontmatter":{"author":"棠羽","date":"2023/04/12","minute":20},"headers":[{"level":2,"title":"背景","slug":"背景","link":"#背景","children":[]},{"level":2,"title":"测试准备","slug":"测试准备","link":"#测试准备","children":[{"level":3,"title":"部署 PolarDB-PG","slug":"部署-polardb-pg","link":"#部署-polardb-pg","children":[]},{"level":3,"title":"生成 TPC-H 测试数据集","slug":"生成-tpc-h-测试数据集","link":"#生成-tpc-h-测试数据集","children":[]}]},{"level":2,"title":"执行 PostgreSQL 单机并行执行","slug":"执行-postgresql-单机并行执行","link":"#执行-postgresql-单机并行执行","children":[]},{"level":2,"title":"执行 ePQ 单机并行执行","slug":"执行-epq-单机并行执行","link":"#执行-epq-单机并行执行","children":[]},{"level":2,"title":"执行 ePQ 跨机并行执行","slug":"执行-epq-跨机并行执行","link":"#执行-epq-跨机并行执行","children":[]}],"git":{"updatedTime":1703744114000},"filePathRelative":"operation/tpch-test.md"}');export{e as data}; diff --git a/assets/tpch-test.html-83b2e511.js b/assets/tpch-test.html-83b2e511.js new file mode 100644 index 00000000000..897df0135d1 --- /dev/null +++ b/assets/tpch-test.html-83b2e511.js @@ -0,0 +1,202 @@ +import{_ as u,r as e,o as i,c as d,d as a,a as s,w as p,b as n,e as t}from"./app-3d1677bf.js";const m={},b=s("h1",{id:"tpc-h-测试",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#tpc-h-测试","aria-hidden":"true"},"#"),n(" TPC-H 测试")],-1),w=s("p",null,"本文将引导您对 PolarDB for PostgreSQL 进行 TPC-H 测试。",-1),y={class:"table-of-contents"},h=s("h2",{id:"背景",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),n(" 背景")],-1),g={href:"https://www.tpc.org/tpch/default5.asp",target:"_blank",rel:"noopener noreferrer"},_=t(`

    测试准备

    部署 PolarDB-PG

    使用 Docker 快速拉起一个基于本地存储的 PolarDB for PostgreSQL 集群:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \\
    +    --cap-add=SYS_PTRACE \\
    +    --privileged=true \\
    +    --name polardb_pg_htap \\
    +    --shm-size=512m \\
    +    polardb/polardb_pg_local_instance \\
    +    bash
    +
    `,4),P=s("h3",{id:"生成-tpc-h-测试数据集",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#生成-tpc-h-测试数据集","aria-hidden":"true"},"#"),n(" 生成 TPC-H 测试数据集")],-1),f={href:"https://github.com/ApsaraDB/tpch-dbgen",target:"_blank",rel:"noopener noreferrer"},T=t(`
    $ git clone https://github.com/ApsaraDB/tpch-dbgen.git
    +$ cd tpch-dbgen
    +$ ./build.sh --help
    +
    +  1) Use default configuration to build
    +  ./build.sh
    +  2) Use limited configuration to build
    +  ./build.sh --user=postgres --db=postgres --host=localhost --port=5432 --scale=1
    +  3) Run the test case
    +  ./build.sh --run
    +  4) Run the target test case
    +  ./build.sh --run=3. run the 3rd case.
    +  5) Run the target test case with option
    +  ./build.sh --run --option="set polar_enable_px = on;"
    +  6) Clean the test data. This step will drop the database or tables, remove csv
    +  and tbl files
    +  ./build.sh --clean
    +  7) Quick build TPC-H with 100MB scale of data
    +  ./build.sh --scale=0.1
    +

    通过设置不同的参数,可以定制化地创建不同规模的 TPC-H 数据集。build.sh 脚本中各个参数的含义如下:

    • --user:数据库用户名
    • --db:数据库名
    • --host:数据库主机地址
    • --port:数据库服务端口
    • --run:执行所有 TPC-H 查询,或执行某条特定的 TPC-H 查询
    • --option:额外指定 GUC 参数
    • --scale:生成 TPC-H 数据集的规模,单位为 GB

    该脚本没有提供输入数据库密码的参数,需要通过设置 PGPASSWORD 为数据库用户的数据库密码来完成认证:

    export PGPASSWORD=<your password>
    +

    生成并导入 100MB 规模的 TPC-H 数据:

    ./build.sh --scale=0.1
    +

    生成并导入 1GB 规模的 TPC-H 数据:

    ./build.sh
    +

    执行 PostgreSQL 单机并行执行

    以 TPC-H 的 Q18 为例,执行 PostgreSQL 的单机并行查询,并观测查询速度。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +
    -- 打开计时
    +\\timing on
    +
    +-- 设置单机并行度
    +SET max_parallel_workers_per_gather = 2;
    +
    +-- 查看 Q18 的执行计划
    +\\i finals/18.explain.sql
    +                                                                         QUERY PLAN
    +------------------------------------------------------------------------------------------------------------------------------------------------------------
    + Sort  (cost=3450834.75..3450835.42 rows=268 width=81)
    +   Sort Key: orders.o_totalprice DESC, orders.o_orderdate
    +   ->  GroupAggregate  (cost=3450817.91..3450823.94 rows=268 width=81)
    +         Group Key: customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=3450817.91..3450818.58 rows=268 width=67)
    +               Sort Key: customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=1501454.20..3450807.10 rows=268 width=67)
    +                     Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
    +                     ->  Seq Scan on lineitem  (cost=0.00..1724402.52 rows=59986052 width=22)
    +                     ->  Hash  (cost=1501453.37..1501453.37 rows=67 width=53)
    +                           ->  Nested Loop  (cost=1500465.85..1501453.37 rows=67 width=53)
    +                                 ->  Nested Loop  (cost=1500465.43..1501084.65 rows=67 width=34)
    +                                       ->  Finalize GroupAggregate  (cost=1500464.99..1500517.66 rows=67 width=4)
    +                                             Group Key: lineitem_1.l_orderkey
    +                                             Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
    +                                             ->  Gather Merge  (cost=1500464.99..1500511.66 rows=400 width=36)
    +                                                   Workers Planned: 2
    +                                                   ->  Sort  (cost=1499464.97..1499465.47 rows=200 width=36)
    +                                                         Sort Key: lineitem_1.l_orderkey
    +                                                         ->  Partial HashAggregate  (cost=1499454.82..1499457.32 rows=200 width=36)
    +                                                               Group Key: lineitem_1.l_orderkey
    +                                                               ->  Parallel Seq Scan on lineitem lineitem_1  (cost=0.00..1374483.88 rows=24994188 width=22)
    +                                       ->  Index Scan using orders_pkey on orders  (cost=0.43..8.45 rows=1 width=30)
    +                                             Index Cond: (o_orderkey = lineitem_1.l_orderkey)
    +                                 ->  Index Scan using customer_pkey on customer  (cost=0.43..5.50 rows=1 width=23)
    +                                       Index Cond: (c_custkey = orders.o_custkey)
    +(26 rows)
    +
    +Time: 3.965 ms
    +
    +-- 执行 Q18
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 80150.449 ms (01:20.150)
    +

    执行 ePQ 单机并行执行

    PolarDB for PostgreSQL 提供了弹性跨机并行查询(ePQ)的能力,非常适合进行分析型查询。下面的步骤将引导您可以在一台主机上使用 ePQ 并行执行 TPC-H 查询。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +

    首先需要对 TPC-H 产生的八张表设置 ePQ 的最大查询并行度:

    ALTER TABLE nation SET (px_workers = 100);
    +ALTER TABLE region SET (px_workers = 100);
    +ALTER TABLE supplier SET (px_workers = 100);
    +ALTER TABLE part SET (px_workers = 100);
    +ALTER TABLE partsupp SET (px_workers = 100);
    +ALTER TABLE customer SET (px_workers = 100);
    +ALTER TABLE orders SET (px_workers = 100);
    +ALTER TABLE lineitem SET (px_workers = 100);
    +

    以 Q18 为例,执行查询:

    -- 打开计时
    +\\timing on
    +
    +-- 打开 ePQ 功能的开关
    +SET polar_enable_px = ON;
    +-- 设置每个节点的 ePQ 并行度为 1
    +SET polar_px_dop_per_node = 1;
    +
    +-- 查看 Q18 的执行计划
    +\\i finals/18.explain.sql
    +                                                                          QUERY PLAN
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..257526.21 rows=59986052 width=47)
    +   Merge Key: orders.o_totalprice, orders.o_orderdate
    +   ->  GroupAggregate  (cost=0.00..243457.68 rows=29993026 width=47)
    +         Group Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=0.00..241257.18 rows=29993026 width=47)
    +               Sort Key: orders.o_totalprice DESC, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=0.00..42729.99 rows=29993026 width=47)
    +                     Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
    +                     ->  PX Hash 2:2  (slice2; segments: 2)  (cost=0.00..15959.71 rows=7500000 width=39)
    +                           Hash Key: orders.o_orderkey
    +                           ->  Hash Join  (cost=0.00..15044.19 rows=7500000 width=39)
    +                                 Hash Cond: (orders.o_custkey = customer.c_custkey)
    +                                 ->  PX Hash 2:2  (slice3; segments: 2)  (cost=0.00..11561.51 rows=7500000 width=20)
    +                                       Hash Key: orders.o_custkey
    +                                       ->  Hash Semi Join  (cost=0.00..11092.01 rows=7500000 width=20)
    +                                             Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
    +                                             ->  Partial Seq Scan on orders  (cost=0.00..1132.25 rows=7500000 width=20)
    +                                             ->  Hash  (cost=7760.84..7760.84 rows=400 width=4)
    +                                                   ->  PX Broadcast 2:2  (slice4; segments: 2)  (cost=0.00..7760.84 rows=400 width=4)
    +                                                         ->  Result  (cost=0.00..7760.80 rows=200 width=4)
    +                                                               Filter: ((sum(lineitem.l_quantity)) > '314'::numeric)
    +                                                               ->  Finalize HashAggregate  (cost=0.00..7760.78 rows=500 width=12)
    +                                                                     Group Key: lineitem.l_orderkey
    +                                                                     ->  PX Hash 2:2  (slice5; segments: 2)  (cost=0.00..7760.72 rows=500 width=12)
    +                                                                           Hash Key: lineitem.l_orderkey
    +                                                                           ->  Partial HashAggregate  (cost=0.00..7760.70 rows=500 width=12)
    +                                                                                 Group Key: lineitem.l_orderkey
    +                                                                                 ->  Partial Seq Scan on lineitem  (cost=0.00..3350.82 rows=29993026 width=12)
    +                                 ->  Hash  (cost=597.51..597.51 rows=749979 width=23)
    +                                       ->  PX Hash 2:2  (slice6; segments: 2)  (cost=0.00..597.51 rows=749979 width=23)
    +                                             Hash Key: customer.c_custkey
    +                                             ->  Partial Seq Scan on customer  (cost=0.00..511.44 rows=749979 width=23)
    +                     ->  Hash  (cost=5146.80..5146.80 rows=29993026 width=12)
    +                           ->  PX Hash 2:2  (slice7; segments: 2)  (cost=0.00..5146.80 rows=29993026 width=12)
    +                                 Hash Key: lineitem_1.l_orderkey
    +                                 ->  Partial Seq Scan on lineitem lineitem_1  (cost=0.00..3350.82 rows=29993026 width=12)
    + Optimizer: PolarDB PX Optimizer
    +(37 rows)
    +
    +Time: 216.672 ms
    +
    +-- 执行 Q18
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 59113.965 ms (00:59.114)
    +

    可以看到比 PostgreSQL 的单机并行执行的时间略短。加大 ePQ 功能的节点并行度,查询性能将会有更明显的提升:

    SET polar_px_dop_per_node = 2;
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 42400.500 ms (00:42.401)
    +
    +SET polar_px_dop_per_node = 4;
    +\\i finals/18.sql
    +
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 19892.603 ms (00:19.893)
    +
    +SET polar_px_dop_per_node = 8;
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 10944.402 ms (00:10.944)
    +

    使用 ePQ 执行 Q17 和 Q18 时可能会出现 OOM。需要设置以下参数防止用尽内存:

    SET polar_px_optimizer_enable_hashagg = 0;
    +

    执行 ePQ 跨机并行执行

    在上面的例子中,出于简单考虑,PolarDB for PostgreSQL 的多个计算节点被部署在同一台主机上。在这种场景下使用 ePQ 时,由于所有的计算节点都使用了同一台主机的 CPU、内存、I/O 带宽,因此本质上是基于单台主机的并行执行。实际上,PolarDB for PostgreSQL 的计算节点可以被部署在能够共享存储节点的多台机器上。此时使用 ePQ 功能将进行真正的跨机器分布式并行查询,能够充分利用多台机器上的计算资源。

    `,27),S=t(`

    如果遇到如下错误:

    psql:queries/q01.analyze.sq1:24: WARNING:  interconnect may encountered a network error, please check your network
    +DETAIL:  Failed to send packet (seq 1) to 192.168.1.8:57871 (pid 17766 cid 0) after 100 retries.
    +

    可以尝试统一修改每台机器的 MTU 为 9000:

    ifconfig <网卡名> mtu 9000
    +
    `,1);function q(l,v){const k=e("ArticleInfo"),o=e("router-link"),c=e("ExternalLinkIcon"),r=e("RouterLink");return i(),d("div",null,[b,a(k,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),w,s("nav",y,[s("ul",null,[s("li",null,[a(o,{to:"#背景"},{default:p(()=>[n("背景")]),_:1})]),s("li",null,[a(o,{to:"#测试准备"},{default:p(()=>[n("测试准备")]),_:1}),s("ul",null,[s("li",null,[a(o,{to:"#部署-polardb-pg"},{default:p(()=>[n("部署 PolarDB-PG")]),_:1})]),s("li",null,[a(o,{to:"#生成-tpc-h-测试数据集"},{default:p(()=>[n("生成 TPC-H 测试数据集")]),_:1})])])]),s("li",null,[a(o,{to:"#执行-postgresql-单机并行执行"},{default:p(()=>[n("执行 PostgreSQL 单机并行执行")]),_:1})]),s("li",null,[a(o,{to:"#执行-epq-单机并行执行"},{default:p(()=>[n("执行 ePQ 单机并行执行")]),_:1})]),s("li",null,[a(o,{to:"#执行-epq-跨机并行执行"},{default:p(()=>[n("执行 ePQ 跨机并行执行")]),_:1})])])]),h,s("p",null,[s("a",g,[n("TPC-H"),a(c)]),n(" 是专门测试数据库分析型场景性能的数据集。")]),_,s("p",null,[n("或者参考 "),a(r,{to:"/zh/deploying/deploy.html"},{default:p(()=>[n("进阶部署")]),_:1}),n(" 部署一个基于共享存储的 PolarDB for PostgreSQL 集群。")]),P,s("p",null,[n("通过 "),s("a",f,[n("tpch-dbgen"),a(c)]),n(" 工具来生成测试数据。")]),T,s("p",null,[n("参考 "),a(r,{to:"/zh/deploying/deploy.html"},{default:p(()=>[n("进阶部署")]),_:1}),n(" 可以搭建起不同形态的 PolarDB for PostgreSQL 集群。集群搭建成功后,使用 ePQ 的方式与单机 ePQ 完全相同。")]),S])}const C=u(m,[["render",q],["__file","tpch-test.html.vue"]]);export{C as default}; diff --git a/assets/tpch-test.html-f7f8e1ad.js b/assets/tpch-test.html-f7f8e1ad.js new file mode 100644 index 00000000000..bf45c2b02ed --- /dev/null +++ b/assets/tpch-test.html-f7f8e1ad.js @@ -0,0 +1,202 @@ +import{_ as u,r as e,o as i,c as d,d as a,a as s,w as p,b as n,e as t}from"./app-3d1677bf.js";const m={},b=s("h1",{id:"tpc-h-测试",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#tpc-h-测试","aria-hidden":"true"},"#"),n(" TPC-H 测试")],-1),w=s("p",null,"本文将引导您对 PolarDB for PostgreSQL 进行 TPC-H 测试。",-1),y={class:"table-of-contents"},h=s("h2",{id:"背景",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#背景","aria-hidden":"true"},"#"),n(" 背景")],-1),g={href:"https://www.tpc.org/tpch/default5.asp",target:"_blank",rel:"noopener noreferrer"},_=t(`

    测试准备

    部署 PolarDB-PG

    使用 Docker 快速拉起一个基于本地存储的 PolarDB for PostgreSQL 集群:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \\
    +    --cap-add=SYS_PTRACE \\
    +    --privileged=true \\
    +    --name polardb_pg_htap \\
    +    --shm-size=512m \\
    +    polardb/polardb_pg_local_instance \\
    +    bash
    +
    `,4),P=s("h3",{id:"生成-tpc-h-测试数据集",tabindex:"-1"},[s("a",{class:"header-anchor",href:"#生成-tpc-h-测试数据集","aria-hidden":"true"},"#"),n(" 生成 TPC-H 测试数据集")],-1),f={href:"https://github.com/ApsaraDB/tpch-dbgen",target:"_blank",rel:"noopener noreferrer"},T=t(`
    $ git clone https://github.com/ApsaraDB/tpch-dbgen.git
    +$ cd tpch-dbgen
    +$ ./build.sh --help
    +
    +  1) Use default configuration to build
    +  ./build.sh
    +  2) Use limited configuration to build
    +  ./build.sh --user=postgres --db=postgres --host=localhost --port=5432 --scale=1
    +  3) Run the test case
    +  ./build.sh --run
    +  4) Run the target test case
    +  ./build.sh --run=3. run the 3rd case.
    +  5) Run the target test case with option
    +  ./build.sh --run --option="set polar_enable_px = on;"
    +  6) Clean the test data. This step will drop the database or tables, remove csv
    +  and tbl files
    +  ./build.sh --clean
    +  7) Quick build TPC-H with 100MB scale of data
    +  ./build.sh --scale=0.1
    +

    通过设置不同的参数,可以定制化地创建不同规模的 TPC-H 数据集。build.sh 脚本中各个参数的含义如下:

    • --user:数据库用户名
    • --db:数据库名
    • --host:数据库主机地址
    • --port:数据库服务端口
    • --run:执行所有 TPC-H 查询,或执行某条特定的 TPC-H 查询
    • --option:额外指定 GUC 参数
    • --scale:生成 TPC-H 数据集的规模,单位为 GB

    该脚本没有提供输入数据库密码的参数,需要通过设置 PGPASSWORD 为数据库用户的数据库密码来完成认证:

    export PGPASSWORD=<your password>
    +

    生成并导入 100MB 规模的 TPC-H 数据:

    ./build.sh --scale=0.1
    +

    生成并导入 1GB 规模的 TPC-H 数据:

    ./build.sh
    +

    执行 PostgreSQL 单机并行执行

    以 TPC-H 的 Q18 为例,执行 PostgreSQL 的单机并行查询,并观测查询速度。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +
    -- 打开计时
    +\\timing on
    +
    +-- 设置单机并行度
    +SET max_parallel_workers_per_gather = 2;
    +
    +-- 查看 Q18 的执行计划
    +\\i finals/18.explain.sql
    +                                                                         QUERY PLAN
    +------------------------------------------------------------------------------------------------------------------------------------------------------------
    + Sort  (cost=3450834.75..3450835.42 rows=268 width=81)
    +   Sort Key: orders.o_totalprice DESC, orders.o_orderdate
    +   ->  GroupAggregate  (cost=3450817.91..3450823.94 rows=268 width=81)
    +         Group Key: customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=3450817.91..3450818.58 rows=268 width=67)
    +               Sort Key: customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=1501454.20..3450807.10 rows=268 width=67)
    +                     Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
    +                     ->  Seq Scan on lineitem  (cost=0.00..1724402.52 rows=59986052 width=22)
    +                     ->  Hash  (cost=1501453.37..1501453.37 rows=67 width=53)
    +                           ->  Nested Loop  (cost=1500465.85..1501453.37 rows=67 width=53)
    +                                 ->  Nested Loop  (cost=1500465.43..1501084.65 rows=67 width=34)
    +                                       ->  Finalize GroupAggregate  (cost=1500464.99..1500517.66 rows=67 width=4)
    +                                             Group Key: lineitem_1.l_orderkey
    +                                             Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
    +                                             ->  Gather Merge  (cost=1500464.99..1500511.66 rows=400 width=36)
    +                                                   Workers Planned: 2
    +                                                   ->  Sort  (cost=1499464.97..1499465.47 rows=200 width=36)
    +                                                         Sort Key: lineitem_1.l_orderkey
    +                                                         ->  Partial HashAggregate  (cost=1499454.82..1499457.32 rows=200 width=36)
    +                                                               Group Key: lineitem_1.l_orderkey
    +                                                               ->  Parallel Seq Scan on lineitem lineitem_1  (cost=0.00..1374483.88 rows=24994188 width=22)
    +                                       ->  Index Scan using orders_pkey on orders  (cost=0.43..8.45 rows=1 width=30)
    +                                             Index Cond: (o_orderkey = lineitem_1.l_orderkey)
    +                                 ->  Index Scan using customer_pkey on customer  (cost=0.43..5.50 rows=1 width=23)
    +                                       Index Cond: (c_custkey = orders.o_custkey)
    +(26 rows)
    +
    +Time: 3.965 ms
    +
    +-- 执行 Q18
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 80150.449 ms (01:20.150)
    +

    执行 ePQ 单机并行执行

    PolarDB for PostgreSQL 提供了弹性跨机并行查询(ePQ)的能力,非常适合进行分析型查询。下面的步骤将引导您可以在一台主机上使用 ePQ 并行执行 TPC-H 查询。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +

    首先需要对 TPC-H 产生的八张表设置 ePQ 的最大查询并行度:

    ALTER TABLE nation SET (px_workers = 100);
    +ALTER TABLE region SET (px_workers = 100);
    +ALTER TABLE supplier SET (px_workers = 100);
    +ALTER TABLE part SET (px_workers = 100);
    +ALTER TABLE partsupp SET (px_workers = 100);
    +ALTER TABLE customer SET (px_workers = 100);
    +ALTER TABLE orders SET (px_workers = 100);
    +ALTER TABLE lineitem SET (px_workers = 100);
    +

    以 Q18 为例,执行查询:

    -- 打开计时
    +\\timing on
    +
    +-- 打开 ePQ 功能的开关
    +SET polar_enable_px = ON;
    +-- 设置每个节点的 ePQ 并行度为 1
    +SET polar_px_dop_per_node = 1;
    +
    +-- 查看 Q18 的执行计划
    +\\i finals/18.explain.sql
    +                                                                          QUERY PLAN
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..257526.21 rows=59986052 width=47)
    +   Merge Key: orders.o_totalprice, orders.o_orderdate
    +   ->  GroupAggregate  (cost=0.00..243457.68 rows=29993026 width=47)
    +         Group Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=0.00..241257.18 rows=29993026 width=47)
    +               Sort Key: orders.o_totalprice DESC, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=0.00..42729.99 rows=29993026 width=47)
    +                     Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
    +                     ->  PX Hash 2:2  (slice2; segments: 2)  (cost=0.00..15959.71 rows=7500000 width=39)
    +                           Hash Key: orders.o_orderkey
    +                           ->  Hash Join  (cost=0.00..15044.19 rows=7500000 width=39)
    +                                 Hash Cond: (orders.o_custkey = customer.c_custkey)
    +                                 ->  PX Hash 2:2  (slice3; segments: 2)  (cost=0.00..11561.51 rows=7500000 width=20)
    +                                       Hash Key: orders.o_custkey
    +                                       ->  Hash Semi Join  (cost=0.00..11092.01 rows=7500000 width=20)
    +                                             Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
    +                                             ->  Partial Seq Scan on orders  (cost=0.00..1132.25 rows=7500000 width=20)
    +                                             ->  Hash  (cost=7760.84..7760.84 rows=400 width=4)
    +                                                   ->  PX Broadcast 2:2  (slice4; segments: 2)  (cost=0.00..7760.84 rows=400 width=4)
    +                                                         ->  Result  (cost=0.00..7760.80 rows=200 width=4)
    +                                                               Filter: ((sum(lineitem.l_quantity)) > '314'::numeric)
    +                                                               ->  Finalize HashAggregate  (cost=0.00..7760.78 rows=500 width=12)
    +                                                                     Group Key: lineitem.l_orderkey
    +                                                                     ->  PX Hash 2:2  (slice5; segments: 2)  (cost=0.00..7760.72 rows=500 width=12)
    +                                                                           Hash Key: lineitem.l_orderkey
    +                                                                           ->  Partial HashAggregate  (cost=0.00..7760.70 rows=500 width=12)
    +                                                                                 Group Key: lineitem.l_orderkey
    +                                                                                 ->  Partial Seq Scan on lineitem  (cost=0.00..3350.82 rows=29993026 width=12)
    +                                 ->  Hash  (cost=597.51..597.51 rows=749979 width=23)
    +                                       ->  PX Hash 2:2  (slice6; segments: 2)  (cost=0.00..597.51 rows=749979 width=23)
    +                                             Hash Key: customer.c_custkey
    +                                             ->  Partial Seq Scan on customer  (cost=0.00..511.44 rows=749979 width=23)
    +                     ->  Hash  (cost=5146.80..5146.80 rows=29993026 width=12)
    +                           ->  PX Hash 2:2  (slice7; segments: 2)  (cost=0.00..5146.80 rows=29993026 width=12)
    +                                 Hash Key: lineitem_1.l_orderkey
    +                                 ->  Partial Seq Scan on lineitem lineitem_1  (cost=0.00..3350.82 rows=29993026 width=12)
    + Optimizer: PolarDB PX Optimizer
    +(37 rows)
    +
    +Time: 216.672 ms
    +
    +-- 执行 Q18
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 59113.965 ms (00:59.114)
    +

    可以看到比 PostgreSQL 的单机并行执行的时间略短。加大 ePQ 功能的节点并行度,查询性能将会有更明显的提升:

    SET polar_px_dop_per_node = 2;
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 42400.500 ms (00:42.401)
    +
    +SET polar_px_dop_per_node = 4;
    +\\i finals/18.sql
    +
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 19892.603 ms (00:19.893)
    +
    +SET polar_px_dop_per_node = 8;
    +\\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 10944.402 ms (00:10.944)
    +

    使用 ePQ 执行 Q17 和 Q18 时可能会出现 OOM。需要设置以下参数防止用尽内存:

    SET polar_px_optimizer_enable_hashagg = 0;
    +

    执行 ePQ 跨机并行执行

    在上面的例子中,出于简单考虑,PolarDB for PostgreSQL 的多个计算节点被部署在同一台主机上。在这种场景下使用 ePQ 时,由于所有的计算节点都使用了同一台主机的 CPU、内存、I/O 带宽,因此本质上是基于单台主机的并行执行。实际上,PolarDB for PostgreSQL 的计算节点可以被部署在能够共享存储节点的多台机器上。此时使用 ePQ 功能将进行真正的跨机器分布式并行查询,能够充分利用多台机器上的计算资源。

    `,27),S=t(`

    如果遇到如下错误:

    psql:queries/q01.analyze.sq1:24: WARNING:  interconnect may encountered a network error, please check your network
    +DETAIL:  Failed to send packet (seq 1) to 192.168.1.8:57871 (pid 17766 cid 0) after 100 retries.
    +

    可以尝试统一修改每台机器的 MTU 为 9000:

    ifconfig <网卡名> mtu 9000
    +
    `,1);function q(l,v){const k=e("ArticleInfo"),o=e("router-link"),c=e("ExternalLinkIcon"),r=e("RouterLink");return i(),d("div",null,[b,a(k,{frontmatter:l.$frontmatter},null,8,["frontmatter"]),w,s("nav",y,[s("ul",null,[s("li",null,[a(o,{to:"#背景"},{default:p(()=>[n("背景")]),_:1})]),s("li",null,[a(o,{to:"#测试准备"},{default:p(()=>[n("测试准备")]),_:1}),s("ul",null,[s("li",null,[a(o,{to:"#部署-polardb-pg"},{default:p(()=>[n("部署 PolarDB-PG")]),_:1})]),s("li",null,[a(o,{to:"#生成-tpc-h-测试数据集"},{default:p(()=>[n("生成 TPC-H 测试数据集")]),_:1})])])]),s("li",null,[a(o,{to:"#执行-postgresql-单机并行执行"},{default:p(()=>[n("执行 PostgreSQL 单机并行执行")]),_:1})]),s("li",null,[a(o,{to:"#执行-epq-单机并行执行"},{default:p(()=>[n("执行 ePQ 单机并行执行")]),_:1})]),s("li",null,[a(o,{to:"#执行-epq-跨机并行执行"},{default:p(()=>[n("执行 ePQ 跨机并行执行")]),_:1})])])]),h,s("p",null,[s("a",g,[n("TPC-H"),a(c)]),n(" 是专门测试数据库分析型场景性能的数据集。")]),_,s("p",null,[n("或者参考 "),a(r,{to:"/deploying/deploy.html"},{default:p(()=>[n("进阶部署")]),_:1}),n(" 部署一个基于共享存储的 PolarDB for PostgreSQL 集群。")]),P,s("p",null,[n("通过 "),s("a",f,[n("tpch-dbgen"),a(c)]),n(" 工具来生成测试数据。")]),T,s("p",null,[n("参考 "),a(r,{to:"/deploying/deploy.html"},{default:p(()=>[n("进阶部署")]),_:1}),n(" 可以搭建起不同形态的 PolarDB for PostgreSQL 集群。集群搭建成功后,使用 ePQ 的方式与单机 ePQ 完全相同。")]),S])}const C=u(m,[["render",q],["__file","tpch-test.html.vue"]]);export{C as default}; diff --git a/contributing/coding-style.html b/contributing/coding-style.html new file mode 100644 index 00000000000..dffa49c1b17 --- /dev/null +++ b/contributing/coding-style.html @@ -0,0 +1,33 @@ + + + + + + + + + Coding Style | PolarDB for PostgreSQL + + + + +

    Coding Style

    Languages

    • PostgreSQL kernel, extension, and kernel related tools use C, in order to remain compatible with community versions and to easily upgrade.
    • Management related tools can use shell, GO, or Python, for efficient development.

    Style

    • Coding in C follows PostgreSQL's programing style, such as naming, error message format, control statements, length of lines, comment format, length of functions, and global variable. For detail, please reference Postgresql styleopen in new window. Here is some highlines:

      • Code in PostgreSQL should only rely on language features available in the C99 standard
      • Do not use // for comments
      • Both, macros with arguments and static inline functions, may be used. The latter is preferred only if the former simplifies coding.
      • Follow BSD C programming conventions
    • Programs in Shell, Go, or Python can follow Google code conventions

      • https://google.github.io/styleguide/pyguide.html
      • https://github.com/golang/go/wiki/CodeReviewComments
      • https://google.github.io/styleguide/shellguide.html

    Code design and review

    We share the same thoughts and rules as Google Open Source Code Reviewopen in new window

    Before submitting for code review, please do unit test and pass all tests under src/test, such as regress and isolation. Unit tests or function tests should be submitted with code modification.

    In addition to code review, this doc offers instructions for the whole cycle of high-quality development, from design, implementation, testing, documentation, to preparing for code review. Many good questions are asked for critical steps during development, such as about design, about function, about complexity, about test, about naming, about documentation, and about code review. The doc summarized rules for code review as follows.

    In doing a code review, you should make sure that:

    • The code is well-designed.
    • The functionality is good for the users of the code.
    • Any UI changes are sensible and look good.
    • Any parallel programming is done safely.
    • The code isn't more complex than it needs to be.
    • The developer isn't implementing things they might need in the future but don't know they need now.
    • Code has appropriate unit tests.
    • Tests are well-designed.
    • The developer used clear names for everything.
    • Comments are clear and useful, and mostly explain why instead of what.
    • Code is appropriately documented.
    • The code conforms to our style guides.
    + + + diff --git a/contributing/contributing-polardb-docs.html b/contributing/contributing-polardb-docs.html new file mode 100644 index 00000000000..d80c829220f --- /dev/null +++ b/contributing/contributing-polardb-docs.html @@ -0,0 +1,60 @@ + + + + + + + + + Documentation Contributing | PolarDB for PostgreSQL + + + + +

    Documentation Contributing

    DANGER

    需要翻译

    PolarDB for PostgreSQL 的文档使用 VuePress 2open in new window 进行管理,以 Markdown 为中心进行写作。

    浏览文档

    本文档在线托管于 GitHub Pagesopen in new window 服务上。

    本地文档开发

    若您发现文档中存在内容或格式错误,或者您希望能够贡献新文档,那么您需要在本地安装并配置文档开发环境。本项目的文档是一个 Node.js 工程,以 Yarnopen in new window 作为软件包管理器。Node.js®open in new window 是一个基于 Chrome V8 引擎的 JavaScript 运行时环境。

    Node 环境准备

    您需要在本地准备 Node.js 环境。可以选择在 Node.js 官网 下载open in new window 页面下载安装包手动安装,也可以使用下面的命令自动安装。

    通过 curl 安装 Node 版本管理器 nvm

    curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
    +command -v nvm
    +

    如果上一步显示 command not found,那么请关闭当前终端,然后重新打开。

    如果 nvm 已经被成功安装,执行以下命令安装 Node 的 LTS 版本:

    nvm install --lts
    +

    Node.js 安装完毕后,使用如下命令检查安装是否成功:

    node -v
    +npm -v
    +

    使用 npm 全局安装软件包管理器 yarn

    npm install -g yarn
    +yarn -v
    +

    文档依赖安装

    在 PolarDB for PostgreSQL 工程的根目录下运行以下命令,yarn 将会根据 package.json 安装所有依赖:

    yarn
    +

    运行文档开发服务器

    在 PolarDB for PostgreSQL 工程的根目录下运行以下命令:

    yarn docs:dev
    +

    文档开发服务器将运行于 http://localhost:8080/PolarDB-for-PostgreSQL/,打开浏览器即可访问。对 Markdown 文件作出修改后,可以在网页上实时查看变化。

    文档目录组织

    PolarDB for PostgreSQL 的文档资源位于工程根目录的 docs/ 目录下。其目录被组织为:

    └── docs
    +    ├── .vuepress
    +    │   ├── configs
    +    │   ├── public
    +    │   └── styles
    +    ├── README.md
    +    ├── architecture
    +    ├── contributing
    +    ├── guide
    +    ├── imgs
    +    ├── roadmap
    +    └── zh
    +        ├── README.md
    +        ├── architecture
    +        ├── contributing
    +        ├── guide
    +        ├── imgs
    +        └── roadmap
    +

    可以看到,docs/zh/ 目录下是其父级目录除 .vuepress/ 以外的翻版。docs/ 目录中全部为英语文档,docs/zh/ 目录下全部是相对应的简体中文文档。

    .vuepress/ 目录下包含文档工程的全局配置信息:

    • config.js:文档配置
    • configs/:文档配置模块(导航栏 / 侧边栏、英文 / 中文等配置)
    • public/:公共静态资源
    • styles/:文档主题默认样式覆盖

    文档的配置方式请参考 VuePress 2 官方文档的 配置指南open in new window

    文档开发规范

    1. 新的文档写好后,需要在文档配置中配置路由使其在导航栏和侧边栏中显示(可参考其他已有文档)
    2. 修正一种语言的文档时,也需要顺带修正其他语言的相同文档
    3. 修改文档后,使用 Prettieropen in new window 工具对 Markdown 文档进行格式化:

    文档在线部署

    本文档借助 GitHub Actionsopen in new window 提供 CI 服务。向主分支推送代码时,将触发对 docs/ 目录下文档资源的构建,并将构建结果推送到 gh-pagesopen in new window 分支上。GitHub Pagesopen in new window 服务会自动将该分支上的文档静态资源部署到 Web 服务器上形成文档网站。

    + + + diff --git a/contributing/contributing-polardb-kernel.html b/contributing/contributing-polardb-kernel.html new file mode 100644 index 00000000000..970b1f3372c --- /dev/null +++ b/contributing/contributing-polardb-kernel.html @@ -0,0 +1,45 @@ + + + + + + + + + Code Contributing | PolarDB for PostgreSQL + + + + +

    Code Contributing

    PolarDB for PostgreSQL is an open source product from PostgreSQL and other open source projects. Our main target is to create a larger community for PostgreSQL. Contributors are welcomed to submit their code and ideas. In a long run, we hope this project can be managed by developers from both inside and outside Alibaba.

    Branch Description and Management

    • POLARDB_11_STABLE is the stable branch of PolarDB, it can accept the merge from POLARDB_11_DEV only
    • POLARDB_11_DEV is the stable development branch of PolarDB, it can accept the merge from both pull requests and direct pushes from maintainers

    New features will be merged to POLARDB_11_DEV, and will be merged to POLARDB_11_STABLE periodically by maintainers

    Before Contributing

    Contributing

    Here is a checklist to prepare and submit your PR (pull request):

    • Create your own Github repository copy by forking ApsaraDB/PolarDB-for-PostgreSQL.
    • Checkout documentations for Advanced Deployment from PolarDB source code.
    • Push changes to your personal fork and make sure they follow our coding style.
    • Create a PR with a detailed description, if commit messages do not express themselves.
    • Submit PR for review and address all feedbacks.
    • Wait for merging

    An Example of Submitting Code Change to PolarDB

    Let's use an example to walk through the list.

    Fork Your Own Repository

    On GitHub repository of PolarDB for PostgreSQLopen in new window, Click fork button to create your own PolarDB repository.

    Create Local Repository

    git clone https://github.com/<your-github>/PolarDB-for-PostgreSQL.git
    +

    Create a Local Development Branch

    Check out a new development branch from the stable development branch POLARDB_11_DEV. Suppose your branch is named as dev:

    git checkout POLARDB_11_DEV
    +git checkout -b dev
    +

    Make Changes and Commit Locally

    git status
    +git add <files-to-change>
    +git commit -m "modification for dev"
    +

    Rebase and Commit to Remote Repository

    Click Fetch upstream on your own repository page to make sure your stable development branch is up do date with PolarDB official. Then pull the latest commits on stable development branch to your local repository.

    git checkout POLARDB_11_DEV
    +git pull
    +

    Then, rebase your development branch to the stable development branch, and resolve the conflict:

    git checkout dev
    +git rebase POLARDB_11_DEV
    +-- resolve conflict --
    +git push -f dev
    +

    Create a Pull Request

    Click New pull request or Compare & pull request button, choose to compare branches ApsaraDB/PolarDB-for-PostgreSQL:POLARDB_11_DEV and <your-github>/PolarDB-for-PostgreSQL:dev, and write PR description.

    GitHub will automatically run regression test on your code. Your PR should pass all these checks.

    Address Reviewers' Comments

    Resolve all problems raised by reviewers and update the PR.

    Merge

    It is done by PolarDB maintainers.

    + + + diff --git a/deploying/db-localfs.html b/deploying/db-localfs.html new file mode 100644 index 00000000000..9c3842bafcd --- /dev/null +++ b/deploying/db-localfs.html @@ -0,0 +1,49 @@ + + + + + + + + + 基于单机文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于单机文件系统部署

    棠羽

    2023/08/01

    15 min

    本文将指导您在单机文件系统(如 ext4)上编译部署 PolarDB-PG,适用于所有计算节点都可以访问相同本地磁盘存储的场景。

    拉取镜像

    我们在 DockerHub 上提供了 PolarDB-PG 的 本地实例镜像open in new window,里面已包含启动 PolarDB-PG 本地存储实例的入口脚本。镜像目前支持 linux/amd64linux/arm64 两种 CPU 架构。

    docker pull polardb/polardb_pg_local_instance
    +

    初始化数据库

    新建一个空白目录 ${your_data_dir} 作为 PolarDB-PG 实例的数据目录。启动容器时,将该目录作为 VOLUME 挂载到容器内,对数据目录进行初始化。在初始化的过程中,可以传入环境变量覆盖默认值:

    • POLARDB_PORT:PolarDB-PG 运行所需要使用的端口号,默认值为 5432;镜像将会使用三个连续的端口号(默认 5432-5434
    • POLARDB_USER:初始化数据库时创建默认的 superuser(默认 postgres
    • POLARDB_PASSWORD:默认 superuser 的密码

    使用如下命令初始化数据库:

    docker run -it --rm \
    +    --env POLARDB_PORT=5432 \
    +    --env POLARDB_USER=u1 \
    +    --env POLARDB_PASSWORD=your_password \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance \
    +    echo 'done'
    +

    启动 PolarDB-PG 服务

    数据库初始化完毕后,使用 -d 参数以后台模式创建容器,启动 PolarDB-PG 服务。通常 PolarDB-PG 的端口需要暴露给外界使用,使用 -p 参数将容器内的端口范围暴露到容器外。比如,初始化数据库时使用的是 5432-5434 端口,如下命令将会把这三个端口映射到容器外的 54320-54322 端口:

    docker run -d \
    +    -p 54320-54322:5432-5434 \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance
    +

    或者也可以直接让容器与宿主机共享网络:

    docker run -d \
    +    --network=host \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance
    +
    + + + diff --git a/deploying/db-pfs-curve.html b/deploying/db-pfs-curve.html new file mode 100644 index 00000000000..28b07cc3061 --- /dev/null +++ b/deploying/db-pfs-curve.html @@ -0,0 +1,127 @@ + + + + + + + + + 基于 PFS for CurveBS 文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于 PFS for CurveBS 文件系统部署

    程义

    2022/11/02

    15 min

    本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在 Curve 块存储上格式化并挂载 PFS 的计算节点。

    我们在 DockerHub 上提供了一个 PolarDB 开发镜像open in new window,里面已经包含编译运行 PolarDB for PostgreSQL 所需要的所有依赖。您可以直接使用这个开发镜像进行实例搭建。镜像目前支持 AMD64 和 ARM64 两种 CPU 架构。

    源码下载

    在前置文档中,我们已经从 DockerHub 上拉取了 PolarDB 开发镜像,并且进入到了容器中。进入容器后,从 GitHubopen in new window 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    代码克隆完毕后,进入源码目录:

    cd PolarDB-for-PostgreSQL/
    +

    编译部署 PolarDB

    读写节点部署

    在读写节点上,使用 --with-pfsd 选项编译 PolarDB 内核。请参考 编译测试选项说明 查看更多编译选项的说明。

    ./polardb_build.sh --with-pfsd
    +

    WARNING

    上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

    手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \
    +    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \
    +    stop
    +

    在节点本地初始化数据目录 $HOME/primary/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    在共享存储的 /pool@@volume_my_/shared_data 目录上初始化共享数据目录

    # 使用 pfs 创建共享数据目录
    +sudo pfs -C curve mkdir /pool@@volume_my_/shared_data
    +# 初始化 db 的本地和共享数据目录
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /pool@@volume_my_/shared_data/ curve
    +

    编辑读写节点的配置。打开 $HOME/primary/postgresql.conf,增加配置项:

    port=5432
    +polar_hostid=1
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='pool@@volume_my_'
    +polar_datadir='/pool@@volume_my_/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='curve'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    打开 $HOME/primary/pg_hba.conf,增加以下配置项:

    host	replication	postgres	0.0.0.0/0	trust
    +

    最后,启动读写节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
    +

    检查读写节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'select version();'
    +# 下面为输出内容
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    在读写节点上,为对应的只读节点创建相应的 replication slot,用于只读节点的物理流复制:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "select pg_create_physical_replication_slot('replica1');"
    +# 下面为输出内容
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    只读节点部署

    在只读节点上,使用 --with-pfsd 选项编译 PolarDB 内核。

    ./polardb_build.sh --with-pfsd
    +

    WARNING

    上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

    手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \
    +    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \
    +    stop
    +

    在节点本地初始化数据目录 $HOME/replica1/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/replica1
    +

    编辑只读节点的配置。打开 $HOME/replica1/postgresql.conf,增加配置项:

    port=5433
    +polar_hostid=2
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='pool@@volume_my_'
    +polar_datadir='/pool@@volume_my_/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='curve'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    创建 $HOME/replica1/recovery.conf,增加以下配置项:

    WARNING

    请在下面替换读写节点(容器)所在的 IP 地址。

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    最后,启动只读节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
    +

    检查只读节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5433 \
    +    -d postgres \
    +    -c 'select version();'
    +# 下面为输出内容
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群检查和测试

    部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

    登录 读写节点,创建测试表并插入样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "create table t(t1 int primary key, t2 int);insert into t values (1, 1),(2, 3),(3, 3);"
    +

    登录 只读节点,查询刚刚插入的样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5433 \
    +    -d postgres \
    +    -c "select * from t;"
    +# 下面为输出内容
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    在读写节点上插入的数据对只读节点可见。

    + + + diff --git a/deploying/db-pfs.html b/deploying/db-pfs.html new file mode 100644 index 00000000000..0ff6dbe6013 --- /dev/null +++ b/deploying/db-pfs.html @@ -0,0 +1,117 @@ + + + + + + + + + 基于 PFS 文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于 PFS 文件系统部署

    棠羽

    2022/05/09

    15 min

    本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在共享存储上格式化并挂载 PFS 文件系统的计算节点。

    读写节点部署

    初始化读写节点的本地数据目录 ~/primary/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    在共享存储的 /nvme1n1/shared_data/ 路径上创建共享数据目录,然后使用 polar-initdb.sh 脚本初始化共享数据目录:

    # 使用 pfs 创建共享数据目录
    +sudo pfs -C disk mkdir /nvme1n1/shared_data
    +# 初始化 db 的本地和共享数据目录
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /nvme1n1/shared_data/
    +

    编辑读写节点的配置。打开 ~/primary/postgresql.conf,增加配置项:

    port=5432
    +polar_hostid=1
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,增加以下配置项,允许只读节点进行物理复制:

    host	replication	postgres	0.0.0.0/0	trust
    +

    最后,启动读写节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
    +

    检查读写节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    在读写节点上,为对应的只读节点创建相应的复制槽,用于只读节点的物理复制:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_create_physical_replication_slot('replica1');"
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    只读节点部署

    在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

    mkdir -m 0700 $HOME/replica1
    +sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \
    +    /nvme1n1/shared_data/ $HOME/replica1/
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
    +cp /tmp/replica1/*.conf $HOME/replica1/
    +

    编辑只读节点的配置。打开 ~/replica1/postgresql.conf,增加配置项:

    port=5433
    +polar_hostid=2
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    创建只读节点的复制配置文件 ~/replica1/recovery.conf,增加读写节点的连接信息,以及复制槽名称:

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    最后,启动只读节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
    +

    检查只读节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5433 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群检查和测试

    部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

    登录 读写节点,创建测试表并插入样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +

    登录 只读节点,查询刚刚插入的样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5433 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    在读写节点上插入的数据对只读节点可见,这意味着基于共享存储的 PolarDB 计算节点集群搭建成功。


    常见运维步骤

    + + + diff --git a/deploying/deploy-official.html b/deploying/deploy-official.html new file mode 100644 index 00000000000..e4d37953c36 --- /dev/null +++ b/deploying/deploy-official.html @@ -0,0 +1,33 @@ + + + + + + + + + 阿里云官网购买实例 | PolarDB for PostgreSQL + + + + +

    阿里云官网购买实例

    阿里云官网直接提供了可供购买的 云原生关系型数据库 PolarDB PostgreSQL 引擎open in new window

    + + + diff --git a/deploying/deploy-stack.html b/deploying/deploy-stack.html new file mode 100644 index 00000000000..72f8207b980 --- /dev/null +++ b/deploying/deploy-stack.html @@ -0,0 +1,33 @@ + + + + + + + + + 基于 PolarDB Stack 共享存储 | PolarDB for PostgreSQL + + + + +

    基于 PolarDB Stack 共享存储

    PolarDB Stack 是轻量级 PolarDB PaaS 软件。基于共享存储提供一写多读的 PolarDB 数据库服务,特别定制和深度优化了数据库生命周期管理。通过 PolarDB Stack 可以一键部署 PolarDB-for-PostgreSQL 内核和 PolarDB-FileSystem。

    PolarDB Stack 架构如下图所示,进入 PolarDB Stack 的部署文档open in new window

    PolarDB Stack arch

    + + + diff --git a/deploying/deploy.html b/deploying/deploy.html new file mode 100644 index 00000000000..b47c4ee4ff6 --- /dev/null +++ b/deploying/deploy.html @@ -0,0 +1,33 @@ + + + + + + + + + 进阶部署 | PolarDB for PostgreSQL + + + + +

    进阶部署

    棠羽

    2022/05/09

    10 min

    部署 PolarDB for PostgreSQL 需要在以下三个层面上做准备:

    1. 块存储设备层:用于提供存储介质。可以是单个物理块存储设备(本地存储),也可以是多个物理块设备构成的分布式块存储。
    2. 文件系统层:由于 PostgreSQL 将数据存储在文件中,因此需要在块存储设备上架设文件系统。根据底层块存储设备的不同,可以选用单机文件系统(如 ext4)或分布式文件系统 PolarDB File System(PFS)open in new window
    3. 数据库层:PolarDB for PostgreSQL 的编译和部署环境。

    以下表格给出了三个层次排列组合出的的不同实践方式,其中的步骤包含:

    • 存储层:块存储设备的准备
    • 文件系统:PolarDB File System 的编译、挂载
    • 数据库层:PolarDB for PostgreSQL 各集群形态的编译部署

    我们强烈推荐使用发布在 DockerHub 上的 PolarDB 开发镜像open in new window 来完成实践!开发镜像中已经包含了文件系统层和数据库层所需要安装的所有依赖,无需手动安装。

    块存储文件系统
    实践 1(极简本地部署)本地 SSD本地文件系统(如 ext4)
    实践 2(生产环境最佳实践) 视频阿里云 ECS + ESSD 云盘PFS
    实践 3(生产环境最佳实践) 视频CurveBSopen in new window 共享存储PFS for Curveopen in new window
    实践 4Ceph 共享存储PFS
    实践 5NBD 共享存储PFS
    + + + diff --git a/deploying/fs-pfs-curve.html b/deploying/fs-pfs-curve.html new file mode 100644 index 00000000000..fc248a5bf81 --- /dev/null +++ b/deploying/fs-pfs-curve.html @@ -0,0 +1,50 @@ + + + + + + + + + 格式化并挂载 PFS for CurveBS | PolarDB for PostgreSQL + + + + +

    格式化并挂载 PFS for CurveBS

    棠羽

    2022/08/31

    20 min

    PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。

    PFS 编译安装

    在 PolarDB 计算节点上准备好 PFS 相关工具。推荐使用 DockerHub 上的 PolarDB 开发镜像,其中已经包含了编译完毕的 PFS,无需再次编译安装。Curve 开源社区open in new window 针对 PFS 对接 CurveBS 存储做了专门的优化。在用于部署 PolarDB 的计算节点上,使用下面的命令拉起带有 PFS for CurveBSopen in new window 的 PolarDB 开发镜像:

    docker pull polardb/polardb_pg_devel:curvebs
    +docker run -it \
    +    --network=host \
    +    --cap-add=SYS_PTRACE --privileged=true \
    +    --name polardb_pg \
    +    polardb/polardb_pg_devel:curvebs bash
    +

    读写节点块设备映射与格式化

    进入容器后需要修改 curve 相关的配置文件:

    sudo vim /etc/curve/client.conf
    +#
    +################### mds一侧配置信息 ##################
    +#
    +
    +# mds的地址信息,对于mds集群,地址以逗号隔开
    +mds.listen.addr=127.0.0.1:6666
    +... ...
    +

    注意,这里的 mds.listen.addr 请填写部署 CurveBS 集群中集群状态中输出的 cluster mds addr

    容器内已经安装了 curve 工具,该工具可用于创建卷,用户需要使用该工具创建实际存储 PolarFS 数据的 curve 卷:

    curve create --filename /volume --user my --length 10 --stripeUnit 16384 --stripeCount 64
    +

    用户可通过 curve create -h 命令查看创建卷的详细说明。上面的列子中,我们创建了一个拥有以下属性的卷:

    • 卷名为 /volume
    • 所属用户为 my
    • 大小为 10GB
    • 条带大小为 16KB
    • 条带个数为 64

    特别需要注意的是,在数据库场景下,我们强烈建议使用条带卷,只有这样才能充分发挥 Curve 的性能优势,而 16384 * 64 的条带设置是目前最优的条带设置。

    格式化 curve 卷

    在使用 curve 卷之前需要使用 pfs 来格式化对应的 curve 卷:

    sudo pfs -C curve mkfs pool@@volume_my_
    +

    与我们在本地挂载文件系统前要先在磁盘上格式化文件系统一样,我们也要把我们的 curve 卷格式化为 PolarFS 文件系统。

    注意,由于 PolarFS 解析的特殊性,我们将以 pool@${volume}_${user}_ 的形式指定我们的 curve 卷,此外还需要将卷名中的 / 替换成 @

    启动 pfsd 守护进程

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p pool@@volume_my_
    +

    如果 pfsd 启动成功,那么至此 curve 版 PolarFS 已全部部署完成,已经成功挂载 PFS 文件系统。 下面需要编译部署 PolarDB。


    在 PFS 上编译部署 PolarDB for Curve

    参阅 PolarDB 编译部署:PFS 文件系统

    + + + diff --git a/deploying/fs-pfs.html b/deploying/fs-pfs.html new file mode 100644 index 00000000000..33150038aef --- /dev/null +++ b/deploying/fs-pfs.html @@ -0,0 +1,55 @@ + + + + + + + + + 格式化并挂载 PFS | PolarDB for PostgreSQL + + + + +

    格式化并挂载 PFS

    棠羽

    2022/05/09

    15 min

    PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。

    PFS 编译安装

    推荐使用 DockerHubopen in new window 上的 PolarDB for PostgreSQL 可执行文件镜像open in new window,目前支持 linux/amd64linux/arm64 两种架构,其中已经包含了编译完毕的 PFS 工具,无需手动编译安装。通过以下命令进入容器即可:

    docker pull polardb/polardb_pg_binary
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +

    PFS 的手动编译安装方式请参考 PFS 的 READMEopen in new window,此处不再赘述。

    块设备重命名

    PFS 仅支持访问 以特定字符开头的块设备(详情可见 PolarDB File Systemopen in new window 源代码的 src/pfs_core/pfs_api.hopen in new window 文件):

    #define PFS_PATH_ISVALID(path)                                  \
    +    (path != NULL &&                                            \
    +     ((path[0] == '/' && isdigit((path)[1])) || path[0] == '.'  \
    +      || strncmp(path, "/pangu-", 7) == 0                       \
    +      || strncmp(path, "/sd", 3) == 0                           \
    +      || strncmp(path, "/sf", 3) == 0                           \
    +      || strncmp(path, "/vd", 3) == 0                           \
    +      || strncmp(path, "/nvme", 5) == 0                         \
    +      || strncmp(path, "/loop", 5) == 0                         \
    +      || strncmp(path, "/mapper_", 8) ==0))
    +

    因此,为了保证能够顺畅完成后续流程,我们建议在所有访问块设备的节点上使用相同的软链接访问共享块设备。例如,在 NBD 服务端主机上,使用新的块设备名 /dev/nvme1n1 软链接到共享存储块设备的原有名称 /dev/vdb 上:

    sudo ln -s /dev/vdb /dev/nvme1n1
    +

    在 NBD 客户端主机上,使用同样的块设备名 /dev/nvme1n1 软链到共享存储块设备的原有名称 /dev/nbd0 上:

    sudo ln -s /dev/nbd0 /dev/nvme1n1
    +

    这样便可以在服务端和客户端两台主机上使用相同的块设备名 /dev/nvme1n1 访问同一个块设备。

    块设备格式化

    使用 任意一台主机,在共享存储块设备上格式化 PFS 分布式文件系统:

    sudo pfs -C disk mkfs nvme1n1
    +

    PFS 文件系统挂载

    在能够访问共享存储的 所有主机节点 上分别启动 PFS 守护进程,挂载 PFS 文件系统:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    在 PFS 上编译部署 PolarDB

    参阅 PolarDB 编译部署:PFS 文件系统

    + + + diff --git a/deploying/introduction.html b/deploying/introduction.html new file mode 100644 index 00000000000..af336b34a9b --- /dev/null +++ b/deploying/introduction.html @@ -0,0 +1,33 @@ + + + + + + + + + 架构简介 | PolarDB for PostgreSQL + + + + +

    架构简介

    棠羽

    2022/05/09

    5 min

    PolarDB for PostgreSQL 采用了基于 Shared-Storage 的存储计算分离架构。数据库由传统的 Shared-Nothing 架构,转变成了 Shared-Storage 架构——由原来的 N 份计算 + N 份存储,转变成了 N 份计算 + 1 份存储;而 PostgreSQL 使用了传统的单体数据库架构,存储和计算耦合在一起。

    software-level

    为保证所有计算节点能够以相同的可见性视角访问分布式块存储设备,PolarDB 需要使用分布式文件系统 PolarDB File System(PFS)open in new window 来访问块设备,其实现原理可参考发表在 2018 年 VLDB 上的论文[1];如果所有计算节点都可以本地访问同一个块存储设备,那么也可以不使用 PFS,直接使用本地的单机文件系统(如 ext4)。这是与 PostgreSQL 的不同点之一。


    1. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud databaseopen in new window ↩︎

    + + + diff --git a/deploying/quick-start.html b/deploying/quick-start.html new file mode 100644 index 00000000000..f9edad03614 --- /dev/null +++ b/deploying/quick-start.html @@ -0,0 +1,43 @@ + + + + + + + + + 快速部署 | PolarDB for PostgreSQL + + + + +

    快速部署

    棠羽

    2022/05/09

    5 min

    DANGER

    为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

    仅需单台计算机,同时满足以下要求,就可以快速开启您的 PolarDB 之旅:

    从 DockerHub 上拉取 PolarDB for PostgreSQL 的 本地存储实例镜像open in new window,创建并运行容器,然后直接试用 PolarDB-PG:

    # 拉取 PolarDB-PG 镜像
    +docker pull polardb/polardb_pg_local_instance
    +# 创建并运行容器
    +docker run -it --rm polardb/polardb_pg_local_instance psql
    +# 测试可用性
    +postgres=# SELECT version();
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +
    + + + diff --git a/deploying/storage-aliyun-essd.html b/deploying/storage-aliyun-essd.html new file mode 100644 index 00000000000..cc7d6ac5ab4 --- /dev/null +++ b/deploying/storage-aliyun-essd.html @@ -0,0 +1,38 @@ + + + + + + + + + 阿里云 ECS + ESSD 云盘存储 | PolarDB for PostgreSQL + + + + +

    阿里云 ECS + ESSD 云盘存储 视频

    棠羽

    2022/05/09

    20 min

    阿里云 ESSD(Enhanced SSD)云盘open in new window 结合 25 GE 网络和 RDMA 技术,能够提供单盘高达 100 万的随机读写能力和单路低时延性能。阿里云 ESSD 云盘支持 NVMe 协议,且可以同时挂载到多台支持 NVMe 协议的 ECS(Elastic Compute Service)实例上,从而实现多个 ECS 实例并发读写访问,具备高可靠、高并发、高性能等特点。更新信息请参考阿里云 ECS 文档:

    本文将指导您完成以下过程:

    1. 部署两台阿里云 ECS 作为计算节点
    2. 将一块 ESSD 云盘多重挂载到两台 ECS 上,作为共享存储
    3. 在 ESSD 共享存储上格式化分布式文件系统 PFS
    4. 基于 PFS,在两台 ECS 上共同搭建一个存算分离、读写分离的 PolarDB 集群

    aliyun-ecs-procedure

    部署阿里云 ECS

    首先需要准备两台或以上的 阿里云 ECSopen in new window。目前,ECS 对支持 ESSD 多重挂载的规格有较多限制,详情请参考 使用限制open in new window。仅 部分可用区部分规格(ecs.g7se、ecs.c7se、ecs.r7se)的 ECS 实例可以支持 ESSD 的多重挂载。如图,请务必选择支持多重挂载的 ECS 规格:

    aliyun-ecs-specs

    对 ECS 存储配置的选择,系统盘可以选用任意的存储类型,数据盘和共享盘暂不选择。后续再单独创建一个 ESSD 云盘作为共享盘:

    aliyun-ecs-system-disk

    如图所示,在 同一可用区 中建好两台 ECS:

    aliyun-ecs-instance

    准备 ESSD 云盘

    在阿里云 ECS 的管理控制台中,选择 存储与快照 下的 云盘,点击 创建云盘。在与已经建好的 ECS 所在的相同可用区内,选择建立一个 ESSD 云盘,并勾选 多实例挂载。如果您的 ECS 不符合多实例挂载的限制条件,则该选框不会出现。

    aliyun-essd-specs

    ESSD 云盘创建完毕后,控制台显示云盘支持多重挂载,状态为 待挂载

    aliyun-essd-ready-to-mount

    接下来,把这个云盘分别挂载到两台 ECS 上:

    aliyun-essd-mounting

    挂载完毕后,查看该云盘,将会显示该云盘已经挂载的两台 ECS 实例:

    aliyun-essd-mounted

    检查云盘

    通过 ssh 分别连接到两台 ECS 上,运行 lsblk 命令可以看到:

    • nvme0n1 是 40GB 的 ECS 系统盘,为 ECS 私有
    • nvme1n1 是 100GB 的 ESSD 云盘,两台 ECS 同时可见
    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    准备分布式文件系统

    接下来,将在两台 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要在 ECS 共享的 ESSD 块设备上 格式化并挂载 PFS

    + + + diff --git a/deploying/storage-ceph.html b/deploying/storage-ceph.html new file mode 100644 index 00000000000..3b3c0cd4e14 --- /dev/null +++ b/deploying/storage-ceph.html @@ -0,0 +1,248 @@ + + + + + + + + + Ceph 共享存储 | PolarDB for PostgreSQL + + + + +

    Ceph 共享存储

    Ceph 是一个统一的分布式存储系统,由于它可以提供较好的性能、可靠性和可扩展性,被广泛的应用在存储领域。Ceph 搭建需要 2 台及以上的物理机/虚拟机实现存储共享与数据备份,本教程以 3 台虚拟机机环境为例,介绍基于 ceph 共享存储的实例构建方法。大体如下:

    1. 获取在同一网段的虚拟机三台,互相之间配置 ssh 免密登录,用作 ceph 密钥与配置信息的同步;
    2. 在主节点启动 mon 进程,查看状态,并复制配置文件至其余各个节点,完成 mon 启动;
    3. 在三个环境中启动 osd 进程配置存储盘,并在主节点环境启动 mgr 进程、rgw 进程;
    4. 创建存储池与 rbd 块设备镜像,并对创建好的镜像在各个节点进行映射即可实现块设备的共享;
    5. 对块设备进行 PolarFS 的格式化与 PolarDB 的部署。

    WARNING

    操作系统版本要求 CentOS 7.5 及以上。以下步骤在 CentOS 7.5 上通过测试。

    环境准备

    使用的虚拟机环境如下:

    IP                  hostname
    +192.168.1.173       ceph001
    +192.168.1.174       ceph002
    +192.168.1.175       ceph003
    +

    安装 docker

    TIP

    本教程使用阿里云镜像站提供的 docker 包。

    安装 docker 依赖包

    yum install -y yum-utils device-mapper-persistent-data lvm2
    +

    安装并启动 docker

    yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
    +yum makecache
    +yum install -y docker-ce
    +
    +systemctl start docker
    +systemctl enable docker
    +

    检查是否安装成功

    docker run hello-world
    +

    配置 ssh 免密登录

    密钥的生成与拷贝

    ssh-keygen
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph001
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph002
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph003
    +

    检查是否配置成功

    ssh root@ceph003
    +

    下载 ceph daemon

    docker pull ceph/daemon
    +

    mon 部署

    ceph001 上 mon 进程启动

    docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=192.168.1.173 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon01 \
    +    ceph/daemon mon
    +

    WARNING

    根据实际网络环境修改 IP、子网掩码位数。

    查看容器状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 1 daemons, quorum ceph001 (age 26m)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    WARNING

    如果遇到 mon is allowing insecure global_id reclaim 的报错,使用以下命令解决。

    docker exec mon01 ceph config set mon auth_allow_insecure_global_id_reclaim false
    +

    生成必须的 keyring

    docker exec mon01 ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
    +docker exec mon01 ceph auth get client.bootstrap-rgw -o /var/lib/ceph/bootstrap-rgw/ceph.keyring
    +

    配置文件同步

    ssh root@ceph002 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph002:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph002:/var/lib/ceph
    +ssh root@ceph003 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph003:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph003:/var/lib/ceph
    +

    在 ceph002 与 ceph003 中启动 mon

    docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=192.168.1.174 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon02 \
    +    ceph/daemon mon
    +
    +docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=1192.168.1.175 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon03 \
    +    ceph/daemon mon
    +

    查看当前集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 35s)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    WARNING

    从 mon 节点信息查看是否有添加在另外两个节点创建的 mon 添加进来。

    osd 部署

    osd 准备阶段

    TIP

    本环境的虚拟机只有一个 /dev/vdb 磁盘可用,因此为每个虚拟机只创建了一个 osd 节点。

    docker run --rm --privileged=true --net=host --ipc=host \
    +    --security-opt seccomp=unconfined \
    +    -v /run/lock/lvm:/run/lock/lvm:z \
    +    -v /var/run/udev/:/var/run/udev/:z \
    +    -v /dev:/dev -v /etc/ceph:/etc/ceph:z \
    +    -v /run/lvm/:/run/lvm/ \
    +    -v /var/lib/ceph/:/var/lib/ceph/:z \
    +    -v /var/log/ceph/:/var/log/ceph/:z \
    +    --entrypoint=ceph-volume \
    +    docker.io/ceph/daemon \
    +    --cluster ceph lvm prepare --bluestore --data /dev/vdb
    +

    WARNING

    以上命令在三个节点都是一样的,只需要根据磁盘名称进行修改调整即可。

    osd 激活阶段

    docker run -d --privileged=true --net=host --pid=host --ipc=host \
    +    --security-opt seccomp=unconfined \
    +    -v /dev:/dev \
    +    -v /etc/localtime:/etc/ localtime:ro \
    +    -v /var/lib/ceph:/var/lib/ceph:z \
    +    -v /etc/ceph:/etc/ceph:z \
    +    -v /var/run/ceph:/var/run/ceph:z \
    +    -v /var/run/udev/:/var/run/udev/ \
    +    -v /var/log/ceph:/var/log/ceph:z \
    +    -v /run/lvm/:/run/lvm/ \
    +    -e CLUSTER=ceph \
    +    -e CEPH_DAEMON=OSD_CEPH_VOLUME_ACTIVATE \
    +    -e CONTAINER_IMAGE=docker.io/ceph/daemon \
    +    -e OSD_ID=0 \
    +    --name=ceph-osd-0 \
    +    docker.io/ceph/daemon
    +

    WARNING

    各个节点需要修改 OSD_ID 与 name 属性,OSD_ID 是从编号 0 递增的,其余节点为 OSD_ID=1、OSD_ID=2。

    查看集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_WARN
    +            no active mgr
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 44m)
    +    mgr: no daemons active
    +    osd: 3 osds: 3 up (since 7m), 3 in (since     13m)
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    mgr、mds、rgw 部署

    以下命令均在 ceph001 进行:

    docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    --name=ceph-mgr-0 \
    +    ceph/daemon mgr
    +
    +docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -v /etc/ceph:/etc/ceph \
    +    -e CEPHFS_CREATE=1 \
    +    --name=ceph-mds-0 \
    +    ceph/daemon mds
    +
    +docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -v /etc/ceph:/etc/ceph \
    +    --name=ceph-rgw-0 \
    +    ceph/daemon rgw
    +

    查看集群状态:

    docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 92m)
    +    mgr: ceph001(active, since 25m)
    +    mds: 1/1 daemons up
    +    osd: 3 osds: 3 up (since 54m), 3 in (since    60m)
    +    rgw: 1 daemon active (1 hosts, 1 zones)
    +
    +data:
    +    volumes: 1/1 healthy
    +    pools:   7 pools, 145 pgs
    +    objects: 243 objects, 7.2 KiB
    +    usage:   50 MiB used, 2.9 TiB / 2.9 TiB avail
    +    pgs:     145 active+clean
    +

    rbd 块设备创建

    TIP

    以下命令均在容器 mon01 中进行。

    存储池的创建

    docker exec -it mon01 bash
    +ceph osd pool create rbd_polar
    +

    创建镜像文件并查看信息

    rbd create --size 512000 rbd_polar/image02
    +rbd info rbd_polar/image02
    +
    +rbd image 'image02':
    +size 500 GiB in 128000 objects
    +order 22 (4 MiB objects)
    +snapshot_count: 0
    +id: 13b97b252c5d
    +block_name_prefix: rbd_data.13b97b252c5d
    +format: 2
    +features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    +op_features:
    +flags:
    +create_timestamp: Thu Oct 28 06:18:07 2021
    +access_timestamp: Thu Oct 28 06:18:07 2021
    +modify_timestamp: Thu Oct 28 06:18:07 2021
    +

    映射镜像文件

    modprobe rbd # 加载内核模块,在主机上执行
    +rbd map rbd_polar/image02
    +
    +rbd: sysfs write failed
    +RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten".
    +In some cases useful info is found in syslog -  try "dmesg | tail".
    +rbd: map failed: (6) No such device or address
    +

    WARNING

    某些特性内核不支持,需要关闭才可以映射成功。如下进行:关闭 rbd 不支持特性,重新映射镜像,并查看映射列表。

    rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten
    +rbd map rbd_polar/image02
    +rbd device list
    +
    +id  pool       namespace  image    snap  device
    +0   rbd_polar             image01  -     /dev/  rbd0
    +1   rbd_polar             image02  -     /dev/  rbd1
    +

    TIP

    此处我已经先映射了一个 image01,所以有两条信息。

    查看块设备

    回到容器外,进行操作。查看系统中的块设备:

    lsblk
    +
    +NAME                                                               MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
    +vda                                                                253:0    0  500G  0 disk
    +└─vda1                                                             253:1    0  500G  0 part /
    +vdb                                                                253:16   0 1000G  0 disk
    +└─ceph--7eefe77f--c618--4477--a1ed--b4f44520dfc 2-osd--block--bced3ff1--42b9--43e1--8f63--e853b  ce41435
    +                                                                    252:0    0 1000G  0 lvm
    +rbd0                                                               251:0    0  100G  0 disk
    +rbd1                                                               251:16   0  500G  0 disk
    +

    WARNING

    块设备镜像需要在各个节点都进行映射才可以在本地环境中通过 lsblk 命令查看到,否则不显示。ceph002 与 ceph003 上映射命令与上述一致。


    准备分布式文件系统

    参阅 格式化并挂载 PFS

    + + + diff --git a/deploying/storage-curvebs.html b/deploying/storage-curvebs.html new file mode 100644 index 00000000000..649345fc869 --- /dev/null +++ b/deploying/storage-curvebs.html @@ -0,0 +1,188 @@ + + + + + + + + + CurveBS 共享存储 | PolarDB for PostgreSQL + + + + +

    CurveBS 共享存储 视频

    棠羽

    2022/08/31

    30 min

    Curveopen in new window 是一款高性能、易运维、云原生的开源分布式存储系统。可应用于主流的云原生基础设施平台:

    • 对接 OpenStack 平台为云主机提供高性能块存储服务;
    • 对接 Kubernetes 为其提供 RWO、RWX 等类型的持久化存储卷;
    • 对接 PolarFS 作为云原生数据库的高性能存储底座,完美支持云原生数据库的存算分离架构。

    Curve 亦可作为云存储中间件使用 S3 兼容的对象存储作为数据存储引擎,为公有云用户提供高性价比的共享文件存储。

    本示例将引导您以 CurveBS 作为块存储,部署 PolarDB for PostgreSQL。更多进阶配置和使用方法请参考 Curve 项目的 wikiopen in new window

    设备准备

    curve-cluster

    如图所示,本示例共使用六台服务器。其中,一台中控服务器和三台存储服务器共同组成 CurveBS 集群,对外暴露为一个共享存储服务。剩余两台服务器分别用于部署 PolarDB for PostgreSQL 数据库的读写节点和只读节点,它们共享 CurveBS 对外暴露的块存储设备。

    本示例使用阿里云 ECS 模拟全部六台服务器。六台 ECS 全部运行 Anolis OSopen in new window 8.6(兼容 CentOS 8.6)系统,使用 root 用户,并处于同一局域网段内。需要完成的准备工作包含:

    1. 在全部机器上安装 Dockeropen in new window(请参考 Docker 官方文档)
    2. 在 Curve 中控机上配置 SSH 免密登陆到其它五台服务器

    在中控机上安装 CurveAdm

    bash -c "$(curl -fsSL https://curveadm.nos-eastchina1.126.net/script/install.sh)"
    +source /root/.bash_profile
    +

    导入主机列表

    在中控机上编辑主机列表文件:

    vim hosts.yaml
    +

    文件中包含另外五台服务器的 IP 地址和在 Curve 集群内的名称,其中:

    • 三台主机为 Curve 存储节点主机
    • 两台主机为 PolarDB for PostgreSQL 计算节点主机
    global:
    +  user: root
    +  ssh_port: 22
    +  private_key_file: /root/.ssh/id_rsa
    +
    +hosts:
    +  # Curve worker nodes
    +  - host: server-host1
    +    hostname: 172.16.0.223
    +  - host: server-host2
    +    hostname: 172.16.0.224
    +  - host: server-host3
    +    hostname: 172.16.0.225
    +  # PolarDB nodes
    +  - host: polardb-primary
    +    hostname: 172.16.0.226
    +  - host: polardb-replica
    +    hostname: 172.16.0.227
    +

    导入主机列表:

    curveadm hosts commit hosts.yaml
    +

    格式化磁盘

    准备磁盘列表,并提前生成一批固定大小并预写过的 chunk 文件。磁盘列表中需要包含:

    • 将要进行格式化的所有存储节点主机
    • 每台主机上的统一块设备名(本例中为 /dev/vdb
    • 将被使用的挂载点
    • 格式化百分比
    vim format.yaml
    +
    host:
    +  - server-host1
    +  - server-host2
    +  - server-host3
    +disk:
    +  - /dev/vdb:/data/chunkserver0:90 # device:mount_path:format_percent
    +

    开始格式化。此时,中控机将在每台存储节点主机上对每个块设备启动一个格式化进程容器。

    $ curveadm format -f format.yaml
    +Start Format Chunkfile Pool: ⠸
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +

    当显示 OK 时,说明这个格式化进程容器已启动,但 并不代表格式化已经完成。格式化是个较久的过程,将会持续一段时间:

    Start Format Chunkfile Pool: [OK]
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +

    可以通过以下命令查看格式化进度,目前仍在格式化状态中:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  19/90      Formatting
    +server-host2  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +server-host3  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +

    格式化完成后的输出:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host2  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host3  /dev/vdb  /data/chunkserver0  95/90      Done
    +

    部署 CurveBS 集群

    首先,准备集群配置文件:

    vim topology.yaml
    +

    粘贴如下配置文件:

    kind: curvebs
    +global:
    +  container_image: opencurvedocker/curvebs:v1.2
    +  log_dir: ${home}/logs/${service_role}${service_replicas_sequence}
    +  data_dir: ${home}/data/${service_role}${service_replicas_sequence}
    +  s3.nos_address: 127.0.0.1
    +  s3.snapshot_bucket_name: curve
    +  s3.ak: minioadmin
    +  s3.sk: minioadmin
    +  variable:
    +    home: /tmp
    +    machine1: server-host1
    +    machine2: server-host2
    +    machine3: server-host3
    +
    +etcd_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 2380
    +    listen.client_port: 2379
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +mds_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 6666
    +    listen.dummy_port: 6667
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +chunkserver_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 82${format_replicas_sequence} # 8200,8201,8202
    +    data_dir: /data/chunkserver${service_replicas_sequence} # /data/chunkserver0, /data/chunkserver1
    +    copysets: 100
    +  deploy:
    +    - host: ${machine1}
    +      replicas: 1
    +    - host: ${machine2}
    +      replicas: 1
    +    - host: ${machine3}
    +      replicas: 1
    +
    +snapshotclone_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 5555
    +    listen.dummy_port: 8081
    +    listen.proxy_port: 8080
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +

    根据上述的集群拓扑文件创建集群 my-cluster

    curveadm cluster add my-cluster -f topology.yaml
    +

    切换 my-cluster 集群为当前管理集群:

    curveadm cluster checkout my-cluster
    +

    部署集群。如果部署成功,将会输出类似 Cluster 'my-cluster' successfully deployed ^_^. 字样。

    $ curveadm deploy --skip snapshotclone
    +
    +...
    +Create Logical Pool: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Start Service: [OK]
    +  + host=server-host1  role=snapshotclone  containerId=9d3555ba72fa [1/1] [OK]
    +  + host=server-host2  role=snapshotclone  containerId=e6ae2b23b57e [1/1] [OK]
    +  + host=server-host3  role=snapshotclone  containerId=f6d3446c7684 [1/1] [OK]
    +
    +Balance Leader: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Cluster 'my-cluster' successfully deployed ^_^.
    +

    查看集群状态:

    $ curveadm status
    +Get Service Status: [OK]
    +
    +cluster name      : my-cluster
    +cluster kind      : curvebs
    +cluster mds addr  : 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +cluster mds leader: 172.16.0.225:6666 / d0a94a7afa14
    +
    +Id            Role         Host          Replicas  Container Id  Status
    +--            ----         ----          --------  ------------  ------
    +5567a1c56ab9  etcd         server-host1  1/1       f894c5485a26  Up 17 seconds
    +68f9f0e6f108  etcd         server-host2  1/1       69b09cdbf503  Up 17 seconds
    +a678263898cc  etcd         server-host3  1/1       2ed141800731  Up 17 seconds
    +4dcbdd08e2cd  mds          server-host1  1/1       76d62ff0eb25  Up 17 seconds
    +8ef1755b0a10  mds          server-host2  1/1       d8d838258a6f  Up 17 seconds
    +f3599044c6b5  mds          server-host3  1/1       d63ae8502856  Up 17 seconds
    +9f1d43bc5b03  chunkserver  server-host1  1/1       39751a4f49d5  Up 16 seconds
    +3fb8fd7b37c1  chunkserver  server-host2  1/1       0f55a19ed44b  Up 16 seconds
    +c4da555952e3  chunkserver  server-host3  1/1       9411274d2c97  Up 16 seconds
    +

    部署 CurveBS 客户端

    在 Curve 中控机上编辑客户端配置文件:

    vim client.yaml
    +

    注意,这里的 mds.listen.addr 请填写上一步集群状态中输出的 cluster mds addr

    kind: curvebs
    +container_image: opencurvedocker/curvebs:v1.2
    +mds.listen.addr: 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +log_dir: /root/curvebs/logs/client
    +

    准备分布式文件系统

    接下来,将在两台运行 PolarDB 计算节点的 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要让这两个节点能够共享 CurveBS 块设备,并在块设备上 格式化并挂载 PFS

    + + + diff --git a/deploying/storage-nbd.html b/deploying/storage-nbd.html new file mode 100644 index 00000000000..734a106a07d --- /dev/null +++ b/deploying/storage-nbd.html @@ -0,0 +1,66 @@ + + + + + + + + + NBD 共享存储 | PolarDB for PostgreSQL + + + + +

    NBD 共享存储

    Network Block Device (NBD) 是一种网络协议,可以在多个主机间共享块存储设备。NBD 被设计为 Client-Server 的架构,因此至少需要两台物理机来部署。

    以两台物理机环境为例,本小节介绍基于 NBD 共享存储的实例构建方法大体如下:

    • 首先,两台主机通过 NBD 共享一个块设备;
    • 然后,两台主机上均部署 PolarDB File System (PFS) 来初始化并挂载到同一个块设备;
    • 最后,在两台主机上分别部署 PolarDB for PostgreSQL 内核,构建主节点、只读节点以形成简单的一写多读实例。

    WARNING

    以上步骤在 CentOS 7.5 上通过测试。

    安装 NBD

    为操作系统下载安装 NBD 驱动

    TIP

    操作系统内核需要支持 NBD 内核模块,如果操作系统当前不支持该内核模块,则需要自己通过对应内核版本进行编译和加载 NBD 内核模块。

    CentOS 官网open in new window 下载对应内核版本的驱动源码包并解压:

    rpm -ihv kernel-3.10.0-862.el7.src.rpm
    +cd ~/rpmbuild/SOURCES
    +tar Jxvf linux-3.10.0-862.el7.tar.xz -C /usr/src/kernels/
    +cd /usr/src/kernels/linux-3.10.0-862.el7/
    +

    NBD 驱动源码路径位于:drivers/block/nbd.c。接下来编译操作系统内核依赖和组件:

    cp ../$(uname -r)/Module.symvers ./
    +make menuconfig # Device Driver -> Block devices -> Set 'M' On 'Network block device support'
    +make prepare && make modules_prepare && make scripts
    +make CONFIG_BLK_DEV_NBD=m M=drivers/block
    +

    检查是否正常生成驱动:

    modinfo drivers/block/nbd.ko
    +

    拷贝、生成依赖并安装驱动:

    cp drivers/block/nbd.ko /lib/modules/$(uname -r)/kernel/drivers/block
    +depmod -a
    +modprobe nbd # 或者 modprobe -f nbd 可以忽略模块版本检查
    +

    检查是否安装成功:

    # 检查已安装内核模块
    +lsmod | grep nbd
    +# 如果NBD驱动已经安装,则会生成/dev/nbd*设备(例如:/dev/nbd0、/dev/nbd1等)
    +ls /dev/nbd*
    +

    安装 NBD 软件包

    yum install nbd
    +

    使用 NBD 来共享块设备

    服务端部署

    拉起 NBD 服务端,按照同步方式(sync/flush=true)配置,在指定端口(例如 1921)上监听对指定块设备(例如 /dev/vdb)的访问。

    nbd-server -C /root/nbd.conf
    +

    配置文件 /root/nbd.conf 的内容举例如下:

    [generic]
    +    #user = nbd
    +    #group = nbd
    +    listenaddr = 0.0.0.0
    +    port = 1921
    +[export1]
    +    exportname = /dev/vdb
    +    readonly = false
    +    multifile = false
    +    copyonwrite = false
    +    flush = true
    +    fua = true
    +    sync = true
    +

    客户端部署

    NBD 驱动安装成功后会看到 /dev/nbd* 设备, 根据服务端的配置把远程块设备映射为本地的某个 NBD 设备即可:

    nbd-client x.x.x.x 1921 -N export1 /dev/nbd0
    +# x.x.x.x是NBD服务端主机的IP地址
    +

    准备分布式文件系统

    参阅 格式化并挂载 PFS

    + + + diff --git a/development/customize-dev-env.html b/development/customize-dev-env.html new file mode 100644 index 00000000000..38d22c50053 --- /dev/null +++ b/development/customize-dev-env.html @@ -0,0 +1,186 @@ + + + + + + + + + 定制开发环境 | PolarDB for PostgreSQL + + + + +

    定制开发环境

    自行构建开发镜像

    DockerHub 上已有构建完毕的开发镜像 polardb/polardb_pg_developen in new window 可供直接使用(支持 linux/amd64linux/arm64 两种架构)。

    另外,我们也提供了构建上述开发镜像的 Dockerfile,从 Ubuntu 官方镜像open in new window ubuntu:20.04 开始构建出一个安装完所有开发和运行时依赖的镜像,您可以根据自己的需要在 Dockerfile 中添加更多依赖。以下是手动构建镜像的 Dockerfile 及方法:

    FROM ubuntu:20.04
    +LABEL maintainer="mrdrivingduck@gmail.com"
    +CMD bash
    +
    +# Timezone problem
    +ENV TZ=Asia/Shanghai
    +RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
    +
    +# Upgrade softwares
    +RUN apt update -y && \
    +    apt upgrade -y && \
    +    apt clean -y
    +
    +# GCC (force to 9) and LLVM (force to 11)
    +RUN apt install -y \
    +        gcc-9 \
    +        g++-9 \
    +        llvm-11-dev \
    +        clang-11 \
    +        make \
    +        gdb \
    +        pkg-config \
    +        locales && \
    +    update-alternatives --install \
    +        /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave \
    +        /usr/bin/g++ g++ /usr/bin/g++-9 && \
    +    update-alternatives --install \
    +        /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-11 60 --slave \
    +        /usr/bin/clang++ clang++ /usr/bin/clang++-11 --slave \
    +        /usr/bin/clang clang /usr/bin/clang-11 && \
    +    apt clean -y
    +
    +# Generate locale
    +RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && \
    +    sed -i '/zh_CN.UTF-8/s/^# //g' /etc/locale.gen && \
    +    locale-gen
    +
    +# Dependencies
    +RUN apt install -y \
    +        libicu-dev \
    +        bison \
    +        flex \
    +        python3-dev \
    +        libreadline-dev \
    +        libgss-dev \
    +        libssl-dev \
    +        libpam0g-dev \
    +        libxml2-dev \
    +        libxslt1-dev \
    +        libldap2-dev \
    +        uuid-dev \
    +        liblz4-dev \
    +        libkrb5-dev \
    +        gettext \
    +        libxerces-c-dev \
    +        tcl-dev \
    +        libperl-dev \
    +        libipc-run-perl \
    +        libaio-dev \
    +        libfuse-dev && \
    +    apt clean -y
    +
    +# Tools
    +RUN apt install -y \
    +        iproute2 \
    +        wget \
    +        ccache \
    +        sudo \
    +        vim \
    +        git \
    +        cmake && \
    +    apt clean -y
    +
    +# set to empty if GitHub is not barriered
    +# ENV GITHUB_PROXY=https://ghproxy.com/
    +ENV GITHUB_PROXY=
    +
    +ENV ZLOG_VERSION=1.2.14
    +ENV PFSD_VERSION=pfsd4pg-release-1.2.42-20220419
    +
    +# install dependencies from GitHub mirror
    +RUN cd /usr/local && \
    +    # zlog for PFSD
    +    wget --no-verbose --no-check-certificate "${GITHUB_PROXY}https://github.com/HardySimpson/zlog/archive/refs/tags/${ZLOG_VERSION}.tar.gz" && \
    +    # PFSD
    +    wget --no-verbose --no-check-certificate "${GITHUB_PROXY}https://github.com/ApsaraDB/PolarDB-FileSystem/archive/refs/tags/${PFSD_VERSION}.tar.gz" && \
    +    # unzip and install zlog
    +    gzip -d $ZLOG_VERSION.tar.gz && \
    +    tar xpf $ZLOG_VERSION.tar && \
    +    cd zlog-$ZLOG_VERSION && \
    +    make && make install && \
    +    echo '/usr/local/lib' >> /etc/ld.so.conf && ldconfig && \
    +    cd .. && \
    +    rm -rf $ZLOG_VERSION* && \
    +    rm -rf zlog-$ZLOG_VERSION && \
    +    # unzip and install PFSD
    +    gzip -d $PFSD_VERSION.tar.gz && \
    +    tar xpf $PFSD_VERSION.tar && \
    +    cd PolarDB-FileSystem-$PFSD_VERSION && \
    +    sed -i 's/-march=native //' CMakeLists.txt && \
    +    ./autobuild.sh && ./install.sh && \
    +    cd .. && \
    +    rm -rf $PFSD_VERSION* && \
    +    rm -rf PolarDB-FileSystem-$PFSD_VERSION*
    +
    +# create default user
    +ENV USER_NAME=postgres
    +RUN echo "create default user" && \
    +    groupadd -r $USER_NAME && \
    +    useradd -ms /bin/bash -g $USER_NAME $USER_NAME -p '' && \
    +    usermod -aG sudo $USER_NAME
    +
    +# modify conf
    +RUN echo "modify conf" && \
    +    mkdir -p /var/log/pfs && chown $USER_NAME /var/log/pfs && \
    +    mkdir -p /var/run/pfs && chown $USER_NAME /var/run/pfs && \
    +    mkdir -p /var/run/pfsd && chown $USER_NAME /var/run/pfsd && \
    +    mkdir -p /dev/shm/pfsd && chown $USER_NAME /dev/shm/pfsd && \
    +    touch /var/run/pfsd/.pfsd && \
    +    echo "ulimit -c unlimited" >> /home/postgres/.bashrc && \
    +    echo "export PGHOST=127.0.0.1" >> /home/postgres/.bashrc && \
    +    echo "alias pg='psql -h /home/postgres/tmp_master_dir_polardb_pg_1100_bld/'" >> /home/postgres/.bashrc
    +
    +ENV PATH="/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin:$PATH"
    +WORKDIR /home/$USER_NAME
    +USER $USER_NAME
    +

    将上述内容复制到一个文件内(假设文件名为 Dockerfile-PolarDB)后,使用如下命令构建镜像:

    TIP

    💡 请在下面的高亮行中按需替换 <image_name> 内的 Docker 镜像名称

    docker build --network=host \
    +    -t <image_name> \
    +    -f Dockerfile-PolarDB .
    +

     

    从干净的系统开始搭建开发环境

    该方式假设您从一台具有 root 权限的干净的 CentOS 7 操作系统上从零开始,可以是:

    • 安装 CentOS 7 的物理机/虚拟机
    • 从 CentOS 7 官方 Docker 镜像 centos:centos7 上启动的 Docker 容器

    建立非 root 用户

    PolarDB for PostgreSQL 需要以非 root 用户运行。以下步骤能够帮助您创建一个名为 postgres 的用户组和一个名为 postgres 的用户。

    TIP

    如果您已经有了一个非 root 用户,但名称不是 postgres:postgres,可以忽略该步骤;但请注意在后续示例步骤中将命令中用户相关的信息替换为您自己的用户组名与用户名。

    下面的命令能够创建用户组 postgres 和用户 postgres,并为该用户赋予 sudo 和工作目录的权限。需要以 root 用户执行这些命令。

    # install sudo
    +yum install -y sudo
    +# create user and group
    +groupadd -r postgres
    +useradd -m -g postgres postgres -p ''
    +usermod -aG wheel postgres
    +# make postgres as sudoer
    +chmod u+w /etc/sudoers
    +echo 'postgres ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
    +chmod u-w /etc/sudoers
    +# grant access to home directory
    +chown -R postgres:postgres /home/postgres/
    +echo 'source /etc/bashrc' >> /home/postgres/.bashrc
    +# for su postgres
    +sed -i 's/4096/unlimited/g' /etc/security/limits.d/20-nproc.conf
    +

    接下来,切换到 postgres 用户,就可以进行后续的步骤了:

    su postgres
    +source /etc/bashrc
    +cd ~
    +

    依赖安装

    在 PolarDB for PostgreSQL 的源码库根目录下,有一个 install_dependencies.sh 脚本,包含了 PolarDB 和 PFS 需要运行的所有依赖。因此,首先需要克隆 PolarDB 的源码库。

    PolarDB for PostgreSQL 的代码托管于 GitHubopen in new window 上,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    sudo yum install -y git
    +git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    sudo yum install -y git
    +git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    源码下载完毕后,使用 sudo 执行源代码根目录下的依赖安装脚本 install_dependencies.sh 自动完成所有的依赖安装。如果有定制的开发需求,请自行修改 install_dependencies.sh

    cd PolarDB-for-PostgreSQL
    +sudo ./install_dependencies.sh
    +
    + + + diff --git a/development/dev-on-docker.html b/development/dev-on-docker.html new file mode 100644 index 00000000000..2e735d719ac --- /dev/null +++ b/development/dev-on-docker.html @@ -0,0 +1,63 @@ + + + + + + + + + 基于 Docker 容器开发 | PolarDB for PostgreSQL + + + + +

    基于 Docker 容器开发

    DANGER

    为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

    在开发机器上下载源代码

    GitHubopen in new window 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    代码克隆完毕后,进入源码目录:

    cd PolarDB-for-PostgreSQL/
    +

    拉取开发镜像

    从 DockerHub 上拉取 PolarDB for PostgreSQL 的 开发镜像open in new window

    # 拉取 PolarDB 开发镜像
    +docker pull polardb/polardb_pg_devel
    +

    创建并运行容器

    此时我们已经在开发机器的源码目录中。从开发镜像上创建一个容器,将当前目录作为一个 volume 挂载到容器中,这样可以:

    • 在容器内的环境中编译源码
    • 在容器外(开发机器上)使用编辑器来查看或修改代码
    docker run -it \
    +    -v $PWD:/home/postgres/polardb_pg \
    +    --shm-size=512m --cap-add=SYS_PTRACE --privileged=true \
    +    --name polardb_pg_devel \
    +    polardb/polardb_pg_devel \
    +    bash
    +

    进入容器后,为容器内用户获取源码目录的权限,然后编译部署 PolarDB-PG 实例。

    # 获取权限并编译部署
    +cd polardb_pg
    +sudo chmod -R a+wr ./
    +sudo chown -R postgres:postgres ./
    +./polardb_build.sh
    +
    +# 验证
    +psql -h 127.0.0.1 -c 'select version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    编译测试选项说明

    以下表格列出了编译、初始化或测试 PolarDB-PG 集群所可能使用到的选项及说明。更多选项及其说明详见源码目录下的 polardb_build.sh 脚本。

    选项描述默认值
    --withrep是否初始化只读节点NO
    --repnum只读节点数量1
    --withstandby是否初始化热备份节点NO
    --initpx是否初始化为 HTAP 集群(1 个读写节点,2 个只读节点)NO
    --with-pfsd是否编译 PolarDB File System(PFS)相关功能NO
    --with-tde是否初始化 透明数据加密(TDE)open in new window 功能NO
    --with-dma是否初始化为 DMA(Data Max Availability)高可用三节点集群NO
    -r/ -t /
    --regress
    在编译安装完毕后运行内核回归测试NO
    -r-px运行 HTAP 实例的回归测试NO
    -e /
    --extension
    运行扩展插件测试NO
    -r-external测试 external/ 下的扩展插件NO
    -r-contrib测试 contrib/ 下的扩展插件NO
    -r-pl测试 src/pl/ 下的扩展插件NO

    如无定制的需求,则可以按照下面给出的选项编译部署不同形态的 PolarDB-PG 集群并进行测试。

    PolarDB-PG 各形态编译部署

    本地单节点实例

    • 1 个读写节点(运行于 5432 端口)
    ./polardb_build.sh
    +

    本地多节点实例

    • 1 个读写节点(运行于 5432 端口)
    • 1 个只读节点(运行于 5433 端口)
    ./polardb_build.sh --withrep --repnum=1
    +

    本地多节点带备库实例

    • 1 个读写节点(运行于 5432 端口)
    • 1 个只读节点(运行于 5433 端口)
    • 1 个备库节点(运行于 5434 端口)
    ./polardb_build.sh --withrep --repnum=1 --withstandby
    +

    本地多节点 HTAP 实例

    • 1 个读写节点(运行于 5432 端口)
    • 2 个只读节点(运行于 5433 / 5434 端口)
    ./polardb_build.sh --initpx
    +

    实例回归测试

    普通实例回归测试:

    ./polardb_build.sh --withrep -r -e -r-external -r-contrib -r-pl --with-tde
    +

    HTAP 实例回归测试:

    ./polardb_build.sh -r-px -e -r-external -r-contrib -r-pl --with-tde
    +

    DMA 实例回归测试:

    ./polardb_build.sh -r -e -r-external -r-contrib -r-pl --with-tde --with-dma
    +
    + + + diff --git a/favicon.ico b/favicon.ico new file mode 100644 index 00000000000..873dd54a834 Binary files /dev/null and b/favicon.ico differ diff --git a/icons/favicon-16x16.png b/icons/favicon-16x16.png new file mode 100644 index 00000000000..2378120c15d Binary files /dev/null and b/icons/favicon-16x16.png differ diff --git a/icons/favicon-32x32.png b/icons/favicon-32x32.png new file mode 100644 index 00000000000..db428270c3a Binary files /dev/null and b/icons/favicon-32x32.png differ diff --git a/images/polardb.png b/images/polardb.png new file mode 100644 index 00000000000..703431c1537 Binary files /dev/null and b/images/polardb.png differ diff --git a/images/polardb_group.png b/images/polardb_group.png new file mode 100644 index 00000000000..b389128a07b Binary files /dev/null and b/images/polardb_group.png differ diff --git a/index.html b/index.html new file mode 100644 index 00000000000..8f93e71f342 --- /dev/null +++ b/index.html @@ -0,0 +1,43 @@ + + + + + + + + + Documentation | PolarDB for PostgreSQL + + + + +
    PolarDB for PostgreSQL

    PolarDB for PostgreSQL

    A cloud-native database developed by Alibaba Cloud


    Quick Start with Docker

    Pull the local instance imageopen in new window of PolarDB for PostgreSQL based on local storage. Create and run the container, and try PolarDB-PG instance directly:

    # pull the instance image from DockerHub
    +docker pull polardb/polardb_pg_local_instance
    +# create and run the container
    +docker run -it --rm polardb/polardb_pg_local_instance psql
    +# check
    +postgres=# SELECT version();
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +
    + + + diff --git a/operation/backup-and-restore.html b/operation/backup-and-restore.html new file mode 100644 index 00000000000..94da6871a5f --- /dev/null +++ b/operation/backup-and-restore.html @@ -0,0 +1,266 @@ + + + + + + + + + 备份恢复 | PolarDB for PostgreSQL + + + + +

    备份恢复

    慎追、棠羽

    2023/01/11

    30 min

    PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其备份恢复和 PostgreSQL 存在部分差异。本文将指导您如何对 PolarDB for PostgreSQL 进行备份,并通过备份来搭建 Replica 节点或 Standby 节点。

    备份恢复原理

    PostgreSQL 的备份流程可以总结为以下几步:

    1. 进入备份模式
      • 强制进入 Full Page Write 模式,并切换当前的 WAL segment 文件
      • 在数据目录下创建 backup_label 文件,其中包含基础备份的起始点位置
      • 备份的恢复必须从一个内存数据与磁盘数据一致的检查点开始,所以将等待下一次检查点的到来,或立刻强制进行一次 CHECKPOINT
    2. 备份数据库:使用文件系统级别的工具进行备份
    3. 退出备份模式
      • 重置 Full Page Write 模式,并切换到下一个 WAL segment 文件
      • 创建备份历史文件,包含当前基础备份的起止 WAL 位置,并删除 backup_label 文件

    备份 PostgreSQL 数据库最简便方法是使用 pg_basebackup 工具。

    数据目录结构

    PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其数据目录分为以下两类:

    • 本地数据目录:位于每个计算节点的本地存储上,为每个计算节点私有
    • 共享数据目录:位于共享存储上,被所有计算节点共享

    backup-dir

    由于本地数据目录中的目录和文件不涉及数据库的核心数据,因此在备份数据库时,备份本地数据目录是可选的。可以仅备份共享存储上的数据目录,然后使用 initdb 重新生成新的本地存储目录。但是计算节点的本地配置文件需要被手动备份,如 postgresql.confpg_hba.conf 等文件。

    本地数据目录

    通过以下 SQL 命令可以查看节点的本地数据目录:

    postgres=# SHOW data_directory;
    +     data_directory
    +------------------------
    + /home/postgres/primary
    +(1 row)
    +

    本地数据目录类似于 PostgreSQL 的数据目录,大多数目录和文件都是通过 initdb 生成的。随着数据库服务的运行,本地数据目录中会产生更多的本地文件,如临时文件、缓存文件、配置文件、日志文件等。其结构如下:

    $ tree ./ -L 1
    +./
    +├── base
    +├── current_logfiles
    +├── global
    +├── pg_commit_ts
    +├── pg_csnlog
    +├── pg_dynshmem
    +├── pg_hba.conf
    +├── pg_ident.conf
    +├── pg_log
    +├── pg_logical
    +├── pg_logindex
    +├── pg_multixact
    +├── pg_notify
    +├── pg_replslot
    +├── pg_serial
    +├── pg_snapshots
    +├── pg_stat
    +├── pg_stat_tmp
    +├── pg_subtrans
    +├── pg_tblspc
    +├── PG_VERSION
    +├── pg_xact
    +├── polar_cache_trash
    +├── polar_dma.conf
    +├── polar_fullpage
    +├── polar_node_static.conf
    +├── polar_rel_size_cache
    +├── polar_shmem
    +├── polar_shmem_stat_file
    +├── postgresql.auto.conf
    +├── postgresql.conf
    +├── postmaster.opts
    +└── postmaster.pid
    +
    +21 directories, 12 files
    +

    共享数据目录

    通过以下 SQL 命令可以查看所有计算节点在共享存储上的共享数据目录:

    postgres=# SHOW polar_datadir;
    +     polar_datadir
    +-----------------------
    + /nvme1n1/shared_data/
    +(1 row)
    +

    共享数据目录中存放 PolarDB for PostgreSQL 的核心数据文件,如表文件、索引文件、WAL 日志、DMA、LogIndex、Flashback Log 等。这些文件被所有节点共享,因此必须被备份。其结构如下:

    $ sudo pfs -C disk ls /nvme1n1/shared_data/
    +   Dir  1     512               Wed Jan 11 09:34:01 2023  base
    +   Dir  1     7424              Wed Jan 11 09:34:02 2023  global
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_tblspc
    +   Dir  1     512               Wed Jan 11 09:35:05 2023  pg_wal
    +   Dir  1     384               Wed Jan 11 09:35:01 2023  pg_logindex
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_twophase
    +   Dir  1     128               Wed Jan 11 09:34:02 2023  pg_xact
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_commit_ts
    +   Dir  1     256               Wed Jan 11 09:34:03 2023  pg_multixact
    +   Dir  1     0                 Wed Jan 11 09:34:03 2023  pg_csnlog
    +   Dir  1     256               Wed Jan 11 09:34:03 2023  polar_dma
    +   Dir  1     512               Wed Jan 11 09:35:09 2023  polar_fullpage
    +  File  1     32                Wed Jan 11 09:35:00 2023  RWID
    +   Dir  1     256               Wed Jan 11 10:25:42 2023  pg_replslot
    +  File  1     224               Wed Jan 11 10:19:37 2023  polar_non_exclusive_backup_label
    +total 16384 (unit: 512Bytes)
    +

    polar_basebackup 备份工具

    PolarDB for PostgreSQL 的备份工具 polar_basebackup,由 PostgreSQL 的 pg_basebackupopen in new window 改造而来,完全兼容 pg_basebackup,因此同样可以用于对 PostgreSQL 做备份恢复。polar_basebackup 的可执行文件位于 PolarDB for PostgreSQL 安装目录下的 bin/ 目录中。

    该工具的主要功能是将一个运行中的 PolarDB for PostgreSQL 数据库的数据目录(包括本地数据目录和共享数据目录)备份到目标目录中。

    polar_basebackup takes a base backup of a running PostgreSQL server.
    +
    +Usage:
    +  polar_basebackup [OPTION]...
    +
    +Options controlling the output:
    +  -D, --pgdata=DIRECTORY receive base backup into directory
    +  -F, --format=p|t       output format (plain (default), tar)
    +  -r, --max-rate=RATE    maximum transfer rate to transfer data directory
    +                         (in kB/s, or use suffix "k" or "M")
    +  -R, --write-recovery-conf
    +                         write recovery.conf for replication
    +  -T, --tablespace-mapping=OLDDIR=NEWDIR
    +                         relocate tablespace in OLDDIR to NEWDIR
    +      --waldir=WALDIR    location for the write-ahead log directory
    +  -X, --wal-method=none|fetch|stream
    +                         include required WAL files with specified method
    +  -z, --gzip             compress tar output
    +  -Z, --compress=0-9     compress tar output with given compression level
    +
    +General options:
    +  -c, --checkpoint=fast|spread
    +                         set fast or spread checkpointing
    +  -C, --create-slot      create replication slot
    +  -l, --label=LABEL      set backup label
    +  -n, --no-clean         do not clean up after errors
    +  -N, --no-sync          do not wait for changes to be written safely to disk
    +  -P, --progress         show progress information
    +  -S, --slot=SLOTNAME    replication slot to use
    +  -v, --verbose          output verbose messages
    +  -V, --version          output version information, then exit
    +      --no-slot          prevent creation of temporary replication slot
    +      --no-verify-checksums
    +                         do not verify checksums
    +  -?, --help             show this help, then exit
    +
    +Connection options:
    +  -d, --dbname=CONNSTR   connection string
    +  -h, --host=HOSTNAME    database server host or socket directory
    +  -p, --port=PORT        database server port number
    +  -s, --status-interval=INTERVAL
    +                         time between status packets sent to server (in seconds)
    +  -U, --username=NAME    connect as specified database user
    +  -w, --no-password      never prompt for password
    +  -W, --password         force password prompt (should happen automatically)
    +      --polardata=datadir  receive polar data backup into directory
    +      --polar_disk_home=disk_home  polar_disk_home for polar data backup
    +      --polar_host_id=host_id  polar_host_id for polar data backup
    +      --polar_storage_cluster_name=cluster_name  polar_storage_cluster_name for polar data backup
    +

    polar_basebackup 的参数及用法几乎和 pg_basebackup 一致,新增了以下与共享存储相关的参数:

    • --polar_disk_home / --polar_host_id / --polar_storage_cluster_name:这三个参数指定了用于存放备份共享数据的共享存储节点
    • --polardata:该参数指定了备份共享存储节点上存放共享数据的路径;如不指定,则默认将共享数据备份到本地数据备份目录的 polar_shared_data/ 路径下

    备份并恢复一个 Replica 节点

    基础备份可用于搭建一个新的 Replica(RO)节点。如前文所述,一个正在运行中的 PolarDB for PostgreSQL 实例的数据文件分布在各计算节点的本地存储和存储节点的共享存储中。下面将说明如何使用 polar_basebackup 将实例的数据文件备份到一个本地磁盘上,并从这个备份上启动一个 Replica 节点。

    PFS 文件系统挂载

    首先,在将要部署 Replica 节点的机器上启动 PFSD 守护进程,挂载到正在运行中的共享存储的 PFS 文件系统上。后续启动的 Replica 节点将使用这个守护进程来访问共享存储。

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    备份数据到本地存储

    运行如下命令,将实例 Primary 节点的本地数据和共享数据备份到用于部署 Replica 节点的本地存储路径 /home/postgres/replica1 下:

    polar_basebackup \
    +    --host=[Primary节点所在IP] \
    +    --port=[Primary节点所在端口号] \
    +    -D /home/postgres/replica1 \
    +    -X stream --progress --write-recovery-conf -v
    +

    将看到如下输出:

    polar_basebackup: initiating base backup, waiting for checkpoint to complete
    +polar_basebackup: checkpoint completed
    +polar_basebackup: write-ahead log start point: 0/16ADD60 on timeline 1
    +polar_basebackup: starting background WAL receiver
    +polar_basebackup: created temporary replication slot "pg_basebackup_359"
    +851371/851371 kB (100%), 2/2 tablespaces
    +polar_basebackup: write-ahead log end point: 0/16ADE30
    +polar_basebackup: waiting for background process to finish streaming ...
    +polar_basebackup: base backup completed
    +

    备份完成后,可以以这个备份目录作为本地数据目录,启动一个新的 Replica 节点。由于本地数据目录中不需要共享存储上已有的共享数据文件,所以删除掉本地数据目录中的 polar_shared_data/ 目录:

    rm -rf ~/replica1/polar_shared_data
    +

    重新配置 Replica 节点

    重新编辑 Replica 节点的配置文件 ~/replica1/postgresql.conf

    -polar_hostid=1
    ++polar_hostid=2
    +-synchronous_standby_names='replica1'
    +

    重新编辑 Replica 节点的复制配置文件 ~/replica1/recovery.conf

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[Primary节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    Replica 节点启动

    启动 Replica 节点:

    pg_ctl -D $HOME/replica1 start
    +

    Replica 节点验证

    在 Primary 节点上执行建表并插入数据,在 Replica 节点上可以查到 Primary 节点插入的数据:

    $ psql -q \
    +    -h [Primary节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +
    +$ psql -q \
    +    -h [Replica节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    备份并恢复一个 Standby 节点

    基础备份也可以用于搭建一个新的 Standby 节点。如下图所示,Standby 节点与 Primary / Replica 节点各自使用独立的共享存储,与 Primary 节点使用物理复制保持同步。Standby 节点可用于作为主共享存储的灾备。

    backup-dir

    PFS 文件系统格式化和挂载

    假设此时用于部署 Standby 计算节点的机器已经准备好用于后备的共享存储 nvme2n1

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:1    0  40G  0 disk
    +└─nvme0n1p1 259:2    0  40G  0 part /etc/hosts
    +nvme2n1     259:3    0  70G  0 disk
    +nvme1n1     259:0    0  60G  0 disk
    +

    将这个共享存储格式化为 PFS 格式,并启动 PFSD 守护进程挂载到 PFS 文件系统:

    sudo pfs -C disk mkfs nvme2n1
    +sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme2n1 -w 2
    +

    备份数据到本地存储和共享存储

    在用于部署 Standby 节点的机器上执行备份,以 ~/standby 作为本地数据目录,以 /nvme2n1/shared_data 作为共享存储目录:

    polar_basebackup \
    +    --host=[Primary节点所在IP] \
    +    --port=[Primary节点所在端口号] \
    +    -D /home/postgres/standby \
    +    --polardata=/nvme2n1/shared_data/ \
    +    --polar_storage_cluster_name=disk \
    +    --polar_disk_name=nvme2n1 \
    +    --polar_host_id=3 \
    +    -X stream --progress --write-recovery-conf -v
    +

    将会看到如下输出。其中,除了 polar_basebackup 的输出以外,还有 PFS 的输出日志:

    [PFSD_SDK INF Jan 11 10:11:27.247112][99]pfs_mount_prepare 103: begin prepare mount cluster(disk), PBD(nvme2n1), hostid(3),flags(0x13)
    +[PFSD_SDK INF Jan 11 10:11:27.247161][99]pfs_mount_prepare 165: pfs_mount_prepare success for nvme2n1 hostid 3
    +[PFSD_SDK INF Jan 11 10:11:27.293900][99]chnl_connection_poll_shm 1238: ack data update s_mount_epoch 1
    +[PFSD_SDK INF Jan 11 10:11:27.293912][99]chnl_connection_poll_shm 1266: connect and got ack data from svr, err = 0, mntid 0
    +[PFSD_SDK INF Jan 11 10:11:27.293979][99]pfsd_sdk_init 191: pfsd_chnl_connect success
    +[PFSD_SDK INF Jan 11 10:11:27.293987][99]pfs_mount_post 208: pfs_mount_post err : 0
    +[PFSD_SDK ERR Jan 11 10:11:27.297257][99]pfsd_opendir 1437: opendir /nvme2n1/shared_data/ error: No such file or directory
    +[PFSD_SDK INF Jan 11 10:11:27.297396][99]pfsd_mkdir 1320: mkdir /nvme2n1/shared_data
    +polar_basebackup: initiating base backup, waiting for checkpoint to complete
    +WARNING:  a labelfile "/nvme1n1/shared_data//polar_non_exclusive_backup_label" is already on disk
    +HINT:  POLAR: we overwrite it
    +polar_basebackup: checkpoint completed
    +polar_basebackup: write-ahead log start point: 0/16C91F8 on timeline 1
    +polar_basebackup: starting background WAL receiver
    +polar_basebackup: created temporary replication slot "pg_basebackup_373"
    +...
    +[PFSD_SDK INF Jan 11 10:11:32.992005][99]pfsd_open 539: open /nvme2n1/shared_data/polar_non_exclusive_backup_label with inode 6325, fd 0
    +[PFSD_SDK INF Jan 11 10:11:32.993074][99]pfsd_open 539: open /nvme2n1/shared_data/global/pg_control with inode 8373, fd 0
    +851396/851396 kB (100%), 2/2 tablespaces
    +polar_basebackup: write-ahead log end point: 0/16C9300
    +polar_basebackup: waiting for background process to finish streaming ...
    +polar_basebackup: base backup completed
    +[PFSD_SDK INF Jan 11 10:11:52.378220][99]pfsd_umount_force 247: pbdname nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.378229][99]pfs_umount_prepare 269: pfs_umount_prepare. pbdname:nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.404010][99]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme2n1/99.pid
    +[PFSD_SDK INF Jan 11 10:11:52.404171][99]pfs_umount_post 281: pfs_umount_post. pbdname:nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.404174][99]pfsd_umount_force 261: umount success for nvme2n1
    +

    上述命令会在当前机器的本地存储上备份 Primary 节点的本地数据目录,在参数指定的共享存储目录上备份共享数据目录。

    重新配置 Standby 节点

    重新编辑 Standby 节点的配置文件 ~/standby/postgresql.conf

    -polar_hostid=1
    ++polar_hostid=3
    +-polar_disk_name='nvme1n1'
    +-polar_datadir='/nvme1n1/shared_data/'
    ++polar_disk_name='nvme2n1'
    ++polar_datadir='/nvme2n1/shared_data/'
    +-synchronous_standby_names='replica1'
    +

    在 Standby 节点的复制配置文件 ~/standby/recovery.conf 中添加:

    +recovery_target_timeline = 'latest'
    ++primary_slot_name = 'standby1'
    +

    Standby 节点启动

    在 Primary 节点上创建用于与 Standby 进行物理复制的复制槽:

    $ psql \
    +    --host=[Primary节点所在IP] --port=5432 \
    +    -d postgres \
    +    -c "SELECT * FROM pg_create_physical_replication_slot('standby1');"
    + slot_name | lsn
    +-----------+-----
    + standby1  |
    +(1 row)
    +

    启动 Standby 节点:

    pg_ctl -D $HOME/standby start
    +

    Standby 节点验证

    在 Primary 节点上创建表并插入数据,在 Standby 节点上可以查询到数据:

    $ psql -q \
    +    -h [Primary节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +
    +$ psql -q \
    +    -h [Standby节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +
    + + + diff --git a/operation/grow-storage.html b/operation/grow-storage.html new file mode 100644 index 00000000000..6470b0421c0 --- /dev/null +++ b/operation/grow-storage.html @@ -0,0 +1,60 @@ + + + + + + + + + 共享存储在线扩容 | PolarDB for PostgreSQL + + + + +

    共享存储在线扩容 视频

    棠羽

    2022/10/12

    15 min

    在使用数据库时,随着数据量的逐渐增大,不可避免需要对数据库所使用的存储空间进行扩容。由于 PolarDB for PostgreSQL 基于共享存储与分布式文件系统 PFS 的架构设计,与安装部署时类似,在扩容时,需要在以下三个层面分别进行操作:

    本文将指导您分别在以上三个层面上分别完成扩容操作,以实现不停止数据库实例的动态扩容。

    块存储层扩容

    首先需要进行的是块存储层面上的扩容。不管使用哪种类型的共享存储,存储层面扩容最终需要达成的目的是:在能够访问共享存储的主机上运行 lsblk 命令,显示存储块设备的物理空间变大。由于不同类型的共享存储有不同的扩容方式,本文以 阿里云 ECS + ESSD 云盘共享存储 为例演示如何进行存储层面的扩容。

    另外,为保证后续扩容步骤的成功,请以 10GB 为单位进行扩容。

    本示例中,在扩容之前,已有一个 20GB 的 ESSD 云盘多重挂载在两台 ECS 上。在这两台 ECS 上运行 lsblk,可以看到 ESSD 云盘共享存储对应的块设备 nvme1n1 目前的物理空间为 20GB。

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0  40G  0 disk
    +└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
    +nvme1n1     259:2    0  20G  0 disk
    +

    接下来对这块 ESSD 云盘进行扩容。在阿里云 ESSD 云盘的管理页面上,点击 云盘扩容

    essd-storage-grow

    进入到云盘扩容界面以后,可以看到该云盘已被两台 ECS 实例多重挂载。填写扩容后的容量,然后点击确认扩容,把 20GB 的云盘扩容为 40GB:

    essd-storage-online-grow

    扩容成功后,将会看到如下提示:

    essd-storage-grow-complete

    此时,两台 ECS 上运行 lsblk,可以看到 ESSD 对应块设备 nvme1n1 的物理空间已经变为 40GB:

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0  40G  0 disk
    +└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
    +nvme1n1     259:2    0  40G  0 disk
    +

    至此,块存储层面的扩容就完成了。

    文件系统层扩容

    在物理块设备完成扩容以后,接下来需要使用 PFS 文件系统提供的工具,对块设备上扩大后的物理空间进行格式化,以完成文件系统层面的扩容。

    在能够访问共享存储的 任意一台主机上 运行 PFS 的 growfs 命令,其中:

    • -o 表示共享存储扩容前的空间(以 10GB 为单位)
    • -n 表示共享存储扩容后的空间(以 10GB 为单位)

    本例将共享存储从 20GB 扩容至 40GB,所以参数分别填写 24

    $ sudo pfs -C disk growfs -o 2 -n 4 nvme1n1
    +
    +...
    +
    +Init chunk 2
    +                metaset        2/1: sectbda      0x500001000, npage       80, objsize  128, nobj 2560, oid range [    2000,     2a00)
    +                metaset        2/2: sectbda      0x500051000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
    +                metaset        2/3: sectbda      0x500091000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
    +
    +Init chunk 3
    +                metaset        3/1: sectbda      0x780001000, npage       80, objsize  128, nobj 2560, oid range [    3000,     3a00)
    +                metaset        3/2: sectbda      0x780051000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
    +                metaset        3/3: sectbda      0x780091000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
    +
    +pfs growfs succeeds!
    +

    如果看到上述输出,说明文件系统层面的扩容已经完成。

    数据库实例层扩容

    最后,在数据库实例层,扩容需要做的工作是执行 SQL 函数来通知每个实例上已经挂载到共享存储的 PFSD(PFS Daemon)守护进程,告知共享存储上的新空间已经可以被使用了。需要注意的是,数据库实例集群中的 所有 PFSD 都需要被通知到,并且需要 先通知所有 RO 节点上的 PFSD,最后通知 RW 节点上的 PFSD。这意味着我们需要在 每一个 PolarDB for PostgreSQL 节点上执行一次通知 PFSD 的 SQL 函数,并且 RO 节点在先,RW 节点在后

    数据库实例层通知 PFSD 的扩容函数实现在 PolarDB for PostgreSQL 的 polar_vfs 插件中,所以首先需要在 RW 节点 上加载 polar_vfs 插件。在加载插件的过程中,会在 RW 节点和所有 RO 节点上注册好 polar_vfs_disk_expansion 这个 SQL 函数。

    CREATE EXTENSION IF NOT EXISTS polar_vfs;
    +

    接下来,依次 在所有的 RO 节点上,再到 RW 节点上 分别 执行这个 SQL 函数。其中函数的参数名为块设备名:

    SELECT polar_vfs_disk_expansion('nvme1n1');
    +

    执行完毕后,数据库实例层面的扩容也就完成了。此时,新的存储空间已经能够被数据库使用了。

    + + + diff --git a/operation/ro-online-promote.html b/operation/ro-online-promote.html new file mode 100644 index 00000000000..e17da1f3640 --- /dev/null +++ b/operation/ro-online-promote.html @@ -0,0 +1,57 @@ + + + + + + + + + 只读节点在线 Promote | PolarDB for PostgreSQL + + + + +

    只读节点在线 Promote

    棠羽

    2022/12/25

    15 min

    PolarDB for PostgreSQL 是一款存储与计算分离的云原生数据库,所有计算节点共享一份存储,并且对存储的访问具有 一写多读 的限制:所有计算节点可以对存储进行读取,但只有一个计算节点可以对存储进行写入。这种限制会带来一个问题:当读写节点因为宕机或网络故障而不可用时,集群中将没有能够可以写入存储的计算节点,应用业务中的增、删、改,以及 DDL 都将无法运行。

    本文将指导您在 PolarDB for PostgreSQL 计算集群中的读写节点停止服务时,将任意一个只读节点在线提升为读写节点,从而使集群恢复对于共享存储的写入能力。

    前置准备

    为方便起见,本示例使用基于本地磁盘的实例来进行演示。拉取如下镜像并启动容器,可以得到一个基于本地磁盘的 HTAP 实例:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg_htap \
    +    --shm-size=512m \
    +    polardb/polardb_pg_local_instance \
    +    bash
    +

    容器内的 54325434 端口分别运行着一个读写节点和两个只读节点。两个只读节点与读写节点共享同一份数据,并通过物理复制保持与读写节点的内存状态同步。

    验证只读节点不可写

    首先,连接到读写节点,创建一张表并插入一些数据:

    psql -p5432
    +
    postgres=# CREATE TABLE t (id int);
    +CREATE TABLE
    +postgres=# INSERT INTO t SELECT generate_series(1,10);
    +INSERT 0 10
    +

    然后连接到只读节点,并同样试图对表插入数据,将会发现无法进行插入操作:

    psql -p5433
    +
    postgres=# INSERT INTO t SELECT generate_series(1,10);
    +ERROR:  cannot execute INSERT in a read-only transaction
    +

    读写节点停止写入

    此时,关闭读写节点,模拟出读写节点不可用的行为:

    $ pg_ctl -D ~/tmp_master_dir_polardb_pg_1100_bld/ stop
    +waiting for server to shut down.... done
    +server stopped
    +

    此时,集群中没有任何节点可以写入存储了。这时,我们需要将一个只读节点提升为读写节点,恢复对存储的写入。

    只读节点 Promote

    只有当读写节点停止写入后,才可以将只读节点提升为读写节点,否则将会出现集群内两个节点同时写入的情况。当数据库检测到出现多节点写入时,将会导致运行异常。

    将运行在 5433 端口的只读节点提升为读写节点:

    $ pg_ctl -D ~/tmp_replica_dir_polardb_pg_1100_bld1/ promote
    +waiting for server to promote.... done
    +server promoted
    +

    计算集群恢复写入

    连接到已经完成 promote 的新读写节点上,再次尝试之前的 INSERT 操作:

    postgres=# INSERT INTO t SELECT generate_series(1,10);
    +INSERT 0 10
    +

    从上述结果中可以看到,新的读写节点能够成功对存储进行写入。这说明原先的只读节点已经被成功提升为读写节点了。

    + + + diff --git a/operation/scale-out.html b/operation/scale-out.html new file mode 100644 index 00000000000..c80eb453c55 --- /dev/null +++ b/operation/scale-out.html @@ -0,0 +1,182 @@ + + + + + + + + + 计算节点扩缩容 | PolarDB for PostgreSQL + + + + +

    计算节点扩缩容

    棠羽

    2022/12/19

    30 min

    PolarDB for PostgreSQL 是一款存储与计算分离的数据库,所有计算节点共享存储,并可以按需要弹性增加或删减计算节点而无需做任何数据迁移。所有本教程将协助您在共享存储集群上添加或删除计算节点。

    部署读写节点

    首先,在已经搭建完毕的共享存储集群上,初始化并启动第一个计算节点,即读写节点,该节点可以对共享存储进行读写。我们在下面的镜像中提供了已经编译完毕的 PolarDB for PostgreSQL 内核和周边工具的可执行文件:

    $ docker pull polardb/polardb_pg_binary
    +$ docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +
    +$ ls ~/tmp_basedir_polardb_pg_1100_bld/bin/
    +clusterdb     dropuser           pg_basebackup   pg_dump         pg_resetwal    pg_test_timing       polar-initdb.sh          psql
    +createdb      ecpg               pgbench         pg_dumpall      pg_restore     pg_upgrade           polar-replica-initdb.sh  reindexdb
    +createuser    initdb             pg_config       pg_isready      pg_rewind      pg_verify_checksums  polar_tools              vacuumdb
    +dbatools.sql  oid2name           pg_controldata  pg_receivewal   pg_standby     pg_waldump           postgres                 vacuumlo
    +dropdb        pg_archivecleanup  pg_ctl          pg_recvlogical  pg_test_fsync  polar_basebackup     postmaster
    +

    确认存储可访问

    使用 lsblk 命令确认存储集群已经能够被当前机器访问到。比如,如下示例中的 nvme1n1 是将要使用的共享存储的块设备:

    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    格式化并挂载 PFS 文件系统

    此时,共享存储上没有任何内容。使用容器内的 PFS 工具将共享存储格式化为 PFS 文件系统的格式:

    sudo pfs -C disk mkfs nvme1n1
    +

    格式化完成后,在当前容器内启动 PFS 守护进程,挂载到文件系统上。该守护进程后续将会被计算节点用于访问共享存储:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    初始化数据目录

    使用 initdb 在节点本地存储的 ~/primary 路径上创建本地数据目录。本地数据目录中将会存放节点的配置、审计日志等节点私有的信息:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    使用 PFS 工具,在共享存储上创建一个共享数据目录;使用 polar-initdb.sh 脚本把将会被所有节点共享的数据文件拷贝到共享存储的数据目录中。将会被所有节点共享的文件包含所有的表文件、WAL 日志文件等:

    sudo pfs -C disk mkdir /nvme1n1/shared_data
    +
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /nvme1n1/shared_data/
    +

    编辑读写节点配置

    对读写节点的配置文件 ~/primary/postgresql.conf 进行修改,使数据库以共享模式启动,并能够找到共享存储上的数据目录:

    port=5432
    +polar_hostid=1
    +
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,允许来自所有地址的客户端以 postgres 用户进行物理复制:

    host	replication	postgres	0.0.0.0/0	trust
    +

    启动读写节点

    使用以下命令启动读写节点,并检查节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/primary start
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群扩容

    接下来,在已经有一个读写节点的计算集群中扩容一个新的计算节点。由于 PolarDB for PostgreSQL 是一写多读的架构,所以后续扩容的节点只可以对共享存储进行读取,但无法对共享存储进行写入。只读节点通过与读写节点进行物理复制来保持内存状态的同步。

    类似地,在用于部署新计算节点的机器上,拉取镜像并启动带有可执行文件的容器:

    docker pull polardb/polardb_pg_binary
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +

    确认存储可访问

    确保部署只读节点的机器也可以访问到共享存储的块设备:

    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    挂载 PFS 文件系统

    由于此时共享存储已经被读写节点格式化为 PFS 格式了,因此这里无需再次进行格式化。只需要启动 PFS 守护进程完成挂载即可:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    初始化数据目录

    在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

    mkdir -m 0700 $HOME/replica1
    +sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \
    +    /nvme1n1/shared_data/ $HOME/replica1/
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
    +cp /tmp/replica1/*.conf $HOME/replica1/
    +

    编辑只读节点配置

    编辑只读节点的配置文件 ~/replica1/postgresql.conf,配置好只读节点的集群标识和监听端口,以及与读写节点相同的共享存储目录:

    port=5432
    +polar_hostid=2
    +
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    编辑只读节点的复制配置文件 ~/replica1/recovery.conf,配置好当前节点的角色(只读),以及从读写节点进行物理复制的连接串和复制槽:

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +primary_slot_name='replica1'
    +

    由于读写节点上暂时还没有名为 replica1 的复制槽,所以需要连接到读写节点上,创建这个复制槽:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_create_physical_replication_slot('replica1');"
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    启动只读节点

    完成上述步骤后,启动只读节点并验证:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 start
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群功能检查

    连接到读写节点上,创建一个表并插入数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t(id INT); INSERT INTO t SELECT generate_series(1,10);"
    +

    在只读节点上可以立刻查询到从读写节点上插入的数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + id
    +----
    +  1
    +  2
    +  3
    +  4
    +  5
    +  6
    +  7
    +  8
    +  9
    + 10
    +(10 rows)
    +

    从读写节点上可以看到用于与只读节点进行物理复制的复制槽已经处于活跃状态:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM pg_replication_slots;"
    + slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
    +-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
    + replica1  |        | physical  |        |          | f         | t      |         45 |      |              | 0/4079E8E8  |
    +(1 rows)
    +

    依次类推,使用类似的方法还可以横向扩容更多的只读节点。

    集群缩容

    集群缩容的步骤较为简单:将只读节点停机即可。

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 stop
    +

    在只读节点停机后,读写节点上的复制槽将变为非活跃状态。非活跃的复制槽将会阻止 WAL 日志的回收,所以需要及时清理。

    在读写节点上执行如下命令,移除名为 replica1 的复制槽:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_drop_replication_slot('replica1');"
    + pg_drop_replication_slot
    +--------------------------
    +
    +(1 row)
    +
    + + + diff --git a/operation/tpcc-test.html b/operation/tpcc-test.html new file mode 100644 index 00000000000..f1e6fc37820 --- /dev/null +++ b/operation/tpcc-test.html @@ -0,0 +1,59 @@ + + + + + + + + + TPC-C 测试 | PolarDB for PostgreSQL + + + + +

    TPC-C 测试

    棠羽

    2023/04/11

    15 min

    本文将引导您对 PolarDB for PostgreSQL 进行 TPC-C 测试。

    背景

    TPC 是一系列事务处理和数据库基准测试的规范。其中 TPC-Copen in new window (Transaction Processing Performance Council) 是针对 OLTP 的基准测试模型。TPC-C 测试模型给基准测试提供了一种统一的测试标准,可以大体观察出数据库服务稳定性、性能以及系统性能等一系列问题。对数据库展开 TPC-C 基准性能测试,一方面可以衡量数据库的性能,另一方面可以衡量采用不同硬件软件系统的性价比,是被业内广泛应用并关注的一种测试模型。

    测试步骤

    部署 PolarDB-PG

    参考如下教程部署 PolarDB for PostgreSQL:

    安装测试工具 BenchmarkSQL

    BenchmarkSQLopen in new window 依赖 Java 运行环境与 Maven 包管理工具,需要预先安装。拉取 BenchmarkSQL 工具源码并进入目录后,通过 mvn 编译工程:

    $ git clone https://github.com/pgsql-io/benchmarksql.git
    +$ cd benchmarksql
    +$ mvn
    +

    编译出的工具位于如下目录中:

    $ cd target/run
    +

    TPC-C 配置

    在编译完毕的工具目录下,将会存在面向不同数据库产品的示例配置:

    $ ls | grep sample
    +sample.firebird.properties
    +sample.mariadb.properties
    +sample.oracle.properties
    +sample.postgresql.properties
    +sample.transact-sql.properties
    +

    其中,sample.postgresql.properties 包含 PostgreSQL 系列数据库的模板参数,可以基于这个模板来修改并自定义配置。参考 BenchmarkSQL 工具的 文档open in new window 可以查看关于配置项的详细描述。

    配置项包含的配置类型有:

    • JDBC 驱动及连接信息:需要自行配置 PostgreSQL 数据库运行的连接串、用户名、密码等
    • 测试规模参数
    • 测试时间参数
    • 吞吐量参数
    • 事务类型参数

    导入数据

    使用 runDatabaseBuild.sh 脚本,以配置文件作为参数,产生和导入测试数据:

    ./runDatabaseBuild.sh sample.postgresql.properties
    +

    预热数据

    通常,在正式测试前会进行一次数据预热:

    ./runBenchmark.sh sample.postgresql.properties
    +

    正式测试

    预热完毕后,再次运行同样的命令进行正式测试:

    ./runBenchmark.sh sample.postgresql.properties
    +

    查看结果

                                              _____ latency (seconds) _____
    +  TransType              count |   mix % |    mean       max     90th% |    rbk%          errors
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +| NEW_ORDER    |           635 |  44.593 |   0.006 |   0.012 |   0.008 |   1.102 |             0 |
    +| PAYMENT      |           628 |  44.101 |   0.001 |   0.006 |   0.002 |   0.000 |             0 |
    +| ORDER_STATUS |            58 |   4.073 |   0.093 |   0.168 |   0.132 |   0.000 |             0 |
    +| STOCK_LEVEL  |            52 |   3.652 |   0.035 |   0.044 |   0.041 |   0.000 |             0 |
    +| DELIVERY     |            51 |   3.581 |   0.000 |   0.001 |   0.001 |   0.000 |             0 |
    +| DELIVERY_BG  |            51 |   0.000 |   0.018 |   0.023 |   0.020 |   0.000 |             0 |
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +
    +Overall NOPM:          635 (98.76% of the theoretical maximum)
    +Overall TPM:         1,424
    +

    另外也有 CSV 形式的结果被保存,从输出日志中可以找到结果存放目录。

    + + + diff --git a/operation/tpch-test.html b/operation/tpch-test.html new file mode 100644 index 00000000000..a75167102fe --- /dev/null +++ b/operation/tpch-test.html @@ -0,0 +1,234 @@ + + + + + + + + + TPC-H 测试 | PolarDB for PostgreSQL + + + + +

    TPC-H 测试

    棠羽

    2023/04/12

    20 min

    本文将引导您对 PolarDB for PostgreSQL 进行 TPC-H 测试。

    背景

    TPC-Hopen in new window 是专门测试数据库分析型场景性能的数据集。

    测试准备

    部署 PolarDB-PG

    使用 Docker 快速拉起一个基于本地存储的 PolarDB for PostgreSQL 集群:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg_htap \
    +    --shm-size=512m \
    +    polardb/polardb_pg_local_instance \
    +    bash
    +

    或者参考 进阶部署 部署一个基于共享存储的 PolarDB for PostgreSQL 集群。

    生成 TPC-H 测试数据集

    通过 tpch-dbgenopen in new window 工具来生成测试数据。

    $ git clone https://github.com/ApsaraDB/tpch-dbgen.git
    +$ cd tpch-dbgen
    +$ ./build.sh --help
    +
    +  1) Use default configuration to build
    +  ./build.sh
    +  2) Use limited configuration to build
    +  ./build.sh --user=postgres --db=postgres --host=localhost --port=5432 --scale=1
    +  3) Run the test case
    +  ./build.sh --run
    +  4) Run the target test case
    +  ./build.sh --run=3. run the 3rd case.
    +  5) Run the target test case with option
    +  ./build.sh --run --option="set polar_enable_px = on;"
    +  6) Clean the test data. This step will drop the database or tables, remove csv
    +  and tbl files
    +  ./build.sh --clean
    +  7) Quick build TPC-H with 100MB scale of data
    +  ./build.sh --scale=0.1
    +

    通过设置不同的参数,可以定制化地创建不同规模的 TPC-H 数据集。build.sh 脚本中各个参数的含义如下:

    • --user:数据库用户名
    • --db:数据库名
    • --host:数据库主机地址
    • --port:数据库服务端口
    • --run:执行所有 TPC-H 查询,或执行某条特定的 TPC-H 查询
    • --option:额外指定 GUC 参数
    • --scale:生成 TPC-H 数据集的规模,单位为 GB

    该脚本没有提供输入数据库密码的参数,需要通过设置 PGPASSWORD 为数据库用户的数据库密码来完成认证:

    export PGPASSWORD=<your password>
    +

    生成并导入 100MB 规模的 TPC-H 数据:

    ./build.sh --scale=0.1
    +

    生成并导入 1GB 规模的 TPC-H 数据:

    ./build.sh
    +

    执行 PostgreSQL 单机并行执行

    以 TPC-H 的 Q18 为例,执行 PostgreSQL 的单机并行查询,并观测查询速度。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +
    -- 打开计时
    +\timing on
    +
    +-- 设置单机并行度
    +SET max_parallel_workers_per_gather = 2;
    +
    +-- 查看 Q18 的执行计划
    +\i finals/18.explain.sql
    +                                                                         QUERY PLAN
    +------------------------------------------------------------------------------------------------------------------------------------------------------------
    + Sort  (cost=3450834.75..3450835.42 rows=268 width=81)
    +   Sort Key: orders.o_totalprice DESC, orders.o_orderdate
    +   ->  GroupAggregate  (cost=3450817.91..3450823.94 rows=268 width=81)
    +         Group Key: customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=3450817.91..3450818.58 rows=268 width=67)
    +               Sort Key: customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=1501454.20..3450807.10 rows=268 width=67)
    +                     Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
    +                     ->  Seq Scan on lineitem  (cost=0.00..1724402.52 rows=59986052 width=22)
    +                     ->  Hash  (cost=1501453.37..1501453.37 rows=67 width=53)
    +                           ->  Nested Loop  (cost=1500465.85..1501453.37 rows=67 width=53)
    +                                 ->  Nested Loop  (cost=1500465.43..1501084.65 rows=67 width=34)
    +                                       ->  Finalize GroupAggregate  (cost=1500464.99..1500517.66 rows=67 width=4)
    +                                             Group Key: lineitem_1.l_orderkey
    +                                             Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
    +                                             ->  Gather Merge  (cost=1500464.99..1500511.66 rows=400 width=36)
    +                                                   Workers Planned: 2
    +                                                   ->  Sort  (cost=1499464.97..1499465.47 rows=200 width=36)
    +                                                         Sort Key: lineitem_1.l_orderkey
    +                                                         ->  Partial HashAggregate  (cost=1499454.82..1499457.32 rows=200 width=36)
    +                                                               Group Key: lineitem_1.l_orderkey
    +                                                               ->  Parallel Seq Scan on lineitem lineitem_1  (cost=0.00..1374483.88 rows=24994188 width=22)
    +                                       ->  Index Scan using orders_pkey on orders  (cost=0.43..8.45 rows=1 width=30)
    +                                             Index Cond: (o_orderkey = lineitem_1.l_orderkey)
    +                                 ->  Index Scan using customer_pkey on customer  (cost=0.43..5.50 rows=1 width=23)
    +                                       Index Cond: (c_custkey = orders.o_custkey)
    +(26 rows)
    +
    +Time: 3.965 ms
    +
    +-- 执行 Q18
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 80150.449 ms (01:20.150)
    +

    执行 ePQ 单机并行执行

    PolarDB for PostgreSQL 提供了弹性跨机并行查询(ePQ)的能力,非常适合进行分析型查询。下面的步骤将引导您可以在一台主机上使用 ePQ 并行执行 TPC-H 查询。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +

    首先需要对 TPC-H 产生的八张表设置 ePQ 的最大查询并行度:

    ALTER TABLE nation SET (px_workers = 100);
    +ALTER TABLE region SET (px_workers = 100);
    +ALTER TABLE supplier SET (px_workers = 100);
    +ALTER TABLE part SET (px_workers = 100);
    +ALTER TABLE partsupp SET (px_workers = 100);
    +ALTER TABLE customer SET (px_workers = 100);
    +ALTER TABLE orders SET (px_workers = 100);
    +ALTER TABLE lineitem SET (px_workers = 100);
    +

    以 Q18 为例,执行查询:

    -- 打开计时
    +\timing on
    +
    +-- 打开 ePQ 功能的开关
    +SET polar_enable_px = ON;
    +-- 设置每个节点的 ePQ 并行度为 1
    +SET polar_px_dop_per_node = 1;
    +
    +-- 查看 Q18 的执行计划
    +\i finals/18.explain.sql
    +                                                                          QUERY PLAN
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..257526.21 rows=59986052 width=47)
    +   Merge Key: orders.o_totalprice, orders.o_orderdate
    +   ->  GroupAggregate  (cost=0.00..243457.68 rows=29993026 width=47)
    +         Group Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=0.00..241257.18 rows=29993026 width=47)
    +               Sort Key: orders.o_totalprice DESC, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=0.00..42729.99 rows=29993026 width=47)
    +                     Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
    +                     ->  PX Hash 2:2  (slice2; segments: 2)  (cost=0.00..15959.71 rows=7500000 width=39)
    +                           Hash Key: orders.o_orderkey
    +                           ->  Hash Join  (cost=0.00..15044.19 rows=7500000 width=39)
    +                                 Hash Cond: (orders.o_custkey = customer.c_custkey)
    +                                 ->  PX Hash 2:2  (slice3; segments: 2)  (cost=0.00..11561.51 rows=7500000 width=20)
    +                                       Hash Key: orders.o_custkey
    +                                       ->  Hash Semi Join  (cost=0.00..11092.01 rows=7500000 width=20)
    +                                             Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
    +                                             ->  Partial Seq Scan on orders  (cost=0.00..1132.25 rows=7500000 width=20)
    +                                             ->  Hash  (cost=7760.84..7760.84 rows=400 width=4)
    +                                                   ->  PX Broadcast 2:2  (slice4; segments: 2)  (cost=0.00..7760.84 rows=400 width=4)
    +                                                         ->  Result  (cost=0.00..7760.80 rows=200 width=4)
    +                                                               Filter: ((sum(lineitem.l_quantity)) > '314'::numeric)
    +                                                               ->  Finalize HashAggregate  (cost=0.00..7760.78 rows=500 width=12)
    +                                                                     Group Key: lineitem.l_orderkey
    +                                                                     ->  PX Hash 2:2  (slice5; segments: 2)  (cost=0.00..7760.72 rows=500 width=12)
    +                                                                           Hash Key: lineitem.l_orderkey
    +                                                                           ->  Partial HashAggregate  (cost=0.00..7760.70 rows=500 width=12)
    +                                                                                 Group Key: lineitem.l_orderkey
    +                                                                                 ->  Partial Seq Scan on lineitem  (cost=0.00..3350.82 rows=29993026 width=12)
    +                                 ->  Hash  (cost=597.51..597.51 rows=749979 width=23)
    +                                       ->  PX Hash 2:2  (slice6; segments: 2)  (cost=0.00..597.51 rows=749979 width=23)
    +                                             Hash Key: customer.c_custkey
    +                                             ->  Partial Seq Scan on customer  (cost=0.00..511.44 rows=749979 width=23)
    +                     ->  Hash  (cost=5146.80..5146.80 rows=29993026 width=12)
    +                           ->  PX Hash 2:2  (slice7; segments: 2)  (cost=0.00..5146.80 rows=29993026 width=12)
    +                                 Hash Key: lineitem_1.l_orderkey
    +                                 ->  Partial Seq Scan on lineitem lineitem_1  (cost=0.00..3350.82 rows=29993026 width=12)
    + Optimizer: PolarDB PX Optimizer
    +(37 rows)
    +
    +Time: 216.672 ms
    +
    +-- 执行 Q18
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 59113.965 ms (00:59.114)
    +

    可以看到比 PostgreSQL 的单机并行执行的时间略短。加大 ePQ 功能的节点并行度,查询性能将会有更明显的提升:

    SET polar_px_dop_per_node = 2;
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 42400.500 ms (00:42.401)
    +
    +SET polar_px_dop_per_node = 4;
    +\i finals/18.sql
    +
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 19892.603 ms (00:19.893)
    +
    +SET polar_px_dop_per_node = 8;
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 10944.402 ms (00:10.944)
    +

    使用 ePQ 执行 Q17 和 Q18 时可能会出现 OOM。需要设置以下参数防止用尽内存:

    SET polar_px_optimizer_enable_hashagg = 0;
    +

    执行 ePQ 跨机并行执行

    在上面的例子中,出于简单考虑,PolarDB for PostgreSQL 的多个计算节点被部署在同一台主机上。在这种场景下使用 ePQ 时,由于所有的计算节点都使用了同一台主机的 CPU、内存、I/O 带宽,因此本质上是基于单台主机的并行执行。实际上,PolarDB for PostgreSQL 的计算节点可以被部署在能够共享存储节点的多台机器上。此时使用 ePQ 功能将进行真正的跨机器分布式并行查询,能够充分利用多台机器上的计算资源。

    参考 进阶部署 可以搭建起不同形态的 PolarDB for PostgreSQL 集群。集群搭建成功后,使用 ePQ 的方式与单机 ePQ 完全相同。

    如果遇到如下错误:

    psql:queries/q01.analyze.sq1:24: WARNING:  interconnect may encountered a network error, please check your network
    +DETAIL:  Failed to send packet (seq 1) to 192.168.1.8:57871 (pid 17766 cid 0) after 100 retries.
    +

    可以尝试统一修改每台机器的 MTU 为 9000:

    ifconfig <网卡名> mtu 9000
    +
    + + + diff --git a/roadmap/index.html b/roadmap/index.html new file mode 100644 index 00000000000..0302b94d08d --- /dev/null +++ b/roadmap/index.html @@ -0,0 +1,33 @@ + + + + + + + + + Roadmap | PolarDB for PostgreSQL + + + + +

    Roadmap

    Alibaba Cloud continuously releases updates to PolarDB PostgreSQL (hereafter simplified as PolarDB) to improve user experience. At present, Alibaba Cloud plans the following versions for PolarDB:

    Version 1.0

    Version 1.0 supports shared storage and compute-storage separation. This version provides the minimum set of features such as Polar virtual file system (PolarVFS), flushing and buffer management, LogIndex, and SyncDDL.

    • PolarVFS: A VFS is abstracted from the database engine. This way, the database engine can connect to all types of storage, and you do not need to consider whether the storage uses buffered I/O or direct I/O.
    • Flushing and buffer management: In each PolarDB cluster, data is separately processed on each compute node, but all compute nodes share the same physical storage. The speed at which the primary node flushes write-ahead logging (WAL) records must be controlled to prevent the read-only nodes from reading future pages.
    • LogIndex: The read-only nodes cannot flush WAL records. When you query a page on a read-only node, the read-only node reads a previous version of the page from the shared storage. Then, the read-only node reads and replays the WAL records of the page from its memory to obtain the most recent version of the page. Each LogIndex record consists of the metadata of a specific WAL record. The read-only nodes can efficiently retrieve the WAL records of a page by using LogIndex records.
    • SyncDDL: PolarDB supports compute-storage separation. When the primary node runs DDL operations, it considers the objects, such as relations, that are referenced by the read-only nodes. The locks that are held by the DDL operations are synchronized from the primary node to the read-only nodes.
    • db-monitor: The db-monitor module monitors the host on which your PolarDB cluster runs. The db-monitor module also monitors the databases that you create in your PolarDB cluster. The monitoring data provides a basis for switchovers and helps ensure high availability.

    Version 2.0

    In addition to improvements to compute-storage separation, version 2.0 provides a significantly improved optimizer.

    • UniqueKey: The UniqueKey module ensures that the data on plan nodes is unique. This feature is similar to the ordering feature that you can use on plan nodes. Data uniqueness reduces unnecessary DISTINCT and GROUP BY clauses and improves the ordering of the results of joins.

    Version 3.0

    The availability of PolarDB with compute-storage separation is significantly improved.

    • Parallel replay: LogIndex enables PolarDB to replay WAL records in lazy replay mode. In the lazy replay mode, the read-only nodes only mark the WAL records of each updated page. The read-only nodes read and replay the WAL records only when you query the page on these nodes. The lazy replay mechanism may impair read performance. Version 3.0 uses the parallel replay mechanism together with the lazy replay mechanism to accelerate read queries.
    • OnlinePromote: If the primary node unexpectedly exits, your workloads can be switched over to a read-only node. The read-only node does not need to restart. The read-only node is promoted to run as the new primary node immediately after it replays all WAL records in parallel. This significantly reduces downtime.

    Version 4.0

    Version 4.0 can meet your growing business requirements in hybrid transaction/analytical processing (HTAP) scenarios. Version 4.0 is based on the shared storage-based massively parallel processing (MPP) architecture, which allows PolarDB to fully utilize the CPU, memory, and I/O resources of multiple read-only nodes.

    Test results show that the performance of a PolarDB cluster linearly increases as you increase the number of cores from 1 to 256.

    Version 5.0

    In earlier versions, each PolarDB cluster consists of one primary node that processes both read requests and write requests and one or more read-only nodes that process only read requests. You can increase the read capability of a PolarDB cluster by creating more read-only nodes. However, you cannot increase the writing capability because each PolarDB cluster consists of only one primary node.

    Version 5.0 uses the shared-nothing architecture together with the shared-everything architecture. This allows multiple compute nodes to process write requests.

    + + + diff --git a/theory/analyze.html b/theory/analyze.html new file mode 100644 index 00000000000..e3a9a02923d --- /dev/null +++ b/theory/analyze.html @@ -0,0 +1,639 @@ + + + + + + + + + Code Analysis of ANALYZE | PolarDB for PostgreSQL + + + + +

    Code Analysis of ANALYZE

    棠羽

    2022/06/20

    15 min

    Background

    PostgreSQL 在优化器中为一个查询树输出一个执行效率最高的物理计划树。其中,执行效率高低的衡量是通过代价估算实现的。比如通过估算查询返回元组的条数,和元组的宽度,就可以计算出 I/O 开销;也可以根据将要执行的物理操作估算出可能需要消耗的 CPU 代价。优化器通过系统表 pg_statistic 获得这些在代价估算过程需要使用到的关键统计信息,而 pg_statistic 系统表中的统计信息又是通过自动或手动的 ANALYZE 操作(或 VACUUM)计算得到的。ANALYZE 将会扫描表中的数据并按列进行分析,将得到的诸如每列的数据分布、最常见值、频率等统计信息写入系统表。

    本文从源码的角度分析一下 ANALYZE 操作的实现机制。源码使用目前 PostgreSQL 最新的稳定版本 PostgreSQL 14。

    Statistics

    首先,我们应当搞明白分析操作的输出是什么。所以我们可以看一看 pg_statistic 中有哪些列,每个列的含义是什么。这个系统表中的每一行表示其它数据表中 每一列的统计信息

    postgres=# \d+ pg_statistic
    +                                 Table "pg_catalog.pg_statistic"
    +   Column    |   Type   | Collation | Nullable | Default | Storage  | Stats target | Description
    +-------------+----------+-----------+----------+---------+----------+--------------+-------------
    + starelid    | oid      |           | not null |         | plain    |              |
    + staattnum   | smallint |           | not null |         | plain    |              |
    + stainherit  | boolean  |           | not null |         | plain    |              |
    + stanullfrac | real     |           | not null |         | plain    |              |
    + stawidth    | integer  |           | not null |         | plain    |              |
    + stadistinct | real     |           | not null |         | plain    |              |
    + stakind1    | smallint |           | not null |         | plain    |              |
    + stakind2    | smallint |           | not null |         | plain    |              |
    + stakind3    | smallint |           | not null |         | plain    |              |
    + stakind4    | smallint |           | not null |         | plain    |              |
    + stakind5    | smallint |           | not null |         | plain    |              |
    + staop1      | oid      |           | not null |         | plain    |              |
    + staop2      | oid      |           | not null |         | plain    |              |
    + staop3      | oid      |           | not null |         | plain    |              |
    + staop4      | oid      |           | not null |         | plain    |              |
    + staop5      | oid      |           | not null |         | plain    |              |
    + stanumbers1 | real[]   |           |          |         | extended |              |
    + stanumbers2 | real[]   |           |          |         | extended |              |
    + stanumbers3 | real[]   |           |          |         | extended |              |
    + stanumbers4 | real[]   |           |          |         | extended |              |
    + stanumbers5 | real[]   |           |          |         | extended |              |
    + stavalues1  | anyarray |           |          |         | extended |              |
    + stavalues2  | anyarray |           |          |         | extended |              |
    + stavalues3  | anyarray |           |          |         | extended |              |
    + stavalues4  | anyarray |           |          |         | extended |              |
    + stavalues5  | anyarray |           |          |         | extended |              |
    +Indexes:
    +    "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, stainherit)
    +
    /* ----------------
    + *      pg_statistic definition.  cpp turns this into
    + *      typedef struct FormData_pg_statistic
    + * ----------------
    + */
    +CATALOG(pg_statistic,2619,StatisticRelationId)
    +{
    +    /* These fields form the unique key for the entry: */
    +    Oid         starelid BKI_LOOKUP(pg_class);  /* relation containing
    +                                                 * attribute */
    +    int16       staattnum;      /* attribute (column) stats are for */
    +    bool        stainherit;     /* true if inheritance children are included */
    +
    +    /* the fraction of the column's entries that are NULL: */
    +    float4      stanullfrac;
    +
    +    /*
    +     * stawidth is the average width in bytes of non-null entries.  For
    +     * fixed-width datatypes this is of course the same as the typlen, but for
    +     * var-width types it is more useful.  Note that this is the average width
    +     * of the data as actually stored, post-TOASTing (eg, for a
    +     * moved-out-of-line value, only the size of the pointer object is
    +     * counted).  This is the appropriate definition for the primary use of
    +     * the statistic, which is to estimate sizes of in-memory hash tables of
    +     * tuples.
    +     */
    +    int32       stawidth;
    +
    +    /* ----------------
    +     * stadistinct indicates the (approximate) number of distinct non-null
    +     * data values in the column.  The interpretation is:
    +     *      0       unknown or not computed
    +     *      > 0     actual number of distinct values
    +     *      < 0     negative of multiplier for number of rows
    +     * The special negative case allows us to cope with columns that are
    +     * unique (stadistinct = -1) or nearly so (for example, a column in which
    +     * non-null values appear about twice on the average could be represented
    +     * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
    +     * column is nulls).  Because the number-of-rows statistic in pg_class may
    +     * be updated more frequently than pg_statistic is, it's important to be
    +     * able to describe such situations as a multiple of the number of rows,
    +     * rather than a fixed number of distinct values.  But in other cases a
    +     * fixed number is correct (eg, a boolean column).
    +     * ----------------
    +     */
    +    float4      stadistinct;
    +
    +    /* ----------------
    +     * To allow keeping statistics on different kinds of datatypes,
    +     * we do not hard-wire any particular meaning for the remaining
    +     * statistical fields.  Instead, we provide several "slots" in which
    +     * statistical data can be placed.  Each slot includes:
    +     *      kind            integer code identifying kind of data (see below)
    +     *      op              OID of associated operator, if needed
    +     *      coll            OID of relevant collation, or 0 if none
    +     *      numbers         float4 array (for statistical values)
    +     *      values          anyarray (for representations of data values)
    +     * The ID, operator, and collation fields are never NULL; they are zeroes
    +     * in an unused slot.  The numbers and values fields are NULL in an
    +     * unused slot, and might also be NULL in a used slot if the slot kind
    +     * has no need for one or the other.
    +     * ----------------
    +     */
    +
    +    int16       stakind1;
    +    int16       stakind2;
    +    int16       stakind3;
    +    int16       stakind4;
    +    int16       stakind5;
    +
    +    Oid         staop1 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop2 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop3 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop4 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop5 BKI_LOOKUP_OPT(pg_operator);
    +
    +    Oid         stacoll1 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll2 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll3 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll4 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll5 BKI_LOOKUP_OPT(pg_collation);
    +
    +#ifdef CATALOG_VARLEN           /* variable-length fields start here */
    +    float4      stanumbers1[1];
    +    float4      stanumbers2[1];
    +    float4      stanumbers3[1];
    +    float4      stanumbers4[1];
    +    float4      stanumbers5[1];
    +
    +    /*
    +     * Values in these arrays are values of the column's data type, or of some
    +     * related type such as an array element type.  We presently have to cheat
    +     * quite a bit to allow polymorphic arrays of this kind, but perhaps
    +     * someday it'll be a less bogus facility.
    +     */
    +    anyarray    stavalues1;
    +    anyarray    stavalues2;
    +    anyarray    stavalues3;
    +    anyarray    stavalues4;
    +    anyarray    stavalues5;
    +#endif
    +} FormData_pg_statistic;
    +

    从数据库命令行的角度和内核 C 代码的角度来看,统计信息的内容都是一致的。所有的属性都以 sta 开头。其中:

    • starelid 表示当前列所属的表或索引
    • staattnum 表示本行统计信息属于上述表或索引中的第几列
    • stainherit 表示统计信息是否包含子列
    • stanullfrac 表示该列中值为 NULL 的行数比例
    • stawidth 表示该列非空值的平均宽度
    • stadistinct 表示列中非空值的唯一值数量
      • 0 表示未知或未计算
      • > 0 表示唯一值的实际数量
      • < 0 表示 negative of multiplier for number of rows

    由于不同数据类型所能够被计算的统计信息可能会有一些细微的差别,在接下来的部分中,PostgreSQL 预留了一些存放统计信息的 槽(slots)。目前的内核里暂时预留了五个槽:

    #define STATISTIC_NUM_SLOTS  5
    +

    每一种特定的统计信息可以使用一个槽,具体在槽里放什么完全由这种统计信息的定义自由决定。每一个槽的可用空间包含这么几个部分(其中的 N 表示槽的编号,取值为 15):

    • stakindN:标识这种统计信息的整数编号
    • staopN:用于计算或使用统计信息的运算符 OID
    • stacollN:排序规则 OID
    • stanumbersN:浮点数数组
    • stavaluesN:任意值数组

    PostgreSQL 内核中规定,统计信息的编号 199 被保留给 PostgreSQL 核心统计信息使用,其它部分的编号安排如内核注释所示:

    /*
    + * The present allocation of "kind" codes is:
    + *
    + *  1-99:       reserved for assignment by the core PostgreSQL project
    + *              (values in this range will be documented in this file)
    + *  100-199:    reserved for assignment by the PostGIS project
    + *              (values to be documented in PostGIS documentation)
    + *  200-299:    reserved for assignment by the ESRI ST_Geometry project
    + *              (values to be documented in ESRI ST_Geometry documentation)
    + *  300-9999:   reserved for future public assignments
    + *
    + * For private use you may choose a "kind" code at random in the range
    + * 10000-30000.  However, for code that is to be widely disseminated it is
    + * better to obtain a publicly defined "kind" code by request from the
    + * PostgreSQL Global Development Group.
    + */
    +

    目前可以在内核代码中看到的 PostgreSQL 核心统计信息有 7 个,编号分别从 17。我们可以看看这 7 种统计信息分别如何使用上述的槽。

    Most Common Values (MCV)

    /*
    + * In a "most common values" slot, staop is the OID of the "=" operator
    + * used to decide whether values are the same or not, and stacoll is the
    + * collation used (same as column's collation).  stavalues contains
    + * the K most common non-null values appearing in the column, and stanumbers
    + * contains their frequencies (fractions of total row count).  The values
    + * shall be ordered in decreasing frequency.  Note that since the arrays are
    + * variable-size, K may be chosen by the statistics collector.  Values should
    + * not appear in MCV unless they have been observed to occur more than once;
    + * a unique column will have no MCV slot.
    + */
    +#define STATISTIC_KIND_MCV  1
    +

    对于一个列中的 最常见值,在 staop 中保存 = 运算符来决定一个值是否等于一个最常见值。在 stavalues 中保存了该列中最常见的 K 个非空值,stanumbers 中分别保存了这 K 个值出现的频率。

    Histogram

    /*
    + * A "histogram" slot describes the distribution of scalar data.  staop is
    + * the OID of the "<" operator that describes the sort ordering, and stacoll
    + * is the relevant collation.  (In theory more than one histogram could appear,
    + * if a datatype has more than one useful sort operator or we care about more
    + * than one collation.  Currently the collation will always be that of the
    + * underlying column.)  stavalues contains M (>=2) non-null values that
    + * divide the non-null column data values into M-1 bins of approximately equal
    + * population.  The first stavalues item is the MIN and the last is the MAX.
    + * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
    + * slot is also provided, then the histogram describes the data distribution
    + * *after removing the values listed in MCV* (thus, it's a "compressed
    + * histogram" in the technical parlance).  This allows a more accurate
    + * representation of the distribution of a column with some very-common
    + * values.  In a column with only a few distinct values, it's possible that
    + * the MCV list describes the entire data population; in this case the
    + * histogram reduces to empty and should be omitted.
    + */
    +#define STATISTIC_KIND_HISTOGRAM  2
    +

    表示一个(数值)列的数据分布直方图。staop 保存 < 运算符用于决定数据分布的排序顺序。stavalues 包含了能够将该列的非空值划分到 M - 1 个容量接近的桶中的 M 个非空值。如果该列中已经有了 MCV 的槽,那么数据分布直方图中将不包含 MCV 中的值,以获得更精确的数据分布。

    Correlation

    /*
    + * A "correlation" slot describes the correlation between the physical order
    + * of table tuples and the ordering of data values of this column, as seen
    + * by the "<" operator identified by staop with the collation identified by
    + * stacoll.  (As with the histogram, more than one entry could theoretically
    + * appear.)  stavalues is not used and should be NULL.  stanumbers contains
    + * a single entry, the correlation coefficient between the sequence of data
    + * values and the sequence of their actual tuple positions.  The coefficient
    + * ranges from +1 to -1.
    + */
    +#define STATISTIC_KIND_CORRELATION  3
    +

    stanumbers 中保存数据值和它们的实际元组位置的相关系数。

    Most Common Elements

    /*
    + * A "most common elements" slot is similar to a "most common values" slot,
    + * except that it stores the most common non-null *elements* of the column
    + * values.  This is useful when the column datatype is an array or some other
    + * type with identifiable elements (for instance, tsvector).  staop contains
    + * the equality operator appropriate to the element type, and stacoll
    + * contains the collation to use with it.  stavalues contains
    + * the most common element values, and stanumbers their frequencies.  Unlike
    + * MCV slots, frequencies are measured as the fraction of non-null rows the
    + * element value appears in, not the frequency of all rows.  Also unlike
    + * MCV slots, the values are sorted into the element type's default order
    + * (to support binary search for a particular value).  Since this puts the
    + * minimum and maximum frequencies at unpredictable spots in stanumbers,
    + * there are two extra members of stanumbers, holding copies of the minimum
    + * and maximum frequencies.  Optionally, there can be a third extra member,
    + * which holds the frequency of null elements (expressed in the same terms:
    + * the fraction of non-null rows that contain at least one null element).  If
    + * this member is omitted, the column is presumed to contain no null elements.
    + *
    + * Note: in current usage for tsvector columns, the stavalues elements are of
    + * type text, even though their representation within tsvector is not
    + * exactly text.
    + */
    +#define STATISTIC_KIND_MCELEM  4
    +

    与 MCV 类似,但是保存的是列中的 最常见元素,主要用于数组等类型。同样,在 staop 中保存了等值运算符用于判断元素出现的频率高低。但与 MCV 不同的是这里的频率计算的分母是非空的行,而不是所有的行。另外,所有的常见元素使用元素对应数据类型的默认顺序进行排序,以便二分查找。

    Distinct Elements Count Histogram

    /*
    + * A "distinct elements count histogram" slot describes the distribution of
    + * the number of distinct element values present in each row of an array-type
    + * column.  Only non-null rows are considered, and only non-null elements.
    + * staop contains the equality operator appropriate to the element type,
    + * and stacoll contains the collation to use with it.
    + * stavalues is not used and should be NULL.  The last member of stanumbers is
    + * the average count of distinct element values over all non-null rows.  The
    + * preceding M (>=2) members form a histogram that divides the population of
    + * distinct-elements counts into M-1 bins of approximately equal population.
    + * The first of these is the minimum observed count, and the last the maximum.
    + */
    +#define STATISTIC_KIND_DECHIST  5
    +

    表示列中出现所有数值的频率分布直方图。stanumbers 数组的前 M 个元素是将列中所有唯一值的出现次数大致均分到 M - 1 个桶中的边界值。后续跟上一个所有唯一值的平均出现次数。这个统计信息应该会被用于计算 选择率

    Length Histogram

    /*
    + * A "length histogram" slot describes the distribution of range lengths in
    + * rows of a range-type column. stanumbers contains a single entry, the
    + * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
    + * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
    + * values that divide the column data values into M-1 bins of approximately
    + * equal population. The lengths are stored as float8s, as measured by the
    + * range type's subdiff function. Only non-null rows are considered.
    + */
    +#define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
    +

    长度直方图描述了一个范围类型的列的范围长度分布。同样也是一个长度为 M 的直方图,保存在 stanumbers 中。

    Bounds Histogram

    /*
    + * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
    + * a range-type column.  stavalues contains M (>=2) range values that divide
    + * the column data values into M-1 bins of approximately equal population.
    + * Unlike a regular scalar histogram, this is actually two histograms combined
    + * into a single array, with the lower bounds of each value forming a
    + * histogram of lower bounds, and the upper bounds a histogram of upper
    + * bounds.  Only non-NULL, non-empty ranges are included.
    + */
    +#define STATISTIC_KIND_BOUNDS_HISTOGRAM  7
    +

    边界直方图同样也被用于范围类型,与数据分布直方图类似。stavalues 中保存了使该列数值大致均分到 M - 1 个桶中的 M 个范围边界值。只考虑非空行。

    Kernel Execution of ANALYZE

    知道 pg_statistic 最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令,执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

    /*
    + * standard_ProcessUtility itself deals only with utility commands for
    + * which we do not provide event trigger support.  Commands that do have
    + * such support are passed down to ProcessUtilitySlow, which contains the
    + * necessary infrastructure for such triggers.
    + *
    + * This division is not just for performance: it's critical that the
    + * event trigger code not be invoked when doing START TRANSACTION for
    + * example, because we might need to refresh the event trigger cache,
    + * which requires being in a valid transaction.
    + */
    +void
    +standard_ProcessUtility(PlannedStmt *pstmt,
    +                        const char *queryString,
    +                        bool readOnlyTree,
    +                        ProcessUtilityContext context,
    +                        ParamListInfo params,
    +                        QueryEnvironment *queryEnv,
    +                        DestReceiver *dest,
    +                        QueryCompletion *qc)
    +{
    +    // ...
    +
    +    switch (nodeTag(parsetree))
    +    {
    +        // ...
    +
    +        case T_VacuumStmt:
    +            ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
    +            break;
    +
    +        // ...
    +    }
    +
    +    // ...
    +}
    +

    ANALYZE 的处理逻辑入口和 VACUUM 一致,进入 ExecVacuum() 函数。

    /*
    + * Primary entry point for manual VACUUM and ANALYZE commands
    + *
    + * This is mainly a preparation wrapper for the real operations that will
    + * happen in vacuum().
    + */
    +void
    +ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
    +{
    +    // ...
    +
    +    /* Now go through the common routine */
    +    vacuum(vacstmt->rels, &params, NULL, isTopLevel);
    +}
    +

    在 parse 了一大堆 option 之后,进入了 vacuum() 函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以:

    • 分析整个数据库中的所有表
    • 分析某几个特定的表
    • 分析某个表的某几个特定列

    在明确要分析哪些表以后,依次将每一个表传入 analyze_rel() 函数:

    if (params->options & VACOPT_ANALYZE)
    +{
    +    // ...
    +
    +    analyze_rel(vrel->oid, vrel->relation, params,
    +                vrel->va_cols, in_outer_xact, vac_strategy);
    +
    +    // ...
    +}
    +

    进入 analyze_rel() 函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁,以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel() 函数中。

    /*
    + *  analyze_rel() -- analyze one relation
    + *
    + * relid identifies the relation to analyze.  If relation is supplied, use
    + * the name therein for reporting any failure to open/lock the rel; do not
    + * use it once we've successfully opened the rel, since it might be stale.
    + */
    +void
    +analyze_rel(Oid relid, RangeVar *relation,
    +            VacuumParams *params, List *va_cols, bool in_outer_xact,
    +            BufferAccessStrategy bstrategy)
    +{
    +    // ...
    +
    +    /*
    +     * Do the normal non-recursive ANALYZE.  We can skip this for partitioned
    +     * tables, which don't contain any rows.
    +     */
    +    if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
    +        do_analyze_rel(onerel, params, va_cols, acquirefunc,
    +                       relpages, false, in_outer_xact, elevel);
    +
    +    // ...
    +}
    +

    进入 do_analyze_rel() 函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。

    为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:

    /*
    + * Determine how many rows we need to sample, using the worst case from
    + * all analyzable columns.  We use a lower bound of 100 rows to avoid
    + * possible overflow in Vitter's algorithm.  (Note: that will also be the
    + * target in the corner case where there are no analyzable columns.)
    + */
    +targrows = 100;
    +for (i = 0; i < attr_cnt; i++)
    +{
    +    if (targrows < vacattrstats[i]->minrows)
    +        targrows = vacattrstats[i]->minrows;
    +}
    +for (ind = 0; ind < nindexes; ind++)
    +{
    +    AnlIndexData *thisdata = &indexdata[ind];
    +
    +    for (i = 0; i < thisdata->attr_cnt; i++)
    +    {
    +        if (targrows < thisdata->vacattrstats[i]->minrows)
    +            targrows = thisdata->vacattrstats[i]->minrows;
    +    }
    +}
    +
    +/*
    + * Look at extended statistics objects too, as those may define custom
    + * statistics target. So we may need to sample more rows and then build
    + * the statistics with enough detail.
    + */
    +minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
    +
    +if (targrows < minrows)
    +    targrows = minrows;
    +

    在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc 函数指针采样数据:

    /*
    + * Acquire the sample rows
    + */
    +rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
    +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
    +                             inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
    +                             PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
    +if (inh)
    +    numrows = acquire_inherited_sample_rows(onerel, elevel,
    +                                            rows, targrows,
    +                                            &totalrows, &totaldeadrows);
    +else
    +    numrows = (*acquirefunc) (onerel, elevel,
    +                              rows, targrows,
    +                              &totalrows, &totaldeadrows);
    +

    这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样:

    • 阶段 1:随机选择包含目标采样行数的数据块
    • 阶段 2:对每一个数据块使用 Vitter 算法按行随机采样数据

    两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:

    /*
    + * acquire_sample_rows -- acquire a random sample of rows from the table
    + *
    + * Selected rows are returned in the caller-allocated array rows[], which
    + * must have at least targrows entries.
    + * The actual number of rows selected is returned as the function result.
    + * We also estimate the total numbers of live and dead rows in the table,
    + * and return them into *totalrows and *totaldeadrows, respectively.
    + *
    + * The returned list of tuples is in order by physical position in the table.
    + * (We will rely on this later to derive correlation estimates.)
    + *
    + * As of May 2004 we use a new two-stage method:  Stage one selects up
    + * to targrows random blocks (or all blocks, if there aren't so many).
    + * Stage two scans these blocks and uses the Vitter algorithm to create
    + * a random sample of targrows rows (or less, if there are less in the
    + * sample of blocks).  The two stages are executed simultaneously: each
    + * block is processed as soon as stage one returns its number and while
    + * the rows are read stage two controls which ones are to be inserted
    + * into the sample.
    + *
    + * Although every row has an equal chance of ending up in the final
    + * sample, this sampling method is not perfect: not every possible
    + * sample has an equal chance of being selected.  For large relations
    + * the number of different blocks represented by the sample tends to be
    + * too small.  We can live with that for now.  Improvements are welcome.
    + *
    + * An important property of this sampling method is that because we do
    + * look at a statistically unbiased set of blocks, we should get
    + * unbiased estimates of the average numbers of live and dead rows per
    + * block.  The previous sampling method put too much credence in the row
    + * density near the start of the table.
    + */
    +static int
    +acquire_sample_rows(Relation onerel, int elevel,
    +                    HeapTuple *rows, int targrows,
    +                    double *totalrows, double *totaldeadrows)
    +{
    +    // ...
    +
    +    /* Outer loop over blocks to sample */
    +    while (BlockSampler_HasMore(&bs))
    +    {
    +        bool        block_accepted;
    +        BlockNumber targblock = BlockSampler_Next(&bs);
    +        // ...
    +    }
    +
    +    // ...
    +
    +    /*
    +     * If we didn't find as many tuples as we wanted then we're done. No sort
    +     * is needed, since they're already in order.
    +     *
    +     * Otherwise we need to sort the collected tuples by position
    +     * (itempointer). It's not worth worrying about corner cases where the
    +     * tuples are already sorted.
    +     */
    +    if (numrows == targrows)
    +        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
    +
    +    /*
    +     * Estimate total numbers of live and dead rows in relation, extrapolating
    +     * on the assumption that the average tuple density in pages we didn't
    +     * scan is the same as in the pages we did scan.  Since what we scanned is
    +     * a random sample of the pages in the relation, this should be a good
    +     * assumption.
    +     */
    +    if (bs.m > 0)
    +    {
    +        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
    +        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
    +    }
    +    else
    +    {
    +        *totalrows = 0.0;
    +        *totaldeadrows = 0.0;
    +    }
    +
    +    // ...
    +}
    +

    回到 do_analyze_rel() 函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic 系统表:

    /*
    + * Compute the statistics.  Temporary results during the calculations for
    + * each column are stored in a child context.  The calc routines are
    + * responsible to make sure that whatever they store into the VacAttrStats
    + * structure is allocated in anl_context.
    + */
    +if (numrows > 0)
    +{
    +    // ...
    +
    +    for (i = 0; i < attr_cnt; i++)
    +    {
    +        VacAttrStats *stats = vacattrstats[i];
    +        AttributeOpts *aopt;
    +
    +        stats->rows = rows;
    +        stats->tupDesc = onerel->rd_att;
    +        stats->compute_stats(stats,
    +                             std_fetch_func,
    +                             numrows,
    +                             totalrows);
    +
    +        // ...
    +    }
    +
    +    // ...
    +
    +    /*
    +     * Emit the completed stats rows into pg_statistic, replacing any
    +     * previous statistics for the target columns.  (If there are stats in
    +     * pg_statistic for columns we didn't process, we leave them alone.)
    +     */
    +    update_attstats(RelationGetRelid(onerel), inh,
    +                    attr_cnt, vacattrstats);
    +
    +    // ...
    +}
    +

    显然,对于不同类型的列,其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:

    /*
    + * std_typanalyze -- the default type-specific typanalyze function
    + */
    +bool
    +std_typanalyze(VacAttrStats *stats)
    +{
    +    // ...
    +
    +    /*
    +     * Determine which standard statistics algorithm to use
    +     */
    +    if (OidIsValid(eqopr) && OidIsValid(ltopr))
    +    {
    +        /* Seems to be a scalar datatype */
    +        stats->compute_stats = compute_scalar_stats;
    +        /*--------------------
    +         * The following choice of minrows is based on the paper
    +         * "Random sampling for histogram construction: how much is enough?"
    +         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
    +         * Proceedings of ACM SIGMOD International Conference on Management
    +         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
    +         * says that for table size n, histogram size k, maximum relative
    +         * error in bin size f, and error probability gamma, the minimum
    +         * random sample size is
    +         *      r = 4 * k * ln(2*n/gamma) / f^2
    +         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
    +         *      r = 305.82 * k
    +         * Note that because of the log function, the dependence on n is
    +         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
    +         * bin size error with probability 0.99.  So there's no real need to
    +         * scale for n, which is a good thing because we don't necessarily
    +         * know it at this point.
    +         *--------------------
    +         */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +    else if (OidIsValid(eqopr))
    +    {
    +        /* We can still recognize distinct values */
    +        stats->compute_stats = compute_distinct_stats;
    +        /* Might as well use the same minrows as above */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +    else
    +    {
    +        /* Can't do much but the trivial stuff */
    +        stats->compute_stats = compute_trivial_stats;
    +        /* Might as well use the same minrows as above */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +
    +    // ...
    +}
    +

    这个条件判断语句可以被解读为:

    • 如果说一个列的数据类型支持默认的 =eqopr:equals operator)和 <ltopr:less than operator),那么这个列应该是一个数值类型,可以使用 compute_scalar_stats() 函数进行分析
    • 如果列的数据类型只支持 = 运算符,那么依旧还可以使用 compute_distinct_stats 进行唯一值的统计分析
    • 如果都不行,那么这个列只能使用 compute_trivial_stats 进行一些简单的分析

    我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。

    compute_trivial_stats

    如果某个列的数据类型不支持等值运算符和比较运算符,那么就只能进行一些简单的分析,比如:

    • 非空行的比例
    • 列中元组的平均宽度

    这些可以通过对采样后的元组数组进行循环遍历后轻松得到。

    /*
    + *  compute_trivial_stats() -- compute very basic column statistics
    + *
    + *  We use this when we cannot find a hash "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows and the average datum width.
    + */
    +static void
    +compute_trivial_stats(VacAttrStatsP stats,
    +                      AnalyzeAttrFetchFunc fetchfunc,
    +                      int samplerows,
    +                      double totalrows)
    +{}
    +

    compute_distinct_stats

    如果某个列只支持等值运算符,也就是说我们只能知道一个数值 是什么,但不能和其它数值比大小。所以无法分析数值在大小范围上的分布,只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含:

    • 非空行的比例
    • 列中元组的平均宽度
    • 最频繁出现的值(MCV)
    • (估算的)唯一值个数
    /*
    + *  compute_distinct_stats() -- compute column statistics including ndistinct
    + *
    + *  We use this when we can find only an "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows, the average width, the
    + *  most common values, and the (estimated) number of distinct values.
    + *
    + *  The most common values are determined by brute force: we keep a list
    + *  of previously seen values, ordered by number of times seen, as we scan
    + *  the samples.  A newly seen value is inserted just after the last
    + *  multiply-seen value, causing the bottommost (oldest) singly-seen value
    + *  to drop off the list.  The accuracy of this method, and also its cost,
    + *  depend mainly on the length of the list we are willing to keep.
    + */
    +static void
    +compute_distinct_stats(VacAttrStatsP stats,
    +                       AnalyzeAttrFetchFunc fetchfunc,
    +                       int samplerows,
    +                       double totalrows)
    +{}
    +

    compute_scalar_stats

    如果一个列的数据类型支持等值运算符和比较运算符,那么可以进行最详尽的分析。分析目标包含:

    • 非空行的比例
    • 列中元组的平均宽度
    • 最频繁出现的值(MCV)
    • (估算的)唯一值个数
    • 数据分布直方图
    • 物理和逻辑位置的相关性
    /*
    + *  compute_distinct_stats() -- compute column statistics including ndistinct
    + *
    + *  We use this when we can find only an "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows, the average width, the
    + *  most common values, and the (estimated) number of distinct values.
    + *
    + *  The most common values are determined by brute force: we keep a list
    + *  of previously seen values, ordered by number of times seen, as we scan
    + *  the samples.  A newly seen value is inserted just after the last
    + *  multiply-seen value, causing the bottommost (oldest) singly-seen value
    + *  to drop off the list.  The accuracy of this method, and also its cost,
    + *  depend mainly on the length of the list we are willing to keep.
    + */
    +static void
    +compute_distinct_stats(VacAttrStatsP stats,
    +                       AnalyzeAttrFetchFunc fetchfunc,
    +                       int samplerows,
    +                       double totalrows)
    +{}
    +

    Summary

    以 PostgreSQL 优化器需要的统计信息为切入点,分析了 ANALYZE 命令的大致执行流程。出于简洁性,在流程分析上没有覆盖各种 corner case 和相关的处理逻辑。

    References

    PostgreSQL 14 Documentation: ANALYZEopen in new window

    PostgreSQL 14 Documentation: 25.1. Routine Vacuumingopen in new window

    PostgreSQL 14 Documentation: 14.2. Statistics Used by the Planneropen in new window

    PostgreSQL 14 Documentation: 52.49. pg_statisticopen in new window

    阿里云数据库内核月报 2016/05:PostgreSQL 特性分析 统计信息计算方法open in new window

    + + + diff --git a/theory/arch-htap.html b/theory/arch-htap.html new file mode 100644 index 00000000000..dda5fb97440 --- /dev/null +++ b/theory/arch-htap.html @@ -0,0 +1,109 @@ + + + + + + + + + HTAP Architecture | PolarDB for PostgreSQL + + + + +

    HTAP Architecture

    严华

    2022/09/10

    35 min

    背景

    很多 PolarDB PG 的用户都有 TP (Transactional Processing) 和 AP (Analytical Processing) 共用的需求。他们期望数据库在白天处理高并发的 TP 请求,在夜间 TP 流量下降、机器负载空闲时进行 AP 的报表分析。但是即使这样,依然没有最大化利用空闲机器的资源。原先的 PolarDB PG 数据库在处理复杂的 AP 查询时会遇到两大挑战:

    • 单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,无论是单机串行还是单机并行,都无法利用其他节点的 CPU、内存等计算资源,只能纵向 Scale Up,不能横向 Scale Out;
    • PolarDB 底层是存储池,理论上 I/O 吞吐是无限大的。而单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,受限于单节点 CPU 和内存的瓶颈,无法充分发挥存储侧大 I/O 带宽的优势。

    image.png

    为了解决用户实际使用中的痛点,PolarDB 实现了 HTAP 特性。当前业界 HTAP 的解决方案主要有以下三种:

    1. TP 和 AP 在存储和计算上完全分离
      • 优势:两种业务负载互不影响
      • 劣势:
        • 时效性:TP 的数据需要导入到 AP 系统中,存在一定的延迟
        • 成本 / 运维难度:增加了一套冗余的 AP 系统
    2. TP 和 AP 在存储和计算上完全共享
      • 优势:成本最小化、资源利用最大化
      • 劣势:
        • 计算共享会导致 AP 查询和 TP 查询同时运行时或多或少会存在相互影响
        • 扩展计算节点存储时,数据需要重分布,无法快速弹性 Scale Out
    3. TP 和 AP 在存储上共享,在计算上分离
      • PolarDB 的存储计算分离架构天然支持此方案

    原理

    架构特性

    基于 PolarDB 的存储计算分离架构,我们研发了分布式 MPP 执行引擎,提供了跨机并行执行、弹性计算弹性扩展的保证,使得 PolarDB 初步具备了 HTAP 的能力:

    1. 一体化存储:毫秒级数据新鲜度
      • TP / AP 共享一套存储数据,减少存储成本,提高查询时效
    2. TP / AP 物理隔离:杜绝 CPU / 内存的相互影响
      • 单机执行引擎:在 RW / RO 节点上,处理高并发的 TP 查询
      • 分布式 MPP 执行引擎: 在 RO 节点,处理高复杂度的 AP 查询
    3. Serverless 弹性扩展:任何一个 RO 节点均可发起 MPP 查询
      • Scale Out:弹性调整 MPP 的执行节点范围
      • Scale Up:弹性调整 MPP 的单机并行度
    4. 消除数据倾斜、计算倾斜,充分考虑 PostgreSQL 的 Buffer Pool 亲和性

    image.png

    分布式 MPP 执行引擎

    PolarDB HTAP 的核心是分布式 MPP 执行引擎,是典型的火山模型引擎。A、B 两张表先做 join 再做聚合输出,这也是 PostgreSQL 单机执行引擎的执行流程。

    image.png

    在传统的 MPP 执行引擎中,数据被打散到不同的节点上,不同节点上的数据可能具有不同的分布属性,比如哈希分布、随机分布、复制分布等。传统的 MPP 执行引擎会针对不同表的数据分布特点,在执行计划中插入算子来保证上层算子对数据的分布属性无感知。

    不同的是,PolarDB 是共享存储架构,存储上的数据可以被所有计算节点全量访问。如果使用传统的 MPP 执行引擎,每个计算节点 Worker 都会扫描全量数据,从而得到重复的数据;同时,也没有起到扫描时分治加速的效果,并不能称得上是真正意义上的 MPP 引擎。

    因此,在 PolarDB 分布式 MPP 执行引擎中,我们借鉴了火山模型论文中的思想,对所有扫描算子进行并发处理,引入了 PxScan 算子来屏蔽共享存储。PxScan 算子将 shared-storage 的数据映射为 shared-nothing 的数据,通过 Worker 之间的协调,将目标表划分为多个虚拟分区数据块,每个 Worker 扫描各自的虚拟分区数据块,从而实现了跨机分布式并行扫描。

    PxScan 算子扫描出来的数据会通过 Shuffle 算子来重分布。重分布后的数据在每个 Worker 上如同单机执行一样,按照火山模型来执行。

    Serverless 弹性扩展

    传统 MPP 只能在指定节点发起 MPP 查询,因此每个节点上都只能有单个 Worker 扫描一张表。为了支持云原生下 serverless 弹性扩展的需求,我们引入了分布式事务一致性保证。

    image.png

    任意选择一个节点作为 Coordinator 节点,它的 ReadLSN 会作为约定的 LSN,从所有 MPP 节点的快照版本号中选择最小的版本号作为全局约定的快照版本号。通过 LSN 的回放等待和 Global Snapshot 同步机制,确保在任何一个节点发起 MPP 查询时,数据和快照均能达到一致可用的状态。

    image.png

    为了实现 serverless 的弹性扩展,我们从共享存储的特点出发,将 Coordinator 节点全链路上各个模块需要的外部依赖全部放至共享存储上。各个 Worker 节点运行时需要的参数也会通过控制链路从 Coordinator 节点同步过来,从而使 Coordinator 节点和 Worker 节点全链路 无状态化 (Stateless)

    基于以上两点设计,PolarDB 的弹性扩展具备了以下几大优势:

    • 任何节点都可以成为 Coordinator 节点,解决了传统 MPP 数据库 Coordinator 节点的单点问题。
    • PolarDB 可以横向 Scale Out(计算节点数量),也可以纵向 Scale Up(单节点并行度),且弹性扩展即时生效,不需要重新分布数据。
    • 允许业务有更多的弹性调度策略,不同的业务域可以运行在不同的节点集合上。如下图右侧所示,业务域 1 的 SQL 可以选择 RO1 和 RO2 节点来执行 AP 查询,业务域 2 的 SQL 可以选择使用 RO3 和 RO4 节点来执行 AP 查询。两个业务域使用的计算节点可以实现弹性调度。

    image.png

    消除倾斜

    倾斜是传统 MPP 固有的问题,其根本原因主要是数据分布倾斜和数据计算倾斜:

    • 数据分布倾斜通常由数据打散不均衡导致,在 PostgreSQL 中还会由于大对象 Toast 表存储引入一些不可避免的数据分布不均衡问题;
    • 计算倾斜通常由于不同节点上并发的事务、Buffer Pool、网络、I/O 抖动导致。

    倾斜会导致传统 MPP 在执行时出现木桶效应,执行完成时间受制于执行最慢的子任务。

    image.png

    PolarDB 设计并实现了 自适应扫描机制。如上图所示,采用 Coordinator 节点来协调 Worker 节点的工作模式。在扫描数据时,Coordinator 节点会在内存中创建一个任务管理器,根据扫描任务对 Worker 节点进行调度。Coordinator 节点内部分为两个线程:

    • Data 线程主要负责服务数据链路、收集汇总元组
    • Control 线程负责服务控制链路、控制每一个扫描算子的扫描进度

    扫描进度较快的 Worker 能够扫描多个数据块,实现能者多劳。比如上图中 RO1 与 RO3 的 Worker 各自扫描了 4 个数据块, RO2 由于计算倾斜可以扫描更多数据块,因此它最终扫描了 6 个数据块。

    PolarDB HTAP 的自适应扫描机制还充分考虑了 PostgreSQL 的 Buffer Pool 亲和性,保证每个 Worker 尽可能扫描固定的数据块,从而最大化命中 Buffer Pool 的概率,降低 I/O 开销。

    TPC-H 性能对比

    单机并行 vs 分布式 MPP

    我们使用 256 GB 内存的 16 个 PolarDB PG 实例作为 RO 节点,搭建了 1 TB 的 TPC-H 环境进行对比测试。相较于单机并行,分布式 MPP 并行充分利用了所有 RO 节点的计算资源和底层共享存储的 I/O 带宽,从根本上解决了前文提及的 HTAP 诸多挑战。在 TPC-H 的 22 条 SQL 中,有 3 条 SQL 加速了 60 多倍,19 条 SQL 加速了 10 多倍,平均加速 23 倍。

    image.png

    此外,我们也测试了弹性扩展计算资源带来的性能变化。通过增加 CPU 的总核心数,从 16 核增加到 128 核,TPC-H 的总运行时间线性提升,每条 SQL 的执行速度也呈线性提升,这也验证了 PolarDB HTAP serverless 弹性扩展的特点。

    image.png

    image.png

    在测试中发现,当 CPU 的总核数增加到 256 核时,性能提升不再明显。原因是此时 PolarDB 共享存储的 I/O 带宽已经打满,成为了瓶颈。

    PolarDB vs 传统 MPP 数据库

    我们将 PolarDB 的分布式 MPP 执行引擎与传统数据库的 MPP 执行引擎进行了对比,同样使用了 256 GB 内存的 16 个节点。

    在 1 TB 的 TPC-H 数据上,当保持与传统 MPP 数据库相同单机并行度的情况下(多机单进程),PolarDB 的性能是传统 MPP 数据库的 90%。其中最本质的原因是传统 MPP 数据库的数据默认是哈希分布的,当两张表的 join key 是各自的分布键时,可以不用 shuffle 直接进行本地的 Wise Join。而 PolarDB 的底层是共享存储池,PxScan 算子并行扫描出来的数据等价于随机分布,必须进行 shuffle 重分布以后才能像传统 MPP 数据库一样进行后续的处理。因此,TPC-H 涉及到表连接时,PolarDB 相比传统 MPP 数据库多了一次网络 shuffle 的开销。

    image.png

    image.png

    PolarDB 分布式 MPP 执行引擎能够进行弹性扩展,数据无需重分布。因此,在有限的 16 台机器上执行 MPP 时,PolarDB 还可以继续扩展单机并行度,充分利用每台机器的资源:当 PolarDB 的单机并行度为 8 时,它的性能是传统 MPP 数据库的 5-6 倍;当 PolarDB 的单机并行度呈线性增加时,PolarDB 的总体性能也呈线性增加。只需要修改配置参数,就可以即时生效。

    功能特性

    Parallel Query 并行查询

    经过持续迭代的研发,目前 PolarDB HTAP 在 Parallel Query 上支持的功能特性主要有五大部分:

    • 基础算子全支持:扫描 / 连接 / 聚合 / 子查询等算子。
    • 共享存储算子优化:包括 Shuffle 算子共享、SharedSeqScan 共享、SharedIndexScan 算子等。其中 SharedSeqScan 共享、SharedIndexScan 共享是指,在大表 join 小表时,小表采用类似于复制表的机制来减少广播开销,进而提升性能。
    • 分区表支持:不仅包括对 Hash / Range / List 三种分区方式的完整支持,还包括对多级分区静态裁剪、分区动态裁剪的支持。除此之外,PolarDB 分布式 MPP 执行引擎还支持分区表的 Partition Wise Join。
    • 并行度弹性控制:包括全局级别、表级别、会话级别、查询级别的并行度控制。
    • Serverless 弹性扩展:不仅包括任意节点发起 MPP、MPP 节点范围内的任意组合,还包括集群拓扑信息的自动维护,以及支持共享存储模式、主备库模式、三节点模式。

    Parallel DML

    基于 PolarDB 读写分离架构和 HTAP serverless 弹性扩展的设计, PolarDB Parallel DML 支持一写多读、多写多读两种特性。

    • 一写多读:在 RO 节点上有多个读 Worker,在 RW 节点上只有一个写 Worker;
    • 多写多读:在 RO 节点上有多个读 Worker,在 RW 节点上也有多个写 Worker。多写多读场景下,读写的并发度完全解耦。

    不同的特性适用不同的场景,用户可以根据自己的业务特点来选择不同的 PDML 功能特性。

    索引构建加速

    PolarDB 分布式 MPP 执行引擎,不仅可以用于只读查询和 DML,还可以用于 索引构建加速。OLTP 业务中有大量的索引,而 B-Tree 索引创建的过程大约有 80% 的时间消耗在排序和构建索引页上,20% 消耗在写入索引页上。如下图所示,PolarDB 利用 RO 节点对数据进行分布式 MPP 加速排序,采用流水化的技术来构建索引页,同时使用批量写入技术来提升索引页的写入速度。

    image.png

    在目前索引构建加速这一特性中,PolarDB 已经对 B-Tree 索引的普通创建以及 B-Tree 索引的在线创建 (Concurrently) 两种功能进行了支持。

    使用说明

    PolarDB HTAP 适用于日常业务中的 轻分析类业务,例如:对账业务,报表业务。

    使用 MPP 进行分析型查询

    PolarDB PG 引擎默认不开启 MPP 功能。若您需要使用此功能,请使用如下参数:

    • polar_enable_px:指定是否开启 MPP 功能。默认为 OFF,即不开启。
    • polar_px_max_workers_number:设置单个节点上的最大 MPP Worker 进程数,默认为 30。该参数限制了单个节点上的最大并行度,节点上所有会话的 MPP workers 进程数不能超过该参数大小。
    • polar_px_dop_per_node:设置当前会话并行查询的并行度,默认为 1,推荐值为当前 CPU 总核数。若设置该参数为 N,则一个会话在每个节点上将会启用 N 个 MPP Worker 进程,用于处理当前的 MPP 逻辑
    • polar_px_nodes:指定参与 MPP 的只读节点。默认为空,表示所有只读节点都参与。可配置为指定节点参与 MPP,以逗号分隔
    • px_worker:指定 MPP 是否对特定表生效。默认不生效。MPP 功能比较消耗集群计算节点的资源,因此只有对设置了 px_workers 的表才使用该功能。例如:
      • ALTER TABLE t1 SET(px_workers=1) 表示 t1 表允许 MPP
      • ALTER TABLE t1 SET(px_workers=-1) 表示 t1 表禁止 MPP
      • ALTER TABLE t1 SET(px_workers=0) 表示 t1 表忽略 MPP(默认状态)

    本示例以简单的单表查询操作,来描述 MPP 的功能是否有效。

    -- 创建 test 表并插入基础数据。
    +CREATE TABLE test(id int);
    +INSERT INTO test SELECT generate_series(1,1000000);
    +
    +-- 默认情况下 MPP 功能不开启,单表查询执行计划为 PG 原生的 Seq Scan
    +EXPLAIN SELECT * FROM test;
    +                       QUERY PLAN
    +--------------------------------------------------------
    + Seq Scan on test  (cost=0.00..35.50 rows=2550 width=4)
    +(1 row)
    +

    开启并使用 MPP 功能:

    -- 对 test 表启用 MPP 功能
    +ALTER TABLE test SET (px_workers=1);
    +
    +-- 开启 MPP 功能
    +SET polar_enable_px = on;
    +
    +EXPLAIN SELECT * FROM test;
    +
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Seq Scan on test (scan partial)  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    配置参与 MPP 的计算节点范围:

    -- 查询当前所有只读节点的名称
    +CREATE EXTENSION polar_monitor;
    +
    +SELECT name,host,port FROM polar_cluster_info WHERE px_node='t';
    + name  |   host    | port
    +-------+-----------+------
    + node1 | 127.0.0.1 | 5433
    + node2 | 127.0.0.1 | 5434
    +(2 rows)
    +
    +-- 当前集群有 2 个只读节点,名称分别为:node1,node2
    +
    +-- 指定 node1 只读节点参与 MPP
    +SET polar_px_nodes = 'node1';
    +
    +-- 查询参与并行查询的节点
    +SHOW polar_px_nodes;
    + polar_px_nodes
    +----------------
    + node1
    +(1 row)
    +
    +EXPLAIN SELECT * FROM test;
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 1:1  (slice1; segments: 1)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    使用 MPP 进行分区表查询

    当前 MPP 对分区表支持的功能如下所示:

    • 支持 Range 分区的并行查询
    • 支持 List 分区的并行查询
    • 支持单列 Hash 分区的并行查询
    • 支持分区裁剪
    • 支持带有索引的分区表并行查询
    • 支持分区表连接查询
    • 支持多级分区的并行查询
    --分区表 MPP 功能默认关闭,需要先开启 MPP 功能
    +SET polar_enable_px = ON;
    +
    +-- 执行以下语句,开启分区表 MPP 功能
    +SET polar_px_enable_partition = true;
    +
    +-- 执行以下语句,开启多级分区表 MPP 功能
    +SET polar_px_optimizer_multilevel_partitioning = true;
    +

    使用 MPP 加速索引创建

    当前仅支持对 B-Tree 索引的构建,且暂不支持 INCLUDE 等索引构建语法,暂不支持表达式等索引列类型。

    如果需要使用 MPP 功能加速创建索引,请使用如下参数:

    • polar_px_dop_per_node:指定通过 MPP 加速构建索引的并行度。默认为 1
    • polar_px_enable_replay_wait:当使用 MPP 加速索引构建时,当前会话内无需手动开启该参数,该参数将自动生效,以保证最近更新的数据表项可以被创建到索引中,保证索引表的完整性。索引创建完成后,该参数将会被重置为数据库默认值。
    • polar_px_enable_btbuild:是否开启使用 MPP 加速创建索引。取值为 OFF 时不开启(默认),取值为 ON 时开启。
    • polar_bt_write_page_buffer_size:指定索引构建过程中的写 I/O 策略。该参数默认值为 0(不开启),单位为块,最大值可设置为 8192。推荐设置为 4096
      • 当该参数设置为不开启时,在索引创建的过程中,对于索引页写满后的写盘方式是 block-by-block 的单个块写盘。
      • 当该参数设置为开启时,内核中将缓存一个 polar_bt_write_page_buffer_size 大小的 buffer,对于需要写盘的索引页,会通过该 buffer 进行 I/O 合并再统一写盘,避免了频繁调度 I/O 带来的性能开销。该参数会额外提升 20% 的索引创建性能。
    -- 开启使用 MPP 加速创建索引功能。
    +SET polar_px_enable_btbuild = on;
    +
    +-- 使用如下语法创建索引
    +CREATE INDEX t ON test(id) WITH(px_build = ON);
    +
    +-- 查询表结构
    +\d test
    +               Table "public.test"
    + Column |  Type   | Collation | Nullable | Default
    +--------+---------+-----------+----------+---------
    + id     | integer |           |          |
    + id2    | integer |           |          |
    +Indexes:
    +    "t" btree (id) WITH (px_build=finish)
    +
    + + + diff --git a/theory/arch-overview.html b/theory/arch-overview.html new file mode 100644 index 00000000000..bf4a35d9c16 --- /dev/null +++ b/theory/arch-overview.html @@ -0,0 +1,33 @@ + + + + + + + + + Overview | PolarDB for PostgreSQL + + + + +

    Overview

    北侠

    2021/08/24

    35 min

    PolarDB for PostgreSQL (hereafter simplified as PolarDB) is a stable, reliable, scalable, highly available, and secure enterprise-grade database service that is independently developed by Alibaba Cloud to help you increase security compliance and cost-effectiveness. PolarDB is 100% compatible with PostgreSQL. It runs in a proprietary compute-storage separation architecture of Alibaba Cloud to support the horizontal scaling of the storage and computing capabilities.

    PolarDB can process a mix of online transaction processing (OLTP) workloads and online analytical processing (OLAP) workloads in parallel. PolarDB also provides a wide range of innovative multi-model database capabilities to help you process, analyze, and search for diversified data, such as spatio-temporal, GIS, image, vector, and graph data.

    PolarDB supports various deployment architectures. For example, PolarDB supports compute-storage separation, three-node X-Paxos clusters, and local SSDs.

    Issues in Conventional Database Systems

    If you are using a conventional database system and the complexity of your workloads continues to increase, you may face the following challenges as the amount of your business data grows:

    1. The storage capacity is limited by the maximum storage capacity of a single host.
    2. You can increase the read capability of your database system only by creating read-only instances. Each read-only instance must be allocated a specific amount of exclusive storage space, which increases costs.
    3. The time that is required to create a read-only instance increases due to the increase in the amount of data.
    4. The latency of data replication between the primary instance and the secondary instance is high.

    Benefits of PolarDB

    image.png

    To help you resolve the issues that occur in conventional database systems, Alibaba Cloud provides PolarDB. PolarDB runs in a proprietary compute-storage separation architecture of Alibaba Cloud. This architecture has the following benefits:

    1. Scalability: Computing is separated from storage. You can flexibly scale out the computing cluster or the storage cluster based on your business requirements.
    2. Cost-effectiveness: All compute nodes share the same physical storage. This significantly reduces costs.
    3. Easy to use: Each PolarDB cluster consists of one primary node and one or more read-only nodes to support read/write splitting.
    4. Reliability: Data is stored in triplicate, and a backup can be finished in seconds.

    A Guide to This Document

    PolarDB is integrated with various technologies and innovations. This document describes the following two aspects of the PolarDB architecture in sequence: compute-storage separation and hybrid transactional/analytical processing (HTAP). You can find and read the content of your interest with ease.

    • Compute-storage separation is the foundation of the PolarDB architecture. Conventional database systems run in the shared-nothing architecture, in which each instance is allocated independent computing resources and storage resources. As conventional database systems evolve towards compute-storage separation, database engines developers face challenges in managing executors, transactions, and buffers. PolarDB is designed to help you address these challenges.
    • HTAP is designed to support OLAP queries in OLTP scenarios and fully utilize the computing capabilities of multiple read-only nodes. HTAP is achieved by using a shared storage-based massively parallel processing (MPP) architecture. In the shared storage-based MPP architecture, each table or index tree is stored as a whole and is not divided into virtual partitions that are stored on different nodes. This way, you can retain the workflows used in OLTP scenarios. In addition, you can use the shared storage-based MPP architecture without the need to modify your application data.

    This section explains the following two aspects of the PolarDB architecture: compute-storage separation and HTAP.

    Compute-Storage Separation

    image.png

    PolarDB supports compute-storage separation. Each PolarDB cluster consists of a computing cluster and a storage cluster. You can flexibly scale out the computing cluster or the storage cluster based on your business requirements.

    1. If the computing power is insufficient, you can scale out only the computing cluster.
    2. If the storage capacity is insufficient, you can scale out only the storage cluster.

    After the shared-storage architecture is used in PolarDB, the primary node and the read-only nodes share the same physical storage. If the primary node still uses the method that is used in conventional database systems to flush write-ahead logging (WAL) records, the following issues may occur.

    1. The pages that the read-only nodes read from the shared storage are outdated pages. Outdated pages are pages that are of earlier versions than the versions that are recorded on the read-only nodes.
    2. The pages that the read-only nodes read from the shared storage are future pages. Future pages are pages that are of later versions than the versions that are recorded on the read-only nodes.
    3. When your workloads are switched over from the primary node to a read-only node, the pages that the read-only node reads from the shared storage are outdated pages. In this case, the read-only node needs to read and apply WAL records to restore dirty pages.

    To resolve the first issue, PolarDB must support multiple versions for each page. To resolve the second issue, PolarDB must control the speed at which the primary node flushes WAL records.

    HTAP

    When read/write splitting is enabled, each individual compute node cannot fully utilize the high I/O throughput that is provided by the shared storage. In addition, you cannot accelerate large queries by adding computing resources. To resolve these issues, PolarDB uses the shared storage-based MPP architecture to accelerate OLAP queries in OLTP scenarios.

    PolarDB supports a complete suite of data types that are used in OLTP scenarios. PolarDB also supports two computing engines, which can process these types of data:

    • Standalone execution engine: processes highly concurrent OLTP queries.
    • Distributed execution engine: processes large OLAP queries.

    image.png

    When the same hardware resources are used, PolarDB delivers performance that is 90% of the performance delivered by traditional MPP database. PolarDB also provides SQL statement-level scalability. If the computing power of your PolarDB cluster is insufficient, you can allocate more CPU resources to OLAP queries without the need to rearrange data.

    The following sections provide more details about compute-storage separation and HTAP.

    PolarDB: Compute-Storage Separation

    Challenges of Shared Storage

    Compute-storage separation enables the compute nodes of your PolarDB cluster to share the same physical storage. Shared storage brings the following challenges:

    • Data consistency: how to ensure consistency between N copies of data in the computing cluster and 1 copy of data in the storage cluster.
    • Read/write splitting: how to replicate data at a low latency.
    • High availability: how to perform recovery and failover.
    • I/O model: how to optimize the file system from buffered I/O to direct I/O.

    Basic Principles of Shared Storage

    image.png

    The following basic principles of shared storage apply to PolarDB:

    • The primary node can process read requests and write requests. The read-only nodes can process only read requests.
    • Only the primary node can write data to the shared storage. This way, the data that you query on the primary node is the same as the data that you query on the read-only nodes.
    • The read-only nodes apply WAL records to ensure that the pages in the memory of the read-only nodes are synchronous with the pages in the memory of the primary node.
    • The primary node writes WAL records to the shared storage, and only the metadata of the WAL records is replicated to the read-only nodes.
    • The read-only nodes read WAL records from the shared storage and apply the WAL records.

    Data Consistency

    In-memory Page Synchronization in Shared-nothing Architecture

    In a conventional database system, the primary instance and read-only instances each are allocated independent memory resources and storage resources. The primary instance replicates WAL records to the read-only instances, and the read-only instances read and apply the WAL records. These basic principles also apply to replication state machines.

    In-memory Page Synchronization in Shared-storage Architecture

    In a PolarDB cluster, the primary node replicates WAL records to the shared storage. The read-only nodes read and apply the most recent WAL records from the shared storage to ensure that the pages in the memory of the read-only nodes are synchronous with the pages in the memory of the primary node.

    image.png

    1. The primary node flushes the WAL records of a page to write version 200 of the page to the shared storage.
    2. The read-only nodes read and apply the WAL records of the page to update the page from version 100 to version 200.

    Outdated Pages in Shared-storage Architecture

    In the workflow shown in the preceding figure, the new page that the read-only nodes obtain by applying WAL records is removed from the buffer pools of the read-only nodes. When you query the page on the read-only nodes, the read-only nodes read the page from the shared storage. As a result, only the previous version of the page is returned. This previous version is called an outdated page. The following figure shows more details.

    image.png

    1. At T1, the primary node writes a WAL record with a log sequence number (LSN) of 200 to the memory to update Page 1 from version 500 to version 600.
    2. At T1, Page 1 on the read-only nodes is in version 500.
    3. At T2, the primary node sends the metadata of WAL Record 200 to the read-only nodes to notify the read-only nodes of a new WAL record.
    4. At T3, you query Page 1 on the read-only nodes. The read-only nodes read version 500 of Page 1 and WAL Record 200 and apply WAL Record 200 to update Page 1 from version 500 to version 600.
    5. At T4, the read-only nodes remove version 600 of Page 1 because their buffer pools cannot provide sufficient space.
    6. The primary node does not write version 600 of Page 1 to the shared storage. The most recent version of Page 1 in the shared storage is still version 500.
    7. At T5, you query Page 1 on the read-only nodes. The read-only nodes read Page 1 from the shared storage because Page 1 has been removed from the memory of the read-only nodes. In this case, the outdated version 500 of Page 1 is returned.

    Solution to Outdated Pages

    When you query a page on the read-only nodes at a specific point in time, the read-only nodes need to read the base version of the page and the WAL records up to that point in time. Then, the read-only nodes need to apply the WAL records one by one in sequence. The following figure shows more details.

    image.png

    1. The metadata of the WAL records of each page is retained in the memory of the read-only nodes.
    2. When you query a page on the read-only nodes, the read-only nodes need to read and apply the WAL records of the page until the read-only nodes obtain the most recent version of the page.
    3. The read-only nodes read and apply WAL records from the shared storage based on the metadata of the WAL records.

    PolarDB needs to maintain an inverted index that stores the mapping from each page to the WAL records of the page. However, the memory capacity of each read-only node is limited. Therefore, these inverted indexes must be persistently stored. To meet this requirement, PolarDB provides LogIndex. LogIndex is an index structure, which is used to persistently store hash data.

    1. The WAL receiver processes of the read-only nodes receive the metadata of WAL records from the primary node.
    2. The metadata of each WAL record contains information about which page is updated.
    3. The read-only nodes insert the metadata of each WAL record into a LogIndex structure to generate a LogIndex record. The key of the LogIndex record is the ID of the page that is updated, and the value of the LogIndex record is the LSN of the WAL record.
    4. One WAL record may contain information about multiple pages that are updated. This process is defined as index block split. If index blocks are split, one WAL record maps multiple LogIndex records.
    5. The read-only nodes mark each updated page as outdated in their buffer pools. When you query an updated page on the read-only nodes, the read-only nodes can read and apply the WAL records of the page based on the LogIndex records that map the WAL records.
    6. When the memory usage of the read-only nodes reaches a specific threshold, the hash data that is stored in LogIndex structures is asynchronously flushed from the memory to the disk.

    image.png

    LogIndex helps prevent outdated pages and enable the read-only nodes to run in lazy log apply mode. In the lazy log apply mode, the read-only nodes apply only the metadata of the WAL records for dirty pages.

    Future Pages in Shared-storage Architecture

    The read-only nodes may return future pages, whose versions are later than the versions that are recorded on the read-only nodes. The following figure shows more details.

    image.png

    1. At T1, the primary node updates Page 1 twice from version 500 to version 700. Two WAL records are generated during the update process. The LSN of one WAL record is 200, and the LSN of the other WAL record is 300. At this time, Page 1 is still in version 500 on the primary node and the read-only nodes.
    2. At T2, the primary node sends WAL Record 200 to the read-only nodes.
    3. At T3, the read-only nodes apply WAL Record 200 to update Page 1 to version 600. At this time, the read-only nodes have not read or applied WAL Record 300.
    4. At T4, the primary node writes version 700 of Page 1 to the shared storage. At the same time, Page 1 is removed from the buffer pools of the read-only nodes.
    5. At T5, the read-only nodes attempt to read Page 1 again. Page 1 cannot be found in the buffer pools of the read-only nodes. Therefore, the read-only nodes obtain version 700 of Page 1 from the shared storage. Version 700 of Page 1 is a future page to the read-only nodes because the read-only nodes have not read or applied WAL Record 300.
    6. If some of the pages that the read-only nodes obtain from the shared storage are future pages and some are normal pages, data inconsistencies may occur. For example, after an index block is split into two indexes that each map a page, one of the pages the read-only nodes read is a normal page and the other is a future page. In this case, the B+ tree structures of the indexes are damaged.

    Solutions to Future Pages

    The read-only nodes apply WAL records at high speeds in lazy apply mode. However, the speeds may still be lower than the speed at which the primary node flushes WAL records. If the primary node flushes WAL records faster than the read-only nodes apply WAL records, future pages are returned. To prevent future pages, PolarDB must ensure that the speed at which the primary node flushes WAL records does not exceed the speeds at which the read-only nodes apply WAL records. The following figure shows more details.

    image.png

    1. The read-only nodes apply the WAL record that is generated at T4.
    2. When the primary node flushes WAL records to the shared storage, it sorts all WAL records by LSN and flushes only the WAL records that are updated up to T4.
    3. The file position of the LSN that is generated at T4 is defined as the file position of consistency.

    Low-latency Replication

    Issues of Conventional Streaming Replication

    1. The I/O loads on the log synchronization link are heavy, and a large amount of data is transmitted over the network.
    2. When the read-only nodes process I/O-bound workloads or CPU-bound workloads, they read pages and modify the pages in their buffer pools at low speeds.
    3. When file- and data-related DDL operations attempt to acquire locks on specific objects, blocking exceptions may occur. As a result, the operations are run at low speeds.
    4. When the read-only nodes process highly concurrent queries, transaction snapshots are taken at low speeds. The following figure shows more details.

    image.png

    1. The primary node writes WAL records to its local file system.
    2. The WAL sender process of the primary node reads and sends the WAL records to the read-only nodes.
    3. The WAL receiver processes of the read-only nodes receive and write the WAL records to the local file systems of the read-only nodes.
    4. The read-only nodes read the WAL records, write the updated pages to their buffer pools, and then apply the WAL records in the memory.
    5. The primary node flushes the WAL records to the shared storage.

    The full path is long, and the latency on the read-only nodes is high. This may cause an imbalance between the read loads and write loads over the read/write splitting link.

    Optimization Method 1: Replicate Only the Metadata of WAL Records

    The read-only nodes can read WAL records from the shared storage. Therefore, the primary node can remove the payloads of WAL records and send only the metadata of WAL records to the read-only nodes. This alleviates the pressure on network transmission and reduces the I/O loads on critical paths. The following figure shows more details.

    1. Each WAL record consists of three parts: header, page ID, and payload. The header and the page ID comprise the metadata of a WAL record.
    2. The primary node replicates only the metadata of WAL records to the read-only nodes.
    3. The read-only nodes read WAL records from the shared storage based on the metadata of the WAL records.

    image.png

    This optimization method significantly reduces the amount of data that needs to be transmitted between the primary node and the read-only nodes. The amount of data that needs to be transmitted decreases by 98%, as shown in the following figure.

    image.png

    Optimization Method 2: Optimize the Log Apply of WAL Records

    Conventional database systems need to read a large number of pages, apply WAL records to these pages one by one, and then flush the updated pages to the disk. To reduce the read I/O loads on critical paths, PolarDB supports compute-storage separation. If the page that you query on the read-only nodes cannot be hit in the buffer pools of the read-only nodes, no I/O loads are generated and only LogIndex records are recorded.

    The following I/O operations that are performed by log apply processes can be offloaded to session processes:

    1. Data page-related I/O operations
    2. I/O operations to apply WAL records
    3. I/O operations to apply multiple versions of pages based on LogIndex records

    In the example shown in the following figure, when the log apply process of a read-only node applies the metadata of a WAL record of a page:

    image.png

    1. If the page cannot be hit in the memory, only the LogIndex record that maps the WAL record is recorded.
    2. If the page can be hit in the memory, the page is marked as outdated and the LogIndex record that maps the WAL record is recorded. The log apply process is complete.
    3. When you start a session process to read the page, the session process reads and writes the most recent version of the page to the buffer pool. Then, the session process applies the WAL record that maps the LogIndex record.
    4. Major I/O operations are no longer run by a single log apply process. These operations are offloaded to multiple user processes.

    This optimization method significantly reduces the log apply latency and increases the log apply speed by 30 times compared with Amazon Aurora.

    image.png

    Optimization Method 3: Optimize the Log Apply of DDL Locks

    When the primary node runs a DDL operation such as DROP TABLE to modify a table, the primary node acquires an exclusive DDL lock on the table. The exclusive DDL lock is replicated to the read-only nodes along with WAL records. The read-only nodes apply the WAL records to acquire the exclusive DDL lock on the table. This ensures that the table cannot be deleted by the primary node when a read-only node is reading the table. Only one copy of the table is stored in the shared storage.

    When the applying process of a read-only node applies the exclusive DDL lock, the read-only node may require a long period of time to acquire the exclusive DDL lock on the table. You can optimize the critical path of the log apply process by offloading the task of acquiring the exclusive DDL lock to other processes.

    image.png

    This optimization method ensures that the critical path of the log apply process of a read-only node is not blocked even if the log apply process needs to wait for the release of an exclusive DDL lock.

    image.png

    The three optimization methods in combination significantly reduce replication latency and have the following benefits:

    • Read/write splitting: Loads are balanced, which allows PolarDB to deliver user experience that is comparable to Oracle Real Application Clusters (RAC).
    • High availability: The time that is required for failover is reduced.
    • Stability: The number of future pages is minimized, and fewer or even no page snapshots need to be taken.

    Recovery Optimization

    Background Information

    If the read-only nodes apply WAL records at low speeds, your PolarDB cluster may require a long period of time to recover from exceptions such as out of memory (OOM) errors and unexpected crashes. When the direct I/O model is used for the shared storage, the severity of this issue increases.

    image.png

    Lazy Recovery

    The preceding sections explain how LogIndex enables the read-only nodes to apply WAL records in lazy log apply mode. In general, the recovery process of the primary node after a restart is the same as the process in which the read-only nodes apply WAL records. In this sense, the lazy log apply mode can also be used to accelerate the recovery of the primary node.

    image.png

    1. The primary node begins to apply WAL records in lazy log apply mode one by one starting from a specific checkpoint.
    2. After the primary node applies all LogIndex records, the log apply is complete.
    3. After the recovery is complete, the primary node starts to run.
    4. The actual log apply workloads are offloaded to the session process that is started after the primary node restarts.

    The example in the following figure shows how the optimized recovery method significantly reduces the time that is required to apply 500 MB of WAL records.

    image.png

    Persistent Buffer Pool

    After the primary node recovers, a session process may need to apply the pages that the session process reads. When a session process is applying pages, the primary node responds at low speeds for a short period of time. To resolve this issue, PolarDB does not delete pages from the buffer pool of the primary node if the primary node restarts or unexpectedly crashes.

    image.png

    The shared memory of the database engine consists of the following two parts:

    1. One part is used to store global structures and ProcArray structures.
    2. The other part is used to store buffer pool structures. The buffer pool is allocated as a specific amount of named shared memory. Therefore, the buffer pool remains valid after the primary node restarts. However, global structures need to be reinitialized after the primary node restarts.

    image.png

    Not all pages in the buffer pool of the primary node can be reused. For example, if a process acquires an exclusive lock on a page before the primary node restarts and then unexpectedly crashes, no other processes can release the exclusive lock on the page. Therefore, after the primary node unexpectedly crashes or restarts, it needs to traverse all pages in its buffer pool to identify and remove the pages that cannot be reused. In addition, the recycling of buffer pools depends on Kubernetes.

    This optimized buffer pool mechanism ensures the stable performance of your PolarDB cluster before and after a restart.

    image.png

    PolarDB HTAP

    The shared storage of PolarDB is organized as a storage pool. When read/write splitting is enabled, the theoretical I/O throughput that is supported by the shared storage is infinite. However, large queries can be run only on individual compute nodes, and the CPU, memory, and I/O specifications of a single compute node are limited. Therefore, a single compute node cannot fully utilize the high I/O throughput that is supported by the shared storage or accelerate large queries by acquiring more computing resources. To resolve these issues, PolarDB uses the shared storage-based MPP architecture to accelerate OLAP queries in OLTP scenarios.

    Basic Principles of HTAP

    In a PolarDB cluster, the physical storage is shared among all compute nodes. Therefore, you cannot use the method of scanning tables in conventional MPP databases to scan tables in PolarDB clusters. PolarDB supports MPP on standalone execution engines and provides optimized shared storage. This shared storage-based MPP architecture is the first architecture of its kind in the industry. We recommend that you familiarize yourself with following basic principles of this architecture before you use PolarDB:

    1. The Shuffle operator masks the data distribution.
    2. The ParallelScan operator masks the shared storage.

    image.png

    The preceding figure shows an example.

    1. Table A and Table B are joined and aggregated.
    2. Table A and Table B are still individual tables in the shared storage. These tables are not physically partitioned.
    3. Four types of scan operators are redesigned to scan tables in the shared storage as virtual partitions.

    Distributed Optimizer

    The GPORCA optmizer is extended to provide a set of transformation rules that can recognize shared storage. The GPORCA optimizer enables PolarDB to access a specific amount of planned search space. For example, PolarDB can scan a table as a whole or as different virtual partitions. This is a major difference between shared storage-based MPP and conventional MPP.

    The modules in gray in the upper part of the following figure are modules of the database engine. These modules enable the database engine of PolarDB to adapt to the GPORCA optimizer.

    The modules in the lower part of the following figure comprise the GPORCA optimizer. Among these modules, the modules in gray are extended modules, which enable the GPORCA optimizer to communicate with the shared storage of PolarDB.

    image.png

    Parallelism of Operators

    Four types of operators in PolarDB require parallelism. This section describes how to enable parallelism for operators that are used to run sequential scans. To fully utilize the I/O throughput that is supported by the shared storage, PolarDB splits each table into logical units during a sequential scan. Each unit contains 4 MB of data. This way, PolarDB can distribute I/O loads to different disks, and the disks can simultaneously scan data to accelerate the sequential scan. In addition, each read-only node needs to scan only specific tables rather than all tables. The size of tables that can be cached is the total size of the buffer pools of all read-only nodes.

    image.png

    Parallelism has the following benefits, as shown in the following figure:

    1. You can increase scan performance by 30 times by creating read-only nodes.
    2. You can reduce the time that is required for a scan from 37 minutes to 3.75 seconds by enabling the buffering feature.

    image.png

    Solve the Issue of Data Skew

    Data skew is a common issue in conventional MPP:

    1. In PolarDB, large objects reference TOAST tables by using heap tables. You cannot balance loads even if you shard TOAST tables or heap tables.
    2. In addition, the transactions, buffer pools, network connections, and I/O loads of the read-only nodes jitter.
    3. The preceding issues cause long-tail processes.

    image.png

    1. The coordinator node consists of two parts: DataThread and ControlThread.
    2. DataThread collects and aggregates tuples.
    3. ControlThread controls the scan progress of each scan operator.
    4. A worker thread that scans data at a high speed can scan more logical data shards.
    5. The affinity of buffers must be considered.

    Although a scan task is dynamically distributed, we recommend that you maintain the affinity of buffers at your best. In addition, the context of each operator is stored in the private memory of the worker threads. The coordinator node does not store the information about specific tables.

    In the example shown in the following table, PolarDB uses static sharding to shard large objects. During the static sharding process, data skew occurs, but the performance of dynamic scanning can still linearly increase.

    image.png

    SQL Statement-level Scalability

    Data sharing helps deliver ultimate scalability in cloud-native environments. The full path of the coordinator node involves various modules, and PolarDB can store the external dependencies of these modules to the shared storage. In addition, the full path of a worker thread involves a number of operational parameters, and PolarDB can synchronize these parameters from the coordinator node over the control path. This way, the coordinator node and the worker thread are stateless.

    image.png

    The following conclusions are made based on the preceding analysis:

    1. All read-only nodes that run SQL joins can function as coordinator nodes. Therefore, the performance of PolarDB is no longer limited due to the availability of only a single coordinator node.
    2. Each SQL statement can start any number of worker threads on any compute node. This increases the computing power and allows you to schedule your workloads in a more flexible manner. You can configure PolarDB to simultaneously run different kinds of workloads on different compute nodes.

    image.png

    Transactional Consistency

    The log apply wait mechanism and the global snapshot mechanism are used to ensure data consistency among multiple compute nodes. The log apply wait mechanism ensures that all worker threads can obtain the most recent version of each page. The global snapshot mechanism ensures that a unified version of each page can be selected.

    image.png

    TPC-H Performance: Speedup

    image.png

    A total of 1 TB of data is used for TPC-H testing. First, run 22 SQL statements in a PolarDB cluster and in a conventional database system. The PolarDB cluster supports distributed parallelism, and the conventional database system supports standalone parallelism. The test result shows that the PolarDB cluster executes three SQL statements at speeds that are 60 times higher and 19 statements at speeds that are 10 times higher than the conventional database system.

    image.png

    image.png

    Then, run a TPC-H test by using a distributed execution engine. The test result shows that the speed at which each of the 22 SQL statements runs linearly increases as the number of cores increases from 16 to 128.

    TPC-H Performance: Comparison with Traditional MPP Database

    When 16 nodes are configured, PolarDB delivers performance that is 90% of the performance delivered by MPP-based database.

    image.png

    image.png

    As mentioned earlier, the distributed execution engine of PolarDB supports scalability, and data in PolarDB does not need to be redistributed. When the degree of parallelism (DOP) is 8, PolarDB delivers performance that is 5.6 times the performance delivered by MPP-based database.

    Index Creation Accelerated by Distributed Execution

    A large number of indexes are created in OLTP scenarios. The workloads that you run to create these indexes are divided into two parts: 80% of the workloads are run to sort and create index pages, and 20% of the workloads are run to write index pages. Distributed execution accelerates the process of sorting indexes and supports the batch writing of index pages.

    image.png Distributed execution accelerates the creation of indexes by four to five times.

    image.png

    Multi-model Spatio-temporal Database Accelerated by Distributed, Parallel Execution

    PolarDB is a multi-model database service that supports spatio-temporal data. PolarDB runs CPU-bound workloads and I/O-bound workloads. These workloads can be accelerated by distributed execution. The shared storage of PolarDB supports scans on shared R-tree indexes.

    image.png

    • Data volume: 400 million data records, which amount to 500 GB in total
    • Configuration: 5 read-only nodes, each of which provides 16 cores and 128 GB of memory
    • Performance:
      • Linearly increases with the number of cores.
      • Increases by 71 times when the number of cores increases from 16 to 80.

    image.png

    Summary

    This document describes the crucial technologies that are used in the PolarDB architecture:

    • Compute-storage separation
    • HTAP

    More technical details about PolarDB will be discussed in other documents. For example, how the shared storage-based query optimizer runs, how LogIndex achieves high performance, how PolarDB flashes your data back to a specific point in time, how MPP can be implemented in the shared storage, and how PolarDB works with X-Paxos to ensure high availability.

    + + + diff --git a/theory/buffer-management.html b/theory/buffer-management.html new file mode 100644 index 00000000000..c18343427cd --- /dev/null +++ b/theory/buffer-management.html @@ -0,0 +1,37 @@ + + + + + + + + + Buffer Management | PolarDB for PostgreSQL + + + + +

    Buffer Management

    Background Information

    In a conventional database system, the primary instance and the read-only instances are each allocated a specific amount of exclusive storage space. The read-only instances can apply write-ahead logging (WAL) records and can read and write data to their own storage. A PolarDB cluster consists of a primary node and at least one read-only node. The primary node and the read-only nodes share the same physical storage. The primary node can read and write data to the shared storage. The read-only nodes can read data from the shared storage by applying WAL records but cannot write data to the shared storage. The following figure shows the architecture of a PolarDB cluster.

    image.png

    The read-only nodes may read two types of pages from the shared storage:

    • Future pages: The pages that the read-only nodes read from the shared storage incorporate changes that are made after the apply log sequence numbers (LSNs) of the pages. For example, the read-only nodes have applied all WAL records up to the WAL record with an LSN of 200 to a page, but the change described by the most recent WAL record with an LSN of 300 has been incorporated into the same page in the shared storage. These pages are called future pages.

      image.png

    • Outdated pages: The pages that the read-only nodes read from the shared storage do not incorporate changes that are made before the apply LSNs of the pages. For example, the read-only nodes have applied all WAL records up to the most recent WAL record with an LSN of 200 to a page, but the change described by a previous WAL record with an LSN of 200 has not been incorporated into the same page in the shared storage. These pages are called outdated pages.

      image.png

    Each read-only node expects to read pages that incorporate only the changes made up to the apply LSNs of the pages on that read-only node. If the read-only nodes read outdated pages or future pages from the shared storage, you can take the following measures:

    • To prevent outdated pages, configure the read-only nodes to apply all omitted WAL records up to the apply LSN of each page. A page may have different apply LSNs on different read-only nodes.
    • To prevent future pages, configure the primary node to identify how many WAL records are applied on the read-only nodes at the time when the primary node writes data to the shared storage. This is the focus of buffer management.

    Buffer management involves consistent LSNs. For a specific page, each read-only node needs to apply only the WAL records that are generated between the consistent LSN and the apply LSN. This reduces the time that is required to apply WAL records on the read-only nodes.

    Terms

    • Buffer Pool: A buffer pool is an amount of memory that is used to store frequently accessed data. In most cases, data is cached in the buffer pool as pages. In a PolarDB cluster, each compute node has its own buffer pool.
    • LSN: Each LSN is the unique identifier of a WAL record. LSNs globally increment.
    • Apply LSN: The apply LSN of a page on a read-only node marks the most recent WAL record that is applied on the read-only node for the page. Also called Replay LSN.
    • Oldest Apply LSN: The oldest apply LSN of a page is the smallest apply LSN among the apply LSNs of the page on all the read-only nodes.

    Flushing Control

    PolarDB provides a flushing control mechanism to prevent the read-only nodes from reading future pages from the shared storage. Before the primary node writes a page to the shared storage, the primary node checks whether all the read-only nodes have applied the most recent WAL record of the page.

    image.png

    The pages in the buffer pool of the primary node are divided into the following two types based on whether the pages incorporate the changes that are made after the apply LSNs of the pages: pages that can be flushed to the shared storage and pages that cannot be flushed to the shared storage. This categorization is based on the following LSNs:

    • Latest LSN: The latest LSN of a page on a read-only node marks the most recent WAL record that is applied on the read-only node for the page.
    • Oldest apply LSN: The oldest apply LSN of a page is the smallest apply LSN among the apply LSNs of the page on all the read-only nodes.

    The primary node determines whether to flush a dirty page to the shared storage based on the following rules:

    if buffer latest lsn <= oldest apply lsn
    +    flush buffer
    +else
    +    do not flush buffer
    +

    Consistent LSNs

    To apply the WAL records of a page up to a specified LSN, each read-only node manages the mapping between the page and the LSNs of all WAL records that are generated for the page. This mapping is stored as a LogIndex. A LogIndex is used as a hash table that can be persistently stored. When a read-only node requests a page, the read-only node traverses the LogIndex of the page to obtain the LSNs of all WAL records that need to be applied. Then, the read-only node applies the WAL records in sequence to generate the most recent version of the page.

    image.png

    For a specific page, more changes mean more LSNs and a longer period of time required to apply WAL records. To minimize the number of WAL records that need to be applied for each page, PolarDB provides consistent LSNs.

    After all changes that are made up to the consistent LSN of a page are written to the shared storage, the page is persistently stored. The primary node sends the write LSN and consistent LSN of the page to each read-only node, and each read-only node sends the apply LSN of the page to the primary node. The read-only nodes do not need to apply the WAL records that are generated before the consistent LSN of the page. Therefore, all LSNs that are smaller than the consistent LSN can be removed from the LogIndex of the page. This reduces the number of WAL records that the read-only nodes need to apply. This also reduces the storage space that is occupied by LogIndex records.

    Flush Lists

    PolarDB holds a specific state for each buffer in the memory. The state of a buffer in the memory is represented by the LSN that marks the first change to the buffer. This LSN is called the oldest LSN. The consistent LSN of a page is the smallest oldest LSN among the oldest LSNs of all buffers for the page.

    A conventional method of obtaining the consistent LSN of a page requires the primary node to traverse the LSNs of all buffers for the page in the buffer pool. This method causes significant CPU overhead and a long traversal process. To address these issues, PolarDB uses a flush list, in which all dirty pages in the buffer pool are sorted in ascending order based on their oldest LSNs. The flush list helps you reduce the time complexity of obtaining consistent LSNs to O(1).

    image.png

    When a buffer is updated for the first time, the buffer is labeled as dirty. PolarDB inserts the buffer into the flush list and generates an oldest LSN for the buffer. When the buffer is flushed to the shared storage, the label is removed.

    To efficiently move the consistent LSN of each page towards the head of the flush list, PolarDB runs a BGWRITER process to traverse all buffers in the flush list in chronological order and flush early buffers to the shared storage one by one. After a buffer is flushed to the shared storage, the consistent LSN is moved one position forward towards the head of the flush list. In the example shown in the preceding figure, if the buffer with an oldest LSN of 10 is flushed to the shared storage, the buffer with an oldest LSN of 30 is moved one position forward towards the head of the flush list. LSN 30 becomes the consistent LSN.

    Parallel Flushing

    To further improve the efficiency of moving the consistent LSN of each page to the head of the flush list, PolarDB runs multiple BGWRITER processes to flush buffers in parallel. Each BGWRITER process reads a number of buffers from the flush list and flushes the buffers to the shared storage at a time.

    image.png

    Hot Buffers

    After the flushing control mechanism is introduced, PolarDB flushes only the buffers that meet specific flush conditions to the shared storage. If a buffer is frequently updated, its latest LSN may remain larger than its oldest apply LSN. As a result, the buffer can never meet the flush conditions. This type of buffer is called hot buffers. If a page has hot buffers, the consistent LSN of the page cannot be moved towards the head of the flush list. To resolve this issue, PolarDB provides a copy buffering mechanism.

    The copy buffering mechanism allows PolarDB to copy buffers that do not meet the flush conditions to a copy buffer pool. Buffers in the copy buffer pool and their latest LSNs are no longer updated. As the oldest apply LSN moves towards the head of the flush list, these buffers start to meet the flush conditions. When these buffers meet the flush conditions, PolarDB can flush them from the copy buffer pool to the shared storage.

    The following flush rules apply:

    1. If a buffer does not meet the flush conditions, PolarDB checks the number of recent changes to the buffer and the time difference between the most recent change and the latest LSN. If the number and the time difference exceed their predefined thresholds, PolarDB copies the buffer to the copy buffer pool.
    2. When a buffer is updated again, PolarDB checks whether the buffer meets the flush conditions. If the buffer meets the flush conditions, PolarDB flushes the buffer to the shared storage and deletes the copy of the buffer from the copy buffer pool.
    3. If a buffer does not meet the flush conditions, PolarDB checks whether a copy of the buffer can be found in the copy buffer pool. If a copy of the buffer can be found in the copy buffer pool and the copy meets the flush conditions, PolarDB flushes the copy to the shared storage.
    4. After a buffer that is copied to the copy buffer pool is updated, PolarDB regenerates an oldest LSN for the buffer and moves the buffer to the tail of the flush list.

    In the example shown in the following figure, the buffer with an oldest LSN of 30 and a latest LSN of 500 is considered a hot buffer. The buffer is updated after it is copied to the copy buffer pool. If the change is marked by LSN 600, PolarDB changes the oldest LSN of the buffer to 600 and moves the buffer to the tail of the flush list. At this time, the copy of the buffer is no longer updated, and the latest LSN of the copy remains 500. When the copy meets the flush conditions, PolarDB flushes the copy to the shared storage.

    image.png

    After the copy buffering mechanism is introduced, PolarDB uses a different method to calculate the consistent LSN of each page. For a specific page, the oldest LSN in the flush list is no longer the smallest oldest LSN because the oldest LSN in the copy buffer pool can be smaller. Therefore, PolarDB needs to compare the oldest LSN in the flush list with the oldest LSN in the copy buffer pool. The smaller oldest LSN is considered the consistent LSN.

    Lazy Checkpointing

    PolarDB supports consistent LSNs, which are similar to checkpoints. All changes that are made to a page before the checkpoint LSN of the page are flushed to the shared storage. If a recovery operation is run, PolarDB starts to recover the page from the checkpoint LSN. This improves recovery efficiency. If regular checkpoint LSNs are used, PolarDB flushes all dirty pages in the buffer pool and other in-memory pages to the shared storage. This process may require a long period of time and high I/O throughput. As a result, normal queries may be affected.

    Consistent LSNs empower PolarDB to implement lazy checkpointing. If the lazy checkpointing mechanism is used, PolarDB does not flush all dirty pages in the buffer pool to the shared storage. Instead, PolarDB uses consistent LSNs as checkpoint LSNs. This significantly increases checkpointing efficiency.

    The underlying logic of the lazy checkpointing mechanism allows PolarDB to run BGWRITER processes that continuously flush dirty pages and maintain consistent LSNs. The lazy checkpointing mechanism cannot be used with the full page write feature. If you enable the full page write feature, the lazy checkpointing mechanism is automatically disabled.

    + + + diff --git a/theory/ddl-synchronization.html b/theory/ddl-synchronization.html new file mode 100644 index 00000000000..3229d991963 --- /dev/null +++ b/theory/ddl-synchronization.html @@ -0,0 +1,33 @@ + + + + + + + + + DDL Synchronization | PolarDB for PostgreSQL + + + + +

    DDL Synchronization

    Overview

    In a shared storage architecture that consists of one primary node and multiple read-only nodes, a data file has only one copy. Due to multi-version concurrency control (MVCC), the read and write operations performed on different nodes do not conflict. However, MVCC cannot be used to ensure consistency for some specific data operations, such as file operations.

    MVCC applies to tuples within a file but does not apply to the file itself. File operations such as creating and deleting files are visible to the entire cluster immediately after they are performed. This causes an issue that files disappear while read-only nodes are reading the files. To prevent the issue from occurring, file operations need to be synchronized.

    In most cases, DDL is used to perform operations on files. For DDL operations, PolarDB provides a synchronization mechanism to prevent concurrent file operations. The logic of DDL operations in PolarDB is the same as the logic of single-node execution. However, the synchronization mechanism is different.

    Terms

    • LSN: short for log sequence number. Each LSN is the unique identifier of an entry in a write-ahead logging (WAL) log file. LSNs are incremented at a global level.
    • Apply LSN: refers to the position at which a WAL log file is applied on a read-only node.

    DDL Synchronization Mechanism

    DDL Locks

    The DDL synchronization mechanism uses AccessExclusiveLocks (DDL locks) to synchronize DDL operations between primary and read-only nodes.

    image.png
    Figure 1: Relationship Between DDL Lock and WAL Log

    DDL locks are table locks at the highest level in databases. DDL locks and locks at other levels are mutually exclusive. When the primary node synchronizes a WAL log file of a table to the read-only nodes, the primary node acquires the LSN of the lock in the WAL log file. When a read-only node applies the WAL log file beyond the LSN of the lock, the lock is considered to have been acquired on the read-only node. The DDL lock is released after the transaction ends. Figure 1 shows the entire process from the acquisition to the release of a DDL lock. When the WAL log file is applied at Apply LSN 1, the DDL lock is not acquired. When the WAL log file is applied at Apply LSN 2, the DDL lock is acquired. When the WAL log file is applied at Apply LSN 3, the DDL lock is released.

    image.png
    Figure 2: Conditions for Acquiring DDL Lock

    When the WAL log file is applied beyond the LSN of the lock on all read-only nodes, the DDL lock is considered to have been acquired by the transaction of the primary node at the cluster level. Then, this table cannot be accessed over other sessions on the primary node or read-only nodes. During this time period, the primary node can perform various file operations on the table.

    Note: A standby node in an active/standby environment has independent file storage. When a standby node acquires a lock, the preceding situation never occurs.

    image.png
    Figure 3: DDL Synchronization Workflow

    Figure 3 shows the workflow of how DDL operations are synchronized.

    1. Each read-only node executes query statements in a session.
    2. The primary node executes DDL statements in a session, acquires a local DDL lock, writes the DDL lock to the WAL log file, and then waits for all read-only nodes to apply the WAL log file.
    3. The apply process of each read-only node attempts to acquire the DDL lock. When the apply process acquires the DDL lock, it returns the Apply LSN to the primary node.
    4. The primary node is notified that the DDL lock is acquired on all read-only nodes.
    5. Each read-only node starts to perform DDL operations.

    How to Ensure Data Correctness

    DDL locks are locks at the highest level in PostgreSQL databases. Before a database performs operations such as DROP, ALTER, LOCK, and VACUUM (FULL) on a table, a DDL lock must be acquired. The primary node acquires the DDL lock by responding to user requests. When the lock is acquired, the primary node writes the DDL lock to the log file. Read-only nodes acquire the DDL lock by applying the log file.

    • In an active/standby environment, a hot standby node runs read-only queries and applies the log file at the same time. When the log file is applied to the LSN of the lock, the apply is blocked if the table is being read until the apply process times out.
    • In a PolarDB environment, the DDL lock is acquired by the primary node only after the DDL lock is acquired by all read-only nodes. This can ensure that primary and read-only nodes cannot access the data in shared storage. This is a prerequisite for performing DDL operations in PolarDB.

    DDL operations on a table are synchronized based on the following logic. The < indicator shows that the operations are performed from left to right.

    1. Completes all local queries < Acquires a local DDL lock < Releases the local DDL lock < Runs new local queries
    2. The primary node acquires a local DDL lock < Each read-only node acquires a local DDL lock < The primary node acquires a global DDL lock
    3. The primary node acquires a global DDL lock < The primary node writes data < The primary node releases the global DDL lock

    The sequence of the following operations is inferred based on the preceding execution logic: Queries on the primary node and each read-only node end < The primary node acquires a global DDL lock < The primary node writes data < The primary node releases the global DDL lock < The primary node and read-only nodes run new queries.

    When the primary node writes data to the shared storage, no queries are run on the primary node or read-only nodes. This way, data correctness is ensured. The entire operation process follows the two-phase locking (2PL) protocol. This way, data correctness is ensured among multiple tables.

    Apply Optimization for DDL Locks on RO

    In the preceding synchronization mechanism, DDL locks are synchronized in the main process that is used for primary/secondary synchronization. When the synchronization of a DDL lock to a read-only node is blocked, the synchronization of data to the read-only node is also blocked. In the third and fourth phases of the apply process shown in Figure 1, the DDL lock can be acquired only after the session in which local queries are run is closed. The default timeout period for synchronization in PolarDB is 30s. If the primary node runs in heavy load, a large data latency may occur.

    In specific cases, for a read-only node to apply a DDL lock, the data latency is the sum of the time used to apply each log entry. For example, if the primary node writes 10 log entries for a DDL lock within 1s, the read-only node requires 300s to apply all log entries. Data latency can affect the system stability of PolarDB in a negative manner. The primary node may be unable to clean dirty data and perform checkpoints at the earliest opportunity due to data latency. If the system stops responding when a large data latency occurs, the system requires an extended period of time to recover. This can lead to great stability risks.

    Asynchronous Apply of DDL Locks

    To resolve this issue, PolarDB optimizes DDL lock apply on read-only nodes.

    image.png
    Figure 4: Asynchronous Apply of DDL Locks on Read-Only Nodes

    PolarDB uses an asynchronous process to apply DDL locks so that the main apply process is not blocked.

    Figure 4 shows the overall workflow in which PolarDB offloads the acquisition of DDL locks from the main apply process to the lock apply process and immediately returns to the main apply process. This way, the main apply process is not affected even if lock apply are blocked.

    Lock apply conflicts rarely occur. PolarDB does not offload the acquisition of all locks to the lock apply process. PolarDB first attempts to acquire a lock in the main apply process. Then, if the attempt is a success, PolarDB does not offload the lock acquisition to the lock apply process. This can reduce the synchronization overheads between processes.

    By default, the asynchronous lock apply feature is enabled in PolarDB. This feature can reduce the apply latency caused by apply conflicts to ensure service stability. AWS Aurora does not provide similar features. Apply conflicts in AWS Aurora can severely increase data latency.

    How to Ensure Data Correctness

    In asynchronous apply mode, only the executor who acquires locks changes, but the execution logic does not change. During the process in which the primary node acquires a global DDL lock, writes data, and then releases the global DDL lock, no queries are run. This way, data correctness is not affected.

    + + + diff --git a/theory/logindex.html b/theory/logindex.html new file mode 100644 index 00000000000..366c8e200ae --- /dev/null +++ b/theory/logindex.html @@ -0,0 +1,33 @@ + + + + + + + + + LogIndex | PolarDB for PostgreSQL + + + + +

    LogIndex

    Background Information

    PolarDB uses a shared storage architecture. Each PolarDB cluster consists of a primary node and multiple read-only nodes. The primary node can share data in the shared storage. The primary node can read data from the shared storage and write data to the storage. Read-only nodes can read data from the shared storage only by replaying logs. Data in the memory is synchronized from the primary node to read-only nodes. This ensures that data is consistent between the primary node and read-only nodes. Read-only nodes can also provide services to implement read/write splitting and load balancing. If the primary node becomes unavailable, a read-only node can be used as the primary node. This ensures the high availability of the cluster. The following figure shows the architecture of PolarDB.

    image.png

    In the shared-nothing architecture, read-only nodes have independent memory and storage. These nodes need only to receive write-ahead logging (WAL) logs from the primary node and replay the WAL logs. If the data that needs to be replayed is not in buffer pools, the data must be read from storage files and written to buffer pools for replay. This can cause cache misses. More data is evicted from buffer pools because the data is replayed in a continuous manner. The following figure shows more details.

    image.png

    Multiple transactions on the primary node can be executed in parallel. Read-only nodes must replay WAL logs in the sequence in which the WAL logs are generated. As a result, read-only nodes replay WAL logs at a low speed and the latency between the primary node and read-only nodes increases.

    image.png

    If a PolarDB cluster uses a shared storage architecture and consists of one primary node and multiple read-only nodes, the read-only nodes can obtain WAL logs that need to be replayed from the shared storage. If data pages on the shared storage are the most recent pages, read-only nodes can read the data pages without replaying the pages. PolarDB provides LogIndex that can be used on read-only nodes to replay WAL logs at a higher speed.

    Memory Synchronization Architecture for RO

    LogIndex stores the mapping between a data page and all the log sequence numbers (LSNs) of updates on the page. LogIndex can be used to rapidly obtain all LSNs of updates on a data page. This way, the WAL logs generated for the data page can be replayed when the data page is read. The following figure shows the architecture that is used to synchronize data from the primary node to read-only nodes.

    image.png

    Compared with the shared-nothing architecture, the workflow of the primary node and read-only nodes in the shared storage architecture has the following differences:

    • Complete WAL logs are not replicated from the primary node to read-only nodes. Only WAL log metadata is replicated to the read-only nodes. This reduces the amount of data transmitted on the network and the latency between the primary node and read-only nodes.
    • The primary node generates LogIndex records based on WAL log metadata and writes the records to the LogIndex Memory Table. After the LogIndex Memory Table is full, data in the table is flushed to the disk and stored in the LogIndex Table of the shared storage. The LogIndex Memory Table can be reused.
    • The primary node uses the LogIndex metadata file to ensure the atomicity of I/O operations on the LogIndex Memory Table. After data in the Memory Table is flushed to the disk, the LogIndex metadata file is updated. When the data is being flushed to the disk, bloom data is generated. Bloom data can be used to check whether a specific page exists in a LogIndex Table. This way, the LogIndex Tables that are skipped during scans can be skipped. This improves efficiency.
    • Read-only nodes receive WAL log metadata from the primary node. Then, the read-only nodes generate LogIndex records in the memory based on WAL log metadata and write the records to the LogIndex Memory Table stored in the memory of read-only nodes. The pages that correspond to WAL log metadata in buffer pools are marked as outdated pages. In this process, the read-only nodes do not replay logs or perform I/O operations on data. No cost is required for cache misses.
    • After read-only nodes generate LogIndex records based on WAL log metadata, WAL logs generated for the next LSN are replayed. On the read-only nodes, the backend processes that access a page and the background replay processes replay the logs. In this case, the read-only nodes can replay the WAL logs in parallel.
    • Data in the LogIndex Memory Table generated by read-only nodes is not flushed to the disk. The read-only nodes use the LogIndex metadata file to determine whether data in the full LogIndex Memory Table is flushed to the disk on the primary node. If data in the LogIndex Memory Table is flushed to the disk, the data can be reused. When the primary node determines that the LogIndex Table in the storage is no longer used, the LogIndex Table can be truncated.

    PolarDB reduces the latency between the primary node and read-only nodes by replicating only WAL log metadata. PolarDB uses LogIndex to delay the replay of WAL logs and replay WAL logs in parallel. This can increase the speed at which read-only nodes replay WAL logs.

    WAL Meta

    WAL logs are also called XLogRecord. Each XLogRecord consists of two parts, as shown in the following figure.

    • General header portion: This portion is the schema of the XLogRecord. The length of this portion is fixed. This portion stores the general information about the XLogRecord, such as the length, transaction ID, and the type of the resource manager of the XLogRecord.
    • Data portion: This portion is divided into two parts: header and data. The header part contains 0 to N XLogRecordBlockHeader schemas and 0 to 1 XLogRecordDataHeader[Short|Long] schema. The data part contains block data and main data. Each XLogRecordBlockHeader structure corresponds to block data of the data part. The XLogRecordDataHeader[Short|Long] schema corresponds to main data of the data part.

    wal meta.png

    In shared storage mode, complete WAL logs do not need to be replicated from the primary node to read-only nodes. Only WAL log metadata is replicated to the read-only nodes. WAL log metadata consists of the general header portion, header part, and main data, as shown in the preceding figure. Read-only nodes can read complete WAL log content from the shared storage based on WAL log metadata. The following figure shows the process of replicating WAL log metadata from the primary node to read-only nodes.

    wal meta trans.png

    1. When a transaction on the primary node modifies data on this node, the WAL logs are generated for the modification and the metadata of the WAL logs is replicated to the metadata queue of WAL logs in the memory.
    2. In synchronous streaming replication mode, before the transaction is committed, WAL logs in the WAL buffer are flushed to the disk and then the WalSender process is woken up.
    3. If the WalSender process finds new WAL logs that can be sent, the process reads the metadata of the logs from the metadata queue of WAL logs. After the metadata is read, the process sends the metadata to read-only nodes over the streaming replication connection that is established.
    4. After the WalReceiver processes on read-only nodes receive the metadata, the processes push the metadata to the metadata queue of WAL logs in the memory and notify the startup processes of the new metadata.
    5. The startup processes read the metadata from the metadata queue of WAL logs and parse the metadata into a LogIndex Memtable.

    In streaming replication mode, payloads are not replicated from the primary node to read-only nodes. This reduces the amount of data transmitted on the network. The WalSender process on the primary node obtains the metadata of WAL logs from the metadata queue stored in the memory. After the WalReceiver process on the read-only nodes receives the metadata, the process stores the metadata in the metadata queue of WAL logs in the memory. The disk I/O in streaming replication mode is lower than that in primary/secondary mode. This increases the speed at which logs are transmitted and reduces the latency between the primary node and read-only nodes.

    LogIndex

    Memory data structure

    LogIndex is a HashTable structure. The key of this structure is PageTag. A PageTag can identify a specific data page . In this case, the values of this structure are all LSNs generated for updates on the page. The following figure shows the memory data structure of LogIndex. A LogIndex Memtable contains Memtable ID values, maximum and minimum LSNs, and the following arrays:

    • HashTable: The HashTable array records the mapping between a page and the LSN list for updates on the page. Each member of the HashTable array points to a specific LogIndex Item in the Segment array.
    • Segment: Each member in the Segment array is a LogIndex Item. A LogIndex Item has two structures: Item Head and Item Seg, as shown in the following figure. Item Head is the head of the LSN linked list for a page. Item Seg is the subsequent node of the LSN linked list. PageTag in Item Head is used to record the metadata of a single Page. In Item Head, Next Seg points to the subsequent node and Tail Seg points to the tail node. Item Seg stores pointers that point to the previous node Prev Seg and the subsequent node Next Seg. A complete LSN can consist of a Suffix LSN stored in Item Head and Item Seg and a Prefix LSN stored in the LogIndex Memtable. This way, each stored Prefix LSN is unique and the storage space is not wasted. When different values of PageTag specify the same item in the HashTable array based on the calculated result, Next Item in Item Head points to the next page where the hash value is the same as that of the page. This way, the hash collision is resolved.
    • Index Order: The Index Order array records the order in which LogIndex records are added to a LogIndex Memtable. Each member in the array occupies 2 bytes. The last 12 bits of each member correspond to a subscript of the Segment array and point to a specific LogIndex Item. The first four bits correspond to a subscript of the Suffix LSN array in the LogIndex Item and point to a specific Suffix LSN. The Index Order array can be used to obtain all LSNs that are inserted into a LogIndex Memtable and obtain the mapping between an LSN and all modified pages for which the LSN is generated.

    logindex.png

    LogIndex Memtables stored in the memory are divided into two categories: Active LogIndex Memtables and Inactive LogIndex Memtables. The LogIndex records generated based on WAL log metadata are written to an Active LogIndex Memtable. After the Active LogIndex Memtable is full, the table is converted to an Inactive LogIndex Memtable and the system generates another Active LogIndex Memtable. The data in the Inactive LogIndex Memtable can be flushed to the disk. Then, the Inactive LogIndex Memtable can be converted to an Active LogIndex Memtable again. The following figure shows more details.

    image.png

    Data Structure on Disk

    The disk stores a large number of LogIndex Tables. The structure of a LogIndex Table is similar to the structure of a LogIndex Memtable. A LogIndex Table can contain a maximum of 64 LogIndex Memtables. When data in Inactive LogIndex Memtables is flushed to the disk, Bloom filters are generated for the Memtables. The size of a single Bloom filter is 4,096 bytes. A Bloom filter records the information about an Inactive LogIndex Memtable, such as the mapped values that the bit array of the Bloom filter stores for all pages in the Inactive LogIndex Memtable, the minimum LSN, and the maximum LSN. The following figure shows more details. A Bloom filter can be used to determine whether a page exists in the LogIndex Table that corresponds to the filter. This way, LogIndex Tables in which the page does not exist do not need to be scanned. This accelerates data retrieval.

    image.png

    After the data in an Inactive LogIndex Memtable is flushed to the disk, the LogIndex metadata file is updated. This file is used to ensure the atomicity of I/O operations on the LogIndex Memtable file. The LogIndex metadata file stores the information about the smallest LogIndex Table and the largest LogIndex Memtable on the disk. Start LSN in this file records the maximum LSN among all LogIndex Memtables whose data is flushed to the disk. If data is written to the LogIndex Memtable when the Memtable is flushed, the system parses the WAL logs from Start LSN that are recorded in the LogIndex metadata file. Then, LogIndex records that are discarded during the data write are also regenerated to ensure the atomicity of I/O operations on the Memtable.

    image.png

    All modified data pages recorded in WAL logs before the LSN of consistent data are persisted to the shared storage based on the information described in Buffer Management. The LSN of consistent data is the LSN before which data is consistent between the primary node and read-only nodes. Read-only nodes do not need to replay WAL logs generated before the LSN of consistent data. In this case, the WAL logs for the LSNs that are smaller than the LSN of consistent data can be cleared from LogIndex Tables. This way, the primary node can truncate LogIndex Tables that are no longer used in the storage. This enables more efficient log replay for read-only nodes and reduces the space occupied by LogIndex Tables.

    Log replay

    Delayed replay

    For scenarios in which LogIndex Tables are used, the startup processes of read-only nodes generate LogIndex records based on the received WAL metadata and mark the pages that correspond to the WAL metadata and exist in buffer pools as outdated pages. This way, WAL logs for the next LSN can be replayed. The startup processes do not replay WAL logs. The backend processes that access the page and the background replay processes replay the logs. The following figure shows how WAL logs are replayed.

    • The background replay process replays WAL logs in the sequence of WAL logs. The process retrieves modified pages from LogIndex Memtables and LogIndex Tables based on the LSN of a page that you want to replay. If a page exists in a buffer pool, the page is replayed. Otherwise, the page is skipped. The background replay process replays WAL logs generated for the next LSN of a page in a buffer pool in the sequence of LSNs. This prevents a large number of LSNs for a single page that you want to replay from being accumulated.
    • The backend process replays only the pages it must access. If the backend process must access a page that does not exist in a buffer pool, the process reads this page from the shared storage, writes the page to a buffer pool, and replays this page. If the page exists in a buffer pool and is marked as an outdated page, the process replays the most recent WAL logs of this page. The backend process retrieves the LSNs of the page from LogIndex Memtables and LogIndex Tables based on the value of PageTag. After the process retrieves the LSNs, the process generates the LSNs for the page in sequence. Then, the process reads the complete WAL logs from the shared storage based on the generated LSNs to replay the page.

    image.png

    The XLOG Buffer is added to cache the read WAL logs. This reduces performance overhead when WAL logs are read from the disk for replay. WAL logs are read from the WAL segment file on the disk. After the XLOG Page Buffer is added, WAL logs are preferentially read from the XLOG Buffer. If WAL logs that you want to replay are not in the XLOG Buffer, the pages of the WAL logs are read from the disk, written to the buffer, and then copied to readBuf of XLogReaderState. If the WAL logs are in the buffer, the logs are copied to readBuf of XLogReaderState. This reduces the number of I/O operations that need to be performed to replay the WAL logs to increase the speed at which the WAL logs are replayed. The following figure shows more details.

    image.png

    Mini Transaction

    The LogIndex mechanism is different from the shared-nothing architecture in terms of log replay. If the LogIndex mechanism is used, the startup process parses WAL metadata to generate LogIndex records and the backend process replays pages based on LogIndex records in parallel. In this case, the startup process and backend process perform the operations in parallel. The backend process replays only the pages that it must access. An XLogRecord may be used to modify multiple pages. For example, in an index block split, Page_0 and Page_1 are modified. The modification is an atomic operation. This indicates that Page_0 or Page_1 is completely modified or not modified. The service provides the mini transaction lock mechanism. This ensures that the memory data structures are consistent when the backend process replays pages.

    When mini transaction locks are unavailable, the startup process parses WAL metadata and sequentially inserts the current LSN into the LSN list of each page. The following figure shows more details. The startup process completes the update of the LSN list of Page_0 but does not complete the update of the LSN list of Page_1. In this case, Backend_0 accesses Page_0 and Backend_1 accesses Page_1. Backend_0 replays Page_0 based on the LSN list of Page_0. Backend_1 replays Page_1 based on the LSN list of Page_1. The WAL log for LSN_N+1 is replayed for Page_0 and the WAL log for LSN_N is replayed for Page_1. As a result, the versions of the two pages are not consistent in the buffer pool. This causes inconsistency between the memory data structure of Page_0 and that of Page_1.

    image.png

    In the mini transaction lock mechanism, an update on the LSN list of Page_0 or Page_1 is a mini transaction. Before the startup process updates the LSN list of a page, the process must obtain the mini transaction lock of the page. In the following figure, the process first obtains the mini transaction lock of Page_0. The sequence of the obtained mini transaction lock is consistent with the Page_0 modification sequence in which the WAL log of this page is replayed. After the LSN lists of Page_0 and Page_1 are updated, the mini transaction lock is released. If the backend process replays a specific page based on LogIndex records and the startup process for the page is in a mini transaction, the mini transaction lock of the page must be obtained before the page is replayed. The startup process completes the update of the LSN list of Page_0 but does not complete the update of the LSN list of Page_1. Backend_0 accesses Page_0 and Backend_1 accesses Page_1. In this case, Backend_0 cannot replay Page_0 until the LSN list of this page is updated and the mini transaction lock of this page is released. Before the mini transaction lock of this page is released, the update of the LSN list of page_1 is completed. The memory data structures are modified based on the atomic operation rule.

    mini trans.png

    Summary

    PolarDB provides LogIndex based on the shared storage between the primary node and read-only nodes. LogIndex accelerates the speed at which memory data is synchronized from the primary node to read-only nodes and reduces the latency between the primary node and read-only nodes. This ensures the availability of read-only nodes and makes data between the primary node and read-only nodes consistent. This topic describes LogIndex and the LogIndex-based memory synchronization architecture of read-only nodes. LogIndex can be used to synchronize memory data from the primary node to read-only nodes. LogIndex can also be used to promote a read-only node as the primary node online. If the primary node becomes unavailable, the speed at which a read-only node is promoted to the primary node can be increased. This achieves the high availability of compute nodes. In addition, services can be restored in a short period of time.

    + + + diff --git a/theory/polar-sequence-tech.html b/theory/polar-sequence-tech.html new file mode 100644 index 00000000000..c75a8286750 --- /dev/null +++ b/theory/polar-sequence-tech.html @@ -0,0 +1,372 @@ + + + + + + + + + Sequence | PolarDB for PostgreSQL + + + + +

    Sequence

    羁鸟

    2022/08/22

    30 min

    介绍

    Sequence 作为数据库中的一个特别的表级对象,可以根据用户设定的不同属性,产生一系列有规则的整数,从而起到发号器的作用。

    在使用方面,可以设置永不重复的 Sequence 用来作为一张表的主键,也可以通过不同表共享同一个 Sequence 来记录多个表的总插入行数。根据 ANSI 标准,一个 Sequence 对象在数据库要具备以下特征:

    1. 独立的数据库对象 (CREATE SEQUENCE),和表、视图同一层级
    2. 可以设置生成属性:初始值 (star value),步长 (increment),最大/小值 (max/min),循环产生 (cycle),缓存 (cache)等
    3. Sequence 对象在当前值的基础上进行递增或者递减,当前值被初始化为初始值
    4. 在设置循环后,当前值的变化具有周期性;不设置循环下,当前值的变化具有单调性,当前值到达最值后不可再变化

    为了解释上述特性,我们分别定义 ab 两种序列来举例其具体的行为。

    CREATE SEQUENCE a start with 5 minvalue -1 increment -2;
    +CREATE SEQUENCE b start with 2 minvalue 1 maxvalue 4 cycle;
    +

    两个 Sequence 对象提供的序列值,随着序列申请次数的变化,如下所示:

    单调序列与循环序列

    PostgreSQLOracleSQLSERVERMySQLMariaDBDB2SybaseHive
    支持支持支持仅支持自增字段支持支持仅支持自增字段不支持

    为了更进一步了解 PostgreSQL 中的 Sequence 对象,我们先来了解 Sequence 的用法,并从用法中透析 Sequence 背后的设计原理。

    使用方法

    PostgreSQL 提供了丰富的 Sequence 调用接口,以及组合使用的场景,以充分支持开发者的各种需求。

    SQL 接口

    PostgreSQL 对 Sequence 对象也提供了类似于 的访问方式,即 DQL、DML 以及 DDL。我们从下图中可一览对外提供的 SQL 接口。

    SQL接口

    分别来介绍以下这几个接口:

    currval

    该接口的含义为,返回 Session 上次使用的某一 Sequence 的值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +
    +postgres=# select currval('seq');
    + currval
    +---------
    +       2
    +(1 row)
    +

    需要注意的是,使用该接口必须使用过一次 nextval 方法,否则会提示目标 Sequence 在当前 Session 未定义。

    postgres=# select currval('seq');
    +ERROR:  currval of sequence "seq" is not yet defined in this session
    +

    lastval

    该接口的含义为,返回 Session 上次使用的 Sequence 的值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       3
    +(1 row)
    +
    +postgres=# select lastval();
    + lastval
    +---------
    +       3
    +(1 row)
    +

    同样,为了知道上次用的是哪个 Sequence 对象,需要用一次 nextval('seq'),让 Session 以全局变量的形式记录下上次使用的 Sequence 对象。

    lastvalcurval 两个接口仅仅只是参数不同,currval 需要指定是哪个访问过的 Sequence 对象,而 lastval 无法指定,只能是最近一次使用的 Sequence 对象。

    nextval

    该接口的含义为,取 Sequence 对象的下一个序列值。

    通过使用 nextval 方法,可以让数据库基于 Sequence 对象的当前值,返回一个递增了 increment 数量的一个序列值,并将递增后的值作为 Sequence 对象当前值。

    postgres=# CREATE SEQUENCE seq start with 1 increment 2;
    +CREATE SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       3
    +(1 row)
    +

    increment 称作 Sequence 对象的步长,Sequence 的每次以 nextval 的方式进行申请,都是以步长为单位进行申请的。同时,需要注意的是,Sequence 对象创建好以后,第一次申请获得的值,是 start value 所定义的值。对于 start value 的默认值,有以下 PostgreSQL 规则:

    $$start_value = 1, if:increment > 0;$$ $$start_value = -1,if:increment < 0;$$

    另外,nextval 是一种特殊的 DML,其不受事务所保护,即:申请出的序列值不会再回滚。

    postgres=# BEGIN;
    +BEGIN
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# ROLLBACK;
    +ROLLBACK
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +

    PostgreSQL 为了 Sequence 对象可以获得较好的并发性能,并没有采用多版本的方式来更新 Sequence 对象,而是采用了原地修改的方式完成 Sequence 对象的更新,这种不用事务保护的方式几乎成为所有支持 Sequence 对象的 RDMS 的通用做法,这也使得 Sequence 成为一种特殊的表级对象。

    setval

    该接口的含义是,设置 Sequence 对象的序列值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       4
    +(1 row)
    +
    +postgres=# select setval('seq', 1);
    + setval
    +--------
    +      1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +

    该方法可以将 Sequence 对象的序列值设置到给定的位置,同时可以将第一个序列值申请出来。如果不想申请出来,可以采用加入 false 参数的做法。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       4
    +(1 row)
    +
    +postgres=# select setval('seq', 1, false);
    + setval
    +--------
    +      1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +

    SQL接口

    通过在 setval 来设置好 Sequence 对象的值以后,同时来设置 Sequence 对象的 is_called 属性。nextval 就可以根据 Sequence 对象的 is_called 属性来判断要返回的是否要返回设置的序列值。即:如果 is_calledfalsenextval 接口会去设置 is_calledtrue,而不是进行 increment。

    CREATE/ALTER SEQUENCE

    CREATEALTER SEQUENCE 用于创建/变更 Sequence 对象,其中 Sequence 属性也通过 CREATEALTER SEQUENCE 接口进行设置,前面已简单介绍部分属性,下面将详细描述具体的属性。

    CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name
    +    [ AS data_type ]
    +    [ INCREMENT [ BY ] increment ]
    +    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
    +    [ START [ WITH ] start ] [ CACHE cache ] [ [ NO ] CYCLE ]
    +    [ OWNED BY { table_name.column_name | NONE } ]
    +ALTER SEQUENCE [ IF EXISTS ] name
    +    [ AS data_type ]
    +    [ INCREMENT [ BY ] increment ]
    +    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
    +    [ START [ WITH ] start ]
    +    [ RESTART [ [ WITH ] restart ] ]
    +    [ CACHE cache ] [ [ NO ] CYCLE ]
    +    [ OWNED BY { table_name.column_name | NONE } ]
    +
    • AS:设置 Sequence 的数据类型,只可以设置为 smallintintbigint;与此同时也限定了 minvaluemaxvalue 的设置范围,默认为 bigint 类型(注意,只是限定,而不是设置,设置的范围不得超过数据类型的范围)。
    • INCREMENT:步长,nextval 申请序列值的递增数量,默认值为 1。
    • MINVALUE / NOMINVALUE:设置/不设置 Sequence 对象的最小值,如果不设置则是数据类型规定的范围,例如 bigint 类型,则最小值设置为 PG_INT64_MIN(-9223372036854775808)
    • MAXVALUE / NOMAXVALUE:设置/不设置 Sequence 对象的最大值,如果不设置,则默认设置规则如上。
    • START:Sequence 对象的初始值,必须在 MINVALUEMAXVALUE 范围之间。
    • RESTART:ALTER 后,可以重新设置 Sequence 对象的序列值,默认设置为 start value。
    • CACHE / NOCACHE:设置 Sequence 对象使用的 Cache 大小,NOCACHE 或者不设置则默认为 1。
    • OWNED BY:设置 Sequence 对象归属于某张表的某一列,删除列后,Sequence 对象也将删除。

    特殊场景下的序列回滚

    下面描述了一种序列回滚的场景

    CREATE SEQUENCE
    +postgres=# BEGIN;
    +BEGIN
    +postgres=# ALTER SEQUENCE seq maxvalue 10;
    +ALTER SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +
    +postgres=# ROLLBACK;
    +ROLLBACK
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +

    与之前描述的不同,此处 Sequence 对象受到了事务的保护,序列值发生了发生回滚。实际上,此处事务保护的是 ALTER SEQUENCE(DDL),而非 nextval(DML),因此此处发生的回滚是将 Sequence 对象回滚到 ALTER SEQUENCE 之前的状态,故发生了序列回滚现象。

    DROP/TRUNCATE

    • DROP SEQUENCE,如字面意思,去除数据库中的 Sequence 对象。
    • TRUNCATE,准确来讲,是通过 TRUNCATE TABLE 完成 RESTART SEQUENCE
    postgres=# CREATE TABLE tbl_iden (i INTEGER, j int GENERATED ALWAYS AS IDENTITY);
    +CREATE TABLE
    +postgres=# insert into tbl_iden values (100);
    +INSERT 0 1
    +postgres=# insert into tbl_iden values (1000);
    +INSERT 0 1
    +postgres=# select * from tbl_iden;
    +  i   | j
    +------+---
    +  100 | 1
    + 1000 | 2
    +(2 rows)
    +
    +postgres=# TRUNCATE TABLE tbl_iden RESTART IDENTITY;
    +TRUNCATE TABLE
    +postgres=# insert into tbl_iden values (1234);
    +INSERT 0 1
    +postgres=# select * from tbl_iden;
    +  i   | j
    +------+---
    + 1234 | 1
    +(1 row)
    +

    此处相当于在 TRUNCATE 表的时候,执行 ALTER SEQUENCE RESTART

    Sequence 组合使用场景

    SEQUENCE 除了作为一个独立的对象时候以外,还可以组合其他 PostgreSQL 其他组件进行使用,我们总结了一下几个常用的场景。

    组合调用

    显式调用

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl (i INTEGER PRIMARY KEY);
    +INSERT INTO tbl (i) VALUES (nextval('seq'));
    +SELECT * FROM tbl ORDER BY 1 DESC;
    +   tbl
    +---------
    +       1
    +(1 row)
    +

    触发器调用

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl (i INTEGER PRIMARY KEY, j INTEGER);
    +CREATE FUNCTION f()
    +RETURNS TRIGGER AS
    +$$
    +BEGIN
    +NEW.i := nextval('seq');
    +RETURN NEW;
    +END;
    +$$
    +LANGUAGE 'plpgsql';
    +
    +CREATE TRIGGER tg
    +BEFORE INSERT ON tbl
    +FOR EACH ROW
    +EXECUTE PROCEDURE f();
    +
    +INSERT INTO tbl (j) VALUES (4);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+---
    + 1 | 4
    +(1 row)
    +

    DEFAULT 调用

    显式 DEFAULT 调用:

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl(i INTEGER DEFAULT nextval('seq') PRIMARY KEY, j INTEGER);
    +
    +INSERT INTO tbl (i,j) VALUES (DEFAULT,11);
    +INSERT INTO tbl(j) VALUES (321);
    +INSERT INTO tbl (i,j) VALUES (nextval('seq'),1);
    +
    +SELECT * FROM tbl;
    + i |  j
    +---+-----
    + 2 | 321
    + 1 |  11
    + 3 |   1
    +(3 rows)
    +

    SERIAL 调用:

    CREATE TABLE tbl (i SERIAL PRIMARY KEY, j INTEGER);
    +INSERT INTO tbl (i,j) VALUES (DEFAULT,42);
    +
    +INSERT INTO tbl (j) VALUES (25);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+----
    + 1 | 42
    + 2 | 25
    +(2 rows)
    +

    注意,SERIAL 并不是一种类型,而是 DEFAULT 调用的另一种形式,只不过 SERIAL 会自动创建 DEFAULT 约束所要使用的 Sequence。

    AUTO_INC 调用

    CREATE TABLE tbl (i int GENERATED ALWAYS AS IDENTITY,
    +                  j INTEGER);
    +INSERT INTO tbl(i,j) VALUES (DEFAULT,32);
    +
    +INSERT INTO tbl(j) VALUES (23);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+----
    + 1 | 32
    + 2 | 23
    +(2 rows)
    +

    AUTO_INC 调用对列附加了自增约束,与 default 约束不同,自增约束通过查找 dependency 的方式找到该列关联的 Sequence,而 default 调用仅仅是将默认值设置为一个 nextval 表达式。

    原理剖析

    Sequence 在系统表与数据表中的描述

    在 PostgreSQL 中有一张专门记录 Sequence 信息的系统表,即 pg_sequence。其表结构如下:

    postgres=# \d pg_sequence
    +             Table "pg_catalog.pg_sequence"
    +    Column    |  Type   | Collation | Nullable | Default
    +--------------+---------+-----------+----------+---------
    + seqrelid     | oid     |           | not null |
    + seqtypid     | oid     |           | not null |
    + seqstart     | bigint  |           | not null |
    + seqincrement | bigint  |           | not null |
    + seqmax       | bigint  |           | not null |
    + seqmin       | bigint  |           | not null |
    + seqcache     | bigint  |           | not null |
    + seqcycle     | boolean |           | not null |
    +Indexes:
    +    "pg_sequence_seqrelid_index" PRIMARY KEY, btree (seqrelid)
    +

    不难看出,pg_sequence 中记录了 Sequence 的全部的属性信息,该属性在 CREATE/ALTER SEQUENCE 中被设置,Sequence 的 nextval 以及 setval 要经常打开这张系统表,按照规则办事。

    对于 Sequence 序列数据本身,其实现方式是基于 heap 表实现的,heap 表共计三个字段,其在表结构如下:

    typedef struct FormData_pg_sequence_data
    +{
    +    int64		last_value;
    +    int64		log_cnt;
    +    bool		is_called;
    +} FormData_pg_sequence_data;
    +
    • last_value 记录了 Sequence 的当前的序列值,我们称之为页面值(与后续的缓存值相区分)
    • log_cnt 记录了 Sequence 在 nextval 申请时,预先向 WAL 中额外申请的序列次数,这一部分我们放在序列申请机制剖析中详细介绍。
    • is_called 标记 Sequence 的 last_value 是否已经被申请过,例如 setval 可以设置 is_called 字段:
    -- setval false
    +postgres=# select setval('seq', 10, false);
    + setval
    +--------
    +     10
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         10 |       0 | f
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      10
    +(1 row)
    +
    +-- setval true
    +postgres=# select setval('seq', 10, true);
    + setval
    +--------
    +     10
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         10 |       0 | t
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      11
    +(1 row)
    +

    每当用户创建一个 Sequence 对象时,PostgreSQL 总是会创建出一张上面这种结构的 heap 表,来记录 Sequence 对象的数据信息。当 Sequence 对象因为 nextvalsetval 导致序列值变化时,PostgreSQL 就会通过原地更新的方式更新 heap 表中的这一行的三个字段。

    setval 为例,下面的逻辑解释了其具体的原地更新过程。

    static void
    +do_setval(Oid relid, int64 next, bool iscalled)
    +{
    +
    +    /* 打开并对Sequence heap表进行加锁 */
    +    init_sequence(relid, &elm, &seqrel);
    +
    +    ...
    +
    +    /* 对buffer进行加锁,同时提取tuple */
    +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
    +
    +    ...
    +
    +    /* 原地更新tuple */
    +    seq->last_value = next;		/* last fetched number */
    +    seq->is_called = iscalled;
    +    seq->log_cnt = 0;
    +
    +    ...
    +
    +    /* 释放buffer锁以及表锁 */
    +    UnlockReleaseBuffer(buf);
    +    relation_close(seqrel, NoLock);
    +}
    +

    可见,do_setval 会直接去设置 Sequence heap 表中的这一行元组,而非普通 heap 表中的删除 + 插入的方式来完成元组更新,对于 nextval 而言,也是类似的过程,只不过 last_value 的值需要计算得出,而非用户设置。

    序列申请机制剖析

    讲清楚 Sequence 对象在内核中的存在形式之后,就需要讲清楚一个序列值是如何发出的,即 nextval 方法。其在内核的具体实现在 sequence.c 中的 nextval_internal 函数,其最核心的功能,就是计算 last_value 以及 log_cnt

    last_valuelog_cnt 的具体关系如下图:

    页面值与wal关系

    其中 log_cnt 是一个预留的申请次数。默认值为 32,由下面的宏定义决定:

    /*
    + * We don't want to log each fetching of a value from a sequence,
    + * so we pre-log a few fetches in advance. In the event of
    + * crash we can lose (skip over) as many values as we pre-logged.
    + */
    +#define SEQ_LOG_VALS	32
    +

    每当将 last_value 增加一个 increment 的长度时,log_cnt 就会递减 1。

    页面值递增

    log_cnt 为 0,或者发生 checkpoint 以后,就会触发一次 WAL 日志写入,按下面的公式设置 WAL 日志中的页面值,并重新将 log_cnt 设置为 SEQ_LOG_VALS

    $$wal_value = last_value+increment*SEQ_LOG_VALS$$

    通过这种方式,PostgreSQL 每次通过 nextval 修改页面中的 last_value 后,不需要每次都写入 WAL 日志。这意味着:如果 nextval 每次都需要修改页面值的话,这种优化将会使得写 WAL 的频率降低 32 倍。其代价就是,在发生 crash 前如果没有及时进行 checkpoint,那么会丢失一段序列。如下面所示:

    postgres=# create sequence seq;
    +CREATE SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +          1 |      32 | t
    +(1 row)
    +
    +-- crash and restart
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         33 |       0 | t
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      34
    +(1 row)
    +

    显然,crash 以后,Sequence 对象产生了 2-33 这段空洞,但这个代价是可以被接受的,因为 Sequence 并没有违背唯一性原则。同时,在特定场景下极大地降低了写 WAL 的频率。

    Sequence 缓存机制

    通过上述描述,不难发现 Sequence 每次发生序列申请,都需要通过加入 buffer 锁的方式来修改页面,这意味着 Sequence 的并发性能是比较差的。

    针对这个问题,PostgreSQL 使用对 Sequence 使用了 Session Cache 来提前缓存一段序列,来提高并发性能。如下图所示:

    Session Cache

    Sequence Session Cache 的实现是一个 entry 数量固定为 16 的哈希表,以 Sequence 的 OID 为 key 去检索已经缓存好的 Sequence 序列,其缓存的 value 结构如下:

    typedef struct SeqTableData
    +{
    +    Oid			relid;			/* Sequence OID(hash key) */
    +    int64		last;			/* value last returned by nextval */
    +    int64		cached;			/* last value already cached for nextval */
    +    int64		increment;		/* copy of sequence's increment field */
    +} SeqTableData;
    +

    其中 last 即为 Sequence 在 Session 中的当前值,即 current_value,cached 为 Sequence 在 Session 中的缓存值,即 cached_value,increment 记录了步长,有了这三个值即可满足 Sequence 缓存的基本条件。

    对于 Sequence Session Cache 与页面值之间的关系,如下图所示:

    cache与页面关系

    类似于 log_cntcache_cnt 即为用户在定义 Sequence 时,设置的 Cache 大小,最小为 1。只有当 cache domain 中的序列用完以后,才会去对 buffer 加锁,修改页中的 Sequence 页面值。调整过程如下所示:

    cache申请

    例如,如果 CACHE 设置的值为 20,那么当 cache 使用完以后,就会尝试对 buffer 加锁来调整页面值,并重新申请 20 个 increment 至 cache 中。对于上图而言,有如下关系:

    $$cached_value = NEW\ current_value$$ $$NEW\ current_value+20\times INC=NEW\ cached_value$$ $$NEW\ last_value = NEW\ cached_value$$

    在 Sequence Session Cache 的加持下,nextval 方法的并发性能得到了极大的提升,以下是通过 pgbench 进行压测的结果对比。

    性能对比

    总结

    Sequence 在 PostgreSQL 中是一类特殊的表级对象,提供了简单而又丰富的 SQL 接口,使得用户可以更加方便的创建、使用定制化的序列对象。不仅如此,Sequence 在内核中也具有丰富的组合使用场景,其使用场景也得到了极大地扩展。

    本文详细介绍了 Sequence 对象在 PostgreSQL 内核中的具体设计,从对象的元数据描述、对象的数据描述出发,介绍了 Sequence 对象的组成。本文随后介绍了 Sequence 最为核心的 SQL 接口——nextval,从 nextval 的序列值计算、原地更新、降低 WAL 日志写入三个方面进行了详细阐述。最后,本文介绍了 Sequence Session Cache 的相关原理,描述了引入 Cache 以后,序列值在 Cache 中,以及页面中的计算方法以及对齐关系,并对比了引入 Cache 前后,nextval 方法在单序列和多序列并发场景下的对比情况。

    + + + diff --git a/zh/contributing/coding-style.html b/zh/contributing/coding-style.html new file mode 100644 index 00000000000..89df9a39d6a --- /dev/null +++ b/zh/contributing/coding-style.html @@ -0,0 +1,33 @@ + + + + + + + + + 编码风格 | PolarDB for PostgreSQL + + + + +

    编码风格

    警告

    需要翻译

    Languages

    • PostgreSQL kernel, extension, and kernel related tools use C, in order to remain compatible with community versions and to easily upgrade.
    • Management related tools can use shell, GO, or Python, for efficient development.

    Style

    • Coding in C follows PostgreSQL's programing style, such as naming, error message format, control statements, length of lines, comment format, length of functions, and global variable. For detail, please reference Postgresql styleopen in new window. Here is some highlines:

      • Code in PostgreSQL should only rely on language features available in the C99 standard
      • Do not use // for comments
      • Both, macros with arguments and static inline functions, may be used. The latter is preferred only if the former simplifies coding.
      • Follow BSD C programming conventions
    • Programs in Shell, Go, or Python can follow Google code conventions

      • https://google.github.io/styleguide/pyguide.html
      • https://github.com/golang/go/wiki/CodeReviewComments
      • https://google.github.io/styleguide/shellguide.html

    Code design and review

    We share the same thoughts and rules as Google Open Source Code Reviewopen in new window

    Before submitting for code review, please do unit test and pass all tests under src/test, such as regress and isolation. Unit tests or function tests should be submitted with code modification.

    In addition to code review, this doc offers instructions for the whole cycle of high-quality development, from design, implementation, testing, documentation, to preparing for code review. Many good questions are asked for critical steps during development, such as about design, about function, about complexity, about test, about naming, about documentation, and about code review. The doc summarized rules for code review as follows.

    In doing a code review, you should make sure that:

    • The code is well-designed.
    • The functionality is good for the users of the code.
    • Any UI changes are sensible and look good.
    • Any parallel programming is done safely.
    • The code isn't more complex than it needs to be.
    • The developer isn't implementing things they might need in the future but don't know they need now.
    • Code has appropriate unit tests.
    • Tests are well-designed.
    • The developer used clear names for everything.
    • Comments are clear and useful, and mostly explain why instead of what.
    • Code is appropriately documented.
    • The code conforms to our style guides.
    + + + diff --git a/zh/contributing/contributing-polardb-docs.html b/zh/contributing/contributing-polardb-docs.html new file mode 100644 index 00000000000..c0d984dd7f9 --- /dev/null +++ b/zh/contributing/contributing-polardb-docs.html @@ -0,0 +1,60 @@ + + + + + + + + + 贡献文档 | PolarDB for PostgreSQL + + + + +

    贡献文档

    PolarDB for PostgreSQL 的文档使用 VuePress 2open in new window 进行管理,以 Markdown 为中心进行写作。

    浏览文档

    本文档在线托管于 GitHub Pagesopen in new window 服务上。

    本地文档开发

    若您发现文档中存在内容或格式错误,或者您希望能够贡献新文档,那么您需要在本地安装并配置文档开发环境。本项目的文档是一个 Node.js 工程,以 Yarnopen in new window 作为软件包管理器。Node.js®open in new window 是一个基于 Chrome V8 引擎的 JavaScript 运行时环境。

    Node 环境准备

    您需要在本地准备 Node.js 环境。可以选择在 Node.js 官网 下载open in new window 页面下载安装包手动安装,也可以使用下面的命令自动安装。

    通过 curl 安装 Node 版本管理器 nvm

    curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
    +command -v nvm
    +

    如果上一步显示 command not found,那么请关闭当前终端,然后重新打开。

    如果 nvm 已经被成功安装,执行以下命令安装 Node 的 LTS 版本:

    nvm install --lts
    +

    Node.js 安装完毕后,使用如下命令检查安装是否成功:

    node -v
    +npm -v
    +

    使用 npm 全局安装软件包管理器 yarn

    npm install -g yarn
    +yarn -v
    +

    文档依赖安装

    在 PolarDB for PostgreSQL 工程的根目录下运行以下命令,yarn 将会根据 package.json 安装所有依赖:

    yarn
    +

    运行文档开发服务器

    在 PolarDB for PostgreSQL 工程的根目录下运行以下命令:

    yarn docs:dev
    +

    文档开发服务器将运行于 http://localhost:8080/PolarDB-for-PostgreSQL/,打开浏览器即可访问。对 Markdown 文件作出修改后,可以在网页上实时查看变化。

    文档目录组织

    PolarDB for PostgreSQL 的文档资源位于工程根目录的 docs/ 目录下。其目录被组织为:

    └── docs
    +    ├── .vuepress
    +    │   ├── configs
    +    │   ├── public
    +    │   └── styles
    +    ├── README.md
    +    ├── architecture
    +    ├── contributing
    +    ├── guide
    +    ├── imgs
    +    ├── roadmap
    +    └── zh
    +        ├── README.md
    +        ├── architecture
    +        ├── contributing
    +        ├── guide
    +        ├── imgs
    +        └── roadmap
    +

    可以看到,docs/zh/ 目录下是其父级目录除 .vuepress/ 以外的翻版。docs/ 目录中全部为英语文档,docs/zh/ 目录下全部是相对应的简体中文文档。

    .vuepress/ 目录下包含文档工程的全局配置信息:

    • config.js:文档配置
    • configs/:文档配置模块(导航栏 / 侧边栏、英文 / 中文等配置)
    • public/:公共静态资源
    • styles/:文档主题默认样式覆盖

    文档的配置方式请参考 VuePress 2 官方文档的 配置指南open in new window

    文档开发规范

    1. 新的文档写好后,需要在文档配置中配置路由使其在导航栏和侧边栏中显示(可参考其他已有文档)
    2. 修正一种语言的文档时,也需要顺带修正其他语言的相同文档
    3. 修改文档后,使用 Prettieropen in new window 工具对 Markdown 文档进行格式化:

    文档在线部署

    本文档借助 GitHub Actionsopen in new window 提供 CI 服务。向主分支推送代码时,将触发对 docs/ 目录下文档资源的构建,并将构建结果推送到 gh-pagesopen in new window 分支上。GitHub Pagesopen in new window 服务会自动将该分支上的文档静态资源部署到 Web 服务器上形成文档网站。

    + + + diff --git a/zh/contributing/contributing-polardb-kernel.html b/zh/contributing/contributing-polardb-kernel.html new file mode 100644 index 00000000000..aa9b58a6adf --- /dev/null +++ b/zh/contributing/contributing-polardb-kernel.html @@ -0,0 +1,45 @@ + + + + + + + + + 贡献代码 | PolarDB for PostgreSQL + + + + +

    贡献代码

    PolarDB for PostgreSQL 基于 PostgreSQL 和其它开源项目进行开发,我们的主要目标是为 PostgreSQL 建立一个更大的社区。我们欢迎来自社区的贡献者提交他们的代码或想法。在更远的未来,我们希望这个项目能够被来自阿里巴巴内部和外部的开发者共同管理。

    分支说明与管理方式

    • POLARDB_11_STABLE 是 PolarDB 的稳定分支,只接受来自 POLARDB_11_DEV 的合并
    • POLARDB_11_DEV 是 PolarDB 的稳定开发分支,接受来自开源社区的 PR 合并,以及内部开发者的直接推送

    新的代码将被合并到 POLARDB_11_DEV 上,再由内部开发者定期合并到 POLARDB_11_STABLE 上。

    贡献代码之前

    贡献流程

    • ApsaraDB/PolarDB-for-PostgreSQL 仓库点击 fork 复制一个属于您自己的仓库
    • 查阅 进阶部署 了解如何从源码编译开发 PolarDB
    • 向您的复制源码仓库推送代码,并确保代码符合我们的 编码风格规范
    • 向 PolarDB 官方源码仓库发起 pull request;如果 commit message 本身不能很好地表达您的贡献内容,您可以在 PR 中给出较为细节的描述
    • 等待维护者评审您的代码,讨论并解决所有的评审意见
    • 等待维护者合并您的代码

    代码提交实例说明

    复制您自己的仓库

    PolarDB for PostgreSQLopen in new window 的代码仓库页面上,点击右上角的 fork 按钮复制您自己的 PolarDB 仓库。

    克隆您的仓库到本地

    git clone https://github.com/<your-github>/PolarDB-for-PostgreSQL.git
    +

    创建本地开发分支

    从稳定开发分支 POLARDB_11_DEV 上检出一个新的开发分支,假设这个分支名为 dev

    git checkout POLARDB_11_DEV
    +git checkout -b dev
    +

    在本地仓库修改代码并提交

    git status
    +git add <files-to-change>
    +git commit -m "modification for dev"
    +

    变基并提交到远程仓库

    首先点击您自己仓库页面上的 Fetch upstream 确保您的稳定开发分支与 PolarDB 官方仓库的稳定开发分支一致。然后将稳定开发分支上的最新修改拉取到本地:

    git checkout POLARDB_11_DEV
    +git pull
    +

    接下来将您的开发分支变基到目前的稳定开发分支,并解决冲突:

    git checkout dev
    +git rebase POLARDB_11_DEV
    +-- 解决冲突 --
    +git push -f dev
    +

    创建 Pull Request

    点击 New pull requestCompare & pull request 按钮,选择对 ApsaraDB/PolarDB-for-PostgreSQL:POLARDB_11_DEV 分支和 <your-github>/PolarDB-for-PostgreSQL:dev 分支进行比较,并撰写 PR 描述。

    GitHub 会对您的 PR 进行自动化的回归测试,您的 PR 需要 100% 通过这些测试。

    解决代码评审中的问题

    您可以与维护者就代码中的问题进行讨论,并解决他们提出的评审意见。

    代码合并

    如果您的代码通过了测试和评审,PolarDB 的维护者将会把您的 PR 合并到稳定分支上。

    + + + diff --git a/zh/deploying/db-localfs.html b/zh/deploying/db-localfs.html new file mode 100644 index 00000000000..a231d9d55e2 --- /dev/null +++ b/zh/deploying/db-localfs.html @@ -0,0 +1,49 @@ + + + + + + + + + 基于单机文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于单机文件系统部署

    棠羽

    2023/08/01

    15 min

    本文将指导您在单机文件系统(如 ext4)上编译部署 PolarDB-PG,适用于所有计算节点都可以访问相同本地磁盘存储的场景。

    拉取镜像

    我们在 DockerHub 上提供了 PolarDB-PG 的 本地实例镜像open in new window,里面已包含启动 PolarDB-PG 本地存储实例的入口脚本。镜像目前支持 linux/amd64linux/arm64 两种 CPU 架构。

    docker pull polardb/polardb_pg_local_instance
    +

    初始化数据库

    新建一个空白目录 ${your_data_dir} 作为 PolarDB-PG 实例的数据目录。启动容器时,将该目录作为 VOLUME 挂载到容器内,对数据目录进行初始化。在初始化的过程中,可以传入环境变量覆盖默认值:

    • POLARDB_PORT:PolarDB-PG 运行所需要使用的端口号,默认值为 5432;镜像将会使用三个连续的端口号(默认 5432-5434
    • POLARDB_USER:初始化数据库时创建默认的 superuser(默认 postgres
    • POLARDB_PASSWORD:默认 superuser 的密码

    使用如下命令初始化数据库:

    docker run -it --rm \
    +    --env POLARDB_PORT=5432 \
    +    --env POLARDB_USER=u1 \
    +    --env POLARDB_PASSWORD=your_password \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance \
    +    echo 'done'
    +

    启动 PolarDB-PG 服务

    数据库初始化完毕后,使用 -d 参数以后台模式创建容器,启动 PolarDB-PG 服务。通常 PolarDB-PG 的端口需要暴露给外界使用,使用 -p 参数将容器内的端口范围暴露到容器外。比如,初始化数据库时使用的是 5432-5434 端口,如下命令将会把这三个端口映射到容器外的 54320-54322 端口:

    docker run -d \
    +    -p 54320-54322:5432-5434 \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance
    +

    或者也可以直接让容器与宿主机共享网络:

    docker run -d \
    +    --network=host \
    +    -v ${your_data_dir}:/var/polardb \
    +    polardb/polardb_pg_local_instance
    +
    + + + diff --git a/zh/deploying/db-pfs-curve.html b/zh/deploying/db-pfs-curve.html new file mode 100644 index 00000000000..79c154a76e1 --- /dev/null +++ b/zh/deploying/db-pfs-curve.html @@ -0,0 +1,127 @@ + + + + + + + + + 基于 PFS for CurveBS 文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于 PFS for CurveBS 文件系统部署

    程义

    2022/11/02

    15 min

    本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在 Curve 块存储上格式化并挂载 PFS 的计算节点。

    我们在 DockerHub 上提供了一个 PolarDB 开发镜像open in new window,里面已经包含编译运行 PolarDB for PostgreSQL 所需要的所有依赖。您可以直接使用这个开发镜像进行实例搭建。镜像目前支持 AMD64 和 ARM64 两种 CPU 架构。

    源码下载

    在前置文档中,我们已经从 DockerHub 上拉取了 PolarDB 开发镜像,并且进入到了容器中。进入容器后,从 GitHubopen in new window 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    代码克隆完毕后,进入源码目录:

    cd PolarDB-for-PostgreSQL/
    +

    编译部署 PolarDB

    读写节点部署

    在读写节点上,使用 --with-pfsd 选项编译 PolarDB 内核。请参考 编译测试选项说明 查看更多编译选项的说明。

    ./polardb_build.sh --with-pfsd
    +

    注意

    上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

    手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \
    +    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \
    +    stop
    +

    在节点本地初始化数据目录 $HOME/primary/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    在共享存储的 /pool@@volume_my_/shared_data 目录上初始化共享数据目录

    # 使用 pfs 创建共享数据目录
    +sudo pfs -C curve mkdir /pool@@volume_my_/shared_data
    +# 初始化 db 的本地和共享数据目录
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /pool@@volume_my_/shared_data/ curve
    +

    编辑读写节点的配置。打开 $HOME/primary/postgresql.conf,增加配置项:

    port=5432
    +polar_hostid=1
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='pool@@volume_my_'
    +polar_datadir='/pool@@volume_my_/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='curve'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    打开 $HOME/primary/pg_hba.conf,增加以下配置项:

    host	replication	postgres	0.0.0.0/0	trust
    +

    最后,启动读写节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
    +

    检查读写节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'select version();'
    +# 下面为输出内容
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    在读写节点上,为对应的只读节点创建相应的 replication slot,用于只读节点的物理流复制:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "select pg_create_physical_replication_slot('replica1');"
    +# 下面为输出内容
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    只读节点部署

    在只读节点上,使用 --with-pfsd 选项编译 PolarDB 内核。

    ./polardb_build.sh --with-pfsd
    +

    注意

    上述脚本在编译完成后,会自动部署一个基于 本地文件系统 的实例,运行于 5432 端口上。

    手动键入以下命令停止这个实例,以便 在 PFS 和共享存储上重新部署实例

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl \
    +    -D $HOME/tmp_master_dir_polardb_pg_1100_bld/ \
    +    stop
    +

    在节点本地初始化数据目录 $HOME/replica1/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/replica1
    +

    编辑只读节点的配置。打开 $HOME/replica1/postgresql.conf,增加配置项:

    port=5433
    +polar_hostid=2
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='pool@@volume_my_'
    +polar_datadir='/pool@@volume_my_/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='curve'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    创建 $HOME/replica1/recovery.conf,增加以下配置项:

    注意

    请在下面替换读写节点(容器)所在的 IP 地址。

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    最后,启动只读节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
    +

    检查只读节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5433 \
    +    -d postgres \
    +    -c 'select version();'
    +# 下面为输出内容
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群检查和测试

    部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

    登录 读写节点,创建测试表并插入样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "create table t(t1 int primary key, t2 int);insert into t values (1, 1),(2, 3),(3, 3);"
    +

    登录 只读节点,查询刚刚插入的样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5433 \
    +    -d postgres \
    +    -c "select * from t;"
    +# 下面为输出内容
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    在读写节点上插入的数据对只读节点可见。

    + + + diff --git a/zh/deploying/db-pfs.html b/zh/deploying/db-pfs.html new file mode 100644 index 00000000000..5eb69481174 --- /dev/null +++ b/zh/deploying/db-pfs.html @@ -0,0 +1,117 @@ + + + + + + + + + 基于 PFS 文件系统部署 | PolarDB for PostgreSQL + + + + +

    基于 PFS 文件系统部署

    棠羽

    2022/05/09

    15 min

    本文将指导您在分布式文件系统 PolarDB File System(PFS)上编译部署 PolarDB,适用于已经在共享存储上格式化并挂载 PFS 文件系统的计算节点。

    读写节点部署

    初始化读写节点的本地数据目录 ~/primary/

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    在共享存储的 /nvme1n1/shared_data/ 路径上创建共享数据目录,然后使用 polar-initdb.sh 脚本初始化共享数据目录:

    # 使用 pfs 创建共享数据目录
    +sudo pfs -C disk mkdir /nvme1n1/shared_data
    +# 初始化 db 的本地和共享数据目录
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /nvme1n1/shared_data/
    +

    编辑读写节点的配置。打开 ~/primary/postgresql.conf,增加配置项:

    port=5432
    +polar_hostid=1
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,增加以下配置项,允许只读节点进行物理复制:

    host	replication	postgres	0.0.0.0/0	trust
    +

    最后,启动读写节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary
    +

    检查读写节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    在读写节点上,为对应的只读节点创建相应的复制槽,用于只读节点的物理复制:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_create_physical_replication_slot('replica1');"
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    只读节点部署

    在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

    mkdir -m 0700 $HOME/replica1
    +sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \
    +    /nvme1n1/shared_data/ $HOME/replica1/
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
    +cp /tmp/replica1/*.conf $HOME/replica1/
    +

    编辑只读节点的配置。打开 ~/replica1/postgresql.conf,增加配置项:

    port=5433
    +polar_hostid=2
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    创建只读节点的复制配置文件 ~/replica1/recovery.conf,增加读写节点的连接信息,以及复制槽名称:

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    最后,启动只读节点:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1
    +

    检查只读节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5433 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群检查和测试

    部署完成后,需要进行实例检查和测试,确保读写节点可正常写入数据、只读节点可以正常读取。

    登录 读写节点,创建测试表并插入样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +

    登录 只读节点,查询刚刚插入的样例数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5433 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    在读写节点上插入的数据对只读节点可见,这意味着基于共享存储的 PolarDB 计算节点集群搭建成功。


    常见运维步骤

    + + + diff --git a/zh/deploying/deploy-official.html b/zh/deploying/deploy-official.html new file mode 100644 index 00000000000..7d0c7746148 --- /dev/null +++ b/zh/deploying/deploy-official.html @@ -0,0 +1,33 @@ + + + + + + + + + 阿里云官网购买实例 | PolarDB for PostgreSQL + + + + +

    阿里云官网购买实例

    阿里云官网直接提供了可供购买的 云原生关系型数据库 PolarDB PostgreSQL 引擎open in new window

    + + + diff --git a/zh/deploying/deploy-stack.html b/zh/deploying/deploy-stack.html new file mode 100644 index 00000000000..37ee3ae8fac --- /dev/null +++ b/zh/deploying/deploy-stack.html @@ -0,0 +1,33 @@ + + + + + + + + + 基于 PolarDB Stack 共享存储 | PolarDB for PostgreSQL + + + + +

    基于 PolarDB Stack 共享存储

    PolarDB Stack 是轻量级 PolarDB PaaS 软件。基于共享存储提供一写多读的 PolarDB 数据库服务,特别定制和深度优化了数据库生命周期管理。通过 PolarDB Stack 可以一键部署 PolarDB-for-PostgreSQL 内核和 PolarDB-FileSystem。

    PolarDB Stack 架构如下图所示,进入 PolarDB Stack 的部署文档open in new window

    PolarDB Stack arch

    + + + diff --git a/zh/deploying/deploy.html b/zh/deploying/deploy.html new file mode 100644 index 00000000000..9ffbc361a6d --- /dev/null +++ b/zh/deploying/deploy.html @@ -0,0 +1,33 @@ + + + + + + + + + 进阶部署 | PolarDB for PostgreSQL + + + + +

    进阶部署

    棠羽

    2022/05/09

    10 min

    部署 PolarDB for PostgreSQL 需要在以下三个层面上做准备:

    1. 块存储设备层:用于提供存储介质。可以是单个物理块存储设备(本地存储),也可以是多个物理块设备构成的分布式块存储。
    2. 文件系统层:由于 PostgreSQL 将数据存储在文件中,因此需要在块存储设备上架设文件系统。根据底层块存储设备的不同,可以选用单机文件系统(如 ext4)或分布式文件系统 PolarDB File System(PFS)open in new window
    3. 数据库层:PolarDB for PostgreSQL 的编译和部署环境。

    以下表格给出了三个层次排列组合出的的不同实践方式,其中的步骤包含:

    • 存储层:块存储设备的准备
    • 文件系统:PolarDB File System 的编译、挂载
    • 数据库层:PolarDB for PostgreSQL 各集群形态的编译部署

    我们强烈推荐使用发布在 DockerHub 上的 PolarDB 开发镜像open in new window 来完成实践!开发镜像中已经包含了文件系统层和数据库层所需要安装的所有依赖,无需手动安装。

    块存储文件系统
    实践 1(极简本地部署)本地 SSD本地文件系统(如 ext4)
    实践 2(生产环境最佳实践) 视频阿里云 ECS + ESSD 云盘PFS
    实践 3(生产环境最佳实践) 视频CurveBSopen in new window 共享存储PFS for Curveopen in new window
    实践 4Ceph 共享存储PFS
    实践 5NBD 共享存储PFS
    + + + diff --git a/zh/deploying/fs-pfs-curve.html b/zh/deploying/fs-pfs-curve.html new file mode 100644 index 00000000000..8ec293b55a5 --- /dev/null +++ b/zh/deploying/fs-pfs-curve.html @@ -0,0 +1,50 @@ + + + + + + + + + 格式化并挂载 PFS for CurveBS | PolarDB for PostgreSQL + + + + +

    格式化并挂载 PFS for CurveBS

    棠羽

    2022/08/31

    20 min

    PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。

    PFS 编译安装

    在 PolarDB 计算节点上准备好 PFS 相关工具。推荐使用 DockerHub 上的 PolarDB 开发镜像,其中已经包含了编译完毕的 PFS,无需再次编译安装。Curve 开源社区open in new window 针对 PFS 对接 CurveBS 存储做了专门的优化。在用于部署 PolarDB 的计算节点上,使用下面的命令拉起带有 PFS for CurveBSopen in new window 的 PolarDB 开发镜像:

    docker pull polardb/polardb_pg_devel:curvebs
    +docker run -it \
    +    --network=host \
    +    --cap-add=SYS_PTRACE --privileged=true \
    +    --name polardb_pg \
    +    polardb/polardb_pg_devel:curvebs bash
    +

    读写节点块设备映射与格式化

    进入容器后需要修改 curve 相关的配置文件:

    sudo vim /etc/curve/client.conf
    +#
    +################### mds一侧配置信息 ##################
    +#
    +
    +# mds的地址信息,对于mds集群,地址以逗号隔开
    +mds.listen.addr=127.0.0.1:6666
    +... ...
    +

    注意,这里的 mds.listen.addr 请填写部署 CurveBS 集群中集群状态中输出的 cluster mds addr

    容器内已经安装了 curve 工具,该工具可用于创建卷,用户需要使用该工具创建实际存储 PolarFS 数据的 curve 卷:

    curve create --filename /volume --user my --length 10 --stripeUnit 16384 --stripeCount 64
    +

    用户可通过 curve create -h 命令查看创建卷的详细说明。上面的列子中,我们创建了一个拥有以下属性的卷:

    • 卷名为 /volume
    • 所属用户为 my
    • 大小为 10GB
    • 条带大小为 16KB
    • 条带个数为 64

    特别需要注意的是,在数据库场景下,我们强烈建议使用条带卷,只有这样才能充分发挥 Curve 的性能优势,而 16384 * 64 的条带设置是目前最优的条带设置。

    格式化 curve 卷

    在使用 curve 卷之前需要使用 pfs 来格式化对应的 curve 卷:

    sudo pfs -C curve mkfs pool@@volume_my_
    +

    与我们在本地挂载文件系统前要先在磁盘上格式化文件系统一样,我们也要把我们的 curve 卷格式化为 PolarFS 文件系统。

    注意,由于 PolarFS 解析的特殊性,我们将以 pool@${volume}_${user}_ 的形式指定我们的 curve 卷,此外还需要将卷名中的 / 替换成 @

    启动 pfsd 守护进程

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p pool@@volume_my_
    +

    如果 pfsd 启动成功,那么至此 curve 版 PolarFS 已全部部署完成,已经成功挂载 PFS 文件系统。 下面需要编译部署 PolarDB。


    在 PFS 上编译部署 PolarDB for Curve

    参阅 PolarDB 编译部署:PFS 文件系统

    + + + diff --git a/zh/deploying/fs-pfs.html b/zh/deploying/fs-pfs.html new file mode 100644 index 00000000000..d0cda88d3b9 --- /dev/null +++ b/zh/deploying/fs-pfs.html @@ -0,0 +1,55 @@ + + + + + + + + + 格式化并挂载 PFS | PolarDB for PostgreSQL + + + + +

    格式化并挂载 PFS

    棠羽

    2022/05/09

    15 min

    PolarDB File System,简称 PFS 或 PolarFS,是由阿里云自主研发的高性能类 POSIX 的用户态分布式文件系统,服务于阿里云数据库 PolarDB 产品。使用 PFS 对共享存储进行格式化并挂载后,能够保证一个计算节点对共享存储的写入能够立刻对另一个计算节点可见。

    PFS 编译安装

    推荐使用 DockerHubopen in new window 上的 PolarDB for PostgreSQL 可执行文件镜像open in new window,目前支持 linux/amd64linux/arm64 两种架构,其中已经包含了编译完毕的 PFS 工具,无需手动编译安装。通过以下命令进入容器即可:

    docker pull polardb/polardb_pg_binary
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +

    PFS 的手动编译安装方式请参考 PFS 的 READMEopen in new window,此处不再赘述。

    块设备重命名

    PFS 仅支持访问 以特定字符开头的块设备(详情可见 PolarDB File Systemopen in new window 源代码的 src/pfs_core/pfs_api.hopen in new window 文件):

    #define PFS_PATH_ISVALID(path)                                  \
    +    (path != NULL &&                                            \
    +     ((path[0] == '/' && isdigit((path)[1])) || path[0] == '.'  \
    +      || strncmp(path, "/pangu-", 7) == 0                       \
    +      || strncmp(path, "/sd", 3) == 0                           \
    +      || strncmp(path, "/sf", 3) == 0                           \
    +      || strncmp(path, "/vd", 3) == 0                           \
    +      || strncmp(path, "/nvme", 5) == 0                         \
    +      || strncmp(path, "/loop", 5) == 0                         \
    +      || strncmp(path, "/mapper_", 8) ==0))
    +

    因此,为了保证能够顺畅完成后续流程,我们建议在所有访问块设备的节点上使用相同的软链接访问共享块设备。例如,在 NBD 服务端主机上,使用新的块设备名 /dev/nvme1n1 软链接到共享存储块设备的原有名称 /dev/vdb 上:

    sudo ln -s /dev/vdb /dev/nvme1n1
    +

    在 NBD 客户端主机上,使用同样的块设备名 /dev/nvme1n1 软链到共享存储块设备的原有名称 /dev/nbd0 上:

    sudo ln -s /dev/nbd0 /dev/nvme1n1
    +

    这样便可以在服务端和客户端两台主机上使用相同的块设备名 /dev/nvme1n1 访问同一个块设备。

    块设备格式化

    使用 任意一台主机,在共享存储块设备上格式化 PFS 分布式文件系统:

    sudo pfs -C disk mkfs nvme1n1
    +

    PFS 文件系统挂载

    在能够访问共享存储的 所有主机节点 上分别启动 PFS 守护进程,挂载 PFS 文件系统:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    在 PFS 上编译部署 PolarDB

    参阅 PolarDB 编译部署:PFS 文件系统

    + + + diff --git a/zh/deploying/introduction.html b/zh/deploying/introduction.html new file mode 100644 index 00000000000..01a3f71b122 --- /dev/null +++ b/zh/deploying/introduction.html @@ -0,0 +1,33 @@ + + + + + + + + + 架构简介 | PolarDB for PostgreSQL + + + + +

    架构简介

    棠羽

    2022/05/09

    5 min

    PolarDB for PostgreSQL 采用了基于 Shared-Storage 的存储计算分离架构。数据库由传统的 Share-Nothing 架构,转变成了 Shared-Storage 架构——由原来的 N 份计算 + N 份存储,转变成了 N 份计算 + 1 份存储;而 PostgreSQL 使用了传统的单体数据库架构,存储和计算耦合在一起。

    software-level

    为保证所有计算节点能够以相同的可见性视角访问分布式块存储设备,PolarDB 需要使用分布式文件系统 PolarDB File System(PFS)open in new window 来访问块设备,其实现原理可参考发表在 2018 年 VLDB 上的论文[1];如果所有计算节点都可以本地访问同一个块存储设备,那么也可以不使用 PFS,直接使用本地的单机文件系统(如 ext4)。这是与 PostgreSQL 的不同点之一。


    1. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud databaseopen in new window ↩︎

    + + + diff --git a/zh/deploying/quick-start.html b/zh/deploying/quick-start.html new file mode 100644 index 00000000000..4fc9b03d867 --- /dev/null +++ b/zh/deploying/quick-start.html @@ -0,0 +1,43 @@ + + + + + + + + + 快速部署 | PolarDB for PostgreSQL + + + + +

    快速部署

    棠羽

    2022/05/09

    5 min

    警告

    为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

    仅需单台计算机,同时满足以下要求,就可以快速开启您的 PolarDB 之旅:

    从 DockerHub 上拉取 PolarDB for PostgreSQL 的 本地存储实例镜像open in new window,创建并运行容器,然后直接试用 PolarDB-PG:

    # 拉取 PolarDB-PG 镜像
    +docker pull polardb/polardb_pg_local_instance
    +# 创建并运行容器
    +docker run -it --rm polardb/polardb_pg_local_instance psql
    +# 测试可用性
    +postgres=# SELECT version();
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +
    + + + diff --git a/zh/deploying/storage-aliyun-essd.html b/zh/deploying/storage-aliyun-essd.html new file mode 100644 index 00000000000..c50b52fcc01 --- /dev/null +++ b/zh/deploying/storage-aliyun-essd.html @@ -0,0 +1,38 @@ + + + + + + + + + 阿里云 ECS + ESSD 云盘存储 | PolarDB for PostgreSQL + + + + +

    阿里云 ECS + ESSD 云盘存储 视频

    棠羽

    2022/05/09

    20 min

    阿里云 ESSD(Enhanced SSD)云盘open in new window 结合 25 GE 网络和 RDMA 技术,能够提供单盘高达 100 万的随机读写能力和单路低时延性能。阿里云 ESSD 云盘支持 NVMe 协议,且可以同时挂载到多台支持 NVMe 协议的 ECS(Elastic Compute Service)实例上,从而实现多个 ECS 实例并发读写访问,具备高可靠、高并发、高性能等特点。更新信息请参考阿里云 ECS 文档:

    本文将指导您完成以下过程:

    1. 部署两台阿里云 ECS 作为计算节点
    2. 将一块 ESSD 云盘多重挂载到两台 ECS 上,作为共享存储
    3. 在 ESSD 共享存储上格式化分布式文件系统 PFS
    4. 基于 PFS,在两台 ECS 上共同搭建一个存算分离、读写分离的 PolarDB 集群

    aliyun-ecs-procedure

    部署阿里云 ECS

    首先需要准备两台或以上的 阿里云 ECSopen in new window。目前,ECS 对支持 ESSD 多重挂载的规格有较多限制,详情请参考 使用限制open in new window。仅 部分可用区部分规格(ecs.g7se、ecs.c7se、ecs.r7se)的 ECS 实例可以支持 ESSD 的多重挂载。如图,请务必选择支持多重挂载的 ECS 规格:

    aliyun-ecs-specs

    对 ECS 存储配置的选择,系统盘可以选用任意的存储类型,数据盘和共享盘暂不选择。后续再单独创建一个 ESSD 云盘作为共享盘:

    aliyun-ecs-system-disk

    如图所示,在 同一可用区 中建好两台 ECS:

    aliyun-ecs-instance

    准备 ESSD 云盘

    在阿里云 ECS 的管理控制台中,选择 存储与快照 下的 云盘,点击 创建云盘。在与已经建好的 ECS 所在的相同可用区内,选择建立一个 ESSD 云盘,并勾选 多实例挂载。如果您的 ECS 不符合多实例挂载的限制条件,则该选框不会出现。

    aliyun-essd-specs

    ESSD 云盘创建完毕后,控制台显示云盘支持多重挂载,状态为 待挂载

    aliyun-essd-ready-to-mount

    接下来,把这个云盘分别挂载到两台 ECS 上:

    aliyun-essd-mounting

    挂载完毕后,查看该云盘,将会显示该云盘已经挂载的两台 ECS 实例:

    aliyun-essd-mounted

    检查云盘

    通过 ssh 分别连接到两台 ECS 上,运行 lsblk 命令可以看到:

    • nvme0n1 是 40GB 的 ECS 系统盘,为 ECS 私有
    • nvme1n1 是 100GB 的 ESSD 云盘,两台 ECS 同时可见
    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    准备分布式文件系统

    接下来,将在两台 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要在 ECS 共享的 ESSD 块设备上 格式化并挂载 PFS

    + + + diff --git a/zh/deploying/storage-ceph.html b/zh/deploying/storage-ceph.html new file mode 100644 index 00000000000..4eaa343b057 --- /dev/null +++ b/zh/deploying/storage-ceph.html @@ -0,0 +1,248 @@ + + + + + + + + + Ceph 共享存储 | PolarDB for PostgreSQL + + + + +

    Ceph 共享存储

    Ceph 是一个统一的分布式存储系统,由于它可以提供较好的性能、可靠性和可扩展性,被广泛的应用在存储领域。Ceph 搭建需要 2 台及以上的物理机/虚拟机实现存储共享与数据备份,本教程以 3 台虚拟机环境为例,介绍基于 ceph 共享存储的实例构建方法。大体如下:

    1. 获取在同一网段的虚拟机三台,互相之间配置 ssh 免密登录,用作 ceph 密钥与配置信息的同步;
    2. 在主节点启动 mon 进程,查看状态,并复制配置文件至其余各个节点,完成 mon 启动;
    3. 在三个环境中启动 osd 进程配置存储盘,并在主节点环境启动 mgr 进程、rgw 进程;
    4. 创建存储池与 rbd 块设备镜像,并对创建好的镜像在各个节点进行映射即可实现块设备的共享;
    5. 对块设备进行 PolarFS 的格式化与 PolarDB 的部署。

    注意

    操作系统版本要求 CentOS 7.5 及以上。以下步骤在 CentOS 7.5 上通过测试。

    环境准备

    使用的虚拟机环境如下:

    IP                  hostname
    +192.168.1.173       ceph001
    +192.168.1.174       ceph002
    +192.168.1.175       ceph003
    +

    安装 docker

    提示

    本教程使用阿里云镜像站提供的 docker 包。

    安装 docker 依赖包

    yum install -y yum-utils device-mapper-persistent-data lvm2
    +

    安装并启动 docker

    yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
    +yum makecache
    +yum install -y docker-ce
    +
    +systemctl start docker
    +systemctl enable docker
    +

    检查是否安装成功

    docker run hello-world
    +

    配置 ssh 免密登录

    密钥的生成与拷贝

    ssh-keygen
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph001
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph002
    +ssh-copy-id -i /root/.ssh/id_rsa.pub    root@ceph003
    +

    检查是否配置成功

    ssh root@ceph003
    +

    下载 ceph daemon

    docker pull ceph/daemon
    +

    mon 部署

    ceph001 上 mon 进程启动

    docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=192.168.1.173 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon01 \
    +    ceph/daemon mon
    +

    注意

    根据实际网络环境修改 IP、子网掩码位数。

    查看容器状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 1 daemons, quorum ceph001 (age 26m)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    注意

    如果遇到 mon is allowing insecure global_id reclaim 的报错,使用以下命令解决。

    docker exec mon01 ceph config set mon auth_allow_insecure_global_id_reclaim false
    +

    生成必须的 keyring

    docker exec mon01 ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
    +docker exec mon01 ceph auth get client.bootstrap-rgw -o /var/lib/ceph/bootstrap-rgw/ceph.keyring
    +

    配置文件同步

    ssh root@ceph002 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph002:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph002:/var/lib/ceph
    +ssh root@ceph003 mkdir -p /var/lib/ceph
    +scp -r /etc/ceph root@ceph003:/etc
    +scp -r /var/lib/ceph/bootstrap* root@ceph003:/var/lib/ceph
    +

    在 ceph002 与 ceph003 中启动 mon

    docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=192.168.1.174 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon02 \
    +    ceph/daemon mon
    +
    +docker run -d \
    +    --net=host \
    +    --privileged=true \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -e MON_IP=192.168.1.175 \
    +    -e CEPH_PUBLIC_NETWORK=192.168.1.0/24 \
    +    --security-opt seccomp=unconfined \
    +    --name=mon03 \
    +    ceph/daemon mon
    +

    查看当前集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     937ccded-3483-4245-9f61-e6ef0dbd85ca
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 35s)
    +    mgr: no daemons active
    +    osd: 0 osds: 0 up, 0 in
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    注意

    从 mon 节点信息查看是否有添加在另外两个节点创建的 mon 添加进来。

    osd 部署

    osd 准备阶段

    提示

    本环境的虚拟机只有一个 /dev/vdb 磁盘可用,因此为每个虚拟机只创建了一个 osd 节点。

    docker run --rm --privileged=true --net=host --ipc=host \
    +    --security-opt seccomp=unconfined \
    +    -v /run/lock/lvm:/run/lock/lvm:z \
    +    -v /var/run/udev/:/var/run/udev/:z \
    +    -v /dev:/dev -v /etc/ceph:/etc/ceph:z \
    +    -v /run/lvm/:/run/lvm/ \
    +    -v /var/lib/ceph/:/var/lib/ceph/:z \
    +    -v /var/log/ceph/:/var/log/ceph/:z \
    +    --entrypoint=ceph-volume \
    +    docker.io/ceph/daemon \
    +    --cluster ceph lvm prepare --bluestore --data /dev/vdb
    +

    注意

    以上命令在三个节点都是一样的,只需要根据磁盘名称进行修改调整即可。

    osd 激活阶段

    docker run -d --privileged=true --net=host --pid=host --ipc=host \
    +    --security-opt seccomp=unconfined \
    +    -v /dev:/dev \
    +    -v /etc/localtime:/etc/ localtime:ro \
    +    -v /var/lib/ceph:/var/lib/ceph:z \
    +    -v /etc/ceph:/etc/ceph:z \
    +    -v /var/run/ceph:/var/run/ceph:z \
    +    -v /var/run/udev/:/var/run/udev/ \
    +    -v /var/log/ceph:/var/log/ceph:z \
    +    -v /run/lvm/:/run/lvm/ \
    +    -e CLUSTER=ceph \
    +    -e CEPH_DAEMON=OSD_CEPH_VOLUME_ACTIVATE \
    +    -e CONTAINER_IMAGE=docker.io/ceph/daemon \
    +    -e OSD_ID=0 \
    +    --name=ceph-osd-0 \
    +    docker.io/ceph/daemon
    +

    注意

    各个节点需要修改 OSD_ID 与 name 属性,OSD_ID 是从编号 0 递增的,其余节点为 OSD_ID=1、OSD_ID=2。

    查看集群状态

    $ docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_WARN
    +            no active mgr
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 44m)
    +    mgr: no daemons active
    +    osd: 3 osds: 3 up (since 7m), 3 in (since     13m)
    +
    +data:
    +    pools:   0 pools, 0 pgs
    +    objects: 0 objects, 0 B
    +    usage:   0 B used, 0 B / 0 B avail
    +    pgs:
    +

    mgr、mds、rgw 部署

    以下命令均在 ceph001 进行:

    docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /etc/ceph:/etc/ceph \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    --name=ceph-mgr-0 \
    +    ceph/daemon mgr
    +
    +docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -v /etc/ceph:/etc/ceph \
    +    -e CEPHFS_CREATE=1 \
    +    --name=ceph-mds-0 \
    +    ceph/daemon mds
    +
    +docker run -d --net=host \
    +    --privileged=true \
    +    --security-opt seccomp=unconfined \
    +    -v /var/lib/ceph/:/var/lib/ceph/ \
    +    -v /etc/ceph:/etc/ceph \
    +    --name=ceph-rgw-0 \
    +    ceph/daemon rgw
    +

    查看集群状态:

    docker exec mon01 ceph -s
    +cluster:
    +    id:     e430d054-dda8-43f1-9cda-c0881b782e17
    +    health: HEALTH_OK
    +
    +services:
    +    mon: 3 daemons, quorum ceph001,ceph002,ceph003 (age 92m)
    +    mgr: ceph001(active, since 25m)
    +    mds: 1/1 daemons up
    +    osd: 3 osds: 3 up (since 54m), 3 in (since    60m)
    +    rgw: 1 daemon active (1 hosts, 1 zones)
    +
    +data:
    +    volumes: 1/1 healthy
    +    pools:   7 pools, 145 pgs
    +    objects: 243 objects, 7.2 KiB
    +    usage:   50 MiB used, 2.9 TiB / 2.9 TiB avail
    +    pgs:     145 active+clean
    +

    rbd 块设备创建

    提示

    以下命令均在容器 mon01 中进行。

    存储池的创建

    docker exec -it mon01 bash
    +ceph osd pool create rbd_polar
    +

    创建镜像文件并查看信息

    rbd create --size 512000 rbd_polar/image02
    +rbd info rbd_polar/image02
    +
    +rbd image 'image02':
    +size 500 GiB in 128000 objects
    +order 22 (4 MiB objects)
    +snapshot_count: 0
    +id: 13b97b252c5d
    +block_name_prefix: rbd_data.13b97b252c5d
    +format: 2
    +features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    +op_features:
    +flags:
    +create_timestamp: Thu Oct 28 06:18:07 2021
    +access_timestamp: Thu Oct 28 06:18:07 2021
    +modify_timestamp: Thu Oct 28 06:18:07 2021
    +

    映射镜像文件

    modprobe rbd # 加载内核模块,在主机上执行
    +rbd map rbd_polar/image02
    +
    +rbd: sysfs write failed
    +RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten".
    +In some cases useful info is found in syslog -  try "dmesg | tail".
    +rbd: map failed: (6) No such device or address
    +

    注意

    某些特性内核不支持,需要关闭才可以映射成功。如下进行:关闭 rbd 不支持特性,重新映射镜像,并查看映射列表。

    rbd feature disable rbd_polar/image02 object-map fast-diff deep-flatten
    +rbd map rbd_polar/image02
    +rbd device list
    +
    +id  pool       namespace  image    snap  device
    +0   rbd_polar             image01  -     /dev/  rbd0
    +1   rbd_polar             image02  -     /dev/  rbd1
    +

    提示

    此处我已经先映射了一个 image01,所以有两条信息。

    查看块设备

    回到容器外,进行操作。查看系统中的块设备:

    lsblk
    +
    +NAME                                                               MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
    +vda                                                                253:0    0  500G  0 disk
    +└─vda1                                                             253:1    0  500G  0 part /
    +vdb                                                                253:16   0 1000G  0 disk
    +└─ceph--7eefe77f--c618--4477--a1ed--b4f44520dfc 2-osd--block--bced3ff1--42b9--43e1--8f63--e853b  ce41435
    +                                                                    252:0    0 1000G  0 lvm
    +rbd0                                                               251:0    0  100G  0 disk
    +rbd1                                                               251:16   0  500G  0 disk
    +

    注意

    块设备镜像需要在各个节点都进行映射才可以在本地环境中通过 lsblk 命令查看到,否则不显示。ceph002 与 ceph003 上映射命令与上述一致。


    准备分布式文件系统

    参阅 格式化并挂载 PFS

    + + + diff --git a/zh/deploying/storage-curvebs.html b/zh/deploying/storage-curvebs.html new file mode 100644 index 00000000000..669840a9fc7 --- /dev/null +++ b/zh/deploying/storage-curvebs.html @@ -0,0 +1,188 @@ + + + + + + + + + CurveBS 共享存储 | PolarDB for PostgreSQL + + + + +

    CurveBS 共享存储 视频

    棠羽

    2022/08/31

    30 min

    Curveopen in new window 是一款高性能、易运维、云原生的开源分布式存储系统。可应用于主流的云原生基础设施平台:

    • 对接 OpenStack 平台为云主机提供高性能块存储服务;
    • 对接 Kubernetes 为其提供 RWO、RWX 等类型的持久化存储卷;
    • 对接 PolarFS 作为云原生数据库的高性能存储底座,完美支持云原生数据库的存算分离架构。

    Curve 亦可作为云存储中间件使用 S3 兼容的对象存储作为数据存储引擎,为公有云用户提供高性价比的共享文件存储。

    本示例将引导您以 CurveBS 作为块存储,部署 PolarDB for PostgreSQL。更多进阶配置和使用方法请参考 Curve 项目的 wikiopen in new window

    设备准备

    curve-cluster

    如图所示,本示例共使用六台服务器。其中,一台中控服务器和三台存储服务器共同组成 CurveBS 集群,对外暴露为一个共享存储服务。剩余两台服务器分别用于部署 PolarDB for PostgreSQL 数据库的读写节点和只读节点,它们共享 CurveBS 对外暴露的块存储设备。

    本示例使用阿里云 ECS 模拟全部六台服务器。六台 ECS 全部运行 Anolis OSopen in new window 8.6(兼容 CentOS 8.6)系统,使用 root 用户,并处于同一局域网段内。需要完成的准备工作包含:

    1. 在全部机器上安装 Dockeropen in new window(请参考 Docker 官方文档)
    2. 在 Curve 中控机上配置 SSH 免密登陆到其它五台服务器

    在中控机上安装 CurveAdm

    bash -c "$(curl -fsSL https://curveadm.nos-eastchina1.126.net/script/install.sh)"
    +source /root/.bash_profile
    +

    导入主机列表

    在中控机上编辑主机列表文件:

    vim hosts.yaml
    +

    文件中包含另外五台服务器的 IP 地址和在 Curve 集群内的名称,其中:

    • 三台主机为 Curve 存储节点主机
    • 两台主机为 PolarDB for PostgreSQL 计算节点主机
    global:
    +  user: root
    +  ssh_port: 22
    +  private_key_file: /root/.ssh/id_rsa
    +
    +hosts:
    +  # Curve worker nodes
    +  - host: server-host1
    +    hostname: 172.16.0.223
    +  - host: server-host2
    +    hostname: 172.16.0.224
    +  - host: server-host3
    +    hostname: 172.16.0.225
    +  # PolarDB nodes
    +  - host: polardb-primary
    +    hostname: 172.16.0.226
    +  - host: polardb-replica
    +    hostname: 172.16.0.227
    +

    导入主机列表:

    curveadm hosts commit hosts.yaml
    +

    格式化磁盘

    准备磁盘列表,并提前生成一批固定大小并预写过的 chunk 文件。磁盘列表中需要包含:

    • 将要进行格式化的所有存储节点主机
    • 每台主机上的统一块设备名(本例中为 /dev/vdb
    • 将被使用的挂载点
    • 格式化百分比
    vim format.yaml
    +
    host:
    +  - server-host1
    +  - server-host2
    +  - server-host3
    +disk:
    +  - /dev/vdb:/data/chunkserver0:90 # device:mount_path:format_percent
    +

    开始格式化。此时,中控机将在每台存储节点主机上对每个块设备启动一个格式化进程容器。

    $ curveadm format -f format.yaml
    +Start Format Chunkfile Pool: ⠸
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [0/1] ⠸
    +

    当显示 OK 时,说明这个格式化进程容器已启动,但 并不代表格式化已经完成。格式化是个较久的过程,将会持续一段时间:

    Start Format Chunkfile Pool: [OK]
    +  + host=server-host1  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host2  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +  + host=server-host3  device=/dev/vdb  mountPoint=/data/chunkserver0  usage=90% [1/1] [OK]
    +

    可以通过以下命令查看格式化进度,目前仍在格式化状态中:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  19/90      Formatting
    +server-host2  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +server-host3  /dev/vdb  /data/chunkserver0  22/90      Formatting
    +

    格式化完成后的输出:

    $ curveadm format --status
    +Get Format Status: [OK]
    +
    +Host          Device    MountPoint          Formatted  Status
    +----          ------    ----------          ---------  ------
    +server-host1  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host2  /dev/vdb  /data/chunkserver0  95/90      Done
    +server-host3  /dev/vdb  /data/chunkserver0  95/90      Done
    +

    部署 CurveBS 集群

    首先,准备集群配置文件:

    vim topology.yaml
    +

    粘贴如下配置文件:

    kind: curvebs
    +global:
    +  container_image: opencurvedocker/curvebs:v1.2
    +  log_dir: ${home}/logs/${service_role}${service_replicas_sequence}
    +  data_dir: ${home}/data/${service_role}${service_replicas_sequence}
    +  s3.nos_address: 127.0.0.1
    +  s3.snapshot_bucket_name: curve
    +  s3.ak: minioadmin
    +  s3.sk: minioadmin
    +  variable:
    +    home: /tmp
    +    machine1: server-host1
    +    machine2: server-host2
    +    machine3: server-host3
    +
    +etcd_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 2380
    +    listen.client_port: 2379
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +mds_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 6666
    +    listen.dummy_port: 6667
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +
    +chunkserver_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 82${format_replicas_sequence} # 8200,8201,8202
    +    data_dir: /data/chunkserver${service_replicas_sequence} # /data/chunkserver0, /data/chunkserver1
    +    copysets: 100
    +  deploy:
    +    - host: ${machine1}
    +      replicas: 1
    +    - host: ${machine2}
    +      replicas: 1
    +    - host: ${machine3}
    +      replicas: 1
    +
    +snapshotclone_services:
    +  config:
    +    listen.ip: ${service_host}
    +    listen.port: 5555
    +    listen.dummy_port: 8081
    +    listen.proxy_port: 8080
    +  deploy:
    +    - host: ${machine1}
    +    - host: ${machine2}
    +    - host: ${machine3}
    +

    根据上述的集群拓扑文件创建集群 my-cluster

    curveadm cluster add my-cluster -f topology.yaml
    +

    切换 my-cluster 集群为当前管理集群:

    curveadm cluster checkout my-cluster
    +

    部署集群。如果部署成功,将会输出类似 Cluster 'my-cluster' successfully deployed ^_^. 字样。

    $ curveadm deploy --skip snapshotclone
    +
    +...
    +Create Logical Pool: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Start Service: [OK]
    +  + host=server-host1  role=snapshotclone  containerId=9d3555ba72fa [1/1] [OK]
    +  + host=server-host2  role=snapshotclone  containerId=e6ae2b23b57e [1/1] [OK]
    +  + host=server-host3  role=snapshotclone  containerId=f6d3446c7684 [1/1] [OK]
    +
    +Balance Leader: [OK]
    +  + host=server-host1  role=mds  containerId=c6fdd71ae678 [1/1] [OK]
    +
    +Cluster 'my-cluster' successfully deployed ^_^.
    +

    查看集群状态:

    $ curveadm status
    +Get Service Status: [OK]
    +
    +cluster name      : my-cluster
    +cluster kind      : curvebs
    +cluster mds addr  : 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +cluster mds leader: 172.16.0.225:6666 / d0a94a7afa14
    +
    +Id            Role         Host          Replicas  Container Id  Status
    +--            ----         ----          --------  ------------  ------
    +5567a1c56ab9  etcd         server-host1  1/1       f894c5485a26  Up 17 seconds
    +68f9f0e6f108  etcd         server-host2  1/1       69b09cdbf503  Up 17 seconds
    +a678263898cc  etcd         server-host3  1/1       2ed141800731  Up 17 seconds
    +4dcbdd08e2cd  mds          server-host1  1/1       76d62ff0eb25  Up 17 seconds
    +8ef1755b0a10  mds          server-host2  1/1       d8d838258a6f  Up 17 seconds
    +f3599044c6b5  mds          server-host3  1/1       d63ae8502856  Up 17 seconds
    +9f1d43bc5b03  chunkserver  server-host1  1/1       39751a4f49d5  Up 16 seconds
    +3fb8fd7b37c1  chunkserver  server-host2  1/1       0f55a19ed44b  Up 16 seconds
    +c4da555952e3  chunkserver  server-host3  1/1       9411274d2c97  Up 16 seconds
    +

    部署 CurveBS 客户端

    在 Curve 中控机上编辑客户端配置文件:

    vim client.yaml
    +

    注意,这里的 mds.listen.addr 请填写上一步集群状态中输出的 cluster mds addr

    kind: curvebs
    +container_image: opencurvedocker/curvebs:v1.2
    +mds.listen.addr: 172.16.0.223:6666,172.16.0.224:6666,172.16.0.225:6666
    +log_dir: /root/curvebs/logs/client
    +

    准备分布式文件系统

    接下来,将在两台运行 PolarDB 计算节点的 ECS 上分别部署 PolarDB 的主节点和只读节点。作为前提,需要让这两个节点能够共享 CurveBS 块设备,并在块设备上 格式化并挂载 PFS

    + + + diff --git a/zh/deploying/storage-nbd.html b/zh/deploying/storage-nbd.html new file mode 100644 index 00000000000..9918a3c344a --- /dev/null +++ b/zh/deploying/storage-nbd.html @@ -0,0 +1,66 @@ + + + + + + + + + NBD 共享存储 | PolarDB for PostgreSQL + + + + +

    NBD 共享存储

    Network Block Device (NBD) 是一种网络协议,可以在多个主机间共享块存储设备。NBD 被设计为 Client-Server 的架构,因此至少需要两台物理机来部署。

    以两台物理机环境为例,本小节介绍基于 NBD 共享存储的实例构建方法大体如下:

    • 首先,两台主机通过 NBD 共享一个块设备;
    • 然后,两台主机上均部署 PolarDB File System (PFS) 来初始化并挂载到同一个块设备;
    • 最后,在两台主机上分别部署 PolarDB for PostgreSQL 内核,构建主节点、只读节点以形成简单的一写多读实例。

    注意

    以上步骤在 CentOS 7.5 上通过测试。

    安装 NBD

    为操作系统下载安装 NBD 驱动

    提示

    操作系统内核需要支持 NBD 内核模块,如果操作系统当前不支持该内核模块,则需要自己通过对应内核版本进行编译和加载 NBD 内核模块。

    CentOS 官网open in new window 下载对应内核版本的驱动源码包并解压:

    rpm -ihv kernel-3.10.0-862.el7.src.rpm
    +cd ~/rpmbuild/SOURCES
    +tar Jxvf linux-3.10.0-862.el7.tar.xz -C /usr/src/kernels/
    +cd /usr/src/kernels/linux-3.10.0-862.el7/
    +

    NBD 驱动源码路径位于:drivers/block/nbd.c。接下来编译操作系统内核依赖和组件:

    cp ../$(uname -r)/Module.symvers ./
    +make menuconfig # Device Driver -> Block devices -> Set 'M' On 'Network block device support'
    +make prepare && make modules_prepare && make scripts
    +make CONFIG_BLK_DEV_NBD=m M=drivers/block
    +

    检查是否正常生成驱动:

    modinfo drivers/block/nbd.ko
    +

    拷贝、生成依赖并安装驱动:

    cp drivers/block/nbd.ko /lib/modules/$(uname -r)/kernel/drivers/block
    +depmod -a
    +modprobe nbd # 或者 modprobe -f nbd 可以忽略模块版本检查
    +

    检查是否安装成功:

    # 检查已安装内核模块
    +lsmod | grep nbd
    +# 如果NBD驱动已经安装,则会生成/dev/nbd*设备(例如:/dev/nbd0、/dev/nbd1等)
    +ls /dev/nbd*
    +

    安装 NBD 软件包

    yum install nbd
    +

    使用 NBD 来共享块设备

    服务端部署

    拉起 NBD 服务端,按照同步方式(sync/flush=true)配置,在指定端口(例如 1921)上监听对指定块设备(例如 /dev/vdb)的访问。

    nbd-server -C /root/nbd.conf
    +

    配置文件 /root/nbd.conf 的内容举例如下:

    [generic]
    +    #user = nbd
    +    #group = nbd
    +    listenaddr = 0.0.0.0
    +    port = 1921
    +[export1]
    +    exportname = /dev/vdb
    +    readonly = false
    +    multifile = false
    +    copyonwrite = false
    +    flush = true
    +    fua = true
    +    sync = true
    +

    客户端部署

    NBD 驱动安装成功后会看到 /dev/nbd* 设备, 根据服务端的配置把远程块设备映射为本地的某个 NBD 设备即可:

    nbd-client x.x.x.x 1921 -N export1 /dev/nbd0
    +# x.x.x.x是NBD服务端主机的IP地址
    +

    准备分布式文件系统

    参阅 格式化并挂载 PFS

    + + + diff --git a/zh/development/customize-dev-env.html b/zh/development/customize-dev-env.html new file mode 100644 index 00000000000..40814482f51 --- /dev/null +++ b/zh/development/customize-dev-env.html @@ -0,0 +1,186 @@ + + + + + + + + + 定制开发环境 | PolarDB for PostgreSQL + + + + +

    定制开发环境

    自行构建开发镜像

    DockerHub 上已有构建完毕的开发镜像 polardb/polardb_pg_developen in new window 可供直接使用(支持 linux/amd64linux/arm64 两种架构)。

    另外,我们也提供了构建上述开发镜像的 Dockerfile,从 Ubuntu 官方镜像open in new window ubuntu:20.04 开始构建出一个安装完所有开发和运行时依赖的镜像,您可以根据自己的需要在 Dockerfile 中添加更多依赖。以下是手动构建镜像的 Dockerfile 及方法:

    FROM ubuntu:20.04
    +LABEL maintainer="mrdrivingduck@gmail.com"
    +CMD bash
    +
    +# Timezone problem
    +ENV TZ=Asia/Shanghai
    +RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
    +
    +# Upgrade softwares
    +RUN apt update -y && \
    +    apt upgrade -y && \
    +    apt clean -y
    +
    +# GCC (force to 9) and LLVM (force to 11)
    +RUN apt install -y \
    +        gcc-9 \
    +        g++-9 \
    +        llvm-11-dev \
    +        clang-11 \
    +        make \
    +        gdb \
    +        pkg-config \
    +        locales && \
    +    update-alternatives --install \
    +        /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave \
    +        /usr/bin/g++ g++ /usr/bin/g++-9 && \
    +    update-alternatives --install \
    +        /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-11 60 --slave \
    +        /usr/bin/clang++ clang++ /usr/bin/clang++-11 --slave \
    +        /usr/bin/clang clang /usr/bin/clang-11 && \
    +    apt clean -y
    +
    +# Generate locale
    +RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && \
    +    sed -i '/zh_CN.UTF-8/s/^# //g' /etc/locale.gen && \
    +    locale-gen
    +
    +# Dependencies
    +RUN apt install -y \
    +        libicu-dev \
    +        bison \
    +        flex \
    +        python3-dev \
    +        libreadline-dev \
    +        libgss-dev \
    +        libssl-dev \
    +        libpam0g-dev \
    +        libxml2-dev \
    +        libxslt1-dev \
    +        libldap2-dev \
    +        uuid-dev \
    +        liblz4-dev \
    +        libkrb5-dev \
    +        gettext \
    +        libxerces-c-dev \
    +        tcl-dev \
    +        libperl-dev \
    +        libipc-run-perl \
    +        libaio-dev \
    +        libfuse-dev && \
    +    apt clean -y
    +
    +# Tools
    +RUN apt install -y \
    +        iproute2 \
    +        wget \
    +        ccache \
    +        sudo \
    +        vim \
    +        git \
    +        cmake && \
    +    apt clean -y
    +
    +# set to empty if GitHub is not barriered
    +# ENV GITHUB_PROXY=https://ghproxy.com/
    +ENV GITHUB_PROXY=
    +
    +ENV ZLOG_VERSION=1.2.14
    +ENV PFSD_VERSION=pfsd4pg-release-1.2.42-20220419
    +
    +# install dependencies from GitHub mirror
    +RUN cd /usr/local && \
    +    # zlog for PFSD
    +    wget --no-verbose --no-check-certificate "${GITHUB_PROXY}https://github.com/HardySimpson/zlog/archive/refs/tags/${ZLOG_VERSION}.tar.gz" && \
    +    # PFSD
    +    wget --no-verbose --no-check-certificate "${GITHUB_PROXY}https://github.com/ApsaraDB/PolarDB-FileSystem/archive/refs/tags/${PFSD_VERSION}.tar.gz" && \
    +    # unzip and install zlog
    +    gzip -d $ZLOG_VERSION.tar.gz && \
    +    tar xpf $ZLOG_VERSION.tar && \
    +    cd zlog-$ZLOG_VERSION && \
    +    make && make install && \
    +    echo '/usr/local/lib' >> /etc/ld.so.conf && ldconfig && \
    +    cd .. && \
    +    rm -rf $ZLOG_VERSION* && \
    +    rm -rf zlog-$ZLOG_VERSION && \
    +    # unzip and install PFSD
    +    gzip -d $PFSD_VERSION.tar.gz && \
    +    tar xpf $PFSD_VERSION.tar && \
    +    cd PolarDB-FileSystem-$PFSD_VERSION && \
    +    sed -i 's/-march=native //' CMakeLists.txt && \
    +    ./autobuild.sh && ./install.sh && \
    +    cd .. && \
    +    rm -rf $PFSD_VERSION* && \
    +    rm -rf PolarDB-FileSystem-$PFSD_VERSION*
    +
    +# create default user
    +ENV USER_NAME=postgres
    +RUN echo "create default user" && \
    +    groupadd -r $USER_NAME && \
    +    useradd -ms /bin/bash -g $USER_NAME $USER_NAME -p '' && \
    +    usermod -aG sudo $USER_NAME
    +
    +# modify conf
    +RUN echo "modify conf" && \
    +    mkdir -p /var/log/pfs && chown $USER_NAME /var/log/pfs && \
    +    mkdir -p /var/run/pfs && chown $USER_NAME /var/run/pfs && \
    +    mkdir -p /var/run/pfsd && chown $USER_NAME /var/run/pfsd && \
    +    mkdir -p /dev/shm/pfsd && chown $USER_NAME /dev/shm/pfsd && \
    +    touch /var/run/pfsd/.pfsd && \
    +    echo "ulimit -c unlimited" >> /home/postgres/.bashrc && \
    +    echo "export PGHOST=127.0.0.1" >> /home/postgres/.bashrc && \
    +    echo "alias pg='psql -h /home/postgres/tmp_master_dir_polardb_pg_1100_bld/'" >> /home/postgres/.bashrc
    +
    +ENV PATH="/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin:$PATH"
    +WORKDIR /home/$USER_NAME
    +USER $USER_NAME
    +

    将上述内容复制到一个文件内(假设文件名为 Dockerfile-PolarDB)后,使用如下命令构建镜像:

    提示

    💡 请在下面的高亮行中按需替换 <image_name> 内的 Docker 镜像名称

    docker build --network=host \
    +    -t <image_name> \
    +    -f Dockerfile-PolarDB .
    +

     

    从干净的系统开始搭建开发环境

    该方式假设您从一台具有 root 权限的干净的 CentOS 7 操作系统上从零开始,可以是:

    • 安装 CentOS 7 的物理机/虚拟机
    • 从 CentOS 7 官方 Docker 镜像 centos:centos7 上启动的 Docker 容器

    建立非 root 用户

    PolarDB for PostgreSQL 需要以非 root 用户运行。以下步骤能够帮助您创建一个名为 postgres 的用户组和一个名为 postgres 的用户。

    提示

    如果您已经有了一个非 root 用户,但名称不是 postgres:postgres,可以忽略该步骤;但请注意在后续示例步骤中将命令中用户相关的信息替换为您自己的用户组名与用户名。

    下面的命令能够创建用户组 postgres 和用户 postgres,并为该用户赋予 sudo 和工作目录的权限。需要以 root 用户执行这些命令。

    # install sudo
    +yum install -y sudo
    +# create user and group
    +groupadd -r postgres
    +useradd -m -g postgres postgres -p ''
    +usermod -aG wheel postgres
    +# make postgres as sudoer
    +chmod u+w /etc/sudoers
    +echo 'postgres ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
    +chmod u-w /etc/sudoers
    +# grant access to home directory
    +chown -R postgres:postgres /home/postgres/
    +echo 'source /etc/bashrc' >> /home/postgres/.bashrc
    +# for su postgres
    +sed -i 's/4096/unlimited/g' /etc/security/limits.d/20-nproc.conf
    +

    接下来,切换到 postgres 用户,就可以进行后续的步骤了:

    su postgres
    +source /etc/bashrc
    +cd ~
    +

    依赖安装

    在 PolarDB for PostgreSQL 的源码库根目录下,有一个 install_dependencies.sh 脚本,包含了 PolarDB 和 PFS 需要运行的所有依赖。因此,首先需要克隆 PolarDB 的源码库。

    PolarDB for PostgreSQL 的代码托管于 GitHubopen in new window 上,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    sudo yum install -y git
    +git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    sudo yum install -y git
    +git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    源码下载完毕后,使用 sudo 执行源代码根目录下的依赖安装脚本 install_dependencies.sh 自动完成所有的依赖安装。如果有定制的开发需求,请自行修改 install_dependencies.sh

    cd PolarDB-for-PostgreSQL
    +sudo ./install_dependencies.sh
    +
    + + + diff --git a/zh/development/dev-on-docker.html b/zh/development/dev-on-docker.html new file mode 100644 index 00000000000..eb52c6ec2c2 --- /dev/null +++ b/zh/development/dev-on-docker.html @@ -0,0 +1,63 @@ + + + + + + + + + 基于 Docker 容器开发 | PolarDB for PostgreSQL + + + + +

    基于 Docker 容器开发

    警告

    为简化使用,容器内的 postgres 用户没有设置密码,仅供体验。如果在生产环境等高安全性需求场合,请务必修改健壮的密码!

    在开发机器上下载源代码

    GitHubopen in new window 上下载 PolarDB for PostgreSQL 的源代码,稳定分支为 POLARDB_11_STABLE。如果因网络原因不能稳定访问 GitHub,则可以访问 Gitee 国内镜像open in new window

    git clone -b POLARDB_11_STABLE https://github.com/ApsaraDB/PolarDB-for-PostgreSQL.git
    +
    git clone -b POLARDB_11_STABLE https://gitee.com/mirrors/PolarDB-for-PostgreSQL
    +

    代码克隆完毕后,进入源码目录:

    cd PolarDB-for-PostgreSQL/
    +

    拉取开发镜像

    从 DockerHub 上拉取 PolarDB for PostgreSQL 的 开发镜像open in new window

    # 拉取 PolarDB 开发镜像
    +docker pull polardb/polardb_pg_devel
    +

    创建并运行容器

    此时我们已经在开发机器的源码目录中。从开发镜像上创建一个容器,将当前目录作为一个 volume 挂载到容器中,这样可以:

    • 在容器内的环境中编译源码
    • 在容器外(开发机器上)使用编辑器来查看或修改代码
    docker run -it \
    +    -v $PWD:/home/postgres/polardb_pg \
    +    --shm-size=512m --cap-add=SYS_PTRACE --privileged=true \
    +    --name polardb_pg_devel \
    +    polardb/polardb_pg_devel \
    +    bash
    +

    进入容器后,为容器内用户获取源码目录的权限,然后编译部署 PolarDB-PG 实例。

    # 获取权限并编译部署
    +cd polardb_pg
    +sudo chmod -R a+wr ./
    +sudo chown -R postgres:postgres ./
    +./polardb_build.sh
    +
    +# 验证
    +psql -h 127.0.0.1 -c 'select version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    编译测试选项说明

    以下表格列出了编译、初始化或测试 PolarDB-PG 集群所可能使用到的选项及说明。更多选项及其说明详见源码目录下的 polardb_build.sh 脚本。

    选项描述默认值
    --withrep是否初始化只读节点NO
    --repnum只读节点数量1
    --withstandby是否初始化热备份节点NO
    --initpx是否初始化为 HTAP 集群(1 个读写节点,2 个只读节点)NO
    --with-pfsd是否编译 PolarDB File System(PFS)相关功能NO
    --with-tde是否初始化 透明数据加密(TDE)open in new window 功能NO
    --with-dma是否初始化为 DMA(Data Max Availability)高可用三节点集群NO
    -r/ -t /
    --regress
    在编译安装完毕后运行内核回归测试NO
    -r-px运行 HTAP 实例的回归测试NO
    -e /
    --extension
    运行扩展插件测试NO
    -r-external测试 external/ 下的扩展插件NO
    -r-contrib测试 contrib/ 下的扩展插件NO
    -r-pl测试 src/pl/ 下的扩展插件NO

    如无定制的需求,则可以按照下面给出的选项编译部署不同形态的 PolarDB-PG 集群并进行测试。

    PolarDB-PG 各形态编译部署

    本地单节点实例

    • 1 个读写节点(运行于 5432 端口)
    ./polardb_build.sh
    +

    本地多节点实例

    • 1 个读写节点(运行于 5432 端口)
    • 1 个只读节点(运行于 5433 端口)
    ./polardb_build.sh --withrep --repnum=1
    +

    本地多节点带备库实例

    • 1 个读写节点(运行于 5432 端口)
    • 1 个只读节点(运行于 5433 端口)
    • 1 个备库节点(运行于 5434 端口)
    ./polardb_build.sh --withrep --repnum=1 --withstandby
    +

    本地多节点 HTAP 实例

    • 1 个读写节点(运行于 5432 端口)
    • 2 个只读节点(运行于 5433 / 5434 端口)
    ./polardb_build.sh --initpx
    +

    实例回归测试

    普通实例回归测试:

    ./polardb_build.sh --withrep -r -e -r-external -r-contrib -r-pl --with-tde
    +

    HTAP 实例回归测试:

    ./polardb_build.sh -r-px -e -r-external -r-contrib -r-pl --with-tde
    +

    DMA 实例回归测试:

    ./polardb_build.sh -r -e -r-external -r-contrib -r-pl --with-tde --with-dma
    +
    + + + diff --git a/zh/features/index.html b/zh/features/index.html new file mode 100644 index 00000000000..506ceff36c7 --- /dev/null +++ b/zh/features/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 自研功能 | PolarDB for PostgreSQL + + + + +

    自研功能

    功能 / 版本映射矩阵

    功能 / 版本PostgreSQLPolarDB for PostgreSQL 11
    高性能......
    预读 / 预扩展/V11 / v1.1.1-
    表大小缓存/V11 / v1.1.10-
    Shared Server/V11 / v1.1.30-
    高可用......
    只读节点 Online Promote/V11 / v1.1.1-
    WAL 日志并行回放/V11 / v1.1.17-
    DataMax 日志节点/V11 / v1.1.6-
    Resource Manager/V11 / v1.1.1-
    闪回表和闪回日志/V11 / v1.1.22-
    安全......
    透明数据加密/V11 / v1.1.1-
    弹性跨机并行查询(ePQ)......
    ePQ 执行计划查看与分析/V11 / v1.1.22-
    ePQ 计算节点范围选择与并行度控制/V11 / v1.1.20-
    ePQ 支持分区表查询/V11 / v1.1.17-
    ePQ 支持创建 B-Tree 索引并行加速/V11 / v1.1.15-
    集群拓扑视图/V11 / v1.1.20-
    自适应扫描/V11 / v1.1.17-
    并行 INSERT/V11 / v1.1.17-
    ePQ 支持创建/刷新物化视图并行加速和批量写入/V11 / v1.1.30-
    第三方插件......
    pgvector/V11 / v1.1.35-
    smlar/V11 / v1.1.35-
    + + + diff --git a/zh/features/v11/availability/avail-online-promote.html b/zh/features/v11/availability/avail-online-promote.html new file mode 100644 index 00000000000..e103ed98ee3 --- /dev/null +++ b/zh/features/v11/availability/avail-online-promote.html @@ -0,0 +1,34 @@ + + + + + + + + + 只读节点 Online Promote | PolarDB for PostgreSQL + + + + +

    只读节点 Online Promote

    V11 / v1.1.1-

    学弈

    2022/09/20

    25 min

    背景

    PolarDB 是基于共享存储的一写多读架构,与传统数据库的主备架构有所不同:

    • Standby 节点,是传统数据库的备库节点,有独立的存储,与主库节点之间通过传输完整的 WAL 日志来同步数据;
    • 只读节点,也称为 Replica 节点,是 PolarDB 数据库的只读备库节点,与主节点共享同一份存储,与主库节点之间通过传输 WAL Meta 日志信息来同步数据。

    传统数据库支持 Standby 节点升级为主库节点的 Promote 操作,在不重启的情况下,提升备库节点为主库节点,继续提供读写服务,保证集群高可用的同时,也有效降低了实例的恢复时间 RTO。

    PolarDB 同样需要只读备库节点提升为主库节点的 Promote 能力,鉴于只读节点与传统数据库 Standby 节点的不同,PolarDB 提出了一种一写多读架构下只读节点的 OnlinePromote 机制。

    使用

    使用 pg_ctl 工具对 Replica 节点执行 Promote 操作:

    pg_ctl promote -D [datadir]
    +

    OnlinePromote 原理

    触发机制

    PolarDB 使用和传统数据库一致的备库节点 Promote 方法,触发条件如下:

    • 调用 pg_ctl 工具的 Promote 命令,pg_ctl 工具会向 Postmaster 进程发送信号,接收到信号的 Postmaster 进程再通知其他进程执行相应的操作,完成整个 Promote 操作。
    • recovery.conf 中定义 trigger file 的路径,其他组件通过生成 trigger file 来触发。

    相比于传统数据库 Standby 节点的 Promote 操作,PolarDB Replica 节点的 OnlinePromote 操作需要多考虑以下几个问题:

    • Replica 节点 OnlinePromote 为主库节点后,需要以读写模式重新挂载共享存储;
    • Replica 节点会在内存中维护一些重要的控制信息,这些控制信息在主库节点上会被持久化到共享存储中。Promote 过程中,这部分信息也需要持久化到共享存储;
    • Replica 节点在内存中通过日志回放得到的数据信息,在 OnlinePromote 的过程中需要确认哪些数据可以写入共享存储;
    • Replica 节点在内存中回放 WAL 日志时,缓冲区淘汰方法和不刷脏的特性与主库节点截然不同,OnlinePromote 过程中应该如何处理;
    • Replica 节点 OnlinePromote 过程中,各个子进程的处理过程。

    Postmaster 进程处理过程

    1. Postmaster 进程发现 trigger file 文件或者接收到 OnlinePromote 命令后,进入 OnlinePromote 的处理流程;
    2. 发送 SIGTERM 信号给当前所有 Backend 进程。
      • 只读节点在 OnlinePromote 过程中可以继续提供只读服务,但是只读的数据不能保证是最新的。为了避免切换过程中从新的主库节点读到旧的数据,这里先将所有的 Backend 会话断开,等 Startup 进程退出后再开始对外提供读写服务。
    3. 重新以 读写模式 挂载共享存储,需要底层存储提供相应的功能支持;
    4. 发送 SIGUSR2 信号给 Startup 进程,通知其结束回放并处理 OnlinePromote 操作;
    5. 发送 SIGUSR2 信号给 Polar Worker 辅助进程,通知其停止对于部分 LogIndex 数据的解析,因为这部分 LogIndex 数据只对于正常运行期间的 Replica 节点有用处。
    6. 发送 SIGUSR2 信号给 LogIndex BGW (Background Ground Worker) 后台回放进程,通知其处理 OnlinePromote 操作。

    image.png

    Startup 进程处理过程

    1. Startup 进程回放完所有旧主库节点产生的 WAL 日志,生成相应的 LogIndex 数据;
    2. 确认旧主库节点最后一次的 checkpoint 在 Replica 节点也完成,目的是确保对应的 checkpoint 应该在 Replica 节点本地写入的数据落盘完毕;
    3. 等待确认 LogIndex BGW 进程进入 POLAR_BG_WAITING_RESET 状态;
    4. 将 Replica 节点本地的数据(如 clog 等)拷贝到共享存储中;
    5. 重置 WAL Meta Queue 内存空间,从共享存储中重新加载 slot 信息,并重新设置 LogIndex BGW 进程的回放位点为其与当前一致性位点两者的最小值,表示接下来 LogIndex BGW 进程从该位点开始新的回放;
    6. 将节点角色设置为主库节点,并设置 LogIndex BGW 进程的状态为 POLAR_BG_ONLINE_PROMOTE,至此实例可以对外提供读写服务。

    image.png

    LogIndex BGW 进程处理过程

    LogIndex BGW 进程有自己的状态机,在其生命周期内,一直按照该状态机运行,具体每个状态机的操作内容如下:

    • POLAR_BG_WAITING_RESET:LogIndex BGW 进程状态重置,通知其他进程状态机发生变化;
    • POLAR_BG_ONLINE_PROMOTE:读取 LogIndex 数据,组织并分发回放任务,利用并行回放进程组回放 WAL 日志,该状态的进程需要回放完所有的 LogIndex 数据才会进行状态切换,最后推进后台回放进程的回放位点;
    • POLAR_BG_REDO_NOT_START:表示回放任务结束;
    • POLAR_BG_RO_BUF_REPLAYING:Replica 节点正常运行时,进程处于该状态,读取 LogIndex 数据,按照 WAL 日志的顺序回放一定量的 WAL 日志,每回放一轮,便会推进后台回放进程的回放位点;
    • POLAR_BG_PARALLEL_REPLAYING:LogIndex BGW 进程每次读取一定量的 LogIndex 数据,组织并分发回放任务,利用并行回放进程组回放 WAL 日志,每回放一轮,便会推进后台回放进程的回放位点。

    image.png

    LogIndex BGW 进程接收到 Postmaster 的 SIGUSR2 信号后,执行 OnlinePromote 操作的流程如下:

    1. 将所有的 LogIndex 数据落盘,并切换状态为 POLAR_BG_WAITING_RESET
    2. 等待 Startup 进程将其切换为 POLAR_BG_ONLINE_PROMOTE 状态;
      • Replica 节点在执行 OnlinePromote 操作前,后台回放进程只回放在 buffer pool 中的页面;
      • Replica 节点处于 OnlinePromote 过程中时,鉴于之前主库节点可能有部分页面在内存中,未来得及落盘,所以后台回放进程按照日志顺序回放所有的 WAL 日志,并在回放后调用 MarkBufferDirty 标记该页面为脏页,等待刷脏;
      • 回放结束后,推进后台回放进程的回放位点,然后切换状态为 POLAR_BG_REDO_NOT_START

    刷脏控制

    每个脏页都带有一个 Oldest LSN,该 LSN 在 FlushList 里是有序的,目的是通过这个 LSN 来确定一致性位点。

    Replica 节点在 OnlinePromote 过程后,由于同时存在着回放和新的页面写入,如果像主库节点一样,直接将当前的 WAL 日志插入位点设为 Buffer 的 Oldest LSN,可能会导致:比它小的 Buffer 还未落盘,但新的一致性位点已经被设置。

    所以 Replica 节点在 OnlinePromote 过程中需要面对两个问题:

    • 旧主库节点的 WAL 日志回放时,如何给脏页设置 Oldest LSN;
    • 新主库节点产生的脏页如何设置 Oldest LSN;

    PolarDB 在 Replica 节点 OnlinePromote 的过程中,将上述两类情况产生的脏页的 Oldest LSN 都设置为 LogIndex BGW 进程推进的回放位点。只有当标记为相同 Oldest LSN 的 Buffer 都落盘了,才将一致性位点向前推进。

    + + + diff --git a/zh/features/v11/availability/avail-parallel-replay.html b/zh/features/v11/availability/avail-parallel-replay.html new file mode 100644 index 00000000000..bd346d6171c --- /dev/null +++ b/zh/features/v11/availability/avail-parallel-replay.html @@ -0,0 +1,34 @@ + + + + + + + + + WAL 日志并行回放 | PolarDB for PostgreSQL + + + + +

    WAL 日志并行回放

    V11 / v1.1.17-

    学弈

    2022/09/20

    30 min

    背景

    在 PolarDB for PostgreSQL 的一写多读架构下,只读节点(Replica 节点)运行过程中,LogIndex 后台回放进程(LogIndex Background Worker)和会话进程(Backend)分别使用 LogIndex 数据在不同的 Buffer 上回放 WAL 日志,本质上达到了一种并行回放 WAL 日志的效果。

    鉴于 WAL 日志回放在 PolarDB 集群的高可用中起到至关重要的作用,将并行回放 WAL 日志的思想用到常规的日志回放路径上,是一种很好的优化思路。

    并行回放 WAL 日志至少可以在以下三个场景下发挥优势:

    1. 主库节点、只读节点以及备库节点崩溃恢复(Crash Recovery)的过程;
    2. 只读节点 LogIndex BGW 进程持续回放 WAL 日志的过程;
    3. 备库节点 Startup 进程持续回放 WAL 日志的过程。

    术语

    • Block:数据块
    • WAL:Write-Ahead Logging,预写日志
    • Task Node:并行执行框架中的子任务执行节点,可以接收并执行一个子任务
    • Task Tag:子任务的分类标识,同一类的子任务执行顺序有先后关系
    • Hold List:并行执行框架中,每个子进程调度执行回放子任务所使用的链表

    原理

    概述

    一条 WAL 日志可能修改多个数据块 Block,因此可以使用如下定义来表示 WAL 日志的回放过程:

    • 假设第 i 条 WAL 日志 LSN 为 $LSN_i$,其修改了 m 个数据块,则定义第 i 条 WAL 日志修改的数据块列表 $Block_i = [Block_{i,0}, Block_{i,1}, ..., Block_{i,m}]$;
    • 定义最小的回放子任务为 $Task_{i,j}={LSN_i -> Block_{i,j}}$,表示在数据块 $Block_{i,j}$ 上回放第 i 条 WAL 日志;
    • 因此,一条修改了 k 个 Block 的 WAL 日志就可以表示成 k 个回放子任务的集合:$TASK_{i,*} = [Task_{i,0}, Task_{i,1}, ..., Task_{i,k}]$;
    • 进而,多条 WAL 日志就可以表示成一系列回放子任务的集合:$TASK_{,} = [Task_{0,}, Task_{1,}, ..., Task_{N,*}]$;

    在日志回放子任务集合 $Task_{,}$ 中,每个子任务的执行,有时并不依赖于前序子任务的执行结果。假设回放子任务集合如下:$TASK_{,} = [Task_{0,}, Task_{1,}, Task_{2,*}]$,其中:

    • $Task_{0,*}=[Task_{0,0}, Task_{0,1}, Task_{0,2}]$
    • $Task_{1,*}=[Task_{1,0}, Task_{1,1}]$,
    • $Task_{2,*}=[Task_{2,0}]$

    并且 $Block_{0,0} = Block_{1,0}$,$Block_{0,1} = Block_{1,1}$,$Block_{0,2} = Block_{2,0}$

    则可以并行回放的子任务集合有三个:$[Task_{0,0},Task_{1,0}]$、$[Task_{0,1},Task_{1,1}]$、$[Task_{0,2},Task_{2,0}]$

    综上所述,在整个 WAL 日志所表示的回放子任务集合中,存在很多子任务序列可以并行执行,而且不会影响最终回放结果的一致性。PolarDB 借助这种思想,提出了一种并行任务执行框架,并成功运用到了 WAL 日志回放的过程中。

    并行任务执行框架

    将一段共享内存根据并发进程数目进行等分,每一段作为一个环形队列,分配给一个进程。通过配置参数设定每个环形队列的深度:

    image.png

    • Dispatcher 进程
      • 通过将任务分发给指定的进程来控制并发调度;
      • 负责将进程执行完的任务从队列中删除;
    • 进程组
      • 组内每一个进程从相应的环形队列中获取需要执行的任务,根据任务的状态决定是否执行。

    image.png

    任务

    环形队列的内容由 Task Node 组成,每个 Task Node 包含五个状态:Idle、Running、Hold、Finished、Removed。

    • Idle:表示该 Task Node 未分配任务;
    • Running:表示该 Task Node 已经分配任务,正在等待进程执行,或已经在执行;
    • Hold:表示该 Task Node 有前向依赖的任务,需要等待依赖的任务执行完再执行;
    • Finished:表示进程组中的进程已经执行完该任务;
    • Removed:当 Dispatcher 进程发现一个任务的状态已经为 Finished,那么该任务所有的前置依赖任务也都应该为 Finished 状态,Removed 状态表示 Dispatcher 进程已经将该任务以及该任务所有前置任务都从管理结构体中删除;可以通过该机制保证 Dispatcher 进程按顺序处理有依赖关系的任务执行结果。

    image.png

    上述状态机的状态转移过程中,黑色线标识的状态转移过程在 Dispatcher 进程中完成,橙色线标识的状态转移过程在并行回放进程组中完成。

    Dispatcher 进程

    Dispatcher 进程有三个关键数据结构:Task HashMap、Task Running Queue 以及 Task Idle Nodes。

    • Task HashMap 负责记录 Task Tag 和相应的执行任务列表的 hash 映射关系:
      • 每个任务有一个指定的 Task Tag,如果两个任务间存在依赖关系,则它们的 Task Tag 相同;
      • 在分发任务时,如果一个 Task Node 存在前置依赖任务,则状态标识为 Hold,需等待前置任务先执行。
    • Task Running Queue 负责记录当前正在执行的任务;
    • Task Idel Nodes 负责记录进程组中不同进程,当前处于 Idle 状态的 Task Node;

    Dispatcher 调度策略如下:

    • 如果要执行的 Task Node 有相同 Task Tag 的任务在执行,则优先将该 Task Node 分配到该 Task Tag 链表最后一个 Task Node 所在的执行进程;目的是让有依赖关系的任务尽量被同一个进程执行,减少进程间同步的开销;
    • 如果期望优先分配的进程队列已满,或者没有相同的 Task Tag 在执行,则在进程组中按顺序选择一个进程,从中获取状态为 Idle 的 Task Node 来调度任务执行;目的是让任务尽量平均分配到不同的进程进行执行。

    image.png

    进程组

    该并行执行针对的是相同类型的任务,它们具有相同的 Task Node 数据结构;在进程组初始化时配置 SchedContext,指定负责执行具体任务的函数指针:

    • TaskStartup 表示进程执行任务前需要进行的初始化动作
    • TaskHandler 根据传入的 Task Node,负责执行具体的任务
    • TaskCleanup 表示执行进程退出前需要执行的回收动作

    image.png

    进程组中的进程从环形队列中获取一个 Task Node,如果 Task Node 当前的状态是 Hold,则将该 Task Node 插入到 Hold List 的尾部;如果 Task Node 的状态为 Running,则调用 TaskHandler 执行;如果 TaskHandler 执行失败,则设置该 Task Node 重新执行需要等待调用的次数,默认为 3,将该 Task Node 插入到 Hold List 的头部。

    image.png

    进程优先从 Hold List 头部搜索,获取可执行的 Task;如果 Task 状态为 Running,且等待调用次数为 0,则执行该 Task;如果 Task 状态为 Running,但等待调用次数大于 0,则将等待调用次数减去 1。

    image.png

    WAL 日志并行回放

    根据 LogIndex 章节介绍,LogIndex 数据中记录了 WAL 日志和其修改的数据块之间的对应关系,而且 LogIndex 数据支持使用 LSN 进行检索,鉴于此,PolarDB 数据库在 Standby 节点持续回放 WAL 日志过程中,引入了上述并行任务执行框架,并结合 LogIndex 数据将 WAL 日志的回放任务并行化,提高了 Standby 节点数据同步的速度。

    工作流程

    • Startup 进程:解析 WAL 日志后,仅构建 LogIndex 数据而不真正回放 WAL 日志;
    • LogIndex BGW 后台回放进程:成为上述并行任务执行框架的 Dispatcher 进程,利用 LSN 来检索 LogIndex 数据,构建日志回放的子任务,并分配给并行回放进程组;
    • 并行回放进程组内的进程:执行日志回放子任务,对数据块执行单个日志的回放操作;
    • Backend 进程:主动读取数据块时,根据 PageTag 来检索 LogIndex 数据,获得修改该数据块的 LSN 日志链表,对数据块执行完整日志链的回放操作。

    image.png

    • Dispatcher 进程利用 LSN 来检索 LogIndex 数据,按 LogIndex 插入顺序枚举 PageTag 和对应 LSN,构建{LSN -> PageTag},组成相应的 Task Node;
    • PageTag 作为 Task Node 的 Task Tag;
    • 将枚举组成的 Task Node 分发给并行执行框架中进程组的子进程进行回放;

    image.png

    使用方法

    在 Standby 节点的 postgresql.conf 中添加以下参数开启功能:

    polar_enable_parallel_replay_standby_mode = ON
    +
    + + + diff --git a/zh/features/v11/availability/datamax.html b/zh/features/v11/availability/datamax.html new file mode 100644 index 00000000000..f7930311f1f --- /dev/null +++ b/zh/features/v11/availability/datamax.html @@ -0,0 +1,74 @@ + + + + + + + + + DataMax 日志节点 | PolarDB for PostgreSQL + + + + +

    DataMax 日志节点

    V11 / v1.1.6-

    玊于

    2022/11/17

    30 min

    术语

    • RPO (Recovery Point Objective):数据恢复点目标,指业务系统所能容忍的数据丢失量。
    • AZ (Availability Zone):可用区,指同一个地域内电力和网络相互独立的区域,可用区之间可以做到故障隔离。

    背景

    在高可用的场景中,为保证 RPO = 0,主库和备库之间需配置为同步复制模式。但当主备库距离较远时,同步复制的方式会存在较大延迟,从而对主库性能带来较大影响。异步复制对主库的性能影响较小,但会带来一定程度的数据丢失。PolarDB for PostgreSQL 采用基于共享存储的一写多读架构,可同时提供 AZ 内 / 跨 AZ / 跨域级别的高可用。为了减少日志同步对主库的影响,PolarDB for PostgreSQL 引入了 DataMax 节点。在进行跨 AZ 甚至跨域同步时,DataMax 节点可以作为主库日志的中转节点,能够以较低成本实现零数据丢失的同时,降低日志同步对主库性能的影响。

    原理

    DataMax 高可用架构

    PolarDB for PostgreSQL 基于物理流复制实现主备库之间的数据同步,主库与备库的流复制模式分为 同步模式异步模式 两种:

    • 异步模式:主库事务提交仅需等待对应 WAL 日志写入本地磁盘文件后,即可进行事务提交的后续操作,备库状态对主库性能无影响;但异步模式下无法保证 RPO = 0,备库相较于主库存在一定的延迟,若主库所在集群出现故障,切换至备库可能存在数据丢失的问题;
    • 同步模式:主库及备库之间的同步模式包含不同的级别,当设置 synchronous_standby_names 参数开启备库同步后,可以通过 synchronous_commit 参数设置主库及备库之间的同步级别,包括:
      • remote_write:主库的事务提交需等待对应 WAL 日志写入主库磁盘文件及备库的系统缓存中后,才能进行事务提交的后续操作;
      • on:主库的事务提交需等待对应 WAL 日志已写入主库及备库的磁盘文件中后,才能进行事务提交的后续操作;
      • remote_apply:主库的事务提交需等待对应 WAL 日志写入主库及备库的磁盘文件中,并且备库已经回放完相应 WAL 日志使备库上的查询对该事务可见后,才能进行事务提交的后续操作。

    同步模式保证了主库的事务提交操作需等待备库接收到对应的 WAL 日志数据之后才可执行,实现了主库与备库之间的零数据丢失,可保证 RPO = 0。然而,该模式下主库的事务提交操作能否继续进行依赖于备库的 WAL 日志接收结果,当主备之间距离较远导致传输延迟较大时,同步模式会对主库的性能带来影响。极端情况下,若备库异常崩溃,则主库会一直阻塞等待备库,导致无法正常提供服务。

    针对传统主备模式下同步复制对主库性能影响较大的问题,PolarDB for PostgreSQL 新增了 DataMax 节点用于实现远程同步,该模式下的高可用架构如下所示:

    dma-arch

    其中:

    1. 一个数据库集群部署在一个可用区内,不同的集群之间互为灾备,以主备模式保证跨 AZ / 跨域级别的高可用;
    2. 单个数据库集群内为一写多读架构, Primary 节点和 Replica 节点共享同一份存储,有效降低存储成本;同时 Replica 节点还可以实现单个 AZ 内计算节点的高可用;
    3. DataMax 节点与集群内的 Primary 节点部署在同一个可用区内:
      • DataMax 节点只接收并保存 Primary 节点的 WAL 日志文件,但不对日志进行回放操作,也不保存 Primary 节点的数据文件,降低存储成本;
      • DataMax 节点与 Primary 节点的数据不共享,两者的存储设备彼此隔离,防止计算集群存储异常导致 Primary 节点与 DataMax 节点保存的日志同时丢失;
      • DataMax 节点与 Primary 节点之间为 同步复制 模式,确保 RPO = 0;DataMax 节点部署在距离 Primary 节点较近的区域,通常与 Primary 节点位于同一可用区,最小化日志同步对 Primary 节点带来的性能影响;
      • DataMax 节点将其接收的 WAL 日志发送至其他可用区的 Standby 节点,Standby 节点接收并回放 DataMax 节点的日志,实现与 Primary 节点(主库)的数据同步;Standby 节点与 DataMax 节点之间可设置为异步流复制模式,通过 DataMax 节点可分流 Primary 节点向多个备份数据库传输 WAL 日志的开销。

    DataMax 实现

    DataMax 是一种新的节点角色,用户需要通过配置文件来标识当前节点是否为 DataMax 节点。DataMax 模式下,Startup 进程在回放完 DataMax 节点自身日志之后,从 PM_HOT_STANDBY 进入到 PM_DATAMAX 模式。PM_DATAMAX 模式下,Startup 进程仅进行相关信号及状态的处理,并通知 Postmaster 进程启动流复制,Startup 进程不再进行日志回放的操作。因此 DataMax 节点不会保存 Primary 节点的数据文件,从而降低了存储成本。

    datamax-impl

    如上图所示,DataMax 节点通过 WalReceiver 进程向 Primary 节点发起流复制请求,接收并保存 Primary 节点发送的 WAL 日志信息;同时通过 WalSender 进程将所接收的主库 WAL 日志发送给异地的备库节点;备库节点接收到 WAL 日志后,通知其 Startup 进程进行日志回放,从而实现备库节点与 Primary 节点的数据同步。

    DataMax 节点在数据目录中新增了 polar_datamax/ 目录,用于保存所接收的主库 WAL 日志。DataMax 节点自身的 WAL 日志仍保存在原始目录下,两者的 WAL 日志不会相互覆盖,DataMax 节点也可以有自身的独有数据。

    由于 DataMax 节点不会回放 Primary 节点的日志数据,在 DataMax 节点因为异常原因需要重启恢复时,就有了日志起始位点的问题。DataMax 节点通过 polar_datamax_meta 元数据文件存储相关的位点信息,以此来确认运行的起始位点:

    • 初始化部署:在全新部署或者 DataMax 节点重搭的场景下,没有存量的位点信息;在向主库请求流复制时,需要表明自己是 DataMax 节点,同时还需要额外传递 InvalidXLogRecPtr 位点,表明其需要从 Primary 节点当前最旧的位点开始复制; Primary 节点接收到 InvalidXLogRecPtr 的流复制请求之后,会开始从当前最旧且完整的 WAL segment 文件开始发送 WAL 日志,并将相应复制槽的 restart_lsn 设置为该位点;
    • 异常恢复:从存储上读取元数据文件,确认位点信息;以该位点为起点请求流复制。

    datamax-impl-dir

    DataMax 集群高可用

    如下图所示,增加 DataMax 节点后,若 Primary 节点与 Replica 节点同时异常,或存储无法提供服务时,则可将位于不同可用区的 Standby 节点提升为 Primary 节点,保证服务的可用性。在将 Standby 节点提升为 Primary 节点并向外提供服务之前,会确认 Standby 节点是否已从 DataMax 节点拉取完所有日志,待 Standby 节点获取完所有日志后才会将其提升为 Primary 节点。由于 DataMax 节点与 Primary 节点为同步复制,因此该场景下可保证 RPO = 0。

    此外,DataMax 节点在进行日志清理时,除了保留下游 Standby 节点尚未接收的 WAL 日志文件以外,还会保留上游 Primary 节点尚未删除的 WAL 日志文件,避免 Primary 节点异常后,备份系统无法获取到 Primary 节点相较于 DataMax 节点多出的日志信息,保证集群数据的完整性。

    datamax-ha

    若 DataMax 节点异常,则优先尝试通过重启进行恢复;若重启失败则会对其进行重建。因 DataMax 节点与 Primary 节点的存储彼此隔离,因此两者的数据不会互相影响。此外,DataMax 节点同样可以使用计算存储分离架构,确保 DataMax 节点的异常不会导致其存储的 WAL 日志数据丢失。

    datamax-restart

    类似地,DataMax 节点实现了如下几种日志同步模式,用户可以根据具体业务需求进行相应配置:

    • 最大保护模式:DataMax 节点与 Primary 节点进行同步复制,确保 RPO = 0;若 DataMax 节点因网络或硬件故障无法提供服务,则 Primary 节点也会因此阻塞而无法对外提供服务;
    • 最大性能模式:DataMax 节点与 Primary 节点进行异步复制,DataMax 节点不对 Primary 节点性能带来影响,DataMax 节点异常也不会影响 Primary 节点的服务;若 Primary 节点的存储或对应的集群发生故障,可能导致丢失数据,无法确保 RPO = 0;
    • 最大高可用模式
      • 当 DataMax 节点正常工作时,DataMax 节点与 Primary 节点进行同步复制,即为最大保护模式;
      • 若 DataMax 节点异常,Primary 节点自动将同步模式降级为最大性能模式,保证 Primary 节点服务的持续可用性;
      • 当 DataMax 节点恢复正常后,Primary 节点将最大性能模式提升为最大保护模式,避免 WAL 日志数据丢失的可能性。

    综上,通过 DataMax 日志中转节点降低日志同步延迟、分流 Primary 节点的日志传输压力,在性能稳定的情况下,可以保障跨 AZ / 跨域 RPO = 0 的高可用。

    使用指南

    DataMax 节点目录初始化

    初始化 DataMax 节点时需要指定 Primary 节点的 system identifier:

    # 获取 Primary 节点的 system identifier
    +~/tmp_basedir_polardb_pg_1100_bld/bin/pg_controldata -D ~/primary | grep 'system identifier'
    +
    +# 创建 DataMax 节点
    +# -i 参数指定的 [primary_system_identifier] 为上一步得到的 Primary 节点 system identifier
    +~/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D datamax -i [primary_system_identifier]
    +
    +# 如有需要,参考 Primary 节点,对 DataMax 节点的共享存储进行初始化
    +sudo pfs -C disk mkdir /nvme0n1/dm_shared_data
    +sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh ~/datamax/ /nvme0n1/dm_shared_data/
    +

    加载运维插件

    以可写节点的形式拉起 DataMax 节点,创建用户和插件以方便后续运维。DataMax 节点默认为只读模式,无法创建用户和插件。

    ~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D ~/datamax
    +

    创建管理账号及插件:

    postgres=# create user test superuser;
    +CREATE ROLE
    +postgres=# create extension polar_monitor;
    +CREATE EXTENSION
    +

    关闭 DataMax 节点:

    ~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl stop -D ~/datamax;
    +

    DataMax 节点配置及启动

    在 DataMax 节点的 recovery.conf 中添加 polar_datamax_mode 参数,表示当前节点为 DataMax 节点:

    polar_datamax_mode = standalone
    +recovery_target_timeline='latest'
    +primary_slot_name='datamax'
    +primary_conninfo='host=[主节点的IP] port=[主节点的端口] user=[$USER] dbname=postgres application_name=datamax'
    +

    启动 DataMax 节点:

    ~/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D ~/datamax
    +

    DataMax 节点检查

    DataMax 节点自身可通过 polar_get_datamax_info() 接口来判断其运行是否正常:

    postgres=# SELECT * FROM polar_get_datamax_info();
    + min_received_timeline | min_received_lsn | last_received_timeline | last_received_lsn | last_valid_received_lsn | clean_reserved_lsn | force_clean
    +-----------------------+------------------+------------------------+-------------------+-------------------------+--------------------+-------------
    +                     1 | 0/40000000       |                      1 | 0/4079DFE0        | 0/4079DFE0              | 0/0                | f
    +(1 row)
    +

    在 Primary 节点可以通过 pg_replication_slots 查看对应复制槽的状态:

    postgres=# SELECT * FROM pg_replication_slots;
    + slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
    +-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
    + datamax   |        | physical  |        |          | f         | t      |     124551 |  570 |              | 0/4079DFE0  |
    +(1 row)
    +

    日志同步模式配置

    通过配置 Primary 节点的 postgresql.conf,可以设置下游 DataMax 节点的日志同步模式:

    最大保护模式。其中 datamax 为 Primary 节点创建的复制槽名称:

    polar_enable_transaction_sync_mode = on
    +synchronous_commit = on
    +synchronous_standby_names = 'datamax'
    +

    最大性能模式:

    polar_enable_transaction_sync_mode = on
    +synchronous_commit = on
    +

    最大高可用模式:

    • 参数 polar_sync_replication_timeout 用于设置同步超时时间阈值,单位为毫秒;等待同步复制锁超过此阈值时,同步复制将降级为异步复制;
    • 参数 polar_sync_rep_timeout_break_lsn_lag 用于设置同步恢复延迟阈值,单位为字节;当异步复制延迟阈值小于此阈值时,异步复制将重新恢复为同步复制。
    polar_enable_transaction_sync_mode = on
    +synchronous_commit = on
    +synchronous_standby_names = 'datamax'
    +polar_sync_replication_timeout = 10s
    +polar_sync_rep_timeout_break_lsn_lag = 8kB
    +
    + + + diff --git a/zh/features/v11/availability/flashback-table.html b/zh/features/v11/availability/flashback-table.html new file mode 100644 index 00000000000..52ba4ab12d3 --- /dev/null +++ b/zh/features/v11/availability/flashback-table.html @@ -0,0 +1,72 @@ + + + + + + + + + 闪回表和闪回日志 | PolarDB for PostgreSQL + + + + +

    闪回表和闪回日志

    V11 / v1.1.22-

    恒亦

    2022/11/23

    20 min

    概述

    目前文件系统并不能保证数据库页面级别的原子读写,在一次页面的 I/O 过程中,如果发生设备断电等情况,就会造成页面数据的错乱和丢失。在实现闪回表的过程中,我们发现通过定期保存旧版本数据页 + WAL 日志回放的方式可以得到任意时间点的数据页,这样就可以解决半写问题。这种方式和 PostgreSQL 原生的 Full Page Write 相比,由于不在事务提交的主路径上,因此性能有了约 30% ~ 100% 的提升。实例规格越大,负载压力越大,效果越明显。

    闪回日志 (Flashback Log) 用于保存压缩后的旧版本数据页。其解决半写问题的方案如下:

    1. 对 Shared Buffer 中的每个 buffer,在每次 闪回点 (Flashback Point) 后第一次修改页面期间,记录 Flashback Log,保存该版本的数据页面
    2. Flashback Log 顺序落盘
    3. 维护 Flashback Log 的日志索引,用于快速检索某个数据页与其对应的 Flashback Log 记录

    当遭遇半写问题(数据页 checksum 不正确)时,通过日志索引快速找到该页对应的 Flashback Log 记录,通过 Flashback Log 记录可以得到旧版本的正确数据页,用于替换被损坏的页。在文件系统不能保证 8kB 级别原子读写的任何设备上,都可以使用这个功能。需要特别注意的是,启用这个功能会造成一定的性能下降。

    闪回表 (Flashback Table) 功能通过定期保留数据页面快照到闪回日志中,保留事务信息到快速恢复区中,支持用户将某个时刻的表数据恢复到一个新的表中。

    使用方法

    语法

    FLASHBACK TABLE
    +    [ schema. ]table
    +    TO TIMESTAMP expr;
    +

    示例

    准备测试数据。创建表 test,并插入数据:

    CREATE TABLE test(id int);
    +INSERT INTO test select * FROM generate_series(1, 10000);
    +

    查看已插入的数据:

    polardb=# SELECT count(1) FROM test;
    + count
    +-------
    + 10000
    +(1 row)
    +
    +polardb=# SELECT sum(id) FROM test;
    +   sum
    +----------
    + 50005000
    +(1 row)
    +

    等待 10 秒并删除表数据:

    SELECT pg_sleep(10);
    +DELETE FROM test;
    +

    表中已无数据:

    polardb=# SELECT * FROM test;
    + id
    +----
    +(0 rows)
    +

    闪回表到 10 秒之前的数据:

    polardb=# FLASHBACK TABLE test TO TIMESTAMP now() - interval'10s';
    +NOTICE:  Flashback the relation test to new relation polar_flashback_65566, please check the data
    +FLASHBACK TABLE
    +

    检查闪回表数据:

    polardb=# SELECT count(1) FROM polar_flashback_65566;
    + count
    +-------
    + 10000
    +(1 row)
    +
    +polardb=# SELECT sum(id) FROM polar_flashback_65566;
    +   sum
    +----------
    + 50005000
    +(1 row)
    +

    实践指南

    闪回表功能依赖闪回日志和快速恢复区功能,需要设置 polar_enable_flashback_logpolar_enable_fast_recovery_area 参数并重启。其他的参数也需要按照需求来修改,建议一次性修改完成并在业务低峰期重启。打开闪回表功能将会增大内存、磁盘的占用量,并带来一定的性能损失,请谨慎评估后再使用。

    内存占用

    打开闪回日志功能需要增加的共享内存大小为以下三项之和:

    • polar_flashback_log_buffers * 8kB
    • polar_flashback_logindex_mem_size MB
    • polar_flashback_logindex_queue_buffers MB

    打开快速恢复区需要增加大约 32kB 的共享内存大小,请评估当前实例状态后再调整参数。

    磁盘占用

    为了保证能够闪回到一定时间之前,需要保留该段时间的闪回日志和 WAL 日志,以及两者的 LogIndex 文件,这会增加磁盘空间的占用。理论上 polar_fast_recovery_area_rotation 设置得越大,磁盘占用越多。若 polar_fast_recovery_area_rotation 设置为 300,则将会保存 5 个小时的历史数据。

    打开闪回日志之后,会定期去做 闪回点(Flashback Point)。闪回点是检查点的一种,当触发检查点后会检查 polar_flashback_point_segmentspolar_flashback_point_timeout 参数来判断当前检查点是否为闪回点。所以建议:

    • 设置 polar_flashback_point_segmentsmax_wal_size 的倍数
    • 设置 polar_flashback_point_timeoutcheckpoint_timeout 的倍数

    假设 5 个小时共产生 20GB 的 WAL 日志,闪回日志与 WAL 日志的比例大约是 1:20,那么大约会产生 1GB 的闪回日志。闪回日志和 WAL 日志的比例大小和以下两个因素有关:

    • 业务模型中,写业务越多,闪回日志越多
    • polar_flashback_point_segmentspolar_flashback_point_timeout 参数设定越大,闪回日志越少

    性能影响

    闪回日志特性增加了两个后台进程来消费闪回日志,这势必会增大 CPU 的开销。可以调整 polar_flashback_log_bgwrite_delaypolar_flashback_log_insert_list_delay 参数使得两个后台进程工作间隔周期更长,从而减少 CPU 消耗,但是这可能会造成一定性能的下降,建议使用默认值即可。

    另外,由于闪回日志功能需要在该页面刷脏之前,先刷对应的闪回日志,来保证不丢失闪回日志,所以可能会造成一定的性能下降。目前测试在大多数场景下性能下降不超过 5%。

    在表闪回的过程中,目标表涉及到的页面在共享内存池中换入换出,可能会造成其他数据库访问操作的性能抖动。

    使用限制

    目前闪回表功能会恢复目标表的数据到一个新表中,表名为 polar_flashback_目标表 OID。在执行 FLASHBACK TABLE 语法后会有如下 NOTICE 提示:

    polardb=# flashback table test to timestamp now() - interval '1h';
    +NOTICE:  Flashback the relation test to new relation polar_flashback_54986, please check the data
    +FLASHBACK TABLE
    +

    其中的 polar_flashback_54986 就是闪回恢复出的临时表,只恢复表数据到目标时刻。目前只支持 普通表 的闪回,不支持以下数据库对象:

    • 索引
    • Toast 表
    • 物化视图
    • 分区表 / 分区子表
    • 系统表
    • 外表
    • 含有 toast 子表的表

    另外,如果在目标时间到当前时刻对表执行过某些 DDL,则无法闪回:

    • DROP TABLE
    • ALTER TABLE SET WITH OIDS
    • ALTER TABLE SET WITHOUT OIDS
    • TRUNCATE TABLE
    • 修改列类型,修改前后的类型不可以直接隐式转换,且不是无需增加其他值安全强制转换的 USING 子句
    • 修改表为 UNLOGGED 或者 LOGGED
    • 增加 IDENTITY 的列
    • 增加有约束限制的列
    • 增加默认值表达式含有易变的函数的列

    其中 DROP TABLE 的闪回可以使用 PolarDB for PostgreSQL/Oracle 的闪回删除功能来恢复。

    使用建议

    当出现人为误操作数据的情况时,建议先使用审计日志快速定位到误操作发生的时间,然后将目标表闪回到该时间之前。在表闪回过程中,会持有目标表的排他锁,因此仅可以对目标表进行查询操作。另外,在表闪回的过程中,目标表涉及到的页面在共享内存池中换入换出,可能会造成其他数据库访问操作的性能抖动。因此,建议在业务低峰期执行闪回操作。

    闪回的速度和表的大小相关。当表比较大时,为节约时间,可以加大 polar_workers_per_flashback_table 参数,增加并行闪回的 worker 个数。

    在表闪回结束后,可以根据 NOTICE 的提示,查询对应闪回表的数据,和原表的数据进行比对。闪回表上不会有任何索引,用户可以根据查询需要自行创建索引。在数据比对完成之后,可以将缺失的数据重新回流到原表。

    详细参数列表

    参数名参数含义取值范围默认值生效方法
    polar_enable_flashback_log是否打开闪回日志on / offoff修改配置文件后重启生效
    polar_enable_fast_recovery_area是否打开快速恢复区on / offoff修改配置文件后重启生效
    polar_flashback_log_keep_segments闪回日志保留的文件个数,可重用。每个文件 256MB[3, 2147483647]8SIGHUP 生效
    polar_fast_recovery_area_rotation快速恢复区保留的事务信息时长,单位为分钟,即最大可闪回表到几分钟之前。[1, 14400]180SIGHUP 生效
    polar_flashback_point_segments两个闪回点之间的最小 WAL 日志个数,每个 WAL 日志 1GB[1, 2147483647]16SIGHUP 生效
    polar_flashback_point_timeout两个闪回点之间的最小时间间隔,单位为秒[1, 86400]300SIGHUP 生效
    polar_flashback_log_buffers闪回日志共享内存大小,单位为 8kB[4, 262144]2048 (16MB)修改配置文件后重启生效
    polar_flashback_logindex_mem_size闪回日志索引共享内存大小,单位为 MB[3, 1073741823]64修改配置文件后重启生效
    polar_flashback_logindex_bloom_blocks闪回日志索引的布隆过滤器页面个数[8, 1073741823]512修改配置文件后重启生效
    polar_flashback_log_insert_locks闪回日志插入锁的个数[1, 2147483647]8修改配置文件后重启生效
    polar_workers_per_flashback_table闪回表并行 worker 的数量[0, 1024] (0 为关闭并行)5即时生效
    polar_flashback_log_bgwrite_delay闪回日志 bgwriter 进程的工作间隔周期,单位为 ms[1, 10000]100SIGHUP 生效
    polar_flashback_log_flush_max_size闪回日志 bgwriter 进程每次刷盘闪回日志的大小,单位为 kB[0, 2097152] (0 为不限制)5120SIGHUP 生效
    polar_flashback_log_insert_list_delay闪回日志 bginserter 进程的工作间隔周期,单位为 ms[1, 10000]10SIGHUP 生效
    + + + diff --git a/zh/features/v11/availability/index.html b/zh/features/v11/availability/index.html new file mode 100644 index 00000000000..8bc71ad48f0 --- /dev/null +++ b/zh/features/v11/availability/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 高可用 | PolarDB for PostgreSQL + + + + +

    高可用

    + + + diff --git a/zh/features/v11/availability/resource-manager.html b/zh/features/v11/availability/resource-manager.html new file mode 100644 index 00000000000..9dbfdab6d30 --- /dev/null +++ b/zh/features/v11/availability/resource-manager.html @@ -0,0 +1,51 @@ + + + + + + + + + Resource Manager | PolarDB for PostgreSQL + + + + +

    Resource Manager

    V11 / v1.1.1-

    学有

    2022/11/25

    20 min

    背景

    PolarDB for PostgreSQL 的内存可以分为以下三部分:

    • 共享内存
    • 进程间动态共享内存
    • 进程私有内存

    进程间动态共享内存和进程私有内存是 动态分配 的,其使用量随着实例承载的业务运行情况而不断变化。过多使用动态内存,可能会导致内存使用量超过操作系统限制,触发内核内存限制机制,造成实例进程异常退出,实例重启,引发实例不可用的问题。

    进程私有内存 MemoryContext 管理的内存可以分为两部分:

    • 工作计算区域内存:业务运行所需的内存,此部分内存会影响业务的正常运行;
    • Cache 内存:数据库会把部分内部元数据存放在进程内,此部分内存只会影响数据库性能;

    目标

    为了解决以上问题,PolarDB for PostgreSQL 增加了 Resource Manager 资源限制机制,能够在实例运行期间,周期性检测资源使用情况。对于超过资源限制阈值的进程,强制进行资源限制,降低实例不可用的风险。

    Resource Manager 主要的限制资源有:

    • 内存
    • CPU
    • I/O

    当前仅支持对内存资源进行限制。

    内存限制原理

    内存限制依赖 Cgroup,如果不存在 Cgroup,则无法有效进行资源限制。Resource Manager 作为 PolarDB for PostgreSQL 一个后台辅助进程,周期性读取 Cgroup 的内存使用数据作为内存限制的依据。当发现存在进程超过内存限制阈值后,会读取内核的用户进程内存记账,按照内存大小排序,依次对内存使用量超过阈值的进程发送中断进程信号(SIGTERM)或取消操作信号(SIGINT)。

    内存限制方式

    Resource Manager 守护进程会随着实例启动而建立,同时对 RW、RO 以及 Standby 节点起作用。可以通过修改参数改变 Resource Manager 的行为。

    • enable_resource_manager:是否启动 Resource Manager,取值为 on / off,默认值为 on
    • stat_interval:资源使用量周期检测的间隔,单位为毫秒,取值范围为 10-10000,默认值为 500
    • total_mem_limit_rate:限制实例内存使用的百分比,当实例内存使用超过该百分比后,开始强制对内存资源进行限制,默认值为 95
    • total_mem_limit_remain_size:实例内存预留值,当实例空闲内存小于预留值后,开始强制对内存资源进行限制,单位为 kB,取值范围为 131072-MAX_KILOBYTES(整型数值最大值),默认值为 524288
    • mem_release_policy:内存资源限制的策略
      • none:无动作
      • default:缺省策略(默认值),优先中断空闲进程,然后中断活跃进程
      • cancel_query:中断活跃进程
      • terminate_idle_backend:中断空闲进程
      • terminate_any_backend:中断所有进程
      • terminate_random_backend:中断随机进程

    内存限制效果

    2022-11-28 14:07:56.929 UTC [18179] LOG:  [polar_resource_manager] terminate process 13461 release memory 65434123 bytes
    +2022-11-28 14:08:17.143 UTC [35472] FATAL:  terminating connection due to out of memory
    +2022-11-28 14:08:17.143 UTC [35472] BACKTRACE:
    +        postgres: primary: postgres postgres [local] idle(ProcessInterrupts+0x34c) [0xae5fda]
    +        postgres: primary: postgres postgres [local] idle(ProcessClientReadInterrupt+0x3a) [0xae1ad6]
    +        postgres: primary: postgres postgres [local] idle(secure_read+0x209) [0x8c9070]
    +        postgres: primary: postgres postgres [local] idle() [0x8d4565]
    +        postgres: primary: postgres postgres [local] idle(pq_getbyte+0x30) [0x8d4613]
    +        postgres: primary: postgres postgres [local] idle() [0xae1861]
    +        postgres: primary: postgres postgres [local] idle() [0xae1a83]
    +        postgres: primary: postgres postgres [local] idle(PostgresMain+0x8df) [0xae7949]
    +        postgres: primary: postgres postgres [local] idle() [0x9f4c4c]
    +        postgres: primary: postgres postgres [local] idle() [0x9f440c]
    +        postgres: primary: postgres postgres [local] idle() [0x9ef963]
    +        postgres: primary: postgres postgres [local] idle(PostmasterMain+0x1321) [0x9ef18a]
    +        postgres: primary: postgres postgres [local] idle() [0x8dc1f6]
    +        /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f888afff445]
    +        postgres: primary: postgres postgres [local] idle() [0x49d209]
    +
    + + + diff --git a/zh/features/v11/epq/adaptive-scan.html b/zh/features/v11/epq/adaptive-scan.html new file mode 100644 index 00000000000..4b84f281c30 --- /dev/null +++ b/zh/features/v11/epq/adaptive-scan.html @@ -0,0 +1,98 @@ + + + + + + + + + 自适应扫描 | PolarDB for PostgreSQL + + + + +

    自适应扫描

    V11 / v1.1.17-

    步真

    2022/09/21

    25 min

    背景介绍

    PolarDB for PostgreSQL 支持 ePQ 弹性跨机并行查询特性,通过利用集群中多个节点的计算能力,来实现跨节点的并行查询功能。ePQ 可以支持顺序扫描、索引扫描等多种物理算子的跨节点并行化。其中,对顺序扫描算子,ePQ 提供了两种扫描模式,分别为 自适应扫描模式非自适应扫描模式

    术语

    • QC:Query Coordinator,发起 ePQ 并行查询的进程角色。
    • PX Worker:参与 ePQ 跨节点并行查询的工作进程角色。
    • Worker ID:唯一标识一个 PX Worker 的编号。
    • Disk Unit ID:ePQ 跨节点并行扫描的最小存储单元,默认为 4MB 大小。

    功能介绍

    非自适应扫描

    非自适应扫描模式是 ePQ 顺序扫描算子(Sequential Scan)的默认扫描方式。每一个参与并行查询的 PX Worker 在执行过程中都会被分配一个唯一的 Worker ID。非自适应扫描模式将会依据 Worker ID 划分数据表在物理存储上的 Disk Unit ID,从而实现每个 PX Worker 可以均匀扫描数据表在共享存储上的存储单元,所有 PX Worker 的扫描结果最终汇总形成全量的数据。

    自适应扫描

    在非自适应扫描模式下,扫描单元会均匀划分给每个 PX Worker。当存在个别只读节点计算资源不足的情况下,可能会导致扫描过程发生计算倾斜:用户发起的单次并行查询迟迟不能完成,查询受限于计算资源不足的节点长时间不能完成扫描任务。

    ePQ 提供的自适应扫描模式可以解决这个问题。自适应扫描模式不再限定每个 PX Worker 扫描特定的 Disk Unit ID,而是采用 请求-响应(Request-Response)模式,通过 QC 进程与 PX Worker 进程之间的特定 RPC 通信机制,由 QC 进程负责告知每个 PX Worker 进程可以执行的扫描任务,从而消除计算倾斜的问题。

    功能设计

    非自适应扫描

    QC 进程在发起并行查询任务时,会为每个 PX Worker 进程分配固定的 Worker ID,每个 PX Worker 进程根据 Worker ID 对存储单元 取模,只扫描其所属的特定的 Dist Unit。

    non-adaptive-scan

    自适应扫描

    QC 进程在发起并行查询任务时,会启动 自适应扫描线程,用于接收并处理来自 PX Worker 进程的请求消息。自适应扫描线程维护了当前查询扫描任务的进度,并根据每个 PX Worker 进程的工作进度,向 PX Worker 进程分派需要扫描的 Disk Unit ID。对于需要扫描的最后一个 Disk Unit,自适应扫描线程会唤醒处于空闲状态的 PX Worker,加速最后一块 Disk Unit 的扫描过程。

    adaptive-scan

    消息通信机制

    由于自适应扫描线程与各个 PX worker 进程之间的通信数据很少,频率不高,所以重用了已有的 QC 进程与 PX worker 进程之间的 libpq 连接进行报文通信。自适应扫描线程通过 poll 的方式在需要时同步轮询 PX Worker 进程的请求和响应。

    扫描任务协调

    PX Worker 进程在执行顺序扫描算子时,会首先向 QC 进程发起询问请求,将以下信息发送给 QC 端的自适应扫描线程:

    • 扫描任务的编号
    • 扫描动作(正向 / 反向扫描)
    • 扫描物理块数

    自适应扫描线程在收到询问请求后,会创建扫描任务或更新扫描任务的进度。

    可变颗粒度

    为了减少请求带来的网络交互次数,ePQ 实现了可变的任务颗粒度。当扫描任务量剩余较多时,PX Worker 进程单次领取的扫描物理块数较多;当扫描任务量剩余较少时,PX Worker 进程单次领取的扫描物理块数相应减少。通过这种方法,可以平衡 网络开销负载均衡 两者之间的关系。

    缓存友好

    自适应扫描模式将尽量保证每个节点在多次执行并行查询任务时,能够重用 Shared Buffer 缓存,避免缓存频繁更新 / 淘汰。在实现上,自适应扫描功能会根据 集群拓扑视图 配置的节点 IP 地址信息,采用缓存绑定策略,尽量让同一个物理 Page 被同一个节点复用。

    报文设计

    • PX Worker 请求报文:采用 libpq 的 'S' 协议进行通信,按照 key-value 的方式编码为字符串。

      内容描述
      task_id扫描任务编号
      direction扫描方向
      page_count需扫描的总物理块数
      scan_start扫描起始物理块号
      current_page当前扫描的物理块号
      scan_round扫描的次数
    • 自适应扫描线程回复报文

      内容描述
      success是否成功
      page_start响应的起始物理块号
      page_end响应的结束物理块号

    使用指南

    创建测试表:

    postgres=# CREATE TABLE t(id INT);
    +CREATE TABLE
    +postgres=# INSERT INTO t VALUES(generate_series(1,100));
    +INSERT 0 100
    +

    非自适应扫描

    开启 ePQ 并行查询功能,并设置单节点并发度为 3。通过 EXPLAIN 可以看到执行计划来自 PX 优化器。由于参与测试的只读节点有两个,所以从执行计划中可以看到整体并发度为 6。

    postgres=# SET polar_enable_px = 1;
    +SET
    +postgres=# SET polar_px_dop_per_node = 3;
    +SET
    +postgres=# SHOW polar_px_enable_adps;
    + polar_px_enable_adps
    +----------------------
    + off
    +(1 row)
    +
    +postgres=# EXPLAIN SELECT * FROM t;
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on t  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +
    +postgres=# SELECT COUNT(*) FROM t;
    + count
    +-------
    +   100
    +(1 row)
    +

    自适应扫描

    开启自适应扫描功能的开关后,通过 EXPLAIN ANALYZE 可以看到每个 PX Worker 进程扫描的物理块号。

    postgres=# SET polar_enable_px = 1;
    +SET
    +postgres=# SET polar_px_dop_per_node = 3;
    +SET
    +postgres=# SET polar_px_enable_adps = 1;
    +SET
    +postgres=# SHOW polar_px_enable_adps;
    + polar_px_enable_adps
    +----------------------
    + on
    +(1 row)
    +
    +postgres=# SET polar_px_enable_adps_explain_analyze = 1;
    +SET
    +postgres=# SHOW polar_px_enable_adps_explain_analyze;
    + polar_px_enable_adps_explain_analyze
    +--------------------------------------
    + on
    +(1 row)
    +
    +postgres=# EXPLAIN ANALYZE SELECT * FROM t;
    +                                                        QUERY PLAN
    +---------------------------------------------------------------------------------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4) (actual time=0.968..0.982 rows=100 loops=1)
    +   ->  Partial Seq Scan on t  (cost=0.00..431.00 rows=1 width=4) (actual time=0.380..0.435 rows=100 loops=1)
    +         Dynamic Pages Per Worker: [1]
    + Planning Time: 5.571 ms
    + Optimizer: PolarDB PX Optimizer
    +   (slice0)    Executor memory: 23K bytes.
    +   (slice1)    Executor memory: 14K bytes avg x 6 workers, 14K bytes max (seg0).
    + Execution Time: 9.047 ms
    +(8 rows)
    +
    +postgres=# SELECT COUNT(*) FROM t;
    + count
    +-------
    +   100
    +(1 row)
    +
    + + + diff --git a/zh/features/v11/epq/cluster-info.html b/zh/features/v11/epq/cluster-info.html new file mode 100644 index 00000000000..e7f206bdb81 --- /dev/null +++ b/zh/features/v11/epq/cluster-info.html @@ -0,0 +1,47 @@ + + + + + + + + + 集群拓扑视图 | PolarDB for PostgreSQL + + + + +

    集群拓扑视图

    V11 / v1.1.20-

    烛远

    2022/09/20

    20 min

    功能介绍

    PolarDB for PostgreSQL 的 ePQ 弹性跨机并行查询功能可以将一个大查询分散到多个节点上执行,从而加快查询速度。该功能会涉及到各个节点之间的通信,包括执行计划的分发、执行的控制、结果的获取等等。因此设计了 集群拓扑视图 功能,用于为 ePQ 组件收集并展示集群的拓扑信息,实现跨节点查询。

    术语

    • RW / Primary:读写节点,后统称为 Primary
    • RO / Replica:只读节点,后统称为 Replica
    • Standby:灾备节点
    • Replication Slot:流复制槽,PostgreSQL 中用于持久化流复制关系的机制

    功能使用

    集群拓扑视图的维护是完全透明的,用户只需要按照部署文档搭建一写多读的集群,集群拓扑视图即可正确维护起来。关键在于需要搭建带有流复制槽的 Replica / Standby 节点。

    使用以下接口可以获取集群拓扑视图(执行结果来自于 PolarDB for PostgreSQL 11):

    postgres=# SELECT * FROM polar_cluster_info;
    + name  |   host    | port | release_date | version | slot_name |  type   | state | cpu | cpu_quota | memory | memory_quota | iops | iops_quota | connection | connection_quota | px_connection | px_connection_quota | px_node
    +-------+-----------+------+--------------+---------+-----------+---------+-------+-----+-----------+--------+--------------+------+------------+------------+------------------+---------------+---------------------+---------
    + node0 | 127.0.0.1 | 5432 | 20220930     | 1.1.27  |           | RW      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | f
    + node1 | 127.0.0.1 | 5433 | 20220930     | 1.1.27  | replica1  | RO      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | t
    + node2 | 127.0.0.1 | 5434 | 20220930     | 1.1.27  | replica2  | RO      | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | t
    + node3 | 127.0.0.1 | 5431 | 20220930     | 1.1.27  | standby1  | Standby | Ready |   0 |         0 |      0 |            0 |    0 |          0 |          0 |                0 |             0 |                   0 | f
    +(4 rows)
    +
    • name 是节点的名称,是自动生成的。
    • host / port 表示了节点的连接信息。在这里,都是本地地址。
    • release_dateversion 标识了 PolarDB 的版本信息。
    • slot_name 是节点连接所使用的流复制槽,只有使用流复制槽连接上来的节点才会被统计在该视图中(除 Primary 节点外)。
    • type 表示节点的类型,有三类:
      • PolarDB for PostgreSQL 11:RW / RO / Standby
      • PolarDB for PostgreSQL 14:Primary / Replica / Standby
    • state 表示节点的状态。有 Offline / Going Offline / Disabled / Initialized / Pending / Ready / Unknown 这些状态,其中只有 Ready 才有可能参与 PX 计算,其他的都无法参与 PX 计算。
    • px_node 表示是否参与 PX 计算。
    • 后续字段都是性能采集相关的字段,目前都是留空的。

    对于 ePQ 查询来说,默认只有 Replica 节点参与。可以通过参数控制使用 Primary 节点或者 Standby 节点参与计算:

    -- 使 Primary 节点参与计算
    +SET polar_px_use_master = ON;
    +
    +-- 使 Standby 节点参与计算
    +SET polar_px_use_standby = ON;
    +

    提示

    从 PolarDB for PostgreSQL 14 起,polar_px_use_master 参数改名为 polar_px_use_primary

    还可以使用 polar_px_nodes 指定哪些节点参与 PX 计算。例如使用上述集群拓扑视图,可以执行如下命令,让 PX 查询只在 replica1 上执行。

    SET polar_px_nodes = 'node1';
    +

    设计实现

    信息采集

    集群拓扑视图信息的采集是通过流复制来传递信息的。该功能对流复制协议增加了新的消息类型用于集群拓扑视图的传递。分为以下两个步骤:

    • Replica / Standby 将状态传递给 Primary
    • Primary 汇总集群拓扑视图,返回给 Replica / Standby

    更新频率

    集群拓扑视图并非定时更新与发送,因为视图并非一直变化。只有当节点刚启动时,或发生关键状态变化时再进行更新发送。

    在具体实现上,Primary 节点收集的全局状态带有版本 generation,只有在接收到节点拓扑变化才会递增;当全局状态版本更新后,才会发送到其他节点,其他节点接收到后,设置到自己的节点上。

    生成集群拓扑视图

    采集维度

    状态指标:

    • 节点 name
    • 节点 host / port
    • 节点 slot_name
    • 节点负载(CPU / MEM / 连接 / IOPS)
    • 节点状态
      • Offline
      • Going Offline
      • Disabled
      • Initialized
      • Pending
      • Ready
      • Unknown

    消息格式

    同 WAL Sender / WAL Reciver 的其他消息的做法,新增 'm''M' 消息类型,用于收集节点信息和广播集群拓扑视图。

    内部使用

    提供接口获取 Replica 列表,提供 IP / port 等信息,用于 PX 查询。

    预留了较多的负载接口,可以根据负载来实现动态调整并行度。(尚未接入)

    同时增加了参数 polar_px_use_master / polar_px_use_standby,将 Primary / Standby 加入到 PX 计算中,默认不打开(可能会有正确性问题,因为快照格式、Vacuum 等原因,快照有可能不可用)。

    ePQ 会使用上述信息生成节点的连接信息并缓存下来,并在 ePQ 查询中使用该视图。当 generation 更新或者设置了 polar_px_nodes / polar_px_use_master / polar_px_use_standby 时,该缓存会被重置,并在下次使用时重新生成缓存。

    结果展示

    通过 polar_monitor 插件提供视图,将上述集群拓扑视图提供出去,在任意节点均可获取。

    + + + diff --git a/zh/features/v11/epq/epq-create-btree-index.html b/zh/features/v11/epq/epq-create-btree-index.html new file mode 100644 index 00000000000..277a572a3e2 --- /dev/null +++ b/zh/features/v11/epq/epq-create-btree-index.html @@ -0,0 +1,47 @@ + + + + + + + + + ePQ 支持创建 B-Tree 索引并行加速 | PolarDB for PostgreSQL + + + + +

    ePQ 支持创建 B-Tree 索引并行加速

    V11 / v1.1.15-

    棠羽

    2023/09/20

    20 min

    背景

    在使用 PostgreSQL 时,如果想要在一张表中查询符合某个条件的行,默认情况下需要扫描整张表的数据,然后对每一行数据依次判断过滤条件。如果符合条件的行数非常少,而表的数据总量非常大,这显然是一个非常低效的操作。与阅读书籍类似,想要阅读某个特定的章节时,读者通常会通过书籍开头处的索引查询到对应章节的页码,然后直接从指定的页码开始阅读;在数据库中,通常会对被频繁查找的列创建索引,以避免进行开销极大的全表扫描:通过索引可以精确定位到被查找的数据位于哪些数据页面上。

    PostgreSQL 支持创建多种类型的索引,其中使用得最多的是 B-Treeopen in new window 索引,也是 PostgreSQL 默认创建的索引类型。在一张数据量较大的表上创建索引是一件非常耗时的事,因为其中涉及到的工作包含:

    1. 顺序扫描表中的每一行数据
    2. 根据要创建索引的列值(Scan Key)顺序,对每行数据在表中的物理位置进行排序
    3. 构建索引元组,按 B-Tree 的结构组织并写入索引页面

    PostgreSQL 支持并行(多进程扫描/排序)和并发(不阻塞 DML)创建索引,但只能在创建索引的过程中使用单个计算节点的资源。

    PolarDB-PG 的 ePQ 弹性跨机并行查询特性支持对 B-Tree 类型的索引创建进行加速。ePQ 能够利用多个计算节点的 I/O 带宽并行扫描全表数据,并利用多个计算节点的 CPU 和内存资源对每行数据在表中的物理位置按索引列值进行排序,构建索引元组。最终,将有序的索引元组归并到创建索引的进程中,写入索引页面,完成索引的创建。

    使用方法

    数据准备

    创建一张包含三个列,数据量为 1000000 行的表:

    CREATE TABLE t (id INT, age INT, msg TEXT);
    +
    +INSERT INTO t
    +SELECT
    +    random() * 1000000,
    +    random() * 10000,
    +    md5(random()::text)
    +FROM generate_series(1, 1000000);
    +

    创建索引

    使用 ePQ 创建索引需要以下三个步骤:

    1. 设置参数 polar_enable_pxON,打开 ePQ 的开关
    2. 按需设置参数 polar_px_dop_per_node 调整查询并行度
    3. 在创建索引时显式声明 px_build 属性为 ON
    SET polar_enable_px TO ON;
    +SET polar_px_dop_per_node TO 8;
    +CREATE INDEX t_idx1 ON t(id, msg) WITH(px_build = ON);
    +

    在创建索引的过程中,数据库会对正在创建索引的表施加 ShareLockopen in new window 锁。这个级别的锁将会阻塞其它进程对表的 DML 操作(INSERT / UPDATE / DELETE)。

    并发创建索引

    类似地,ePQ 支持并发创建索引,只需要在 CREATE INDEX 后加上 CONCURRENTLY 关键字即可:

    SET polar_enable_px TO ON;
    +SET polar_px_dop_per_node TO 8;
    +CREATE INDEX CONCURRENTLY t_idx2 ON t(id, msg) WITH(px_build = ON);
    +

    在创建索引的过程中,数据库会对正在创建索引的表施加 ShareUpdateExclusiveLockopen in new window 锁。这个级别的锁将不会阻塞其它进程对表的 DML 操作。

    使用限制

    ePQ 加速创建索引暂不支持以下场景:

    • 创建 UNIQUE 索引
    • 创建索引时附带 INCLUDING
    • 创建索引时指定 TABLESPACE
    • 创建索引时带有 WHERE 而成为部分索引(Partial Index)
    + + + diff --git a/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html b/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html new file mode 100644 index 00000000000..073ea24c389 --- /dev/null +++ b/zh/features/v11/epq/epq-ctas-mtview-bulk-insert.html @@ -0,0 +1,35 @@ + + + + + + + + + ePQ 支持创建/刷新物化视图并行加速和批量写入 | PolarDB for PostgreSQL + + + + +

    ePQ 支持创建/刷新物化视图并行加速和批量写入

    V11 / v1.1.30-

    棠羽

    2023/02/08

    10 min

    背景

    物化视图 (Materialized View)open in new window 是一个包含查询结果的数据库对象。与普通的视图不同,物化视图不仅保存视图的定义,还保存了 创建物化视图open in new window 时的数据副本。当物化视图的数据与视图定义中的数据不一致时,可以进行 物化视图刷新 (Refresh)open in new window 保持物化视图中的数据与视图定义一致。物化视图本质上是对视图定义中的查询做预计算,以便于在查询时复用。

    CREATE TABLE ASopen in new window 语法用于将一个查询所对应的数据构建为一个新的表,其表结构与查询的输出列完全相同。

    SELECT INTOopen in new window 语法用于建立一张新表,并将查询所对应的数据写入表中,而不是将查询到的数据返回给客户端。其表结构与查询的输出列完全相同。

    功能原理介绍

    对于物化视图的创建和刷新,以及 CREATE TABLE AS / SELECT INTO 语法,由于在数据库层面需要完成的工作步骤十分相似,因此 PostgreSQL 内核使用同一套代码逻辑来处理这几种语法。内核执行过程中的主要步骤包含:

    1. 数据扫描:执行视图定义或 CREATE TABLE AS / SELECT INTO 语法中定义的查询,扫描符合查询条件的数据
    2. 数据写入:将上述步骤中扫描到的数据写入到一个新的物化视图 / 表中

    PolarDB for PostgreSQL 对上述两个步骤分别引入了 ePQ 并行扫描和批量数据写入的优化。在需要扫描或写入的数据量较大时,能够显著提升上述 DDL 语法的性能,缩短执行时间:

    1. ePQ 并行扫描:通过 ePQ 功能,利用多个计算节点的 I/O 带宽和计算资源并行执行视图定义中的查询,提升计算资源和带宽的利用率
    2. 批量写入:不再将扫描到的每一个元组依次写入表或物化视图,而是在内存中攒够一定数量的元组后,一次性批量写入表或物化视图中,减少记录 WAL 日志的开销,降低对页面的锁定频率

    使用说明

    ePQ 并行扫描

    将以下参数设置为 ON 即可启用 ePQ 并行扫描来加速上述语法中的查询过程,目前其默认值为 ON。该参数生效的前置条件是 ePQ 特性的总开关 polar_enable_px 被打开。

    SET polar_px_enable_create_table_as = ON;
    +

    由于 ePQ 特性的限制,该优化不支持 CREATE TABLE AS ... WITH OIDS 语法。对于该语法的处理流程中将会回退使用 PostgreSQL 内置优化器为 DDL 定义中的查询生成执行计划,并通过 PostgreSQL 的单机执行器完成查询。

    批量写入

    将以下参数设置为 ON 即可启用批量写入来加速上述语法中的写入过程,目前其默认值为 ON

    SET polar_enable_create_table_as_bulk_insert = ON;
    +
    + + + diff --git a/zh/features/v11/epq/epq-explain-analyze.html b/zh/features/v11/epq/epq-explain-analyze.html new file mode 100644 index 00000000000..5d806c5f929 --- /dev/null +++ b/zh/features/v11/epq/epq-explain-analyze.html @@ -0,0 +1,59 @@ + + + + + + + + + ePQ 执行计划查看与分析 | PolarDB for PostgreSQL + + + + +

    ePQ 执行计划查看与分析

    V11 / v1.1.20-

    渊云、秦疏

    2023/09/06

    30 min

    背景

    PostgreSQL 提供了 EXPLAIN 命令用于 SQL 语句的性能分析。它能够输出 SQL 对应的查询计划,以及在执行过程中的具体耗时、资源消耗等信息,可用于排查 SQL 的性能瓶颈。

    EXPLAIN 命令原先只适用于单机执行的 SQL 性能分析。PolarDB-PG 的 ePQ 弹性跨机并行查询扩展了 EXPLAIN 的功能,使其可以打印 ePQ 的跨机并行执行计划,还能够统计 ePQ 执行计划在各个算子上的执行时间、数据扫描量、内存使用量等信息,并以统一的视角返回给客户端。

    功能介绍

    执行计划查看

    ePQ 的执行计划是分片的。每个计划分片(Slice)由计算节点上的虚拟执行单元(Segment)启动的一组进程(Gang)负责执行,完成 SQL 的一部分计算。ePQ 在执行计划中引入了 Motion 算子,用于在执行不同计划分片的进程组之间进行数据传递。因此,Motion 算子就是计划分片的边界。

    ePQ 中总共引入了三种 Motion 算子:

    • PX Coordinator:源端数据发送到同一个目标端(汇聚)
    • PX Broadcast:源端数据发送到每一个目标端(广播)
    • PX Hash:源端数据经过哈希计算后发送到某一个目标端(重分布)

    以一个简单查询作为例子:

    => CREATE TABLE t (id INT);
    +=> SET polar_enable_px TO ON;
    +=> EXPLAIN (COSTS OFF) SELECT * FROM t LIMIT 1;
    +                   QUERY PLAN
    +-------------------------------------------------
    + Limit
    +   ->  PX Coordinator 6:1  (slice1; segments: 6)
    +         ->  Partial Seq Scan on t
    + Optimizer: PolarDB PX Optimizer
    +(4 rows)
    +

    以上执行计划以 Motion 算子为界,被分为了两个分片:一个是接收最终结果的分片 slice0,一个是扫描数据的分片slice1。对于 slice1 这个计划分片,ePQ 将使用六个执行单元(segments: 6)分别启动一个进程来执行,这六个进程各自负责扫描表的一部分数据(Partial Seq Scan),通过 Motion 算子将六个进程的数据汇聚到一个目标端(PX Coordinator 6:1),传递给 Limit 算子。

    如果查询逐渐复杂,则执行计划中的计划分片和 Motion 算子会越来越多:

    => CREATE TABLE t1 (a INT, b INT, c INT);
    +=> SET polar_enable_px TO ON;
    +=> EXPLAIN (COSTS OFF) SELECT SUM(b) FROM t1 GROUP BY a LIMIT 1;
    +                         QUERY PLAN
    +------------------------------------------------------------
    + Limit
    +   ->  PX Coordinator 6:1  (slice1; segments: 6)
    +         ->  GroupAggregate
    +               Group Key: a
    +               ->  Sort
    +                     Sort Key: a
    +                     ->  PX Hash 6:6  (slice2; segments: 6)
    +                           Hash Key: a
    +                           ->  Partial Seq Scan on t1
    + Optimizer: PolarDB PX Optimizer
    +(10 rows)
    +

    以上执行计划中总共有三个计划分片。将会有六个进程(segments: 6)负责执行 slice2 分片,分别扫描表的一部分数据,然后通过 Motion 算子(PX Hash 6:6)将数据重分布到另外六个(segments: 6)负责执行 slice1 分片的进程上,各自完成排序(Sort)和聚合(GroupAggregate),最终通过 Motion 算子(PX Coordinator 6:1)将数据汇聚到结果分片 slice0

    + + + diff --git a/zh/features/v11/epq/epq-node-and-dop.html b/zh/features/v11/epq/epq-node-and-dop.html new file mode 100644 index 00000000000..ec830b46747 --- /dev/null +++ b/zh/features/v11/epq/epq-node-and-dop.html @@ -0,0 +1,92 @@ + + + + + + + + + ePQ 计算节点范围选择与并行度控制 | PolarDB for PostgreSQL + + + + +

    ePQ 计算节点范围选择与并行度控制

    V11 / v1.1.20-

    渊云

    2023/09/06

    20 min

    背景介绍

    PolarDB-PG 的 ePQ 弹性跨机并行查询特性提供了精细的粒度控制方法,可以合理使用集群内的计算资源。在最大程度利用闲置计算资源进行并行查询,提升资源利用率的同时,避免了对其它业务负载产生影响:

    1. ePQ 可以动态调整集群中参与并行查询的计算节点范围,避免使用负载较高的计算节点
    2. ePQ 支持为每条查询动态调整在计算节点上的并行度,避免 ePQ 并行查询进程对计算资源的消耗影响到相同节点上的其它进程

    计算节点范围选择

    参数 polar_px_nodes 指定了参与 ePQ 的计算节点范围,默认值为空,表示所有只读节点都参与 ePQ 并行查询:

    => SHOW polar_px_nodes;
    + polar_px_nodes
    +----------------
    +
    +(1 row)
    +

    如果希望读写节点也参与 ePQ 并行,则可以设置如下参数:

    SET polar_px_use_primary TO ON;
    +

    如果部分只读节点负载较高,则可以通过修改 polar_px_nodes 参数设置仅特定几个而非所有只读节点参与 ePQ 并行查询。参数 polar_px_nodes 的合法格式是一个以英文逗号分隔的节点名称列表。获取节点名称需要安装 polar_monitor 插件:

    CREATE EXTENSION IF NOT EXISTS polar_monitor;
    +

    通过 polar_monitor 插件提供的集群拓扑视图,可以查询到集群中所有计算节点的名称:

    => SELECT name,slot_name,type FROM polar_cluster_info;
    + name  | slot_name |  type
    +-------+-----------+---------
    + node0 |           | Primary
    + node1 | standby1  | Standby
    + node2 | replica1  | Replica
    + node3 | replica2  | Replica
    +(4 rows)
    +

    其中:

    • Primary 表示读写节点
    • Replica 表示只读节点
    • Standby 表示备库节点

    通用的最佳实践是使用负载较低的只读节点参与 ePQ 并行查询:

    => SET polar_px_nodes = 'node2,node3';
    +=> SHOW polar_px_nodes;
    + polar_px_nodes
    +----------------
    + node2,node3
    +(1 row)
    +

    并行度控制

    参数 polar_px_dop_per_node 用于设置当前会话中的 ePQ 查询在每个计算节点上的执行单元(Segment)数量,每个执行单元会为其需要执行的每一个计划分片(Slice)启动一个进程。

    该参数默认值为 3,通用最佳实践值为当前计算节点 CPU 核心数的一半。如果计算节点的 CPU 负载较高,可以酌情递减该参数,控制计算节点的 CPU 占用率至 80% 以下;如果查询性能不佳时,可以酌情递增该参数,也需要保持计算节点的 CPU 水位不高于 80%。否则可能会拖慢其它的后台进程。

    并行度计算方法示例

    创建一张表:

    CREATE TABLE test(id INT);
    +

    假设集群内有两个只读节点,polar_px_nodes 为空,此时 ePQ 将使用集群内的所有只读节点参与并行查询;参数 polar_px_dop_per_node 的值为 3,表示每个计算节点上将会有三个执行单元。执行计划如下:

    => SHOW polar_px_nodes;
    + polar_px_nodes
    +----------------
    +
    +(1 row)
    +
    +=> SHOW polar_px_dop_per_node;
    + polar_px_dop_per_node
    +-----------------------
    + 3
    +(1 row)
    +
    +=> EXPLAIN SELECT * FROM test;
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    从执行计划中可以看出,两个只读节点上总计有六个执行单元(segments: 6)将会执行这个计划中唯一的计划分片 slice1。这意味着总计会有六个进程并行执行当前查询。

    此时,调整 polar_px_dop_per_node4,再次执行查询,两个只读节点上总计会有八个执行单元参与当前查询。由于执行计划中只有一个计划分片 slice1,这意味着总计会有八个进程并行执行当前查询:

    => SET polar_px_dop_per_node TO 4;
    +SET
    +=> EXPLAIN SELECT * FROM test;
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 8:1  (slice1; segments: 8)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    此时,如果设置 polar_px_use_primary 参数,让读写节点也参与查询,那么读写节点上也将会有四个执行单元参与 ePQ 并行执行,集群内总计 12 个进程参与并行执行:

    => SET polar_px_use_primary TO ON;
    +SET
    +=> EXPLAIN SELECT * FROM test;
    +                                   QUERY PLAN
    +---------------------------------------------------------------------------------
    + PX Coordinator 12:1  (slice1; segments: 12)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +
    + + + diff --git a/zh/features/v11/epq/epq-partitioned-table.html b/zh/features/v11/epq/epq-partitioned-table.html new file mode 100644 index 00000000000..00a0e2a77c0 --- /dev/null +++ b/zh/features/v11/epq/epq-partitioned-table.html @@ -0,0 +1,139 @@ + + + + + + + + + ePQ 支持分区表查询 | PolarDB for PostgreSQL + + + + +

    ePQ 支持分区表查询

    V11 / v1.1.17-

    渊云

    2023/09/06

    20 min

    背景

    随着数据量的不断增长,表的规模将会越来越大。为了方便管理和提高查询性能,比较好的实践是使用分区表,将大表拆分成多个子分区表。甚至每个子分区表还可以进一步拆成二级子分区表,从而形成了多级分区表。

    PolarDB-PG 支持 ePQ 弹性跨机并行查询,能够利用集群中多个计算节点提升只读查询的性能。ePQ 不仅能够对普通表进行高效的跨机并行查询,对分区表也实现了跨机并行查询。

    ePQ 对分区表的基础功能支持包含:

    • 对分区策略为 Range / List / Hash 的分区表进行并行扫描
    • 对分区表进行索引扫描
    • 对分区表进行连接查询

    此外,ePQ 还支持了部分与分区表相关的高级功能:

    • 分区裁剪
    • 智能分区连接(Partition Wise Join)
    • 对多级分区表进行并行查询

    ePQ 暂不支持对具有多列分区键的分区表进行并行查询。

    使用指南

    分区表并行查询

    创建一张分区策略为 Range 的分区表,并创建三个子分区:

    CREATE TABLE t1 (id INT) PARTITION BY RANGE(id);
    +CREATE TABLE t1_p1 PARTITION OF t1 FOR VALUES FROM (0) TO (200);
    +CREATE TABLE t1_p2 PARTITION OF t1 FOR VALUES FROM (200) TO (400);
    +CREATE TABLE t1_p3 PARTITION OF t1 FOR VALUES FROM (400) TO (600);
    +

    设置参数打开 ePQ 开关和 ePQ 分区表扫描功能的开关:

    SET polar_enable_px TO ON;
    +SET polar_px_enable_partition TO ON;
    +

    查看对分区表进行全表扫描的执行计划:

    => EXPLAIN (COSTS OFF) SELECT * FROM t1;
    +                QUERY PLAN
    +-------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)
    +   ->  Append
    +         ->  Partial Seq Scan on t1_p1
    +         ->  Partial Seq Scan on t1_p2
    +         ->  Partial Seq Scan on t1_p3
    + Optimizer: PolarDB PX Optimizer
    +(6 rows)
    +

    ePQ 将会启动一组进程并行扫描分区表的每一个子表。每一个扫描进程都会通过 Append 算子依次扫描每一个子表的一部分数据(Partial Seq Scan),并通过 Motion 算子(PX Coordinator)将所有进程的扫描结果汇聚到发起查询的进程并返回。

    分区静态裁剪

    当查询的过滤条件中包含分区键时,ePQ 优化器可以根据过滤条件对将要扫描的分区表进行裁剪,避免扫描不需要的子分区,节省系统资源,提升查询性能。以上述 t1 表为例,查看以下查询的执行计划:

    => EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE id < 100;
    +                QUERY PLAN
    +-------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)
    +   ->  Append
    +         ->  Partial Seq Scan on t1_p1
    +               Filter: (id < 100)
    + Optimizer: PolarDB PX Optimizer
    +(5 rows)
    +

    由于查询的过滤条件 id < 100 包含分区键,因此 ePQ 优化器可以根据分区表的分区边界,在产生执行计划时去除不符合过滤条件的子分区(t1_p2t1_p3),只保留符合过滤条件的子分区(t1_p1)。

    智能分区连接

    在进行分区表之间的连接操作时,如果分区策略和边界相同,并且连接条件为分区键时,ePQ 优化器可以产生以子分区为单位进行连接的执行计划,避免两张分区表的进行笛卡尔积式的连接,节省系统资源,提升查询性能。

    以两张 Range 分区表的连接为例。使用以下 SQL 创建两张分区策略和边界都相同的分区表 t2t3

    CREATE TABLE t2 (id INT) PARTITION BY RANGE(id);
    +CREATE TABLE t2_p1 PARTITION OF t2 FOR VALUES FROM (0) TO (200);
    +CREATE TABLE t2_p2 PARTITION OF t2 FOR VALUES FROM (200) TO (400);
    +CREATE TABLE t2_p3 PARTITION OF t2 FOR VALUES FROM (400) TO (600);
    +
    +CREATE TABLE t3 (id INT) PARTITION BY RANGE(id);
    +CREATE TABLE t3_p1 PARTITION OF t3 FOR VALUES FROM (0) TO (200);
    +CREATE TABLE t3_p2 PARTITION OF t3 FOR VALUES FROM (200) TO (400);
    +CREATE TABLE t3_p3 PARTITION OF t3 FOR VALUES FROM (400) TO (600);
    +

    打开以下参数启用 ePQ 对分区表的支持:

    SET polar_enable_px TO ON;
    +SET polar_px_enable_partition TO ON;
    +

    当 Partition Wise join 关闭时,两表在分区键上等值连接的执行计划如下:

    => SET polar_px_enable_partitionwise_join TO OFF;
    +=> EXPLAIN (COSTS OFF) SELECT * FROM t2 JOIN t3 ON t2.id = t3.id;
    +                        QUERY PLAN
    +-----------------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)
    +   ->  Hash Join
    +         Hash Cond: (t2_p1.id = t3_p1.id)
    +         ->  Append
    +               ->  Partial Seq Scan on t2_p1
    +               ->  Partial Seq Scan on t2_p2
    +               ->  Partial Seq Scan on t2_p3
    +         ->  Hash
    +               ->  PX Broadcast 6:6  (slice2; segments: 6)
    +                     ->  Append
    +                           ->  Partial Seq Scan on t3_p1
    +                           ->  Partial Seq Scan on t3_p2
    +                           ->  Partial Seq Scan on t3_p3
    + Optimizer: PolarDB PX Optimizer
    +(14 rows)
    +

    从执行计划中可以看出,执行 slice1 计划分片的六个进程会分别通过 Append 算子依次扫描分区表 t2 每一个子分区的一部分数据,并通过 Motion 算子(PX Broadcast)接收来自执行 slice2 的六个进程广播的 t3 全表数据,在本地完成哈希连接(Hash Join)后,通过 Motion 算子(PX Coordinator)汇聚结果并返回。本质上,分区表 t2 的每一行数据都与 t3 的每一行数据做了一次连接。

    打开参数 polar_px_enable_partitionwise_join 启用 Partition Wise join 后,再次查看执行计划:

    => SET polar_px_enable_partitionwise_join TO ON;
    +=> EXPLAIN (COSTS OFF) SELECT * FROM t2 JOIN t3 ON t2.id = t3.id;
    +                   QUERY PLAN
    +------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)
    +   ->  Append
    +         ->  Hash Join
    +               Hash Cond: (t2_p1.id = t3_p1.id)
    +               ->  Partial Seq Scan on t2_p1
    +               ->  Hash
    +                     ->  Full Seq Scan on t3_p1
    +         ->  Hash Join
    +               Hash Cond: (t2_p2.id = t3_p2.id)
    +               ->  Partial Seq Scan on t2_p2
    +               ->  Hash
    +                     ->  Full Seq Scan on t3_p2
    +         ->  Hash Join
    +               Hash Cond: (t2_p3.id = t3_p3.id)
    +               ->  Partial Seq Scan on t2_p3
    +               ->  Hash
    +                     ->  Full Seq Scan on t3_p3
    + Optimizer: PolarDB PX Optimizer
    +(18 rows)
    +

    在上述执行计划中,执行 slice1 计划分片的六个进程将通过 Append 算子依次扫描分区表 t2 每个子分区中的一部分数据,以及分区表 t3 相对应子分区 的全部数据,将两份数据进行哈希连接(Hash Join),最终通过 Motion 算子(PX Coordinator)汇聚结果并返回。在上述执行过程中,分区表 t2 的每一个子分区 t2_p1t2_p2t2_p3 分别只与分区表 t3 对应的 t3_p1t3_p2t3_p3 做了连接,并没有与其它不相关的分区连接,节省了不必要的工作。

    多级分区表并行查询

    在多级分区表中,每级分区表的分区维度(分区键)可以不同:比如一级分区表按照时间维度分区,二级分区表按照地域维度分区。当查询 SQL 的过滤条件中包含每一级分区表中的分区键时,ePQ 优化器支持对多级分区表进行静态分区裁剪,从而过滤掉不需要被扫描的子分区。

    以下图为例:当查询过滤条件 WHERE date = '202201' AND region = 'beijing' 中包含一级分区键 date 和二级分区键 region 时,ePQ 优化器能够裁剪掉所有不相关的分区,产生的执行计划中只包含符合条件的子分区。由此,执行器只对需要扫描的子分区进行扫描即可。

    multi-level-partition

    使用以下 SQL 为例,创建一张多级分区表:

    CREATE TABLE r1 (a INT, b TIMESTAMP) PARTITION BY RANGE (b);
    +
    +CREATE TABLE r1_p1 PARTITION OF r1 FOR VALUES FROM ('2000-01-01') TO ('2010-01-01')  PARTITION BY RANGE (a);
    +CREATE TABLE r1_p1_p1 PARTITION OF r1_p1 FOR VALUES FROM (1) TO (1000000);
    +CREATE TABLE r1_p1_p2 PARTITION OF r1_p1 FOR VALUES FROM (1000000) TO (2000000);
    +
    +CREATE TABLE r1_p2 PARTITION OF r1 FOR VALUES FROM ('2010-01-01') TO ('2020-01-01')  PARTITION BY RANGE (a);
    +CREATE TABLE r1_p2_p1 PARTITION OF r1_p2 FOR VALUES FROM (1) TO (1000000);
    +CREATE TABLE r1_p2_p2 PARTITION OF r1_p2 FOR VALUES FROM (1000000) TO (2000000);
    +

    打开以下参数启用 ePQ 对分区表的支持:

    SET polar_enable_px TO ON;
    +SET polar_px_enable_partition TO ON;
    +

    执行一条以两级分区键作为过滤条件的 SQL,并关闭 ePQ 的多级分区扫描功能,将得到 PostgreSQL 内置优化器经过多级分区静态裁剪后的执行计划:

    => SET polar_px_optimizer_multilevel_partitioning TO OFF;
    +=> EXPLAIN (COSTS OFF) SELECT * FROM r1 WHERE a < 1000000 AND b < '2009-01-01 00:00:00';
    +                                       QUERY PLAN
    +----------------------------------------------------------------------------------------
    + Seq Scan on r1_p1_p1 r1
    +   Filter: ((a < 1000000) AND (b < '2009-01-01 00:00:00'::timestamp without time zone))
    +(2 rows)
    +

    启用 ePQ 的多级分区扫描功能,再次查看执行计划:

    => SET polar_px_optimizer_multilevel_partitioning TO ON;
    +=> EXPLAIN (COSTS OFF) SELECT * FROM r1 WHERE a < 1000000 AND b < '2009-01-01 00:00:00';
    +                                             QUERY PLAN
    +----------------------------------------------------------------------------------------------------
    + PX Coordinator 6:1  (slice1; segments: 6)
    +   ->  Append
    +         ->  Partial Seq Scan on r1_p1_p1
    +               Filter: ((a < 1000000) AND (b < '2009-01-01 00:00:00'::timestamp without time zone))
    + Optimizer: PolarDB PX Optimizer
    +(5 rows)
    +

    在上述计划中,ePQ 优化器进行了对多级分区表的静态裁剪。执行 slice1 计划分片的六个进程只需对符合过滤条件的子分区 r1_p1_p1 进行并行扫描(Partial Seq Scan)即可,并将扫描到的数据通过 Motion 算子(PX Coordinator)汇聚并返回。

    + + + diff --git a/zh/features/v11/epq/index.html b/zh/features/v11/epq/index.html new file mode 100644 index 00000000000..91f9f80caf5 --- /dev/null +++ b/zh/features/v11/epq/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 弹性跨机并行查询(ePQ) | PolarDB for PostgreSQL + + + + +
    + + + diff --git a/zh/features/v11/epq/parallel-dml.html b/zh/features/v11/epq/parallel-dml.html new file mode 100644 index 00000000000..dff07393fa6 --- /dev/null +++ b/zh/features/v11/epq/parallel-dml.html @@ -0,0 +1,57 @@ + + + + + + + + + 并行 INSERT | PolarDB for PostgreSQL + + + + +

    并行 INSERT

    V11 / v1.1.17-

    渊云

    2022/09/27

    30 min

    背景介绍

    PolarDB-PG 支持 ePQ 弹性跨机并行查询,能够利用集群中多个计算节点提升只读查询的性能。此外,ePQ 也支持在读写节点上通过多进程并行写入,实现对 INSERT 语句的加速。

    功能介绍

    ePQ 的并行 INSERT 功能可以用于加速 INSERT INTO ... SELECT ... 这种读写兼备的 SQL。对于 SQL 中的 SELECT 部分,ePQ 将启动多个进程并行执行查询;对于 SQL 中的 INSERT 部分,ePQ 将在读写节点上启动多个进程并行执行写入。执行写入的进程与执行查询的进程之间通过 Motion 算子 进行数据传递。

    能够支持并行 INSERT 的表类型有:

    • 普通表
    • 分区表
    • (部分)外部表

    并行 INSERT 支持动态调整写入并行度(写入进程数量),在查询不成为瓶颈的条件下性能最高能提升三倍。

    使用方法

    创建两张表 t1t2,向 t1 中插入一些数据:

    CREATE TABLE t1 (id INT);
    +CREATE TABLE t2 (id INT);
    +INSERT INTO t1 SELECT generate_series(1,100000);
    +

    打开 ePQ 及并行 INSERT 的开关:

    SET polar_enable_px TO ON;
    +SET polar_px_enable_insert_select TO ON;
    +

    通过 INSERT 语句将 t1 表中的所有数据插入到 t2 表中。查看并行 INSERT 的执行计划:

    => EXPLAIN INSERT INTO t2 SELECT * FROM t1;
    +                                       QUERY PLAN
    +-----------------------------------------------------------------------------------------
    + Insert on t2  (cost=0.00..952.87 rows=33334 width=4)
    +   ->  Result  (cost=0.00..0.00 rows=0 width=0)
    +         ->  PX Hash 6:3  (slice1; segments: 6)  (cost=0.00..432.04 rows=100000 width=8)
    +               ->  Partial Seq Scan on t1  (cost=0.00..431.37 rows=16667 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(5 rows)
    +

    其中的 PX Hash 6:3 表示 6 个并行查询 t1 的进程通过 Motion 算子将数据传递给 3 个并行写入 t2 的进程。

    通过参数 polar_px_insert_dop_num 可以动态调整写入并行度,比如:

    => SET polar_px_insert_dop_num TO 12;
    +=> EXPLAIN INSERT INTO t2 SELECT * FROM t1;
    +                                        QUERY PLAN
    +------------------------------------------------------------------------------------------
    + Insert on t2  (cost=0.00..952.87 rows=8334 width=4)
    +   ->  Result  (cost=0.00..0.00 rows=0 width=0)
    +         ->  PX Hash 6:12  (slice1; segments: 6)  (cost=0.00..432.04 rows=100000 width=8)
    +               ->  Partial Seq Scan on t1  (cost=0.00..431.37 rows=16667 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(5 rows)
    +

    执行计划中的 PX Hash 6:12 显示,并行查询 t1 的进程数量不变,并行写入 t2 的进程数量变更为 12

    使用说明

    调整 polar_px_dop_per_nodepolar_px_insert_dop_num 可以分别修改 INSERT INTO ... SELECT ... 中查询和写入的并行度。

    1. 当查询并行度较低时,逐步提升写入并行度,SQL 执行时间将会逐渐下降并趋于平缓;趋于平缓的原因是查询速度跟不上写入速度而成为瓶颈
    2. 当查询并行度较高时,逐步提升写入并行度,SQL 执行时间将会逐渐下降并趋于平缓;趋于平缓的原因是并行写入只能在读写节点上进行,写入速度因多个写入进程对表页面扩展锁的争抢而跟不上查询速度,成为瓶颈

    原理介绍

    ePQ 对并行 INSERT 的处理如下:

    1. ePQ 优化器以查询解析得到的语法树作为输入,产生计划树
    2. ePQ 执行器将计划树分发到各计算节点,并创建并行查询/并行写入进程,开始执行各自负责执行的子计划
    3. 并行查询进程从存储中并行读取各自负责的数据分片,并将数据发送到 Motion 算子
    4. 并行写入进程从 Motion 算子中获取数据,向存储并行写入数据

    并行查询和并行写入是以流水线的形式同时进行的。上述执行过程如图所示:

    parallel_insert_data_flow

    + + + diff --git a/zh/features/v11/extensions/index.html b/zh/features/v11/extensions/index.html new file mode 100644 index 00000000000..beea630e312 --- /dev/null +++ b/zh/features/v11/extensions/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 第三方插件 | PolarDB for PostgreSQL + + + + +

    第三方插件

    + + + diff --git a/zh/features/v11/extensions/pgvector.html b/zh/features/v11/extensions/pgvector.html new file mode 100644 index 00000000000..01677baee6c --- /dev/null +++ b/zh/features/v11/extensions/pgvector.html @@ -0,0 +1,46 @@ + + + + + + + + + pgvector | PolarDB for PostgreSQL + + + + +

    pgvector

    V11 / v1.1.35-

    山现

    2023/12/25

    10 min

    背景

    pgvectoropen in new window 作为一款高效的向量数据库插件,基于 PostgreSQL 的扩展机制,利用 C 语言实现了多种向量数据类型和运算算法,同时还能够高效存储与查询以向量表示的 AI Embedding。

    pgvector 支持 IVFFlat 索引。IVFFlat 索引能够将向量空间分为若干个划分区域,每个区域都包含一些向量,并创建倒排索引,用于快速地查找与给定向量相似的向量。IVFFlat 是 IVFADC 索引的简化版本,适用于召回精度要求高,但对查询耗时要求不严格(100ms 级别)的场景。相比其他索引类型,IVFFlat 索引具有高召回率、高精度、算法和参数简单、空间占用小的优势。

    pgvector 插件算法的具体流程如下:

    1. 高维空间中的点基于隐形的聚类属性,按照 K-Means 等聚类算法对向量进行聚类处理,使得每个类簇有一个中心点
    2. 检索向量时首先遍历计算所有类簇的中心点,找到与目标向量最近的 n 个类簇中心
    3. 遍历计算 n 个类簇中心所在聚类中的所有元素,经过全局排序得到距离最近的 k 个向量

    使用方法

    pgvector 可以顺序检索或索引检索高维向量,关于索引类型和更多参数介绍可以参考插件源代码的 READMEopen in new window

    安装插件

    CREATE EXTENSION vector;
    +

    向量操作

    执行如下命令,创建一个含有向量字段的表:

    CREATE TABLE t (val vector(3));
    +

    执行如下命令,可以插入向量数据:

    INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);
    +

    创建 IVFFlat 类型的索引:

    1. val vector_ip_ops 表示需要创建索引的列名为 val,并且使用向量操作符 vector_ip_ops 来计算向量之间的相似度。该操作符支持向量之间的点积、余弦相似度、欧几里得距离等计算方式
    2. WITH (lists = 1) 表示使用的划分区域数量为 1,这意味着所有向量都将被分配到同一个区域中。在实际应用中,划分区域数量需要根据数据规模和查询性能进行调整
    CREATE INDEX ON t USING ivfflat (val vector_ip_ops) WITH (lists = 1);
    +

    计算近似向量:

    => SELECT * FROM t ORDER BY val <#> '[3,3,3]';
    +   val
    +---------
    + [1,2,3]
    + [1,1,1]
    + [0,0,0]
    +
    +(4 rows)
    +

    卸载插件

    DROP EXTENSION vector;
    +

    注意事项

    • ePQ 支持通过排序遍历高维向量,不支持通过索引查询向量类型
    + + + diff --git a/zh/features/v11/extensions/smlar.html b/zh/features/v11/extensions/smlar.html new file mode 100644 index 00000000000..9f5e2615f6f --- /dev/null +++ b/zh/features/v11/extensions/smlar.html @@ -0,0 +1,52 @@ + + + + + + + + + smlar | PolarDB for PostgreSQL + + + + +

    smlar

    V11 / v1.1.28-

    棠羽

    2022/10/05

    10 min

    背景

    对大规模的数据进行相似度计算在电商业务、搜索引擎中是一个很关键的技术问题。相对简易的相似度计算实现不仅运算速度慢,还十分消耗资源。smlaropen in new window 是 PostgreSQL 的一款开源第三方插件,提供了可以在数据库内高效计算数据相似度的函数,并提供了支持 GiST 和 GIN 索引的相似度运算符。目前该插件已经支持 PostgreSQL 所有的内置数据类型。

    注意

    由于 smlar 插件的 % 操作符与 RUM 插件的 % 操作符冲突,因此 smlar 与 RUM 两个插件无法同时创建在同一 schema 中。

    函数及运算符介绍

    • float4 smlar(anyarray, anyarray)

      计算两个数组的相似度,数组的数据类型需要一致。

    • float4 smlar(anyarray, anyarray, bool useIntersect)

      计算两个自定义复合类型数组的相似度,useIntersect 参数表示是否让仅重叠元素还是全部元素参与运算;复合类型可由以下方式定义:

      CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
      +
    • float4 smlar(anyarray a, anyarray b, text formula);

      使用参数给定的公式来计算两个数组的相似度,数组的数据类型需要一致;公式中可以使用的预定义变量有:

      • N.i:两个数组中的相同元素个数(交集)
      • N.a:第一个数组中的唯一元素个数
      • N.b:第二个数组中的唯一元素个数
      SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)');
      +
    • anyarray % anyarray

      该运算符的含义为,当两个数组的的相似度超过阈值时返回 TRUE,否则返回 FALSE

    • text[] tsvector2textarray(tsvector)

      tsvector 类型转换为字符串数组。

    • anyarray array_unique(anyarray)

      对数组进行排序、去重。

    • float4 inarray(anyarray, anyelement)

      如果元素出现在数组中,则返回 1.0;否则返回 0

    • float4 inarray(anyarray, anyelement, float4, float4)

      如果元素出现在数组中,则返回第三个参数;否则返回第四个参数。

    可配置参数说明

    • smlar.threshold FLOAT

      相似度阈值,用于给 % 运算符判断两个数组是否相似。

    • smlar.persistent_cache BOOL

      全局统计信息的缓存是否存放在与事务无关的内存中。

    • smlar.type STRING:相似度计算公式,可选的相似度类型包含:

    • smlar.stattable STRING

      存储集合范围统计信息的表名,表定义如下:

      CREATE TABLE table_name (
      +  value   data_type UNIQUE,
      +  ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
      +);
      +
    • smlar.tf_method STRING:计算词频(TF,Term Frequency)的方法,取值如下

      • n:简单计数(默认)
      • log1 + log(n)
      • const:频率等于 1
    • smlar.idf_plus_one BOOL:计算逆文本频率指数的方法(IDF,Inverse Document Frequency)的方法,取值如下

      • FALSElog(d / df)(默认)
      • TRUElog(1 + d / df)

    基本使用方法

    安装插件

    CREATE EXTENSION smlar;
    +

    相似度计算

    使用上述的函数计算两个数组的相似度:

    SELECT smlar('{3,2}'::int[], '{3,2,1}');
    +  smlar
    +----------
    + 0.816497
    +(1 row)
    +
    +SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' );
    +  smlar
    +----------
    + 0.333333
    +(1 row)
    +

    卸载插件

    DROP EXTENSION smlar;
    +

    原理和设计

    GitHub - jirutka/smlaropen in new window

    PGCon 2012 - Finding Similar: Effective similarity search in databaseopen in new window (slidesopen in new window)

    + + + diff --git a/zh/features/v11/index.html b/zh/features/v11/index.html new file mode 100644 index 00000000000..7bfbd95150f --- /dev/null +++ b/zh/features/v11/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 自研功能 | PolarDB for PostgreSQL + + + + +
    + + + diff --git a/zh/features/v11/performance/bulk-read-and-extend.html b/zh/features/v11/performance/bulk-read-and-extend.html new file mode 100644 index 00000000000..753f8005639 --- /dev/null +++ b/zh/features/v11/performance/bulk-read-and-extend.html @@ -0,0 +1,45 @@ + + + + + + + + + 预读 / 预扩展 | PolarDB for PostgreSQL + + + + +

    预读 / 预扩展

    V11 / v1.1.1-

    何柯文

    2022/09/21

    30 min

    背景介绍

    PolarDB for PostgreSQL(以下简称 PolarDB)底层使用 PolarFS(以下简称为 PFS)作为文件系统。不同于 ext4open in new window 等单机文件系统,PFS 在页扩展过程中,元数据更新开销较大;且 PFS 的最小页扩展粒度为 4MB。而 PostgreSQL 8kB 的页扩展粒度并不适合 PFS,将会导致写表或创建索引时性能下降;同时,PFS 在读取大块页面时 I/O 效率更高。为了适配上述特征,我们为 PolarDB 设计了堆表预读、堆表预扩展、索引创建预扩展的功能,使运行在 PFS 上的 PolarDB 能够获得更好的性能。

    功能介绍

    堆表预读

    在 PostgreSQL 读取堆表的过程中,会以 8kB 页为单位通过文件系统读取页面至内存缓冲池(Buffer Pool)中。PFS 对于这种数据量较小的 I/O 操作并不是特别高效。所以,PolarDB 为了适配 PFS 而设计了 堆表批量预读。当读取的页数量大于 1 时,将会触发批量预读,一次 I/O 读取 128kB 数据至 Buffer Pool 中。预读对顺序扫描(Sequential Scan)、Vacuum 两种场景性能可以带来一倍左右的提升,在索引创建场景下可以带来 18% 的性能提升。

    堆表预扩展

    在 PostgreSQL 中,表空间的扩展过程中将会逐个申请并扩展 8kB 的页。即使是 PostgreSQL 支持的批量页扩展,进行一次 N 页扩展的流程中也包含了 N 次 I/O 操作。这种页扩展不符合 PFS 最小页扩展粒度为 4MB 的特性。为此,PolarDB 设计了堆表批量预扩展,在扩展堆表的过程中,一次 I/O 扩展 4MB 页。在写表频繁的场景下(如装载数据),能够带来一倍的性能提升。

    索引创建预扩展

    索引创建预扩展与堆表预扩展的功能类似。索引创建预扩展特别针对 PFS 优化索引创建过程。在索引创建的页扩展过程中,一次 I/O 扩展 4MB 页。这种设计可以在创建索引的过程中带来 30% 的性能提升。

    注意

    当前索引创建预扩展只适配了 B-Tree 索引。其他索引类型暂未支持。

    功能设计

    堆表预读

    堆表预读的实现步骤主要分为四步:

    1. 在 Buffer Pool 中申请 N 个 Buffer
    2. 通过 palloc 在内存中申请一段大小为 N * 页大小 的空间,简称为 p
    3. 通过 PFS 批量读取堆表中 N * 页大小 的数据拷贝至 p
    4. p 中 N 个页的内容逐个拷贝至从 Buffer Pool 申请的 N 个 Buffer 中。

    后续的读取操作会直接命中 Buffer。数据流图如下所示:

    heap-read

    堆表预扩展

    预扩展的实现步骤主要分为三步:

    1. 从 Buffer Pool 中申请 N 个 Buffer,不触发文件系统的页扩展
    2. 通过 PFS 的文件写入接口进行批量页扩展,并且写入为全零页
    3. 对申请出来的页逐个进行页初始化,标识页的可用空间,结束预扩展

    索引创建预扩展

    索引创建预扩展的实现步骤与预扩展类似,但没有涉及 Buffer 的申请。步骤如下:

    1. 写索引页时,通过 PFS 的文件写入接口进行批量页扩展,并且写入为全零页
    2. 将 Buffer Pool 中已经构建好的索引页写入文件系统中

    使用指南

    堆表预读

    堆表预读的参数名为 polar_bulk_read_size,功能默认开启,默认大小为 128kB。不建议用户自行修改该参数,128kB 是贴合 PFS 的最优值,自行调整并不会带来性能的提升。

    关闭功能:

    ALTER SYSTEM SET polar_bulk_read_size = 0;
    +SELECT pg_reload_conf();
    +

    打开功能并设置预读大小为 128kB:

    ALTER SYSTEM SET polar_bulk_read_size = '128kB';
    +SELECT pg_reload_conf();
    +

    堆表预扩展

    堆表预扩展的参数名为 polar_bulk_extend_size,功能默认开启,预扩展的大小默认是 4MB。不建议用户自行修改该参数值,4MB 是贴合 PFS 的最优值。

    关闭功能:

    ALTER SYSTEM SET polar_bulk_extend_size = 0;
    +SELECT pg_reload_conf();
    +

    打开功能并设置预扩展大小为 4MB:

    ALTER SYSTEM SET polar_bulk_extend_size = '4MB';
    +SELECT pg_reload_conf();
    +

    索引创建预扩展

    索引创建预扩展的参数名为 polar_index_create_bulk_extend_size,功能默认开启。索引创建预扩展的大小默认是 4MB。不建议用户自行修改该参数值,4MB 是贴合 PFS 的最优值。

    关闭功能:

    ALTER SYSTEM SET polar_index_create_bulk_extend_size = 0;
    +SELECT pg_reload_conf();
    +

    打开功能,并设置预扩展大小为 4MB:

    ALTER SYSTEM SET polar_index_create_bulk_extend_size = 512;
    +SELECT pg_reload_conf();
    +

    性能表现

    为了展示堆表预读、堆表预扩展、索引创建预扩展的性能提升效果,我们在 PolarDB for PostgreSQL 14 的实例上进行了测试。

    • 规格:8 核 32GB 内存
    • 测试场景:400GB pgbench 测试

    堆表预读

    400GB 表的 Vacuum 性能:

    400gb-vacuum-perf

    400GB 表的 SeqScan 性能:

    400gb-vacuum-seqscan

    结论:

    • 堆表预读在 Vacuum 和 SeqScan 场景上,性能提升了 1-2 倍
    • 堆表预读大小在超过默认值 128kB 之后对性能提升没有明显帮助

    堆表预扩展

    400GB 表数据装载性能:

    400gb-insert-data-perf

    结论:

    • 堆表预扩展在数据装载场景下带来一倍的性能提升
    • 堆表预扩展大小在超过默认值 4MB 后对性能没有明显帮助

    索引创建预扩展

    400GB 表创建索引性能:

    400GB 表创建索引性能

    结论:

    • 索引创建预扩展在索引创建场景下能够带来 30% 的性能提升
    • 加大索引创建预扩展大小超过默认值 4MB 对性能没有明显帮助
    + + + diff --git a/zh/features/v11/performance/index.html b/zh/features/v11/performance/index.html new file mode 100644 index 00000000000..7ad06be5d33 --- /dev/null +++ b/zh/features/v11/performance/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 高性能 | PolarDB for PostgreSQL + + + + +

    高性能

    + + + diff --git a/zh/features/v11/performance/rel-size-cache.html b/zh/features/v11/performance/rel-size-cache.html new file mode 100644 index 00000000000..6aa590016de --- /dev/null +++ b/zh/features/v11/performance/rel-size-cache.html @@ -0,0 +1,139 @@ + + + + + + + + + 表大小缓存 | PolarDB for PostgreSQL + + + + +

    表大小缓存

    V11 / v1.1.10-

    步真

    2022/11/14

    50 min

    背景介绍

    在 SQL 执行的过程中,存在若干次对系统表和用户表的查询。PolarDB for PostgreSQL 通过文件系统的 lseek 系统调用来获取表大小。频繁执行 lseek 系统调用会严重影响数据库的执行性能,特别是对于存储计算分离架构的 PolarDB for PostgreSQL 来说,在 PolarFS 上的 PFS lseek 系统调用会带来更大的 RTO 时延。为了降低 lseek 系统调用的使用频率,PolarDB for PostgreSQL 在自身存储引擎上提供了一层表大小缓存接口,用于提升数据库的运行时性能。

    术语

    • RSC (Relation Size Cache):表大小缓存。
    • Smgr (Storage manager):PolarDB for PostgreSQL 存储管理器。
    • SmgrRelation:PolarDB for PostgreSQL 存储侧的表级元信息。

    功能介绍

    PolarDB for PostgreSQL 为了实现 RSC,在 smgr 层进行了重新适配与设计。在整体上,RSC 是一个 缓存数组 + 两级索引 的结构设计:一级索引通过内存地址 + 引用计数来寻找共享内存 RSC 缓存中的一个缓存块;二级索引通过共享内存中的哈希表来索引得到一个 RSC 缓存块的数组下标,根据下标进一步访问 RSC 缓存,获取表大小信息。

    功能设计

    总体设计

    在开启 RSC 缓存功能后,各个 smgr 层接口将会生效 RSC 缓存查询与更新的逻辑:

    • smgrnblocks:获取表大小的实际入口,将会通过查询 RSC 一级或二级索引得到 RSC 缓存块地址,从而得到物理表大小。如果 RSC 缓存命中则直接返回缓存中的物理表大小;否则需要进行一次 lseek 系统调用,并将实际的物理表大小更新到 RSC 缓存中,并同步更新 RSC 一级与二级索引。
    • smgrextend:表文件扩展接口,将会把物理表文件扩展一个页,并更新对应表的 RSC 索引与缓存。
    • smgrextendbatch:表文件的预扩展接口,将会把物理表文件预扩展多个页,并更新对应表的 RSC 索引与缓存。
    • smgrtruncate:表文件的删除接口,将会把物理表文件删除,并清空对应表的 RSC 索引与缓存。

    RSC 缓存数组

    在共享内存中,维护了一个数组形式的 RSC 缓存。数组中的每个元素是一个 RSC 缓存块,其中保存的关键信息包含:

    • 表标识符
    • 一个长度为 64 位的引用计数 generation:表发生更新操作时,这个计数会自增
    • 表大小

    RSC 一级索引

    对于每个执行用户操作的会话进程而言,其所需访问的表被维护在进程私有的 SmgrRelation 结构中,其中包含:

    • 一个指向 RSC 缓存块的指针,初始值为空,后续将被更新
    • 一个长度为 64 位的 generation 计数

    当执行表访问操作时,如果引用计数与 RSC 缓存中的 generation 一致,则认为 RSC 缓存没有被更新过,可以直接通过指针得到 RSC 缓存,获得物理表的当前大小。RSC 一级索引整体上是一个共享引用计数 + 共享内存指针的设计,在对大多数特定表的读多写少场景中,这样的设计可以有效降低对 RSC 二级索引的并发访问。

    rsc-first-cache

    RSC 二级索引

    当表大小发生更新(例如 INSERTUPDATECOPY 等触发表文件大小元信息变更的操作)时,会导致 RSC 一级索引失效(generation 计数不一致),会话进程会尝试访问 RSC 二级索引。RSC 二级索引的形式是一个共享内存哈希表:

    • Key 为表 OID
    • Value 为表的 RSC 缓存块在 RSC 缓存数组中的下标

    通过待访问物理表的 OID,查找位于共享内存中的 RSC 二级索引:如果命中,则直接得到 RSC 缓存块,取得表大小,同时更新 RSC 一级索引;如果不命中,则使用 lseek 系统调用获取物理表的实际大小,并更新 RSC 缓存及其一二级索引。RSC 缓存更新的过程可能因缓存已满而触发缓存淘汰。

    rsc-second-cache

    RSC 缓存更新与淘汰

    在 RSC 缓存被更新的过程中,可能会因为缓存总容量已满,进而触发缓存淘汰。RSC 实现了一个 SLRU 缓存淘汰算法,用于在缓存块满时选择一个旧缓存块进行淘汰。每一个 RSC 缓存块上都维护了一个引用计数器,缓存每被访问一次,计数器的值加 1;缓存被淘汰时计数器清 0。当缓存淘汰被触发时,将从 RSC 缓存数组上一次遍历到的位置开始向前遍历,递减每一个 RSC 缓存上的引用计数,直到找到一个引用计数为 0 的缓存块进行淘汰。遍历的长度可以通过 GUC 参数控制,默认为 8:当向前遍历 8 个块后仍未找到一个可以被淘汰的 RSC 缓存块时,将会随机选择一个缓存块进行淘汰。

    备节点的 RSC 缓存

    PolarDB for PostgreSQL 的备节点分为两种,一种是提供只读服务的共享存储 Read Only 节点(RO),一种是提供跨数据中心高可用的 Standby 节点。对于 Standby 节点,由于其数据同步机制采用传统流复制 + WAL 日志回放的方式进行,故 RSC 缓存的使用与更新方式与 Read Write 节点(RW)无异。但对于 RO 节点,其数据是通过 PolarDB for PostgreSQL 实现的 LogIndex 机制实现同步的,故需要额外支持该机制下 RO 节点的 RSC 缓存同步方式。对于每种 WAL 日志类型,都需要根据当前是否存在 New Page 类型的日志,进行缓存更新与淘汰处理,保证 RO 节点下 RSC 缓存的一致性。

    使用指南

    该功能默认生效。提供如下 GUC 参数控制:

    • polar_nblocks_cache_mode:是否开启 RSC 功能,取值为:
      • scan(默认值):表示仅在 scan 顺序查询场景下开启
      • on:在所有场景下全量开启 RSC
      • off:关闭 RSC;参数从 scanon 设置为 off,可以直接通过 ALTER SYSTEM SET 进行设置,无需重启即可生效;参数从 off 设置为 scan / on,需要修改 postgresql.conf 配置文件并重启生效
    • polar_enable_replica_use_smgr_cache:RO 节点是否开启 RSC 功能,默认为 on。可配置为 on / off
    • polar_enable_standby_use_smgr_cache:Standby 节点是否开启 RSC 功能,默认为 on。可配置为 on / off

    性能测试

    通过如下 Shell 脚本创建一个带有 1000 个子分区的分区表:

    psql -c "CREATE TABLE hp(a INT) PARTITION BY HASH(a);"
    +for ((i=1; i<1000; i++)); do
    +    psql -c "CREATE TABLE hp$i PARTITION OF hp FOR VALUES WITH(modulus 1000, remainder $i);"
    +done
    +

    此时分区子表无数据。接下来借助一条在所有子分区上的聚合查询,来验证打开或关闭 RSC 功能时,lseek 系统调用所带来的时间性能影响。

    开启 RSC:

    ALTER SYSTEM SET polar_nblocks_cache_mode = 'scan';
    +ALTER SYSTEM
    +
    +ALTER SYSTEM SET polar_enable_replica_use_smgr_cache = on;
    +ALTER SYSTEM
    +
    +ALTER SYSTEM SET polar_enable_standby_use_smgr_cache = on;
    +ALTER SYSTEM
    +
    +SELECT pg_reload_conf();
    + pg_reload_conf
    +----------------
    + t
    +(1 row)
    +
    +SHOW polar_nblocks_cache_mode;
    + polar_nblocks_cache_mode
    +--------------------------
    + scan
    +(1 row)
    +
    +SHOW polar_enable_replica_use_smgr_cache ;
    + polar_enable_replica_use_smgr_cache
    +--------------------------
    + on
    +(1 row)
    +
    +SHOW polar_enable_standby_use_smgr_cache ;
    + polar_enable_standby_use_smgr_cache
    +--------------------------
    + on
    +(1 row)
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 97.658 ms
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 108.672 ms
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 93.678 ms
    +

    关闭 RSC:

    ALTER SYSTEM SET polar_nblocks_cache_mode = 'off';
    +ALTER SYSTEM
    +
    +ALTER SYSTEM SET polar_enable_replica_use_smgr_cache = off;
    +ALTER SYSTEM
    +
    +ALTER SYSTEM SET polar_enable_standby_use_smgr_cache = off;
    +ALTER SYSTEM
    +
    +SELECT pg_reload_conf();
    + pg_reload_conf
    +----------------
    + t
    +(1 row)
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 164.772 ms
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 147.255 ms
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 177.039 ms
    +
    +SELECT COUNT(*) FROM hp;
    + count
    +-------
    +     0
    +(1 row)
    +
    +Time: 194.724 ms
    +
    + + + diff --git a/zh/features/v11/performance/shared-server.html b/zh/features/v11/performance/shared-server.html new file mode 100644 index 00000000000..8246cb7d3eb --- /dev/null +++ b/zh/features/v11/performance/shared-server.html @@ -0,0 +1,33 @@ + + + + + + + + + Shared Server | PolarDB for PostgreSQL + + + + +

    Shared Server

    V11 / v1.1.30-

    严华

    2022/11/25

    20 min

    背景

    原生 PostgreSQL 的连接调度方式是每一个进程对应一个连接 (One-Process-Per-Connection),这种调度方式适合低并发、长连接的业务场景。而在高并发或大量短连接的业务场景中,进程的大量创建、销毁以及上下文切换,会严重影响性能。同时,在业务容器化部署后,每个容器通过连接池向数据库发起连接,业务在高峰期会弹性扩展出很多容器,后端数据库的连接数会瞬间增高,影响数据库稳定性,导致 OOM 频发。

    为了解决上述问题,业界在使用 PostgreSQL 时通常会配置连接池组件,比如部署在数据库侧的后置连接池 PgBounceropen in new window,部署在应用侧的前置连接池 Druidopen in new window。但后置连接池无法支持保留用户连接私有信息(如 GUC 参数、Prepared Statement)的相关功能,在面临进程被污染的情况(如加载动态链接库、修改 role 参数)时也无法及时清理。前置连接池不仅无法解决后置连接池的缺陷,还无法根据应用规模扩展而实时调整配置,仍然会面临连接数膨胀的问题。

    PolarDB for PostgreSQL 针对上述问题,从数据库内部提供了 Shared Server(后文简称 SS)内置连接池功能,采用共享内存 + Session Context + Dispatcher 转发 + Backend Pool 的架构,实现了用户连接与后端进程的解绑。后端进程具备了 Native、Shared、Dedicated 三种执行模式,并且在运行时可以根据实时负载和进程污染情况进行动态转换。负载调度算法充分吸收 AliSQL 对社区版 MySQL 线程池的缺陷改进,使用 Stall 机制弹性控制 Worker 数量,同时避免用户连接饿死。从根本上解决了高并发或者大量短连接带来的性能、稳定性问题。

    原理

    在 PostgreSQL 原生的 One-Process-Per-Connection 连接调度策略中,用户发起的连接与后端进程一一绑定:这里不仅是生命周期的绑定,同时还是服务与被服务关系的绑定。

    ss-old

    在 Shared Server 内置连接池中,通过提取出会话相关上下文 Session Context,将用户连接和后端进程进行了解绑,并且引入 Dispatcher 来进行代理转发:

    ss-new

    • Session Context 保存 Session 相关数据,存放于共享内存中,跨进程共享。存放数据包括:Prepared Statement、连接私有参数、临时表元数据等,后续还可以不断扩展。
    • Dispatcher 进程承载代理转发工作,用户连接通过 Dispatcher 分发调度到不同的后端进程上,后端进程通过 Dispatcher 被多个用户连接共享使用。Dispatcher 进程可以配置多个。
    • 每个 Dispatcher 管理的后端进程按 <user, database, GUCs> 为 key,划分成不同的后端进程池。每个后端进程池都有自己独占的后端进程组,单个后端进程池内的后端进程数量随着负载增高而增多,随着负载降低而减少。
    • 用户连接中的一个事务会始终被同一个后端进程服务,不同事务可能会被不同的后端进程服务

    ss-pool

    在 Shared Server 中,后端进程有三种执行模式。进程执行模式在运行时会根据实时负载和进程污染情况进行动态转换:

    • Native 模式(原生模式):一个后端进程只服务一个用户连接,不存在 Dispatcher 转发数据
      • SS 关闭后,所有后端进程都处于 Native 模式
      • SS 开启后,对于以下场景,后端进程也会在用户连接的登录阶段回退为 Native 模式:
        • WAL Sender 进程
        • MPP 进程
        • SS 共享内存耗尽
        • 在参数 polar_ss_dedicated_dbuser_names 黑名单范围内的数据库或用户
    • Shared 模式(共享模式):后端进程作为可共享的工作进程提供给各个用户连接使用。Shared 模式是标准的、期望的连接池状态,表示后端进程是可复用的;SS 开启后,后端进程会优先使用 Shared 模式,同时会在触发兜底机制时转换为 Dedicated 模式。
    • Dedicated 模式(兜底模式):由于各种原因导致后端进程被污染,退化为当前后端进程只能服务当前用户连接,用户连接退出后,后端进程也退出
      • 用户连接不再使用新的 SS 共享内存,而是使用本地进程内存。
      • 用户连接与后端进程之间的数据传输依旧经过 Dispatcher 转发
      • 以下场景中会触发兜底机制,执行模式会由 Shared 转变为 Dedicated:
        • 更新了 SS 黑名单内的 GUC 参数
        • 使用了 SS 黑名单内的插件
        • 执行了 DECLARE CURSOR 命令
        • 对 ONCOMMIT DELETE ROWS 属性的表进行操作
        • 执行 CURSOR WITH HOLD 操作
        • 使用自定义 GUC 参数
        • 加载动态链接库

    性能对比

    Shared Server 主要应用于高并发或大量短连接的业务场景,因此这里使用 TPC-C 进行测试。

    TPC-C 高并发

    使用 104c 512GB 的物理机单机部署,测试 TPC-C 1000 仓下,并发数从 300 增大到 5000 时,不同配置下的分数对比。如下图所示:

    • old:不使用任何连接池,使用 PostgreSQL 的原生执行模式(即 Native 模式)
    • ss off:使用 Shared Server 内置连接池,启动前关闭 SS 开关,退化为 Native 模式
    • ss native:使用 Shared Server 内置连接池,启动后关闭 SS 开关,退化为 Native 模式
    • ss didicated:使用 Shared Server 内置连接池,启动后开启 SS 开关,但强制使用 Dedicated 模式
    • ss shared:使用 Shared Server 内置连接池,启动后开启 SS 开关,使用标准的 Shared 模式

    ss-tpcc

    从图中可以看出:

    • 原生 PostgreSQL 场景、Shared Server 关闭的场景、Shared Server 兜底场景中,均无法稳定进行 TPC-C 高并发测试。性能从并发数为 1500 时开始下跌,在并发数为 5000 时已经不能提供服务
    • Shared Server 开启并进入 Shared 模式后,TPC-C 性能不受高并发数影响,始终保持在稳定状态,很好地支持了高并发场景

    PgBench 短连接

    使用 104c 512GB 的物理机单机部署,利用 pgbench 分别测试以下配置中,并发短连接数从 1 到 128 的场景下的性能表现:

    • pgbouncer session:使用 PgBouncer 后置连接池, 配置为 session poolingopen in new window 模式
    • pgbouncer transaction:使用 PgBouncer 后置连接池, 配置为 transaction poolingopen in new window 模式
    • old:不使用任何连接池,使用 PostgreSQL 的原生执行模式
    • ss dedicated:使用 Shared Server 内置连接池,但强制设置为 Dedicated 模式
    • ss shared:使用 Shared Server 内置连接池,配置为标准的 Shared 模式

    ss-pgbench1

    ss-pgbench2

    从图中可以看出,使用连接池后,对于短连接,PgBouncer 和 Shared Server 的性能均有所提升。但 PgBouncer 最高只能提升 14 倍性能,Shared Server 最高可以提升 42 倍性能。

    功能特性

    PgBouncer 对比

    业界典型的后置连接池 PgBouncer 具有多种模式。其中 session pooling 模式仅对短连接友好,一般不使用;transaction pooling 模式对短连接、长连接都友好,是默认推荐的模式。与 PgBouncer 相比,Shared Server 的差异化功能特点如下:

    FeaturePgBouncer
    Session Pooling
    PgBouncer
    Transaction Pooling
    Shared Server
    Startup parameters受限受限支持
    SSL支持支持未来将支持
    LISTEN/NOTIFY支持不支持支持
    触发兜底
    LOAD statement支持不支持支持
    触发兜底
    Session-level advisory locks支持不支持支持
    触发兜底
    SET/RESET GUC支持不支持支持
    Protocol-level prepared plans支持未来将支持支持
    PREPARE / DEALLOCATE支持不支持支持
    Cached Plan Reset支持支持支持
    WITHOUT HOLD CURSOR支持支持支持
    WITH HOLD CURSOR支持不支持未来将支持
    触发兜底
    PRESERVE/DELETE ROWS temp支持不支持未来将支持
    触发兜底
    ON COMMIT DROP temp支持支持支持

    注:

    • PgBouncer 的 Startup 参数仅包括:
      • client_encoding
      • datestyle
      • timezone
      • standard_conforming_strings
    • 触发进入 Dedicated 兜底模式,用户连接断开后,后端进程也会释放,避免污染后的进程被其他用户连接使用

    自定义配置

    为了适应不同的环境,Shared Server 支持丰富了参数配置:

    1. 支持配置 Dispatcher 进程和后端进程的最大数量,可以实时调整出最佳性能模式
    2. 支持总连接数超过阈值后才启用 SS 的 Shared 模式,避免连接数较少时 SS 性能不显著
    3. 支持配置强制启用 Dedicated 模式,避免后端进程被污染后持续影响其他用户连接
    4. 支持配置指定的数据库/用户不使用 Shared Server,给专用账户和管理员留下应急通道
    5. 支持配置指定插件不使用 Shared Server,避免外部插件异常导致 Shared Server 不稳定
    6. 支持配置指定 GUC 参数不使用 Shared Server,避免 GUC 功能复杂导致 Shared Server 不稳定
    7. 支持 Dispatcher 阻塞连接数量超过阈值后回退到 Native 模式,避免 Dispatcher 缺陷导致不可用
    8. 支持配置用户连接的超时等待时间,避免用户连接长时间等待后端进程
    9. 支持配置后端进程空闲时间阈值,避免后端进程长时间空闲,占用系统资源
    10. 支持配置后端进程活跃时间阈值, 避免后端进程长时间活跃,占用系统资源
    11. 支持配置每个后端进程池中保留后端进程的最小个数,保持连接池热度,避免进程被全部释放
    12. 支持配置 Shared Server 调试日志,方便排查后端进程调度相关的任何问题

    使用说明

    常用参数

    Shared Server 的典型配置参数说明如下:

    • polar_enable_shm_aset:是否开启全局共享内存,当前默认关闭,重启生效
    • polar_ss_shared_memory_size:Shared Server 全局共享内存的使用上限,单位 kB,为 0 时表示关闭,默认 1MB。重启生效。
    • polar_ss_dispatcher_count:Dispatcher 进程的最大个数,默认为 2,最大为 CPU 核心数,建议配置与 CPU 核心数相同。重启生效。
    • polar_enable_shared_server:Shared Server 功能是否开启,默认关闭。
    • polar_ss_backend_max_count:后端进程的最大数量,默认为 -5,表示为 max_connection 的 1/5;0 / -1 表示与 max_connection 保持一致。建议设置为 CPU 核心数的 10 倍为佳。
    • polar_ss_backend_idle_timeout:后端进程的空闲退出时间,默认 3 分钟
    • polar_ss_session_wait_timeout:后端进程被用满时,用户连接等待被服务的最大时间,默认 60 秒
    • polar_ss_dedicated_dbuser_names:记录指定数据库/用户使用时进入 Native 模式,默认为空,格式为 d1/_,_/u1,d2/u2,表示对使用数据库 d1 的任意连接、使用用户 u1 的任意连接、使用数据库 d2 且用户 u2 的任意连接,都会回退到 Native 模式
    + + + diff --git a/zh/features/v11/security/index.html b/zh/features/v11/security/index.html new file mode 100644 index 00000000000..d6c3725734d --- /dev/null +++ b/zh/features/v11/security/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 安全 | PolarDB for PostgreSQL + + + + +

    安全

    + + + diff --git a/zh/features/v11/security/tde.html b/zh/features/v11/security/tde.html new file mode 100644 index 00000000000..db242bbe1c0 --- /dev/null +++ b/zh/features/v11/security/tde.html @@ -0,0 +1,75 @@ + + + + + + + + + TDE 透明数据加密 | PolarDB for PostgreSQL + + + + +

    TDE 透明数据加密

    V11 / v1.1.1-

    恒亦

    2022/09/27

    20 min

    背景

    TDE(Transparent Data Encryption),即 透明数据加密。TDE 通过在数据库层执行透明的数据加密,阻止可能的攻击者绕过数据库直接从存储层读取敏感信息。经过数据库身份验证的用户可以 透明(不需要更改应用代码或配置)地访问数据,而尝试读取表空间文件中敏感数据的 OS 用户以及尝试读取磁盘或备份信息的不法之徒将不允许访问明文数据。在国内,为了保证互联网信息安全,国家要求相关服务开发商需要满足一些数据安全标准,例如:

    在国际上,一些相关行业也有监管数据安全标准,例如:

    • Payment Card Industry Data Security Standard (PCI DSS)
    • Health Insurance Portability and Accountability Act (HIPAA)
    • General Data Protection Regulation (GDPR)
    • California Consumer Protection Act (CCPA)
    • Sarbanes-Oxley Act (SOX)

    为了满足保护用户数据安全的需求,我们在 PolarDB 中实现 TDE 功能。

    术语

    • KEK:密钥加密密钥(Key Encryption Key)。
    • MDEK:pg_strong_random 随机生成,存在内存中,作为实际加密数据的密码。
    • TDEK:Table Data Encryption Key,由 MDEK 经 HKDF 算法生成,存在内存中,作为实际加密数据的密码。
    • WDEK:Wal Data Encryption Key,MDEK 经 HKDF 算法生成,存在内存中,作为实际加密数据的密码。
    • HMACK:passphrase 经 SHA-512 加密后生成 KEK 和 HMACK。
    • KEK_HMAC:ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC,用于还原密钥时的校验信息。
    • ENCMDEK:用 KEK 加密 MDEK 生成 ENCMDEK。

    使用

    对于用户来说:

    • initdb 时增加 --cluster-passphrase-command 'xxx' -e aes-256 参数就会生成支持 TDE 的集群,其中 cluster-passphrase-command 参数为得到加密密钥的密钥的命令,-e 代表数据加密采用的加密算法,目前支持 AES-128、AES-256 和 SM4。

      initdb --cluster-passphrase-command 'echo \"abc123\"' -e aes-256
      +
    • 在数据库运行过程中,只有超级用户可以执行如下命令得到对应的加密算法:

      show polar_data_encryption_cipher;
      +
    • 在数据库运行过程中,可以创建插件 polar_tde_utils 来修改 TDE 的加密密钥或者查询 TDE 的一些执行状态,目前支持:

      1. 修改加密密钥,其中函数参数为获取加密密钥的方法(该方法保证只能在宿主机所在网络才可以获得),该函数执行后,kmgr 文件内容变更,等下次重启后生效。

        select polar_tde_update_kmgr_file('echo \"abc123456\"');
        +
      2. 得到当前的 kmgr 的 info 信息。

        select * from polar_tde_kmgr_info_view();
        +
      3. 检查 kmgr 文件的完整性。

        select polar_tde_check_kmgr_file();
        +
    • 执行 pg_filedump 解析加密后的页面,用于一些极端情况下,做页面解析。

      pg_filedump -e aes-128 -C 'echo \"abc123\"' -K global/kmgr base/14543/2608
      +

    原理

    密钥管理模块

    密钥结构

    采用 2 层密钥结构,即密钥加密密钥和表数据加密密钥。表数据加密密钥是实际对数据库数据进行加密的密钥。密钥加密密钥则是对表数据加密密钥进行进一步加密的密钥。两层密钥的详细介绍如下:

    • 密钥加密密钥(KEK),以及 KEK 的校验值 HMACK:通过运行 polar_cluster_passphrase_command 参数中命令并计算 SHA-512 后得到 64 字节的数据,其中前 32 字节为顶层加密密钥 KEK,后 32 字节为 HMACK。
    • 表数据加密密钥(TDEK)和 WAL 日志加密密钥(WDEK):通过密码学中的安全随机数生成器生成的密钥,是数据和 WAL 日志加密的真正密钥。两个密钥加密后的密文使用 HMACK 作为密钥,经过 HMAC 算法得到 rdek_hmac 和 wdek_hmac,用于密钥 KEK 的校验,保存在共享存储上。

    KEK 和 HMACK 每次都是通过外部获取,例如 KMS,测试的时候可以直接 echo passphrase 得到。ENCMDEK 和 KEK_HMAC 需要保存在共享存储上,用来保证下次启动时 RW 和 RO 都可以读取该文件,获取真正的加密密钥。其数据结构如下:

    typedef struct KmgrFileData
    +{
    +    /* version for kmgr file */
    +    uint32      kmgr_version_no;
    +
    +    /* Are data pages encrypted? Zero if encryption is disabled */
    +    uint32      data_encryption_cipher;
    +
    +    /*
    +     * Wrapped Key information for data encryption.
    +     */
    +    WrappedEncKeyWithHmac tde_rdek;
    +    WrappedEncKeyWithHmac tde_wdek;
    +
    +    /* CRC of all above ... MUST BE LAST! */
    +    pg_crc32c   crc;
    +} KmgrFileData;
    +

    该文件当前是在 initdb 的时候产生,这样就可以保证 Standby 通过 pg_basebackup 获取到。

    在实例运行状态下,TDE 相关的控制信息保存在进程的内存中,结构如下:

    static keydata_t keyEncKey[TDE_KEK_SIZE];
    +static keydata_t relEncKey[TDE_MAX_DEK_SIZE];
    +static keydata_t walEncKey[TDE_MAX_DEK_SIZE];
    +char *polar_cluster_passphrase_command = NULL;
    +extern int data_encryption_cipher;
    +

    密钥加密

    数据库初始化时需要生成密钥,过程示意图如下:

    image.png

    1. 运行 polar_cluster_passphrase_command 得到 64 字节的 KEK + HMACK,其中 KEK 长度为 32 字节,HMACK 长度为 32 字节。
    2. 调用 OpenSSLopen in new window 中的随机数生成算法生成 MDEK。
    3. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 TDEK。
    4. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 WDEK。
    5. 使用 KEK 加密 MDEK 生成 ENCMDEK。
    6. ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC 用于还原密钥时的校验信息。
    7. 将 ENCMDEK 和 KEK_HMAC 补充其他 KmgrFileData 结构信息写入 global/kmgr 文件。

    密钥解密

    当数据库崩溃或重新启动等情况下,需要通过有限的密文信息解密出对应的密钥,其过程如下:

    image.png

    1. 读取 global/kmgr 文件获取 ENCMDEK 和 KEK_HMAC。
    2. 运行 polar_cluster_passphrase_command 得到 64 字节的 KEK + HMACK。
    3. ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC',比较 KEK_HMAC 和 KEK_HMAC' 两者是否相同,如果相同,继续下一步;如果不同则报错返回。
    4. 使用 KEK 解密 ENCMDEK 生成 MDEK。
    5. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 TDEK,因为是特定的 info 所以可以生成相同 TDEK。
    6. 使用 MDEK 调用 OpenSSL 的 HKDF 算法生成 WDEK,因为是特定的 info 所以可以生成相同 WDEK。

    密钥更换

    密钥更换的过程可以理解为先用旧的 KEK 还原密钥,然后再用新的 KEK 生成新的 kmgr 文件。其过程如下图:

    image.png

    1. 读取 global/kmgr 文件获取 ENCMDEK 和 KEK_HMAC。
    2. 运行 polar_cluster_passphrase_command 得到 64 字节的 KEK + HMACK
    3. ENCMDEK 和 HMACK 经过 HMAC 算法生成 KEK_HMAC',比较 KEK_HMAC 和 KEK_HMAC' 两者是否相同,如果相同,继续下一步;如果不同则报错返回。
    4. 使用 KEK 解密 ENCMDEK 生成 MDEK。
    5. 运行 polar_cluster_passphrase_command 得到 64 字节新的 new_KEK + new_HMACK。
    6. 使用 new_KEK 加密 MDEK 生成 new_ENCMDEK。
    7. new_ENCMDEK 和 new_HMACK 经过 HMAC 算法生成 new_KEK_HMAC 用于在还原密钥时校验信息。
    8. 将 new_ENCMDEK 和 new_KEK_HMAC 补充其他 KmgrFileData 结构信息写入 global/kmgr 文件。

    加密模块

    我们期望对所有的用户数据按照 Page 的粒度进行加密,加密方法采用 AES-128/256 加密算法(产品化默认使用 AES-256)。(page LSN,page number) 作为每个数据页加密的 IV,IV 是可以保证相同内容加密出不同结果的初始向量。

    每个 Page 的头部数据结构如下:

    typedef struct PageHeaderData
    +{
    +    /* XXX LSN is member of *any* block, not only page-organized ones */
    +    PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog
    +                                 * record for last change to this page */
    +    uint16      pd_checksum;    /* checksum */
    +    uint16      pd_flags;       /* flag bits, see below */
    +    LocationIndex pd_lower;     /* offset to start of free space */
    +    LocationIndex pd_upper;     /* offset to end of free space */
    +    LocationIndex pd_special;   /* offset to start of special space */
    +    uint16      pd_pagesize_version;
    +    TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
    +    ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
    +} PageHeaderData;
    +

    在上述结构中:

    • pd_lsn 不能加密:因为解密时需要使用 IV 来解密。
    • pd_flags 增加是否加密的标志位 0x8000,并且不加密:这样可以兼容明文 page 的读取,为增量实例打开 TDE 提供条件。
    • pd_checksum 不加密:这样可以在密文条件下判断 Page 的校验和。

    加密文件

    当前加密含有用户数据的文件,比如数据目录中以下子目录中的文件:

    • base/
    • global/
    • pg_tblspc/
    • pg_replslot/
    • pg_stat/
    • pg_stat_tmp/
    • ...

    何时加密

    当前对于按照数据 Page 来进行组织的数据,将按照 Page 来进行加密的。Page 落盘之前必定需要计算校验和,即使校验和相关参数关闭,也会调用校验和相关的函数 PageSetChecksumCopyPageSetChecksumInplace。所以,只需要计算校验和之前加密 Page,即可保证用户数据在存储上是被加密的。

    解密模块

    存储上的 Page 读入内存之前必定经过 checksum 校验,即使相关参数关闭,也会调用校验函数 PageIsVerified。所以,只需要在校验和计算之后解密,即可保证内存中的数据已被解密。

    + + + diff --git a/zh/index.html b/zh/index.html new file mode 100644 index 00000000000..90fee9ec50e --- /dev/null +++ b/zh/index.html @@ -0,0 +1,45 @@ + + + + + + + + + 文档 | PolarDB for PostgreSQL + + + + +
    PolarDB for PostgreSQL

    PolarDB for PostgreSQL

    阿里云自主研发的云原生数据库


    通过 Docker 快速使用

    从 DockerHub 上拉取 PolarDB for PostgreSQL 的 本地存储实例镜像open in new window,创建并运行容器,然后直接试用 PolarDB-PG:

    # 拉取 PolarDB-PG 镜像
    +docker pull polardb/polardb_pg_local_instance
    +# 创建并运行容器
    +docker run -it --rm polardb/polardb_pg_local_instance psql
    +# 测试可用性
    +postgres=# SELECT version();
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +
    + + + diff --git a/zh/operation/backup-and-restore.html b/zh/operation/backup-and-restore.html new file mode 100644 index 00000000000..3858b54ec97 --- /dev/null +++ b/zh/operation/backup-and-restore.html @@ -0,0 +1,266 @@ + + + + + + + + + 备份恢复 | PolarDB for PostgreSQL + + + + +

    备份恢复

    慎追、棠羽

    2023/01/11

    30 min

    PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其备份恢复和 PostgreSQL 存在部分差异。本文将指导您如何对 PolarDB for PostgreSQL 进行备份,并通过备份来搭建 Replica 节点或 Standby 节点。

    备份恢复原理

    PostgreSQL 的备份流程可以总结为以下几步:

    1. 进入备份模式
      • 强制进入 Full Page Write 模式,并切换当前的 WAL segment 文件
      • 在数据目录下创建 backup_label 文件,其中包含基础备份的起始点位置
      • 备份的恢复必须从一个内存数据与磁盘数据一致的检查点开始,所以将等待下一次检查点的到来,或立刻强制进行一次 CHECKPOINT
    2. 备份数据库:使用文件系统级别的工具进行备份
    3. 退出备份模式
      • 重置 Full Page Write 模式,并切换到下一个 WAL segment 文件
      • 创建备份历史文件,包含当前基础备份的起止 WAL 位置,并删除 backup_label 文件

    备份 PostgreSQL 数据库最简便方法是使用 pg_basebackup 工具。

    数据目录结构

    PolarDB for PostgreSQL 采用基于共享存储的存算分离架构,其数据目录分为以下两类:

    • 本地数据目录:位于每个计算节点的本地存储上,为每个计算节点私有
    • 共享数据目录:位于共享存储上,被所有计算节点共享

    backup-dir

    由于本地数据目录中的目录和文件不涉及数据库的核心数据,因此在备份数据库时,备份本地数据目录是可选的。可以仅备份共享存储上的数据目录,然后使用 initdb 重新生成新的本地存储目录。但是计算节点的本地配置文件需要被手动备份,如 postgresql.confpg_hba.conf 等文件。

    本地数据目录

    通过以下 SQL 命令可以查看节点的本地数据目录:

    postgres=# SHOW data_directory;
    +     data_directory
    +------------------------
    + /home/postgres/primary
    +(1 row)
    +

    本地数据目录类似于 PostgreSQL 的数据目录,大多数目录和文件都是通过 initdb 生成的。随着数据库服务的运行,本地数据目录中会产生更多的本地文件,如临时文件、缓存文件、配置文件、日志文件等。其结构如下:

    $ tree ./ -L 1
    +./
    +├── base
    +├── current_logfiles
    +├── global
    +├── pg_commit_ts
    +├── pg_csnlog
    +├── pg_dynshmem
    +├── pg_hba.conf
    +├── pg_ident.conf
    +├── pg_log
    +├── pg_logical
    +├── pg_logindex
    +├── pg_multixact
    +├── pg_notify
    +├── pg_replslot
    +├── pg_serial
    +├── pg_snapshots
    +├── pg_stat
    +├── pg_stat_tmp
    +├── pg_subtrans
    +├── pg_tblspc
    +├── PG_VERSION
    +├── pg_xact
    +├── polar_cache_trash
    +├── polar_dma.conf
    +├── polar_fullpage
    +├── polar_node_static.conf
    +├── polar_rel_size_cache
    +├── polar_shmem
    +├── polar_shmem_stat_file
    +├── postgresql.auto.conf
    +├── postgresql.conf
    +├── postmaster.opts
    +└── postmaster.pid
    +
    +21 directories, 12 files
    +

    共享数据目录

    通过以下 SQL 命令可以查看所有计算节点在共享存储上的共享数据目录:

    postgres=# SHOW polar_datadir;
    +     polar_datadir
    +-----------------------
    + /nvme1n1/shared_data/
    +(1 row)
    +

    共享数据目录中存放 PolarDB for PostgreSQL 的核心数据文件,如表文件、索引文件、WAL 日志、DMA、LogIndex、Flashback Log 等。这些文件被所有节点共享,因此必须被备份。其结构如下:

    $ sudo pfs -C disk ls /nvme1n1/shared_data/
    +   Dir  1     512               Wed Jan 11 09:34:01 2023  base
    +   Dir  1     7424              Wed Jan 11 09:34:02 2023  global
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_tblspc
    +   Dir  1     512               Wed Jan 11 09:35:05 2023  pg_wal
    +   Dir  1     384               Wed Jan 11 09:35:01 2023  pg_logindex
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_twophase
    +   Dir  1     128               Wed Jan 11 09:34:02 2023  pg_xact
    +   Dir  1     0                 Wed Jan 11 09:34:02 2023  pg_commit_ts
    +   Dir  1     256               Wed Jan 11 09:34:03 2023  pg_multixact
    +   Dir  1     0                 Wed Jan 11 09:34:03 2023  pg_csnlog
    +   Dir  1     256               Wed Jan 11 09:34:03 2023  polar_dma
    +   Dir  1     512               Wed Jan 11 09:35:09 2023  polar_fullpage
    +  File  1     32                Wed Jan 11 09:35:00 2023  RWID
    +   Dir  1     256               Wed Jan 11 10:25:42 2023  pg_replslot
    +  File  1     224               Wed Jan 11 10:19:37 2023  polar_non_exclusive_backup_label
    +total 16384 (unit: 512Bytes)
    +

    polar_basebackup 备份工具

    PolarDB for PostgreSQL 的备份工具 polar_basebackup,由 PostgreSQL 的 pg_basebackupopen in new window 改造而来,完全兼容 pg_basebackup,因此同样可以用于对 PostgreSQL 做备份恢复。polar_basebackup 的可执行文件位于 PolarDB for PostgreSQL 安装目录下的 bin/ 目录中。

    该工具的主要功能是将一个运行中的 PolarDB for PostgreSQL 数据库的数据目录(包括本地数据目录和共享数据目录)备份到目标目录中。

    polar_basebackup takes a base backup of a running PostgreSQL server.
    +
    +Usage:
    +  polar_basebackup [OPTION]...
    +
    +Options controlling the output:
    +  -D, --pgdata=DIRECTORY receive base backup into directory
    +  -F, --format=p|t       output format (plain (default), tar)
    +  -r, --max-rate=RATE    maximum transfer rate to transfer data directory
    +                         (in kB/s, or use suffix "k" or "M")
    +  -R, --write-recovery-conf
    +                         write recovery.conf for replication
    +  -T, --tablespace-mapping=OLDDIR=NEWDIR
    +                         relocate tablespace in OLDDIR to NEWDIR
    +      --waldir=WALDIR    location for the write-ahead log directory
    +  -X, --wal-method=none|fetch|stream
    +                         include required WAL files with specified method
    +  -z, --gzip             compress tar output
    +  -Z, --compress=0-9     compress tar output with given compression level
    +
    +General options:
    +  -c, --checkpoint=fast|spread
    +                         set fast or spread checkpointing
    +  -C, --create-slot      create replication slot
    +  -l, --label=LABEL      set backup label
    +  -n, --no-clean         do not clean up after errors
    +  -N, --no-sync          do not wait for changes to be written safely to disk
    +  -P, --progress         show progress information
    +  -S, --slot=SLOTNAME    replication slot to use
    +  -v, --verbose          output verbose messages
    +  -V, --version          output version information, then exit
    +      --no-slot          prevent creation of temporary replication slot
    +      --no-verify-checksums
    +                         do not verify checksums
    +  -?, --help             show this help, then exit
    +
    +Connection options:
    +  -d, --dbname=CONNSTR   connection string
    +  -h, --host=HOSTNAME    database server host or socket directory
    +  -p, --port=PORT        database server port number
    +  -s, --status-interval=INTERVAL
    +                         time between status packets sent to server (in seconds)
    +  -U, --username=NAME    connect as specified database user
    +  -w, --no-password      never prompt for password
    +  -W, --password         force password prompt (should happen automatically)
    +      --polardata=datadir  receive polar data backup into directory
    +      --polar_disk_home=disk_home  polar_disk_home for polar data backup
    +      --polar_host_id=host_id  polar_host_id for polar data backup
    +      --polar_storage_cluster_name=cluster_name  polar_storage_cluster_name for polar data backup
    +

    polar_basebackup 的参数及用法几乎和 pg_basebackup 一致,新增了以下与共享存储相关的参数:

    • --polar_disk_home / --polar_host_id / --polar_storage_cluster_name:这三个参数指定了用于存放备份共享数据的共享存储节点
    • --polardata:该参数指定了备份共享存储节点上存放共享数据的路径;如不指定,则默认将共享数据备份到本地数据备份目录的 polar_shared_data/ 路径下

    备份并恢复一个 Replica 节点

    基础备份可用于搭建一个新的 Replica(RO)节点。如前文所述,一个正在运行中的 PolarDB for PostgreSQL 实例的数据文件分布在各计算节点的本地存储和存储节点的共享存储中。下面将说明如何使用 polar_basebackup 将实例的数据文件备份到一个本地磁盘上,并从这个备份上启动一个 Replica 节点。

    PFS 文件系统挂载

    首先,在将要部署 Replica 节点的机器上启动 PFSD 守护进程,挂载到正在运行中的共享存储的 PFS 文件系统上。后续启动的 Replica 节点将使用这个守护进程来访问共享存储。

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    备份数据到本地存储

    运行如下命令,将实例 Primary 节点的本地数据和共享数据备份到用于部署 Replica 节点的本地存储路径 /home/postgres/replica1 下:

    polar_basebackup \
    +    --host=[Primary节点所在IP] \
    +    --port=[Primary节点所在端口号] \
    +    -D /home/postgres/replica1 \
    +    -X stream --progress --write-recovery-conf -v
    +

    将看到如下输出:

    polar_basebackup: initiating base backup, waiting for checkpoint to complete
    +polar_basebackup: checkpoint completed
    +polar_basebackup: write-ahead log start point: 0/16ADD60 on timeline 1
    +polar_basebackup: starting background WAL receiver
    +polar_basebackup: created temporary replication slot "pg_basebackup_359"
    +851371/851371 kB (100%), 2/2 tablespaces
    +polar_basebackup: write-ahead log end point: 0/16ADE30
    +polar_basebackup: waiting for background process to finish streaming ...
    +polar_basebackup: base backup completed
    +

    备份完成后,可以以这个备份目录作为本地数据目录,启动一个新的 Replica 节点。由于本地数据目录中不需要共享存储上已有的共享数据文件,所以删除掉本地数据目录中的 polar_shared_data/ 目录:

    rm -rf ~/replica1/polar_shared_data
    +

    重新配置 Replica 节点

    重新编辑 Replica 节点的配置文件 ~/replica1/postgresql.conf

    -polar_hostid=1
    ++polar_hostid=2
    +-synchronous_standby_names='replica1'
    +

    重新编辑 Replica 节点的复制配置文件 ~/replica1/recovery.conf

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_slot_name='replica1'
    +primary_conninfo='host=[Primary节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +

    Replica 节点启动

    启动 Replica 节点:

    pg_ctl -D $HOME/replica1 start
    +

    Replica 节点验证

    在 Primary 节点上执行建表并插入数据,在 Replica 节点上可以查到 Primary 节点插入的数据:

    $ psql -q \
    +    -h [Primary节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +
    +$ psql -q \
    +    -h [Replica节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +

    备份并恢复一个 Standby 节点

    基础备份也可以用于搭建一个新的 Standby 节点。如下图所示,Standby 节点与 Primary / Replica 节点各自使用独立的共享存储,与 Primary 节点使用物理复制保持同步。Standby 节点可用于作为主共享存储的灾备。

    backup-dir

    PFS 文件系统格式化和挂载

    假设此时用于部署 Standby 计算节点的机器已经准备好用于后备的共享存储 nvme2n1

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:1    0  40G  0 disk
    +└─nvme0n1p1 259:2    0  40G  0 part /etc/hosts
    +nvme2n1     259:3    0  70G  0 disk
    +nvme1n1     259:0    0  60G  0 disk
    +

    将这个共享存储格式化为 PFS 格式,并启动 PFSD 守护进程挂载到 PFS 文件系统:

    sudo pfs -C disk mkfs nvme2n1
    +sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme2n1 -w 2
    +

    备份数据到本地存储和共享存储

    在用于部署 Standby 节点的机器上执行备份,以 ~/standby 作为本地数据目录,以 /nvme2n1/shared_data 作为共享存储目录:

    polar_basebackup \
    +    --host=[Primary节点所在IP] \
    +    --port=[Primary节点所在端口号] \
    +    -D /home/postgres/standby \
    +    --polardata=/nvme2n1/shared_data/ \
    +    --polar_storage_cluster_name=disk \
    +    --polar_disk_name=nvme2n1 \
    +    --polar_host_id=3 \
    +    -X stream --progress --write-recovery-conf -v
    +

    将会看到如下输出。其中,除了 polar_basebackup 的输出以外,还有 PFS 的输出日志:

    [PFSD_SDK INF Jan 11 10:11:27.247112][99]pfs_mount_prepare 103: begin prepare mount cluster(disk), PBD(nvme2n1), hostid(3),flags(0x13)
    +[PFSD_SDK INF Jan 11 10:11:27.247161][99]pfs_mount_prepare 165: pfs_mount_prepare success for nvme2n1 hostid 3
    +[PFSD_SDK INF Jan 11 10:11:27.293900][99]chnl_connection_poll_shm 1238: ack data update s_mount_epoch 1
    +[PFSD_SDK INF Jan 11 10:11:27.293912][99]chnl_connection_poll_shm 1266: connect and got ack data from svr, err = 0, mntid 0
    +[PFSD_SDK INF Jan 11 10:11:27.293979][99]pfsd_sdk_init 191: pfsd_chnl_connect success
    +[PFSD_SDK INF Jan 11 10:11:27.293987][99]pfs_mount_post 208: pfs_mount_post err : 0
    +[PFSD_SDK ERR Jan 11 10:11:27.297257][99]pfsd_opendir 1437: opendir /nvme2n1/shared_data/ error: No such file or directory
    +[PFSD_SDK INF Jan 11 10:11:27.297396][99]pfsd_mkdir 1320: mkdir /nvme2n1/shared_data
    +polar_basebackup: initiating base backup, waiting for checkpoint to complete
    +WARNING:  a labelfile "/nvme1n1/shared_data//polar_non_exclusive_backup_label" is already on disk
    +HINT:  POLAR: we overwrite it
    +polar_basebackup: checkpoint completed
    +polar_basebackup: write-ahead log start point: 0/16C91F8 on timeline 1
    +polar_basebackup: starting background WAL receiver
    +polar_basebackup: created temporary replication slot "pg_basebackup_373"
    +...
    +[PFSD_SDK INF Jan 11 10:11:32.992005][99]pfsd_open 539: open /nvme2n1/shared_data/polar_non_exclusive_backup_label with inode 6325, fd 0
    +[PFSD_SDK INF Jan 11 10:11:32.993074][99]pfsd_open 539: open /nvme2n1/shared_data/global/pg_control with inode 8373, fd 0
    +851396/851396 kB (100%), 2/2 tablespaces
    +polar_basebackup: write-ahead log end point: 0/16C9300
    +polar_basebackup: waiting for background process to finish streaming ...
    +polar_basebackup: base backup completed
    +[PFSD_SDK INF Jan 11 10:11:52.378220][99]pfsd_umount_force 247: pbdname nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.378229][99]pfs_umount_prepare 269: pfs_umount_prepare. pbdname:nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.404010][99]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme2n1/99.pid
    +[PFSD_SDK INF Jan 11 10:11:52.404171][99]pfs_umount_post 281: pfs_umount_post. pbdname:nvme2n1
    +[PFSD_SDK INF Jan 11 10:11:52.404174][99]pfsd_umount_force 261: umount success for nvme2n1
    +

    上述命令会在当前机器的本地存储上备份 Primary 节点的本地数据目录,在参数指定的共享存储目录上备份共享数据目录。

    重新配置 Standby 节点

    重新编辑 Standby 节点的配置文件 ~/standby/postgresql.conf

    -polar_hostid=1
    ++polar_hostid=3
    +-polar_disk_name='nvme1n1'
    +-polar_datadir='/nvme1n1/shared_data/'
    ++polar_disk_name='nvme2n1'
    ++polar_datadir='/nvme2n1/shared_data/'
    +-synchronous_standby_names='replica1'
    +

    在 Standby 节点的复制配置文件 ~/standby/recovery.conf 中添加:

    +recovery_target_timeline = 'latest'
    ++primary_slot_name = 'standby1'
    +

    Standby 节点启动

    在 Primary 节点上创建用于与 Standby 进行物理复制的复制槽:

    $ psql \
    +    --host=[Primary节点所在IP] --port=5432 \
    +    -d postgres \
    +    -c "SELECT * FROM pg_create_physical_replication_slot('standby1');"
    + slot_name | lsn
    +-----------+-----
    + standby1  |
    +(1 row)
    +

    启动 Standby 节点:

    pg_ctl -D $HOME/standby start
    +

    Standby 节点验证

    在 Primary 节点上创建表并插入数据,在 Standby 节点上可以查询到数据:

    $ psql -q \
    +    -h [Primary节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t (t1 INT PRIMARY KEY, t2 INT); INSERT INTO t VALUES (1, 1),(2, 3),(3, 3);"
    +
    +$ psql -q \
    +    -h [Standby节点所在IP] \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + t1 | t2
    +----+----
    +  1 |  1
    +  2 |  3
    +  3 |  3
    +(3 rows)
    +
    + + + diff --git a/zh/operation/cpu-usage-high.html b/zh/operation/cpu-usage-high.html new file mode 100644 index 00000000000..f4e7af8db07 --- /dev/null +++ b/zh/operation/cpu-usage-high.html @@ -0,0 +1,66 @@ + + + + + + + + + CPU 使用率高的排查方法 | PolarDB for PostgreSQL + + + + +

    CPU 使用率高的排查方法

    棠羽

    2023/03/06

    20 min

    在 PolarDB for PostgreSQL 的使用过程中,可能会出现 CPU 使用率异常升高甚至达到满载的情况。本文将介绍造成这种情况的常见原因和排查方法,以及相应的解决方案。

    业务量上涨

    当 CPU 使用率上升时,最有可能的情况是业务量的上涨导致数据库使用的计算资源增多。所以首先需要排查目前数据库的活跃连接数是否比平时高很多。如果数据库配备了监控系统,那么活跃连接数的变化情况可以通过图表的形式观察到;否则可以直接连接到数据库,执行如下 SQL 来获取当前活跃连接数:

    SELECT COUNT(*) FROM pg_stat_activity WHERE state NOT LIKE 'idle';
    +

    pg_stat_activity 是 PostgreSQL 的内置系统视图,该视图返回的每一行都是一个正在运行中的 PostgreSQL 进程,state 列表示进程当前的状态。该列可能的取值为:

    • active:进程正在执行查询
    • idle:进程空闲,正在等待新的客户端命令
    • idle in transaction:进程处于事务中,但目前暂未执行查询
    • idle in transaction (aborted):进程处于事务中,且有一条语句发生过错误
    • fastpath function call:进程正在执行一个 fast-path 函数
    • disabled:进程的状态采集功能被关闭

    上述 SQL 能够查询到所有非空闲状态的进程数,即可能占用 CPU 的活跃连接数。如果活跃连接数较平时更多,则 CPU 使用率的上升是符合预期的。

    慢查询

    如果 CPU 使用率上升,而活跃连接数的变化范围处在正常范围内,那么有可能出现了较多性能较差的慢查询。这些慢查询可能在很长一段时间里占用了较多的 CPU,导致 CPU 使用率上升。PostgreSQL 提供了慢查询日志的功能,执行时间高于 log_min_duration_statement 的 SQL 将会被记录到慢查询日志中。然而当 CPU 占用率接近满载时,将会导致整个系统的停滞,所有 SQL 的执行可能都会慢下来,所以慢查询日志中记录的信息可能非常多,并不容易排查。

    定位执行时间较长的慢查询

    pg_stat_statementsopen in new window 插件能够记录数据库服务器上所有 SQL 语句在优化和执行阶段的统计信息。由于该插件需要使用共享内存,因此插件名需要被配置在 shared_preload_libraries 参数中。

    如果没有在当前数据库中创建过 pg_stat_statements 插件的话,首先需要创建这个插件。该过程将会注册好插件提供的函数及视图:

    CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
    +

    该插件和数据库系统本身都会不断累积统计信息。为了排查 CPU 异常升高后这段时间内的问题,需要把数据库和插件中留存的统计信息做一次清空,然后开始收集从当前时刻开始的统计信息:

    -- 清空当前数据库的统计信息
    +SELECT pg_stat_reset();
    +-- 清空 pg_stat_statements 插件截止目前收集的统计信息
    +SELECT pg_stat_statements_reset();
    +

    接下来需要等待一段时间(1-2 分钟),使数据库和插件充分采集这段时间内的统计信息。

    统计信息收集完毕后,参考使用如下 SQL 查询执行时间最长的 5 条 SQL:

    -- < PostgreSQL 13
    +SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
    +-- >= PostgreSQL 13
    +SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;
    +

    定位读取 Buffer 数量较多的慢查询

    当一张表缺少索引,而对该表的查询基本上都是点查时,数据库将不得不使用全表扫描,并在内存中进行过滤条件的判断,处理掉大量的无效记录,导致 CPU 使用率大幅提升。利用 pg_stat_statements 插件的统计信息,参考如下 SQL,可以列出截止目前读取 Buffer 数量最多的 5 条 SQL:

    SELECT * FROM pg_stat_statements
    +ORDER BY shared_blks_hit + shared_blks_read DESC
    +LIMIT 5;
    +

    借助 PostgreSQL 内置系统视图 pg_stat_user_tablesopen in new window 中的统计信息,也可以统计出使用全表扫描的次数最多的表。参考如下 SQL,可以获取具备一定规模数据量(元组约为 10 万个)且使用全表扫描获取到的元组数量最多的 5 张表:

    SELECT * FROM pg_stat_user_tables
    +WHERE n_live_tup > 100000 AND seq_scan > 0
    +ORDER BY seq_tup_read DESC
    +LIMIT 5;
    +

    定位长时间执行不结束的慢查询

    通过系统内置视图 pg_stat_activity,可以查询出长时间执行不结束的 SQL,这些 SQL 有极大可能造成 CPU 使用率过高。参考以下 SQL 获取查询执行时间最长,且目前还未退出的 5 条 SQL:

    SELECT
    +    *,
    +    extract(epoch FROM (NOW() - xact_start)) AS xact_stay,
    +    extract(epoch FROM (NOW() - query_start)) AS query_stay
    +FROM pg_stat_activity
    +WHERE state NOT LIKE 'idle%'
    +ORDER BY query_stay DESC
    +LIMIT 5;
    +

    结合前一步中排查到的 使用全表扫描最多的表,参考如下 SQL 获取 在该表上 执行时间超过一定阈值(比如 10s)的慢查询:

    SELECT * FROM pg_stat_activity
    +WHERE
    +    state NOT LIKE 'idle%' AND
    +    query ILIKE '%表名%' AND
    +    NOW() - query_start > interval '10s';
    +

    解决方法与优化思路

    对于异常占用 CPU 较高的 SQL,如果仅有个别非预期 SQL,则可以通过给后端进程发送信号的方式,先让 SQL 执行中断,使 CPU 使用率恢复正常。参考如下 SQL,以慢查询执行所使用的进程 pid(pg_stat_activity 视图的 pid 列)作为参数,中止相应的进程的执行:

    SELECT pg_cancel_backend(pid);
    +SELECT pg_terminate_backend(pid);
    +

    如果执行较慢的 SQL 是业务上必要的 SQL,那么需要对它进行调优。

    首先可以对 SQL 涉及到的表进行采样,更新其统计信息,使优化器能够产生更加准确的执行计划。采样需要占用一定的 CPU,最好在业务低谷期运行:

    ANALYZE 表名;
    +

    对于全表扫描较多的表,可以在常用的过滤列上创建索引,以尽量使用索引扫描,减少全表扫描在内存中过滤不符合条件的记录所造成的 CPU 浪费。

    + + + diff --git a/zh/operation/grow-storage.html b/zh/operation/grow-storage.html new file mode 100644 index 00000000000..2d03bed533f --- /dev/null +++ b/zh/operation/grow-storage.html @@ -0,0 +1,60 @@ + + + + + + + + + 共享存储在线扩容 | PolarDB for PostgreSQL + + + + +

    共享存储在线扩容 视频

    棠羽

    2022/10/12

    15 min

    在使用数据库时,随着数据量的逐渐增大,不可避免需要对数据库所使用的存储空间进行扩容。由于 PolarDB for PostgreSQL 基于共享存储与分布式文件系统 PFS 的架构设计,与安装部署时类似,在扩容时,需要在以下三个层面分别进行操作:

    本文将指导您分别在以上三个层面上分别完成扩容操作,以实现不停止数据库实例的动态扩容。

    块存储层扩容

    首先需要进行的是块存储层面上的扩容。不管使用哪种类型的共享存储,存储层面扩容最终需要达成的目的是:在能够访问共享存储的主机上运行 lsblk 命令,显示存储块设备的物理空间变大。由于不同类型的共享存储有不同的扩容方式,本文以 阿里云 ECS + ESSD 云盘共享存储 为例演示如何进行存储层面的扩容。

    另外,为保证后续扩容步骤的成功,请以 10GB 为单位进行扩容。

    本示例中,在扩容之前,已有一个 20GB 的 ESSD 云盘多重挂载在两台 ECS 上。在这两台 ECS 上运行 lsblk,可以看到 ESSD 云盘共享存储对应的块设备 nvme1n1 目前的物理空间为 20GB。

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0  40G  0 disk
    +└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
    +nvme1n1     259:2    0  20G  0 disk
    +

    接下来对这块 ESSD 云盘进行扩容。在阿里云 ESSD 云盘的管理页面上,点击 云盘扩容

    essd-storage-grow

    进入到云盘扩容界面以后,可以看到该云盘已被两台 ECS 实例多重挂载。填写扩容后的容量,然后点击确认扩容,把 20GB 的云盘扩容为 40GB:

    essd-storage-online-grow

    扩容成功后,将会看到如下提示:

    essd-storage-grow-complete

    此时,两台 ECS 上运行 lsblk,可以看到 ESSD 对应块设备 nvme1n1 的物理空间已经变为 40GB:

    $ lsblk
    +NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0  40G  0 disk
    +└─nvme0n1p1 259:1    0  40G  0 part /etc/hosts
    +nvme1n1     259:2    0  40G  0 disk
    +

    至此,块存储层面的扩容就完成了。

    文件系统层扩容

    在物理块设备完成扩容以后,接下来需要使用 PFS 文件系统提供的工具,对块设备上扩大后的物理空间进行格式化,以完成文件系统层面的扩容。

    在能够访问共享存储的 任意一台主机上 运行 PFS 的 growfs 命令,其中:

    • -o 表示共享存储扩容前的空间(以 10GB 为单位)
    • -n 表示共享存储扩容后的空间(以 10GB 为单位)

    本例将共享存储从 20GB 扩容至 40GB,所以参数分别填写 24

    $ sudo pfs -C disk growfs -o 2 -n 4 nvme1n1
    +
    +...
    +
    +Init chunk 2
    +                metaset        2/1: sectbda      0x500001000, npage       80, objsize  128, nobj 2560, oid range [    2000,     2a00)
    +                metaset        2/2: sectbda      0x500051000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
    +                metaset        2/3: sectbda      0x500091000, npage       64, objsize  128, nobj 2048, oid range [    1000,     1800)
    +
    +Init chunk 3
    +                metaset        3/1: sectbda      0x780001000, npage       80, objsize  128, nobj 2560, oid range [    3000,     3a00)
    +                metaset        3/2: sectbda      0x780051000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
    +                metaset        3/3: sectbda      0x780091000, npage       64, objsize  128, nobj 2048, oid range [    1800,     2000)
    +
    +pfs growfs succeeds!
    +

    如果看到上述输出,说明文件系统层面的扩容已经完成。

    数据库实例层扩容

    最后,在数据库实例层,扩容需要做的工作是执行 SQL 函数来通知每个实例上已经挂载到共享存储的 PFSD(PFS Daemon)守护进程,告知共享存储上的新空间已经可以被使用了。需要注意的是,数据库实例集群中的 所有 PFSD 都需要被通知到,并且需要 先通知所有 RO 节点上的 PFSD,最后通知 RW 节点上的 PFSD。这意味着我们需要在 每一个 PolarDB for PostgreSQL 节点上执行一次通知 PFSD 的 SQL 函数,并且 RO 节点在先,RW 节点在后

    数据库实例层通知 PFSD 的扩容函数实现在 PolarDB for PostgreSQL 的 polar_vfs 插件中,所以首先需要在 RW 节点 上加载 polar_vfs 插件。在加载插件的过程中,会在 RW 节点和所有 RO 节点上注册好 polar_vfs_disk_expansion 这个 SQL 函数。

    CREATE EXTENSION IF NOT EXISTS polar_vfs;
    +

    接下来,依次 在所有的 RO 节点上,再到 RW 节点上 分别 执行这个 SQL 函数。其中函数的参数名为块设备名:

    SELECT polar_vfs_disk_expansion('nvme1n1');
    +

    执行完毕后,数据库实例层面的扩容也就完成了。此时,新的存储空间已经能够被数据库使用了。

    + + + diff --git a/zh/operation/ro-online-promote.html b/zh/operation/ro-online-promote.html new file mode 100644 index 00000000000..86d78d2291e --- /dev/null +++ b/zh/operation/ro-online-promote.html @@ -0,0 +1,57 @@ + + + + + + + + + 只读节点在线 Promote | PolarDB for PostgreSQL + + + + +

    只读节点在线 Promote

    棠羽

    2022/12/25

    15 min

    PolarDB for PostgreSQL 是一款存储与计算分离的云原生数据库,所有计算节点共享一份存储,并且对存储的访问具有 一写多读 的限制:所有计算节点可以对存储进行读取,但只有一个计算节点可以对存储进行写入。这种限制会带来一个问题:当读写节点因为宕机或网络故障而不可用时,集群中将没有能够可以写入存储的计算节点,应用业务中的增、删、改,以及 DDL 都将无法运行。

    本文将指导您在 PolarDB for PostgreSQL 计算集群中的读写节点停止服务时,将任意一个只读节点在线提升为读写节点,从而使集群恢复对于共享存储的写入能力。

    前置准备

    为方便起见,本示例使用基于本地磁盘的实例来进行演示。拉取如下镜像并启动容器,可以得到一个基于本地磁盘的 HTAP 实例:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg_htap \
    +    --shm-size=512m \
    +    polardb/polardb_pg_local_instance \
    +    bash
    +

    容器内的 54325434 端口分别运行着一个读写节点和两个只读节点。两个只读节点与读写节点共享同一份数据,并通过物理复制保持与读写节点的内存状态同步。

    验证只读节点不可写

    首先,连接到读写节点,创建一张表并插入一些数据:

    psql -p5432
    +
    postgres=# CREATE TABLE t (id int);
    +CREATE TABLE
    +postgres=# INSERT INTO t SELECT generate_series(1,10);
    +INSERT 0 10
    +

    然后连接到只读节点,并同样试图对表插入数据,将会发现无法进行插入操作:

    psql -p5433
    +
    postgres=# INSERT INTO t SELECT generate_series(1,10);
    +ERROR:  cannot execute INSERT in a read-only transaction
    +

    读写节点停止写入

    此时,关闭读写节点,模拟出读写节点不可用的行为:

    $ pg_ctl -D ~/tmp_master_dir_polardb_pg_1100_bld/ stop
    +waiting for server to shut down.... done
    +server stopped
    +

    此时,集群中没有任何节点可以写入存储了。这时,我们需要将一个只读节点提升为读写节点,恢复对存储的写入。

    只读节点 Promote

    只有当读写节点停止写入后,才可以将只读节点提升为读写节点,否则将会出现集群内两个节点同时写入的情况。当数据库检测到出现多节点写入时,将会导致运行异常。

    将运行在 5433 端口的只读节点提升为读写节点:

    $ pg_ctl -D ~/tmp_replica_dir_polardb_pg_1100_bld1/ promote
    +waiting for server to promote.... done
    +server promoted
    +

    计算集群恢复写入

    连接到已经完成 promote 的新读写节点上,再次尝试之前的 INSERT 操作:

    postgres=# INSERT INTO t SELECT generate_series(1,10);
    +INSERT 0 10
    +

    从上述结果中可以看到,新的读写节点能够成功对存储进行写入。这说明原先的只读节点已经被成功提升为读写节点了。

    + + + diff --git a/zh/operation/scale-out.html b/zh/operation/scale-out.html new file mode 100644 index 00000000000..61a43be8b4d --- /dev/null +++ b/zh/operation/scale-out.html @@ -0,0 +1,182 @@ + + + + + + + + + 计算节点扩缩容 | PolarDB for PostgreSQL + + + + +

    计算节点扩缩容

    棠羽

    2022/12/19

    30 min

    PolarDB for PostgreSQL 是一款存储与计算分离的数据库,所有计算节点共享存储,并可以按需要弹性增加或删减计算节点而无需做任何数据迁移。所有本教程将协助您在共享存储集群上添加或删除计算节点。

    部署读写节点

    首先,在已经搭建完毕的共享存储集群上,初始化并启动第一个计算节点,即读写节点,该节点可以对共享存储进行读写。我们在下面的镜像中提供了已经编译完毕的 PolarDB for PostgreSQL 内核和周边工具的可执行文件:

    $ docker pull polardb/polardb_pg_binary
    +$ docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +
    +$ ls ~/tmp_basedir_polardb_pg_1100_bld/bin/
    +clusterdb     dropuser           pg_basebackup   pg_dump         pg_resetwal    pg_test_timing       polar-initdb.sh          psql
    +createdb      ecpg               pgbench         pg_dumpall      pg_restore     pg_upgrade           polar-replica-initdb.sh  reindexdb
    +createuser    initdb             pg_config       pg_isready      pg_rewind      pg_verify_checksums  polar_tools              vacuumdb
    +dbatools.sql  oid2name           pg_controldata  pg_receivewal   pg_standby     pg_waldump           postgres                 vacuumlo
    +dropdb        pg_archivecleanup  pg_ctl          pg_recvlogical  pg_test_fsync  polar_basebackup     postmaster
    +

    确认存储可访问

    使用 lsblk 命令确认存储集群已经能够被当前机器访问到。比如,如下示例中的 nvme1n1 是将要使用的共享存储的块设备:

    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    格式化并挂载 PFS 文件系统

    此时,共享存储上没有任何内容。使用容器内的 PFS 工具将共享存储格式化为 PFS 文件系统的格式:

    sudo pfs -C disk mkfs nvme1n1
    +

    格式化完成后,在当前容器内启动 PFS 守护进程,挂载到文件系统上。该守护进程后续将会被计算节点用于访问共享存储:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    初始化数据目录

    使用 initdb 在节点本地存储的 ~/primary 路径上创建本地数据目录。本地数据目录中将会存放节点的配置、审计日志等节点私有的信息:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D $HOME/primary
    +

    使用 PFS 工具,在共享存储上创建一个共享数据目录;使用 polar-initdb.sh 脚本把将会被所有节点共享的数据文件拷贝到共享存储的数据目录中。将会被所有节点共享的文件包含所有的表文件、WAL 日志文件等:

    sudo pfs -C disk mkdir /nvme1n1/shared_data
    +
    +sudo $HOME/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh \
    +    $HOME/primary/ /nvme1n1/shared_data/
    +

    编辑读写节点配置

    对读写节点的配置文件 ~/primary/postgresql.conf 进行修改,使数据库以共享模式启动,并能够找到共享存储上的数据目录:

    port=5432
    +polar_hostid=1
    +
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +synchronous_standby_names='replica1'
    +

    编辑读写节点的客户端认证文件 ~/primary/pg_hba.conf,允许来自所有地址的客户端以 postgres 用户进行物理复制:

    host	replication	postgres	0.0.0.0/0	trust
    +

    启动读写节点

    使用以下命令启动读写节点,并检查节点能否正常运行:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/primary start
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群扩容

    接下来,在已经有一个读写节点的计算集群中扩容一个新的计算节点。由于 PolarDB for PostgreSQL 是一写多读的架构,所以后续扩容的节点只可以对共享存储进行读取,但无法对共享存储进行写入。只读节点通过与读写节点进行物理复制来保持内存状态的同步。

    类似地,在用于部署新计算节点的机器上,拉取镜像并启动带有可执行文件的容器:

    docker pull polardb/polardb_pg_binary
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg \
    +    --shm-size=512m \
    +    polardb/polardb_pg_binary \
    +    bash
    +

    确认存储可访问

    确保部署只读节点的机器也可以访问到共享存储的块设备:

    $ lsblk
    +NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    +nvme0n1     259:0    0   40G  0 disk
    +└─nvme0n1p1 259:1    0   40G  0 part /etc/hosts
    +nvme1n1     259:2    0  100G  0 disk
    +

    挂载 PFS 文件系统

    由于此时共享存储已经被读写节点格式化为 PFS 格式了,因此这里无需再次进行格式化。只需要启动 PFS 守护进程完成挂载即可:

    sudo /usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme1n1 -w 2
    +

    初始化数据目录

    在只读节点本地磁盘的 ~/replica1 路径上创建一个空目录,然后通过 polar-replica-initdb.sh 脚本使用共享存储上的数据目录来初始化只读节点的本地目录。初始化后的本地目录中没有默认配置文件,所以还需要使用 initdb 创建一个临时的本地目录模板,然后将所有的默认配置文件拷贝到只读节点的本地目录下:

    mkdir -m 0700 $HOME/replica1
    +sudo ~/tmp_basedir_polardb_pg_1100_bld/bin/polar-replica-initdb.sh \
    +    /nvme1n1/shared_data/ $HOME/replica1/
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D /tmp/replica1
    +cp /tmp/replica1/*.conf $HOME/replica1/
    +

    编辑只读节点配置

    编辑只读节点的配置文件 ~/replica1/postgresql.conf,配置好只读节点的集群标识和监听端口,以及与读写节点相同的共享存储目录:

    port=5432
    +polar_hostid=2
    +
    +polar_enable_shared_storage_mode=on
    +polar_disk_name='nvme1n1'
    +polar_datadir='/nvme1n1/shared_data/'
    +polar_vfs.localfs_mode=off
    +shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'
    +polar_storage_cluster_name='disk'
    +
    +logging_collector=on
    +log_line_prefix='%p\t%r\t%u\t%m\t'
    +log_directory='pg_log'
    +listen_addresses='*'
    +max_connections=1000
    +

    编辑只读节点的复制配置文件 ~/replica1/recovery.conf,配置好当前节点的角色(只读),以及从读写节点进行物理复制的连接串和复制槽:

    polar_replica='on'
    +recovery_target_timeline='latest'
    +primary_conninfo='host=[读写节点所在IP] port=5432 user=postgres dbname=postgres application_name=replica1'
    +primary_slot_name='replica1'
    +

    由于读写节点上暂时还没有名为 replica1 的复制槽,所以需要连接到读写节点上,创建这个复制槽:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_create_physical_replication_slot('replica1');"
    + pg_create_physical_replication_slot
    +-------------------------------------
    + (replica1,)
    +(1 row)
    +

    启动只读节点

    完成上述步骤后,启动只读节点并验证:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 start
    +
    +$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c 'SELECT version();'
    +            version
    +--------------------------------
    + PostgreSQL 11.9 (POLARDB 11.9)
    +(1 row)
    +

    集群功能检查

    连接到读写节点上,创建一个表并插入数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "CREATE TABLE t(id INT); INSERT INTO t SELECT generate_series(1,10);"
    +

    在只读节点上可以立刻查询到从读写节点上插入的数据:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM t;"
    + id
    +----
    +  1
    +  2
    +  3
    +  4
    +  5
    +  6
    +  7
    +  8
    +  9
    + 10
    +(10 rows)
    +

    从读写节点上可以看到用于与只读节点进行物理复制的复制槽已经处于活跃状态:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT * FROM pg_replication_slots;"
    + slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
    +-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
    + replica1  |        | physical  |        |          | f         | t      |         45 |      |              | 0/4079E8E8  |
    +(1 rows)
    +

    依次类推,使用类似的方法还可以横向扩容更多的只读节点。

    集群缩容

    集群缩容的步骤较为简单:将只读节点停机即可。

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl -D $HOME/replica1 stop
    +

    在只读节点停机后,读写节点上的复制槽将变为非活跃状态。非活跃的复制槽将会阻止 WAL 日志的回收,所以需要及时清理。

    在读写节点上执行如下命令,移除名为 replica1 的复制槽:

    $HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql \
    +    -p 5432 \
    +    -d postgres \
    +    -c "SELECT pg_drop_replication_slot('replica1');"
    + pg_drop_replication_slot
    +--------------------------
    +
    +(1 row)
    +
    + + + diff --git a/zh/operation/tpcc-test.html b/zh/operation/tpcc-test.html new file mode 100644 index 00000000000..db92c85d1b5 --- /dev/null +++ b/zh/operation/tpcc-test.html @@ -0,0 +1,59 @@ + + + + + + + + + TPC-C 测试 | PolarDB for PostgreSQL + + + + +

    TPC-C 测试

    棠羽

    2023/04/11

    15 min

    本文将引导您对 PolarDB for PostgreSQL 进行 TPC-C 测试。

    背景

    TPC 是一系列事务处理和数据库基准测试的规范。其中 TPC-Copen in new window (Transaction Processing Performance Council) 是针对 OLTP 的基准测试模型。TPC-C 测试模型给基准测试提供了一种统一的测试标准,可以大体观察出数据库服务稳定性、性能以及系统性能等一系列问题。对数据库展开 TPC-C 基准性能测试,一方面可以衡量数据库的性能,另一方面可以衡量采用不同硬件软件系统的性价比,是被业内广泛应用并关注的一种测试模型。

    测试步骤

    部署 PolarDB-PG

    参考如下教程部署 PolarDB for PostgreSQL:

    安装测试工具 BenchmarkSQL

    BenchmarkSQLopen in new window 依赖 Java 运行环境与 Maven 包管理工具,需要预先安装。拉取 BenchmarkSQL 工具源码并进入目录后,通过 mvn 编译工程:

    $ git clone https://github.com/pgsql-io/benchmarksql.git
    +$ cd benchmarksql
    +$ mvn
    +

    编译出的工具位于如下目录中:

    $ cd target/run
    +

    TPC-C 配置

    在编译完毕的工具目录下,将会存在面向不同数据库产品的示例配置:

    $ ls | grep sample
    +sample.firebird.properties
    +sample.mariadb.properties
    +sample.oracle.properties
    +sample.postgresql.properties
    +sample.transact-sql.properties
    +

    其中,sample.postgresql.properties 包含 PostgreSQL 系列数据库的模板参数,可以基于这个模板来修改并自定义配置。参考 BenchmarkSQL 工具的 文档open in new window 可以查看关于配置项的详细描述。

    配置项包含的配置类型有:

    • JDBC 驱动及连接信息:需要自行配置 PostgreSQL 数据库运行的连接串、用户名、密码等
    • 测试规模参数
    • 测试时间参数
    • 吞吐量参数
    • 事务类型参数

    导入数据

    使用 runDatabaseBuild.sh 脚本,以配置文件作为参数,产生和导入测试数据:

    ./runDatabaseBuild.sh sample.postgresql.properties
    +

    预热数据

    通常,在正式测试前会进行一次数据预热:

    ./runBenchmark.sh sample.postgresql.properties
    +

    正式测试

    预热完毕后,再次运行同样的命令进行正式测试:

    ./runBenchmark.sh sample.postgresql.properties
    +

    查看结果

                                              _____ latency (seconds) _____
    +  TransType              count |   mix % |    mean       max     90th% |    rbk%          errors
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +| NEW_ORDER    |           635 |  44.593 |   0.006 |   0.012 |   0.008 |   1.102 |             0 |
    +| PAYMENT      |           628 |  44.101 |   0.001 |   0.006 |   0.002 |   0.000 |             0 |
    +| ORDER_STATUS |            58 |   4.073 |   0.093 |   0.168 |   0.132 |   0.000 |             0 |
    +| STOCK_LEVEL  |            52 |   3.652 |   0.035 |   0.044 |   0.041 |   0.000 |             0 |
    +| DELIVERY     |            51 |   3.581 |   0.000 |   0.001 |   0.001 |   0.000 |             0 |
    +| DELIVERY_BG  |            51 |   0.000 |   0.018 |   0.023 |   0.020 |   0.000 |             0 |
    ++--------------+---------------+---------+---------+---------+---------+---------+---------------+
    +
    +Overall NOPM:          635 (98.76% of the theoretical maximum)
    +Overall TPM:         1,424
    +

    另外也有 CSV 形式的结果被保存,从输出日志中可以找到结果存放目录。

    + + + diff --git a/zh/operation/tpch-test.html b/zh/operation/tpch-test.html new file mode 100644 index 00000000000..88901da22c1 --- /dev/null +++ b/zh/operation/tpch-test.html @@ -0,0 +1,234 @@ + + + + + + + + + TPC-H 测试 | PolarDB for PostgreSQL + + + + +

    TPC-H 测试

    棠羽

    2023/04/12

    20 min

    本文将引导您对 PolarDB for PostgreSQL 进行 TPC-H 测试。

    背景

    TPC-Hopen in new window 是专门测试数据库分析型场景性能的数据集。

    测试准备

    部署 PolarDB-PG

    使用 Docker 快速拉起一个基于本地存储的 PolarDB for PostgreSQL 集群:

    docker pull polardb/polardb_pg_local_instance
    +docker run -it \
    +    --cap-add=SYS_PTRACE \
    +    --privileged=true \
    +    --name polardb_pg_htap \
    +    --shm-size=512m \
    +    polardb/polardb_pg_local_instance \
    +    bash
    +

    或者参考 进阶部署 部署一个基于共享存储的 PolarDB for PostgreSQL 集群。

    生成 TPC-H 测试数据集

    通过 tpch-dbgenopen in new window 工具来生成测试数据。

    $ git clone https://github.com/ApsaraDB/tpch-dbgen.git
    +$ cd tpch-dbgen
    +$ ./build.sh --help
    +
    +  1) Use default configuration to build
    +  ./build.sh
    +  2) Use limited configuration to build
    +  ./build.sh --user=postgres --db=postgres --host=localhost --port=5432 --scale=1
    +  3) Run the test case
    +  ./build.sh --run
    +  4) Run the target test case
    +  ./build.sh --run=3. run the 3rd case.
    +  5) Run the target test case with option
    +  ./build.sh --run --option="set polar_enable_px = on;"
    +  6) Clean the test data. This step will drop the database or tables, remove csv
    +  and tbl files
    +  ./build.sh --clean
    +  7) Quick build TPC-H with 100MB scale of data
    +  ./build.sh --scale=0.1
    +

    通过设置不同的参数,可以定制化地创建不同规模的 TPC-H 数据集。build.sh 脚本中各个参数的含义如下:

    • --user:数据库用户名
    • --db:数据库名
    • --host:数据库主机地址
    • --port:数据库服务端口
    • --run:执行所有 TPC-H 查询,或执行某条特定的 TPC-H 查询
    • --option:额外指定 GUC 参数
    • --scale:生成 TPC-H 数据集的规模,单位为 GB

    该脚本没有提供输入数据库密码的参数,需要通过设置 PGPASSWORD 为数据库用户的数据库密码来完成认证:

    export PGPASSWORD=<your password>
    +

    生成并导入 100MB 规模的 TPC-H 数据:

    ./build.sh --scale=0.1
    +

    生成并导入 1GB 规模的 TPC-H 数据:

    ./build.sh
    +

    执行 PostgreSQL 单机并行执行

    以 TPC-H 的 Q18 为例,执行 PostgreSQL 的单机并行查询,并观测查询速度。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +
    -- 打开计时
    +\timing on
    +
    +-- 设置单机并行度
    +SET max_parallel_workers_per_gather = 2;
    +
    +-- 查看 Q18 的执行计划
    +\i finals/18.explain.sql
    +                                                                         QUERY PLAN
    +------------------------------------------------------------------------------------------------------------------------------------------------------------
    + Sort  (cost=3450834.75..3450835.42 rows=268 width=81)
    +   Sort Key: orders.o_totalprice DESC, orders.o_orderdate
    +   ->  GroupAggregate  (cost=3450817.91..3450823.94 rows=268 width=81)
    +         Group Key: customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=3450817.91..3450818.58 rows=268 width=67)
    +               Sort Key: customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=1501454.20..3450807.10 rows=268 width=67)
    +                     Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
    +                     ->  Seq Scan on lineitem  (cost=0.00..1724402.52 rows=59986052 width=22)
    +                     ->  Hash  (cost=1501453.37..1501453.37 rows=67 width=53)
    +                           ->  Nested Loop  (cost=1500465.85..1501453.37 rows=67 width=53)
    +                                 ->  Nested Loop  (cost=1500465.43..1501084.65 rows=67 width=34)
    +                                       ->  Finalize GroupAggregate  (cost=1500464.99..1500517.66 rows=67 width=4)
    +                                             Group Key: lineitem_1.l_orderkey
    +                                             Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
    +                                             ->  Gather Merge  (cost=1500464.99..1500511.66 rows=400 width=36)
    +                                                   Workers Planned: 2
    +                                                   ->  Sort  (cost=1499464.97..1499465.47 rows=200 width=36)
    +                                                         Sort Key: lineitem_1.l_orderkey
    +                                                         ->  Partial HashAggregate  (cost=1499454.82..1499457.32 rows=200 width=36)
    +                                                               Group Key: lineitem_1.l_orderkey
    +                                                               ->  Parallel Seq Scan on lineitem lineitem_1  (cost=0.00..1374483.88 rows=24994188 width=22)
    +                                       ->  Index Scan using orders_pkey on orders  (cost=0.43..8.45 rows=1 width=30)
    +                                             Index Cond: (o_orderkey = lineitem_1.l_orderkey)
    +                                 ->  Index Scan using customer_pkey on customer  (cost=0.43..5.50 rows=1 width=23)
    +                                       Index Cond: (c_custkey = orders.o_custkey)
    +(26 rows)
    +
    +Time: 3.965 ms
    +
    +-- 执行 Q18
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 80150.449 ms (01:20.150)
    +

    执行 ePQ 单机并行执行

    PolarDB for PostgreSQL 提供了弹性跨机并行查询(ePQ)的能力,非常适合进行分析型查询。下面的步骤将引导您可以在一台主机上使用 ePQ 并行执行 TPC-H 查询。

    tpch-dbgen/ 目录下通过 psql 连接到数据库:

    cd tpch-dbgen
    +psql
    +

    首先需要对 TPC-H 产生的八张表设置 ePQ 的最大查询并行度:

    ALTER TABLE nation SET (px_workers = 100);
    +ALTER TABLE region SET (px_workers = 100);
    +ALTER TABLE supplier SET (px_workers = 100);
    +ALTER TABLE part SET (px_workers = 100);
    +ALTER TABLE partsupp SET (px_workers = 100);
    +ALTER TABLE customer SET (px_workers = 100);
    +ALTER TABLE orders SET (px_workers = 100);
    +ALTER TABLE lineitem SET (px_workers = 100);
    +

    以 Q18 为例,执行查询:

    -- 打开计时
    +\timing on
    +
    +-- 打开 ePQ 功能的开关
    +SET polar_enable_px = ON;
    +-- 设置每个节点的 ePQ 并行度为 1
    +SET polar_px_dop_per_node = 1;
    +
    +-- 查看 Q18 的执行计划
    +\i finals/18.explain.sql
    +                                                                          QUERY PLAN
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..257526.21 rows=59986052 width=47)
    +   Merge Key: orders.o_totalprice, orders.o_orderdate
    +   ->  GroupAggregate  (cost=0.00..243457.68 rows=29993026 width=47)
    +         Group Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +         ->  Sort  (cost=0.00..241257.18 rows=29993026 width=47)
    +               Sort Key: orders.o_totalprice DESC, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
    +               ->  Hash Join  (cost=0.00..42729.99 rows=29993026 width=47)
    +                     Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
    +                     ->  PX Hash 2:2  (slice2; segments: 2)  (cost=0.00..15959.71 rows=7500000 width=39)
    +                           Hash Key: orders.o_orderkey
    +                           ->  Hash Join  (cost=0.00..15044.19 rows=7500000 width=39)
    +                                 Hash Cond: (orders.o_custkey = customer.c_custkey)
    +                                 ->  PX Hash 2:2  (slice3; segments: 2)  (cost=0.00..11561.51 rows=7500000 width=20)
    +                                       Hash Key: orders.o_custkey
    +                                       ->  Hash Semi Join  (cost=0.00..11092.01 rows=7500000 width=20)
    +                                             Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
    +                                             ->  Partial Seq Scan on orders  (cost=0.00..1132.25 rows=7500000 width=20)
    +                                             ->  Hash  (cost=7760.84..7760.84 rows=400 width=4)
    +                                                   ->  PX Broadcast 2:2  (slice4; segments: 2)  (cost=0.00..7760.84 rows=400 width=4)
    +                                                         ->  Result  (cost=0.00..7760.80 rows=200 width=4)
    +                                                               Filter: ((sum(lineitem.l_quantity)) > '314'::numeric)
    +                                                               ->  Finalize HashAggregate  (cost=0.00..7760.78 rows=500 width=12)
    +                                                                     Group Key: lineitem.l_orderkey
    +                                                                     ->  PX Hash 2:2  (slice5; segments: 2)  (cost=0.00..7760.72 rows=500 width=12)
    +                                                                           Hash Key: lineitem.l_orderkey
    +                                                                           ->  Partial HashAggregate  (cost=0.00..7760.70 rows=500 width=12)
    +                                                                                 Group Key: lineitem.l_orderkey
    +                                                                                 ->  Partial Seq Scan on lineitem  (cost=0.00..3350.82 rows=29993026 width=12)
    +                                 ->  Hash  (cost=597.51..597.51 rows=749979 width=23)
    +                                       ->  PX Hash 2:2  (slice6; segments: 2)  (cost=0.00..597.51 rows=749979 width=23)
    +                                             Hash Key: customer.c_custkey
    +                                             ->  Partial Seq Scan on customer  (cost=0.00..511.44 rows=749979 width=23)
    +                     ->  Hash  (cost=5146.80..5146.80 rows=29993026 width=12)
    +                           ->  PX Hash 2:2  (slice7; segments: 2)  (cost=0.00..5146.80 rows=29993026 width=12)
    +                                 Hash Key: lineitem_1.l_orderkey
    +                                 ->  Partial Seq Scan on lineitem lineitem_1  (cost=0.00..3350.82 rows=29993026 width=12)
    + Optimizer: PolarDB PX Optimizer
    +(37 rows)
    +
    +Time: 216.672 ms
    +
    +-- 执行 Q18
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 59113.965 ms (00:59.114)
    +

    可以看到比 PostgreSQL 的单机并行执行的时间略短。加大 ePQ 功能的节点并行度,查询性能将会有更明显的提升:

    SET polar_px_dop_per_node = 2;
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 42400.500 ms (00:42.401)
    +
    +SET polar_px_dop_per_node = 4;
    +\i finals/18.sql
    +
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 19892.603 ms (00:19.893)
    +
    +SET polar_px_dop_per_node = 8;
    +\i finals/18.sql
    +       c_name       | c_custkey | o_orderkey | o_orderdate | o_totalprice |  sum
    +--------------------+-----------+------------+-------------+--------------+--------
    + Customer#001287812 |   1287812 |   42290181 | 1997-11-26  |    558289.17 | 318.00
    + Customer#001172513 |   1172513 |   36667107 | 1997-06-06  |    550142.18 | 322.00
    + ...
    + Customer#001288183 |   1288183 |   48943904 | 1996-07-22  |    398081.59 | 325.00
    + Customer#000114613 |    114613 |   59930883 | 1997-05-17  |    394335.49 | 319.00
    +(84 rows)
    +
    +Time: 10944.402 ms (00:10.944)
    +

    使用 ePQ 执行 Q17 和 Q18 时可能会出现 OOM。需要设置以下参数防止用尽内存:

    SET polar_px_optimizer_enable_hashagg = 0;
    +

    执行 ePQ 跨机并行执行

    在上面的例子中,出于简单考虑,PolarDB for PostgreSQL 的多个计算节点被部署在同一台主机上。在这种场景下使用 ePQ 时,由于所有的计算节点都使用了同一台主机的 CPU、内存、I/O 带宽,因此本质上是基于单台主机的并行执行。实际上,PolarDB for PostgreSQL 的计算节点可以被部署在能够共享存储节点的多台机器上。此时使用 ePQ 功能将进行真正的跨机器分布式并行查询,能够充分利用多台机器上的计算资源。

    参考 进阶部署 可以搭建起不同形态的 PolarDB for PostgreSQL 集群。集群搭建成功后,使用 ePQ 的方式与单机 ePQ 完全相同。

    如果遇到如下错误:

    psql:queries/q01.analyze.sq1:24: WARNING:  interconnect may encountered a network error, please check your network
    +DETAIL:  Failed to send packet (seq 1) to 192.168.1.8:57871 (pid 17766 cid 0) after 100 retries.
    +

    可以尝试统一修改每台机器的 MTU 为 9000:

    ifconfig <网卡名> mtu 9000
    +
    + + + diff --git a/zh/roadmap/index.html b/zh/roadmap/index.html new file mode 100644 index 00000000000..05714416569 --- /dev/null +++ b/zh/roadmap/index.html @@ -0,0 +1,33 @@ + + + + + + + + + 版本规划 | PolarDB for PostgreSQL + + + + +

    版本规划

    PolarDB PostgreSQL 将持续发布对用户有价值的功能。当前我们计划了 5 个阶段:

    PolarDB PostgreSQL 1.0 版本

    1.0 版本基于 Shared-Storage 的存储计算分离架构,发布必备的最小功能集合,例如:PolarVFS、刷脏和 Buffer 管理、LogIndex、SyncDDL 等。

    • PolarVFS:数据库内核中抽象出了一层 VFS 层,使得内核可以对接任意的存储,包括 bufferIO 和 directIO。
    • 刷脏和 Buffer 管理:由原来的 N 份计算+N 份存储,转变成了 N 份计算+1 份存储,主节点在刷脏时需要做协调,避免只读节点读取到超前的“未来页面”。
    • LogIndex: 由于只读节点不能刷脏,所需要的特定版本页面需要从 Shared-Storage 上读取一个老的版本页面,并通过在内存中回放来得到正确的版本。LogIndex 结构记录了每个 Page 所对应的 WAL 日志 Meta 信息,在需要回放时直接查找 LogIndex,从而加速回放过程。
    • DDL 同步: 在存储计算分离后,主节点在执行 DDL 时需要兼顾只读节点对 Relation 等对象的引用,相关的 DDL 动作需要同步地在只读节点上上锁。
    • 数据库监控:支持主机和数据库的监控,同时为 HA 切换提供了判断依据。

    PolarDB PostgreSQL 2.0 版本

    除了在存储计算分离架构上改动之外,2.0 版本将在优化器上进行深度的优化,例如:

    • UniqueKey:和 Plan 节点的有序性类似,UniqueKey 维护的是 Plan 节点数据的唯一性。数据的唯一性可以减少不必要的 DISTINCT、Group By,增加 Join 结果有序性判断等。

    PolarDB PostgreSQL 3.0 版本

    3.0 版本主要在存储计算分离后在可用性上进行重大优化,例如:

    • 并行回放:存储计算分离之后,PolarDB 通过 LogIndex 实现了 Lazy 的回放。实现原理为:仅标记一个 Page 应该回放哪些 WAL 日志,在读进程时再进行真正的回放过程。此时对读的性能是有影响的。在 3.0 版本中,我们在 Lazy 回放基础上实现了并行回放,从而加速 Page 的回放过程。
    • OnlinePromote:在主节点崩溃后,切换到任意只读节点。该只读节点无需重启,继续并行回放完所有的 WAL 之后,Promote 成为新的主节点,从而进一步降低了不可用时间。

    PolarDB PostgreSQL 4.0 版本

    为了满足日益增多的 HTAP 混合负载需求,4.0 版本将发布基于 Shared-Storage 架构的分布式并行执行引擎,充分发挥多个只读节点的 CPU/MEM/IO 资源。

    经测试,在计算集群逐步扩展到 256 核时,性能仍然能够线性提升。

    PolarDB PostgreSQL 5.0 版本

    基于存储计算分离的一写多读架构中,读能力能够弹性的扩展,但是写入能力仍然只能在单个节点上执行。

    5.0 版本将发布 Shared-Nothing On Share-Everything 架构,结合 PolarDB 的分布式版本和 PolarDB 集中式版本的架构优势,使得多个节点都能够写入。

    + + + diff --git a/zh/theory/analyze.html b/zh/theory/analyze.html new file mode 100644 index 00000000000..e1cbfd2eb6d --- /dev/null +++ b/zh/theory/analyze.html @@ -0,0 +1,639 @@ + + + + + + + + + ANALYZE 源码解读 | PolarDB for PostgreSQL + + + + +

    ANALYZE 源码解读

    棠羽

    2022/06/20

    15 min

    背景

    PostgreSQL 在优化器中为一个查询树输出一个执行效率最高的物理计划树。其中,执行效率高低的衡量是通过代价估算实现的。比如通过估算查询返回元组的条数,和元组的宽度,就可以计算出 I/O 开销;也可以根据将要执行的物理操作估算出可能需要消耗的 CPU 代价。优化器通过系统表 pg_statistic 获得这些在代价估算过程需要使用到的关键统计信息,而 pg_statistic 系统表中的统计信息又是通过自动或手动的 ANALYZE 操作(或 VACUUM)计算得到的。ANALYZE 将会扫描表中的数据并按列进行分析,将得到的诸如每列的数据分布、最常见值、频率等统计信息写入系统表。

    本文从源码的角度分析一下 ANALYZE 操作的实现机制。源码使用目前 PostgreSQL 最新的稳定版本 PostgreSQL 14。

    统计信息

    首先,我们应当搞明白分析操作的输出是什么。所以我们可以看一看 pg_statistic 中有哪些列,每个列的含义是什么。这个系统表中的每一行表示其它数据表中 每一列的统计信息

    postgres=# \d+ pg_statistic
    +                                 Table "pg_catalog.pg_statistic"
    +   Column    |   Type   | Collation | Nullable | Default | Storage  | Stats target | Description
    +-------------+----------+-----------+----------+---------+----------+--------------+-------------
    + starelid    | oid      |           | not null |         | plain    |              |
    + staattnum   | smallint |           | not null |         | plain    |              |
    + stainherit  | boolean  |           | not null |         | plain    |              |
    + stanullfrac | real     |           | not null |         | plain    |              |
    + stawidth    | integer  |           | not null |         | plain    |              |
    + stadistinct | real     |           | not null |         | plain    |              |
    + stakind1    | smallint |           | not null |         | plain    |              |
    + stakind2    | smallint |           | not null |         | plain    |              |
    + stakind3    | smallint |           | not null |         | plain    |              |
    + stakind4    | smallint |           | not null |         | plain    |              |
    + stakind5    | smallint |           | not null |         | plain    |              |
    + staop1      | oid      |           | not null |         | plain    |              |
    + staop2      | oid      |           | not null |         | plain    |              |
    + staop3      | oid      |           | not null |         | plain    |              |
    + staop4      | oid      |           | not null |         | plain    |              |
    + staop5      | oid      |           | not null |         | plain    |              |
    + stanumbers1 | real[]   |           |          |         | extended |              |
    + stanumbers2 | real[]   |           |          |         | extended |              |
    + stanumbers3 | real[]   |           |          |         | extended |              |
    + stanumbers4 | real[]   |           |          |         | extended |              |
    + stanumbers5 | real[]   |           |          |         | extended |              |
    + stavalues1  | anyarray |           |          |         | extended |              |
    + stavalues2  | anyarray |           |          |         | extended |              |
    + stavalues3  | anyarray |           |          |         | extended |              |
    + stavalues4  | anyarray |           |          |         | extended |              |
    + stavalues5  | anyarray |           |          |         | extended |              |
    +Indexes:
    +    "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, stainherit)
    +
    /* ----------------
    + *      pg_statistic definition.  cpp turns this into
    + *      typedef struct FormData_pg_statistic
    + * ----------------
    + */
    +CATALOG(pg_statistic,2619,StatisticRelationId)
    +{
    +    /* These fields form the unique key for the entry: */
    +    Oid         starelid BKI_LOOKUP(pg_class);  /* relation containing
    +                                                 * attribute */
    +    int16       staattnum;      /* attribute (column) stats are for */
    +    bool        stainherit;     /* true if inheritance children are included */
    +
    +    /* the fraction of the column's entries that are NULL: */
    +    float4      stanullfrac;
    +
    +    /*
    +     * stawidth is the average width in bytes of non-null entries.  For
    +     * fixed-width datatypes this is of course the same as the typlen, but for
    +     * var-width types it is more useful.  Note that this is the average width
    +     * of the data as actually stored, post-TOASTing (eg, for a
    +     * moved-out-of-line value, only the size of the pointer object is
    +     * counted).  This is the appropriate definition for the primary use of
    +     * the statistic, which is to estimate sizes of in-memory hash tables of
    +     * tuples.
    +     */
    +    int32       stawidth;
    +
    +    /* ----------------
    +     * stadistinct indicates the (approximate) number of distinct non-null
    +     * data values in the column.  The interpretation is:
    +     *      0       unknown or not computed
    +     *      > 0     actual number of distinct values
    +     *      < 0     negative of multiplier for number of rows
    +     * The special negative case allows us to cope with columns that are
    +     * unique (stadistinct = -1) or nearly so (for example, a column in which
    +     * non-null values appear about twice on the average could be represented
    +     * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
    +     * column is nulls).  Because the number-of-rows statistic in pg_class may
    +     * be updated more frequently than pg_statistic is, it's important to be
    +     * able to describe such situations as a multiple of the number of rows,
    +     * rather than a fixed number of distinct values.  But in other cases a
    +     * fixed number is correct (eg, a boolean column).
    +     * ----------------
    +     */
    +    float4      stadistinct;
    +
    +    /* ----------------
    +     * To allow keeping statistics on different kinds of datatypes,
    +     * we do not hard-wire any particular meaning for the remaining
    +     * statistical fields.  Instead, we provide several "slots" in which
    +     * statistical data can be placed.  Each slot includes:
    +     *      kind            integer code identifying kind of data (see below)
    +     *      op              OID of associated operator, if needed
    +     *      coll            OID of relevant collation, or 0 if none
    +     *      numbers         float4 array (for statistical values)
    +     *      values          anyarray (for representations of data values)
    +     * The ID, operator, and collation fields are never NULL; they are zeroes
    +     * in an unused slot.  The numbers and values fields are NULL in an
    +     * unused slot, and might also be NULL in a used slot if the slot kind
    +     * has no need for one or the other.
    +     * ----------------
    +     */
    +
    +    int16       stakind1;
    +    int16       stakind2;
    +    int16       stakind3;
    +    int16       stakind4;
    +    int16       stakind5;
    +
    +    Oid         staop1 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop2 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop3 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop4 BKI_LOOKUP_OPT(pg_operator);
    +    Oid         staop5 BKI_LOOKUP_OPT(pg_operator);
    +
    +    Oid         stacoll1 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll2 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll3 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll4 BKI_LOOKUP_OPT(pg_collation);
    +    Oid         stacoll5 BKI_LOOKUP_OPT(pg_collation);
    +
    +#ifdef CATALOG_VARLEN           /* variable-length fields start here */
    +    float4      stanumbers1[1];
    +    float4      stanumbers2[1];
    +    float4      stanumbers3[1];
    +    float4      stanumbers4[1];
    +    float4      stanumbers5[1];
    +
    +    /*
    +     * Values in these arrays are values of the column's data type, or of some
    +     * related type such as an array element type.  We presently have to cheat
    +     * quite a bit to allow polymorphic arrays of this kind, but perhaps
    +     * someday it'll be a less bogus facility.
    +     */
    +    anyarray    stavalues1;
    +    anyarray    stavalues2;
    +    anyarray    stavalues3;
    +    anyarray    stavalues4;
    +    anyarray    stavalues5;
    +#endif
    +} FormData_pg_statistic;
    +

    从数据库命令行的角度和内核 C 代码的角度来看,统计信息的内容都是一致的。所有的属性都以 sta 开头。其中:

    • starelid 表示当前列所属的表或索引
    • staattnum 表示本行统计信息属于上述表或索引中的第几列
    • stainherit 表示统计信息是否包含子列
    • stanullfrac 表示该列中值为 NULL 的行数比例
    • stawidth 表示该列非空值的平均宽度
    • stadistinct 表示列中非空值的唯一值数量
      • 0 表示未知或未计算
      • > 0 表示唯一值的实际数量
      • < 0 表示 negative of multiplier for number of rows

    由于不同数据类型所能够被计算的统计信息可能会有一些细微的差别,在接下来的部分中,PostgreSQL 预留了一些存放统计信息的 槽(slots)。目前的内核里暂时预留了五个槽:

    #define STATISTIC_NUM_SLOTS  5
    +

    每一种特定的统计信息可以使用一个槽,具体在槽里放什么完全由这种统计信息的定义自由决定。每一个槽的可用空间包含这么几个部分(其中的 N 表示槽的编号,取值为 15):

    • stakindN:标识这种统计信息的整数编号
    • staopN:用于计算或使用统计信息的运算符 OID
    • stacollN:排序规则 OID
    • stanumbersN:浮点数数组
    • stavaluesN:任意值数组

    PostgreSQL 内核中规定,统计信息的编号 199 被保留给 PostgreSQL 核心统计信息使用,其它部分的编号安排如内核注释所示:

    /*
    + * The present allocation of "kind" codes is:
    + *
    + *  1-99:       reserved for assignment by the core PostgreSQL project
    + *              (values in this range will be documented in this file)
    + *  100-199:    reserved for assignment by the PostGIS project
    + *              (values to be documented in PostGIS documentation)
    + *  200-299:    reserved for assignment by the ESRI ST_Geometry project
    + *              (values to be documented in ESRI ST_Geometry documentation)
    + *  300-9999:   reserved for future public assignments
    + *
    + * For private use you may choose a "kind" code at random in the range
    + * 10000-30000.  However, for code that is to be widely disseminated it is
    + * better to obtain a publicly defined "kind" code by request from the
    + * PostgreSQL Global Development Group.
    + */
    +

    目前可以在内核代码中看到的 PostgreSQL 核心统计信息有 7 个,编号分别从 17。我们可以看看这 7 种统计信息分别如何使用上述的槽。

    Most Common Values (MCV)

    /*
    + * In a "most common values" slot, staop is the OID of the "=" operator
    + * used to decide whether values are the same or not, and stacoll is the
    + * collation used (same as column's collation).  stavalues contains
    + * the K most common non-null values appearing in the column, and stanumbers
    + * contains their frequencies (fractions of total row count).  The values
    + * shall be ordered in decreasing frequency.  Note that since the arrays are
    + * variable-size, K may be chosen by the statistics collector.  Values should
    + * not appear in MCV unless they have been observed to occur more than once;
    + * a unique column will have no MCV slot.
    + */
    +#define STATISTIC_KIND_MCV  1
    +

    对于一个列中的 最常见值,在 staop 中保存 = 运算符来决定一个值是否等于一个最常见值。在 stavalues 中保存了该列中最常见的 K 个非空值,stanumbers 中分别保存了这 K 个值出现的频率。

    Histogram

    /*
    + * A "histogram" slot describes the distribution of scalar data.  staop is
    + * the OID of the "<" operator that describes the sort ordering, and stacoll
    + * is the relevant collation.  (In theory more than one histogram could appear,
    + * if a datatype has more than one useful sort operator or we care about more
    + * than one collation.  Currently the collation will always be that of the
    + * underlying column.)  stavalues contains M (>=2) non-null values that
    + * divide the non-null column data values into M-1 bins of approximately equal
    + * population.  The first stavalues item is the MIN and the last is the MAX.
    + * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
    + * slot is also provided, then the histogram describes the data distribution
    + * *after removing the values listed in MCV* (thus, it's a "compressed
    + * histogram" in the technical parlance).  This allows a more accurate
    + * representation of the distribution of a column with some very-common
    + * values.  In a column with only a few distinct values, it's possible that
    + * the MCV list describes the entire data population; in this case the
    + * histogram reduces to empty and should be omitted.
    + */
    +#define STATISTIC_KIND_HISTOGRAM  2
    +

    表示一个(数值)列的数据分布直方图。staop 保存 < 运算符用于决定数据分布的排序顺序。stavalues 包含了能够将该列的非空值划分到 M - 1 个容量接近的桶中的 M 个非空值。如果该列中已经有了 MCV 的槽,那么数据分布直方图中将不包含 MCV 中的值,以获得更精确的数据分布。

    Correlation

    /*
    + * A "correlation" slot describes the correlation between the physical order
    + * of table tuples and the ordering of data values of this column, as seen
    + * by the "<" operator identified by staop with the collation identified by
    + * stacoll.  (As with the histogram, more than one entry could theoretically
    + * appear.)  stavalues is not used and should be NULL.  stanumbers contains
    + * a single entry, the correlation coefficient between the sequence of data
    + * values and the sequence of their actual tuple positions.  The coefficient
    + * ranges from +1 to -1.
    + */
    +#define STATISTIC_KIND_CORRELATION  3
    +

    stanumbers 中保存数据值和它们的实际元组位置的相关系数。

    Most Common Elements

    /*
    + * A "most common elements" slot is similar to a "most common values" slot,
    + * except that it stores the most common non-null *elements* of the column
    + * values.  This is useful when the column datatype is an array or some other
    + * type with identifiable elements (for instance, tsvector).  staop contains
    + * the equality operator appropriate to the element type, and stacoll
    + * contains the collation to use with it.  stavalues contains
    + * the most common element values, and stanumbers their frequencies.  Unlike
    + * MCV slots, frequencies are measured as the fraction of non-null rows the
    + * element value appears in, not the frequency of all rows.  Also unlike
    + * MCV slots, the values are sorted into the element type's default order
    + * (to support binary search for a particular value).  Since this puts the
    + * minimum and maximum frequencies at unpredictable spots in stanumbers,
    + * there are two extra members of stanumbers, holding copies of the minimum
    + * and maximum frequencies.  Optionally, there can be a third extra member,
    + * which holds the frequency of null elements (expressed in the same terms:
    + * the fraction of non-null rows that contain at least one null element).  If
    + * this member is omitted, the column is presumed to contain no null elements.
    + *
    + * Note: in current usage for tsvector columns, the stavalues elements are of
    + * type text, even though their representation within tsvector is not
    + * exactly text.
    + */
    +#define STATISTIC_KIND_MCELEM  4
    +

    与 MCV 类似,但是保存的是列中的 最常见元素,主要用于数组等类型。同样,在 staop 中保存了等值运算符用于判断元素出现的频率高低。但与 MCV 不同的是这里的频率计算的分母是非空的行,而不是所有的行。另外,所有的常见元素使用元素对应数据类型的默认顺序进行排序,以便二分查找。

    Distinct Elements Count Histogram

    /*
    + * A "distinct elements count histogram" slot describes the distribution of
    + * the number of distinct element values present in each row of an array-type
    + * column.  Only non-null rows are considered, and only non-null elements.
    + * staop contains the equality operator appropriate to the element type,
    + * and stacoll contains the collation to use with it.
    + * stavalues is not used and should be NULL.  The last member of stanumbers is
    + * the average count of distinct element values over all non-null rows.  The
    + * preceding M (>=2) members form a histogram that divides the population of
    + * distinct-elements counts into M-1 bins of approximately equal population.
    + * The first of these is the minimum observed count, and the last the maximum.
    + */
    +#define STATISTIC_KIND_DECHIST  5
    +

    表示列中出现所有数值的频率分布直方图。stanumbers 数组的前 M 个元素是将列中所有唯一值的出现次数大致均分到 M - 1 个桶中的边界值。后续跟上一个所有唯一值的平均出现次数。这个统计信息应该会被用于计算 选择率

    Length Histogram

    /*
    + * A "length histogram" slot describes the distribution of range lengths in
    + * rows of a range-type column. stanumbers contains a single entry, the
    + * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
    + * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
    + * values that divide the column data values into M-1 bins of approximately
    + * equal population. The lengths are stored as float8s, as measured by the
    + * range type's subdiff function. Only non-null rows are considered.
    + */
    +#define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
    +

    长度直方图描述了一个范围类型的列的范围长度分布。同样也是一个长度为 M 的直方图,保存在 stanumbers 中。

    Bounds Histogram

    /*
    + * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
    + * a range-type column.  stavalues contains M (>=2) range values that divide
    + * the column data values into M-1 bins of approximately equal population.
    + * Unlike a regular scalar histogram, this is actually two histograms combined
    + * into a single array, with the lower bounds of each value forming a
    + * histogram of lower bounds, and the upper bounds a histogram of upper
    + * bounds.  Only non-NULL, non-empty ranges are included.
    + */
    +#define STATISTIC_KIND_BOUNDS_HISTOGRAM  7
    +

    边界直方图同样也被用于范围类型,与数据分布直方图类似。stavalues 中保存了使该列数值大致均分到 M - 1 个桶中的 M 个范围边界值。只考虑非空行。

    内核执行流程

    知道 pg_statistic 最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令,执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

    /*
    + * standard_ProcessUtility itself deals only with utility commands for
    + * which we do not provide event trigger support.  Commands that do have
    + * such support are passed down to ProcessUtilitySlow, which contains the
    + * necessary infrastructure for such triggers.
    + *
    + * This division is not just for performance: it's critical that the
    + * event trigger code not be invoked when doing START TRANSACTION for
    + * example, because we might need to refresh the event trigger cache,
    + * which requires being in a valid transaction.
    + */
    +void
    +standard_ProcessUtility(PlannedStmt *pstmt,
    +                        const char *queryString,
    +                        bool readOnlyTree,
    +                        ProcessUtilityContext context,
    +                        ParamListInfo params,
    +                        QueryEnvironment *queryEnv,
    +                        DestReceiver *dest,
    +                        QueryCompletion *qc)
    +{
    +    // ...
    +
    +    switch (nodeTag(parsetree))
    +    {
    +        // ...
    +
    +        case T_VacuumStmt:
    +            ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
    +            break;
    +
    +        // ...
    +    }
    +
    +    // ...
    +}
    +

    ANALYZE 的处理逻辑入口和 VACUUM 一致,进入 ExecVacuum() 函数。

    /*
    + * Primary entry point for manual VACUUM and ANALYZE commands
    + *
    + * This is mainly a preparation wrapper for the real operations that will
    + * happen in vacuum().
    + */
    +void
    +ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
    +{
    +    // ...
    +
    +    /* Now go through the common routine */
    +    vacuum(vacstmt->rels, &params, NULL, isTopLevel);
    +}
    +

    在 parse 了一大堆 option 之后,进入了 vacuum() 函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以:

    • 分析整个数据库中的所有表
    • 分析某几个特定的表
    • 分析某个表的某几个特定列

    在明确要分析哪些表以后,依次将每一个表传入 analyze_rel() 函数:

    if (params->options & VACOPT_ANALYZE)
    +{
    +    // ...
    +
    +    analyze_rel(vrel->oid, vrel->relation, params,
    +                vrel->va_cols, in_outer_xact, vac_strategy);
    +
    +    // ...
    +}
    +

    进入 analyze_rel() 函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁,以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel() 函数中。

    /*
    + *  analyze_rel() -- analyze one relation
    + *
    + * relid identifies the relation to analyze.  If relation is supplied, use
    + * the name therein for reporting any failure to open/lock the rel; do not
    + * use it once we've successfully opened the rel, since it might be stale.
    + */
    +void
    +analyze_rel(Oid relid, RangeVar *relation,
    +            VacuumParams *params, List *va_cols, bool in_outer_xact,
    +            BufferAccessStrategy bstrategy)
    +{
    +    // ...
    +
    +    /*
    +     * Do the normal non-recursive ANALYZE.  We can skip this for partitioned
    +     * tables, which don't contain any rows.
    +     */
    +    if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
    +        do_analyze_rel(onerel, params, va_cols, acquirefunc,
    +                       relpages, false, in_outer_xact, elevel);
    +
    +    // ...
    +}
    +

    进入 do_analyze_rel() 函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。

    为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:

    /*
    + * Determine how many rows we need to sample, using the worst case from
    + * all analyzable columns.  We use a lower bound of 100 rows to avoid
    + * possible overflow in Vitter's algorithm.  (Note: that will also be the
    + * target in the corner case where there are no analyzable columns.)
    + */
    +targrows = 100;
    +for (i = 0; i < attr_cnt; i++)
    +{
    +    if (targrows < vacattrstats[i]->minrows)
    +        targrows = vacattrstats[i]->minrows;
    +}
    +for (ind = 0; ind < nindexes; ind++)
    +{
    +    AnlIndexData *thisdata = &indexdata[ind];
    +
    +    for (i = 0; i < thisdata->attr_cnt; i++)
    +    {
    +        if (targrows < thisdata->vacattrstats[i]->minrows)
    +            targrows = thisdata->vacattrstats[i]->minrows;
    +    }
    +}
    +
    +/*
    + * Look at extended statistics objects too, as those may define custom
    + * statistics target. So we may need to sample more rows and then build
    + * the statistics with enough detail.
    + */
    +minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
    +
    +if (targrows < minrows)
    +    targrows = minrows;
    +

    在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc 函数指针采样数据:

    /*
    + * Acquire the sample rows
    + */
    +rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
    +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
    +                             inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
    +                             PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
    +if (inh)
    +    numrows = acquire_inherited_sample_rows(onerel, elevel,
    +                                            rows, targrows,
    +                                            &totalrows, &totaldeadrows);
    +else
    +    numrows = (*acquirefunc) (onerel, elevel,
    +                              rows, targrows,
    +                              &totalrows, &totaldeadrows);
    +

    这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样:

    • 阶段 1:随机选择包含目标采样行数的数据块
    • 阶段 2:对每一个数据块使用 Vitter 算法按行随机采样数据

    两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:

    /*
    + * acquire_sample_rows -- acquire a random sample of rows from the table
    + *
    + * Selected rows are returned in the caller-allocated array rows[], which
    + * must have at least targrows entries.
    + * The actual number of rows selected is returned as the function result.
    + * We also estimate the total numbers of live and dead rows in the table,
    + * and return them into *totalrows and *totaldeadrows, respectively.
    + *
    + * The returned list of tuples is in order by physical position in the table.
    + * (We will rely on this later to derive correlation estimates.)
    + *
    + * As of May 2004 we use a new two-stage method:  Stage one selects up
    + * to targrows random blocks (or all blocks, if there aren't so many).
    + * Stage two scans these blocks and uses the Vitter algorithm to create
    + * a random sample of targrows rows (or less, if there are less in the
    + * sample of blocks).  The two stages are executed simultaneously: each
    + * block is processed as soon as stage one returns its number and while
    + * the rows are read stage two controls which ones are to be inserted
    + * into the sample.
    + *
    + * Although every row has an equal chance of ending up in the final
    + * sample, this sampling method is not perfect: not every possible
    + * sample has an equal chance of being selected.  For large relations
    + * the number of different blocks represented by the sample tends to be
    + * too small.  We can live with that for now.  Improvements are welcome.
    + *
    + * An important property of this sampling method is that because we do
    + * look at a statistically unbiased set of blocks, we should get
    + * unbiased estimates of the average numbers of live and dead rows per
    + * block.  The previous sampling method put too much credence in the row
    + * density near the start of the table.
    + */
    +static int
    +acquire_sample_rows(Relation onerel, int elevel,
    +                    HeapTuple *rows, int targrows,
    +                    double *totalrows, double *totaldeadrows)
    +{
    +    // ...
    +
    +    /* Outer loop over blocks to sample */
    +    while (BlockSampler_HasMore(&bs))
    +    {
    +        bool        block_accepted;
    +        BlockNumber targblock = BlockSampler_Next(&bs);
    +        // ...
    +    }
    +
    +    // ...
    +
    +    /*
    +     * If we didn't find as many tuples as we wanted then we're done. No sort
    +     * is needed, since they're already in order.
    +     *
    +     * Otherwise we need to sort the collected tuples by position
    +     * (itempointer). It's not worth worrying about corner cases where the
    +     * tuples are already sorted.
    +     */
    +    if (numrows == targrows)
    +        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
    +
    +    /*
    +     * Estimate total numbers of live and dead rows in relation, extrapolating
    +     * on the assumption that the average tuple density in pages we didn't
    +     * scan is the same as in the pages we did scan.  Since what we scanned is
    +     * a random sample of the pages in the relation, this should be a good
    +     * assumption.
    +     */
    +    if (bs.m > 0)
    +    {
    +        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
    +        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
    +    }
    +    else
    +    {
    +        *totalrows = 0.0;
    +        *totaldeadrows = 0.0;
    +    }
    +
    +    // ...
    +}
    +

    回到 do_analyze_rel() 函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic 系统表:

    /*
    + * Compute the statistics.  Temporary results during the calculations for
    + * each column are stored in a child context.  The calc routines are
    + * responsible to make sure that whatever they store into the VacAttrStats
    + * structure is allocated in anl_context.
    + */
    +if (numrows > 0)
    +{
    +    // ...
    +
    +    for (i = 0; i < attr_cnt; i++)
    +    {
    +        VacAttrStats *stats = vacattrstats[i];
    +        AttributeOpts *aopt;
    +
    +        stats->rows = rows;
    +        stats->tupDesc = onerel->rd_att;
    +        stats->compute_stats(stats,
    +                             std_fetch_func,
    +                             numrows,
    +                             totalrows);
    +
    +        // ...
    +    }
    +
    +    // ...
    +
    +    /*
    +     * Emit the completed stats rows into pg_statistic, replacing any
    +     * previous statistics for the target columns.  (If there are stats in
    +     * pg_statistic for columns we didn't process, we leave them alone.)
    +     */
    +    update_attstats(RelationGetRelid(onerel), inh,
    +                    attr_cnt, vacattrstats);
    +
    +    // ...
    +}
    +

    显然,对于不同类型的列,其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:

    /*
    + * std_typanalyze -- the default type-specific typanalyze function
    + */
    +bool
    +std_typanalyze(VacAttrStats *stats)
    +{
    +    // ...
    +
    +    /*
    +     * Determine which standard statistics algorithm to use
    +     */
    +    if (OidIsValid(eqopr) && OidIsValid(ltopr))
    +    {
    +        /* Seems to be a scalar datatype */
    +        stats->compute_stats = compute_scalar_stats;
    +        /*--------------------
    +         * The following choice of minrows is based on the paper
    +         * "Random sampling for histogram construction: how much is enough?"
    +         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
    +         * Proceedings of ACM SIGMOD International Conference on Management
    +         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
    +         * says that for table size n, histogram size k, maximum relative
    +         * error in bin size f, and error probability gamma, the minimum
    +         * random sample size is
    +         *      r = 4 * k * ln(2*n/gamma) / f^2
    +         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
    +         *      r = 305.82 * k
    +         * Note that because of the log function, the dependence on n is
    +         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
    +         * bin size error with probability 0.99.  So there's no real need to
    +         * scale for n, which is a good thing because we don't necessarily
    +         * know it at this point.
    +         *--------------------
    +         */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +    else if (OidIsValid(eqopr))
    +    {
    +        /* We can still recognize distinct values */
    +        stats->compute_stats = compute_distinct_stats;
    +        /* Might as well use the same minrows as above */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +    else
    +    {
    +        /* Can't do much but the trivial stuff */
    +        stats->compute_stats = compute_trivial_stats;
    +        /* Might as well use the same minrows as above */
    +        stats->minrows = 300 * attr->attstattarget;
    +    }
    +
    +    // ...
    +}
    +

    这个条件判断语句可以被解读为:

    • 如果说一个列的数据类型支持默认的 =eqopr:equals operator)和 <ltopr:less than operator),那么这个列应该是一个数值类型,可以使用 compute_scalar_stats() 函数进行分析
    • 如果列的数据类型只支持 = 运算符,那么依旧还可以使用 compute_distinct_stats 进行唯一值的统计分析
    • 如果都不行,那么这个列只能使用 compute_trivial_stats 进行一些简单的分析

    我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。

    compute_trivial_stats

    如果某个列的数据类型不支持等值运算符和比较运算符,那么就只能进行一些简单的分析,比如:

    • 非空行的比例
    • 列中元组的平均宽度

    这些可以通过对采样后的元组数组进行循环遍历后轻松得到。

    /*
    + *  compute_trivial_stats() -- compute very basic column statistics
    + *
    + *  We use this when we cannot find a hash "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows and the average datum width.
    + */
    +static void
    +compute_trivial_stats(VacAttrStatsP stats,
    +                      AnalyzeAttrFetchFunc fetchfunc,
    +                      int samplerows,
    +                      double totalrows)
    +{}
    +

    compute_distinct_stats

    如果某个列只支持等值运算符,也就是说我们只能知道一个数值 是什么,但不能和其它数值比大小。所以无法分析数值在大小范围上的分布,只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含:

    • 非空行的比例
    • 列中元组的平均宽度
    • 最频繁出现的值(MCV)
    • (估算的)唯一值个数
    /*
    + *  compute_distinct_stats() -- compute column statistics including ndistinct
    + *
    + *  We use this when we can find only an "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows, the average width, the
    + *  most common values, and the (estimated) number of distinct values.
    + *
    + *  The most common values are determined by brute force: we keep a list
    + *  of previously seen values, ordered by number of times seen, as we scan
    + *  the samples.  A newly seen value is inserted just after the last
    + *  multiply-seen value, causing the bottommost (oldest) singly-seen value
    + *  to drop off the list.  The accuracy of this method, and also its cost,
    + *  depend mainly on the length of the list we are willing to keep.
    + */
    +static void
    +compute_distinct_stats(VacAttrStatsP stats,
    +                       AnalyzeAttrFetchFunc fetchfunc,
    +                       int samplerows,
    +                       double totalrows)
    +{}
    +

    compute_scalar_stats

    如果一个列的数据类型支持等值运算符和比较运算符,那么可以进行最详尽的分析。分析目标包含:

    • 非空行的比例
    • 列中元组的平均宽度
    • 最频繁出现的值(MCV)
    • (估算的)唯一值个数
    • 数据分布直方图
    • 物理和逻辑位置的相关性
    /*
    + *  compute_distinct_stats() -- compute column statistics including ndistinct
    + *
    + *  We use this when we can find only an "=" operator for the datatype.
    + *
    + *  We determine the fraction of non-null rows, the average width, the
    + *  most common values, and the (estimated) number of distinct values.
    + *
    + *  The most common values are determined by brute force: we keep a list
    + *  of previously seen values, ordered by number of times seen, as we scan
    + *  the samples.  A newly seen value is inserted just after the last
    + *  multiply-seen value, causing the bottommost (oldest) singly-seen value
    + *  to drop off the list.  The accuracy of this method, and also its cost,
    + *  depend mainly on the length of the list we are willing to keep.
    + */
    +static void
    +compute_distinct_stats(VacAttrStatsP stats,
    +                       AnalyzeAttrFetchFunc fetchfunc,
    +                       int samplerows,
    +                       double totalrows)
    +{}
    +

    总结

    以 PostgreSQL 优化器需要的统计信息为切入点,分析了 ANALYZE 命令的大致执行流程。出于简洁性,在流程分析上没有覆盖各种 corner case 和相关的处理逻辑。

    参考资料

    PostgreSQL 14 Documentation: ANALYZEopen in new window

    PostgreSQL 14 Documentation: 25.1. Routine Vacuumingopen in new window

    PostgreSQL 14 Documentation: 14.2. Statistics Used by the Planneropen in new window

    PostgreSQL 14 Documentation: 52.49. pg_statisticopen in new window

    阿里云数据库内核月报 2016/05:PostgreSQL 特性分析 统计信息计算方法open in new window

    + + + diff --git a/zh/theory/arch-htap.html b/zh/theory/arch-htap.html new file mode 100644 index 00000000000..201bf236751 --- /dev/null +++ b/zh/theory/arch-htap.html @@ -0,0 +1,109 @@ + + + + + + + + + HTAP 架构详解 | PolarDB for PostgreSQL + + + + +

    HTAP 架构详解

    严华

    2022/09/10

    35 min

    背景

    很多 PolarDB PG 的用户都有 TP (Transactional Processing) 和 AP (Analytical Processing) 共用的需求。他们期望数据库在白天处理高并发的 TP 请求,在夜间 TP 流量下降、机器负载空闲时进行 AP 的报表分析。但是即使这样,依然没有最大化利用空闲机器的资源。原先的 PolarDB PG 数据库在处理复杂的 AP 查询时会遇到两大挑战:

    • 单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,无论是单机串行还是单机并行,都无法利用其他节点的 CPU、内存等计算资源,只能纵向 Scale Up,不能横向 Scale Out;
    • PolarDB 底层是存储池,理论上 I/O 吞吐是无限大的。而单条 SQL 在原生 PostgreSQL 执行引擎下只能在单个节点上执行,受限于单节点 CPU 和内存的瓶颈,无法充分发挥存储侧大 I/O 带宽的优势。

    image.png

    为了解决用户实际使用中的痛点,PolarDB 实现了 HTAP 特性。当前业界 HTAP 的解决方案主要有以下三种:

    1. TP 和 AP 在存储和计算上完全分离
      • 优势:两种业务负载互不影响
      • 劣势:
        • 时效性:TP 的数据需要导入到 AP 系统中,存在一定的延迟
        • 成本 / 运维难度:增加了一套冗余的 AP 系统
    2. TP 和 AP 在存储和计算上完全共享
      • 优势:成本最小化、资源利用最大化
      • 劣势:
        • 计算共享会导致 AP 查询和 TP 查询同时运行时或多或少会存在相互影响
        • 扩展计算节点存储时,数据需要重分布,无法快速弹性 Scale Out
    3. TP 和 AP 在存储上共享,在计算上分离
      • PolarDB 的存储计算分离架构天然支持此方案

    原理

    架构特性

    基于 PolarDB 的存储计算分离架构,我们研发了分布式 MPP 执行引擎,提供了跨机并行执行、弹性计算弹性扩展的保证,使得 PolarDB 初步具备了 HTAP 的能力:

    1. 一体化存储:毫秒级数据新鲜度
      • TP / AP 共享一套存储数据,减少存储成本,提高查询时效
    2. TP / AP 物理隔离:杜绝 CPU / 内存的相互影响
      • 单机执行引擎:在 RW / RO 节点上,处理高并发的 TP 查询
      • 分布式 MPP 执行引擎: 在 RO 节点,处理高复杂度的 AP 查询
    3. Serverless 弹性扩展:任何一个 RO 节点均可发起 MPP 查询
      • Scale Out:弹性调整 MPP 的执行节点范围
      • Scale Up:弹性调整 MPP 的单机并行度
    4. 消除数据倾斜、计算倾斜,充分考虑 PostgreSQL 的 Buffer Pool 亲和性

    image.png

    分布式 MPP 执行引擎

    PolarDB HTAP 的核心是分布式 MPP 执行引擎,是典型的火山模型引擎。A、B 两张表先做 join 再做聚合输出,这也是 PostgreSQL 单机执行引擎的执行流程。

    image.png

    在传统的 MPP 执行引擎中,数据被打散到不同的节点上,不同节点上的数据可能具有不同的分布属性,比如哈希分布、随机分布、复制分布等。传统的 MPP 执行引擎会针对不同表的数据分布特点,在执行计划中插入算子来保证上层算子对数据的分布属性无感知。

    不同的是,PolarDB 是共享存储架构,存储上的数据可以被所有计算节点全量访问。如果使用传统的 MPP 执行引擎,每个计算节点 Worker 都会扫描全量数据,从而得到重复的数据;同时,也没有起到扫描时分治加速的效果,并不能称得上是真正意义上的 MPP 引擎。

    因此,在 PolarDB 分布式 MPP 执行引擎中,我们借鉴了火山模型论文中的思想,对所有扫描算子进行并发处理,引入了 PxScan 算子来屏蔽共享存储。PxScan 算子将 shared-storage 的数据映射为 shared-nothing 的数据,通过 Worker 之间的协调,将目标表划分为多个虚拟分区数据块,每个 Worker 扫描各自的虚拟分区数据块,从而实现了跨机分布式并行扫描。

    PxScan 算子扫描出来的数据会通过 Shuffle 算子来重分布。重分布后的数据在每个 Worker 上如同单机执行一样,按照火山模型来执行。

    Serverless 弹性扩展

    传统 MPP 只能在指定节点发起 MPP 查询,因此每个节点上都只能有单个 Worker 扫描一张表。为了支持云原生下 serverless 弹性扩展的需求,我们引入了分布式事务一致性保证。

    image.png

    任意选择一个节点作为 Coordinator 节点,它的 ReadLSN 会作为约定的 LSN,从所有 MPP 节点的快照版本号中选择最小的版本号作为全局约定的快照版本号。通过 LSN 的回放等待和 Global Snaphot 同步机制,确保在任何一个节点发起 MPP 查询时,数据和快照均能达到一致可用的状态。

    image.png

    为了实现 serverless 的弹性扩展,我们从共享存储的特点出发,将 Coordinator 节点全链路上各个模块需要的外部依赖全部放至共享存储上。各个 Worker 节点运行时需要的参数也会通过控制链路从 Coordinator 节点同步过来,从而使 Coordinator 节点和 Worker 节点全链路 无状态化 (Stateless)

    基于以上两点设计,PolarDB 的弹性扩展具备了以下几大优势:

    • 任何节点都可以成为 Coordinator 节点,解决了传统 MPP 数据库 Coordinator 节点的单点问题。
    • PolarDB 可以横向 Scale Out(计算节点数量),也可以纵向 Scale Up(单节点并行度),且弹性扩展即时生效,不需要重新分布数据。
    • 允许业务有更多的弹性调度策略,不同的业务域可以运行在不同的节点集合上。如下图右侧所示,业务域 1 的 SQL 可以选择 RO1 和 RO2 节点来执行 AP 查询,业务域 2 的 SQL 可以选择使用 RO3 和 RO4 节点来执行 AP 查询。两个业务域使用的计算节点可以实现弹性调度。

    image.png

    消除倾斜

    倾斜是传统 MPP 固有的问题,其根本原因主要是数据分布倾斜和数据计算倾斜:

    • 数据分布倾斜通常由数据打散不均衡导致,在 PostgreSQL 中还会由于大对象 Toast 表存储引入一些不可避免的数据分布不均衡问题;
    • 计算倾斜通常由于不同节点上并发的事务、Buffer Pool、网络、I/O 抖动导致。

    倾斜会导致传统 MPP 在执行时出现木桶效应,执行完成时间受制于执行最慢的子任务。

    image.png

    PolarDB 设计并实现了 自适应扫描机制。如上图所示,采用 Coordinator 节点来协调 Worker 节点的工作模式。在扫描数据时,Coordinator 节点会在内存中创建一个任务管理器,根据扫描任务对 Worker 节点进行调度。Coordinator 节点内部分为两个线程:

    • Data 线程主要负责服务数据链路、收集汇总元组
    • Control 线程负责服务控制链路、控制每一个扫描算子的扫描进度

    扫描进度较快的 Worker 能够扫描多个数据块,实现能者多劳。比如上图中 RO1 与 RO3 的 Worker 各自扫描了 4 个数据块, RO2 由于计算倾斜可以扫描更多数据块,因此它最终扫描了 6 个数据块。

    PolarDB HTAP 的自适应扫描机制还充分考虑了 PostgreSQL 的 Buffer Pool 亲和性,保证每个 Worker 尽可能扫描固定的数据块,从而最大化命中 Buffer Pool 的概率,降低 I/O 开销。

    TPC-H 性能对比

    单机并行 vs 分布式 MPP

    我们使用 256 GB 内存的 16 个 PolarDB PG 实例作为 RO 节点,搭建了 1 TB 的 TPC-H 环境进行对比测试。相较于单机并行,分布式 MPP 并行充分利用了所有 RO 节点的计算资源和底层共享存储的 I/O 带宽,从根本上解决了前文提及的 HTAP 诸多挑战。在 TPC-H 的 22 条 SQL 中,有 3 条 SQL 加速了 60 多倍,19 条 SQL 加速了 10 多倍,平均加速 23 倍。

    image.png

    此外,我们也测试了弹性扩展计算资源带来的性能变化。通过增加 CPU 的总核心数,从 16 核增加到 128 核,TPC-H 的总运行时间线性提升,每条 SQL 的执行速度也呈线性提升,这也验证了 PolarDB HTAP serverless 弹性扩展的特点。

    image.png

    image.png

    在测试中发现,当 CPU 的总核数增加到 256 核时,性能提升不再明显。原因是此时 PolarDB 共享存储的 I/O 带宽已经打满,成为了瓶颈。

    PolarDB vs 传统 MPP 数据库

    我们将 PolarDB 的分布式 MPP 执行引擎与传统数据库的 MPP 执行引擎进行了对比,同样使用了 256 GB 内存的 16 个节点。

    在 1 TB 的 TPC-H 数据上,当保持与传统 MPP 数据库相同单机并行度的情况下(多机单进程),PolarDB 的性能是传统 MPP 数据库的 90%。其中最本质的原因是传统 MPP 数据库的数据默认是哈希分布的,当两张表的 join key 是各自的分布键时,可以不用 shuffle 直接进行本地的 Wise Join。而 PolarDB 的底层是共享存储池,PxScan 算子并行扫描出来的数据等价于随机分布,必须进行 shuffle 重分布以后才能像传统 MPP 数据库一样进行后续的处理。因此,TPC-H 涉及到表连接时,PolarDB 相比传统 MPP 数据库多了一次网络 shuffle 的开销。

    image.png

    image.png

    PolarDB 分布式 MPP 执行引擎能够进行弹性扩展,数据无需重分布。因此,在有限的 16 台机器上执行 MPP 时,PolarDB 还可以继续扩展单机并行度,充分利用每台机器的资源:当 PolarDB 的单机并行度为 8 时,它的性能是传统 MPP 数据库的 5-6 倍;当 PolarDB 的单机并行度呈线性增加时,PolarDB 的总体性能也呈线性增加。只需要修改配置参数,就可以即时生效。

    功能特性

    Parallel Query 并行查询

    经过持续迭代的研发,目前 PolarDB HTAP 在 Parallel Query 上支持的功能特性主要有五大部分:

    • 基础算子全支持:扫描 / 连接 / 聚合 / 子查询等算子。
    • 共享存储算子优化:包括 Shuffle 算子共享、SharedSeqScan 共享、SharedIndexScan 算子等。其中 SharedSeqScan 共享、SharedIndexScan 共享是指,在大表 join 小表时,小表采用类似于复制表的机制来减少广播开销,进而提升性能。
    • 分区表支持:不仅包括对 Hash / Range / List 三种分区方式的完整支持,还包括对多级分区静态裁剪、分区动态裁剪的支持。除此之外,PolarDB 分布式 MPP 执行引擎还支持分区表的 Partition Wise Join。
    • 并行度弹性控制:包括全局级别、表级别、会话级别、查询级别的并行度控制。
    • Serverless 弹性扩展:不仅包括任意节点发起 MPP、MPP 节点范围内的任意组合,还包括集群拓扑信息的自动维护,以及支持共享存储模式、主备库模式、三节点模式。

    Parallel DML

    基于 PolarDB 读写分离架构和 HTAP serverless 弹性扩展的设计, PolarDB Parallel DML 支持一写多读、多写多读两种特性。

    • 一写多读:在 RO 节点上有多个读 Worker,在 RW 节点上只有一个写 Worker;
    • 多写多读:在 RO 节点上有多个读 Worker,在 RW 节点上也有多个写 Worker。多写多读场景下,读写的并发度完全解耦。

    不同的特性适用不同的场景,用户可以根据自己的业务特点来选择不同的 PDML 功能特性。

    索引构建加速

    PolarDB 分布式 MPP 执行引擎,不仅可以用于只读查询和 DML,还可以用于 索引构建加速。OLTP 业务中有大量的索引,而 B-Tree 索引创建的过程大约有 80% 的时间消耗在排序和构建索引页上,20% 消耗在写入索引页上。如下图所示,PolarDB 利用 RO 节点对数据进行分布式 MPP 加速排序,采用流水化的技术来构建索引页,同时使用批量写入技术来提升索引页的写入速度。

    image.png

    在目前索引构建加速这一特性中,PolarDB 已经对 B-Tree 索引的普通创建以及 B-Tree 索引的在线创建 (Concurrently) 两种功能进行了支持。

    使用说明

    PolarDB HTAP 适用于日常业务中的 轻分析类业务,例如:对账业务,报表业务。

    使用 MPP 进行分析型查询

    PolarDB PG 引擎默认不开启 MPP 功能。若您需要使用此功能,请使用如下参数:

    • polar_enable_px:指定是否开启 MPP 功能。默认为 OFF,即不开启。
    • polar_px_max_workers_number:设置单个节点上的最大 MPP Worker 进程数,默认为 30。该参数限制了单个节点上的最大并行度,节点上所有会话的 MPP workers 进程数不能超过该参数大小。
    • polar_px_dop_per_node:设置当前会话并行查询的并行度,默认为 1,推荐值为当前 CPU 总核数。若设置该参数为 N,则一个会话在每个节点上将会启用 N 个 MPP Worker 进程,用于处理当前的 MPP 逻辑
    • polar_px_nodes:指定参与 MPP 的只读节点。默认为空,表示所有只读节点都参与。可配置为指定节点参与 MPP,以逗号分隔
    • px_worker:指定 MPP 是否对特定表生效。默认不生效。MPP 功能比较消耗集群计算节点的资源,因此只有对设置了 px_workers 的表才使用该功能。例如:
      • ALTER TABLE t1 SET(px_workers=1) 表示 t1 表允许 MPP
      • ALTER TABLE t1 SET(px_workers=-1) 表示 t1 表禁止 MPP
      • ALTER TABLE t1 SET(px_workers=0) 表示 t1 表忽略 MPP(默认状态)

    本示例以简单的单表查询操作,来描述 MPP 的功能是否有效。

    -- 创建 test 表并插入基础数据。
    +CREATE TABLE test(id int);
    +INSERT INTO test SELECT generate_series(1,1000000);
    +
    +-- 默认情况下 MPP 功能不开启,单表查询执行计划为 PG 原生的 Seq Scan
    +EXPLAIN SELECT * FROM test;
    +                       QUERY PLAN
    +--------------------------------------------------------
    + Seq Scan on test  (cost=0.00..35.50 rows=2550 width=4)
    +(1 row)
    +

    开启并使用 MPP 功能:

    -- 对 test 表启用 MPP 功能
    +ALTER TABLE test SET (px_workers=1);
    +
    +-- 开启 MPP 功能
    +SET polar_enable_px = on;
    +
    +EXPLAIN SELECT * FROM test;
    +
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Seq Scan on test (scan partial)  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    配置参与 MPP 的计算节点范围:

    -- 查询当前所有只读节点的名称
    +CREATE EXTENSION polar_monitor;
    +
    +SELECT name,host,port FROM polar_cluster_info WHERE px_node='t';
    + name  |   host    | port
    +-------+-----------+------
    + node1 | 127.0.0.1 | 5433
    + node2 | 127.0.0.1 | 5434
    +(2 rows)
    +
    +-- 当前集群有 2 个只读节点,名称分别为:node1,node2
    +
    +-- 指定 node1 只读节点参与 MPP
    +SET polar_px_nodes = 'node1';
    +
    +-- 查询参与并行查询的节点
    +SHOW polar_px_nodes;
    + polar_px_nodes
    +----------------
    + node1
    +(1 row)
    +
    +EXPLAIN SELECT * FROM test;
    +                                  QUERY PLAN
    +-------------------------------------------------------------------------------
    + PX Coordinator 1:1  (slice1; segments: 1)  (cost=0.00..431.00 rows=1 width=4)
    +   ->  Partial Seq Scan on test  (cost=0.00..431.00 rows=1 width=4)
    + Optimizer: PolarDB PX Optimizer
    +(3 rows)
    +

    使用 MPP 进行分区表查询

    当前 MPP 对分区表支持的功能如下所示:

    • 支持 Range 分区的并行查询
    • 支持 List 分区的并行查询
    • 支持单列 Hash 分区的并行查询
    • 支持分区裁剪
    • 支持带有索引的分区表并行查询
    • 支持分区表连接查询
    • 支持多级分区的并行查询
    --分区表 MPP 功能默认关闭,需要先开启 MPP 功能
    +SET polar_enable_px = ON;
    +
    +-- 执行以下语句,开启分区表 MPP 功能
    +SET polar_px_enable_partition = true;
    +
    +-- 执行以下语句,开启多级分区表 MPP 功能
    +SET polar_px_optimizer_multilevel_partitioning = true;
    +

    使用 MPP 加速索引创建

    当前仅支持对 B-Tree 索引的构建,且暂不支持 INCLUDE 等索引构建语法,暂不支持表达式等索引列类型。

    如果需要使用 MPP 功能加速创建索引,请使用如下参数:

    • polar_px_dop_per_node:指定通过 MPP 加速构建索引的并行度。默认为 1
    • polar_px_enable_replay_wait:当使用 MPP 加速索引构建时,当前会话内无需手动开启该参数,该参数将自动生效,以保证最近更新的数据表项可以被创建到索引中,保证索引表的完整性。索引创建完成后,该参数将会被重置为数据库默认值。
    • polar_px_enable_btbuild:是否开启使用 MPP 加速创建索引。取值为 OFF 时不开启(默认),取值为 ON 时开启。
    • polar_bt_write_page_buffer_size:指定索引构建过程中的写 I/O 策略。该参数默认值为 0(不开启),单位为块,最大值可设置为 8192。推荐设置为 4096
      • 当该参数设置为不开启时,在索引创建的过程中,对于索引页写满后的写盘方式是 block-by-block 的单个块写盘。
      • 当该参数设置为开启时,内核中将缓存一个 polar_bt_write_page_buffer_size 大小的 buffer,对于需要写盘的索引页,会通过该 buffer 进行 I/O 合并再统一写盘,避免了频繁调度 I/O 带来的性能开销。该参数会额外提升 20% 的索引创建性能。
    -- 开启使用 MPP 加速创建索引功能。
    +SET polar_px_enable_btbuild = on;
    +
    +-- 使用如下语法创建索引
    +CREATE INDEX t ON test(id) WITH(px_build = ON);
    +
    +-- 查询表结构
    +\d test
    +               Table "public.test"
    + Column |  Type   | Collation | Nullable | Default
    +--------+---------+-----------+----------+---------
    + id     | integer |           |          |
    + id2    | integer |           |          |
    +Indexes:
    +    "t" btree (id) WITH (px_build=finish)
    +
    + + + diff --git a/zh/theory/arch-overview.html b/zh/theory/arch-overview.html new file mode 100644 index 00000000000..89d6e74e495 --- /dev/null +++ b/zh/theory/arch-overview.html @@ -0,0 +1,33 @@ + + + + + + + + + 特性总览 | PolarDB for PostgreSQL + + + + +

    特性总览

    北侠

    2021/08/24

    35 min

    PolarDB for PostgreSQL(以下简称 PolarDB)是一款阿里云自主研发的企业级数据库产品,采用计算存储分离架构,100% 兼容 PostgreSQL。PolarDB 的存储与计算能力均可横向扩展,具有高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB 具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载;还具有时空、向量、搜索、图谱等多模创新特性,可以满足企业对数据处理日新月异的新需求。

    PolarDB 支持多种部署形态:存储计算分离部署、X-Paxos 三节点部署、本地盘部署。

    传统数据库的问题

    随着用户业务数据量越来越大,业务越来越复杂,传统数据库系统面临巨大挑战,如:

    1. 存储空间无法超过单机上限。
    2. 通过只读实例进行读扩展,每个只读实例独享一份存储,成本增加。
    3. 随着数据量增加,创建只读实例的耗时增加。
    4. 主备延迟高。

    PolarDB 云原生数据库的优势

    image.png

    针对上述传统数据库的问题,阿里云研发了 PolarDB 云原生数据库。采用了自主研发的计算集群和存储集群分离的架构。具备如下优势:

    1. 扩展性:存储计算分离,极致弹性。
    2. 成本:共享一份数据,存储成本低。
    3. 易用性:一写多读,透明读写分离。
    4. 可靠性:三副本、秒级备份。

    PolarDB 整体架构概述

    下面会从两个方面来解读 PolarDB 的架构,分别是:存储计算分离架构、HTAP 架构。

    存储计算分离架构概述

    image.png

    PolarDB 是存储计算分离的设计,存储集群和计算集群可以分别独立扩展:

    1. 当计算能力不够时,可以单独扩展计算集群。
    2. 当存储容量不够时,可以单独扩展存储集群。

    基于 Shared-Storage 后,主节点和多个只读节点共享一份存储数据,主节点刷脏不能再像传统的刷脏方式了,否则:

    1. 只读节点去存储中读取的页面,可能是比较老的版本,不符合他自己的状态。
    2. 只读节点指读取到的页面比自身内存中想要的数据要超前。
    3. 主节点切换到只读节点时,只读节点接管数据更新时,存储中的页面可能是旧的,需要读取日志重新对脏页的恢复。

    对于第一个问题,我们需要有页面多版本能力;对于第二个问题,我们需要主库控制脏页的刷脏速度。

    HTAP 架构概述

    读写分离后,单个计算节点无法发挥出存储侧大 IO 带宽的优势,也无法通过增加计算资源来加速大的查询。我们研发了基于 Shared-Storage 的 MPP 分布式并行执行,来加速在 OLTP 场景下 OLAP 查询。 PolarDB 支持一套 OLTP 场景型的数据在如下两种计算引擎下使用:

    • 单机执行引擎:处理高并发的 OLTP 型负载。
    • 分布式执行引擎:处理大查询的 OLAP 型负载。

    image.png

    在使用相同的硬件资源时性能达到了传统 MPP 数据库的 90%,同时具备了 SQL 级别的弹性:在计算能力不足时,可随时增加参与 OLAP 分析查询的 CPU,而数据无需重分布。

    PolarDB:存储计算分离架构详解

    Shared-Storage 带来的挑战

    基于 Shared-Storage 之后,数据库由传统的 share nothing,转变成了 shared storage 架构。需要解决如下问题:

    • 数据一致性:由原来的 N 份计算+N 份存储,转变成了 N 份计算+1 份存储。
    • 读写分离:如何基于新架构做到低延迟的复制。
    • 高可用:如何 Recovery 和 Failover。
    • IO 模型:如何从 Buffer-IO 向 Direct-IO 优化。

    架构原理

    image.png

    首先来看下基于 Shared-Storage 的 PolarDB 的架构原理。

    • 主节点为可读可写节点(RW),只读节点为只读(RO)。
    • Shared-Storage 层,只有主节点能写入,因此主节点和只读节点能看到一致的落盘的数据。
    • 只读节点的内存状态是通过回放 WAL 保持和主节点同步的。
    • 主节点的 WAL 日志写到 Shared-Storage,仅复制 WAL 的 meta 给只读节点。
    • 只读节点从 Shared-Storage 上读取 WAL 并回放。

    数据一致性

    传统数据库的内存状态同步

    传统 share nothing 的数据库,主节点和只读节点都有自己的内存和存储,只需要从主节点复制 WAL 日志到只读节点,并在只读节点上依次回放日志即可,这也是复制状态机的基本原理。

    基于 Shared-Storage 的内存状态同步

    前面讲到过存储计算分离后,Shared-Storage 上读取到的页面是一致的,内存状态是通过从 Shared-Storage 上读取最新的 WAL 并回放得来,如下图:

    image.png

    1. 主节点通过刷脏把版本 200 写入到 Shared-Storage。
    2. 只读节点基于版本 100,并回放日志得到 200。

    基于 Shared-Storage 的“过去页面”

    上述流程中,只读节点中基于日志回放出来的页面会被淘汰掉,此后需要再次从存储上读取页面,会出现读取的页面是之前的老页面,称为“过去页面”。如下图:

    image.png

    1. T1 时刻,主节点在 T1 时刻写入日志 LSN=200,把页面 P1 的内容从 500 更新到 600;
    2. 只读节点此时页面 P1 的内容是 500;
    3. T2 时刻,主节点将日志 200 的 meta 信息发送给只读节点,只读节点得知存在新的日志;
    4. T3 时刻,此时在只读节点上读取页面 P1,需要读取页面 P1 和 LSN=200 的日志,进行一次回放,得到 P1 的最新内容为 600;
    5. T4 时刻,只读节点上由于 BufferPool 不足,将回放出来的最新页面 P1 淘汰掉;
    6. 主节点没有将最新的页面 P1 为 600 的最新内容刷脏到 Shared-Storage 上;
    7. T5 时刻,再次从只读节点上发起读取 P1 操作,由于内存中已把 P1 淘汰掉了,因此从 Shared-Storage 上读取,此时读取到了“过去页面”的内容;

    “过去页面” 的解法

    只读节点在任意时刻读取页面时,需要找到对应的 Base 页面和对应起点的日志,依次回放。如下图:

    image.png

    1. 在只读节点内存中维护每个 Page 对应的日志 meta。
    2. 在读取时一个 Page 时,按需逐个应用日志直到期望的 Page 版本。
    3. 应用日志时,通过日志的 meta 从 Shared-Storage 上读取。

    通过上述分析,需要维护每个 Page 到日志的“倒排”索引,而只读节点的内存是有限的,因此这个 Page 到日志的索引需要持久化,PolarDB 设计了一个可持久化的索引结构 - LogIndex。LogIndex 本质是一个可持久化的 hash 数据结构。

    1. 只读节点通过 WAL receiver 接收从主节点过来的 WAL meta 信息。
    2. WAL meta 记录该条日志修改了哪些 Page。
    3. 将该条 WAL meta 插入到 LogIndex 中,key 是 PageID,value 是 LSN。
    4. 一条 WAL 日志可能更新了多个 Page(索引分裂),在 LogIndex 对有多条记录。
    5. 同时在 BufferPool 中给该该 Page 打上 outdate 标记,以便使得下次读取的时候从 LogIndex 重回放对应的日志。
    6. 当内存达到一定阈值时,LogIndex 异步将内存中的 hash 刷到盘上。

    image.png

    通过 LogIndex 解决了刷脏依赖“过去页面”的问题,也是得只读节点的回放转变成了 Lazy 的回放:只需要回放日志的 meta 信息即可。

    基于 Shared-Storage 的“未来页面”

    在存储计算分离后,刷脏依赖还存在“未来页面”的问题。如下图所示:

    image.png

    1. T1 时刻,主节点对 P1 更新了 2 次,产生了 2 条日志,此时主节点和只读节点上页面 P1 的内容都是 500。
    2. T2 时刻, 发送日志 LSN=200 给只读节点。
    3. T3 时刻,只读节点回放 LSN=200 的日志,得到 P1 的内容为 600,此时只读节点日志回放到了 200,后面的 LSN=300 的日志对他来说还不存在。
    4. T4 时刻,主节点刷脏,将 P1 最新的内容 700 刷到了 Shared-Storage 上,同时只读节点上 BufferPool 淘汰掉了页面 P1。
    5. T5 时刻,只读节点再次读取页面 P1,由于 BufferPool 中不存在 P1,因此从共享内存上读取了最新的 P1,但是只读节点并没有回放 LSN=300 的日志,读取到了一个对他来说超前的“未来页面”。
    6. “未来页面”的问题是:部分页面是未来页面,部分页面是正常的页面,会到时数据不一致,比如索引分裂成 2 个 Page 后,一个读取到了正常的 Page,另一个读取到了“未来页面”,B+Tree 的索引结构会被破坏。

    “未来页面”的解法

    “未来页面”的原因是主节点刷脏的速度超过了任一只读节点的回放速度(虽然只读节点的 Lazy 回放已经很快了)。因此,解法就是对主节点刷脏进度时做控制:不能超过最慢的只读节点的回放位点。如下图所示:

    image.png

    1. 只读节点回放到 T4 位点。
    2. 主节点在刷脏时,对所有脏页按照 LSN 排序,仅刷在 T4 之前的脏页(包括 T4),之后的脏页不刷。
    3. 其中,T4 的 LSN 位点称为“一致性位点”。

    低延迟复制

    传统流复制的问题

    1. 同步链路:日志同步路径 IO 多,网络传输量大。
    2. 页面回放:读取和 Buffer 修改慢(IO 密集型 + CPU 密集型)。
    3. DDL 回放:修改文件时需要对修改的文件加锁,而加锁的过程容易被阻塞,导致 DDL 慢。
    4. 快照更新:RO 高并发引起事务快照更新慢。

    如下图所示:

    image.png

    1. 主节点写入 WAL 日志到本地文件系统中。
    2. WAL Sender 进程读取,并发送。
    3. 只读节点的 WAL Receiver 进程接收写入到本地文件系统中。
    4. 回放进程读取 WAL 日志,读取对应的 Page 到 BufferPool 中,并在内存中回放。
    5. 主节点刷脏页到 Shared Storage。

    可以看到,整个链路是很长的,只读节点延迟高,影响用户业务读写分离负载均衡。

    优化 1:只复制 Meta

    因为底层是 Shared-Storage,只读节点可直接从 Shared-Storage 上读取所需要的 WAL 数据。因此主节点只把 WAL 日志的元数据(去掉 Payload)复制到只读节点,这样网络传输量小,减少关键路径上的 IO。如下图所示:

    image.png

    1. WAL Record 是由:Header,PageID,Payload 组成。
    2. 由于只读节点可以直接读取 Shared-Storage 上的 WAL 文件,因此主节点只把 WAL 日志的元数据发送(复制)到只读节点,包括:Header,PageID。
    3. 在只读节点上,通过 WAL 的元数据直接读取 Shared-Storage 上完整的 WAL 文件。

    通过上述优化,能显著减少主节点和只读节点间的网络传输量。从下图可以看到网络传输量减少了 98%。

    image.png

    优化 2:页面回放优化

    在传统 DB 中日志回放的过程中会读取大量的 Page 并逐个日志 Apply,然后落盘。该流程在用户读 IO 的关键路径上,借助存储计算分离可以做到:如果只读节点上 Page 不在 BufferPool 中,不产生任何 IO,仅仅记录 LogIndex 即可。

    可以将回放进程中的如下 IO 操作 offload 到 session 进程中:

    1. 数据页 IO 开销。
    2. 日志 apply 开销。
    3. 基于 LogIndex 页面的多版本回放。

    如下图所示,在只读节点上的回放进程中,在 Apply 一条 WAL 的 meta 时:

    image.png

    1. 如果对应 Page 不在内存中,仅仅记录 LogIndex。
    2. 如果对应的 Page 在内存中,则标记为 Outdate,并记录 LogIndex,回放过程完成。
    3. 用户 session 进程在读取 Page 时,读取正确的 Page 到 BufferPool 中,并通过 LogIndex 来回放相应的日志。
    4. 可以看到,主要的 IO 操作有原来的单个回放进程 offload 到了多个用户进程。

    通过上述优化,能显著减少回放的延迟,比 AWS Aurora 快 30 倍。

    image.png

    优化 3:DDL 锁回放优化

    在主节点执行 DDL 时,比如:drop table,需要在所有节点上都对表上排他锁,这样能保证表文件不会在只读节点上读取时被主节点删除掉了(因为文件在 Shared-Storage 上只有一份)。在所有只读节点上对表上排他锁是通过 WAL 复制到所有的只读节点,只读节点回放 DDL 锁来完成。而回放进程在回放 DDL 锁时,对表上锁可能会阻塞很久,因此可以通过把 DDL 锁也 offload 到其他进程上来优化回放进程的关键路径。

    image.png

    通过上述优化,能够回放进程一直处于平滑的状态,不会因为去等 DDL 而阻塞了回放的关键路径。

    image.png

    上述 3 个优化之后,极大的降低了复制延迟,能够带来如下优势:

    • 读写分离:负载均衡,更接近 Oracle RAC 使用体验。
    • 高可用:加速 HA 流程。
    • 稳定性:最小化未来页的数量,可以写更少或者无需写页面快照。

    Recovery 优化

    背景

    数据库 OOM、Crash 等场景恢复时间长,本质上是日志回放慢,在共享存储 Direct-IO 模型下问题更加突出。

    image.png

    Lazy Recovery

    前面讲到过通过 LogIndex 我们在只读节点上做到了 Lazy 的回放,那么在主节点重启后的 recovery 过程中,本质也是在回放日志,那么我们可以借助 Lazy 回放来加速 recovery 的过程:

    image.png

    1. 从 checkpoint 点开始逐条去读 WAL 日志。
    2. 回放完 LogIndex 日志后,即认为回放完成。
    3. recovery 完成,开始提供服务。
    4. 真正的回放被 offload 到了重启之后进来的 session 进程中。

    优化之后(回放 500MB 日志量):

    image.png

    Persistent BufferPool

    上述方案优化了在 recovery 的重启速度,但是在重启之后,session 进程通过读取 WAL 日志来回放想要的 page。表现就是在 recovery 之后会有短暂的响应慢的问题。优化的办法为在数据库重启时 BufferPool 并不销毁,如下图所示:crash 和 restart 期间 BufferPool 不销毁。

    image.png

    内核中的共享内存分成 2 部分:

    1. 全局结构,ProcArray 等。
    2. BufferPool 结构;其中 BufferPool 通过具名共享内存来分配,在进程重启后仍然有效。而全局结构在进程重启后需要重新初始化。

    image.png

    而 BufferPool 中并不是所有的 Page 都是可以复用的,比如:在重启前,某进程对 Page 上 X 锁,随后 crash 了,该 X 锁就没有进程来释放了。因此,在 crash 和 restart 之后需要把所有的 BufferPool 遍历一遍,剔除掉不能被复用的 Page。另外,BufferPool 的回收依赖 k8s。该优化之后,使得重启前后性能平稳。

    image.png

    PolarDB:HTAP 架构详解

    PolarDB 读写分离后,由于底层是存储池,理论上 IO 吞吐是无限大的。而大查询只能在单个计算节点上执行,单个计算节点的 CPU/MEM/IO 是有限的,因此单个计算节点无法发挥出存储侧的大 IO 带宽的优势,也无法通过增加计算资源来加速大的查询。我们研发了基于 Shared-Storage 的 MPP 分布式并行执行,来加速在 OLTP 场景下 OLAP 查询。

    HTAP 架构原理

    PolarDB 底层存储在不同节点上是共享的,因此不能直接像传统 MPP 一样去扫描表。我们在原来单机执行引擎上支持了 MPP 分布式并行执行,同时对 Shared-Storage 进行了优化。 基于 Shared-Storage 的 MPP 是业界首创,它的原理是:

    1. Shuffle 算子屏蔽数据分布。
    2. ParallelScan 算子屏蔽共享存储。

    image.png

    如图所示:

    1. 表 A 和表 B 做 join,并做聚合。
    2. 共享存储中的表仍然是单个表,并没有做物理上的分区。
    3. 重新设计 4 类扫描算子,使之在扫描共享存储上的表时能够分片扫描,形成 virtual partition。

    分布式优化器

    基于社区的 GPORCA 优化器扩展了能感知共享存储特性的 Transformation Rules。使得能够探索共享存储下特有的 Plan 空间,比如:对于一个表在 PolarDB 中既可以全量的扫描,也可以分区域扫描,这个是和传统 MPP 的本质区别。图中,上面灰色部分是 PolarDB 内核与 GPORCA 优化器的适配部分。下半部分是 ORCA 内核,灰色模块是我们在 ORCA 内核中对共享存储特性所做的扩展。

    image.png

    算子并行化

    PolarDB 中有 4 类算子需要并行化,下面介绍一个具有代表性的 Seqscan 的算子的并行化。为了最大限度的利用存储的大 IO 带宽,在顺序扫描时,按照 4MB 为单位做逻辑切分,将 IO 尽量打散到不同的盘上,达到所有的盘同时提供读服务的效果。这样做还有一个优势,就是每个只读节点只扫描部分表文件,那么最终能缓存的表大小是所有只读节点的 BufferPool 总和。

    image.png

    下面的图表中:

    1. 增加只读节点,扫描性能线性提升 30 倍。
    2. 打开 Buffer 时,扫描从 37 分钟降到 3.75 秒。

    image.png

    消除数据倾斜问题

    倾斜是传统 MPP 固有的问题:

    1. 在 PolarDB 中,大对象的是通过 heap 表关联 TOAST​ 表,无论对哪个表切分都无法达到均衡。
    2. 另外,不同只读节点的事务、buffer、网络、IO 负载抖动。

    以上两点会导致分布执行时存在长尾进程。

    image.png

    1. 协调节点内部分成 DataThread 和 ControlThread。
    2. DataThread 负责收集汇总元组。
    3. ControlThread 负责控制每个扫描算子的扫描进度。
    4. 扫描快的工作进程能多扫描逻辑的数据切片。
    5. 过程中需要考虑 Buffer 的亲和性。

    需要注意的是:尽管是动态分配,尽量维护 buffer 的亲和性;另外,每个算子的上下文存储在 worker 的私有内存中,Coordinator 不存储具体表的信息;

    下面表格中,当出现大对象时,静态切分出现数据倾斜,而动态扫描仍然能够线性提升。

    image.png

    SQL 级别弹性扩展

    那我们利用数据共享的特点,还可以支持云原生下极致弹性的要求:把 Coordinator 全链路上各个模块所需要的外部依赖存在共享存储上,同时 worker 全链路上需要的运行时参数通过控制链路从 Coordinator 同步过来,使 Coordinator 和 worker 无状态化。

    image.png

    因此:

    1. SQL 连接的任意只读节点都可以成为 Coordinator 节点,这解决了 Coordinator 单点问题。
    2. 一个 SQL 能在任意节点上启动任意 worker 数目,达到算力能 SQL 级别弹性扩展,也允许业务有更多的调度策略:不同业务域同时跑在不同的节点集合上。

    image.png

    事务一致性

    多个计算节点数据一致性通过等待回放和 globalsnapshot 机制来完成。等待回放保证所有 worker 能看到所需要的数据版本,而 globalsnapshot 保证了选出一个统一的版本。

    image.png

    TPC-H 性能:加速比

    image.png

    我们使用 1TB 的 TPC-H 进行了测试,首先对比了 PolarDB 新的分布式并行和单机并行的性能:有 3 个 SQL 提速 60 倍,19 个 SQL 提速 10 倍以上;

    image.png

    image.png

    另外,使用分布式执行引擎测,试增加 CPU 时的性能,可以看到,从 16 核和 128 核时性能线性提升;单看 22 条 SQL,通过该增加 CPU,每个条 SQL 性能线性提升。

    TPC-H 性能:和传统 MPP 数据库的对比

    与传统 MPP 数据库相比,同样使用 16 个节点,PolarDB 的性能是传统 MPP 数据库的 90%。

    image.png

    image.png

    前面讲到我们给 PolarDB 的分布式引擎做到了弹性扩展,数据不需要充分重分布,当 dop = 8 时,性能是传统 MPP 数据库的 5.6 倍。

    分布式执行加速索引创建

    OLTP 业务中会建大量的索引,经分析建索引过程中:80%是在排序和构建索引页,20%在写索引页。通过使用分布式并行来加速排序过程,同时流水化批量写入。

    image.png

    上述优化能够使得创建索引有 4~5 倍的提升。

    image.png

    分布式并行执行加速多模:时空数据库

    PolarDB 是对多模数据库,支持时空数据。时空数据库是计算密集型和 IO 密集型,可以借助分布式执行来加速。我们针对共享存储开发了扫描共享 RTREE 索引的功能。

    image.png

    • 数据量:40000 万,500 GB
    • 规格:5 个只读节点,每个节点规格为 16 核 CPU、128 GB 内存
    • 性能:
      • 随 CPU 数目线性提升
      • 共 80 核 CPU 时,提升71 倍

    image.png

    总结

    本文从架构层面分析了 PolarDB 的技术要点:

    • 存储计算分离架构。
    • HTAP 架构。

    后续文章将具体讨论更多的技术细节,比如:如何基于 Shared-Storage 的查询优化器,LogIndex 如何做到高性能,如何闪回到任意时间点,如何在 Shared-Storage 上支持 MPP,如何和 X-Paxos 结合构建高可用等等,敬请期待。

    + + + diff --git a/zh/theory/buffer-management.html b/zh/theory/buffer-management.html new file mode 100644 index 00000000000..ccd8da55721 --- /dev/null +++ b/zh/theory/buffer-management.html @@ -0,0 +1,37 @@ + + + + + + + + + 缓冲区管理 | PolarDB for PostgreSQL + + + + +

    缓冲区管理

    背景介绍

    传统数据库的主备架构,主备有各自的存储,备节点回放 WAL 日志并读写自己的存储,主备节点在存储层没有耦合。PolarDB 的实现是基于共享存储的一写多读架构,主备使用共享存储中的一份数据。读写节点,也称为主节点或 Primary 节点,可以读写共享存储中的数据;只读节点,也称为备节点或 Replica 节点,仅能各自通过回放日志,从共享存储中读取数据,而不能写入。基本架构图如下所示:

    image.png

    一写多读架构下,只读节点可能从共享存储中读到两类数据页:

    • 未来页:数据页中包含只读节点尚未回放到的数据,比如只读节点回放到 LSN 为 200 的 WAL 日志,但数据页中已经包含 LSN 为 300 的 WAL 日志对应的改动。此类数据页被称为“未来页”。

      image.png

    • 过去页:数据页中未包含所有回放位点之前的改动,比如只读节点将数据页回放到 LSN 为 200 的 WAL 日志,但该数据页在从 Buffer Pool 淘汰之后,再次从共享存储中读取的数据页中没有包含 LSN 为 200 的 WAL 日志的改动,此类数据页被称为“过去页”。

      image.png

    对于只读节点而言,只需要访问与其回放位点相对应的数据页。如果读取到如上所述的“未来页”和“过去页”应该如何处理呢?

    • 对于“过去页”,只读节点需要回放数据页上截止回放位点之前缺失的 WAL 日志,对“过去页”的回放由每个只读节点根据自己的回放位点完成,属于只读节点回放功能,本文暂不讨论。
    • 对于“未来页”,只读节点无法将“未来”的数据页转换为所需的数据页,因此需要在主节点将数据写入共享存储时考虑所有只读节点的回放情况,从而避免只读节点读取到“未来页”,这也是 Buffer 管理要解决的主要问题。

    除此之外,Buffer 管理还需要维护一致性位点,对于某个数据页,只读节点仅需回放一致性位点和当前回放位点之间的 WAL 日志即可,从而加速回放效率。

    术语解释

    • Buffer Pool:缓冲池,是一种内存结构用来存储最常访问的数据,通常以页为单位来缓存数据。PolarDB 中每个节点都有自己的 Buffer Pool。
    • LSN:Log Sequence Number,日志序列号,是 WAL 日志的唯一标识。LSN 在全局是递增的。
    • 回放位点:Apply LSN,表示只读节点回放日志的位置,一般用 LSN 来标记。
    • 最老回放位点:Oldest Apply LSN,表示所有只读节点中 LSN 最小的回放位点。

    刷脏控制

    为避免只读节点读取到“未来页”,PolarDB 引入刷脏控制功能,即在主节点要将数据页写入共享存储时,判断所有只读节点是否均已回放到该数据页最近一次修改对应的 WAL 日志。

    image.png

    主节点 Buffer Pool 中的数据页,根据是否包含“未来数据”(即只读节点的回放位点之后新产生的数据),可以分为两类:可以写入存储的和不能写入存储的。该判断依赖两个位点:

    • Buffer 最近一次修改对应的 LSN,我们称之为 Buffer Latest LSN。
    • 最老回放位点,即所有只读节点中最小的回放位点,我们称之为 Oldest Apply LSN。

    刷脏控制判断规则如下:

    if buffer latest lsn <= oldest apply lsn
    +    flush buffer
    +else
    +    do not flush buffer
    +

    一致性位点

    为将数据页回放到指定的 LSN 位点,只读节点会维护数据页与该页上的 LSN 的映射关系,这种映射关系保存在 LogIndex 中。LogIndex 可以理解为是一种可以持久化存储的 HashTable。访问数据页时,会从该映射关系中获取数据页需要回放的所有 LSN,依次回放对应的 WAL 日志,最终生成需要使用的数据页。

    image.png

    可见,数据页上的修改越多,其对应的 LSN 也越多,回放所需耗时也越长。为了尽量减少数据页需要回放的 LSN 数量,PolarDB 中引入了一致性位点的概念。

    一致性位点表示该位点之前的所有 WAL 日志修改的数据页均已经持久化到存储。主备之间,主节点向备节点发送当前 WAL 日志的写入位点和一致性位点,备节点向主节点发送当前回放的位点。由于一致性位点之前的 WAL 修改都已经写入共享存储,备节点无需再回放该位点之前的 WAL 日志。因此,可以将 LogIndex 中所有小于一致性位点的 LSN 清理掉,既加速回放效率,同时还能减少 LogIndex 占用的空间。

    FlushList

    为维护一致性位点,PolarDB 为每个 Buffer 引入了一个内存状态,即第一次修改该 Buffer 对应的 LSN,称之为 oldest LSN,所有 Buffer 中最小的 oldest LSN 即为一致性位点。

    一种获取一致性位点的方法是遍历 Buffer Pool 中所有 Buffer,找到最小值,但遍历代价较大,CPU 开销和耗时都不能接受。为高效获取一致性位点,PolarDB 引入 FlushList 机制,将 Buffer Pool 中所有脏页按照 oldest LSN 从小到大排序。借助 FlushList,获取一致性位点的时间复杂度可以达到 O(1)。

    image.png

    第一次修改 Buffer 并将其标记为脏时,将该 Buffer 插入到 FlushList 中,并设置其 oldest LSN。Buffer 被写入存储时,将该内存中的标记清除。

    为高效推进一致性位点,PolarDB 的后台刷脏进程(bgwriter)采用“先被修改的 Buffer 先落盘”的刷脏策略,即 bgwriter 会从前往后遍历 FlushList,逐个刷脏,一旦有脏页写入存储,一致性位点就可以向前推进。以上图为例,如果 oldest LSN 为 10 的 Buffer 落盘,一致性位点就可以推进到 30。

    并行刷脏

    为进一步提升一致性位点的推进效率,PolarDB 实现了并行刷脏。每个后台刷脏进程会从 FlushList 中获取一批数据页进行刷脏。

    image.png

    热点页

    引入刷脏控制之后,仅满足刷脏条件的 Buffer 才能写入存储,假如某个 Buffer 修改非常频繁,可能导致 Buffer Latest LSN 总是大于 Oldest Apply LSN,该 Buffer 始终无法满足刷脏条件,此类 Buffer 我们称之为热点页。热点页会导致一致性位点无法推进,为解决热点页的刷脏问题,PolarDB 引入了 Copy Buffer 机制。

    Copy Buffer 机制会将特定的、不满足刷脏条件的 Buffer 从 Buffer Pool 中拷贝至新增的 Copy Buffer Pool 中,Copy Buffer Pool 中的 Buffer 不会再被修改,其对应的 Latest LSN 也不会更新,随着 Oldest Apply LSN 的推进,Copy Buffer 会逐步满足刷脏条件,从而可以将 Copy Buffer 落盘。

    引入 Copy Buffer 机制后,刷脏的流程如下:

    1. 如果 Buffer 不满足刷脏条件,判断其最近修改次数以及距离当前日志位点的距离,超过一定阈值,则将当前数据页拷贝一份至 Copy Buffer Pool 中。
    2. 下次再刷该 Buffer 时,判断其是否满足刷脏条件,如果满足,则将该 Buffer 写入存储并释放其对应的 Copy Buffer。
    3. 如果 Buffer 不满足刷脏条件,则判断其是否存在 Copy Buffer,若存在且 Copy Buffer 满足刷脏条件,则将 Copy Buffer 落盘。
    4. Buffer 被拷贝到 Copy Buffer Pool 之后,如果有对该 Buffer 的修改,则会重新生成该 Buffer 的 Oldest LSN,并将其追加到 FlushList 末尾。

    如下图中,[oldest LSN, latest LSN][30, 500] 的 Buffer 被认为是热点页,将当前 Buffer 拷贝至 Copy Buffer Pool 中,随后该数据页再次被修改,假设修改对应的 LSN 为 600,则设置其 Oldest LSN 为 600,并将其从 FlushList 中删除,然后追加至 FlushList 末尾。此时,Copy Buffer 中数据页不会再修改,其 Latest LSN 始终为 500,若满足刷脏条件,则可以将 Copy Buffer 写入存储。

    image.png

    需要注意的是,引入 Copy Buffer 之后,一致性位点的计算方法有所改变。FlushList 中的 Oldest LSN 不再是最小的 Oldest LSN,Copy Buffer Pool 中可能存在更小的 oldest LSN。因此,除考虑 FlushList 中的 Oldest LSN 之外,还需要遍历 Copy Buffer Pool,找到 Copy Buffer Pool 中最小的 Oldest LSN,取两者的最小值即为一致性位点。

    Lazy Checkpoint

    PolarDB 引入的一致性位点概念,与 checkpoint 的概念类似。PolarDB 中 checkpoint 位点表示该位点之前的所有数据都已经落盘,数据库 Crash Recovery 时可以从 checkpoint 位点开始恢复,提升恢复效率。普通的 checkpoint 会将所有 Buffer Pool 中的脏页以及其他内存数据落盘,这个过程可能耗时较长且在此期间 I/O 吞吐较大,可能会对正常的业务请求产生影响。

    借助一致性位点,PolarDB 中引入了一种特殊的 checkpoint:Lazy Checkpoint。之所以称之为 Lazy(懒惰的),是与普通的 checkpoint 相比,lazy checkpoint 不会把 Buffer Pool 中所有的脏页落盘,而是直接使用当前的一致性位点作为 checkpoint 位点,极大地提升了 checkpoint 的执行效率。

    Lazy Checkpoint 的整体思路是将普通 checkpoint 一次性刷大量脏页落盘的逻辑转换为后台刷脏进程持续不断落盘并维护一致性位点的逻辑。需要注意的是,Lazy Checkpoint 与 PolarDB 中 Full Page Write 的功能有冲突,开启 Full Page Write 之后会自动关闭该功能。

    + + + diff --git a/zh/theory/ddl-synchronization.html b/zh/theory/ddl-synchronization.html new file mode 100644 index 00000000000..8c827aa8942 --- /dev/null +++ b/zh/theory/ddl-synchronization.html @@ -0,0 +1,33 @@ + + + + + + + + + DDL 同步 | PolarDB for PostgreSQL + + + + +

    DDL 同步

    概述

    在共享存储一写多读的架构下,数据文件实际上只有一份。得益于多版本机制,不同节点的读写实际上并不会冲突。但是有一些数据操作不具有多版本机制,其中比较有代表性的就是文件操作。

    多版本机制仅限于文件内的元组,但不包括文件本身。对文件进行创建、删除等操作实际上会对全集群立即可见,这会导致 RO 在读取文件时出现文件消失的情况,因此需要做一些同步操作,来防止此类情况。

    对文件进行操作通常使用 DDL,因此对于 DDL 操作,PolarDB 提供了一种同步机制,来防止并发的文件操作的出现。除了同步机制外,DDL 的其他逻辑和单机执行逻辑并无区别。

    术语

    • LSN:Log Sequence Number,日志序列号。是 WAL 日志的唯一标识。LSN 在全局是递增的。
    • 回放位点:Apply LSN,表示只读节点的回放位点。

    同步 DDL 机制

    DDL 锁

    同步 DDL 机制利用 AccessExclusiveLock(后文简称 DDL 锁)来进行 RW / RO 的 DDL 操作同步。

    异步回放ddl锁.png
    图 1:DDL 锁和 WAL 日志的关系

    DDL 锁是数据库中最高级的表锁,对其他所有的锁级别都互斥,会伴随着 WAL 日志同步到 RO 节点上,并且可以获取到该锁在 WAL 日志的写入位点。当 RO 回放超过 Lock LSN 位点时,就可以认为在 RO 中已经获取了这把锁。DDL 锁会伴随着事务的结束而释放。

    如图 1 所示,当回放到 ApplyLSN1 时,表示未获取到 DDL 锁;当回放到 ApplyLSN2 时,表示获取到了该锁;当回放到 ApplyLSN3 时,已经释放了 DDL 锁。

    异步回放ddl锁.png
    图 2:DDL 锁的获取条件

    当所有 RO 都回放超过了 Lock LSN 这个位点时(如图 2 所示),可以认为 RW 的事务在集群级别获取到了这把锁。获取到这把锁就意味着 RW / RO 中没有其他的会话能够访问这张表,此时 RW 就可以对这张表做各种文件相关的操作。

    说明:Standby 有独立的文件存储,获取锁时不会出现上述情况。

    异步回放ddl锁.png
    图 3:同步 DDL 流程图

    图 3 所示流程说明如下:

    1. RO 会话执行查询语句
    2. RW 会话执行 DDL,在本地获取 DDL 锁并且写到 WAL 日志中,等待所有 RO 回放到该 WAL 日志
    3. RO 的回放进程尝试获取该锁,获取成功后将回放位点返回给 RW
    4. RW 获知所有 RO 均获取到该锁
    5. RO 开始进行 DDL 操作

    如何保证数据正确性

    DDL 锁是 PostgreSQL 数据库最高级别的锁,当对一个表进行 DROP / ALTER / LOCK / VACUUM (FULL) table 等操作时,需要先获取到 DDL 锁。RW 是通过用户的主动操作来获取锁,获取锁成功时会写入到日志中,RO 则通过回放日志获取锁。

    • 主备环境:热备存在只读查询,同时进行回放,回放到该锁时,如果该表正在被读取,回放就会被阻塞直到超时
    • PolarDB 环境:RW 获取锁需要等待 RO 全部获取锁成功才算成功,因为需要确保主备都不再访问共享存储的数据才能进行 DDL 操作

    当以下操作的对象都是某张表,< 表示时间先后顺序时,同步 DDL 的执行逻辑如下:

    1. 本地所有查询操作结束 < 本地获取 DDL 锁 < 本地释放 DDL 锁 < 本地新增查询操作
    2. RW 本地获取 DDL 锁 < 各个 RO 获取本地 DDL 锁 < RW 获取全局 DDL 锁
    3. RW 获取全局 DDL 锁 < RW 进行写数据操作 < RW 释放全局 DDL 锁

    结合以上执行逻辑可以得到以下操作的先后顺序:各个 RW / RO 查询操作结束 < RW 获取全局 DDL 锁 < RW 写数据 < RW 释放全局 DDL 锁 < RW / RO 新增查询操作

    可以看到在写共享存储的数据时,RW / RO 上都不会存在查询,因此不会造成正确性问题。在整个操作的过程中,都是遵循 2PL 协议的,因此对于多个表,也可以保证正确性。

    RO 锁回放优化

    上述机制中存在一个问题,就是锁同步发生在主备同步的主路径中,当 RO 的锁同步被阻塞时,会造成 RO 的数据同步阻塞(如图 1 所示,回放进程的 3、4 阶段在等待本地查询会话结束后才能获取锁)。PolarDB 默认设置的同步超时时间为 30s,如果 RW 压力过大,有可能造成较大的数据延迟。

    RO 中回放的 DDL 锁还会出现叠加效果,例如 RW 在 1s 内写下了 10 个 DDL 锁日志,在 RO 却需要 300s 才能回放完毕。数据延迟对于 PolarDB 是十分危险的,它会造成 RW 无法及时刷脏、及时做检查点,如果此时发生崩溃,恢复系统会需要更长的时间,这会导致极大的稳定性风险。

    异步 DDL 锁回放

    针对此问题,PolarDB 对 RO 锁回放进行了优化。

    异步回放ddl锁.png
    图 4:RO 异步 DDL 锁回放

    优化思路:设计一个异步进程来回放这些锁,从而不阻塞主回放进程的工作。

    整体流程如图 4 所示,和图 3 不同的是,回放进程会将锁获取的操作卸载到锁回放进程中进行,并且立刻回到主回放流程中,从而不受锁回放阻塞的影响。

    锁回放冲突并不是一个常见的情况,因此主回放进程并非将所有的锁都卸载到锁回放进程中进行,它会尝试获取锁,如果获取成功了,就不需要卸载到锁回放进程中进行,这样可以有效减少进程间的同步开销。

    该功能在 PolarDB 中默认启用,能够有效的减少回放冲突造成的回放延迟,以及衍生出来的稳定性问题。在 AWS Aurora 中不具备该特性,当发生冲突时会严重增加延迟。

    如何保证数据正确性

    在异步回放的模式下,仅仅是获取锁的操作者变了,但是执行逻辑并未发生变化,依旧能够保证 RW 获取到全局 DDL 锁、写数据、释放全局 DDL 锁这期间不会存在任何查询,因此不会存在正确性问题。

    + + + diff --git a/zh/theory/logindex.html b/zh/theory/logindex.html new file mode 100644 index 00000000000..88e87e95f5c --- /dev/null +++ b/zh/theory/logindex.html @@ -0,0 +1,33 @@ + + + + + + + + + LogIndex | PolarDB for PostgreSQL + + + + +

    LogIndex

    背景介绍

    PolarDB 采用了共享存储一写多读架构,读写节点 RW 和多个只读节点 RO 共享同一份存储,读写节点可以读写共享存储中的数据;只读节点仅能各自通过回放日志,从共享存储中读取数据,而不能写入,只读节点 RO 通过内存同步来维护数据的一致性。此外,只读节点可同时对外提供服务用于实现读写分离与负载均衡,在读写节点异常 crash 时,可将只读节点提升为读写节点,保证集群的高可用。基本架构图如下所示:

    image.png

    传统 share nothing 的架构下,只读节点 RO 有自己的内存及存储,只需要接收 RW 节点的 WAL 日志进行回放即可。如下图所示,如果需要回放的数据页不在 Buffer Pool 中,需将其从存储文件中读至 Buffer Pool 中进行回放,从而带来 CacheMiss 的成本,且持续性的回放会带来较频繁的 Buffer Pool 淘汰问题。

    image.png

    此外,RW 节点多个事务之间可并行执行,RO 节点则需依照 WAL 日志的顺序依次进行串行回放,导致 RO 回放速度较慢,与 RW 节点的延迟逐步增大。

    image.png

    与传统 share nothing 架构不同,共享存储一写多读架构下 RO 节点可直接从共享存储上获取需要回放的 WAL 日志。若共享存储上的数据页是最新的,那么 RO 可直接读取数据页而不需要再进行回放操作。基于此,PolarDB 设计了 LogIndex 来加速 RO 节点的日志回放。

    RO 内存同步架构

    LogIndex 中保存了数据页与修改该数据页的所有 LSN 的映射关系,基于 LogIndex 可快速获取到修改某个数据页的所有 LSN,从而可将该数据页对应日志的回放操作延迟到真正访问该数据页的时刻进行。LogIndex 机制下 RO 内存同步的架构如下图所示。

    image.png

    RW / RO 的相关流程相较传统 share nothing 架构下有如下区别:

    • 读写节点 RW 与只读节点 RO 之间不再传输完整的 WAL 日志,仅传输 WAL meta,减少网络数据传输量,降低了 RO 与 RW 节点的延迟;
    • 读写节点 RW 依据 WAL meta 生成 LogIndex 写入 LogIndex Memory Table 中,LogIndex Memory Table 写满之后落盘,保存至共享存储的 LogIndex Table 中,已落盘的 LogIndex Memory Table 可以被复用;
    • 读写节点 RW 通过 LogIndex Meta 文件保证 LogIndex Memory Table I/O 操作的原子性,LogIndex Memory Table 落盘后会更新 LogIndex Meta 文件,落盘的同时还会生成 Bloom Data,通过 Bloom Data 可快速检索特定 Page 是否存在于某 LogIndex Table 中,从而忽略不必扫描的 LogIndex Table 提升效率;
    • 只读节点 RO 接收 RW 所发送的 WAL Meta,并基于 WAL Meta 在内存中生成相应的 LogIndex,同样写入其内存的 LogIndex Memory Table 中,同时将 WAL Meta 对应已存在于 Buffer Pool 中的页面标记为 Outdate,该阶段 RO 节点并不进行真正的日志回放,无数据 I/O 操作,可去除 cache miss 的成本;
    • 只读节点 RO 基于 WAL Meta 生成 LogIndex 后即可推进回放位点,日志回放操作被交由背景进程及真正访问该页面的 backend 进程执行,由此 RO 节点也可实现日志的并行回放;
    • 只读节点 RO 生成的 LogIndex Memory Table 不会落盘,其基于 LogIndex Meta 文件判断已满的 LogIndex Memory Table 是否在 RW 节点已落盘,已落盘的 LogIndex Memory Table 可被复用,当 RW 节点判断存储上的 LogIndex Table 不再使用时可将相应的 LogIndex Table Truncate。

    PolarDB 通过仅传输 WAL Meta 降低 RW 与 RO 之间的延迟,通过 LogIndex 实现 WAL 日志的延迟回放 + 并行回放以加速 RO 的回放速度,以下则对这两点进行详细介绍。

    WAL Meta

    WAL 日志又称为 XLOG Record,如下图,每个 XLOG Record 由两部分组成:

    • 通用的首部部分 general header portion:该部分即为 XLogRecord 结构体,固定长度。主要用于存放该条 XLOG Record 的通用信息,如 XLOG Record 的长度、生成该条 XLOG Record 的事务 ID、该条 XLOG Record 对应的资源管理器类型等;
    • 数据部分 data portion:该部分又可以划分为首部和数据两个部分,其中首部部分 header part 包含 0 ~ N 个 XLogRecordBlockHeader 结构体及 0 ~ 1 个 XLogRecordDataHeader[Short|Long] 结构体。数据部分 data part 则包含 block data 及 main data。每一个 XLogRecordBlockHeader 对应数据部分的一个 Block data,XLogRecordDataHeader[Short|Long] 则与数据部分的 main data 对应。

    wal meta.png

    共享存储模式下,读写节点 RW 与只读节点 RO 之间无需传输完整的 WAL 日志,仅传输 WAL Meta 数据,WAL Meta 即为上图中的 general header portion + header part + main data,RO 节点可基于 WAL Meta 从共享存储上读取完整的 WAL 日志内容。该机制下,RW 与 RO 之间传输 WAL Meta 的流程如下:

    wal meta传输.png

    1. 当 RW 节点中的事务对其数据进行修改时,会生成对应的 WAL 日志并将其写入 WAL Buffer,同时拷贝对应的 WAL meta 数据至内存中的 WAL Meta queue 中;
    2. 同步流复制模式下,事务提交时会先将 WAL Buffer 中对应的 WAL 日志 flush 到磁盘,此后会唤醒 WalSender 进程;
    3. WalSender 进程发现有新的日志可以发送,则从 WAL Meta queue 中读取对应的 WAL Meta,通过已建立的流复制连接发送到对端的 RO;
    4. RO 的 WalReceiver 进程接收到新的日志数据之后,将其 push 到内存的 WAL Meta queue 中,同时通知 Startup 进程有新的日志到达;
    5. Startup 从 WAL Meta queue 中读取对应的 meta 数据,解析生成对应的 LogIndex memtable 即可。

    RW 与 RO 节点的流复制不传输具体的 payload 数据,减少了网络数据传输量;此外,RW 节点的 WalSender 进程从内存中的 WAL Meta queue 中获取 WAL Meta 信息,RO 节点的 WalReceiver 进程接收到 WAL Meta 后也同样将其保存至内存的 WAL Meta queue 中,相较于传统主备模式减少了日志发送及接收的磁盘 I/O 过程,从而提升传输速度,降低 RW 与 RO 之间的延迟。

    LogIndex

    内存数据结构

    LogIndex 实质为一个 HashTable 结构,其 key 为 PageTag,可标识一个具体数据页,其 value 即为修改该 page 的所有 LSN。LogIndex 的内存数据结构如下图所示,除了 Memtable ID、Memtable 保存的最大 LSN、最小 LSN 等信息,LogIndex Memtable 中还包含了三个数组,分别为:

    • HashTable:HashTable 数组记录了某个 Page 与修改该 Page 的 LSN List 的映射关系,HashTable 数组的每一个成员指向 Segment 数组中一个具体的 LogIndex Item;
    • Segment:Segment 数组中的每个成员为一个 LogIndex Item,LogIndex Item 有两种结构,即下图中的 Item Head 和 Item Seg,Item Head 为某个 Page 对应的 LSN 链表的头部,Item Seg 则为该 LSN 链表的后续节点。Item Head 中的 Page TAG 用于记录单个 Page 的元信息,其 Next Seg 和 Tail Seg 则分别指向后续节点和尾节点,Item Seg 存储着指向上一节点 Prev Seg 和后续节点 Next Seg 的指针。Item Head 和 Item Seg 中保存的 Suffix LSN 与 LogIndex Memtable 中保存的 Prefix LSN 可构成一个完整的 LSN,避免了重复存储 Prefix LSN 带来的空间浪费。当不同 Page TAG 计算到 HashTable 的同一位置时,通过 Item Head 中的 Next Item 指向下一个具有相同 hash 值的 Page,以此解决哈希冲突;
    • Index Order:Index Order 数组记录了 LogIndex 添加到 LogIndex Memtable 的顺序,该数组中的每个成员占据 2 个字节。每个成员的后 12bit 对应 Segment 数组的一个下标,指向一个具体的 LogIndex Item,前 4bit 则对应 LogIndex Item 中 Suffix LSN 数组的一个下标,指向一个具体的 Suffix LSN,通过 Index Order 可方便地获取插入到该 LogIndex Memtable 的所有 LSN 及某个 LSN 与其对应修改的全部 Page 的映射关系。

    logindex.png

    内存中保存的 LogIndex Memtable 又可分为 Active LogIndex Memtable 和 Inactive LogIndex Memtable。如下图所示,基于 WAL Meta 数据生成的 LogIndex 记录会写入 Active LogIndex Memtable,Active LogIndex Memtable 写满后会转为 Inactive LogIndex Memtable,并重新申请一个新的 Active LogIndex Memtable,Inactive LogIndex Memtable 可直接落盘,落盘后的 Inactive LogIndex Memtable 可再次转为 Active LogIndex Memtable。

    image.png

    磁盘数据结构

    磁盘上保存了若干个 LogIndex Table,LogIndex Table 与 LogIndex Memtable 结构类似,一个 LogIndex Table 可包含 64 个 LogIndex Memtable,Inactive LogIndex Memtable 落盘的同时会生成其对应的 Bloom Filter。如下图所示,单个 Bloom Filter 的大小为 4096 字节,Bloom Filter 记录了该 Inactive LogIndex Memtable 的相关信息,如保存的最小 LSN、最大 LSN、该 Memtable 中所有 Page 在 bloom filter bit array 中的映射值等。通过 Bloom Filter 可快速判断某个 Page 是否存在于对应的 LogIndex Table 中,从而可忽略无需扫描的 LogIndex Table 以加速检索。

    image.png

    当 Inactive LogIndex MemTable 成功落盘后,LogIndex Meta 文件也被更新,该文件可保证 LogIndex Memtable 文件 I/O 操作的原子性。如下,LogIndex Meta 文件保存了当前磁盘上最小 LogIndex Table 及最大 LogIndex Memtable 的相关信息,其 Start LSN 记录了当前已落盘的所有 LogIndex MemTable 中最大的 LSN。若 Flush LogIndex MemTable 时发生部分写,系统会从 LogIndex Meta 记录的 Start LSN 开始解析日志,如此部分写舍弃的 LogIndex 记录也会重新生成,保证了其 I/O 操作的原子性。

    image.png

    Buffer 管理 可知,一致性位点之前的所有 WAL 日志修改的数据页均已持久化到共享存储中,RO 节点无需回放该位点之前的 WAL 日志,故 LogIndex Table 中小于一致性位点的 LSN 均可清除。RW 据此 Truncate 掉存储上不再使用的 LogIndex Table,在加速 RO 回放效率的同时还可减少 LogIndex Table 占用的空间。

    日志回放

    延迟回放

    LogIndex 机制下,RO 节点的 Startup 进程基于接收到的 WAL Meta 生成 LogIndex,同时将该 WAL Meta 对应的已存在于 Buffer Pool 中的页面标记为 Outdate 后即可推进回放位点,Startup 进程本身并不对日志进行回放,日志的回放操作交由背景回放进程及真正访问该页面的 Backend 进程进行,回放过程如下图所示,其中:

    • 背景回放进程按照 WAL 顺序依次进行日志回放操作,根据要回放的 LSN 检索 LogIndex Memtable 及 LogIndex Table,获取该 LSN 修改的 Page List,若某个 Page 存在于 Buffer Pool 中则对其进行回放,否则直接跳过。背景回放进程按照 LSN 的顺序逐步推进 Buffer Pool 中的页面位点,避免单个 Page 需要回放的 LSN 数量堆积太多;
    • Backend 进程则仅对其实际需要访问的 Page 进行回放,当 Backend 进程需要访问一个 Page 时,如果该 Page 在 Buffer Pool 中不存在,则将该 Page 读到 Buffer Pool 后进行回放;如果该 Page 已经在 Buffer Pool 中且标记为 outdate,则将该 Page 回放到最新。Backend 进程依据 Page TAG 对 LogIndex Memtable 及 LogIndex Table 进行检索,按序生成与该 Page 相关的 LSN List,基于 LSN List 从共享存储中读取完整的 WAL 日志来对该 Page 进行回放。

    image.png

    为降低回放时读取磁盘 WAL 日志带来的性能损耗,同时添加了 XLOG Buffer 用于缓存读取的 WAL 日志。如下图所示,原始方式下直接从磁盘上的 WAL Segment File 中读取 WAL 日志,添加 XLog Page Buffer 后,会先从 XLog Buffer 中读取,若所需 WAL 日志不在 XLog Buffer 中,则从磁盘上读取对应的 WAL Page 到 Buffer 中,然后再将其拷贝至 XLogReaderState 的 readBuf 中;若已在 Buffer 中,则直接将其拷贝至 XLogReaderState 的 readBuf 中,以此减少回放 WAL 日志时的 I/O 次数,从而进一步加速日志回放的速度。

    image.png

    Mini Transaction

    与传统 share nothing 架构下的日志回放不同,LogIndex 机制下,Startup 进程解析 WAL Meta 生成 LogIndex 与 Backend 进程基于 LogIndex 对 Page 进行回放的操作是并行的,且各个 Backend 进程仅对其需要访问的 Page 进行回放。由于一条 XLog Record 可能会对多个 Page 进行修改,以索引分裂为例,其涉及对 Page_0、Page_1 的修改,且其对 Page_0 及 Page_1 的修改为一个原子操作,即修改要么全部可见,要么全部不可见。针对此,设计了 mini transaction 锁机制以保证 Backend 进程回放过程中内存数据结构的一致性。

    如下图所示,无 mini transaction lock 时,Startup 进程对 WAL Meta 进行解析并按序将当前 LSN 插入到各个 Page 对应的 LSN List 中。若 Startup 进程完成对 Page_0 LSN List 的更新,但尚未完成对 Page_1 LSN List 的更新时,Backend_0 和 Backend_1 分别对 Page_0 及 Page_1 进行访问,Backend_0 和 Backend_1 分别基于 Page 对应的 LSN List 进行回放操作,Page_0 被回放至 LSN_N + 1 处,Page_1 被回放至 LSN_N 处,可见此时 Buffer Pool 中两个 Page 对应的版本并不一致,从而导致相应内存数据结构的不一致。

    image.png

    Mini transaction 锁机制下,对 Page_0 及 Page_1 LSN List 的更新被视为一个 mini transaction。Startup 进程更新 Page 对应的 LSN List 时,需先获取该 Page 的 mini transaction lock,如下先获取 Page_0 对应的 mtr lock,获取 Page mtr lock 的顺序与回放时的顺序保持一致,更新完 Page_0 及 Page_1 LSN List 后再释放 Page_0 对应的 mtr lock。Backend 进程基于 LogIndex 对特定 Page 进行回放时,若该 Page 对应在 Startup 进程仍处于一个 mini transaction 中,则同样需先获取该 Page 对应的 mtr lock 后再进行回放操作。故若 Startup 进程完成对 Page_0 LSN List 的更新,但尚未完成对 Page_1 LSN List 的更新时,Backend_0 和 Backend_1 分别对 Page_0 及 Page_1 进行访问,此时 Backend_0 需等待 LSN List 更新完毕并释放 Page_0 mtr lock 之后才可进行回放操作,而释放 Page_0 mtr lock 时 Page_1 的 LSN List 已完成更新,从而实现了内存数据结构的原子修改。

    mini trans.png

    总结

    PolarDB 基于 RW 节点与 RO 节点共享存储这一特性,设计了 LogIndex 机制来加速 RO 节点的内存同步,降低 RO 节点与 RW 节点之间的延迟,确保了 RO 节点的一致性与可用性。本文对 LogIndex 的设计背景、基于 LogIndex 的 RO 内存同步架构及具体细节进行了分析。除了实现 RO 节点的内存同步,基于 LogIndex 机制还可实现 RO 节点的 Online Promote,可加速 RW 节点异常崩溃时,RO 节点提升为 RW 节点的速度,从而构建计算节点的高可用,实现服务的快速恢复。

    + + + diff --git a/zh/theory/polar-sequence-tech.html b/zh/theory/polar-sequence-tech.html new file mode 100644 index 00000000000..b451cfaaf54 --- /dev/null +++ b/zh/theory/polar-sequence-tech.html @@ -0,0 +1,372 @@ + + + + + + + + + Sequence 使用、原理全面解析 | PolarDB for PostgreSQL + + + + +

    Sequence 使用、原理全面解析

    羁鸟

    2022/08/22

    30 min

    介绍

    Sequence 作为数据库中的一个特别的表级对象,可以根据用户设定的不同属性,产生一系列有规则的整数,从而起到发号器的作用。

    在使用方面,可以设置永不重复的 Sequence 用来作为一张表的主键,也可以通过不同表共享同一个 Sequence 来记录多个表的总插入行数。根据 ANSI 标准,一个 Sequence 对象在数据库要具备以下特征:

    1. 独立的数据库对象 (CREATE SEQUENCE),和表、视图同一层级
    2. 可以设置生成属性:初始值 (star value),步长 (increment),最大/小值 (max/min),循环产生 (cycle),缓存 (cache)等
    3. Sequence 对象在当前值的基础上进行递增或者递减,当前值被初始化为初始值
    4. 在设置循环后,当前值的变化具有周期性;不设置循环下,当前值的变化具有单调性,当前值到达最值后不可再变化

    为了解释上述特性,我们分别定义 ab 两种序列来举例其具体的行为。

    CREATE SEQUENCE a start with 5 minvalue -1 increment -2;
    +CREATE SEQUENCE b start with 2 minvalue 1 maxvalue 4 cycle;
    +

    两个 Sequence 对象提供的序列值,随着序列申请次数的变化,如下所示:

    单调序列与循环序列

    PostgreSQLOracleSQLSERVERMySQLMariaDBDB2SybaseHive
    支持支持支持仅支持自增字段支持支持仅支持自增字段不支持

    为了更进一步了解 PostgreSQL 中的 Sequence 对象,我们先来了解 Sequence 的用法,并从用法中透析 Sequence 背后的设计原理。

    使用方法

    PostgreSQL 提供了丰富的 Sequence 调用接口,以及组合使用的场景,以充分支持开发者的各种需求。

    SQL 接口

    PostgreSQL 对 Sequence 对象也提供了类似于 的访问方式,即 DQL、DML 以及 DDL。我们从下图中可一览对外提供的 SQL 接口。

    SQL接口

    分别来介绍以下这几个接口:

    currval

    该接口的含义为,返回 Session 上次使用的某一 Sequence 的值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +
    +postgres=# select currval('seq');
    + currval
    +---------
    +       2
    +(1 row)
    +

    需要注意的是,使用该接口必须使用过一次 nextval 方法,否则会提示目标 Sequence 在当前 Session 未定义。

    postgres=# select currval('seq');
    +ERROR:  currval of sequence "seq" is not yet defined in this session
    +

    lastval

    该接口的含义为,返回 Session 上次使用的 Sequence 的值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       3
    +(1 row)
    +
    +postgres=# select lastval();
    + lastval
    +---------
    +       3
    +(1 row)
    +

    同样,为了知道上次用的是哪个 Sequence 对象,需要用一次 nextval('seq'),让 Session 以全局变量的形式记录下上次使用的 Sequence 对象。

    lastvalcurval 两个接口仅仅只是参数不同,currval 需要指定是哪个访问过的 Sequence 对象,而 lastval 无法指定,只能是最近一次使用的 Sequence 对象。

    nextval

    该接口的含义为,取 Sequence 对象的下一个序列值。

    通过使用 nextval 方法,可以让数据库基于 Sequence 对象的当前值,返回一个递增了 increment 数量的一个序列值,并将递增后的值作为 Sequence 对象当前值。

    postgres=# CREATE SEQUENCE seq start with 1 increment 2;
    +CREATE SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       3
    +(1 row)
    +

    increment 称作 Sequence 对象的步长,Sequence 的每次以 nextval 的方式进行申请,都是以步长为单位进行申请的。同时,需要注意的是,Sequence 对象创建好以后,第一次申请获得的值,是 start value 所定义的值。对于 start value 的默认值,有以下 PostgreSQL 规则:

    $$start_value = 1, if:increment > 0;$$ $$start_value = -1,if:increment < 0;$$

    另外,nextval 是一种特殊的 DML,其不受事务所保护,即:申请出的序列值不会再回滚。

    postgres=# BEGIN;
    +BEGIN
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# ROLLBACK;
    +ROLLBACK
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +

    PostgreSQL 为了 Sequence 对象可以获得较好的并发性能,并没有采用多版本的方式来更新 Sequence 对象,而是采用了原地修改的方式完成 Sequence 对象的更新,这种不用事务保护的方式几乎成为所有支持 Sequence 对象的 RDMS 的通用做法,这也使得 Sequence 成为一种特殊的表级对象。

    setval

    该接口的含义是,设置 Sequence 对象的序列值。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       4
    +(1 row)
    +
    +postgres=# select setval('seq', 1);
    + setval
    +--------
    +      1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +

    该方法可以将 Sequence 对象的序列值设置到给定的位置,同时可以将第一个序列值申请出来。如果不想申请出来,可以采用加入 false 参数的做法。

    postgres=# select nextval('seq');
    + nextval
    +---------
    +       4
    +(1 row)
    +
    +postgres=# select setval('seq', 1, false);
    + setval
    +--------
    +      1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +

    SQL接口

    通过在 setval 来设置好 Sequence 对象的值以后,同时来设置 Sequence 对象的 is_called 属性。nextval 就可以根据 Sequence 对象的 is_called 属性来判断要返回的是否要返回设置的序列值。即:如果 is_calledfalsenextval 接口会去设置 is_calledtrue,而不是进行 increment。

    CREATE/ALTER SEQUENCE

    CREATEALTER SEQUENCE 用于创建/变更 Sequence 对象,其中 Sequence 属性也通过 CREATEALTER SEQUENCE 接口进行设置,前面已简单介绍部分属性,下面将详细描述具体的属性。

    CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name
    +    [ AS data_type ]
    +    [ INCREMENT [ BY ] increment ]
    +    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
    +    [ START [ WITH ] start ] [ CACHE cache ] [ [ NO ] CYCLE ]
    +    [ OWNED BY { table_name.column_name | NONE } ]
    +ALTER SEQUENCE [ IF EXISTS ] name
    +    [ AS data_type ]
    +    [ INCREMENT [ BY ] increment ]
    +    [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
    +    [ START [ WITH ] start ]
    +    [ RESTART [ [ WITH ] restart ] ]
    +    [ CACHE cache ] [ [ NO ] CYCLE ]
    +    [ OWNED BY { table_name.column_name | NONE } ]
    +
    • AS:设置 Sequence 的数据类型,只可以设置为 smallintintbigint;与此同时也限定了 minvaluemaxvalue 的设置范围,默认为 bigint 类型(注意,只是限定,而不是设置,设置的范围不得超过数据类型的范围)。
    • INCREMENT:步长,nextval 申请序列值的递增数量,默认值为 1。
    • MINVALUE / NOMINVALUE:设置/不设置 Sequence 对象的最小值,如果不设置则是数据类型规定的范围,例如 bigint 类型,则最小值设置为 PG_INT64_MIN(-9223372036854775808)
    • MAXVALUE / NOMAXVALUE:设置/不设置 Sequence 对象的最大值,如果不设置,则默认设置规则如上。
    • START:Sequence 对象的初始值,必须在 MINVALUEMAXVALUE 范围之间。
    • RESTART:ALTER 后,可以重新设置 Sequence 对象的序列值,默认设置为 start value。
    • CACHE / NOCACHE:设置 Sequence 对象使用的 Cache 大小,NOCACHE 或者不设置则默认为 1。
    • OWNED BY:设置 Sequence 对象归属于某张表的某一列,删除列后,Sequence 对象也将删除。

    特殊场景下的序列回滚

    下面描述了一种序列回滚的场景

    CREATE SEQUENCE
    +postgres=# BEGIN;
    +BEGIN
    +postgres=# ALTER SEQUENCE seq maxvalue 10;
    +ALTER SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       2
    +(1 row)
    +
    +postgres=# ROLLBACK;
    +ROLLBACK
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +

    与之前描述的不同,此处 Sequence 对象受到了事务的保护,序列值发生了发生回滚。实际上,此处事务保护的是 ALTER SEQUENCE(DDL),而非 nextval(DML),因此此处发生的回滚是将 Sequence 对象回滚到 ALTER SEQUENCE 之前的状态,故发生了序列回滚现象。

    DROP/TRUNCATE

    • DROP SEQUENCE,如字面意思,去除数据库中的 Sequence 对象。
    • TRUNCATE,准确来讲,是通过 TRUNCATE TABLE 完成 RESTART SEQUENCE
    postgres=# CREATE TABLE tbl_iden (i INTEGER, j int GENERATED ALWAYS AS IDENTITY);
    +CREATE TABLE
    +postgres=# insert into tbl_iden values (100);
    +INSERT 0 1
    +postgres=# insert into tbl_iden values (1000);
    +INSERT 0 1
    +postgres=# select * from tbl_iden;
    +  i   | j
    +------+---
    +  100 | 1
    + 1000 | 2
    +(2 rows)
    +
    +postgres=# TRUNCATE TABLE tbl_iden RESTART IDENTITY;
    +TRUNCATE TABLE
    +postgres=# insert into tbl_iden values (1234);
    +INSERT 0 1
    +postgres=# select * from tbl_iden;
    +  i   | j
    +------+---
    + 1234 | 1
    +(1 row)
    +

    此处相当于在 TRUNCATE 表的时候,执行 ALTER SEQUENCE RESTART

    Sequence 组合使用场景

    SEQUENCE 除了作为一个独立的对象时候以外,还可以组合其他 PostgreSQL 其他组件进行使用,我们总结了一下几个常用的场景。

    组合调用

    显式调用

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl (i INTEGER PRIMARY KEY);
    +INSERT INTO tbl (i) VALUES (nextval('seq'));
    +SELECT * FROM tbl ORDER BY 1 DESC;
    +   tbl
    +---------
    +       1
    +(1 row)
    +

    触发器调用

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl (i INTEGER PRIMARY KEY, j INTEGER);
    +CREATE FUNCTION f()
    +RETURNS TRIGGER AS
    +$$
    +BEGIN
    +NEW.i := nextval('seq');
    +RETURN NEW;
    +END;
    +$$
    +LANGUAGE 'plpgsql';
    +
    +CREATE TRIGGER tg
    +BEFORE INSERT ON tbl
    +FOR EACH ROW
    +EXECUTE PROCEDURE f();
    +
    +INSERT INTO tbl (j) VALUES (4);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+---
    + 1 | 4
    +(1 row)
    +

    DEFAULT 调用

    显式 DEFAULT 调用:

    CREATE SEQUENCE seq;
    +CREATE TABLE tbl(i INTEGER DEFAULT nextval('seq') PRIMARY KEY, j INTEGER);
    +
    +INSERT INTO tbl (i,j) VALUES (DEFAULT,11);
    +INSERT INTO tbl(j) VALUES (321);
    +INSERT INTO tbl (i,j) VALUES (nextval('seq'),1);
    +
    +SELECT * FROM tbl;
    + i |  j
    +---+-----
    + 2 | 321
    + 1 |  11
    + 3 |   1
    +(3 rows)
    +

    SERIAL 调用:

    CREATE TABLE tbl (i SERIAL PRIMARY KEY, j INTEGER);
    +INSERT INTO tbl (i,j) VALUES (DEFAULT,42);
    +
    +INSERT INTO tbl (j) VALUES (25);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+----
    + 1 | 42
    + 2 | 25
    +(2 rows)
    +

    注意,SERIAL 并不是一种类型,而是 DEFAULT 调用的另一种形式,只不过 SERIAL 会自动创建 DEFAULT 约束所要使用的 Sequence。

    AUTO_INC 调用

    CREATE TABLE tbl (i int GENERATED ALWAYS AS IDENTITY,
    +                  j INTEGER);
    +INSERT INTO tbl(i,j) VALUES (DEFAULT,32);
    +
    +INSERT INTO tbl(j) VALUES (23);
    +
    +SELECT * FROM tbl;
    + i | j
    +---+----
    + 1 | 32
    + 2 | 23
    +(2 rows)
    +

    AUTO_INC 调用对列附加了自增约束,与 default 约束不同,自增约束通过查找 dependency 的方式找到该列关联的 Sequence,而 default 调用仅仅是将默认值设置为一个 nextval 表达式。

    原理剖析

    Sequence 在系统表与数据表中的描述

    在 PostgreSQL 中有一张专门记录 Sequence 信息的系统表,即 pg_sequence。其表结构如下:

    postgres=# \d pg_sequence
    +             Table "pg_catalog.pg_sequence"
    +    Column    |  Type   | Collation | Nullable | Default
    +--------------+---------+-----------+----------+---------
    + seqrelid     | oid     |           | not null |
    + seqtypid     | oid     |           | not null |
    + seqstart     | bigint  |           | not null |
    + seqincrement | bigint  |           | not null |
    + seqmax       | bigint  |           | not null |
    + seqmin       | bigint  |           | not null |
    + seqcache     | bigint  |           | not null |
    + seqcycle     | boolean |           | not null |
    +Indexes:
    +    "pg_sequence_seqrelid_index" PRIMARY KEY, btree (seqrelid)
    +

    不难看出,pg_sequence 中记录了 Sequence 的全部的属性信息,该属性在 CREATE/ALTER SEQUENCE 中被设置,Sequence 的 nextval 以及 setval 要经常打开这张系统表,按照规则办事。

    对于 Sequence 序列数据本身,其实现方式是基于 heap 表实现的,heap 表共计三个字段,其在表结构如下:

    typedef struct FormData_pg_sequence_data
    +{
    +    int64		last_value;
    +    int64		log_cnt;
    +    bool		is_called;
    +} FormData_pg_sequence_data;
    +
    • last_value 记录了 Sequence 的当前的序列值,我们称之为页面值(与后续的缓存值相区分)
    • log_cnt 记录了 Sequence 在 nextval 申请时,预先向 WAL 中额外申请的序列次数,这一部分我们放在序列申请机制剖析中详细介绍。
    • is_called 标记 Sequence 的 last_value 是否已经被申请过,例如 setval 可以设置 is_called 字段:
    -- setval false
    +postgres=# select setval('seq', 10, false);
    + setval
    +--------
    +     10
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         10 |       0 | f
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      10
    +(1 row)
    +
    +-- setval true
    +postgres=# select setval('seq', 10, true);
    + setval
    +--------
    +     10
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         10 |       0 | t
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      11
    +(1 row)
    +

    每当用户创建一个 Sequence 对象时,PostgreSQL 总是会创建出一张上面这种结构的 heap 表,来记录 Sequence 对象的数据信息。当 Sequence 对象因为 nextvalsetval 导致序列值变化时,PostgreSQL 就会通过原地更新的方式更新 heap 表中的这一行的三个字段。

    setval 为例,下面的逻辑解释了其具体的原地更新过程。

    static void
    +do_setval(Oid relid, int64 next, bool iscalled)
    +{
    +
    +    /* 打开并对Sequence heap表进行加锁 */
    +    init_sequence(relid, &elm, &seqrel);
    +
    +    ...
    +
    +    /* 对buffer进行加锁,同时提取tuple */
    +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
    +
    +    ...
    +
    +    /* 原地更新tuple */
    +    seq->last_value = next;		/* last fetched number */
    +    seq->is_called = iscalled;
    +    seq->log_cnt = 0;
    +
    +    ...
    +
    +    /* 释放buffer锁以及表锁 */
    +    UnlockReleaseBuffer(buf);
    +    relation_close(seqrel, NoLock);
    +}
    +

    可见,do_setval 会直接去设置 Sequence heap 表中的这一行元组,而非普通 heap 表中的删除 + 插入的方式来完成元组更新,对于 nextval 而言,也是类似的过程,只不过 last_value 的值需要计算得出,而非用户设置。

    序列申请机制剖析

    讲清楚 Sequence 对象在内核中的存在形式之后,就需要讲清楚一个序列值是如何发出的,即 nextval 方法。其在内核的具体实现在 sequence.c 中的 nextval_internal 函数,其最核心的功能,就是计算 last_value 以及 log_cnt

    last_valuelog_cnt 的具体关系如下图:

    页面值与wal关系

    其中 log_cnt 是一个预留的申请次数。默认值为 32,由下面的宏定义决定:

    /*
    + * We don't want to log each fetching of a value from a sequence,
    + * so we pre-log a few fetches in advance. In the event of
    + * crash we can lose (skip over) as many values as we pre-logged.
    + */
    +#define SEQ_LOG_VALS	32
    +

    每当将 last_value 增加一个 increment 的长度时,log_cnt 就会递减 1。

    页面值递增

    log_cnt 为 0,或者发生 checkpoint 以后,就会触发一次 WAL 日志写入,按下面的公式设置 WAL 日志中的页面值,并重新将 log_cnt 设置为 SEQ_LOG_VALS

    $$wal_value = last_value+increment*SEQ_LOG_VALS$$

    通过这种方式,PostgreSQL 每次通过 nextval 修改页面中的 last_value 后,不需要每次都写入 WAL 日志。这意味着:如果 nextval 每次都需要修改页面值的话,这种优化将会使得写 WAL 的频率降低 32 倍。其代价就是,在发生 crash 前如果没有及时进行 checkpoint,那么会丢失一段序列。如下面所示:

    postgres=# create sequence seq;
    +CREATE SEQUENCE
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +       1
    +(1 row)
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +          1 |      32 | t
    +(1 row)
    +
    +-- crash and restart
    +
    +postgres=# select * from seq;
    + last_value | log_cnt | is_called
    +------------+---------+-----------
    +         33 |       0 | t
    +(1 row)
    +
    +postgres=# select nextval('seq');
    + nextval
    +---------
    +      34
    +(1 row)
    +

    显然,crash 以后,Sequence 对象产生了 2-33 这段空洞,但这个代价是可以被接受的,因为 Sequence 并没有违背唯一性原则。同时,在特定场景下极大地降低了写 WAL 的频率。

    Sequence 缓存机制

    通过上述描述,不难发现 Sequence 每次发生序列申请,都需要通过加入 buffer 锁的方式来修改页面,这意味着 Sequence 的并发性能是比较差的。

    针对这个问题,PostgreSQL 使用对 Sequence 使用了 Session Cache 来提前缓存一段序列,来提高并发性能。如下图所示:

    Session Cache

    Sequence Session Cache 的实现是一个 entry 数量固定为 16 的哈希表,以 Sequence 的 OID 为 key 去检索已经缓存好的 Sequence 序列,其缓存的 value 结构如下:

    typedef struct SeqTableData
    +{
    +    Oid			relid;			/* Sequence OID(hash key) */
    +    int64		last;			/* value last returned by nextval */
    +    int64		cached;			/* last value already cached for nextval */
    +    int64		increment;		/* copy of sequence's increment field */
    +} SeqTableData;
    +

    其中 last 即为 Sequence 在 Session 中的当前值,即 current_value,cached 为 Sequence 在 Session 中的缓存值,即 cached_value,increment 记录了步长,有了这三个值即可满足 Sequence 缓存的基本条件。

    对于 Sequence Session Cache 与页面值之间的关系,如下图所示:

    cache与页面关系

    类似于 log_cntcache_cnt 即为用户在定义 Sequence 时,设置的 Cache 大小,最小为 1。只有当 cache domain 中的序列用完以后,才会去对 buffer 加锁,修改页中的 Sequence 页面值。调整过程如下所示:

    cache申请

    例如,如果 CACHE 设置的值为 20,那么当 cache 使用完以后,就会尝试对 buffer 加锁来调整页面值,并重新申请 20 个 increment 至 cache 中。对于上图而言,有如下关系:

    $$cached_value = NEW\ current_value$$ $$NEW\ current_value+20\times INC=NEW\ cached_value$$ $$NEW\ last_value = NEW\ cached_value$$

    在 Sequence Session Cache 的加持下,nextval 方法的并发性能得到了极大的提升,以下是通过 pgbench 进行压测的结果对比。

    性能对比

    总结

    Sequence 在 PostgreSQL 中是一类特殊的表级对象,提供了简单而又丰富的 SQL 接口,使得用户可以更加方便的创建、使用定制化的序列对象。不仅如此,Sequence 在内核中也具有丰富的组合使用场景,其使用场景也得到了极大地扩展。

    本文详细介绍了 Sequence 对象在 PostgreSQL 内核中的具体设计,从对象的元数据描述、对象的数据描述出发,介绍了 Sequence 对象的组成。本文随后介绍了 Sequence 最为核心的 SQL 接口——nextval,从 nextval 的序列值计算、原地更新、降低 WAL 日志写入三个方面进行了详细阐述。最后,本文介绍了 Sequence Session Cache 的相关原理,描述了引入 Cache 以后,序列值在 Cache 中,以及页面中的计算方法以及对齐关系,并对比了引入 Cache 前后,nextval 方法在单序列和多序列并发场景下的对比情况。

    + + +