From caab1ea5be934b854b476ddec5ca6ed867d6e3e3 Mon Sep 17 00:00:00 2001 From: zjx990 <66360103+zjx990@users.noreply.github.com> Date: Thu, 14 Sep 2023 00:34:55 +0800 Subject: [PATCH] Update FAQ document about service fake death (#2262) * Update FAQ document about service fake death --------- Co-authored-by: zhaojiaxu3 --- docs/content/faq/_index.cn.md | 19 +++++++++++++++++++ docs/content/faq/_index.en.md | 17 +++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/docs/content/faq/_index.cn.md b/docs/content/faq/_index.cn.md index 8d54892615..18cd5dbe09 100644 --- a/docs/content/faq/_index.cn.md +++ b/docs/content/faq/_index.cn.md @@ -129,3 +129,22 @@ Mesos 相关请参考 [Apache Mesos](https://mesos.apache.org/)。 1. 指定网卡 eno1:`-Delasticjob.preferred.network.interface=eno1`。 1. 指定IP地址 192.168.0.100:`-Delasticjob.preferred.network.ip=192.168.0.100`。 1. 泛指IP地址(正则表达式) 192.168.*:`-Delasticjob.preferred.network.ip=192.168.*`。 + +## 15. zk授权升级,在滚动部署过程中出现实例假死,回退到历史版本也依然存在假死。 + +回答: + +在滚动部署过程中,会触发竞争选举leader,有密码的实例会给zk目录加密导致无密码的实例不可访问,最终导致整体选举阻塞。 + +例如: + +通过日志可以发现会抛出-102异常: + +```bash +xxxx-07-27 22:33:55.224 [DEBUG] [localhost-startStop-1-EventThread] [] [] [] - o.a.c.f.r.c.TreeCache : processResult: CuratorEventImpl{type=GET_DATA, resultCode=-102, path='/xxx/leader/election/latch/_c_bccccdcc-1134-4e0a-bb52-59a13836434a-latch-0000000047', name='null', children=null, context=null, stat=null, data=null, watchedEvent=null, aclList=null} +``` + +解决方案: + +1.如果您在升级的过程中出现回退历史版本也依然假死的问题,建议删除zk上所有作业目录,之后再重启历史版本。 +2.计算出合理的作业执行间隙,比如晚上21:00-21:30作业不会触发,在此期间先将实例全部停止,然后将带密码的版本全部部署上线。 \ No newline at end of file diff --git a/docs/content/faq/_index.en.md b/docs/content/faq/_index.en.md index 20d32efe95..976d2b8860 100644 --- a/docs/content/faq/_index.en.md +++ b/docs/content/faq/_index.en.md @@ -128,3 +128,20 @@ For example 1. specify the interface eno1: `-Delasticjob.preferred.network.interface=eno1`. 1. specify network addresses, 192.168.0.100: `-Delasticjob.preferred.network.ip=192.168.0.100`. 1. specify network addresses for regular expressions, 192.168.*: `-Delasticjob.preferred.network.ip=192.168.*`. + +## 15. During the zk authorization upgrade process, there was a false death of the instance during the rolling deployment process, and even if the historical version was rolled back, there was still false death. + +Answer: + +During the rolling deployment process, competitive election leaders will be triggered, and instances with passwords will encrypt the zk directory, making instances without passwords inaccessible, ultimately leading to overall election blocking. + +For example + +Through the logs, it can be found that an -102 exception will be thrown: + +```bash +xxxx-07-27 22:33:55.224 [DEBUG] [localhost-startStop-1-EventThread] [] [] [] - o.a.c.f.r.c.TreeCache : processResult: CuratorEventImpl{type=GET_DATA, resultCode=-102, path='/xxx/leader/election/latch/_c_bccccdcc-1134-4e0a-bb52-59a13836434a-latch-0000000047', name='null', children=null, context=null, stat=null, data=null, watchedEvent=null, aclList=null} +``` + +1.If you encounter the issue of returning to the historical version and still pretending to be dead during the upgrade process, it is recommended to delete all job directories on zk and restart the historical version afterwards. +2.Calculate a reasonable job execution gap, such as when the job will not trigger from 21:00 to 21:30 in the evening. During this period, first stop all instances, and then deploy all versions with passwords online. \ No newline at end of file