Update FAQ document about service fake death (#2262)

* Update FAQ document about service fake death --------- Co-authored-by: zhaojiaxu3 <[email protected]>
apache · Sep 13, 2023 · caab1ea · caab1ea
1 parent 095f558
commit caab1ea
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 0 deletions.
diff --git a/docs/content/faq/_index.cn.md b/docs/content/faq/_index.cn.md
@@ -129,3 +129,22 @@ Mesos 相关请参考 [Apache Mesos](https://mesos.apache.org/)。
 1. 指定网卡 eno1：`-Delasticjob.preferred.network.interface=eno1`。
 1. 指定IP地址 192.168.0.100：`-Delasticjob.preferred.network.ip=192.168.0.100`。
 1. 泛指IP地址(正则表达式) 192.168.*：`-Delasticjob.preferred.network.ip=192.168.*`。
+
+## 15. zk授权升级,在滚动部署过程中出现实例假死,回退到历史版本也依然存在假死。
+
+回答:
+
+在滚动部署过程中,会触发竞争选举leader,有密码的实例会给zk目录加密导致无密码的实例不可访问,最终导致整体选举阻塞。
+
+例如:
+
+通过日志可以发现会抛出-102异常:
+
+```bash
+xxxx-07-27 22:33:55.224 [DEBUG] [localhost-startStop-1-EventThread] [] [] [] - o.a.c.f.r.c.TreeCache : processResult: CuratorEventImpl{type=GET_DATA, resultCode=-102, path='/xxx/leader/election/latch/_c_bccccdcc-1134-4e0a-bb52-59a13836434a-latch-0000000047', name='null', children=null, context=null, stat=null, data=null, watchedEvent=null, aclList=null}
+```
+
+解决方案:
+
+1.如果您在升级的过程中出现回退历史版本也依然假死的问题,建议删除zk上所有作业目录,之后再重启历史版本。
+2.计算出合理的作业执行间隙,比如晚上21:00-21:30作业不会触发,在此期间先将实例全部停止,然后将带密码的版本全部部署上线。
diff --git a/docs/content/faq/_index.en.md b/docs/content/faq/_index.en.md
@@ -128,3 +128,20 @@ For example
 1. specify the interface eno1: `-Delasticjob.preferred.network.interface=eno1`.
 1. specify network addresses, 192.168.0.100: `-Delasticjob.preferred.network.ip=192.168.0.100`.
 1. specify network addresses for regular expressions, 192.168.*: `-Delasticjob.preferred.network.ip=192.168.*`.
+
+## 15. During the zk authorization upgrade process, there was a false death of the instance during the rolling deployment process, and even if the historical version was rolled back, there was still false death.
+
+Answer:
+
+During the rolling deployment process, competitive election leaders will be triggered, and instances with passwords will encrypt the zk directory, making instances without passwords inaccessible, ultimately leading to overall election blocking.
+
+For example
+
+Through the logs, it can be found that an -102 exception will be thrown:
+
+```bash
+xxxx-07-27 22:33:55.224 [DEBUG] [localhost-startStop-1-EventThread] [] [] [] - o.a.c.f.r.c.TreeCache : processResult: CuratorEventImpl{type=GET_DATA, resultCode=-102, path='/xxx/leader/election/latch/_c_bccccdcc-1134-4e0a-bb52-59a13836434a-latch-0000000047', name='null', children=null, context=null, stat=null, data=null, watchedEvent=null, aclList=null}
+```
+
+1.If you encounter the issue of returning to the historical version and still pretending to be dead during the upgrade process, it is recommended to delete all job directories on zk and restart the historical version afterwards.
+2.Calculate a reasonable job execution gap, such as when the job will not trigger from 21:00 to 21:30 in the evening. During this period, first stop all instances, and then deploy all versions with passwords online.