During restoration, the storaged or metad service fails to be started. The following error message is displayed: [db/db_impl/db_impl_open.cc:2112] DB::Open() failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: /xxx/000024.ldb: No such file or directory in file /xxx/MANIFEST-000020 #5976

cccxgit · 2024-11-17T08:49:12Z

nebula 版本：3.6
部署方式：k8s分布式
安装方式：源码编译
是否上生产环境：Y
硬件信息
- 机械磁盘
- 6U15G
问题的具体描述
我这边生产环境搭建了一个基于nebula的k8s分布式集群，已创建15个图空间，导入5亿点边数据。在服务正常情况下，执行create snapshot进行数据的备份。基于备份的数据，为metad和storaged服务进行恢复时，存在偶先storaged或metad服务启动失败，报错信息为

2024/11/11-21:38:38.656812 140578633020992 [WARN] [db/db_impl/db_impl_open.cc:2112] DB::Open() failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: /xxx/000024.ldb: No such file or directory in file /xxx/MANIFEST-000020
2024/11/11-21:38:38.656835 140578633020992 [db/db_impl/db_impl.cc:477] Shutdown: canceling all background work
2024/11/11-21:38:38.656893 140578633020992 [db/db_impl/db_impl.cc:677] Shutdown complete

其他信息：
（1）通过多次恢复验证，storaged启动失败的概率大于metad
（2）本集群未使用bragent进行备份恢复，而是自研一套方案。本集群的恢复方案为：1、从远端存储机器中下载snapshot备份文件压缩包到storaged和metad容器中；2、通过nebula.service stop关闭storaged和metad服务，并解压snapshot文件到storaged或metad指定data目录下（storaged存在多个图空间，对应多个snapshot压缩文件，启动多线程并行解压）；3、解压完成的服务，执行nebula.service start启动（采用节点粒度启动服务。当节点中的storaged和metad都解压完，一起启动服务）。
（3）不同节点机器性能存在差异，因此服务启动时间不同，存在时间差（可能10mins）
（4）当前已验证远端存储机器下载的snapshot文件无破损（md5值验证）；所有解压均无失败
问题检索：
（1）rocksdb github社区，有几个相似问题的issue，均处于open

https://github.com/facebook/rocksdb/issues/10258
https://github.com/facebook/rocksdb/issues/10357

（2）其中facebook/rocksdb#10357 贴子最下面，似乎有解决方案，请帮助分析感谢

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cccxgit commented Nov 17, 2024 •

edited

Loading

Comments

cccxgit commented Nov 17, 2024 • edited Loading

cccxgit commented Nov 17, 2024 •

edited

Loading