Fantom Sonic Mainnet Archive node gets corrupted DB #167

tibineacsu95 · 2024-07-06T10:39:48Z

Describe the bug
Fantom Sonic Mainnet Archive node gets corrupted DB.

To Reproduce
Steps to reproduce the behavior:

Set up an Archive node using the recommended steps.
Use a snapshot (https://files.fantom.network/mainnet-284692.tar.gz) to not sync from scratch in order to reduce sync time.
Node starts syncing from the point the snapshot is in.
Database gets corrupted after a couple of hours:
sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

Expected behavior
Node is able to sync properly, without getting its DB corrupted.

Desktop (please complete the following information):

OS: Ubuntu 22.04.4 LTS
Version: Sonicd 1.2.1-b

Additional context
Not quite sure how to mitigate this. It's the second time we're running into such issues on two different nodes - changed the machines as well thinking it would be a local storage problem.

We are using systemd, here is the service file:

Any feedback is highly appreciated!

The text was updated successfully, but these errors were encountered:

blockpi019 · 2024-07-08T06:06:01Z

We had the same problem too

janzhanal · 2024-07-08T07:30:57Z

Same issue with 1.2.1-d

thaarok · 2024-07-08T09:44:02Z

The error message implies your node cannot start because its database is corrupted. Probably it has crashed or was killed at some point and the systemd has restarted it automatically. The described message is produced by the following run, which fails to start, because the db is already corrupted.

Can you provide logs from the original crash? It is necessary to understand whats happened. Thanks!

tibineacsu95 · 2024-07-08T10:40:34Z

It looks like an OOM kill.

Jul 07 18:28:09 sonic01 systemd[1]: sonic.service: A process of this unit has been killed by the OOM killer. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Main process exited, code=killed, status=9/KILL Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Failed with result 'oom-kill'. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Consumed 1d 23h 57min 8.615s CPU time.

This is from a freshly synced node started 2 days ago from scratch. This time I used Restart=no in the service file. If I try to start the service back up, I get the same message, sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE although I was hoping the DB wouldn't get corrupted this time since the service didn't get to actually restart.

Each parameter used for the service is tuned based on the server specs (128 GB RAM and 32 CPUs):

GOMEMLIMIT=116GiB
--cache 51200

Any advice here? Should I set the values for the limit and the cache lower? And if so, what would be the suitable numbers for this spec? Thanks in advance!

janzhanal · 2024-07-08T11:27:13Z

I'm sorry, running in container and got it flushed. But it means SIGTERM and if it did't stop in 10sec then SIGKILL.

Is there any way to fix the corruption? Because unclean shutdowns happens even in production environments and having to wait multiple days for archive genesis to be processed is really painful...

tibineacsu95 · 2024-07-10T21:28:17Z

Just for reference, updated to 1.2.1-d, tried lowering GOMEMLIMIT to 70% (docs say it should be fine with 90%) of the total RAM of the machine, and the --cache to 25% (docs say it should be fine with 40%) but the crashes still occur - same outcome, OOM kill.

Link to docs - https://docs.fantom.foundation/node/tutorials/sonic-client/run-an-api-node

This was on a freshly installed machine, and the systemd service is configured to not restart automatically in case of any crash, and has a timeout set to 600 seconds (which should be more than enough for the service to stop gracefully) - the DB still gets corrupted.

If I stop the service manually, using systemctl stop sonic.service, it shuts down correctly (takes about 5 minutes) and I am able to just restart it normally afterwards.

We brought up 4 nodes, they all crashed due to the same reason, but at different points in time.

janzhanal · 2024-07-11T09:59:01Z

I did some testing and can confirm that SIGTERM and SIGKILL makes database corrupted. So the main questions stands:

is there a way to prevent corruption in case of unclean shutdown?
is there a way to recover corrupted db?
of course, a plan to implement those features would be appreciated as well (otherwise the tool is not usable in production and I would mark this as a critical issue)

insider89 · 2024-07-15T14:40:29Z

@janzhanal Is there is docker image to run or you build your own?(didn't find docker image for Sonic chain, only for opera)

janzhanal · 2024-07-16T08:15:40Z

Building my own.

janzhanal · 2024-07-20T08:11:25Z

Hello all,
today I have a corruption even if the app reported proper closure:

INFO [07-20|08:02:47.143] New block                                index=86203336 id=294680:3321:af86b2  gas_used=1,832,237  txs=5/0    age=1.545s          t=7.585ms
INFO [07-20|08:02:47.448] Got interrupt, shutting down... 
INFO [07-20|08:02:47.449] IPC endpoint closed                      url=/data/opera.ipc
INFO [07-20|08:02:47.449] Stopping Fantom protocol 
INFO [07-20|08:02:49.040] Fantom protocol stopped 
INFO [07-20|08:02:49.133] Fantom service stopped 
INFO [07-20|08:02:52.898] Closing State DB...                      module=evm-store
root@ovh-us-hi-10:~# docker logs fantom-mainnet-archive-sonicd -f --tail 10
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:24.236] Maximum peer count                       total=50
INFO [07-20|08:04:24.236] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:25.170] Maximum peer count                       total=50
INFO [07-20|08:04:25.170] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

flolege · 2024-07-21T07:32:47Z

Would also really appreciate a way to recover a dirty state db.

thaarok · 2024-07-29T09:47:36Z

@janzhanal The Closing State DB... log message needs to be followed by State DB closed message, otherwise the app is not terminated property. Are you sure the process was not killed, like by OOM for example?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fantom Sonic Mainnet Archive node gets corrupted DB #167

Fantom Sonic Mainnet Archive node gets corrupted DB #167

tibineacsu95 commented Jul 6, 2024 •

edited

Loading

blockpi019 commented Jul 8, 2024

janzhanal commented Jul 8, 2024

thaarok commented Jul 8, 2024 •

edited

Loading

tibineacsu95 commented Jul 8, 2024 •

edited

Loading

janzhanal commented Jul 8, 2024

tibineacsu95 commented Jul 10, 2024 •

edited

Loading

janzhanal commented Jul 11, 2024

insider89 commented Jul 15, 2024

janzhanal commented Jul 16, 2024

janzhanal commented Jul 20, 2024

flolege commented Jul 21, 2024

thaarok commented Jul 29, 2024

Fantom Sonic Mainnet Archive node gets corrupted DB #167

Fantom Sonic Mainnet Archive node gets corrupted DB #167

Comments

tibineacsu95 commented Jul 6, 2024 • edited Loading

blockpi019 commented Jul 8, 2024

janzhanal commented Jul 8, 2024

thaarok commented Jul 8, 2024 • edited Loading

tibineacsu95 commented Jul 8, 2024 • edited Loading

janzhanal commented Jul 8, 2024

tibineacsu95 commented Jul 10, 2024 • edited Loading

janzhanal commented Jul 11, 2024

insider89 commented Jul 15, 2024

janzhanal commented Jul 16, 2024

janzhanal commented Jul 20, 2024

flolege commented Jul 21, 2024

thaarok commented Jul 29, 2024

tibineacsu95 commented Jul 6, 2024 •

edited

Loading

thaarok commented Jul 8, 2024 •

edited

Loading

tibineacsu95 commented Jul 8, 2024 •

edited

Loading

tibineacsu95 commented Jul 10, 2024 •

edited

Loading