Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fantom Sonic Mainnet Archive node gets corrupted DB #167

Open
tibineacsu95 opened this issue Jul 6, 2024 · 12 comments
Open

Fantom Sonic Mainnet Archive node gets corrupted DB #167

tibineacsu95 opened this issue Jul 6, 2024 · 12 comments

Comments

@tibineacsu95
Copy link

tibineacsu95 commented Jul 6, 2024

Describe the bug
Fantom Sonic Mainnet Archive node gets corrupted DB.

To Reproduce
Steps to reproduce the behavior:

  1. Set up an Archive node using the recommended steps.
  2. Use a snapshot (https://files.fantom.network/mainnet-284692.tar.gz) to not sync from scratch in order to reduce sync time.
  3. Node starts syncing from the point the snapshot is in.
  4. Database gets corrupted after a couple of hours:
    sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

Expected behavior
Node is able to sync properly, without getting its DB corrupted.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04.4 LTS
  • Version: Sonicd 1.2.1-b

Additional context
Not quite sure how to mitigate this. It's the second time we're running into such issues on two different nodes - changed the machines as well thinking it would be a local storage problem.

We are using systemd, here is the service file:
Screenshot 2024-07-06 at 13 40 23

Any feedback is highly appreciated!

@blockpi019
Copy link

We had the same problem too

@janzhanal
Copy link

Same issue with 1.2.1-d

@thaarok
Copy link
Collaborator

thaarok commented Jul 8, 2024

The error message implies your node cannot start because its database is corrupted. Probably it has crashed or was killed at some point and the systemd has restarted it automatically. The described message is produced by the following run, which fails to start, because the db is already corrupted.

Can you provide logs from the original crash? It is necessary to understand whats happened. Thanks!

@tibineacsu95
Copy link
Author

tibineacsu95 commented Jul 8, 2024

It looks like an OOM kill.

Jul 07 18:28:09 sonic01 systemd[1]: sonic.service: A process of this unit has been killed by the OOM killer. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Main process exited, code=killed, status=9/KILL Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Failed with result 'oom-kill'. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Consumed 1d 23h 57min 8.615s CPU time.

This is from a freshly synced node started 2 days ago from scratch. This time I used Restart=no in the service file. If I try to start the service back up, I get the same message, sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE although I was hoping the DB wouldn't get corrupted this time since the service didn't get to actually restart.

Each parameter used for the service is tuned based on the server specs (128 GB RAM and 32 CPUs):

  • GOMEMLIMIT=116GiB
  • --cache 51200

Any advice here? Should I set the values for the limit and the cache lower? And if so, what would be the suitable numbers for this spec? Thanks in advance!

@janzhanal
Copy link

I'm sorry, running in container and got it flushed. But it means SIGTERM and if it did't stop in 10sec then SIGKILL.

Is there any way to fix the corruption? Because unclean shutdowns happens even in production environments and having to wait multiple days for archive genesis to be processed is really painful...

@tibineacsu95
Copy link
Author

tibineacsu95 commented Jul 10, 2024

Just for reference, updated to 1.2.1-d, tried lowering GOMEMLIMIT to 70% (docs say it should be fine with 90%) of the total RAM of the machine, and the --cache to 25% (docs say it should be fine with 40%) but the crashes still occur - same outcome, OOM kill.

Link to docs - https://docs.fantom.foundation/node/tutorials/sonic-client/run-an-api-node

This was on a freshly installed machine, and the systemd service is configured to not restart automatically in case of any crash, and has a timeout set to 600 seconds (which should be more than enough for the service to stop gracefully) - the DB still gets corrupted.

If I stop the service manually, using systemctl stop sonic.service, it shuts down correctly (takes about 5 minutes) and I am able to just restart it normally afterwards.

We brought up 4 nodes, they all crashed due to the same reason, but at different points in time.

@janzhanal
Copy link

I did some testing and can confirm that SIGTERM and SIGKILL makes database corrupted. So the main questions stands:

  • is there a way to prevent corruption in case of unclean shutdown?
  • is there a way to recover corrupted db?
  • of course, a plan to implement those features would be appreciated as well (otherwise the tool is not usable in production and I would mark this as a critical issue)

@insider89
Copy link

@janzhanal Is there is docker image to run or you build your own?(didn't find docker image for Sonic chain, only for opera)

@janzhanal
Copy link

Building my own.

@janzhanal
Copy link

Hello all,
today I have a corruption even if the app reported proper closure:

INFO [07-20|08:02:47.143] New block                                index=86203336 id=294680:3321:af86b2  gas_used=1,832,237  txs=5/0    age=1.545s          t=7.585ms
INFO [07-20|08:02:47.448] Got interrupt, shutting down... 
INFO [07-20|08:02:47.449] IPC endpoint closed                      url=/data/opera.ipc
INFO [07-20|08:02:47.449] Stopping Fantom protocol 
INFO [07-20|08:02:49.040] Fantom protocol stopped 
INFO [07-20|08:02:49.133] Fantom service stopped 
INFO [07-20|08:02:52.898] Closing State DB...                      module=evm-store
root@ovh-us-hi-10:~# docker logs fantom-mainnet-archive-sonicd -f --tail 10
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:24.236] Maximum peer count                       total=50
INFO [07-20|08:04:24.236] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:25.170] Maximum peer count                       total=50
INFO [07-20|08:04:25.170] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

@flolege
Copy link

flolege commented Jul 21, 2024

Would also really appreciate a way to recover a dirty state db.

@thaarok
Copy link
Collaborator

thaarok commented Jul 29, 2024

@janzhanal The Closing State DB... log message needs to be followed by State DB closed message, otherwise the app is not terminated property. Are you sure the process was not killed, like by OOM for example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants