Postmortem of 7/10 incident #5

corollari · 2020-10-09T20:12:56Z

On the 7th a user ran into a problem with tbtcswaps which led to the discovery of a vulnerability and the pausing of all swaps. More concretely, a user initiated a LN->tBTC swap where, on the second step, where a request is sent to the node that is the swapping counterparty, the node errored, making the UI on the client side get stuck.

After being alerted of a swap that was stuck I looked into the server logs to find this:

2020-10-07T17:23:54.184267+00:00 app[web.1]: (node:23) UnhandledPromiseRejectionWarning: Error: Invoice for this payment has not been generated
2020-10-07T17:23:54.184279+00:00 app[web.1]: at Command.callback (/app/build/index.js:177:24)
2020-10-07T17:23:54.184280+00:00 app[web.1]: at normal_reply (/app/node_modules/redis/index.js:654:21)
2020-10-07T17:23:54.184281+00:00 app[web.1]: at RedisClient.return_reply (/app/node_modules/redis/index.js:752:9)
2020-10-07T17:23:54.184282+00:00 app[web.1]: at JavascriptRedisParser.returnReply (/app/node_modules/ redis/index.js:137:18)
2020-10-07T17:23:54.184282+00:00 app[web.1]: at JavascriptRedisParser.execute (/app/node_modules/redis-parser/lib/parser.js:544:14)
2020-10-07T17:23:54.184283+00:00 app[web.1]: at Socket.<anonymous> (/app/node_modules/redis/index.js:218:27)
2020-10-07T17:23:54.184284+00:00 app[web.1]: at Socket.emit (events.js:314:20)
2020-10-07T17:23:54.184284+00:00 app[web.1]: at addChunk (_stream_readable.js:298:12)
2020-10-07T17:23:54.184284+00:00 app[web.1]: at readableAddChunk (_stream_readable.js:273:9)
2020-10-07T17:23:54.184285+00:00 app[web.1]: at Socket.Readable.push (_stream_readable.js:214:10)
2020-10-07T17:23:54.184285+00:00 app[web.1]: at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

This error comes from https://github.com/corollari/tbtcswaps/blob/master/node/src/index.ts#L187 and it's a direct consequence of a previous necessary step not being performed. But before delving into this let's first take a step back at how LN->tBTC swaps work:

User locks 1 ETH
User requests the LN invoice from the server and pays it
...

To prevent attacks where a malicious user causes the creation of an unlimited number of invoices, the invoice generation is implemented on the server as 2 separate steps:

When the node receives a new locking event from the smart contract, a new invoice is generated and stored on redis
When the users sends an HTTP request to retrieve the invoice, the node gets it from redis and sends it

Avid readers might notice that this system is vulnerable to a race condition where the user's HTTP request is received before the new invoice has been stored on redis, triggering an extraneous error. When I was developing the node I thought of this too, but, after running a few tests on the speeds of both processes, I came to the conclusion that such event would be extremely unlikely to happen under normal conditions, and, given that this wouldn't lead to loss of funds on either side, I decided to leave it like that.

So, when I saw that error on the server logs, which is caused by an attempt to retrieve an invoice that hasn't been generated, I connected this with the fact that several tBTC->LN swaps concluded successfully right before this one, the first LN->tBTC swap performed by someone other than me, and my mind raced to the idea that this was caused by the race condition mentioned before. I thought that maybe something had happened which heavily slowed down one of the processes, such as a spike in packet loss on the Infura<->Node connection.

In any case, it was clear to me that this was a problem which should have never ocurred, so I proceeded to stop LN->tBTC swaps (I don't have any admin keys, I just pushed an update to the frontend that disabled the "Swap" button), helped
the affected user revert their swap and sent a public message on the KEEP discord announcing that a problem related to a race condition had been found and I had disabled swaps.

Afterwards I tested some other stuff to confirm the problem, and, after waiting some more time to make sure nobody had any other problem, I went to bed.

The next day I started working on reproducing the problem, but, after struggling to do that for some time, I saw that in both redis and lnd logs there was no trace of an invoice being created and stored, which seemed to imply that the code that handles the event was never executed in the first place. Once I came into that realization I disabled all swaps on the site preemptively and continued investigating and shared the results of my investigation with prestwich, who, after a few hours, confirmed my suspicions: The error originated from infura not relaying the event to the node. I'm still not sure if that was because the websocket connection was dropped or because infura just missed that event (latter seems to be more likely), but one of these two was the ultimate cause of the incident.

Essentially this came about because I implicitly made the assumption that infura is 100% reliable, which lead to an absence of fallback mechanisms for it and made it become a single point of failure for the whole system.

It must also be said that this new discovery makes the system exploitable, as it is possible for an attacker to perform a lot of swaps and just wait for one where the LN2TBTCPreimageRevealed event is not relayed properly, as then the server would fail to settle the invoice while the attacker would get to retrieve the tBTC on-chain, thus keeping both payments.

Timeline

7:08 Agoristen posts a message on discord saying that he's having some problems with the swap
7:18 Initial reply on my side
7:37 Initial (mis)diagnosis of the problem
7:39 LN->tBTC swaps are disabled
8:21 Message on discord explaining the incident and announcing that LN->tBTC swaps had been disabled
Next day
12:21 tBTC->LN swaps are disabled
5:12 prestwich confirms the problem

What I learned

More and better logging in production is needed to help diagnose problems quickly
I should be more mindful of implicit assumptions
Revert button should be more accessible
Frontend should handle errors on the server responses better (in this case the loading button kept spinning instead of announcing that an error had ocurred)

Moving forward

Fix the issues encountered
Add more redundancy to limit reliance on external services
Revisit all assumptions
Get quotes for an audit and, if it's within my financial means, get one done
Add more defense in depth measures so, if there's a bug, the damage is stopped
- Initially this will mean a notification system that sends emails to the node owner if something weird happens
- If you have any other ideas to improve on this area please let me know

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postmortem of 7/10 incident #5

Postmortem of 7/10 incident #5

corollari commented Oct 9, 2020

Postmortem of 7/10 incident #5

Postmortem of 7/10 incident #5

Comments

corollari commented Oct 9, 2020

Timeline

What I learned

Moving forward