Figure out how to actually leverage an empty events table for performance improvements #1867

thomaseizinger · 2022-04-05T02:30:01Z

thomaseizinger
Apr 5, 2022

Once #1651 is landed, the events table will be empty if a node has 0 open CFDs.

In that state, a software upgrade that changes the structure of the events is safe. We can leverage this to make backwards-incompatible changes to the layout of our events either for reasons of performance improvements or otherwise.

The tricky part that we need to solve is: How do we get the users application into this state?

Some ideas (with varying degree of user experience):

The application could detect that it is about to run a version with incompatible events for the first time, force-close all CFDs, wait until they are completely closed and only then initialise the user interface.
The application could refuse to boot if the events table is not empty, thus forcing the user to close all CFDs prior to the upgrade with an old version.
Assume (2) and force-close all positions through the maker whilst at the same time, not allowing new positions to be opened. Release a new version once there are no open positions.
Introduce a new version of the events we want to change, keeping the old ones working. Track per CFD which software version it was opened with. Monitor how many open CFDs we still have with this software version. Release new version that removes old events once those CFDs drop to 0 or some defined threshold.

thomaseizinger · 2022-04-06T07:22:28Z

thomaseizinger
Apr 6, 2022
Author

I though about this a bit more.

We can actually make changes to the events today. They just need to be backwards-compatible. We can achieve that by introducing new events and mapping the old events to new ones. For that, it would probably be wise to introduce a dedicated db::Event type that deals with serialization and have the model layer not care about serialization at all. Then the db module can keep supporting old events and map into the newest representation.

When we can remove support for these old events is actually a completely different problem and does not need to be solved straight away. The only downside of that is some legacy mapping code that will be around for a while.

Once we have shipped a few releases with the mapping code in it, we can always make a judgement call on when we think it is safe or acceptable to remove it. To not break any users at that point, all their CFDs will have to be closed.

0 replies

da-kami · 2022-04-12T02:40:28Z

da-kami
Apr 12, 2022

We can actually make changes to the events today. They just need to be backwards-compatible. We can achieve that by introducing new events and mapping the old events to new ones. For that, it would probably be wise to introduce a dedicated db::Event type that deals with serialization and have the model layer not care about serialization at all. Then the db module can keep supporting old events and map into the newest representation.

I like this proposal. I think we should aim for this.
Yes, it would be great if we could keep the mapping purely in the database layer, and the model changes (without introducing model versions). I would hope that our changes will allow that. We can then easily have versioned tables in the db as well, and map old CFDs in the new model.

If we cannot achieve to map old CFDs into the new model I think we should go for a breaking version that forces users to close all positions before an upgrade. We would have to define details for the upgrade process though.
I feel having multiple model version will add complexity that we don't want at the moment.

0 replies

bonomat · 2022-04-20T05:58:11Z

bonomat
Apr 20, 2022
Maintainer

I find it a bit hard to reason about a generic upgrade solution without knowing what we want to change.
Imho a solution depends on the exact changes we want to introduce and any of the proposed 4 points might be viable depending on the scope of the change.

The proposed solution of mapping old events onto new events sounds to be the least invasive one with the best user experience :)

0 replies

bonomat · 2022-04-21T01:27:45Z

bonomat
Apr 21, 2022
Maintainer

What do we want to optimize

We want to lower the storage/db size and want to reduce the time to load CFDs from the DB.

Why is the DB so big

We store every event. The biggest events are ContractSetupCompleted and RolloverCompleted because both store the whole DLC including all CETs. With our current config of 200 CETs one of these events has roughly 820kb. In the long run I see more CETs being used to have a finer granularity on the payouts. This is important if the amounts are bigger. Hence, the size of this will grow!

A rollover happens ideally 1/h meaning for a single CFD we have ~24*820kb=20mb in data per day!

Our rollover protocol foresees that a user should never publish an old state, i.e. if a rollover was completed, a user should not publish an old state.
Publishing an old state can either be a mistake (our famous problem of a rollover worked for one party but not for the other party) or a malicious attempt: the user wants to publish an old state to profit from it.
Since we can't distinguish these situations easily, the user should be punished if he published an old state.

For punishment we need a few things from passed rollovers:

encsig (our own)
revocation secret (other party)
publication keys (other party)
txid of the old transaction
the spending script

We don't need:

the old CETs which are the biggest part of the data

What can we do?

With #1779 landed we move closed CFDs into their own table and delete all events. This frees storage and speeds up loading time.

However, this does not optimize open positions. A single position being open for a year will have ~ 7.3GB in storage.
Hence, we need an additional solution:

Snapshots

Snapshots have been discussed various times: they allow us to store a snapshot at a specific point of time and hence optimize loading - we don't need to load the whole history of events but can start from the snapshot.
While contradicting with the core idea of ES this would also allow us to delete old events prior the snapshot to free up storage.

The question is, how often do we snapshot:

time-based snapshots: We can do a snapshot 1/day, 1/hour/, ... The challenge I see here the longer we wait, the larger and more costly it gets to load everything and create a snapshot. If it is too often, even if it is fast, it will result in spikes in machine resources and might block the DB unnecessarily.
Snapshot after an event: We could do a snapshot after specific events, since the only regular events we are having are rollover-events, we could do a snapshot after each rollover. This should be relatively fast and cheap. The question is, do we even need the old events anymore? We will always base our business logic on the latest snapshot, all prior events are not important.
Snapshot after X events: We could do a snapshot after X rollover. The hard part is to get X right, the more users we get, the smaller the X should be going eventually closed to solution 2.

Our goal is to scale big, it may take some time, but eventually we will get there.
In numbers expressed, if we have 1000 users and increased our CETs to 5MB (~1k CETS), we will have 5GB of data every hour. We basically have to do snapshots after each event because we can't optimize much more.

The only way out is to do rollovers less often but this will only push the inevitable a bit away.

That being said, is this a sign that we should drop ES hard rules on immutability?
If we have todo a "snapshot" after each event, are we even following the core ideas of ES?

1 reply

thomaseizinger Apr 25, 2022
Author

The only way out is to do rollovers less often but this will only push the inevitable a bit away.

Every solution has a bottleneck somewhere if you scale far enough. Hence, it is important to recognise, how for a solution needs to be optimised until it is "good enough". With #1779, the storage consumed is only "working storage" in the sense that only open CFDs consume more storage. If we are trying to make a decision based on numbers, we should also run the numbers for a solution that takes optimisations like "run rollover less often" into account.¹

If these numbers turn out to still be "too much storage", I'd recommend to look into a completely different design overall. On that front, it is super important to point out that events are at the core of our system so we will need to find a way of either keeping those events without ES or re-architecting, how information flows through the system without events.

Changing the storage model can unlock big performance improvements but also comes at a risk. I have the assumption that we can get quite far in optimising the working storage of the application through ideas like "fixed funding intervals", "dissect the Dlc struct", "store signatures as binary instead of hex", "don't use JSON for the database", "combined position". Also not that most of these optimisations would likely carry over to different storage solutions later or at least bring features that have already been talked about.

Running rollover less often has come up outside of performance talk as well in the form of fixed funding intervals. ↩

thomaseizinger · 2022-04-26T04:02:12Z

thomaseizinger
Apr 26, 2022
Author

A reasonably straight-forward way of ditching event sourcing could be to retain the idea of making changes to the model by means of emitting events from commands but apply them to a persistent aggregate and only save that one to the DB. Every other component in the system would then have to initialise its model from that persistent aggregate. In other words, instead of just saving closed CFDs in a separate table, save an aggregate for all CFDs, regardless of their state.

This would allow us to retain most of the program's architecture that relies on events being emitted from the model's commands which IMO would otherwise be a gigantic piece of work.

The major changes would be:

Remove the apply function from the CfdAggregate trait
db::append_event would load the existing Cfd from a table, make changes to it according to the event and save it again

It is somewhat obvious but I am gonna point it out anyway: Any solution that ditches event sourcing will result in a "super" model on the db layer that needs to serve every requirement (monitor, projection, model, auto-rollover etc) because we will no longer be able to selectively take information from the events.

We can still retain a "two-layer" approach where db::Cfd is a dumb struct with a million, potentially optional fields and we initialise various other models from it via the CfdAggregate trait.

To retain an audit-trail, we can still save the name of the event in a table the same we do it for closed CFDs.

3 replies

bonomat May 2, 2022
Maintainer

I think that's pretty much the same what I had in mind and I believe this is relatively easy to be implemented provided we have the db model defined.

Just to repeat in my own words:

we stick to event-based system we have, i.e. whenever there is a new event to record we send it to the db actor
the db actor stores and aggregated view in the db (via append_event), i.e. a relational database.
Given we have all the events, we can additionally record an audit trail
we remove the apply function of CfdAggregate and instead load the final view from the db
implement atomic select statements to get everything at once instead of sequentially executed select statements.

It is somewhat obvious but I am gonna point it out anyway: Any solution that ditches event sourcing will result in a "super" model on the db layer that needs to serve every requirement (monitor, projection, model, auto-rollover etc) because we will no longer be able to selectively take information from the events.

I'm not sure what you mean with this: we can still have separate select statements for different needs and return different structs there, meaning, we don't always have to load the full Cfd with all CETs attached if we don't need them.

thomaseizinger May 2, 2022
Author

It is somewhat obvious but I am gonna point it out anyway: Any solution that ditches event sourcing will result in a "super" model on the db layer that needs to serve every requirement (monitor, projection, model, auto-rollover etc) because we will no longer be able to selectively take information from the events.

I'm not sure what you mean with this: we can still have separate select statements for different needs and return different structs there, meaning, we don't always have to load the full Cfd with all CETs attached if we don't need them.

I was more talking about the relational model itself. We will need to define tables ahead of time that support every usecase because that is the data you need to store.

At the moment - because we store events - we can build the aggregate dynamically based on whatever requirements come in.

bonomat May 2, 2022
Maintainer

Got it, yes, this is the hard part.

luckysori · 2022-05-02T01:30:52Z

luckysori
May 2, 2022
Maintainer

As I've expressed in the past, I'm very skeptical about giving up on event sourcing at this stage. We have definitely identified some pain points with the way we are using it (primarily the fact that we store a gigantic blob every hour and cannot/don't delete the old blobs). I suggest we try to mitigate them, even it means that our take on event sourcing is not as "pure" as it could be.

Looking at the code I've noticed that every time we apply the gigantic events (ContractSetupCompleted and RolloverCompleted) all our aggregates are already prepared for the dlc field to be null. This means that we are already able to process these events with a missing dlc field.

The obvious solution might be to set these dlc fields to Option and break event sourcing rules by somehow removing that part of the event for all "old" events.

Another idea that I think does not go against event sourcing principles could be to:

Remove the DLC field from the ContractSetupCompleted and RolloverCompleted events.
Create a separate dlcs table for DLCs, associated with particular CFDs. Only ever store the latest DLC for a CFD i.e. overwrite the previous one.
When applying events to an aggregate, also use the DLC loaded from the dlcs table.

Basically, my observation is that the DLC doesn't need to follow event sourcing rules, so we can store it separately and do whatever we want to it.

3 replies

thomaseizinger May 2, 2022
Author

Another idea that I think does not go against event sourcing principles could be to:

Remove the DLC field from the ContractSetupCompleted and RolloverCompleted events.

Create a separate dlcs table for DLCs, associated with particular CFDs. Only ever store the latest DLC for a CFD i.e. overwrite the previous one.

When applying events to an aggregate, also use the DLC loaded from the dlcs table.

Basically, my observation is that the DLC doesn't need to follow event sourcing rules, so we can store it separately and do whatever we want to it.

It does not go against it per se but at the same time, you are basically removing one of the main "benefits"¹ of ES: The ability to compute the latest state based on the sequence of events that happened.

If you start storing the Dlc separately, the latest state is no longer a function of the event log but event log + whatever is stored in the dlcs table. At this point, you may as well ditch it completely because a lot of things that you can do with ES are no longer possible.

In quotes because we are currently not leveraging it. ↩

luckysori May 2, 2022
Maintainer

Can you elaborate on what you think we are currently not leveraging about ES?

If you start storing the Dlc separately, the latest state is no longer a function of the event log but event log + whatever is stored in the dlcs table. At this point, you may as well ditch it completely because a lot of things that you can do with ES are no longer possible.

I agree with the facts presented in the first sentence, but I think your second sentence takes a big leap. Can you anticipate some things that will be impossible if we store the Dlc in a separate table? And even if we did identify some things, would that be reason enough to completely forego ES?

thomaseizinger May 2, 2022
Author

I agree with the facts presented in the first sentence, but I think your second sentence takes a big leap. Can you anticipate some things that will be impossible if we store the Dlc in a separate table?

For example, one of the things that an event log allows you to do is inspect (and run) the system with a certain state of the past (by simply not applying all events). If the Dlc is stored in a separate table, the Dlc at the time of the now simulated last event no longer exists because it was overwritten.

Can you elaborate on what you think we are currently not leveraging about ES?

We don't use it debug how a system got into the state that it is in or what the state of a system was at the time of a historical event.

I think this is partly because our RolloverCompleted is already half a snapshot event and doesn't actually store just the diff to the previous rollover.

We did certainly already leverage some of the benefits. For example, when we allowed to close CFDs in "Open pending" state, having the list of events that happened allowed us to very quickly identify, that the event for close confirmed can happen before lock confirmed. That might have been a tricky race condition to otherwise detect!

thomaseizinger · 2022-05-02T04:37:44Z

thomaseizinger
May 2, 2022
Author

To start a meta-discussion:

With Figure out how to actually leverage an empty events table for performance improvements #1867 (comment), we think to have found a way of replacing ES with a relational model where the changes are primarily contained in the DB layer. IMO, this is almost a baseline requirement to make any discussion about ditching ES possible. Completely doing away with events is almost impossible economically speaking.
We have several ideas on how to improve the performance of an ES-based system: Splitting up the Dlc struct, performing less rollovers, introducing snapshots, etc.

Both of these paths are doable, will yield positive results in terms of performance but also have downsides:

Sticking with ES:

Downside: There is an upper-limit to how much performance one can squeeze out of it.
Upside: Incremental performance improvements are safer and will less likely uncover unknowns because we are already working within this idiom.
Upside: As the system grows, being able to replay and inspect how we got into a certain state may become beneficial for debugging or other analysis. More generally: There are several benefits that one can get out of ES which we are not (yet) leveraging.

Moving to relational model:

Upside: We will get close to the smallest storage that is possible given our requirements.
Downside: We are changing a paradigm which always involves unknowns. Unknowns translate to risks in terms of bugs, instability and likely delays in the implementation as we actually try and do stuff.
Upside: Changes to the relational model are much easier. Not being able to change events held us back more than once.

1 reply

thomaseizinger May 2, 2022
Author

I know that we have primarily started this discussion from a performance perspective but I think the following has almost as much weight if not more:

Upside: Changes to the relational model are much easier. Not being able to change events held us back more than once.

Not being able to change the events hinders progress because we cannot iterate on the data model as easily. We have already identified multiple points where the data model is wrong and not being able to fix this harms innovation and cycle time of new features.

I almost certain that we will find ways of dealing with a big event log for the maker deployment. Be it with snapshots, more iron, dissected events or other improvements. However, most of these improvements require control over the upgrade cycle (as we saw with the "sqlite vacuum" situation). We don't have control over the upgrade cycle of our clients so deploying a storage concept that builds on an immutable log of events was likely a very bad idea.

da-kami · 2022-05-16T05:42:49Z

da-kami
May 16, 2022

Upside: Changes to the relational model are much easier. Not being able to change events held us back more than once.

My 2 cents:
I would strongly advocate for no json in the db and normalize all the data. This will help with changes.
We can have an "in between" step where only some of the data is moved to a separate table, but it would be nice to have a vision on how to evolve moving aways from the json in general before.

2 replies

bonomat May 20, 2022
Maintainer

I played around with some advanced queries and came to the conclusion that a pure sql migration might theoretically be possible but probably not worth the effort.
The problem is two fold

we have a hashmap in the json which keys are the oracle event ids - these are dynamic and not easy to match agains
we have multiple arrays in the json which we should flat out. This is not easy as the amount of elements in these arrays are dynamic.

If we want to proper migrate away from json I believe the best approach is to create a new table for DLCs (including multiple sub-tables). We can still keep backwards compatbility by doing the migration manually using rust.

Just for the fun of it, here is a small sql query I tried to extract the fields from the json blob. It's not too bad but stops at having a single column for arrays which is still not optimal.

select table_1.funding_fee,
       table_1.rate,
       table_1.commit_tx,
       table_1.idendity,
       table_1.lock,
       table_1.maker_address,
       table_1.maker_lock_amount,
       table_1.publish,
       table_1.publish_pk_counterparty,
       table_1.refund,
       table_1.refund_timelock,
       table_1.revocation,
       table_1.revocation_pk_counterparty,
       table_1.revoked_commit,
       table_1.settlement_event_id,
       table_1.taker_address,
       table_1.taker_lock_amount,
       table_2.array
from (select EVENTS.id,
             events.cfd_id,
             json_extract(EVENTS.data, '$.funding_fee.fee')                as funding_fee,
             json_extract(EVENTS.data, '$.funding_fee.rate')               as rate,
             json_extract(EVENTS.data, '$.dlc')                            as rest,
             json_extract(EVENTS.data, '$.dlc.commit')                     as commit_tx,
             json_extract(EVENTS.data, '$.dlc.identity')                   as idendity,
             json_extract(EVENTS.data, '$.dlc.lock')                       as lock,
             json_extract(EVENTS.data, '$.dlc.maker_address')              as maker_address,
             json_extract(EVENTS.data, '$.dlc.maker_lock_amount')          as maker_lock_amount,
             json_extract(EVENTS.data, '$.dlc.publish')                    as publish,
             json_extract(EVENTS.data, '$.dlc.publish_pk_counterparty')    as publish_pk_counterparty,
             json_extract(EVENTS.data, '$.dlc.refund')                     as refund,
             json_extract(EVENTS.data, '$.dlc.refund_timelock')            as refund_timelock,
             json_extract(EVENTS.data, '$.dlc.revocation')                 as revocation,
             json_extract(EVENTS.data, '$.dlc.revocation_pk_counterparty') as revocation_pk_counterparty,
             json_extract(EVENTS.data, '$.dlc.revoked_commit')             as revoked_commit,
             json_extract(EVENTS.data, '$.dlc.settlement_event_id')        as settlement_event_id,
             json_extract(EVENTS.data, '$.dlc.taker_address')              as taker_address,
             json_extract(EVENTS.data, '$.dlc.taker_lock_amount')          as taker_lock_amount
      from EVENTS
      where name = 'RolloverCompleted'
        and rest is not null) as table_1
         join (SELECT id,
                      array
               FROM (SELECT id,
                            SUBSTR(SUBSTR(json_extract(data, '$.dlc.cets'),
                                          INSTR(json_extract(data, '$.dlc.cets'), '[') + 0),
                                   1,
                                   INSTR(SUBSTR(json_extract(data, '$.dlc.cets'),
                                                INSTR(json_extract(data, '$.dlc.cets'), '[') + 1),
                                         ']') + 1) AS array
                     FROM EVENTS) subtable
               ) as table_2 on table_1.id = table_2.id

da-kami May 20, 2022

I think it's fine to go with a code based solution where we load, write into new tables and then delete the old data in a transaction.
I like playing with SQL, but I feel we might be faster just writing the code - and the code can be more easily tested with unit tests before testing with testnet / mainnet data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to actually leverage an empty events table for performance improvements #1867

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Figure out how to actually leverage an empty events table for performance improvements #1867

thomaseizinger Apr 5, 2022

Replies: 8 comments · 10 replies

thomaseizinger Apr 6, 2022 Author

da-kami Apr 12, 2022

bonomat Apr 20, 2022 Maintainer

bonomat Apr 21, 2022 Maintainer

What do we want to optimize

Why is the DB so big

What can we do?

Snapshots

thomaseizinger Apr 25, 2022 Author

Footnotes

thomaseizinger Apr 26, 2022 Author

bonomat May 2, 2022 Maintainer

thomaseizinger May 2, 2022 Author

bonomat May 2, 2022 Maintainer

luckysori May 2, 2022 Maintainer

thomaseizinger May 2, 2022 Author

Footnotes

luckysori May 2, 2022 Maintainer

thomaseizinger May 2, 2022 Author

thomaseizinger May 2, 2022 Author

thomaseizinger May 2, 2022 Author

da-kami May 16, 2022

bonomat May 20, 2022 Maintainer

da-kami May 20, 2022

thomaseizinger
Apr 5, 2022

Replies: 8 comments 10 replies

thomaseizinger
Apr 6, 2022
Author

da-kami
Apr 12, 2022

bonomat
Apr 20, 2022
Maintainer

bonomat
Apr 21, 2022
Maintainer

thomaseizinger Apr 25, 2022
Author

thomaseizinger
Apr 26, 2022
Author

bonomat May 2, 2022
Maintainer

thomaseizinger May 2, 2022
Author

bonomat May 2, 2022
Maintainer

luckysori
May 2, 2022
Maintainer

thomaseizinger May 2, 2022
Author

luckysori May 2, 2022
Maintainer

thomaseizinger May 2, 2022
Author

thomaseizinger
May 2, 2022
Author

thomaseizinger May 2, 2022
Author

da-kami
May 16, 2022

bonomat May 20, 2022
Maintainer