Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Design prevents Concurrency #29

Open
juagargi opened this issue Jan 23, 2023 · 2 comments
Open

DB Design prevents Concurrency #29

juagargi opened this issue Jan 23, 2023 · 2 comments
Assignees
Labels

Comments

@juagargi
Copy link
Member

The update process disallows pushing certificate data directly to the DB,
unless the previous certificates for existing domains are fetched (and transformed together with the new certificate).
This makes concurrency hard, and disallows pushing data from CSV files directly to the DB.

A proposal is to have one record per certificate.

@juagargi juagargi self-assigned this Jan 23, 2023
@juagargi juagargi added the DB label Jan 23, 2023
@juagargi juagargi mentioned this issue Jan 23, 2023
3 tasks
@juagargi
Copy link
Member Author

Some notes and a proposal:

Fix the update process so that:

  1. It can easily be processed by several routines.
  2. Data can be pushed directly from CSV files into the DB.

The current update process has the following steps:

  1. Obtain the data (download, from file, etc).
  2. Update the domainEntries table in DB with the material from (1).
  3. Update the Sparse Merkle Tree with the material from (2).
  4. Update the tree table in DB with (3).

Steps 2 to 4 happen in small batches. I.e. a batch of e.g. 1000 elements
is taken from step 1, and passed thru steps 2, 3 and 4.
This batch processing should happen in parallel, but it seldom does,
as the update of a certificate C for a domain D requires retrieval of
the certificate collection for D, insertion of C into D (following
certain rules), and back to the DB in the domainEntries table.

We propose to change it to:

  1. Obtain the data.
  2. Create (upsert or similar) a new record per new certificate C and domain D.
  3. Write the modified domains into a table dirty (formerly known as the updates table).
  4. In DB and via a stored procedure,
    serialize the certificate collection (following certain rules) and write it, plus its SHA256,
    to the table.
  5. Wait until all batches have finished.
  6. Update the SMT with the material from (4), and using the domains in dirty.
  7. Store the tree table in DB.
  8. Truncate the dirty table.

Tables

For performance reasons, no foreign keys exist in any table.

  • certs table
    1. id: PK, this is the SHA256 of the certificate.
    2. domain: index, this is the SHA256 of the domain.
    3. serialized: this is the certificate, serialized.
    4. parent: this is the parent certificate, in the trust chain, or NULL if root.
  • domains table. This table is constructed in DB from the certs table
    1. domain: PK, SHA256 of the domain
    2. hash: SHA256 of the serialized certificate collection for the domain.
      This comes from all the certificates that have their certs.domain equal
      to this domains.domain, serialized following certain rules.
  • tree table, remains the same as before
    1. id: PK, auto increment.
    2. key: index, whatever the SMT library uses as key, 32 bytes.
    3. value: whatever the SMT library uses as value.
  • root table. Should contain zero or one elements.
    1. id: PK, 32 bytes, SHA256 of the root of the SMT.
  • dirty table
    1. key: PK, SHA256 of each of the modified domains.

The dirty table should always be non-empty when the SMT update process starts.

Detailed Steps to Update a Bunch of Certificates

Keeping in mind that this "bunch of certificates" could easily be 109 entries,
spread into multiple CSV files, we cannot keep every thing in memory.
We will piggyback into the DB to keep track of the updated domains,
and for that we will have two main steps inside one update cycle:

  1. Update the certificates themselves. This means adding a new entry in certs if the
    certificate doesn't exist yet.
  2. Update the tree table. For that we have to load and then update the SMT structure
    from the DB.

1. Updating the Certificates

We will process all the certificates in batches, whose size depends on
the row count of the CSV files (if local ingest) of download batch size.
Let's pick 105 as a possible batch size example.

If we have CSV files, for each one of the batches:

  1. Disable indices on the certs and dirty tables, until we are done with all batches.
  2. The trust chain will be only referenced by the entries.
    The content of the certificates that are part of the trust chain is inserted later.
  3. Insert the CSV into the certs table directly.
  4. If step 1 is not possible, create a temporary CSV file holding the appropriate
    values, and then do step 1.
  5. Insert only the certs.domain field into the dirty.key table.

Now we insert the certificates part of the trust chain:

  1. TODO(juagargi) This would be much better done if the local files are already
    in the format we expect, uncompressed.

If we instead have in-memory downloads, we do:

  1. Disable indices on the certs and dirty tables, until we are done with all batches.
  2. We will treat the data to split it into two parts: domains pertaining to the certificates,
    and certificates themselves. The certificates can come from the domains or from
    the trust chain.
  3. We divide the data into batches of e.g. 1000 certificate entries.
  4. For each batch:
    1. Get a bit mask of the already present certificates in the certs table.
    2. For those certificates not present, insert them into the certs table.
    3. TODO(juagargi) check if upsert them is more performant on average
      (depends on the number of times we encounter existing identical certificates).
    4. We update the dirty table.
    5. We group all the trust chain certificates into a map.
    6. We get a bit mask of those certificates not present in the DB.
    7. We insert those certificates not present in the bit mask.

We wait until all certificates are inserted into their appropriate rows
in certs and dirty.

2. Updating the SMT

This process is quite straight forward, as done previously, with an SMT updater,
which is an object that maintains the mutexes, etc. required for the
updating using multiple go routines.
We will divide the job into batches, e.g. batches of 106 elements.
These batches will be processed in parallel, using sub-batches
(because probably sending one key-value to a channel to be picked up by
a goroutine is going to be too much overhead, so we will group in sub-batches
of e.g. 10K elements).

The steps done for each batch are:

  1. Load the root from root table.
  2. For each entry in the dirty table:
    1. Use the dirty.key as key in the tree entry structure.
    2. Retrieve its domains.hash and use it as value.
  3. Wait for all the routines to finish their sub-batch.

After all batches have been processed, we can commit the SMT to the DB.
We may want to disable indices before doing this (we would have to test
in real life if it improves performance).

@cyrill-k
Copy link
Collaborator

One thing I'm a bit confused about is the domain field in the certs table.

  • What would this field contain? I guess just the CommonName field? What if the CommonName field of a certificate is empty? Would it then be set to NULL? Or the first SAN entry?
  • What about the SubjectAlternativeName (SAN) domains? Would you create a separate entry in the certs table for each SAN?
    In some cases we have tens of SAN entries for a single certificate. Which means we would store the same serialized certificate multiple times in the certs table. On average, the number of domains is quite low (~1-2) so that might be acceptable. But it can become quite big (see #domains hist).

Btw., here are hists for the #intermediate certs and validity times, which can be useful for performance estimations.

@juagargi juagargi mentioned this issue Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants