Skip to content

Basic Operations

Kinan Bab edited this page Apr 26, 2023 · 2 revisions

Make sure you have a clean K9db server running and connect to it from the terminal using a suitable client. You can find instructions on how to do this in our top-level tutorial page.

Simple course submission system

Let's start with a simple example: a course assignment submission system.

Schema

For simplicity, we will assume this system is specific to only one offering of a single course and a single fixed instructor. We will not consider the instructor to be part of the database or be a data subject for compliance purposes, even-though that is not the case in reality. We will only consider the students to be data subjects.

Our scenario has three tables: assignments, students and submissions.

We will consider assignments to have an id and a title. Assignments are created by the instructor independently of the students and their activity. A reasonable application policy thus grants students no privacy rights over the assignments themselves. From K9db's perspective, this means that assignments is a plain SQL table not associated with any data subjects and unaffected by compliance operations.

We can create this table by inputting the following command via the terminal client:

# In mariadb/mysql terminal client connected to K9db server.
CREATE TABLE assignments (
  ID int,
  title text,
  PRIMARY KEY(ID)
);
> Query OK, 0 rows affected (0.055 sec)

The above command is identical to how you would create this table in MySQL or other SQL-based databases. K9db has some implementation limitations: it only supports int, text, and (partially) datatime types. It does not enforce NOT NULL and UNIQUE constraints (but it does enforce them for PRIMARY KEYs).

Next, we should create the students table. Every row in this table corresponds to a student, which are by-definition a data subject with privacy rights according to the GDPR (and similar privacy laws). Developers need to express this K9db using the DATA_SUBJECT table level annotation as follows:

CREATE DATA_SUBJECT TABLE students (
  ID int,
  name text,
  PRIMARY KEY(ID)
);
> Query OK, 0 rows affected (0.049 sec)

Unlike assignments, K9db knows that data in students are associated with data subjects, and thus is affected by subject access requests (SARs). K9db also maintains a variety of compliance-critical invariants about data in such tables. For example, that each row in this table is associated with at least one data subject. For the students table this is trivially the case. But as we will see later, this becomes more interesting for more complex tables.

K9db does this by physically separating each data subject and data subject-associated table into per-user shards. The data in each table that is owned by a data subject is stored in that data subject's shard. Refer to our OSDI paper for more details on how this works, and what advantages it offers for compliance.

We can validate that K9db understood this for our students table above by running the SHOW SHARDS command, which displays to us all the existing types of data subject shards in the system, and how many data subjects there are. For now, this is 0, since no data has yet been inserted.

SHOW SHARDS;
> +------------+-------+
> | Shard Kind | Count |
> +------------+-------+
> | students   |     0 |
> +------------+-------+
> 1 row in set (0.001 sec)

The last table we have is submissions, which record every submission made by a student to an assignment. While rows in the table do not represent data subject, they are clearly associated to one. In particular, the student that made the submission has GDPR rights to that submission. Another way to look at this is to look at how this table would be expressed in MySQL: it will have a foreign key pointing to the student that created the submission, thus expressing that some relationship exists between that student and the submission.

In this case, one reasonable application policy may consider that a submission is owned by the student, in the sense that the student has full ownership/control over that submission: if a student issues a SAR to access or delete their data, all their submissions should be accessed or deleted respectively. We can express this by annotating that foreign key column with an OWNED_BY annotation.

CREATE TABLE submissions (
  ID int,
  student_id int,
  assignment_id int,
  ts int,
  -- Foreign key with ownership semantics
  FOREIGN KEY (student_id) OWNED_BY students(ID),
  -- Plain foreign key
  FOREIGN KEY (assignment_id) REFERENCES assignments(ID),
  PRIMARY KEY(ID)
);
Query OK, 0 rows affected (0.133 sec)

To reduce developer effort, K9db automatically annotates a foreign key to a data subject table or a table associated with a data subject with the OWNED_BY annotation, except when multiple such columns exist in the same table, in which case K9db issues an error asking developers to manually annotate the foreign keys as they see fit. In cases where the automatic deduction is incorrect, developers can override it by using REFERENCES ONLY when defining the foreign key.

In a different scenario, it may be reasonable to keep the submission data in some anonymized form, even after the student requested deletion, if this is required for the application and its parent institution to function correctly, e.g. in order to issue or validate a degree or transcripts to the student in the future. GDPR allows some flexibility in this regards, provided that such decision is justifiable due to legitimate interest, contractual obligations, compliance with other laws, and similar factors. We will see how such policies can be express to K9db later.

K9db provides a new command: EXPLAIN COMPLIANCE to help developers ensure that they expressed their desired policy correctly, and that K9db correctly understood it. We can use EXPLAIN COMPLIANCE to check that our schema indeed expresses the policy we want. Note that the output of EXPLAIN COMPLIANCE is currently displayed in the K9db server logs, which you may have open in a different terminal, rather than in the client that issued the command.

EXPLAIN COMPLIANCE;

# Output in K9db server logs
############# EXPLAIN COMPLIANCE #############
----------------------------------------- 
assignments: UNSHARDED
----------------------------------------- 
students: DATASUBJECT
----------------------------------------- 
submissions: SHARDED
  student_id           shards to students             (explicit annotation)
      via   submissions(student_id) -> students(ID)
  [Warning] This table is sharded, but all sharding paths are nullable. This will lead to records being stored in a global table if those columns are NULL and could be a source of non-compliance.
############# END EXPLAIN COMPLIANCE #############

The output of EXPLAIN COMPLIANCE indeed validates our policy: the assignments table is unrelated to any data subjects or their shards, and is thus unsharded. The students table represents data subjects. The submissions is related to a data subject, and thus sharded into per-user shards, and the relationship that determines this sharding is the student_id foreign key leading to the students table.

EXPLAIN COMPLIANCE correctly warns that this foreign key is nullable (a NOT NULL constraint is not applied to it), thus a row that has this column set to NULL will not be associated with any data subjects, and can thus lead to orphaned data and potentially violate compliance. Support for NOT NULL constraints is currently a work in progress.

Inserting data

We can insert data into these three tables as in regular SQL databases.

INSERT INTO assignments VALUES (1, 'assignment 1');
INSERT INTO assignments VALUES (2, 'assignment 2');
INSERT INTO students VALUES (1, 'Jerry');
INSERT INTO students VALUES (2, 'Layne');
INSERT INTO students VALUES (3, 'Sean');
INSERT INTO submissions VALUES (1, 1, 1, 1);
INSERT INTO submissions VALUES (2, 2, 1, 2);
INSERT INTO submissions VALUES (3, 3, 1, 3);
INSERT INTO submissions VALUES (4, 1, 1, 4);
INSERT INTO submissions VALUES (5, 1, 2, 5);
INSERT INTO submissions VALUES (6, 2, 2, 6);
INSERT INTO submissions VALUES (7, 3, 2, 7);

Note that we inserted each student prior to inserting any of their submissions. This is intentional. K9db ensures the referential integrity of each foreign key annotated with ownership semantics. Thus, inserting a submission for a student that does not yet exist will cause K9db to error. K9db will also error if inserting data with a duplicate PRIMARY KEY.

Note that many of these errors will be displayed in the K9db server logs rather than in the client. Furthermore, K9db may shutdown after encountering some of these errors due to implementation limitations. However, you can start K9db again after such an error, and it will reload without losing or corrupting any data.

You can similarly issue DELETE and UPDATE statements to K9db. These statements will also ensure referential integrity before committing their changes. Thus, deleting some rows in students that have no corresponding submissions will succeed, while deleting ones that have some corresponding submissions will not. K9db executes each statement in a transactions (somewhat similar to AUTO_COMMIT), thus a violation in one of the affected rows guarantees that the updates to other non-violating rows are not committed. K9db does not support updating PRIMARY KEY columns.

Queries

K9db supports issuing regular SQL queries. Internally, K9db translated these queries into queries over the relevant data subject shards using a variety of static analysis, query re-writing, and indexing techniques.

SELECT * FROM assignments;
+------+--------------+
| ID   | Name         |
+------+--------------+
|    2 | assignment 2 |
|    1 | assignment 1 |
+------+--------------+
2 rows in set (0.001 sec)

SELECT * FROM students;
+------+-------+
| ID   | name  |
+------+-------+
|    2 | Layne |
|    3 | Sean  |
|    1 | Jerry |
+------+-------+
3 rows in set (0.001 sec)

SELECT * FROM submissions;
+------+------------+---------------+------+
| ID   | student_id | assignment_id | ts   |
+------+------------+---------------+------+
|    6 |          2 |             2 |    6 |
|    2 |          2 |             1 |    2 |
|    7 |          3 |             2 |    7 |
|    3 |          3 |             1 |    3 |
|    4 |          1 |             1 |    4 |
|    5 |          1 |             2 |    5 |
|    1 |          1 |             1 |    1 |
+------+------------+---------------+------+
7 rows in set (0.001 sec)

SELECT * FROM submissions WHERE assignment_id = 1;
+------+------------+---------------+------+
| ID   | student_id | assignment_id | ts   |
+------+------------+---------------+------+
|    2 |          2 |             1 |    2 |
|    4 |          1 |             1 |    4 |
|    3 |          3 |             1 |    3 |
|    1 |          1 |             1 |    1 |
+------+------------+---------------+------+
4 rows in set (0.001 sec)

SELECT * FROM submissions WHERE student_id = 1;
+------+------------+---------------+------+
| ID   | student_id | assignment_id | ts   |
+------+------------+---------------+------+
|    4 |          1 |             1 |    4 |
|    5 |          1 |             2 |    5 |
|    1 |          1 |             1 |    1 |
+------+------------+---------------+------+
3 rows in set (0.001 sec)

SELECT * FROM submissions WHERE student_id = 1 AND assignment_id = 2;
+------+------------+---------------+------+
| ID   | student_id | assignment_id | ts   |
+------+------------+---------------+------+
|    5 |          1 |             2 |    5 |
+------+------------+---------------+------+
1 row in set (0.002 sec)

K9db also supports more complicated queries, such as ones that join multiple tables or aggregates data.

SELECT students.name as name, count(submissions.ID)
FROM students JOIN submissions ON students.ID = submissions.student_id
GROUP BY students.name
HAVING name = 'Layne';
+-------+-------+
| name  | Count |
+-------+-------+
| Layne |     2 |
+-------+-------+

In-memory caching

K9db provides an integrated built-in in-memory cache to accelerate expensive queries. This cache is continuously maintained and up-to-date using streaming data-flow processing, and thus is always in compliance with SARs. Unlike standalone caches, e.g. memcached, that need to be maintained by developers manually with respect to both application updates as well as SARs (for compliance). Our cache is not demand-filled, it always has the most update-to-date data and never encounters invalidation.

We can use this cache to accelerate the previous query as below. Note that when creating a view to cache the query, we did not specify the name of the student, and instead used a ?. This means that our cache is computed for all students, and its content are then keyed by the student name, to allow constant-time lookup into the cache. This also means that this view can be re-used for any query that has this structure even when it applies to a different student. K9db supports caching for queries with zero or more ?, as well as a mix of ? and concrete values.

-- Create a view for caching
CREATE VIEW count_view AS '"
    SELECT students.name as name, count(submissions.ID)
    FROM students JOIN submissions ON students.ID = submissions.student_id
    GROUP BY students.name
    HAVING name = ?"';
-- We can now query the view as many times as we want
SELECT * FROM count_view WHERE name = 'Layne';
+-------+-------+
| name  | Count |
+-------+-------+
| Layne |     2 |
+-------+-------+
1 row in set (0.001 sec)

To reduce implementation effort, we double dip on our dataflow caching infrastructure, and use it to both maintain explicitly cached queries, as well as to serve direct complex queries. This frees us form having to implement a fully-fledged query planner. Indeed, our non-cached version of the query that we used first is served by K9db by creating a one-off view. We can validate using the SHOW MEMORY and SHOW VIEW <view_name> commands, which display to the client the memory usage and the planned dataflow for each view.

Note that the earlier simpler queries did not create any views, as K9db can serve them efficiently from the database without the need for caching.

SHOW MEMORY;
+------------+-------------+----------+
| Flow       | Operator ID | Size(KB) |
+------------+-------------+----------+
| count_view | 6           |        0 |
| count_view | 5           |        0 |
| count_view | 7           |        0 |
| count_view | 4           |        0 |
| count_view | 3           |        0 |
| count_view | 2           |        0 |
| count_view | 0           |        0 |
| count_view | 1           |        0 |
| count_view | TOTAL       |        0 |
| _oneoff_0  | 7           |        0 |
| _oneoff_0  | 6           |        0 |
| _oneoff_0  | 8           |        0 |
| _oneoff_0  | 5           |        0 |
| _oneoff_0  | 4           |        0 |
| _oneoff_0  | 1           |        0 |
| _oneoff_0  | 3           |        0 |
| _oneoff_0  | 0           |        0 |
| _oneoff_0  | 2           |        0 |
| _oneoff_0  | TOTAL       |        0 |
| TOTAL      | TOTAL       |        0 |
+------------+-------------+----------+
20 rows in set (0.001 sec)

SHOW VIEW count_view;
+------+-----------+--------------------------------------------------------------+----------+---------+------------------------------------+
| ID   | Type      | Output Schema                                                | Children | Parents | Info                               |
+------+-----------+--------------------------------------------------------------+----------+---------+------------------------------------+
|    0 | INPUT     |  (|ID(INT,KEY)|name(TEXT)|)                                  | [3]      | []      |                                    |
|    1 | INPUT     |  (|ID(INT,KEY)|student_id(INT)|assignment_id(INT)|ts(INT)|)  | [2]      | []      |                                    |
|    2 | PROJECT   |  (|student_id(INT)|)                                         | [3]      | [1]     |                                    |
|    3 | EQUIJOIN  |  (|ID(INT,KEY)|name(TEXT)|)                                  | [4]      | [0,2]   |                                    |
|    4 | PROJECT   |  (|name(TEXT)|)                                              | [7]      | [3]     |                                    |
|    5 | AGGREGATE |  (|name(TEXT,KEY)|Count(UINT)|)                              | [6]      | [7]     |                                    |
|    6 | MAT_VIEW  |  (|name(TEXT,KEY)|Count(UINT)|)                              | []       | [5]     | [0] no order LIMIT = -1 OFFSET = 0 |
|    7 | EXCHANGE  |  (|name(TEXT)|)                                              | [5]      | [4]     | 0                                  |
+------+-----------+--------------------------------------------------------------+----------+---------+------------------------------------+
8 rows in set (0.001 sec)

Caching trades memory for efficiency, and thus makes most sense when the cache is re-used many times to serve a class of queries, rather than a single one-off query. Thus, we recommend that keying these views over ? parameters, and then querying them whenever needed with concrete values.

For terminal-based clients, this can only be done by creating views explicitly using CREATE VIEW and then querying them.

For applications, K9db can automate this when applications use (server-side) prepared statements. Prepared statements are usually created with ? parameters, which are then resolved using concrete values. Whenever a prepared statement is created, K9db analyzes it to determine if it is complex enough to warrant caching, and if so, create a cache for it. K9db then serves all future invocations of this prepared statement from that cache. When K9db analyzes prepared statements in canonicalize them so that prepared statement that are slightly different syntactically (e.g. have a different number of arguments in an IN clause) still resolve to the same cached view. This also avoids duplicating views when the same prepared statement is created multiple times. You can find an example of this in examples/PreparedStatements.java.

SARs and compliance operations

Whenever an application receives a SAR from a data subject, it can use the primitive GDPR operation provided by K9db to perform that SAR. Note that the application may need to perform additional handling of a SAR for relevant data stored outside the database (e.g. files uploaded by the user).

GDPR operations must specify the type of operation (GET or FORGET) and the type and ID of the data subject.

# Layne requests a copy of their data
GDPR GET students 2;

+------+------------+---------------+------+
| ID   | student_id | assignment_id | ts   |
+------+------------+---------------+------+
|    2 |          2 |             1 |    2 |
|    6 |          2 |             2 |    6 |
+------+------------+---------------+------+
2 rows in set (0.000 sec)

+------+-------+
| ID   | name  |
+------+-------+
|    2 | Layne |
+------+-------+
1 row in set (0.000 sec)


# Jerry requests all their data be deleted
GDPR FORGET students 1;
> Query OK, 4 rows affected (0.005 sec)

We can validate that Jerry's data was indeed deleted, including data in the cache. SELECT * FROM students; +------+-------+ | ID | name | +------+-------+ | 2 | Layne | | 3 | Sean | +------+-------+ 2 rows in set (0.001 sec)

SELECT * FROM submissions; +------+------------+---------------+------+ | ID | student_id | assignment_id | ts | +------+------------+---------------+------+ | 6 | 2 | 2 | 6 | | 2 | 2 | 1 | 2 | | 7 | 3 | 2 | 7 | | 3 | 3 | 1 | 3 | +------+------------+---------------+------+

SELECT * FROM count_view; +-------+-------+ | name | Count | +-------+-------+ | Layne | 2 | | Sean | 2 | +-------+-------+ 2 rows in set (0.001 sec)