Skip to content

Deeper Dive Into Schema Annotations

Kinan Bab edited this page May 28, 2023 · 1 revision

In this tutorial, we will highlight a few problematic patterns in web applications that may require more involved schema changes to be compatible with K9db.

For more details, consult the annotated schemas we provide for 7 applications in experiments/schema-annot.

Implicit Foreign Keys / Relationships

Many web applications do not actually spell out foreign key constraints in their schema. While they still have columns that act as foreign keys logically (i.e. represent a relationship between two tables), they do not apply the FOREIGN KEY or REFERENCES keywords to them.

This makes migrating such schemas to K9db tedious. Usually, it is sufficient to reason about only FK columns when selecting the correct K9db annotations to apply. However, when no FK constraints are explicit, we will need to consider all columns when thinking about annotations, while also reasoning about whether these columns represents a relationship or not.

We recommend introducing the FK constraints into the schema first, and then thinking about annotating them, rather than doing both at the same time.

De-normalized Schemas

Some applications may have a de-normalized schema. Sometimes, this may have an affect on annotating the schema.

We encountered one such pattern in ownCloud. Files in ownCloud may be shared with users or groups. This is in encoded in an oc_share table. In particular, this table contains a share_with column, that contains an identifier that may correspond to either a user or a group, depending on the value of the adjacent share_type column. K9db annotations can only be applied to a FK. However, share_with is not a valid FK, as it may be point to two different tables and columns based on a dynamic value in the row.

We address such scenarios by normalizing the schema. For example, in ownCloud, we add a new share_with_group column to the oc_share table, and make this new column a FK to the groups table, while making the share_with a FK to the users table only. We allow both columns to be nullable. Now, a row that represents a share with a group can set the share_with_group column but not the share_with, and vice-versa. We can now apply K9db annotations (specifically, ACCESSED_BY) to both of these columns to get the desired functionality.

Note that doing this in practice also requires modifying the application code to use the new schema correctly.

Ownership normalization

Schemas that are normalized in the traditional SQL sense may still be de-normalized with respect to ownership.

For example, in the conference submission system hotcrp, the PaperConflict table encodes a many-to-many relationship between papers and users, represented in the Paper and Contact tables. The exact relationship depends on the conflictType column, which can represent authorship, an institutional conflict, a recent collaborator, etc.

We would like to specify that users who are authors of a paper are owners of it. At the same time, users that have other types of conflicts with a paper have no rights to it (i.e., are neither owners nor accessors), but both relationships are represented via the same relationship via PaperConflict.

We address this by normalizing the schema with respect to ownership. We create a new table PaperAuthors that only contains records, and annotate the FK linking it to Paper with OWNS. We keep PaperConflict as is, but only use it to record conflicts different than authorship. Thus, we can keep its FK to Paper unannotated.

As in ownCloud, this also requires modifying the application code to use the new schema.

Implicit Data Subjects

K9db requires annotating one or more tables with DATA_SUBJECT. Each row in these tables must represent a unique human person with GDPR rights in the application. Some applications may not contain such an explicit table. For example, mouthful stores comments made by users in its Comment table, which contains an Author text column, that host applications use to identify authors of comments.

mouthful itself does not store any additional information about these authors. In particular, it has no users nor authors tables. The Comment table is not suited for the annotation DATA_SUBJECT, as it does not represent unique data subject, specifically its ID doesn't uniquely identify data subjects.

We introduce a new user table that we annotate with data subject. This new table only contains a text id column. We make the Author column in Comment a FK to to user(id). This has the desired behavior of identifying a single data subject, and also connecting that data subject with all comments they authored.