Skip to content

Latest commit

 

History

History
942 lines (687 loc) · 36.2 KB

UBERALLFS.org

File metadata and controls

942 lines (687 loc) · 36.2 KB

Uberallfs

Overview

The Idea

Uberallfs is a peer to peer distributed filesystem. Objects are cached on the accessing node. Mutation is implemented by passing Tokens around. Only the owner of the Token is eligible to mutate the data. This token can be requested from the current token holder by any entity which has permission to mutate the object. The old Token holder will then keep a reference to the entity which received the token. These references can then be walked to find the authoritative node at the end of the list. The actual implementation of this Idea becomes a bit more complex and offers certain optimization opportunities.

Entities who don’t have ‘write’ access do an object will never get the token. When read access is requested they synchronize the object with the nodes who hold the original data. This synchronization is managed by the current token owner.

Look here for examples how to use it

Features and Goals

Caching and pinning objects

Objects become cached upon access. There will be tools to enforce this caching and pin objects to a node.

Offline use

The filesystem caches objects and can be used when a node is offline. For writes this needs either the token to be locally obtained or the object can become ‘detached’ and changes need to be merged when connectivity is restored (which can be automatic if there are no conflicts).

Strong security and anonymity available

Objects can be either instantiated by a ‘creator’ who defines security policies about who has access to them OR published as anonymous/immutable.

mixed in private data

Objects may be private only and never be shared.

redundant storage

Different levels of redundancy are planned, raid1 like redundancy where sync/close calls only complete after the data is replicated or lazy backup schemes where data becomes synchronized at lower priority without blocking current access.

Garbage collection / Balancing

When space becomes scarce unused objects can be evicted from the cache. Either if this is just a copy (that is not used for redundancy) by deleting the object or by offering objects to other nodes within a configured realm of hosts.

Striped parallel downloads

If possible (later) Object transfer and syncronization can be spread over multiple peers to utilize better bandwidth sharing (Bittorrent alike).

Recoverable

The way objects are stored in the objectstore allows easy revival even without uberallfs available.

Design Choices

Uberallfs uses ‘opinionated’ Design. Protocols include a single version number which fully defines the properties, sizes and algorithms used. Future versions will be backward compatible to few older versions but eventually old versions will become unsupported (which may happen earlier when there are security related problems).

Version O is always defined to be ‘experimental’ it will be used in closed environments for testing and development, never in production. Any Version 0 Protocol outside of this environment is considered incompatible with itself.

Components

Following a coarse overview of the components making uberallfs. Details are described in later Chapters.

Object Store

At the core is a object store where all filesystem objects are cached. Later support for volatile objects is planned to allow once used streaming data. For Details see below.

The Objectstore expose a Virtual-File-System API to give a filesystem alike access layer.

Frontends

User-access to the underlying filesystem hierarchy. The primary goal is a Linux fuse filesystem which maps the underlying uberallfs to an ordinary POSIX conforming filesystem.

Later other front ends are planned. Android storage framework for example.

Object Discovery

As described in the introduction, the ‘trail’ pointer used to locate the node which is authoritative for a filesystem object is the main concept of uberallfs. Still there needs to be more to make this functional. For example Objects need to be recovered when the trail got broken (lost node). Only nodes which have full access to an object are allowed to become authoritative.

When a node becomes authoritative this does not mean that the data is available there, it only manages the ‘ownership’. The object metadata contains references to nodes who actually hold the data. For reading the data will be synchronized. While writing only invalidates the old references and instantiates new data locally.

Nodes without full access to objects can synchronize data as far they have permissions to do so and negotiate promises and leases with the authoritative node for race free data access.

Object Synchronization

Once access/authority to an object is granted the data may be synchronized (for reads). For this maps of byte-ranges and version/generation counts are used. There is no need for rsync like checksumming since the authoritative always knows which data is changed/recent.

Objects may become scattered across the nodes when frequent random writes at different locations of an object happen. This is mitigated by a low priority object coalescing which gather fragments and merges them on single nodes.

Access Control

Access control is implemented over public keys and signatures. The node which is authoritative over an object is responsible for enforcing the permissions. Access control metadata is sufficient enough to be freestanding without any additional information. Still due to the distributed nature there are some loopholes that can not be closed (discussed below). Basically any access ever granted can not be reliably revoked at a later time.

Details below.

Network / Sessions

A node establishes a session with another node on behalf of a user/key. Each session is then authenticated for this keys which is used for access control. Sessions are keep state for some operations. As long a session is alive these states are valid. When a session dies unexpectedly then these states and all associated data gets cleaned up/rolled back.

Handled by the Node.

Node Discovery

Nodes are addressed by their public keys. The last seen addresses and names of other nodes are cached for fast lookup. If that fails then a discovery is initiated (Details to be worked out).

Key Management

creates user and node keys, manages signatures/pki, key-agent process.

Distributed PKI

Future versions will include a distributed public key infrastructure. This augments the exiting Access control with more advanced features like:

  • web of trust for confirming identity and credibility of other keys
  • revoking signatures
  • key aliasing/delegation
  • key renewal.

Object Store

While uberallfs looks like a hierarchical filesystem, the backend store is a flat key/value object store. The keys are derived from universally unique and secure identifiers. Secure in this context means that not entity can create a collision that goes unnoticed. These identifiers resemble global unique inode numbers.

There are different object types of objects stored under a key, explained later in this document. The main parts are the ‘tree’ and ‘blob’ types. A ‘tree’ is an object that holds named references to sub-object keys much like a directory in a filesystem. Blob objects contain the file data. Other types contain metadata for security and distribution.

A mounted uberallfs uses a ‘tree’ object as the root of the mountpoint. From there on a hierarchy like with any other filesystem is created.

The difference here is that all objects can be distributed over the network and anyone (with permission to access the object) can references them within his own hierarchy. This for example allows a complete home directory to be shared as well as mounting the same object (directory) under different names at different positions in the hierarchy. For example one instance may name a directory ‘./Work/’ and another one refers to the same tree object as ‘./Arbeit/’.

Eventually (if one is careless) this could lead to directory cycles, which is the major difference to traditional filesystems where directory cycles are highly disregarded.

The most important difference to traditional filesystems is that Directories in uberallfs do not have parents. Frontends keep track of the directory traversing to for providing the parent directories.

Objects

A Object is defined by different parts:

The Object Type
Defines if it is a plain file, a directory and so on (in future a few more types will be supported).
The Identifier
Is a global unique 264 bit number (44 flipbase64 encoded characters). There are different types of identifiers which describe how the object is handled.
Data
The data of the object itself, could be a directory or file contents etc.
Metadata
Depending on the object type and identifier some extra metadata will be present, some is required (like ACL’s for Shared objects). Maps which show which nodes hold what version of the object data. Block hashes for torrent like distribution and some more.

Object Types

Directory

Stores references to other objects (trees, blobs, symlinks) May store Unix special files (fifo, sockets, device nodes) initially private, eventually network transparent nodes may be implemented.

File

The actual File data. can be sparse/incomplete with not yet synchronized data.

part

PLANNED: parts of blobs with own identifiers.

Identifier Types

A mutable objects are identified by a unique (random or hash) number while an immutable object is identified by a hash over its content. Objects which are constrained by permissions a digital signature is required to guarantee integrity (see below).

We can further deduce the necessity of 3 scopes where these keys are valid:

  1. private objects that must never be shared but is accessible to the local instance
  2. public objects that have ownership and access permissions
  3. anonymous objects without any ownership and public access

This leads to following 4 types of identifiers:

privatepublicanonymous
mutablerandomrandom signature¹
immutable²hash signaturehash

Note that there are 2 not supported combinations:

  1. Anonymous mutable data would lead security problems like denial of service attacks
  2. Having immutable private objects won’t have any security implications and may be supported at some point when need arises (eg. deduplication)

Eventually some more Types might be supported, for example hashing could be indirect being the hash over a bittorrent like list of hashes. This may even become the default for immutable objects at some point.

Plans

Later file encryption might be added. This is not directly on topic for uberallfs as objects are only distributed to nodes that are allowed to (at least) read them. File encryption would remove this requirement and allow proxying/caching on nodes that which don’t have access to the object.

Metadata Types

perm

Security manifest, access control and security related metadata.

meta

Extra metadata about authority/trail/generation/distribution.

dmap

Maps to the nodes holding the data for mutable files. Initially only complete objects, later byte ranges/multi node.

hash

Torrent like hash list for immutable files.

link

When an object type changes, its identifier changes. This .link type is then a pointer to the new identifier.

rule

  • Size restrictions for files.
  • Accepted filename patterns.
  • dirs/files only.
  • Change the properties/identifier of a file, eg. a when a ‘.mkv.part’ file becomes renamed to ‘.mkv’ its type is changed to ‘public immutable’.

It is planned to make a simple rule engine that automates policies on objects (mostly directories). For example:

Ideas

Keep lazy stats (coarse granularity, infrequently written to disk, with risk of loosing data in a crash)

atime
know when the object was last used
afreq
average frequency of use (rolling average?)

Disk Layout

There are (so far) three main components which need to be visible on the host filesystem. These are designed to be in the same place (shared directory) as well as in different places with the components shared over several uberallfs instances.

The basic use case is that all data resides in a single directory which also serves as mountpoint for the fuse filesystem, thus shadowing they underlying data.

objectstore

The objectstore can be freestanding/self contained no external configuration is needed.

objects/
used for the objectstore
objects/??/
any 2 character dir is used for the first level (4096 dirs, base64)
objects/root/
symlink to the root dir object
objects/tmp/
for safe tempfile handling
objects/delete/
deleted objects with some grace period
objects/volatile
can be a tmpfs for temporary objects
objects/volatile/??/
any 2 character dir is used for the first level (4096 dirs)
config/
configuration files
objectstore.version
version identifier

Planned: links to other objectstores on local computer, possibly on slower media for archives.

objects

Objects are stored within the first level (2 character) directory under their flipbase64 identifier. Any associated metadata will have the same name but a filename extension per kind of metadata.

Directories

Directories in the objectstore refer to the contained objects. This is implemented with some special marked symlink which is the flipbase64 identifier prefixed with .uberallfs.. This leverages the underlying filesystem semantics for lookup and other operations.

node

The ‘node’ manages the data distribution between other nodes, forming a peer to peer network.

For that it keeps the networks addresses of other nodes and manages network related keys.

config/
configuration files
nodes/??/
information about other nodes
keystore/
some of the keys used to operate the node. Others may be in ~/.config/uberallfs and are loaded on startup. Private keys will be isolated, TBD.
uberallfs.sock
socket for local node control
node.version
version identifier

fuse

When fuse gets mounted it may shadow all of the above and present POSIX compatible file system. Only files starting with ‘.uberallfs.’ at the root are reserved (control socket etc).

Permissions

Local permissions are treated as ‘voluntary’ in sense that a Node which gathers access to Data must not compromise the global security of the filesystem. The Objectstore itself runs as single user and uses permissions only to enforce the basic requirements (immutable objects become readonly and so on). Actual permission/access checks are managed by the outward facing VFS Api. This ensures security across the global network.

VFS

The ‘public’ API of the Objectstore is a virtual filesystem layer. Frontends like fuse use this to access objects. For this a Client has to authenticate against public Keys and used for permission checks.

Access Control

The ‘perm’ object type contains all metadata necessary for access control for the associated object. Any node is obliged to validate access rights on queries.

Identification

We must ensure that an Object Key and Identifier belongs to the Object in question and all following security metadata needs to be derived from this in a provable way. All public keys can be constrained by an expire date.

Identifier
A random number.
Creator
Public key of the Creator/expiration of this object. Can be only once used key which is deleted after initialization of the metadata. The expiration date here becomes part of the identifier. Once passed the object becomes invalid and can be purged.
Key Expire
Creation and expire parameters
Identifier Signature
The Identifier is signed with the Creators key.
Object Key
The Identifier and its Signature are hashed together to give the key used in the object store. This is not stored in the ‘perm’ object as it is the ‘name’ thereof itself.
Administrative Lists
Super Admins
A (optional) list of public key/expire tupes that are allowed to modify the per-permission admins below.
Super Admins Signature
The list of Super-Admins together with a nonce and the Identifier becomes signed by the Creator. This indirection allows to dispose the Creator key now and to delegate administrative task to multiple entities. Caveat: after the Creator key is disposed the Super-Admin list can not be changed anymore.
Per Permission Admins
Optional list for each possible permission (read, write, delete, append, …). Keys listed in these lists are allowed to modify the respective ACL’s below. (idea: permission tags on the lists itself: an admin may add/delete…)
Per Permission Admins Signature
Each of the lists above needs to be signed by the Creator or a Super-Admin. This signature contains a nonce and the Identifier as well
Access Control Lists
Optional list for each possible permission (read, write, delete, append, …). Keys listed in these lists are allowed to access the object in requested way.
ACL Signature
Each of the lists above needs to be signed by the Creator or a Super-Admin or a matching per-permission-Admin. This signature contains a nonce and the Identifier as well.
Generation Count and Signature
Whenever any data on the above got changed a generation counter is incremented and the all list blocks plus this generation counter must be signed by one of the above administrative Keys (usually the one who did the change).

TODO: creation date and expire parameters are required, shall these be signed here?

Brainstorm/Ideas

Quorum
M of N Admins must grant permission to be effective
Key revocation
special tree object which holds revoked signatures, must be safe against DoS, needs some thinking.
Serial Nonces
Rand(u128) number initially smaller than (MAX_U128-MAX_U64) they are incremented by adding a rand(u32)+1. Thus the magnitude is growing and one can compare that any ‘new’ value must be larger than the last known. This gives a (weak) protection against replay attacks without leaking any info about how frequently metadata got updated.

Security Implications

Replay Attack

TBD: in short one who once had (administrative) access to the object can replay that old version of the metadata under some conditions since the ‘trail’ and generation count can be incomplete. (write example how this can happen, any solution for this?)

  1. A creates a file with B and C as Admin
  2. B takes the token from A A->B
  3. C takes the token from B A->B->C
  4. C removes B from an Administrative list
  5. B takes the token from C back A->B<-C
  6. B replays the ‘perm’ metadata from 2. (gains Admin back)
  7. A takes the file from B but can not discover the tampering

The only ‘weak’ protection against this are the expiration dates. When these are short enough they limit the time window in which such an attack can be done and constrain the necessary lifetime for signature revocations.

Malicious Object Mutation

Can not happen because the token will never be given to a node that won’t have write access.

Privilege Escalation

Object persistence

Collisions

Concise Permissions

Uberallfs implements a set of concise permissions unlike traditional ‘rwx’ Unix permissions with their overloaded meaning for directories.

These permissions are mapped onto the available permissions of the target operating system. Permissions are tied to (lists of) public keys. There are no users and groups otherwise. There is one special (all zero?) Key which means ‘anyone’.

The local system/VFS layer maps Keys to local users to allow a straightforward view of the filesystem contents.

A permission which would allow full access (including deleting/overwriting) all data also allows a node to take authority over an object. Nodes which can’t gain authority over an object must pass their mutations to the authoritative node where they will be validated.

Access control is inclusive, when one could gain access because the key is listed in the respective Admin list, then one gets that permission implicitly.

Someone who gains the knowledge of an Identifier has also further access to inspect its metadata. Thus there are no permission checks on identifers themself. Only their lookup is validated.

File Permissions

File permission are initially relatively simple, only ‘append’ added over unix permissions. Should be self explanatory.

read
read object
write
This is the authoritative permission.
append

Directory Permissions

WIP!

With directories things become more complicated.

list
Allow listing of the directory filenames only (excluding their identifiers).
list-accessible
Listing is filtered to content where one has (any) access to.
list-authoritative
Listing is filtered to content where one has authority for.
read
Allow listing of the directory content including object identifiers
read-accessible
Listing is filtered to content where one has (any) access to.
read-authoritative
Listing is filtered to content where one has authority for.
add
Add new objects. Implies ‘list’.
add-authoritative
Only add objects where one is authoritative for. Implies ‘list-authorative’
add-anonymous
Add anonymous objects. Implies ‘list-accessible’.
rename
Rename an object within the same directory. Moving objects across directories are handled like add/delete on each directory. Implies ‘list’.
rename-authoritative
Rename an object within the same directory where one is authoritative for. Implies ‘list-authorative’.
rename-anonymous
Rename an anonymous object within the same directory. Implies ‘list-accessible’.
delete
Delete any object. This is the authoritative permission.
delete-authoritative
Delete objects where one is authoritative for.
delete-anonymous
Delete anonymous objects.

Further rules can be defined how objects are created, what extra permissions and keys apply (inherit from directory,..)

To prevent collisions, the ‘add’ and ‘rename’ permissions imply the necessary ‘list’ permissions that would make the destination visible. To successfully add or rename a file into an existing name one would need the permission to delete the old content as well.

Permission inheritance

TBD: what permissions do objects inherit from the parent (dir) additionally to the ones the creator set up.

Secure Metadata

leases
expire time for leases, default and per node pubkey. leases are persistent (stored in the token trail)
promises
expire time for promises, default and per node pubkey. promises are volatile and expire with the session.

The Node

Planned

Total Encryption

Any data send around is encrypted starting from the first bit (w/ the targets pubkey). Without knowledge of the keys not even protocol information is leaked. Incoming packets/connection are just dropped when they can’t be decrypted.

Realms

HowTo

Envisioned usage, work in progress.

Examples here using defaults for most options. Defaults should always be the be safe option.

Plumbing vs Porcelain

This examples starting with ‘plumbing’ commands to show the steps involved to set something up. When applicable ‘porcelain’ is added next to it, in general porcelain commands simplify usage, but depend on some preconditions, like that the filesystem is already set up and mounted (unless for the setup commands), contrary plumbing commands need access to the objectstore or node data and may not work when these directories are hidden behind the mounted filesystem.

Initialize and start a new uberallfs node

With private root

$ uberallfs objectstore ./DIR_A init
$ uberallfs node ./DIR_A init
$ uberallfs node ./DIR_A start
$ uberallfs fuse ./DIR_A mount
$ uberallfs init ./DIR_A
$ uberallfs start ./DIR_A
or
$ uberallfs insta ./DIR_A

Will result in a uberallfs mounted on ‘./DIR_A’ with a private (by default) root directory.

Make a Directory shareable

We created a ‘private’ root directory in the previous step. For being used as distributed directory its type must be changed.

$ uberallfs objectstore ./DIR_A chtype public_mutable /

This changes the type and sets up a minimal ACL to make the executing user Creator of the object.

Porcelain will only work on a running (mounted) filesystem.

$ uberallfs chtype public_mutable /path/to/root

Shared Root Dir

The root directory is nothing special an can be shared as any other object, the only difference is that the root directory must be present in the objectstore for almost all other operations (like mounting the file system). Thus objectstore initialization can already takes care for setting up the root directory.

On the new filesystem the node must be initialized first for exporting the (default generated) users public key.

$ uberallfs node ./DIR_B init
$ uberallfs node ./DIR_B export-key
base64encodedpubkey
$ uberallfs node ./DIR_B init
$ uberallfs export_key ./DIR_B
base64encodedpubkey
  • By exported Directory

    Give the new user/key access to the root directory in ‘./DIR_A’ and export it into an archive. This thin export only contains the minimum necessary metadata to reconstruct the content by querying the original node.

    $ uberallfs objectstore ./DIR_A chacl +super_admin base64encodedpubkey /
    $ uberallfs objectstore ./DIR_A send --thin / >ARCHIVE
        
    $ uberallfs chacl +super_admin base64encodedpubkey ./DIR_A
    $ uberallfs export ./DIR_A ARCHIVE
        

    Now we can import that archive as new root directory and go on.

    $ uberallfs objectstore ./DIR_B init --import ARCHIVE
    $ uberallfs node ./DIR_B start
    $ uberallfs fuse ./DIR_B mount
        
    $ uberallfs import --root ARCHIVE ./DIR_B
    $ uberallfs start ./DIR_B
        
  • By URL

    Instead importing an ARCHIVE one can also supply a URL the root dir will then be fetched over the network.

    The an URL has the form ‘uberallfs://host:port/identifier’ and can be shown by:

    $ uberallfs node ./DIR_A show --url /
    uberallfs://localhost:port/base64encodedidentifier
        
    $ uberallfs show-url ./DIR_A
    uberallfs://localhost:port/base64encodedidentifier
        

    This URL can then be used to bootstrap the new objectstore

    $ uberallfs objectstore ./DIR_B init --no-root
    $ uberallfs node ./DIR_B start
    $ uberallfs node ./DIR_B fetch uberallfs://localhost:port/base64encodedidentifier
    $ uberallfs objectstore ./DIR_B root --set base64encodedidentifier
    $ uberallfs fuse ./DIR_B mount
        

    ‘insta’ does all DWIM magic to get a uberallfs running. initialization, starting the node and mounting the filesystem. It possibly asks some interactive questions (for deploying keys). An existing dir will be reused if no data gets overwritten (same root again). By default an ‘insta’ created uberallfs is private but this can be overridden by the ‘–from’ and ‘–shared’ flags.

    $ uberallfs insta ./DIR_B --from uberallfs://localhost:port/base64encodedidentifier
        

Runtime Maintenance Commands

pinning

  • authorative Pins an object to be locally available, possible with short lease times to allow others to mutate it without proxying.
  • non authorative register at the current token holder that one wants to get a notification when the object changed (or is moved). This has only session persistence.

replication rules

Objects can hold a small list of peers where the data must be replicated. There are different modes of operation: N of M operations must succeed before returning, remaining are synced lazy Operations are write, fsync, close. The N of M can be required to be N different realms.

drop/gc

frees memory by dropping non used (lazy atime) non owned objects. may move owned objects away (asking some other node about taking over).

sync

Fetches and syncronizes all date (before going offline)

offline

turns the node into offline mode (with –timeout?) it wont try to access other nodes even when internet is up. normally unnecessary because reachability is determined automatically on a peer by peer base with some backoff mechanism.

detach

explicitly detach objects, so that they can be locally changed even when offline but may later be merged

merge

merge detached objects back. may need manual conflict resolution in case changes happened on both sides.

config

what happens when offline and not owning an object

  • On Read:
    • old version available, just cant sync
      • return stale data
      • block (with timeout, then one of the next)
      • EIO
      • EACCESS
    • data isn’t locally available (or incomplete)
      • block (with timeout, then one of the next)
      • EIO
      • EACCES
  • On Write:
    • block (with timeout, then one of the next)
    • auto detach
    • EIO
    • EACCESS

Problems/Solutions

Symlink escapes

Since normal directory objects can be linked at any position in a filesystem tree and have no implicit parent, symlinking into parents with ”/..’ becomes unreliable and even dangerous. For normal directory objects this becomes forbidden. The same is true for absolute symlinks.

This restriction is be removed for Private entries. Important note is that such directories can not be changed into PublicAcl shared directories in presence of such symlinks.

Later an alternative Directory type “DirectoryWithParent” may be introduced. Such Directories have some restrictions. They can only be linked to the parent defined there and thus can not be root nodes where the filesystem is mounted. Symlinks with parent refs ‘/..’ are allowed to cross into these Directories.

Distributed object deletion

Objects may be referenced from different locations all over the network. Deleting a object from a directory is as simple as just remove it from there when one has authority over the directory. But this does not mean the Object itself can be removed from the object store since other nodes may still refer to it.

Solutions
  • When no parts of the object are locally authoritative (no data!) then it can be removed.
  • Every Object has a ‘grace’ time for which it will be kept with a ‘deleted’ flag. Once this grace time is expired it can be deleted.
    • Any other node which references this object should poll the object within this grace time. When the authoritative node responds that the object ought to be deleted then
      • Node without full access may synchronize the object
      • Nodes with full access are advised to adopt the object.
        • Once adopted and all data is transferred the data can deleted. Metadata (trail) needs to stay alive until the grace time is expired.
  • May also provide an discard command that really deletes an object without grace time. Other nodes querying it then will get a ‘EEXIST’ and may decide how to go on (revive or discard)

Reviving an Object

Eventually objects may get lost when an node takes ownership but is not reachable anymore.

Such an object can then be revived by quering the trail if it is possible to reconstruct the last know state of the object. This may then be revived as ‘detached’ object or put alife again under a new Identifier. This is then per-parent directory as the new identifier is inserted there.

In the event that the initially unreachable node commes alive again, data must be merged from there. The lost node is responsible for merging this. Possibly reestablishing the old Identifier with the new content again.

Maybe mark new object metadata with a ‘revived $oldidentifier’, is this necessary?

Directories may have a flag that they are protected from ‘careless’ reviving because they are intended as mountpoint -> list of nodes/expire that (may) mount them (authoratively only)

The No-Parent Case

When mounting (authoratively) one needs to check that the dir didnt got revived by querying possible buddies:

  • walk trail/redunancy copies/authorative mount list

Worklog per node

limited in size and age

Rust Notes

Error handling

TBD

Logging

TBD, which logging lib?

what to log?

Prelude

log

Ideas