Skip to content
Jonas Kunze edited this page Apr 6, 2014 · 3 revisions

UrlId extends Uid

This class is used to identify external URLs. It is a Uid with the type being UidTypes.URL. Instead of storing the timestamp and fine time it stores the domain ID (2 byte) and the file ID within this host (4 byte). We can therefore store up to 2^1666 k domains and for each domain up to 2^44.3 G files.

Definition of domain and file

Consider following URL http://sub.example.com/some/path?and=variables#

  • The domain is "example.com"
  • The file is "some/path?and=variables#"

UrlStore

The mapping between UrlIds (long) and the actual URLs (String) are stored in the UrlStore. Additionally the largest file ID for each domain is stored which is used to generate UrlIds of already knonw domains.

Key Value
UrlId (long) URL(string)
UrlId (long) URL(string)
domain ID (short) Max file ID (int)
"MaxDomainID" Max domain ID (short)

Subdomains

To allow quick traversal over all UrlIds of the same domain we map all sub domains to the same domain ID. To avoid conflicts following algorithm is applied during the creation of a new ID with an already known domain:

  1. Generate the domain ID ignoring the subdomain
  2. Generate the file ID
  3. Check if we've already stored the UrlId with the calculated domain and file ID
  4. If no, store the UrlID and complete URL in the storage
  5. If yes, compare if the stored URL equals the new one (if the new one has another sub domain it will not) 1. If the URLs are the same, we are done (the URL is already stored) 2. If the URLs are not the same, increment the file ID by 3 (this is a prime number 2^32 is not multiple of. this way you will always hit a space if there is something free). Then go to 3.
Clone this wiki locally