Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable cache entry expiration #31

Open
elrayle opened this issue Nov 2, 2016 · 18 comments
Open

Add configurable cache entry expiration #31

elrayle opened this issue Nov 2, 2016 · 18 comments
Assignees

Comments

@elrayle
Copy link
Member

elrayle commented Nov 2, 2016

I am exploring adding cache entry expiration. I would like to get feedback from those using linked-data-fragments for caching.

Approach:

TimeToLive configuration - There will be a global TimeToLive value that serves as a default. There can also be a TimeToLive interval defined per host with the default used if the current URI's host does not have a separate TimeToLive configured.

ExpirationDT for a URI - Each cached URI will have an extra triple added to identify the date-time on which the cached entry for the URI expires and becomes invalid. ExpirationDT = date_retrieved + TimeToLive(URI_host)

Modifications to Retrieval Algorithm

    if ExpirationDT < now, attempt to get from source
        if success
            update URI's cached value
            reset ExpirationDT
            return value
        if source unavailable
            do NOT adjust ExpirationDT
            Log host out of service
            return value
    else
        return value from cache

Predicate for ExpirationDT - I have not found a predicate that matches the concept exactly. The closest I have at the moment is http://vivoweb.org/ontology/core#expirationDate. I am open to suggestions for an alternate predicate.

Other additions that could be part of this work.

Optional ForceRecache - Retrieve method can have a parameter added to allow caller to request the URI's cache be updated from source. What would you want returned if the host is out of service?

LastModifiedDT - Add a new triple that holds the LastModifiedDT for the cache.

Thoughts on predicate choices. I am somewhat hesitant to use existing predicates that aren't cache specific. If the cached URI happens to use the same predicates, they would get clobbered by the cache added predicates. I'd like to see predicates: cache_expiration_dt and cache_last_modified_dt, and possible cache_create_dt. Others thoughts?

Please comment on this approach as soon as you can. I am looking at beginning work as soon as I get feedback.

@elrayle elrayle self-assigned this Nov 2, 2016
@hackartisan
Copy link

Can you give an example of the extra triple?

@anarchivist
Copy link
Member

Can you talk more about the motivation for adding this as a triple? I'm concerned about the potential impact here. Is there an assumption that this triple would be included in the serialized representation?

@acoburn
Copy link

acoburn commented Nov 2, 2016

The way Marmotta handles this is like so: triples from the resource are cached in one location and metadata about those cached triples is stored separately (i.e. when the triples were retrieved). This way, it is possible to configure a TTL globally (or by endpoint) without mixing the metadata with the triples from the resource. Marmotta also does not use RDF to store that metadata, nor is there any inherent need to do so. As an example, Marmotta's file-based cache looks like this:

http://localhost:8080/fcrepo/rest/test
1470333573969 # last retrieved: 2016-08-04 13:59:33.969
1470419973969 # expires: 2016-08-05 13:59:33.969
1 # 1 updates
0 # 0 triples

This way, you don't mix the resource triples and the metadata about those triples; nor do you run into namespace clashes between that metadata and the triples themselves.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

@acoburn Thanks for that info. I am new to Marmotta and linked data fragments, so pointers are appreciated.

My only concern is that this gem also provides caching in Blazegraph. My approach would have to be compatible for Marmotta and Blazegraph (and other potential repositories).

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

@HackMasterA

<subject_URI> <http://vivoweb.org/ontology/core#expirationDate> "2016-11-02T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>

@tpendragon
Copy link

Considering the use case here is effectively a reverse proxy cache for an external RDFSource, I'm 👎 on triples attached to the subject URI for configuration of the caching system. I might be able to be convinced that a second URI (or maybe a named graph, but I'm iffy there too) which is included in the response could have that, though.

The other thing is there's no way to update triples in this gem now, sort of on purpose. You'd need that functionality to add the expiration triple, yes? The use cases were always simple: cache external responses and provide information about that cache. I think there's a benefit to keeping it that way.

Global TTL seems like a good config option. The Marmotta backend's always had it, but surfacing it in this layer is a good thing for those backends that don't have caching built in.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

@anarchivist The motivation is that some authorities change the display string, and potentially other triple values, associated with a controlled vocabulary term. If you capture the triples associated with a URI once and never update, you will be using a stale cache value. Having a configurable TimeToLive value allows you to invalidate subject_URIs in the cache, forcing a refresh from the original source. Having TimeToLive be configurable by host allows for a more flexible approach to cache refresh, so that an authority that rarely modifies its data can have a longer TimeToLive setting than an authority who commonly modifies data.

@tpendragon
Copy link

I will say, I have a feeling that most users of this don't want hard expirations - they want something like periodic updates from upstream. If the remote source goes down, you want your cache to work even if the TTL is over.

@tpendragon
Copy link

(They might not even want AUTOMATIC periodic updates - I've heard concerns about data drift in remote sources before, but maybe that's a second product which mints a sameAs URI for temporal locks)

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

When a subject expires, the proposal says attempt to get from source and if unsuccessful (i.e. server is down) then use the cache.

@tpendragon
Copy link

@elrayle Yeah, I think I could agree if the workflow was more like "if TTL was past, queue up a refresh in the background and serve up a response QUICKLY anyways, with a header saying it's stale", then have some method to block the response while waiting for a cache update.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

One piece I left out of the proposal that we were discussing locally is having something like a cron job that crawls the cache at night and attempts a refresh on expired subject_URIs.

BTW, I like the idea of using a named graph to hold expiration dates. That avoids potential conflict with the cached data.

@tpendragon
Copy link

Basically my use cases around TTL are these:

  1. It needs to be easy for me to get a fast response from already-cached triples even if the remote source is down or slow, no matter what my TTL is.
  2. I need to easily be able to get the exact response I would have gotten from the source, with no modifications.
  3. I need a way to force a refresh of one URI and wait on it, and have some indication that the forced refresh succeeded. (Maybe this is just a header asking for a modified date past the last time it was refreshed? I dunno)

Anything which solves those three use cases I'm 👍 for.

@acoburn
Copy link

acoburn commented Nov 2, 2016

@elrayle You may want to take a look at the Marmotta LDCache interface for inspiration.

In particular, the get(URI, RefreshOpts) method accepts both a URI and a RefreshOpts value that determines exactly how to handle stale entries.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

@tpendragon 1 and 3 make sense to me. Can you expand on 2? I think I know what you mean, but want to be sure.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

I would be fine with @tpendragon suggestion for a modification to the retrieval algorithm...

retrieve from cache
if ExpirationDT < now, start background job to get from source and update ExpirationDT
return value from cache

@hackartisan
Copy link

I agree with the direction of this discussion. The example triple confuses me because it appears to conflate a real-world-object with its uri representation. if the subject uri is some name authority you would essentially be asserting that that person has an expiration date. It seems like some form of reification would solve that problem, though I'm not sure exactly what that would need to look like.

@elrayle
Copy link
Member Author

elrayle commented Nov 2, 2016

@HackMasterA I see your point. I think it would be easy to avoid the triple in Marmotta based on the feedback from @acoburn. Blazegraph and other repository implementations may be more challenging.

I would be less concerned with the conflation with a real-world-object if the predicate were better named, e.g. cache_expiration_dt.

Based on feedback, for triplestore implementations, I propose...

  • the triple for expiration would be stored in a name graph
  • the expiration triple would NOT be returned as part of the set of triples retrieved from the cache for the given subject_URI
  • the expiration datetime could be returned in the header

For Marmotta, I would use the internal mechanism already in Marmotta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants