Skip to content
This repository has been archived by the owner on Jul 15, 2021. It is now read-only.

Stale Data with RPKI Validator 3 #275

Open
alkhos opened this issue Sep 17, 2020 · 4 comments
Open

Stale Data with RPKI Validator 3 #275

alkhos opened this issue Sep 17, 2020 · 4 comments

Comments

@alkhos
Copy link

alkhos commented Sep 17, 2020

Hello,
I have RIPE RPKI Validator 3 deployed on a number of VMs running Ubuntu 18.04 using the debain instructions in Wiki.
It seems that we are having a couple of issues with the validator as we have it ran there for a while now:

  1. Once in a while, the servers stop getting new data. I can see this by monitoring http:///api/trust-anchors/statuses and noticing that "lastUpdated" is lagging behind the current time by the matter of days. This situation goes away by restarting the rpki validator ( systemctl restart rpki-validator-3 ). But I was wondering if anybody has had a similar issue and if so, what has been the cause of it?

  2. Our servers, are also deviating in terms of # errors, warnings, and even successful count in the same "trust-anchors/statuses" when compared to the ripe's public server (https://rpki-validator.ripe.net/). I can see that a log of these are errors in these categories ( as seen in the validation runs API )

  • crl.next.update.before.now
  • mf.past.next.update
  • cert.not.valid.after
  • crl.next.update.before.now
  • validator.manifest.entry.found
  • validator.no.local.manifest.no.manifest.in.repository
  • cert.not.revoked
  • validator.no.manifest.repository.failed
  • validator.rpki.repository.pending

Also almost all of these errors can be tracked to RRDP repositories ( and not the RSYNC ones ).
We run 8/20 build for reference.
Is there any reason for such deviation? or are there specific things that we have to note in the configuration to avoid this situations?

@JvGinkel
Copy link

JvGinkel commented Oct 5, 2020

I had the same, for some reason my validator last update was almost two months ago. Today I updated to the latest version and did some OS updates after the restart it's up2date again. I added a monitoring check that will alert me again if the cache is older than 5 days.

@wibisono
Copy link
Contributor

wibisono commented Oct 6, 2020

Hi,

After the build 8/20 we have made releases that fix potential deadlock e.g relase 9/18. The stale updates might be related to this issue that we occasionally encounter, hopefully upgrading to latest version will solve this issue.

Please let us know if the problem persist after updating to the latest release.

@lukastribus
Copy link

I added a monitoring check that will alert me again if the cache is older than 5 days.

5 day old VRP's on a ROV enabled production network is way too much. Please consider dropping the alert threshold to something like 2 hours instead. What if your validator is buggy and instead of 5 minutes a validation run takes 4 hours? You would never know.

ROV is supposed to converge fairly quickly, "days" is not a term we should ever have to use ...

@ties
Copy link
Member

ties commented Oct 7, 2020

5 day old VRP's on a ROV enabled production network is way too much. Please consider dropping the alert threshold to something like 2 hours instead. What if your validator is buggy and instead of 5 minutes a validation run takes 4 hours? You would never know.

ROV is supposed to converge fairly quickly, "days" is not a term we should ever have to use ...

In practice a TA may not update for an extended period, causing spurious alerts (in my experience only on APNIC's TA0). This prometheus alert works pretty well for me (no false positives and no alerts since the last release):

alert: ValidatorDown
expr: time() - rpkivalidator_last_validation_run{trust_anchor!="rsync://rpki-as0.apnic.net/repository/APNIC-AS0-AP/apnic-rpki-root-as0-origin.cer"} > 3600
for: 5m
labels:
  severity: critical
annotations:
  summary: Trust anchor {{ $labels.trust_anchor }} has not updated for 60 minutes.

We are discussing some changes for improved scheduling of validation and quicker convergence (and bootstrapping), which should also improve reliability. This may end up in one of our next sprints.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants