Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RA fails in batches due to TIMEOUT #1316

Open
PHA-SYSOPS opened this issue Jun 20, 2023 · 1 comment
Open

RA fails in batches due to TIMEOUT #1316

PHA-SYSOPS opened this issue Jun 20, 2023 · 1 comment

Comments

@PHA-SYSOPS
Copy link

It seems RA fails, this is a two prone error for both pruntimev1 and v2.

for v2 you will get this error (which makes it hard to understand):
error 18; this may indicate that infrastructure for the epid attestation requested by gramine is missing on this machine

for v1 you will get more indicating error SGX_RA_TIMEOUT

There are 2 conditions that trigger this error:

  1. obviously networking issues, mostly DNS, e.g. docker container uses wrong DNS servers like 127.0.0.1 which obviously wont work
  2. the reply from intel is somehow delayed, mostly due to routing issues between ISP's and Microsoft Azure

We are specifically looking at issue number 2 here, where if you do a tcpdump you will notice that the reply is received later than 8 seconds (in my case between 8.2 and 11.7 seconds), which is long yes, but not a problem. Intel does not ratelimit like this, only send HTTP codes for that (see : https://www.intel.in/content/www/in/en/support/articles/000090552/software/intel-security-products.html)

The underlying code for this is :

`fn get_report_from_intel(quote: &[u8], ias_key: &str) -> Result<(String, String, String)> {
let encoded_quote = base64::encode(quote);
let encoded_json = format!("{{"isvEnclaveQuote":"{encoded_quote}"}}\r\n");

let mut res_body_buffer = Vec::new(); //container for body of a response
let timeout = Some(Duration::from_secs(8));

let url: reqwest::Url = format!("https://%7Bias_host%7D%7Bias_report_endpoint%7D%22%29.parse%28%29/?;
info!(from=%url, "Getting RA report");`

As we can see there is no catching fail here, or retry, and the 8 seconds is hardcoded. I would request that the timeout and amount of retries can be configured via ENV and this put in a retry/catch loop to solve this.

@kvinwang
Copy link
Collaborator

Looks reasonable to me. Let me improve it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants