Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update Center] Fix HTTP/404 errors due to broken links in HTML listing pages and missing ?uctest endpoint #4311

Closed
dduportal opened this issue Sep 28, 2024 · 20 comments · Fixed by jenkins-infra/update-center2#810

Comments

@dduportal
Copy link
Contributor

As described in #2649 (comment), the HTML files generated by jenkins-infra/update_center2 are using relative links.

It used to be a good technique when dealing with both domains updates.jenkins-ci.org and updates.jenkins.io in the past when they both served files.

But it is now an issue in the context of the new Update Center system which uses HTTP(S) mirrors to serve content to end users to:

Examples of pages:

@dduportal
Copy link
Contributor Author

Comment by @daniel-beck about the bandwidth in a discussion we got together on this topic:

By @dduportal
I don't recall the exact amount of data transferred but it was huge even for these tiny HTML files. We're speaking about Tbs per month (globally, it's 50 Tb per month)

Did you just group by file extension, or also path? Because some of the "JSON" files also have an HTML file extension. So > if you count https://updates.jenkins.io/update-center.json.html as HTML, that'll skew this a lot.

=> Important point as it means we could have to change the routing pattern.

Cloudflare Analytics shows that HTML was far behind in amount of requests but we can't tell the different HTML files appart:

Capture d’écran 2024-09-28 à 10 26 45

@dduportal
Copy link
Contributor Author

dduportal commented Sep 28, 2024

Proposal: Given the context of the new Update Center, let's use absolute URL links.

What are your thoughts on this @daniel-beck @timja @MarkEWaite ?

@timja
Copy link
Member

timja commented Sep 29, 2024

Absolute URL makes sense to me.

@daniel-beck
Copy link

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

@dduportal
Copy link
Contributor Author

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

We understand the mirroring mechanism which is why i opened this issue. If we start to select files which are mirrored vs which one are not, the architectural complexity will be a pain as we will need to maintain a list of conditions. It is already nightmare-ish on get.jenkins.io tbh

hence the question about pros and cons of switching to absolute URLs which is non mutually exclusive with analysing usage to understand better.

the costs involved here are huge compared to optimization: but it is mandatory to have a finer grain of understanding

@dduportal
Copy link
Contributor Author

Hello @daniel-beck 👋

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type?

My apologies, I mistakenly used the word "behind". You are correct, I meant that HTML seems to be, by far, the most popular type of file downloaded, at least as per the Cloudflare dashboard during the 24 hours experiment.

Let me check if we see the same result on the current VM (analysing the logs from a few days ago).

How does that make any sense?

I don't know. Let's compare with current behavior.
That could also be "assumed" content type (including HTTP/404) as they are served as HTML as well.

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

I ... don't know. We did not even know there was an HTML version of this one. Where should we look (except our access logs)?

@dduportal
Copy link
Contributor Author

Initial check for the 09 October 2024 (both HTTP and HTTPS, both updates.jenkins-ci.org and updates.jenkins.io vhosts):

  • ~ 8,478,760 hits

  • ~ 444.350 visitors

  • ~5,000,000 redirections (HTTP/3XX) for around 1.2 Gib

  • ~3,200,000 files served (HTTP/2XX) for around 2.1 Tib

  • ~ 257,890 client errors (HTTP/4XX) for around 43 Mib

Report (generated with GoAccess from the "combined" access log):

report.html.zip

@dduportal
Copy link
Contributor Author

@daniel-beck If we compare with Cloudflare numbers for 24 hours, which are only HTTP/2XX and HTTP/4XX (as the redirects are NOT sent to Cloudflare), it maps:

  • Total requests on Cloudflare where ~10,8M
    • 7,67M where HTTP/4XX => It's ~3,2M HTTP/2XX which is the same number as what we see on the actual production.
  • 9,88M requests where "HTML": it's ~2,21M HTTP/2XX HTML types (removing the HTTP/4XX)
    • JSON (which are HTTP/2XX) are ~1,14M which means we have 1/3 of JSON, and 2/3 of HTML (all files included, tools installer and metadatas).

Need to check the repartition HTML/JSON on the current production, but the high rate of HTTP/4XX clearly explains the ratio change during the brownout.

It also adds more weight in using an absolute URL in the HTML generated files to decrease this amount of HTTP/4XX.

@daniel-beck
Copy link

daniel-beck commented Oct 16, 2024

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

1/3 of JSON, and 2/3 of HTML

The problem with this view is that there are different kinds of HTML files on this domain.

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Various update-center.json.html exist and are irrelevant for this topic. Half the tool installer files (e.g. in https://updates.jenkins.io/updates/ ) are HTML files and are irrelevant for this topic.

@dduportal
Copy link
Contributor Author

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access20241003gz. Got 4 files (unsecured and secured, for both hostnames)

@dduportal
Copy link
Contributor Author

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access_20241003_gz. Got 4 files (unsecured and secured, for both hostnames)

Additions:

  • I concatenated the 4 access logs files from production into a single one and ran the goaccess tool on it (specifying combined logs format). The "concatenated" file weight 1.2 Gb: do you want me to send it to you (compressed) through a private channel @daniel-beck to avoid further unneded tasks for you?

@dduportal
Copy link
Contributor Author

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files.
Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

@daniel-beck
Copy link

daniel-beck commented Oct 17, 2024

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted

@dduportal
Copy link
Contributor Author

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted

Yes, I had the same results before generating the goaccess. I fail to understand the relationship with the current issue: the domain change when serving files from mirrors leads to wrong hyperlinks in the generated pages. what did I miss?

@daniel-beck
Copy link

daniel-beck commented Oct 17, 2024

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I wonder whether this is necessary. Seems like mirrors make sense for anything that's actual "content" (the stuff being downloaded), not glorified directory indexes.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

This came from #4311 (comment) / #4311 (comment)

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

@dduportal
Copy link
Contributor Author

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?
What did I miss?

This came from #4311 (comment) / #4311 (comment)

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

Oh i see, thanks for clarifying. We agree then on the result from the current production.

Let me compile my thoughts and analysis on the Cloudflare part:

  • Cloudflare still does not provides us access logs, only the terrible dashboard I screenshot. Request sent to them to enable access log publication (streamed to datadog as we cannot access them directly). Like any sponsorship programs, the beginning is back and forth
  • My 1/3 vs. 2/3 is a ratio in number of hits, not in downloaded volume. We need to calculate this on the current access logs (I'll try to do it and publish my shell commands, because goaccess is too limited for such analysis), either by content type or by URL patterns.
  • The huge spike in HTTP/4XX means we still have some endpoints sent to mirrors which should not. The links on the pages here (most probably due to crawler patterns) are part of this, but we don't really know how much.

@smerle33 did propose to use non Cloudflare mirror as a safety net if things goes south with CF. It would use a custom webserver we manage (or two) and hosted in DigitalOcean (we have 4-5 Tb bandwidth for free and 15k credits valids until end of year) so we can check access logs in details. Cost is OK for another brownout (assuming 2 to 3 Tb of download for 24h), but we'll need to be careful if we add it permanently.

@daniel-beck
Copy link

I met with @dduportal to move this topic along. Outcome:

  • He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating RedirectMatch to RewriteRule in the uc2 .htaccess file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.
  • I look into making URLs in --download-links-directory and --latest-links-directory absolute instead of relative, independent of the outcome of your task. This is implemented in Use absolute URLs for links from download indexes update-center2#810

@dduportal
Copy link
Contributor Author

I met with @dduportal to move this topic along. Outcome:

* He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating `RedirectMatch` to `RewriteRule` in the uc2 `.htaccess` file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.

* I look into making URLs in `--download-links-directory` and `--latest-links-directory` absolute instead of relative, independent of the outcome of your task. This is implemented in [Use absolute URLs for links from download indexes update-center2#810](https://github.com/jenkins-infra/update-center2/pull/810)

Following this summary, I've opened the PR jenkins-infra/update-center2#812 to focus on the second solution.

With the use of RewriteRule for the "fallback" rule (tested with success), we can add a rewrite condition to test the absence of a file: that would allow us to server the /downloads/**/*html file from Apache since it's only a low volume, and would solve the HTTP/404 links without requiring absolute links.

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

# Before the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 307 
date: Tue, 22 Oct 2024 09:54:37 GMT
content-type: text/html; charset=iso-8859-1
location: https://mirrors.updates.jenkins.io/uctest.json?uctest
strict-transport-security: max-age=2592000; includeSubDomains; preload

# After the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 200 
date: Tue, 22 Oct 2024 09:55:06 GMT
content-type: application/json
content-length: 3
last-modified: Tue, 22 Oct 2024 09:54:46 GMT
etag: "3-6250dc26ce6f7"
accept-ranges: bytes
strict-transport-security: max-age=2592000; includeSubDomains; preload

@dduportal dduportal changed the title [Update Center] generate HTML pages with absolute links [Update Center] Fix HTTP/404 errors due to broken links in HTML listing pages and missing ?uctest endpoint Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants