Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[update-center] Downloading plugins unstable: Connection timed out #2960

Closed
AB-xdev opened this issue May 30, 2022 · 11 comments · Fixed by jenkins-infra/status#175
Closed

[update-center] Downloading plugins unstable: Connection timed out #2960

AB-xdev opened this issue May 30, 2022 · 11 comments · Fixed by jenkins-infra/status#175
Assignees

Comments

@AB-xdev
Copy link

AB-xdev commented May 30, 2022

Service(s)

Update center

Summary

While updating the plugins inside multiple jenkins instance today, I encounter the following problem:

Failure -

java.net.SocketTimeoutException: Read timed out
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
	at java.base/java.net.SocketInputStream.read(Unknown Source)
	at java.base/java.net.SocketInputStream.read(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketInputRecord.decode(Unknown Source)
	at java.base/sun.security.ssl.SSLTransport.decode(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
	at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source)
	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
	at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1291)
Caused: java.io.IOException: Failed to load https://updates.jenkins.io/download/plugins/font-awesome-api/6.1.1-1/font-awesome-api.hpi to /var/jenkins_home/plugins/font-awesome-api.jpi.tmp
	at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1302)
Caused: java.io.IOException: Failed to download from https://updates.jenkins.io/download/plugins/font-awesome-api/6.1.1-1/font-awesome-api.hpi
	at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1336)
	at hudson.model.UpdateCenter$DownloadJob._run(UpdateCenter.java:1893)
	at hudson.model.UpdateCenter$InstallationJob._run(UpdateCenter.java:2205)
	at hudson.model.UpdateCenter$DownloadJob.run(UpdateCenter.java:1867)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
	at java.base/java.lang.Thread.run(Unknown Source)

It looks like the problem happens randomly, after a few retries downloading succeeds:
grafik

However the status-page says that everything is ok:
grafik

Geographic location information:

  • All our affected machines are running on AWS eu-central-1 (Frankfurt, Germany)
  • I can also reproduce the problem locally (Germany) when trying to download e.g. https://updates.jenkins.io/download/plugins/font-awesome-api/6.1.1-1/font-awesome-api.hpi, it takes very long (>30s) to establish an connection inside the browser.
    curl also times out when trying to download the plugin:
    C:\TEMP>curl -L https://updates.jenkins.io/download/plugins/font-awesome-api/6.1.1-1/font-awesome-api.hpi
    curl: (35) schannel: failed to receive handshake, SSL/TLS connection failed
    
    C:\TEMP>curl -L https://updates.jenkins.io/download/plugins/font-awesome-api/6.1.1-1/font-awesome-api.hpi
    curl: (28) Failed to connect to updates.jenkins.io port 443 after 21088 ms: Timed out
    

Reproduction steps

Try to download or update a plugin.

@github-actions github-actions bot added triage Incoming issues that need review updateCenter labels May 30, 2022
@lemeurherve
Copy link
Member

lemeurherve commented May 30, 2022

A status incident has been opened, we're working on it.

@lemeurherve lemeurherve removed the triage Incoming issues that need review label May 30, 2022
@AB-xdev AB-xdev changed the title Downloading plugins unstable: Connection timed out - Statuspage says everything is fine Downloading plugins unstable: Connection timed out May 30, 2022
@github-actions github-actions bot added the triage Incoming issues that need review label May 30, 2022
@dduportal dduportal self-assigned this May 30, 2022
@dduportal dduportal removed the triage Incoming issues that need review label May 30, 2022
@dduportal
Copy link
Contributor

Many thanks for reporting. For info, related to #2888.

@dduportal
Copy link
Contributor

Recurent (new) error in the apache logs: AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

@dduportal
Copy link
Contributor

Another recurent error: AH00646: Error writing to |/usr/bin/rotatelogs -t /var/log/apache2/updates.jenkins.io/access_nonssl.log.%Y%m%d%H%M%S 604800 for all vhosts

@dduportal
Copy link
Contributor

dduportal commented May 30, 2022

There were 2 main issues on the VM hosting the UC. Both issues are solved "manually" and need to be "puppetized" today:

  • The "error.log" for apache vhosts are not rotated. It is a general setting for ALL of our puppet Apache-managed services alas. The error log on the update center was ~40 Go per vhosts (7 vhosts) and was locking a lot of apahce threads on each reload. We caught this through the "rotatelogs" warning with the help of @smerle33 , and truncating the error log file helped to have a more reactive apache server. (if needed, there is the disk snapshot to recover the rrror.log from last friday).

  • The MaxRequestWorkers (formerly known as MaxClients on Apache 2.2<) was set to its default 150 on the VM, while it should be greater (to handle the many requests of the UC). The error logs of apache was telling us this after a certain time. The 150 value is the default on both Apache AND the puppet module that we use for apache. Puppet has been disabled and the maxworker was incrrased to 500 to meet the values from the backup of last friday: no more errors (at least for now: since the past 4 months, we had errors after 5-6 hours). => gotta increase in puppet after calculating the upper bound (given this is a beefy machine, no reason to NOT use a greater value)

(edit) huge thanks to @olblak for taking some time to confirm this reasoning and sharing insights on his free time ❤️

@dduportal
Copy link
Contributor

PRs to fix the MaxRequestWorkers value for this VM:

Dry run before enabling puppet:

root@ip-172-31-33-96:~# puppet agent --noop --test
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts


Info: Applying configuration version '1653929150'
Notice: /Stage[main]/Apache::Mod::Event/File[/etc/apache2/mods-available/event.conf]/content: 
--- /etc/apache2/mods-available/event.conf      2022-05-30 12:19:42.284528561 +0000
+++ /tmp/puppet-file20220530-10345-3ff6ur       2022-05-30 16:46:01.339430567 +0000
@@ -1,11 +1,11 @@
 <IfModule mpm_event_module>
-  ServerLimit            25
+  ServerLimit            30
   StartServers           2
+  MaxRequestWorkers      750
   MinSpareThreads        25
   MaxSpareThreads        75
   ThreadsPerChild        25
   MaxRequestsPerChild    0
   ThreadLimit            64
   ListenBacklog          511
-  MaxRequestWorkers     500
 </IfModule>

Notice: /Stage[main]/Apache::Mod::Event/File[/etc/apache2/mods-available/event.conf]/content: 
--- /etc/apache2/mods-available/event.conf      2022-05-30 12:19:42.284528561 +0000
+++ /tmp/puppet-file20220530-10345-lh5m7r       2022-05-30 16:46:01.347430545 +0000
@@ -1,11 +1,11 @@
 <IfModule mpm_event_module>
-  ServerLimit            25
+  ServerLimit            30
   StartServers           2
+  MaxRequestWorkers      750
   MinSpareThreads        25
   MaxSpareThreads        75
   ThreadsPerChild        25
   MaxRequestsPerChild    0
   ThreadLimit            64
   ListenBacklog          511
-  MaxRequestWorkers     500
 </IfModule>

Notice: /Stage[main]/Apache::Mod::Event/File[/etc/apache2/mods-available/event.conf]/content: current_value '{md5}d20788cd0cce84117bc1d6051995205d', should be '{md5}d2be01b9deb39cc51e0a0ac146ce64b1' (noop)
Info: /Stage[main]/Apache::Mod::Event/File[/etc/apache2/mods-available/event.conf]: Scheduling refresh of Class[Apache::Service]
Notice: Class[Apache::Mod::Event]: Would have triggered 'refresh' from 1 event
Notice: Class[Apache::Service]: Would have triggered 'refresh' from 1 event
Info: Class[Apache::Service]: Scheduling refresh of Service[httpd]
Notice: /Stage[main]/Apache::Service/Service[httpd]: Would have triggered 'refresh' from 1 event
Notice: Class[Apache::Service]: Would have triggered 'refresh' from 1 event
Notice: Stage[main]: Would have triggered 'refresh' from 2 events
Notice: Applied catalog in 6.56 seconds

@dduportal
Copy link
Contributor

Applied the new settings. let's check datadog's monitoring.

  • Puppet agent enabled and ran again with success
  • apache2buddy.pl confirms that the new setting is applied

@dduportal
Copy link
Contributor

Today's incident resumed in a datadog graph:

Capture d’écran 2022-05-30 à 18 53 45

@dduportal
Copy link
Contributor

No error since the past hour. Let's see how it behaves during the (EU) night - (US) day

@dduportal dduportal changed the title Downloading plugins unstable: Connection timed out [update-center] Downloading plugins unstable: Connection timed out May 31, 2022
@github-actions github-actions bot added the triage Incoming issues that need review label May 31, 2022
@dduportal
Copy link
Contributor

We are closing the incident in jenkins-infra/status#175 (and this issue) as our automated configuration fixed the incident.

It's been 24 hours without errors: the Apache fine tuning seems not only to fix the issue, but improve the previous situation described in #2888 .

Capture d’écran 2022-05-31 à 12 20 54

@dduportal
Copy link
Contributor

Many thanks for reporting, and really sorry for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants