Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update Center - Azure] HTTP/404 responses instead of redirections with error AH01630: client denied by server configuration #4312

Open
dduportal opened this issue Sep 28, 2024 · 5 comments

Comments

@dduportal
Copy link
Contributor

During the new (Azure) Update Center brownouts (such as #2649 (comment)), our monitoring system alerted us about URL answering HTTP/404 instead of redirecting to a valid page (see https://github.com/jenkins-infra/datadog/blob/457ce441a4e4d9c5f1c0434a95a56b3d80d7a645/synthetics_updatecenter.tf#L4-L12).

The associated logs are showing errors such as:

[Fri Sep 27 05:53:50.035112 2024] [authz_core:error] [pid 10:tid 43] [client 10.100.8.4:36002] AH01630: client denied by server configuration: /usr/local/apache2/htdocs/.htaccess

When it happens, all the files / directories in the htdocs/ directory of the pod with failure are empty:

Click to see details
/usr/local/apache2/htdocs# ls -ltra
total 32
drwxr-xr-x 1 www-data www-data  4096 Sep  5 12:17 ..
drwxrwxrwx 2 root     root         0 Sep 21 09:18 .
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.427
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.450
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.476
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.444
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.420
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.441
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.426
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.452.1
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.440.1
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.426.3
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.440.2
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.426.2
drwxrwxrwx 2 root     root         0 Sep 23 17:01 stable
drwxrwxrwx 2 root     root         0 Sep 23 17:01 current
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.459
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.432
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.463
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.452.3
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.425
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.462
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.430
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.453
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.472
drwxrwxrwx 2 root     root         0 Sep 23 17:01 experimental
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.414.3
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.426.1
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.414.2
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.454
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.421
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.462.1
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.452.2
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.440.3
drwxrwxrwx 2 root     root         0 Sep 23 17:01 latest
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.446
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.440
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.434
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.414.1
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.475
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.460
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.443
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-stable-2.462.2
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.452
drwxrwxrwx 2 root     root         0 Sep 23 17:01 dynamic-2.423
drwxrwxrwx 2 root     root         0 Sep 25 14:34 dynamic-2.464
-rwxrwxrwx 1 root     root     24024 Sep 26 20:24 .htaccess
/usr/local/apache2/htdocs# wc -l .htaccess
0
  • Note all of the 4 pods (2 replicas pods for HTTPS-traffic and 2 for HTTP-traffic) are failing at the same time. Sometime it's only one, sometime all, etc. No deterministic behavior
  • The problem is solved by restarting the pod, which unmount/remount the volume.
  • The "hourly" rollout of the httpd pods makes the system self-healing" but it does not prevent the user-facing error
@dduportal
Copy link
Contributor Author

A potential source of this behavior could be that we did not specify any custom mount options for the statically provisioned volume (which uses SMB by default): https://github.com/jenkins-infra/azure/blob/e81b9697e4a9cedebf10e119b6a5a112e09b651f/updates.jenkins.io.tf#L77-L108

while we used to have some for the dynamically provisioned volume before the fourth brownout: https://github.com/jenkins-infra/kubernetes-management/blob/ec33aec2a83bd86d876d14327698ab2fe511ed98/config/updates.jenkins.io_httpd.yaml#L44-L52.

But the problem did appear with the custom mount options though, during the 3rd brownout (just... less often in 24 hours): #2649 (comment)

A few notes about these parameters (from https://learn.microsoft.com/en-us/azure/aks/azure-csi-files-storage-provision#mount-options and https://linux.die.net/man/8/mount.cifs):

  • actimeo defaults to 1s. It's the caching time for file attributes (see https://linux.die.net/man/8/mount.cifs).
  • nobrl disable sending byte range lock requests to the server and for applications which have challenges with posix locks
  • cache defaults to strict since kernel 3.7

@dduportal
Copy link
Contributor Author

dduportal commented Sep 28, 2024

Looks like that we should also tune Apache to take in account the fact that it is serving files from a network shared folder:

For NFS mounted files, this feature may be disabled explicitly for the offending files by specifying:

For network mounted files, this feature may be disabled explicitly for the offending files by specifying:

Example in https://www.cloudiseasy.com/2021/06/13/deploying-apache-server-on-aks-with-azure-files/

@dduportal
Copy link
Contributor Author

Proposal course of actions:

  • Find a way to reproduce the issue without requiring a brownout
  • Set up the mountoptions and Apache tuning: see if the problem re-appear
  • If it does, we could use NFSv4 instead of SMB for the file share PVC
    • Still the pain of a network file storage, but would help for Apache (better support)
  • Last chance: stop using a shared file system for Apache
    • Rsync side container with a local emptyFile (local FS, but need coupling with update center2)
    • Docker image?
    • Something else?

@dduportal
Copy link
Contributor Author

Update:

dduportal added a commit to jenkins-infra/azure that referenced this issue Oct 22, 2024
…S ROX PVs (#866)

Related to
jenkins-infra/helpdesk#4312 (comment)

This PR sets up custom mount options for the persistent volumes using
SMB/CIFS fileshare, to limit side effects with the update of files "each
10 minutes" which may show files as present but empty.

- Default and recommended mount options are described in Azure
documentation:
https://learn.microsoft.com/en-us/azure/aks/azure-csi-files-storage-provision#mount-file-share-as-a-persistent-volume
- CIFS client options reference with explanations is there:
https://manpages.debian.org/testing/cifs-utils/mount.cifs.8.en.html



NOTE: this change is not sufficient. Remount is required once deployed +
Apache will need to be tuned.

Signed-off-by: Damien Duportal <[email protected]>
@dduportal dduportal self-assigned this Oct 23, 2024
@dduportal
Copy link
Contributor Author

Update: the PR jenkins-infra/azure#866 did add custom mount point options to tune SMB for a posix usage (see PR body for details).

Note that such volume must be unmounted and remounted with the new options: the AKS CSI driver does not do it automatically (makes sense as it would disrupt the service). Did a node OS upgrade did the trick as it created new nodes (otherwise I would have cordoned, drained and delete each node).

Checked the "real life" mount before and after with the mount command executed in a pod:

# Before
//updatesjenkinsio.file.core.windows.net/updates-jenkins-io-redirects on /usr/local/apache2/htdocs type cifs (ro,relatime,vers=3.1.1,cache=strict,username=updatesjenkinsio,uid=0,noforceuid,gid=0,noforcegid,addr=52.239.174.104,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,nosharesock,actimeo=30,closetimeo=1)

# After (note the `nobrl` option applied
//updatesjenkinsio.file.core.windows.net/updates-jenkins-io-redirects on /usr/local/apache2/htdocs type cifs (ro,relatime,vers=3.1.1,cache=strict,username=updatesjenkinsio,uid=0,forceuid,gid=0,forcegid,addr=52.239.174.72,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,nobrl,mfsymlinks,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,nosharesock,actimeo=30,closetimeo=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant