Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check quickly that we have Fedora copr-backend backup #3390

Closed
praiskup opened this issue Aug 28, 2024 · 19 comments · Fixed by #3506
Closed

Check quickly that we have Fedora copr-backend backup #3390

praiskup opened this issue Aug 28, 2024 · 19 comments · Fixed by #3506
Assignees

Comments

@praiskup
Copy link
Member

praiskup commented Aug 28, 2024

The backups should be on storinator box.

@praiskup praiskup converted this from a draft issue Aug 28, 2024
@praiskup praiskup moved this from Needs triage to In 3 months in CPT Kanban Aug 28, 2024
@praiskup
Copy link
Member Author

We need a howto document (output from this ticket).

@praiskup praiskup moved this from In 3 months to In Progress in CPT Kanban Sep 2, 2024
@praiskup praiskup self-assigned this Sep 2, 2024
@praiskup
Copy link
Member Author

# for i in $(ls -1 /var/log/cron-*.xz | tac); do xzcat $i | grep rsnapshot; done
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (rsnapshot encountered an error! The program was invoked with these options:)
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push )
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.)
Sep 17 21:06:28 copr-be CROND[1470129]: (copr) CMDEND (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)
Sep 14 01:01:02 copr-be CROND[1470229]: (copr) CMD (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)

@praiskup
Copy link
Member Author

$ lvresize /dev/VG_nfs/copr-be -L +8TB
$ xfs_growfs /srv/nfs/copr-be/
$ df -h /srv/nfs/copr-be/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VG_nfs-copr--be 48T 40T 8.1T 84% /srv/nfs/copr-be

@praiskup
Copy link
Member Author

Running ionice --class=idle /usr/local/bin/rsnapshot_copr_backend manually.

@praiskup
Copy link
Member Author

Still doing the rsync :-( and we seem to run out of space again:
/dev/mapper/VG_nfs-copr--be 48T 47T 1.7T 97% /srv/nfs/copr-be

@praiskup
Copy link
Member Author

I would remove the old increments, but that would probably break the current rsnapshot process. I'll keep the sync going for now, and wait for the potential failure (if it really fails, I'll remove old increments, and then restart rsnapshot).

@praiskup
Copy link
Member Author

Ok, going with /bin/rm -rf push.3 push.2 push.1 push.0 first, keeping the last .sync

@praiskup
Copy link
Member Author

praiskup commented Oct 5, 2024

[copr@copr-be ~][PROD]$ ionice --class=idle /usr/local/bin/rsnapshot_copr_backend
Warning: Permanently added 'storinator01.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.
building file list ... 
rsync: [sender] opendir "/var/lib/copr/public_html/archive/issues/copr-3016" failed: Permission denied (13)
Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]

@praiskup
Copy link
Member Author

praiskup commented Oct 7, 2024

 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=26461/64855270)
/var/lib/copr/public_html/temp/
/var/lib/copr/public_html/temp/issue-3067/
/var/lib/copr/public_html/usage-2019-08-04/
/var/lib/copr/public_html/usage4/
 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=0/64855270)    rsync: [receiver] stat "var/lib/copr/public_html/temp/issue-3067" (in push) failed: No such file or directory (2)
 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=0/64855270)----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:

rsnapshot encountered an error! The program was invoked with these options:
/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push 
----------------------------------------------------------------------------
ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device
Traceback (most recent call last):
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>
    _main()
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main
    rotate(database)
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate
    subprocess.check_call(cmd)
  File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.


sent 34,368,856,397 bytes  received 217,861,583 bytes  272,708.92 bytes/sec
total size is 41,526,683,163,969  speedup is 1,200.65

@praiskup
Copy link
Member Author

praiskup commented Oct 7, 2024

Starting with: /dev/mapper/VG_nfs-copr--be 48T 345G 48T 1% /srv/nfs/copr-be

@praiskup
Copy link
Member Author

praiskup commented Oct 9, 2024

Hmmm

Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.                                                  
rsync: [sender] write error: Broken pipe (32)                                                                          
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]                                                                                                                                                                         
                                                                                                                       
real    1038m41.824s                                                                                                                                                                                                                          
user    67m3.939s                                                                                                                                                                                                                             
sys     85m51.315s

Eventhough storinator's sshd:

● sshd.service - OpenSSH server daemon
     Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-10-05 22:14:15 UTC; 3 days ago

@praiskup
Copy link
Member Author

13,669,445,228,841  66%   31.94MB/s   58:19:50  Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.

rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]

real    11032m7.826s
user    821m15.285s
sys     946m50.620s

@praiskup
Copy link
Member Author

# 5h
ServerAliveInterval 20
ServerAliveCountMax 900
ConnectTimeout 120

Before I tried with 20 / 5 / 60.

@praiskup
Copy link
Member Author

First rsync run finished, and the config above probably helped; so the first backup round is done but we still need to fix ansible.git.

@praiskup
Copy link
Member Author

Fixed: https://pagure.io/fedora-infra/ansible/c/5cffe17cd8856b14fef8b858ba1dd12dfec43dd3
Running again (second increment, deletes seem to be done correctly)

praiskup added a commit to praiskup/copr that referenced this issue Nov 5, 2024
@praiskup
Copy link
Member Author

praiskup commented Nov 6, 2024

From triage: let Konflux folks know, let PULP folks know

@praiskup
Copy link
Member Author

praiskup commented Nov 7, 2024

Last run started 2024-11-05 07:00 AM, ended 2024-11-07 03:00 AM, after ~44 hours. Succeeded. Transferred 2TB of data, which is the increment since the last run finished (~2024-11-02). IOW 2TB for 6 days, which is ~10TB/month. Hmm.

It doesn't seem that the last backup run hit any "build peak"; 🤷 so the increments might be worse sometimes, if build peaks appear.

Then, since we keep 4 weekly increments, and we need to have space for 5th "in progres" increment, storinator should provide us at least as much space as backend consumes (~30TB) plus space for 5 increments (~12TB) = 42T. The df claims 48T, we can get +16T more (volume group allows it, but we need to ask fedora infra first).

That said, everything seems OK right now -> but we should keep monitoring the next two incremental backups, to be sure that the increments fit well into the backup volume.

@praiskup
Copy link
Member Author

I'd like to take a bit more from the VG: https://pagure.io/fedora-infrastructure/issue/12280

@praiskup
Copy link
Member Author

Enlarged volume +6T. Last backup has been running for 3.5 days already.

praiskup added a commit to praiskup/copr that referenced this issue Nov 20, 2024
ryanlerch pushed a commit to fedora-infra/ansible that referenced this issue Nov 27, 2024
@nikromen nikromen moved this from In Progress to Done in CPT Kanban Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant