Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x arches got slower in the last few days (like a lot) #3060

Closed
fberat opened this issue Dec 13, 2023 · 14 comments
Closed

s390x arches got slower in the last few days (like a lot) #3060

fberat opened this issue Dec 13, 2023 · 14 comments
Assignees
Labels
bug external fedora-copr-admin Tasks that need to be done by Fedora Copr administrator

Comments

@fberat
Copy link

fberat commented Dec 13, 2023

The following build on all arch took less than 1h in all arches except for s390x for which it took more than 7h. That's unusual behaviour for s390x.

https://copr.fedorainfracloud.org/coprs/fberat/gcc-14_gnat/build/6743516/

Overall, since a few days (maybe last Sunday actually), it is harder to get the s390x builder to start, and when it does the build seem quite slow.
Note that a build of GCC started on Friday took a similar time as the other arches (about 30 hours). Which would mean that if there is any problem, that's fairly recent.

Can you please have a look ?

@github-project-automation github-project-automation bot moved this to Needs triage in CPT Kanban Dec 13, 2023
@praiskup praiskup added the fedora-copr-admin Tasks that need to be done by Fedora Copr administrator label Dec 13, 2023
@praiskup
Copy link
Member

Should we report this against IBM Cloud support?

@praiskup praiskup moved this from Needs triage to In 3 months in CPT Kanban Dec 13, 2023
@codonell
Copy link

Should we report this against IBM Cloud support?

Yes please?

Just to give some broader context here.

We are using COPR to test the GCC 14 rebase by using the Mass Prebuild tooling.

We are using COPR at the recommendation of FESCo because it is deemed to be as-fast and as-conformant with builds as Koji.

If the s390x builds take longer then it blocks us from getting results to improve GCC 14 for all Fedora arches, and the eventual Fedora 40 mass rebuild on 2024-01-17. We could exclude s390x, but we don't want to do that.

Adding @fweimer-rh for awareness.

@praiskup
Copy link
Member

Sorry for the inconvenience. There seemed to be some allocation api problem, not sure about details, I killed some of the old builders, and things seem to allocate fine now. I'll try to keep this monitored :-/

Can you confirm this "performance" problem is still happening? These IBM Cloud builders were always slower compared to the other architecture, but according to @fberat report above it seems they are now unuseably slow. But I don't seem to observe this right now.

@praiskup praiskup self-assigned this Dec 16, 2023
@praiskup praiskup moved this from In 3 months to In Progress in CPT Kanban Dec 16, 2023
@praiskup
Copy link
Member

I tested a tar build that spent like 14 minutes on the s390x builder, mostly spent on the disk-intensive test stuff:

171: storing sparse files > 8G                       ok
172: storing long sparse file names                  ok
173: listing sparse files bigger than 2^33 B         ok

Copr uses tmpfs for chroots, and if combined with memory-intensive task -> we might overflow to SWAP extensively, see the hw profile of s390x builder.

Can this be the issue?

@praiskup
Copy link
Member

The instances are cz2-2x4 + 160 Volumes for SWAP. The quota sponsored by IBM allows us to spawn 18 such machines in parallel.

@fberat
Copy link
Author

fberat commented Dec 18, 2023

@praiskup I'm trying out right now. I've start 2 builds.
A short one first: https://copr.fedorainfracloud.org/coprs/fberat/gcc-14_gnat.checker/build/6765970/
Result: It took a bit long to get a builder (5 minutes compared to less than 1 for the other arches), but the build itself looked fine. As a point of comparison, a previous attempt earlier in the week: https://copr.fedorainfracloud.org/coprs/fberat/gcc-14_gnat/build/6759821/

A longer build is still ongoing, we'll see how it goes today: https://copr.fedorainfracloud.org/coprs/fberat/gcc-14_gnat/build/6765971/
This one took 7h on s390x last week, while it took less than 1h on other arches (specifically ppc64le took 1h). We'll see how it goes.

@praiskup
Copy link
Member

Thank you for the test, but the build eventually failed for some core dump :-(

make[1]: *** [Makefile:29: check] Aborted (core dumped)
make[1]: Leaving directory '/builddir/build/BUILD/ahven-2.8/gnat_linux'
make: *** [Makefile:32: check] Error 2

@fberat
Copy link
Author

fberat commented Dec 18, 2023

Thank you for the test, but the build eventually failed for some core dump :-(

make[1]: *** [Makefile:29: check] Aborted (core dumped)
make[1]: Leaving directory '/builddir/build/BUILD/ahven-2.8/gnat_linux'
make: *** [Makefile:32: check] Error 2

Yes, but that's fine, since it fails with the same core dump on all platforms :D

Regarding the second build, it wasn't completed in 5h, so I'd say the builders are still quite slow.

@praiskup
Copy link
Member

Regarding the second build, it wasn't completed in 5h, so I'd say the builders are still quite slow.

Do you have some build of Agda package from the time it was working OK?

@praiskup
Copy link
Member

We discussed this "off list" and it appears that the problems before were caused by the resource allocation hiccup that is resolved.

The builder performance is "expected", the Agda build simply overflows from memory to SWAP and generates ~3000IOPS. We are now doing a testing build with 4x more memory to confirm this, but it is quite obvious already.

@praiskup
Copy link
Member

JFTR, as mitigation for a frequent giant build slowdown, we upgraded the machines to have more RAM and decreased the quota from "up to 18 machines" to "up to 12" machines (to keep the same $ budget). This is because we believe that @fberat could rebuild all the Fedora packages much faster.

But it seems people already complaining that we do have not enough s390x builders, https://matrix.to/#/#buildsys:fedoraproject.org

@praiskup
Copy link
Member

praiskup commented Jan 3, 2024

Just a quick update; it seems that the new pattern with "up to 12" s390x machines with more memory works well enough. The queue gets bigger for "many small builds" but copr eventually handles it; and the "bigger builds" just scale better. So we don't plan to revert the change.

I bumped the thread with IBM folks - asking if we could implement more "boosted" approach to process the queue.

@praiskup
Copy link
Member

praiskup commented Jan 4, 2024

We got approval to bump the builders up to 18 while staying with the memory-optimized instances. And up to that start 2 high-performance builders, so I created #3086.

@praiskup
Copy link
Member

praiskup commented Jan 4, 2024

We got approval to bump the builders up to 18

Done, so closing.

@praiskup praiskup closed this as completed Jan 4, 2024
@nikromen nikromen moved this from In Progress to Done in CPT Kanban Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug external fedora-copr-admin Tasks that need to be done by Fedora Copr administrator
Projects
Archived in project
Development

No branches or pull requests

3 participants