ci: backport fix for `osx-*` 6h timeouts #700

dscho · 2024-10-22T14:11:24Z

For quite a while now we have been haunted by those evil timeouts in some of the slowest CI jobs. There is a fix in upstream Git now, already on their master branch, but still unreleased. I have integrated a backport in Git for Windows v2.47.0(2) via git-for-windows#5221.

Let's integrate this into vfs-2.47.0 separately since I do not plan on merging Git for Windows' main branch into vfs-2.47.0 any time soon (basically, I integrated those recent changes only because I had to have a version out that fixes git-for-windows#5199 but v2.47.0.vfs.0.2 already fixed that, so there is no urgency).

To start an async ipc server, you call ipc_server_run_async(). That initializes the ipc_server_data object, and starts all of the threads running, which may immediately start serving clients. This can create some awkward timing problems, though. In the fsmonitor daemon (the sole user of the simple-ipc system), we want to create the ipc server early in the process, which means we may start serving clients before the rest of the daemon is fully initialized. To solve this, let's break run_async() into two parts: an initialization which allocates all data and spawns the threads (without letting them run), and a start function which actually lets them begin work. Since we have two simple-ipc implementations, we have to handle this twice: - in ipc-unix-socket.c, we have a central listener thread which hands connections off to worker threads using a work_available mutex. We can hold that mutex after init, and release it when we're ready to start. We do need an extra "started" flag so that we know whether the main thread is holding the mutex or not (e.g., if we prematurely stop the server, we want to make sure all of the worker threads are released to hear about the shutdown). - in ipc-win32.c, we don't have a central mutex. So we'll introduce a new startup_barrier mutex, which we'll similarly hold until we're ready to let the threads proceed. We again need a "started" flag here to make sure that we release the barrier mutex when shutting down, so that the sub-threads can proceed to the finish. I've renamed the run_async() function to init_async() to make sure we catch all callers, since they'll now need to call the matching start_async(). We could leave run_async() as a wrapper that does both, but there's not much point. There are only two callers, one of which is fsmonitor, which will want to actually do work between the two calls. And the other is just a test-tool wrapper. For now I've added the start_async() calls in fsmonitor where they would otherwise have happened, so there should be no behavior change with this patch. Signed-off-by: Jeff King <[email protected]> Acked-by: Koji Nakamaru <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

There's a racy hang in fsmonitor on macOS that we sometimes see in CI. When we serve a client, what's supposed to happen is: 1. The client thread calls with_lock__wait_for_cookie() in which we create a cookie file and then wait for a pthread_cond event 2. The filesystem event listener sees the cookie file creation, does some internal book-keeping, and then triggers the pthread_cond. But there's a problem: we start the listener that accepts client threads before we start the fs event thread. So it's possible for us to accept a client which creates the cookie file and starts waiting before the fs event thread is initialized, and we miss those filesystem events entirely. That leaves the client thread hanging forever. In CI, the symptom is that t9210 (which is testing scalar, which always enables fsmonitor under the hood) may hang forever in "scalar clone". It is waiting on "git fetch" which is waiting on the fsmonitor daemon. The race happens more frequently under load, but you can trigger it predictably with a sleep like this, which delays the start of the fs event thread: --- a/compat/fsmonitor/fsm-listen-darwin.c +++ b/compat/fsmonitor/fsm-listen-darwin.c @@ -510,6 +510,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) FSEventStreamSetDispatchQueue(data->stream, data->dq); data->stream_scheduled = 1; + sleep(1); if (!FSEventStreamStart(data->stream)) { error(_("Failed to start the FSEventStream")); goto force_error_stop_without_loop; One solution might be to reverse the order of initialization: start the fs event thread before we start the thread listening for clients. But the fsmonitor code explicitly does it in the opposite direction. The fs event thread wants to refer to the ipc_server_data struct, so we need it to be initialized first. A further complication is that we need a signal from the fs event thread that it is actually ready and listening. And those details happen within backend-specific fsmonitor code, whereas the initialization is in the shared code. So instead, let's use the ipc_server init/start split added in the previous commit. The generic fsmonitor code will init the ipc_server but _not_ start it, leaving that to the backend specific code, which now needs to call ipc_server_start_async() at the right time. For macOS, that is right after we start the FSEventStream that you can see in the diff above. It's not clear to me if Windows suffers from the same problem (and we simply don't trigger it in CI), or if it is immune. Regardless, the obvious place to start accepting clients there is right after we've established the ReadDirectoryChanges watch. This makes the hangs go away in our macOS CI environment, even when compiled with the sleep() above. Helped-by: Koji Nakamaru <[email protected]> Signed-off-by: Jeff King <[email protected]> Acked-by: Koji Nakamaru <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

derrickstolee

Thanks for getting this into all the right places.

peff added 2 commits October 8, 2024 12:03

dscho requested a review from derrickstolee October 22, 2024 14:11

dscho self-assigned this Oct 22, 2024

dscho mentioned this pull request Oct 22, 2024

ci: backport fix for osx-* 6h timeouts git-for-windows/git#5225

Closed

derrickstolee approved these changes Oct 22, 2024

View reviewed changes

derrickstolee merged commit 61778f5 into microsoft:vfs-2.47.0 Oct 22, 2024
49 checks passed

derrickstolee mentioned this pull request Oct 22, 2024

pack-objects: allow --path-walk with --shallow #699

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: backport fix for `osx-*` 6h timeouts #700

ci: backport fix for `osx-*` 6h timeouts #700

dscho commented Oct 22, 2024

derrickstolee left a comment

ci: backport fix for osx-* 6h timeouts #700

ci: backport fix for osx-* 6h timeouts #700

Conversation

dscho commented Oct 22, 2024

derrickstolee left a comment

Choose a reason for hiding this comment

ci: backport fix for `osx-*` 6h timeouts #700

ci: backport fix for `osx-*` 6h timeouts #700