Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: testing/synctest: new package for testing concurrent code #67434

Open
neild opened this issue May 16, 2024 · 89 comments
Open

proposal: testing/synctest: new package for testing concurrent code #67434

neild opened this issue May 16, 2024 · 89 comments

Comments

@neild
Copy link
Contributor

neild commented May 16, 2024

This is a proposal for a new package to aid in testing concurrent code.

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

This package has two main features:

  1. It permits using a fake clock to test code which uses timers. The test can control the passage of time as observed by the code under test.
  2. It permits a test to wait until an asynchronous operation has completed.

As an example, let us say we are testing an expiring concurrent cache:

type Cache[K comparable, V any] struct{}

// NewCache creates a new cache with the given expiry.
// f is called to create new items as necessary.
func NewCache[K comparable, V any](expiry time.Duration, f func(K) V) *Cache {}

// Get returns the cache entry for K, creating it if necessary.
func (c *Cache[K,V]) Get(key K) V {}

A naive test for this cache might look something like this:

func TestCacheEntryExpires(t *testing.T) {
	count := 0
	c := NewCache(2 * time.Second, func(key string) int {
		count++
		return fmt.Sprintf("%v:%v", key, count)
	})

	// Get an entry from the cache.
	if got, want := c.Get("k"), "k:1"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}

	// Verify that we get the same entry when accessing it before the expiry.
	time.Sleep(1 * time.Second)
	if got, want := c.Get("k"), "k:1"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}

	// Wait for the entry to expire and verify that we now get a new one.
	time.Sleep(3 * time.Second)
	if got, want := c.Get("k"), "k:2"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}
}

This test has a couple problems. It's slow, taking four seconds to execute. And it's flaky, because it assumes the cache entry will not have expired one second before its deadline and will have expired one second after. While computers are fast, it is not uncommon for an overloaded CI system to pause execution of a program for longer than a second.

We can make the test less flaky by making it slower, or we can make the test faster at the expense of making it flakier, but we can't make it fast and reliable using this approach.

We can design our Cache type to be more testable. We can inject a fake clock to give us control over time in tests. When advancing the fake clock, we will need some mechanism to ensure that any timers that fire have executed before progressing the test. These changes come at the expense of additional code complexity: We can no longer use time.Timer, but must use a testable wrapper. Background goroutines need additional synchronization points.

The synctest package simplifies all of this. Using synctest, we can write:

func TestCacheEntryExpires(t *testing.T) {
        synctest.Run(func() {
                count := 0
                        c := NewCache(2 * time.Second, func(key string) int {
                        count++
                        return fmt.Sprintf("%v:%v", key, count)
                })

                // Get an entry from the cache.
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Verify that we get the same entry when accessing it before the expiry.
                time.Sleep(1 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Wait for the entry to expire and verify that we now get a new one.
                time.Sleep(3 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:2"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }
        })
}

This is identical to the naive test above, wrapped in synctest.Run and with the addition of two calls to synctest.Wait. However:

  1. This test is not slow. The time.Sleep calls use a fake clock, and execute immediately.
  2. This test is not flaky. The synctest.Wait ensures that all background goroutines have idled or exited before the test proceeds.
  3. This test requires no additional instrumentation of the code under test. It can use standard time package timers, and it does not need to provide any mechanism for tests to synchronize with it.

A limitation of the synctest.Wait function is that it does not recognize goroutines blocked on network or other I/O operations as idle. While the scheduler can identify a goroutine blocked on I/O, it cannot distinguish between a goroutine that is genuinely blocked and one which is about to receive data from a kernel network buffer. For example, if a test creates a loopback TCP connection, starts a goroutine reading from one side of the connection, and then writes to the other, the read goroutine may remain in I/O wait for a brief time before the kernel indicates that the connection has become readable. If synctest.Wait considered a goroutine in I/O wait to be idle, this would cause nondeterminism in cases such as this,

Tests which use synctest with network connections or other external data sources should use a fake implementation with deterministic behavior. For net.Conn, net.Pipe can create a suitable in-memory connection.

This proposal is based in part on experience with tests in the golang.org/x/net/http2 package. Tests of an HTTP client or server often involve multiple interacting goroutines and timers. For example, a client request may involve goroutines writing to the server, reading from the server, and reading from the request body; as well as timers covering various stages of the request process. The combination of fake clocks and an operation which waits for all goroutines in the test to stabilize has proven effective.

@aclements
Copy link
Member

I really like how simple this API is.

Time advances when every goroutine is idle.

How does time work when goroutines aren't idle? Does it stand still, or does it advance at the usual rate? If it stands still, it seems like that could break software that assumes time will advance during computation (that maybe that's rare in practice). If it advances at the usual rate, it seems like that reintroduces a source of flakiness. E.g., in your example, the 1 second sleep will advance time by 1 second, but then on a slow system the checking thread may still not execute for a long time.

What are the bounds of the fake time implementation? Presumably if you're making direct system calls that interact with times or durations, we're not going to do anything about that. Are we going to make any attempt at faking time in the file system?

If every goroutine is idle and there are no timers scheduled, Run panics.

What if a goroutine is blocked on a channel that goes outside the group? This came to mind in the context of whether this could be used to coordinate a multi-process client/server test, though I think it would also come up if there's any sort of interaction with a background worker goroutine or pool.

or is the goroutine calling Wait.

What happens if multiple goroutines in a group call Wait? I think the options are to panic or to consider all of them idle, in which case they would all wake up when every other goroutine in the group is idle.

What happens if you have nested groups, say group A contains group B, and a goroutine in B is blocked in Wait, and then a goroutine in A calls Wait? I think your options are to panic (though that feels wrong), wake up both if all of the goroutines in group A are idle, or wake up just B if all of the goroutines in B are idle (but this block waking up A until nothing is calling Wait in group B).

@neild
Copy link
Contributor Author

neild commented May 16, 2024

How does time work when goroutines aren't idle?

Time stands still, except when all goroutines in a group are idle. (Same as the playground behaves, I believe.) This would break software that assumes time will advance. You'd need to use something else to test that case.

What are the bounds of the fake time implementation?

The time package: Now, Since, Sleep, Timer, Ticker, etc.

Faking time in the filesystem seems complicated and highly specialized, so I don't think we should try. Code which cares about file timestamps will need to use a test fs.FS or some such.

What if a goroutine is blocked on a channel that goes outside the group?

As proposed, this would count as an idle goroutine. If you fail to isolate the system under test this will probably cause problems, so don't do that.

What happens if multiple goroutines in a group call Wait?

As proposed, none of them ever wake up and your test times out, or possibly panics if we can detect that all goroutines are blocked in that case. Having them all wake at the same time would also be reasonable.

What happens if you have nested groups

Oh, I didn't think of that. Nested groups are too complicated, Run should panic if called from within a group.

@apparentlymart
Copy link

This is a very interesting proposal!

I feel worried that the synctest.Run characteristic of establishing a "goroutine group" and blocking until it completes might make it an attractive nuisance for folks who see it as simpler than arranging for the orderly completion of many goroutines using other synchronization primitives. That is: people may be tempted to use it in non-test code.

Assuming that's a valid concern (if it isn't then I'll retract this entire comment!), I could imagine mitigating it in two different ways:

  • Offer "goroutine groups" as a standalone synchronization primitive that synctest.Run is implemented in terms of, offering the "wait for completion of this and any other related goroutines" mechanism as a feature separate from synthetic time. Those who want to use it in non-test code can therefore use the lower-level function directly, instead of using synctest.Run.
  • Change the synctest.Run design in some way that makes it harder to misuse. One possible idea: make synctest.Run take a testing.TB as an additional argument, and then in every case where the proposal currently calls for a panic use t.FailNow() instead. It's inconvenient (though of course not impossible) to obtain a testing.TB implementation outside of a test case or benchmark, which could be sufficient inconvenience for someone to reconsider what they were attempting.

(I apologize in advance if I misunderstood any part of the proposal or if I am missing something existing that's already similarly convenient to synctest.Run.)

@neild
Copy link
Contributor Author

neild commented May 17, 2024

The fact that synctest goroutine groups always use a fake clock will hopefully act as discouragement to using them in non-test code. Defining goroutines blocked on I/O as not being idle also discourages use outside of tests; any goroutine reading from a network connection defeats synctest.Wait entirely.

I think using idle-wait synchronization outside of tests is always going to be a mistake. It's fragile and fiddly, and you're better served by explicit synchronization. (This prompts the question: Isn't this fragile and fiddly inside tests as well? It is, but using a fake clock removes much of the sources of fragility, and tests often have requirements that make the fiddliness a more worthwhile tradeoff. In the expiring cache example, for example, non-test code will never need to guarantee that a cache entry expires precisely at the nanosecond defined.)

So while perhaps we could offer a standalone synchroniziation primitive outside of synctest, I think we would need a very good understanding of when it would be appropriate to use it.

As for passing a testing.TB to synctest.Run, I don't think this would do much to prevent misuse, since the caller could just pass a &testing.T{}, or just nil. I don't think it would be wise to use synctest outside of tests, but if someone disagrees, then I don't think it's worth trying to stop them.

@gh2o
Copy link

gh2o commented May 18, 2024

Interesting proposal. I like that it allows for waiting for a group of goroutines, as opposed to all goroutines in my proposal (#65336), though I do have some concerns:

  • Complexity of implementation: Having to modify every time-related function may increase complexity for non-test code. Would it make more sense to outsource the mock time implementation to a third party library? The Wait() function should be sufficient for the third party library to function deterministically, and goroutines started by Run() would behave like normal goroutines in all aspects.

  • Timeouts: In my proposal, WaitIdle() returns a <-chan struct{} since it allows for a test harness to abort the test if it takes too long (e.g. 30 seconds in case the test gets stuck in an infinite loop). Would it make sense for the Wait() function here to return a chan too to allow for timeouts?

@neild
Copy link
Contributor Author

neild commented May 18, 2024

One of the goals of this proposal is to minimize the amount of unnatural code required to make a system testable. Mock time implementations require replacing calls to idiomatic time package functions with a testable interface. Putting fake time in the standard library would let us just write the idiomatic code without compromising testability.

For timeouts, the -timeout test flag allows aborting too-slow tests. Putting an explicit timeout in test code is usually a bad idea, because how long a test is expected to run is a property of the local system. (I've seen a lot of tests inside Google which set an explicit timeout of 5 or 10 seconds, and then experience flakiness when run with -tsan and on CI systems that execute at a low batch priority.)

Also, it would be pointless for Wait to return a <-chan struct{}, because Wait must be called from within a synctest group and therefore the caller doesn't have access to a real clock.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals May 18, 2024
@neild
Copy link
Contributor Author

neild commented May 22, 2024

I wanted to evaluate practical usage of the proposed API.

I wrote a version of Run and Wait based on parsing the output of runtime.Stack. Wait calls runtime.Gosched in a loop until all goroutines in the current group are idle.

I also wrote a fake time implementation.

Combined, these form a reasonable facsimile of the proposed synctest package, with some limitations: The code under test needs to be instrumented to call the fake time functions, and to call a marking function after creating new goroutines. Also, you need to call a synctest.Sleep function in tests to advance the fake clock.

I then added this instrumentation to net/http.

The synctest package does not work with real network connections, so I added an in-memory net.Conn implementation to the net/http tests.

I also added an additional helper to net/http's tests, which simplifies some of the experimentation below:

var errStillRunning = errors.New("async op still running")

// asyncResult is the result of an asynchronous operation.
type asyncResult[T any] struct {}

// runAsync runs f in a new goroutine,
// and returns an asyncResult which is populated with the result of f when it finishes.
// runAsync calls synctest.Wait after running f.
func runAsync[T any](f func() (T, error)) *asyncResult[T]

// done reports whether the asynchronous operation has finished.
func (r *asyncResult[T]) done() bool

// result returns the result of the asynchronous operation.
// It returns errStillRunning if the operation is still running.
func (r *asyncResult[T]) result() (T, error)

One of the longest-running tests in the net/http package is TestServerShutdownStateNew (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/serve_test.go#5611). This test creates a server, opens a connection to it, and calls Server.Shutdown. It asserts that the server, which is expected to wait 5 seconds for the idle connection to close, shuts down in no less than 2.5 seconds and no more than 7.5 seconds. This test generally takes about 5-6 seconds to run in both HTTP/1 and HTTP/2 modes.

The portion of this test which performs the shutdown is:

shutdownRes := make(chan error, 1)
go func() {
	shutdownRes <- ts.Config.Shutdown(context.Background())
}()
readRes := make(chan error, 1)
go func() {
	_, err := c.Read([]byte{0})
	readRes <- err
}()

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

t0 := time.Now()
select {
case got := <-shutdownRes:
	d := time.Since(t0)
	if got != nil {
		t.Fatalf("shutdown error after %v: %v", d, err)
	}
	if d < expectTimeout/2 {
		t.Errorf("shutdown too soon after %v", d)
	}
case <-time.After(expectTimeout * 3 / 2):
	t.Fatalf("timeout waiting for shutdown")
}

// Wait for c.Read to unblock; should be already done at this point,
// or within a few milliseconds.
if err := <-readRes; err == nil {
	t.Error("expected error from Read")
}

I wrapped the test in a synctest.Run call and changed it to use the in-memory connection. I then rewrote this section of the test:

shutdownRes := runAsync(func() (struct{}, error) {
	return struct{}{}, ts.Config.Shutdown(context.Background())
})
readRes := runAsync(func() (int, error) {
	return c.Read([]byte{0})
})

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

synctest.Sleep(expectTimeout - 1)
if shutdownRes.done() {
	t.Fatal("shutdown too soon")
}

synctest.Sleep(2 * time.Second)
if _, err := shutdownRes.result(); err != nil {
	t.Fatalf("Shutdown() = %v, want complete", err)
}
if n, err := readRes.result(); err == nil || err == errStillRunning {
	t.Fatalf("Read() = %v, %v; want error", n, err)
}

The test exercises the same behavior it did before, but it now runs instantaneously. (0.01 seconds on my laptop.)

I made an interesting discovery after converting the test: The server does not actually shut down in 5 seconds. In the initial version of this test, I checked for shutdown exactly 5 seconds after calling Shutdown. The test failed, reporting that the Shutdown call had not completed.

Examining the Shutdown function revealed that the server polls for closed connections during shutdown, with a maximum poll interval of 500ms, and therefore shutdown can be delayed slightly past the point where connections have shut down.

I changed the test to check for shutdown after 6 seconds. But once again, the test failed.

Further investigation revealed this code (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/server.go#3041):

st, unixSec := c.getState()
// Issue 22682: treat StateNew connections as if
// they're idle if we haven't read the first request's
// header in over 5 seconds.
if st == StateNew && unixSec < time.Now().Unix()-5 {
	st = StateIdle
}

The comment states that new connections are considered idle for 5 seconds, but thanks to the low granularity of Unix timestamps the test can consider one idle for as little as 4 or as much as 6 seconds. Combined with the 500ms poll interval (and ignoring any added scheduler delay), Shutdown may take up to 6.5 seconds to complete, not 5.

Using a fake clock rather than a real one not only speeds up this test dramatically, but it also allows us to more precisely test the behavior of the system under test.


Another slow test is TestTransportExpect100Continue (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/transport_test.go#1188). This test sends an HTTP request containing an "Expect: 100-continue" header, which indicates that the client is waiting for the server to indicate that it wants the request body before it sends it. In one variation, the server does not send a response; after a 2 second timeout, the client gives up waiting and sends the request.

This test takes 2 seconds to execute, thanks to this timeout. In addition, the test does not validate the timing of the client sending the request body; in particular, tests pass even if the client waits

The portion of the test which sends the request is:

resp, err := c.Do(req)

I changed this to:

rt := runAsync(func() (*Response, error) {
	return c.Do(req)
})
if v.timeout {
	synctest.Sleep(expectContinueTimeout-1)
	if rt.done() {
		t.Fatalf("RoundTrip finished too soon")
	}
	synctest.Sleep(1)
}
resp, err := rt.result()
if err != nil {
	t.Fatal(err)
}

This test now executes instantaneously. It also verifies that the client does or does not wait for the ExpectContinueTimeout as expected.

I made one discovery while converting this test. The synctest.Run function blocks until all goroutines in the group have exited. (In the proposed synctest package, Run will panic if all goroutines become blocked (deadlock), but I have not implemented that feature in the test version of the package.) The test was hanging in Run, due to leaking a goroutine. I tracked this down to a missing net.Conn.Close call, which was leaving an HTTP client reading indefinitely from an idle and abandoned server connection.

In this case, Run's behavior caused me some confusion, but ultimately led to the discovery of a real (if fairly minor) bug in the test. (I'd probably have experienced less confusion, but I initially assumed this was a bug in the implementation of Run.)


At one point during this exercise, I accidentally called testing.T.Run from within a synctest.Run group. This results in, at the very best, quite confusing behavior. I think we would want to make it possible to detect when running within a group, and have testing.T.Run panic in this case.


My experimental implementation of the synctest package includes a synctest.Sleep function by necessity: It was much easier to implement with an explicit call to advance the fake clock. However, I found in writing these tests that I often want to sleep and then wait for any timers to finish executing before continuing.

I think, therefore, that we should have one additional convenience function:

package synctest

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
	time.Sleep(d)
	Wait()
}

The net/http package was not designed to support testing with a fake clock. This has served as an obstacle to improving the state of the package's tests, many of which are slow, flaky, or both.

Converting net/http to be testable with my experimental version of synctest required a small number of minor changes. A runtime-supported synctest would have required no changes at all to net/http itself.

Converting net/http tests to use synctest required adding an in-memory net.Conn. (I didn't attempt to use net.Pipe, because its fully-synchronous behavior tends to cause problems in tests.) Aside from this, the changes required were very small.


My experiment is in https://go.dev/cl/587657.

@rsc
Copy link
Contributor

rsc commented May 23, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@gh2o
Copy link

gh2o commented May 29, 2024

Commenting here due to @rsc's request:

Relative to my proposal #65336, I have the following concerns:

  • Goroutine grouping: the only precedent for goroutine having a user-visible identity is runtime.LockOSThread(), and even then, it is set-only: a goroutine can not know whether it is locked to a thread or not without parsing runtime.Stack() output. Having these special "test mode" goroutines feels like a violation of goroutines being interchangeable anonymous workers, insofar as the Go runtime hides the goroutine ID from user code. Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).
  • Overriding standard library behavior: again, there is no precedent for standard library functions to behave differently based on what goroutine they are called from. The standard idiomatic way to do this is to define an interface (e.g. fs.FS) and direct all calls through the interface, and the implementation of the interface can be mocked at test time. If it is desirable to keep the current Run()/Wait() API, I would still strongly advocate for not changing the behavior of the standard time package, and instead incorporate a mock clock implementation in another package (likely under testing).

@neild
Copy link
Contributor Author

neild commented May 29, 2024

Regarding overriding the time package vs. providing a testing implementation:

The time package provides a number of commonly-used, exported functions, where code that makes direct use of these functions cannot be properly tested. I think this makes it unique in the standard library. For example, code which directly calls time.Sleep cannot be tested properly, because inserting a real delay inherently makes a test slow, and because there is no non-flaky way for a test to ensure that a certain amount of time has elapsed.

In contrast, we can test code which calls os.Open by providing it with the name of a file in a test directory. We can test code which calls net.Listen by listening on a loopback interface. The io/fs.FS interface may be used to create a testable seam in a system, but it isn't required.

Time is fundamentally different in that there is no way to use real time in a test without making the test flaky and slow.

Time is also different from an fs.File or a net.Conn in that there is only one plausible production implementation of time. A fs.FS might be the local filesystem, or an embedded set of static files, or a remote filesystem of some kind. A net.Conn might be a TCP or TLS connection. But it is difficult to come up with occasions outside of tests when time.Sleep should do anything other than sleep for the defined amount of time.

Since we can't use real time in tests, we can insert a testable wrapper around the time package as you propose. This requires that we avoid the idiomatic and easy-to-use time package functions. We essentially put an asterisk next to every existing function in the time package that deals with the system clock saying, "don't actually use this, or at least not in code you intend to test".

In addition, if we define a standard testable wrapper around the clock, we are essentially declaring that all public packages which deal with time should provide a way to plumb in a clock. (Some packages do this already, of course; crypto/tls.Config.Time is an example in std).

That's an option, of course. But it would be a very large change to the Go ecosystem as a whole.

@DmitriyMV
Copy link
Contributor

DmitriyMV commented May 29, 2024

the only precedent for goroutine having a user-visible identity is runtime.LockOSThread()

The pprof.SetGoroutineLabels disagrees.

insofar as the Go runtime hides the goroutine ID from user code

It doesn't try to hide it, more like tries to restrict people from relying on numbers.

Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).

If I understood proposal correctly, it will wait for any goroutine (and recursively) that was started using go statement from the func passed to Run. It will not catch anything started before or sidewise. Which brings the good question: @neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

@neild
Copy link
Contributor Author

neild commented May 29, 2024

@neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

Yes, if you call AfterFunc from within a synctest group then the goroutine started by AfterFunc is also in the group.

@gh2o
Copy link

gh2o commented May 30, 2024

Given that there's more precedent for goroutine identity than I had previously thought, and seeing how pprof.Do() works, I am onboard with the idea of goroutine groups.

However, I'm still a little ambivalent about goroutine groups affecting time package / standard library behavior, and theoretically a test running in synctest mode may want to know the real world time for logging purposes (I guess that could be solved by adding a time.RealNow() or something similar). The Wait() primitive seems to provide what is necessary for a third-party package to provide the same functionality without additional runtime support, so it could be worth exploring this option a bit more.

That being said, I agree that plumbing a time/clock interface through existing code is indeed tedious, and having time modified to conditionally use a mock timer may be the lesser evil. But it still feels a little icky to me for some reason.

@aclements
Copy link
Member

Thanks for doing the experiment. I find the results pretty compelling.

I think, therefore, that we should have one additional convenience function: [synctest.Sleep]

I don't quite understand this function. Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

Nested groups are too complicated, Run should panic if called from within a group.

Partly I was wondering about nested groups because I've been scheming other things that the concept of a goroutine group could be used for. Though it's true that, even if we have groups for other purposes, it may make sense to say that synctest groups cannot be nested, even if in general groups can be nested.

@neild
Copy link
Contributor Author

neild commented May 30, 2024

Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

You're right that sleeping past the deadline of a timer is sufficient. The synctest.Wait function isn't strictly necessary at all; you could use time.Sleep(1) to skip ahead a nanosecond and ensure all currently running goroutines have parked.

It's fairly natural to sleep to the exact instant of a timer, however. If a cache entry expires in some amount of time, it's easy to sleep for that exact amount of time, possibly using the same constant that the cache timeout was initialized with, rather than adding a nanosecond.

Adding nanoseconds also adds a small but real amount of confusion to a test in various small ways: The time of logged events drifts off the integer second, rate calculations don't come out as cleanly, and so on.

Plus, if you forget to add the necessary adjustment or otherwise accidentally sleep directly onto the instant of a timer's expiry, you get a race condition.

Cleaner, I think, for the test code to always resynchronize after poking the system under test. This doesn't have to be a function in the synctest package, of course; synctest.Sleep is a trivial two-liner using exported APIs. But I suspect most users of the package would use it, or at least the ones that make use of the synthetic clock.

I've been scheming other things that the concept of a goroutine group could be used for.

I'm very intrigued! I've just about convinced myself that there's a useful general purpose synchronization API hiding in here, but I'm not sure what it is or what it's useful for.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2024

For what it's worth, I think it's a good thing that virtual time is included in this, because it makes sure that this package isn't used in production settings. It makes it only suitable for tests (and very suitable).

@rsc
Copy link
Contributor

rsc commented Jun 5, 2024

It sounds like the API is still:

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

Damien suggested adding also:

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
	time.Sleep(d)
	Wait()
}

The difference between time.Sleep and synctest.Sleep seems subtle enough that it seems like you should have to spell out the Wait at the call sites where you need it. The only time you really need Wait is if you know someone else is waking up at that very moment. But then if they've both done the Sleep+Wait form then you still have a problem. You really only want some of the call sites (maybe just one) to use the Sleep+Wait form. I suppose that the production code will use time.Sleep since it's not importing synctest, so maybe it's clear that the test harness is the only one that will call Sleep+Wait. On the other hand, fixing a test failure by changing s/time.Sleep/synctest.Sleep/ will be a strange-looking bug fix. Better to have to add synctest.Wait instead. If we really need this, it could be synctest.SleepAndWait but that's what statements are for. Probably too subtle and should just limit the proposal to Run and Wait.

@gh2o
Copy link

gh2o commented Jun 5, 2024

Some additional suggestions for the description of the Wait() function:

// A goroutine is idle if it is blocked on a channel operation,
// mutex operation (...),
// time.Sleep,
// a select operation with or without cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// A goroutine blocked on a direct syscall (via the syscall package) is also not idle,
// even if the syscall itself sleeps.

Additionally, for "mutex operation", let's list out the the exact operations considered for implementation/testing completeness:

  • sync.Cond.Wait()
  • sync.Mutex.Lock()
  • sync.RWMutex.Lock()
  • sync.RWMutex.RLock()
  • sync.WaitGroup.Wait()

@nightlyone
Copy link
Contributor

The API looks simple and that is excellent.

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Imagine you unit test your code but are unable to mock out a dependency. Maybe due to lack of experience or bad design of existing code I have to work with.

That dependency that suddenly starts calling a syscall (e.g. to lazily try to tune the library using a sync.Once instead of on init time and having a timeout).

Without support in testing you will never detect that now and only your tests will suddenly time out after an innocent minor dependency update.

@nightlyone
Copy link
Contributor

May I ortgogonally to the previous comment suggest to limit this package to standard library only to gather more experience with that approach before ?

That would also allow to sketch out integration with the testing package in addition to finding more pitfalls.

@neild
Copy link
Contributor Author

neild commented Jun 6, 2024

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Can you expand more on what you mean by undetected regressions?

If the code under test (either directly, or through a dependency) unexpectedly calls a blocking syscall, Wait will wait for that syscall to complete before proceeding. If the syscall completes normally (the code is using os/exec to execute a subprocess, for example), then everything should operate as expected--the operation completes and the test proceeds. If the syscall is waiting on some event (reading from a network socket, perhaps), then the test will hang, which is a detectable event. You can look at goroutine stacks from the timed-out test to analyze the reason for the hang.

Without support in testing

What kind of support are you thinking of?

@ChrisHines
Copy link
Contributor

What does this do?

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

Does it succeed or panic? It's not clear to me from the API docs because:

If every goroutine is idle and there are no timers scheduled, Run panics.

A goroutine is idle if it [...] is the goroutine calling Wait.

This is obviously a degenerate case, but I think it also applies if a test wanted to get the fake time features when testing otherwise non-concurrent code.

@gh2o
Copy link

gh2o commented Jun 6, 2024

What does this do?

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

In this case, the goroutine calling synctest.Wait() should never enter idle because there's nothing to wait for, and hence a panic should not occur.

@prattmic
Copy link
Member

prattmic commented Sep 11, 2024

crypto/internal/randutil.MaybeReadByte does exactly this and is used by a variety of public crypto APIs. :(

Edit: FWIW, I don't understand why this function is implemented this way. The select implementation ends up using runtime.cheaprand, not some particularly fancy RNG. Perhaps this issue here was just the import graph? But that is mostly beside the point; this is just an example that such lazy initialization exists in practice.

@cherrymui
Copy link
Member

@rsc suggested above #67434 (comment) that we restrict the panic behavior to timer channels. That would eliminate @prattmic 's concern.

The updated proposal/CL applies the restriction to all bubbled channels, regardless of timers or data type. While the new update does solve the issue about buffering and (mostly) the issue about global channels, is there a good reason to do all channels instead of just timer channels? Or would time channels suffice for the intended testing cases?

@rsc
Copy link
Contributor

rsc commented Sep 11, 2024

"Just timers" does seem better if we can make it work.

Someone can implement what is semantically a mutex using (a) sync.Mutex, (b) channels, or (c) sync.Cond.
It seems weird that one of these methods has different behavior than the other two.
It seems like all three should be acceptable, not just sync.Mutex.

@neild
Copy link
Contributor Author

neild commented Sep 11, 2024

To recap, the proposed panic behavior is: If a non-bubbled goroutine operates on a bubbled channel, panic.

The rationale for having a distinction between bubbled and unbubbled channels is to allow a bubbled goroutine to access channel-synchronized resources from outside its bubble. For example, let's imagine a simple case where a channel is being used as a lock:

var lockChan = make(chan struct{}, 1) // locked when the channel contains a value

// Get fetches some resource.
func Get() T {
  lockChan <- struct{}{} // lock acquired
  v := acquireResource()
  <-lockChan // lock released
  return v
}

If we didn't have a distinction between bubbled and unbubbled channels, then a bubbled goroutine calling Get while the lock is held from outside the bubble will run into problems:

synctest.Run(func() {
  Get()
})

Get blocks writing to lockChan, the only goroutine in the bubble is now idle, Run panics because all goroutines are deadlocked.

Making a distinction between bubbled and unbubbled channels means that instead when Get blocks on lockChan, synctest can recognize that it is blocked on something outside the bubble and not actually idle.

If lockChan is lazily created, however, it might be inadvertently created within a synctest bubble. Now we fall back to the previous behavior: Some goroutine outside the bubble acquires lockChan, Get blocks on lockChan, lockChan is in the bubble, Run panics. But things are much more confusing than before, because the behavior depends on when lockChan was created.

To avoid this confusion, we panic when the unbubbled goroutine writes to lockChan. An unbubbled goroutine accessing a bubbled channel indicates that something has gone wrong. The fix to the problem will probably be to ensure any lazy initialization happens outside the bubble.

I think that if we distinguish between bubbled and unbubbled chans, then we need to prevent unbubbled goroutines from accessing bubbled chans to avoid confusion. If we don't distinguish between bubbled and unbubbled chans, then the overall model is simpler, but bubbled goroutines can't access global resources synchronized by a channel, which is unfortunate.

@bmizerany
Copy link
Contributor

I've greatly enjoyed using the preview of this addition. It's been very useful in my work. However, now that synctest is internal, it's become challenging to test and use outside stdlib.

Would it be possible to make it accessible through a less restrictive means than a stdlib internal package? Perhaps a GOEXPERIMENT flag could work? The functionality is valuable enough that I'd consider vendoring it if it weren't so tightly coupled with Go's internals.

This is such an awesome new addition. I'm eager to keep using it, even in an experimental state. :)

@neild
Copy link
Contributor Author

neild commented Sep 16, 2024

Out of curiosity, I tried implementing the sync.Mutex semantics I described above:
https://go.dev/cl/613515

To be clear, I am not currently proposing we do this. This is just an experiment to see how intrusive the changes to sync.Mutex might be.

To recap, this adds the following rules:

  • A locked sync.Mutex tracks whether the locking goroutine was in a bubble or not.
  • A goroutine blocked on sync.Mutex.Lock is idle if and only if the locking goroutine was in a bubble.

This essentially means that a mutex used within a bubble counts for idleness detection, but a global mutex shared with goroutines outside the bubble (such as the reflect type cache mutexes) does not.

The changes required to sync.Mutex are not huge, but are not entirely trivial either. I'm still not convinced the value of mutex support in synctest is worth changing such a performance-critical type.

@aclements
Copy link
Member

I am not fully caught up on this, but my inclination is that we need to take a step back and rethink the concept of bubbled and non-bubbled synchronization primitives. Given that the heap is shared, it seems fundamental that synchronization between inside and outside a bubble needs to work without panicking.

@neild
Copy link
Contributor Author

neild commented Sep 18, 2024

To summarize the current state of affairs as I understand them:

The synctest package depends on identifying when bubbled goroutines are durably blocked. ("Durably": The goroutine is not just parked, it isn't going to unpark without some other goroutine in its bubble taking an action.) The synctest fake clock advances when all bubbled goroutines are durably blocked, and the Wait function lets tests wait for background work to complete.

A bubbled goroutine can block on some non-bubbled resource. For example, reflect.TypeOf has a mutex-guarded cache, so a bubbled goroutine which calls TypeOf can block waiting for a non-bubbled goroutine. This goroutine is not durably blocked--it will resume executing when the non-bubbled goroutine releases the mutex.

The reflect.TypeOf case demonstrates that synctest must gracefully handle the case of a bubbled goroutine non-durably blocked on a goroutine outside the bubble. We can impose limitations on what you can do inside a synctest bubble, but "don't call reflect.TypeOf" is too much of a limitation.

There are three types of resource a bubbled goroutine can durably block on: Mutexes (sync.Mutex, sync.RWMutex), condition variables (sync.Cond), and channels. In all cases, a bubbled goroutine can block waiting for some other goroutine in the bubble, or block waiting for some global resource held by a goroutine outside the bubble.

For a bubbled goroutine blocked on any of these resources, we can

  • always say that it is durably blocked;
  • always say that it is not durably blocked;
  • or make a decision based on the state of the resource it is blocked on.

In the last case, we distinguish between a resource (mutex, channel, cond) "in the bubble" and one "out of the bubble". (For efficiency and implementation simplicity, this probably takes the form of a boolean "in some bubble" state, rather than tracking the actual bubble, but that's an implementation detail.)

A goroutine blocked on a mutex is always blocked because some goroutine acquired the mutex and has not released it. We could move a mutex into a bubble when a bubbled goroutine locks it, and move it out when it is unlocked. (This is demonstrated in https://go.dev/cl/613515.)

This trick doesn't work for channels and conds. However, channels and conds are created with a constructor, so we could mark them as bubbled or non-bubbled at creation time.

If resources have a bubbled/non-bubbled state, then there are several scenarios to consider:

  • A bubbled goroutine acts on a bubbled resource: If the action blocks the goroutine is durably idle.
  • A bubbled goroutine acts on a non-bubbled resource: Fine, but if the action blocks the goroutine is not durably idle.
  • A non-bubbled goroutine acts on a non-bubbled resource: Business as usual.
  • A non-bubbled goroutine acts on a bubbled resource: Something has gone wrong, and we should panic.

This last case is the one where panicking can (and I think must) occur. If a bubbled goroutine has been marked durably idle, it should not be woken by some event outside the bubble--the entire notion of "durably idle" is that the goroutine is waiting only for events produced within the bubble. If we can mark channels as being bubbled, then it is an error for a bubbled channel to be operated on from outside the bubble, since a "bubbled channel" is specifically a channel that isn't supposed to escape its bubble.


That's a lot of theory. I think the practical choices available to us are:

  • Mutexes:
    1. Bubbled goroutines blocked on a mutex are not idle.

      Pro: Simple. Allows synchronization between inside and outside a bubble.
      Con: A bubble doesn't become idle if a goroutine is durably blocked on a mutex.

    2. Bubbled goroutines blocked on a mutex are idle if and only if the mutex was locked by a bubbled goroutine.

      Pro: Allows synchronization between inside and outside a bubble. Mutexes inside a bubble just work.
      Con: A bit more complexity in Mutex. See https://go.dev/cl/613515.

  • Channels:
    1. Bubbled goroutines blocked on a channel are always idle.

      Pro: Simple. Channels are used less often than mutexes for global synchronization.
      Con: Does not allow synchronization between inside and outside a bubble.

    2. Channels are marked as bubbled or non-bubbled. We panic if a non-bubbled goroutine accesses a bubbled channel.

      Pro: Allows synchronization between inside and outside a bubble. Everything pretty much just works.
      Con: If a global synchronization channel is lazily constructed from within a bubble, the next non-bubbled goroutine to access it panics.

  • sync.Cond:
    1. Bubbled goroutines blocked on sync.Cond.Wait are always idle.

      Pro: Simple. Does anybody ever use a sync.Cond for global synchronization?
      Con: Does not allow synchronization between inside and outside a bubble.

    2. Do whatever we do for channels.

      Pro: Consistent.
      Con: If we mark channels as bubbled/non-bubbled, a bit more code in sync.Cond.

The simplest set of choices would be option 1 from all the above: blocking on a mutex is not idle, blocking on a chan or cond is always idle. That might be good enough, but it means synctest bubbles must not access global channels. That might be fine. (We can fix crypto/internal/randutil to not use a chan.)

@neild
Copy link
Contributor Author

neild commented Sep 27, 2024

We had a VC discussion about how to progress. To summarize my understanding of our conclusions:

For the moment, we're going to go with:

  • Channels are marked as bubbled when created in a bubble. Bubbled goroutines are idle when blocked on bubbled channels. A non-bubbled goroutine acting on a bubbled channel panics.
  • Bubbled goroutines are not idle when blocked on a mutex.
  • Bubbled goroutines are idle when blocked on sync.Cond.Wait.

This is the approach currently implemented in https://go.dev/cl/613515.

The rationale for these choices is based on what we're least likely to regret:

  • If we don't make it a panic for a non-bubbled goroutine to act on a bubbled channel now, we probably can't add it in the future. We can remove the panic if we decide it's a bad idea, though.
  • If we implement the more complex mutex behavior (idle only when blocking on a mutex acquired by another bubbled goroutine), we probably can't remove it in the future. We can add it later if we decide we want it, though.
  • sync.Cond generally doesn't get used for global resource locks, so we can do the simple thing now and decide to be more complicated later if necessary.

We will initially add the package with a GOEXPERIMENT=synctest guard to give people a chance to try out the API before we commit to it. (The package will only exist when using Go compiled with GOEXPERIMENT=synctest.)

I will also pick out a few existing third-party modules that use fake clocks in tests, and try rewriting their tests to use synctest instead to provide some more examples of whether it provides any significant benefit compared to existing approaches to testing.

@aclements
Copy link
Member

@neild , just for logistics, would you mind creating a new mini proposal issue for landing this with your stated semantics as a GOEXPERIMENT?

@neild
Copy link
Contributor Author

neild commented Sep 27, 2024

Done: #69687

@neild
Copy link
Contributor Author

neild commented Oct 17, 2024

I've encountered an unexpected problem caused by the new channel semantics (specifically: sending to a bubbled channel from outside the bubble panics).

I have a net/http test which (highly simplified) does something like this:

func Test(t *testing.T) {
  synctest.Run(func() {
    ch := make(chan struct{})
    t.Cleanup(func() {
      close(ch)
    })
    // ...
  })
}

(The actual test contains many more layers, of course, and the channel creation and t.Cleanup happen a couple levels of object initialization down.)

Since the t.Cleanup func executes after synctest.Run returns, it is executed outside the bubble and the close(ch) operation panics.

One obvious solution here is to say that I shouldn't do this. I can arrange for cleanup to happen inside the bubble. This is a bit unfortunate, however, since t.Cleanup really is tremendously useful and it would be a shame to take that tool off the shelf.

Or perhaps this indicates that panicking when operating on a bubbled channel from outside the bubble is a mistake. I'm not convinced, though: Waking up an "idle" bubble from outside the bubble is never correct. On the other hand, in this particular case nothing in the bubble is being woken--the bubble is already gone.

Another possibility is to add API to the testing package (as proposed by #67434 (comment) and others) to run a subtest in a bubble:

package testing

// RunIsolated runs f as a subtest of t, in a bubble.
// Cleanup funcs run in the bubble.
func (t *testing.T) RunIsolated(name string, f func())

This could either be in addition to the proposed synctest package, or a complete replacement of it: synctest.Run becomes testing.T.RunIsolated, and synctest.Wait becomes (perhaps) testing.T.Wait.

I am currently leaning towards saying that this case indicates that, while I'm reluctant to propose increasing the testing package's API surface, there needs to be a version of T.Run that starts a bubble and arranges for cleanup to happen within it. Probably on B and F as well for consistency, although I'm very dubious that any benchmark using a bubble will produce useful results.

@Merovius
Copy link
Contributor

@neild why not package synctest; func Run(t *testing.T, name string, f func(*testing.T))? That is, why does it have to be a method on *testing.T, instead of being a top-level function accepting a *testing.T? That would not increase the testing API, while still allowing it to be a subtest with its own cleanup.

@neild
Copy link
Contributor Author

neild commented Oct 18, 2024

@Merovius That's an interesting idea.

Having a version of t.Run that isn't in the testing package seems a bit strange to me, but perhaps that would be okay.

Implementation would be messy. I think the only way I see to do it is to implement the Run function in package testing and linkname it over to testing/synctest. The problem is that Run(t *testing.T, name string, f func(*testing.T)) needs to run f and the cleanup funcs registered by f within a bubble, but needs to run the surrounding test infrastructure (which accounts for test timeouts, among other things) outside the bubble--it can't just call testing.T.Run, because either it calls it outside a bubble (and cleanup funcs run outside the bubble) or inside a bubble (and the testing package's timing mechanisms run with the fake clock).

I think the messiness of implementation indicates that this isn't feasible. Maybe I'm missing a good way to do it.

@neild
Copy link
Contributor Author

neild commented Oct 18, 2024

I'm going to propose two options:

Option 1: Keep testing/synctest, add a method to testing.TB.

We keep all of the existing proposal unchanged. We add one method in the testing package to T, B, F, and TB:

package testing

// RunIsolated runs f as a subtest of t called name.
// It runs f in a synctest bubble (as if called by [synctest.Run]).
// Cleanup functions run in the bubble.
// RunIsolated otherwise behaves like [t.Run].
func (t *T) RunIsolated(name string, f func()) bool

This option has the advantage of keeping all the complicated synctest documentation in its own package.

Option 2: Move the API to the testing package.

We drop the testing/synctest package and add the following to the testing package:

package testing

// RunIsolated runs f as a subtest of t called name.
// Run reports whether f succeeded (or at least did not fail before calling t.Parallel).
//
// Run runs f in a new goroutine.
// This goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Run waits for all goroutines in the bubble to exit, or for f to call t.Parallel.
//
// Goroutines in the bubble use a fake time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// A goroutine in the bubble is idle if it is blocked on:
//   - a send or receive on a channel created within the bubble
//   - a select statement where every case is a channel within the bubble
//   - sync.Cond.Wait
//   - time.Sleep
//
// The above cases are the only times a goroutine is idle.
// In particular, a goroutine is NOT idle when blocked on:
//   - system calls
//   - cgo calls
//   - I/O, such as reading from a network connection with no data
//   - sync.Mutex.Lock or sync.Mutex.RLock
//
// Time advances when every goroutine in the bubble is idle.
// For example, a call to time.Sleep will block until all goroutines
// are idle, and return after the bubble's clock has advanced
// by the sleep duration.
//
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
func Run(f func(*Sync))

// A Sync is used by RunIsolated for running isolated subtests.
type Sync

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is idle.
func (s *Sync) Wait()

This option has the advantage of not adding a synctest.Run that, presumably, just about nobody will ever use (because testing.T.RunIsolated is the preferable choice), but it adds a big chunk of documentation to the testing package.

Of the two, I lean towards option 2, but only slightly.

@ianthehat
Copy link

Would it be possible to unbubble channels that are owned by a bubble when the bubble goes away, such that operating on them no longer panics after the bubble is ended? it might be too expensive to implement but I think it would logically solve this case in a clean way?

@neild
Copy link
Contributor Author

neild commented Oct 18, 2024

Unbubbling channels after the bubble ends is an interesting idea. It would fix the panic in the case that I have. It wouldn't help a test that starts a background goroutine and signals that goroutine to exit in a t.Cleanup. (Think: Start a server that listens for connections in a separate goroutine, stop it in cleanup.)

Implementation would be a bit tricky, since we need to store an association between channels and bubbles. The obvious and easy way is to put a pointer to the bubble on the channel, but that grows the size of a channel by a word; too expensive for a test-only feature. Alternatively, the bubble could have a set of weak references to bubbled channels, or we could have a global weak map of channel-to-bubble.

@neild
Copy link
Contributor Author

neild commented Oct 22, 2024

This is a report on attempting to apply the proposed synctest API to existing, real-world code.

etcd is a popular distributed key-value store. "github.com/jonboulle/clockwork" is an also-popular fake time package. The etcd repository (https://github.com/etcd-io/etcd) contains a number of Go modules, some of which use the clockwork package for testing.

In this experiment, I rewrote the tests for package "go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor" to use the proposed synctest API.

To give away the ending: The synctest package was able to replace the fake clock, simplifying the system under test (SUT). The synctest package was also able to replace some complex manual synchronization between test and SUT, leading to a simpler, more robust test.

In the following, I will go over one specific test in detail, explaining its behavior before and after my changes. The test is TestPeriodicPause: https://github.com/etcd-io/etcd/blob/ac3d5d77ea5fdbc12ef07a6f6fe1722f06d75b24/server/etcdserver/api/v3compactor/periodic_test.go#L132-L175

System Under Test

The system under test (SUT) is a Periodic. A Periodic operates on a RevGetter and a Compactable:

type RevGetter interface {
	Rev() int64
}

type Compactable interface {
	Compact(ctx context.Context, r *pb.CompactionRequest) (*pb.CompactionResponse, error)
}

A Periodic maintains a background goroutine which periodically polls RevGetter.Rev, and calls Compactable.Compact when certain conditions are met. The internal details of RevGetter and Compactable are not important to the test, which uses a fake implementation of both.

This is a straightforward system: The inputs are time and the revisions, and the output is a series of Compact calls.

Test Infrastructure

TestPeriodicPause uses a testutil.Recorder to synchronize and monitor the SUT.

type Action struct {
	Name   string
	Params []interface{}
}

type Recorder interface {
	// Record publishes an Action (e.g., function call) which will
	// be reflected by Wait() or Chan()
	Record(a Action)
	// Wait waits until at least n Actions are available or returns with error
	Wait(n int) ([]Action, error)
	// Action returns immediately available Actions
	Action() []Action
	// Chan returns the channel for actions published by Record
	Chan() <-chan Action
}

A Recorder records a sequence of Actions (events) performed by the SUT. The fake implementations of RevGetter and Compactable record actions for each Rev/Compact call.

The implementation of the Record interface is interesting and relevant to us. The testutil package contains two implementations of Recorder.

A RecorderBuffered records each action to an internal slice of unbounded length. Record calls do not block. Wait calls attempt to wait for all pending record calls to finish before returning:

func (r *RecorderBuffered) Wait(n int) (acts []Action, err error) {
	// legacy racey behavior
	WaitSchedule()
	// ...
}

// WaitSchedule briefly sleeps in order to invoke the go scheduler.
// TODO: improve this when we are able to know the schedule or status of target go-routine.
func WaitSchedule() {
	time.Sleep(10 * time.Millisecond)
}

Note the comment about "legacy racey behavior", and the WaitSchedule function.

A recorderStream, in contrast, records each action to an unbuffered channel. Record calls block until a Wait call consumes the action. A recorderStream is created with a timeout, where a timeout of 0 indicates no timeout. Wait(n) waits up to the timeout (or indefinitely when timeout==0) or until n actions are received.

TestPeriodicPause uses blocking recorderStreams to synchronize with the SUT:

rg := &fakeRevGetter{testutil.NewRecorderStreamWithWaitTimout(0), 0}
compactable := &fakeCompactable{testutil.NewRecorderStreamWithWaitTimout(10 * time.Millisecond)}

Note that the fakeRevGetter Recorder is created with no timeout, while the fakeCompatable Recorder has a 10ms timeout. The test uses the fakeRevGetter's Recorder to synchronize with the Periodic's background goroutine. For example, when advancing over an interval of time, the test reads actions from the RevGetter:

func TestPeriodicPause(t *testing.T) {
	fc := clockwork.NewFakeClock()
	// ...

// tb will collect 3 hours of revisions but not compact since paused
for i := 0; i < n*3; i++ {
	waitOneAction(t, rg)
	fc.Advance(tb.getRetryInterval())
}

// ...
}

func waitOneAction(t *testing.T, r testutil.Recorder) {
	if actions, _ := r.Wait(1); len(actions) != 1 {
		t.Errorf("expect 1 action, got %v instead", len(actions))
	}
}

In each iteration of the loop above:

  • The SUT polls the RevGetter for the current revision. This blocks recording an action.
  • The test consumes the RevGetter action, unblocking the SUT.
  • The fake RevGetter automatically increments the current revision.
  • The test advances the fake clock.

This is a complicated dance. Every call of RevGetter.Get by the SUT must be paired with a Recorder.Wait call by the test. It would be fairly easy to desynchronize the SUT and the test, with confusing results. However, the use of the Recorder to create synchronization points between the SUT and test allows tests to create carefully orchestrated scenarios with a fake clock.

Using synctest

Using the synctest package, we can eliminate much of this test's infrastructure, and simplify what remained.

I made the following changes:

  • Use the time package in the SUT, rather than a testable wrapper.
  • Remove the use of testutil.Recorder entirely.
  • Remove the automatic incrementing of revisions in the fake RevGetter. Previously, every RevGetter.Rev call by the SUT was paired with a Recorder.Wait call by the test. Each Wait corresponded to an increment in the revision. We don't need the Wait calls for synchronization any more, but it seems useful to keep the changes in revision number explicit in the test.
  • Change the fake Compactable to record one compaction event. It reports an error if an unexpected compaction occurs. This is an improvement over the Recorder-based design, which could ignore unexpected compactions.

The rewritten test:

func TestPeriodicPause(t *testing.T) {
	synctest.Run(func() {
		testPeriodicPause(t)
	})
}
func testPeriodicPause(t *testing.T) {
	retentionDuration := time.Hour
	rg := &fakeRevGetter{rev: 1}
	compactable := newFakeCompactible(t)
	tb := newPeriodic(zaptest.NewLogger(t), retentionDuration, rg, compactable)
	defer tb.Stop()

	tb.Run()
	tb.Pause()

	n := tb.getRetentions()

	// tb will collect 3 hours of revisions but not compact since paused
	for i := 0; i < n*3; i++ {
		rg.IncRev()
		time.Sleep(tb.getRetryInterval())
	}

	compactable.Want(nil) // no compaction

	// tb resumes to being blocked on the clock
	// will kick off a compaction at T=3h6m by retry
	tb.Resume()
	time.Sleep(tb.getRetryInterval())
	compactable.Want(&pb.CompactionRequest{Revision: int64(1 + 2*n + 1)})
}

And, for contrast, the original:

func TestPeriodicPause(t *testing.T) {
	fc := clockwork.NewFakeClock()
	retentionDuration := time.Hour
	rg := &fakeRevGetter{testutil.NewRecorderStreamWithWaitTimout(0), 0}
	compactable := &fakeCompactable{testutil.NewRecorderStreamWithWaitTimout(10 * time.Millisecond)}
	tb := newPeriodic(zaptest.NewLogger(t), fc, retentionDuration, rg, compactable)

	tb.Run()
	tb.Pause()

	n := tb.getRetentions()

	// tb will collect 3 hours of revisions but not compact since paused
	for i := 0; i < n*3; i++ {
		waitOneAction(t, rg)
		fc.Advance(tb.getRetryInterval())
	}
	// t.revs = [21 22 23 24 25 26 27 28 29 30]

	select {
	case a := <-compactable.Chan():
		t.Fatalf("unexpected action %v", a)
	case <-time.After(10 * time.Millisecond):
	}

	// tb resumes to being blocked on the clock
	tb.Resume()
	waitOneAction(t, rg)

	// unblock clock, will kick off a compaction at T=3h6m by retry
	fc.Advance(tb.getRetryInterval())

	// T=3h6m
	a, err := compactable.Wait(1)
	if err != nil {
		t.Fatal(err)
	}

	// compact the revision from hour 2:06
	wreq := &pb.CompactionRequest{Revision: int64(1 + 2*n + 1)}
	if !reflect.DeepEqual(a[0].Params[0], wreq) {
		t.Errorf("compact request = %v, want %v", a[0].Params[0], wreq.Revision)
	}
}

The complete code is at:
neild/etcd@57e8a4d

What went well

The synctest package was effective at providing synchronization and fake time for this test.

The synctest version of the test is slightly shorter than the original, although some of that reduction in size is thanks to moving some functionality to the fakeCompactable type.

The synctest version of the test is, I believe, easier to modify: The original depends on precisely pairing every recorded action in the SUT with a Wait call in the test.

The original version of the test contains 10ms waits in various places, waiting for the SUT to stabilize. The synctest version just waits for the SUT to stabilize.

What went less well

The synctest.Run call indents all the test code by a level. Not really a big concern.

I forgot to follow a time.Sleep call with a synctest.Wait a couple times.

One test (TestRevisionPause) left a background goroutine running. This produces confusing results: The test hangs after executing, because synctest.Run keeps advancing the fake clock into the future and restarting the background goroutine. The fix was simple–stop the goroutine before finishing the test–but identifying the problem is a bit difficult.

These last two points make me wonder if it would be clearer to have a function that explicitly advances the fake clock. Or perhaps we should stop advancing the fake clock when the root goroutine started by synctest.Run returns.

@nightlyone
Copy link
Contributor

Since a common error condition in tests is to still have Go routines running when one actually doesn't want to, identifying and nothing Go routines started but not exited would contribute to the isolation of this test feature. But that is probably its own proposal.

When implementing synctest.Run as testing.T.RunIsolated, it would be great to keep that door open.

@cherrymui
Copy link
Member

If we want to keep synctest.Run and also t.Cleanup work as expected, perhaps we can have synctest.Run take a func(*testing.T), and we pass a wrapped t that runs t.Cleanup functions in the same bubble? The wrapping may need to access some internals of the testing package (I haven't looked into the detail), but we probably could make that happen. If we do this, it probably becomes synctest.Run(*testing.T, func(*testing.T)), where it takes the original t, and passed the wrapped one to the closure.

@neild
Copy link
Contributor Author

neild commented Oct 23, 2024

We can have synctest.Run(*testing.T, func(*testing.T)), but it'll probably need to do some internal linkname shenanigans to cooperate with the testing package. I think we'd probably implement the function in the testing package (it'll be simple, just a few lines long), and then use linkname to make it visible in synctest.

I'm fine with that, but it might be simpler to just say that if Run uses a *testing.T it belongs in the testing package.

@aclements
Copy link
Member

Personally I like the unbubbling idea, perhaps because I'm already a bit skeptical of the idea of bubbling in general. You could do it without increasing the size of chan by keeping a side table of bubbled channels.

One test (TestRevisionPause) left a background goroutine running.

It sounds like this should be easy to debug from a stack dump, no?

Or perhaps we should stop advancing the fake clock when the root goroutine started by synctest.Run returns.

Does that mean that in a situation like the background goroutine, sleeps would block the goroutine forever?

@neild
Copy link
Contributor Author

neild commented Oct 23, 2024

It sounds like this should be easy to debug from a stack dump, no?

This is easy to debug from a stack dump--it's just briefly confusing, because the failure mode is for the test to hang indefinitely rather than giving you an immediate stack dump. Once I ran the test with -timeout=2s the stacks made it clear what had happened.

Does that mean that in a situation like the background goroutine, sleeps would block the goroutine forever?

In the case with the background goroutine, the main test goroutine started by Run would return, and Run would then panic complaining that all remaining goroutines are blocked. (As opposed to advancing the fake clock, which is what it currently does.)

I think we should do this; I'm not sure there's a use case for continuing to run the fake clock after the main test goroutine has returned, and we can change our minds later if we want.

@neild
Copy link
Contributor Author

neild commented Oct 24, 2024

Personally I like the unbubbling idea, perhaps because I'm already a bit skeptical of the idea of bubbling in general. You could do it without increasing the size of chan by keeping a side table of bubbled channels.

I've been thinking about this, and I think unbubbling channels at the end of the bubble isn't the right choice.

The problem is that it's reasonable for a test to want to shut down background goroutines in a cleanup function. For example, a test may start a server listening on a fake network socket (with a background goroutine blocked in net.Listener.Accept) and stop the server in a cleanup function. If the cleanup function runs after the bubble exits, then it runs too late: a bubble never exits cleanly while any bubbled goroutines are still executing.

Cleanup functions registered in a bubble should execute in the bubble.

This is independent of the question of whether bubbling channels is a good idea or not--even if we don't associate channels with bubbles, we still want to be able to shut down a test completely before Run exits and its bubble ends.

@aclements
Copy link
Member

Putting on hold for experience with synctest under a GOEXPERIMENT (#69687). Discussion can of course continue, but this way we'll hold off on looking at this each week until there's more experience with it.

@aclements aclements moved this from Active to Hold in Proposals Nov 13, 2024
@aclements
Copy link
Member

Placed on hold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests