Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add features tests for gRPC retry behavior #435

Closed
wants to merge 13 commits into from

Conversation

chronos-tachyon
Copy link

@chronos-tachyon chronos-tachyon commented Mar 8, 2024

What was changed

  • Implement a small gRPC proxy in Go that can freeze or fail requests on demand
  • Extend features runner to optionally launch the proxy, if the executable can be found
  • Add proxy-based retry tests for Go and Java
  • Minor style cleanups / refactoring of the features runner

Why?

We need tests that verify the client's retry behavior. A comment on features#326 says that @rross reports that sdk-java doesn't retry correctly on the first connection. Indeed, these new tests confirmed that a bug exists in sdk-java's GrpcRetryer, as the serverCapabilities.get() call can throw in a place that the code doesn't anticipate exceptions.

From here, there is some future work to be done to make these tests more generally useful:

  1. CI needs to be modified, to build the proxy executable before running the features tests.
  2. Test implementations for C#, TypeScript, and Python should be written.

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally makes sense to me and thanks for some of the various cleanup that's in here. My only question is around the need for this proxy to be its own binary/process. We could save a little argparsing and CI code if it's not.

@@ -154,6 +204,22 @@ func (r *Runner) Run(ctx context.Context, patterns []string) error {
return err
}

var fn func(context.Context, *cmd.Run) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var fn func(context.Context, *cmd.Run) error
var langRunFn func(context.Context, *cmd.Run) error

@@ -0,0 +1,752 @@
package main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious to me why this guy needs to be his own executable? Seemingly the runner could start this up internally rather than needing to start it as a subprocess?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concur. @chronos-tachyon - take a peek at what I recently had to do with HTTP proxying: #448 (ignore the subprocess stuff which was because I had to test environment variables safely and unrelated to the HTTP proxy part). Some notes:

  • Can have a feature's config opt-in to the proxy so if someone runs a specific feature without needing a proxy it's not started
  • Can name it "tcp proxy" to disambiguate from http proxy (note, I called the other "http proxy" to avoid this confusion too)
  • Can make a TCPProxy utility as a Go library that the command just happens to use. This can help properly encapsulate the tool (even though it'll only ever be used by the cmd) instead of using globals and subprocesses and such
  • Features know if they need a proxy, so they can choose to, say, dial to that address instead of the default one. No need to affect existing running features by making them go through the proxy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't do it in-process because, when I did, there were too many CPU-bound goroutines to allow the proxy server's goroutines to get serviced in a timely manner (delays on the order of 20 seconds).

Copy link
Member

@cretz cretz Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very surprising. We test the HTTP proxy embedded in the main process (not the per-lang worker process) without any such delays. I wonder what is causing such delays. We shouldn't really have many CPU-bound goroutines here except for the case of running Go features embedded (every other feature type of run uses external processes to run the features). Also, can we consider only using the TCP proxy for features that need it (i.e. they can connect different clients to the proxy inside their tests instead of using default client)?

Comment on lines +44 to +54
- POST /reject
Immediately reject incoming gRPC requests with UNAVAILABLE.

- POST /accept
Accept incoming gRPC requests; this is the default.

- POST /freeze
Block on incoming accepted gRPC requests.

- POST /thaw
Process incoming accepted gRPC requests immediately; this is the default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume posting to these switches to the corresponding "mode"? Might be worth mentioning this modality concept explicitly.

}
}()

Info("HTTP control server is running on: %s", flagControl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If user passes 127.0.0.1:0 this would show 0 rather than the bound port, yes?

Not really a problem since we're predetermining available ports and using those, but thought I'd mention it since this can be run directly in theory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is made a library as part of a runner, we should let this listener pick the random port and extract the address out of it instead temporary-listener-free-port approach (which is a good approach in general, but won't be needed when this is not a separate process).

Comment on lines +302 to +303
func HandleHelp(w http.ResponseWriter, r *http.Request) {
if path.Clean(r.URL.Path) != "/" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice touch

Comment on lines +59 to +60
var (
flagTrace bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit in the same vein as asking why this is its own binary, it might be nice to move all of these into some FeaturesProxyServer struct or something

Comment on lines +668 to +673
s.gc = nil
s.gs = nil
s.l = nil
s.wc = nil
s.ws = nil
s.quitCh = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it might be easy to miss any future fields here. I'm wondering if just having three top-level fields: The mutex, condvar, and an "everything else" container might be nice.

It could also potentially let you have some generic code for this pattern that the servers can share.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All makes sense. High-level notes on the separate proxy process.

@@ -154,6 +204,22 @@ func (r *Runner) Run(ctx context.Context, patterns []string) error {
return err
}

var fn func(context.Context, *cmd.Run) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No disagreement with extracting switch case content into methods, but why move the switch up here? If it's all the same, can just leave switch where it is and make the actual invocations in each case

@@ -0,0 +1,752 @@
package main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concur. @chronos-tachyon - take a peek at what I recently had to do with HTTP proxying: #448 (ignore the subprocess stuff which was because I had to test environment variables safely and unrelated to the HTTP proxy part). Some notes:

  • Can have a feature's config opt-in to the proxy so if someone runs a specific feature without needing a proxy it's not started
  • Can name it "tcp proxy" to disambiguate from http proxy (note, I called the other "http proxy" to avoid this confusion too)
  • Can make a TCPProxy utility as a Go library that the command just happens to use. This can help properly encapsulate the tool (even though it'll only ever be used by the cmd) instead of using globals and subprocesses and such
  • Features know if they need a proxy, so they can choose to, say, dial to that address instead of the default one. No need to affect existing running features by making them go through the proxy.

@@ -221,6 +287,49 @@ func (r *Runner) Run(ctx context.Context, patterns []string) error {
return err
}
}
r.config.DirectServer = r.config.Server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the direct server should be the server and the proxy can be provided for features that know they need it (and the features can create clients to the proxy that know they need to test with the proxy)

func (s *ControlServer) serveThread() {
defer s.finish()

err := s.server.Serve(s.l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big fan of using HTTP here. I wish when the "summary server" thing was done HTTP was used there instead of homemade Temporal binary protocol, heh.

}
}()

Info("HTTP control server is running on: %s", flagControl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is made a library as part of a runner, we should let this listener pick the random port and extract the address out of it instead temporary-listener-free-port approach (which is a good approach in general, but won't be needed when this is not a separate process).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants