Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/v3 fix watch API blocked forever even with WithRequireLeader option #17622

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

346999452
Copy link

@346999452 346999452 commented Mar 21, 2024

Fixes #17611

@k8s-ci-robot
Copy link

Hi @346999452. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ivanvc
Copy link
Member

ivanvc commented Mar 21, 2024

Hi, @346999452, thanks for your PR. It seems like your commit wasn't properly signed off. The proper syntax is

Signed-off-by: Name <email>

Could you please amend your commit and push it again? so the developer certificate of origin (DCO) check passes, i.e:

git commit --amend --signoff # Fix the signed off trailer
git push --force

@346999452 346999452 force-pushed the fix-watch-api-blocked-forever branch from f70abbe to a921a9b Compare March 21, 2024 06:13
@346999452 346999452 force-pushed the fix-watch-api-blocked-forever branch from a921a9b to 5b22764 Compare March 21, 2024 06:35
@serathius
Copy link
Member

/ok-to-test

@serathius
Copy link
Member

Can you provide a test to verify that the PR fixes the issue #17611?

@346999452 346999452 closed this Mar 21, 2024
@346999452 346999452 reopened this Mar 21, 2024
@346999452
Copy link
Author

346999452 commented Mar 21, 2024

Can you provide a test to verify that the PR fixes the issue #17611?

The watch method is used as normal, it is difficult to reproduce the problem in real scenarios.
I simulated what I encountered and it could be easily reproduced:

The serveWatchClient method can be modified for easy reproduction

func (w *watchGRPCStream) serveWatchClient(wc pb.Watch_WatchClient) {
	for {
		resp, err := wc.Recv()
		if err != nil {
			select {
			case w.errc <- err:
			case <-w.donec:
			}
			return
		}
		if resp.Created {
			resp.Canceled = true
			resp.CancelReason = v3rpc.ErrGRPCPermissionDenied.Error()
		}
		select {
		case w.respc <- resp:
			if resp.Created && resp.Canceled && resp.CancelReason != "" {
				select {
				case w.errc <- errors.New("some err to test"):
				case <-w.donec:
				}
				return
			}
		case <-w.donec:
			return
		}
	}
}

then it will be blocked forever

@ivanvc
Copy link
Member

ivanvc commented Mar 21, 2024

@346999452, I think if you can provide (write) a unit or integration test for this, it would be the best.

@346999452
Copy link
Author

346999452 commented Mar 22, 2024

@346999452, I think if you can provide (write) a unit or integration test for this, it would be the best.

func TestWatch(t *testing.T) {
	var client, err = clientv3.New(clientv3.Config{
		Endpoints:            []string{"127.0.0.1:2379", "127.0.0.1:3379", "127.0.0.1:4379"},
		DialTimeout:          1e11,
		Username:             "hello",
		Password:             "world",
		MaxCallSendMsgSize:   10485760,
		MaxCallRecvMsgSize:   107374182400,
		DialKeepAliveTime:    3e10,
		DialKeepAliveTimeout: 1e10,
		DialOptions:          []grpc.DialOption{grpc.WithBlock()},
	})
	if err != nil {
		panic(err)
	}
	var watcher = clientv3.NewWatcher(client)
	t.Log("to watch")
	var (
		revision    int64
		resp        clientv3.WatchResponse
		watcherChan = watcher.Watch(clientv3.WithRequireLeader(context.Background()), "/hello/", clientv3.WithPrefix())
	)
	t.Log("watch success")
	for resp = range watcherChan {
		if resp.Err() != nil {
			t.Log(resp.Err())
			break
		}
		if resp.Header.Revision > revision {
			revision = resp.Header.Revision
		}
		t.Log("receive resp")
	}
	_ = watcher.Close()
	retry(t, client, revision)
}

func retry(t *testing.T, client *clientv3.Client, revision int64) {
	var watcher = clientv3.NewWatcher(client)
	t.Log("retry watch")
	var (
		resp        clientv3.WatchResponse
		watcherChan = watcher.Watch(clientv3.WithRequireLeader(context.Background()), "/hello/", clientv3.WithPrefix(), clientv3.WithRev(revision + 1))
	)
	t.Log("watch success")
	for resp = range watcherChan {
		if resp.Err() != nil {
			if errors.Is(resp.Err(), rpctypes.ErrCompacted) {
				revision = func() int64 {
					/*
						TODO: ...
					*/
					return 0
				}()
			}
			break
		}
		if resp.Header.Revision > revision {
			revision = resp.Header.Revision
		}
		t.Log("receive resp")
	}
	_ = watcher.Close()
	retry(t, client, revision)
}

There is nothing special about the unit testing, because the block is not caused by the way it is used, but by some extreme scenario
when the server is frequently faulty, a block may occur when the client rewatches

This code may not print 'watch success', or it may blocked while receiving from the watch channel

@liangyuanpeng
Copy link
Contributor

This will greatly increase the confidence of this PR if you can reproduce this problem in CI.

@ivanvc
Copy link
Member

ivanvc commented Mar 23, 2024

I agree. Writing the test in integration/clientv3/ would be better. Probably in integration/clientv3/watch_test.go.

@346999452 346999452 closed this Mar 25, 2024
@346999452 346999452 reopened this Mar 25, 2024
@jmhbnz
Copy link
Member

jmhbnz commented Mar 27, 2024

/retitle client/v3 fix watch API blocked forever even with WithRequireLeader option

@k8s-ci-robot k8s-ci-robot changed the title client/v3: fix issue https://github.com/etcd-io/etcd/issues/17611 client/v3 fix watch API blocked forever even with WithRequireLeader option Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Watch API. blocked forever even with WithRequireLeader option
6 participants