Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: support etcd v3, by mocking v2 API #2036

Merged
merged 79 commits into from
Sep 16, 2020

Conversation

Yiyiyimu
Copy link
Member

@Yiyiyimu Yiyiyimu commented Aug 10, 2020

What this PR does / why we need it:

A simplified way to support etcd v3, comparing to #1943, by mocking both requests and responses of v2 API.

#1767

Pre-submission checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible?

The discuss thread in mailing list: https://lists.apache.org/thread.html/r9a6dd85cc5388547ce1c20446d18366ed6f11844cacbb7bdd6be6005%40%3Cdev.apisix.apache.org%3E

@Yiyiyimu
Copy link
Member Author

I make a default benchmark test on my local PC, it seems that deploying etcd v3 could improve performance quite a lot

v3 v2
1 worker + 1 upstream + no plugin + sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.17ms 153.33us 3.44ms 77.84%
Req/Sec 6.83k 271.91 7.53k 72.55%
69316 requests in 5.10s, 275.85MB read
Requests/sec: 13591.19
Transfer/sec: 54.09MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.18ms 176.65us 5.14ms 83.33%
Req/Sec 6.80k 268.62 7.48k 73.53%
68972 requests in 5.10s, 274.49MB read
Requests/sec: 13524.55
Transfer/sec: 53.82MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.33ms 211.08us 5.95ms 82.60%
Req/Sec 6.05k 289.03 6.63k 71.57%
61423 requests in 5.10s, 245.32MB read
Requests/sec: 12043.51
Transfer/sec: 48.10MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.31ms 195.78us 3.63ms 81.55%
Req/Sec 6.11k 600.94 11.41k 97.03%
61396 requests in 5.10s, 245.22MB read
Requests/sec: 12039.57
Transfer/sec: 48.09MB
1 worker + 1 upstream + 2 plugins + sleep 3
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 558.87us 1.24ms 35.81ms 96.89%
Req/Sec 19.64k 1.91k 28.10k 76.24%
197213 requests in 5.10s, 801.24MB read
Requests/sec: 38669.03
Transfer/sec: 157.11MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 521.08us 1.05ms 25.48ms 97.08%
Req/Sec 19.72k 1.84k 23.12k 73.00%
196309 requests in 5.00s, 797.58MB read
Requests/sec: 39251.72
Transfer/sec: 159.47MB
+ sleep 3
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.47ms 252.12us 9.54ms 84.44%
Req/Sec 5.48k 588.37 10.98k 97.03%
55055 requests in 5.10s, 223.67MB read
Requests/sec: 10796.49
Transfer/sec: 43.86MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.51ms 292.56us 9.64ms 88.99%
Req/Sec 5.34k 540.96 10.24k 97.03%
53658 requests in 5.10s, 217.99MB read
Requests/sec: 10521.64
Transfer/sec: 42.75MB
fake empty apisix server: 1 worker + sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.20ms 249.70us 8.56ms 90.65%
Req/Sec 6.71k 299.13 7.87k 75.49%
68112 requests in 5.10s, 271.06MB read
Requests/sec: 13356.74
Transfer/sec: 53.15MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.27ms 1.00ms 18.01ms 98.38%
Req/Sec 6.74k 786.85 13.37k 96.04%
67753 requests in 5.10s, 269.64MB read
Requests/sec: 13287.07
Transfer/sec: 52.88MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.17ms 153.33us 3.44ms 77.84%
Req/Sec 6.83k 271.91 7.53k 72.55%
69316 requests in 5.10s, 275.85MB read
Requests/sec: 13591.19
Transfer/sec: 54.09MB
+ sleep 1
+ wrk -d 5 -c 16 http://127.0.0.1:9080/hello
Running 5s test @ http://127.0.0.1:9080/hello
2 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.18ms 176.65us 5.14ms 83.33%
Req/Sec 6.80k 268.62 7.48k 73.53%
68972 requests in 5.10s, 274.49MB read
Requests/sec: 13524.55
Transfer/sec: 53.82MB

@Yiyiyimu Yiyiyimu mentioned this pull request Aug 10, 2020
4 tasks
@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Aug 10, 2020

TODO:

  • currenly only test v3, test for both v2 and v3
  • currently it only support etcd v2 and v3.3+, do we need to support other etcd version between these two?
  • documentation of migrating data from v2 to v3, to avoid data loss

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Aug 10, 2020

It seems travis ci passed in my personal repo, but github CI failed quite early. I'm a bit confused here.


UPDATE:
it seems travis ci is using etcd v3.4, but github actions is using etcd v3.2. working on multi-version support

bin/apisix Outdated Show resolved Hide resolved
@Yiyiyimu
Copy link
Member Author

Currently implemented multi-version support (auto change etcd prefix) in t/APISIX.pm, worked for most tests. But for test files like t/admin/stream-routes-disable.t, which change the content of config.yaml before test would not get the prefix change in APISIX.pm.

So right now mannually set etcd prefix to "/v3alpha" to try to pass the github ci

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Aug 12, 2020

Fixed by iterating through watch response.


DEBUG HELP NEEDED!!

Currently there is only one error in test file, which is the last test of t/plugin/key-auth.t. Normally in etcd v2, it would add 20 consumers and find the 13th. But in current implementation of etcd v3, the test would add 20 consumers but only get first three of them, so it could not get the 13th. However, if I rerun the test, the test would get all 20 consumers and passed.

I think it might related to the implementation of waitdir between two versions, but I could not find the way to debug.

I print self.values_hash at the start of each sync_data, and the log is here, maybe it could be of help.

@membphis
Copy link
Member

I make a default benchmark test on my local PC, it seems that deploying etcd v3 could improve performance quite a lot

Use v3 or v2 protocol, their test results should be the same. Because etcd is incrementally notified, most requests are in the wait status.

@nic-chen
Copy link
Member

Currently there is only one error in test file, which is the last test of t/plugin/key-auth.t. Normally in etcd v2, it would add 20 consumers and find the 13th. But in current implementation of etcd v3, the test would add 20 consumers but only get first three of them, so it could not get the 13th. However, if I rerun the test, the test would get all 20 consumers and passed.

I think your guess is correct, it should be a waitdir problem, you can debug to see why the etcd data is not synchronized in time

@Yiyiyimu
Copy link
Member Author

Use v3 or v2 protocol, their test results should be the same. Because etcd is incrementally notified, most requests are in the wait status.

Is there any reason that could cause the difference? Or the default benchmark test is not suitable for this change

@membphis
Copy link
Member

Is there any reason that could cause the difference? Or the default benchmark test is not suitable for this change

I think you need to check the error log first for confirming some detail. If you need more help, you need to provide your benchmark step one by one.

@nic-chen
Copy link
Member

Note:
I found that run etcd v3 on docker for mac cause high CPU usage, like:
etcd-io/etcd#11460

Running on Linux is ok.

@membphis
Copy link
Member

one more thing, we need to check the etcd version in bin/apisix, confirm the etcd version >= 3.4 .

we can fix this in a new PR, here is the related issue: #2227

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
# limitations under the License.
#

wget https://github.com/etcd-io/etcd/releases/download/v3.4.0/etcd-v3.4.0-linux-amd64.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, you can refer to the way of docker: https://github.com/apache/apisix/pull/2225/files#diff-65e6a3c4290328a0a57797b4cf3de4d2R39.
we can not modify it in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Sep 15, 2020

one more thing, we need to check the etcd version in bin/apisix, confirm the etcd version >= 3.4 .

we can fix this in a new PR, here is the related issue: #2227

fix #2227

end


function _M.watch_format(v3res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a more meaningful function name



function _M.get_format(res, realkey)
if res.body.error == "etcdserver: user name is empty" then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does etcd has error code with msg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some does and some don't. For example in this case, ngx.status is set to 400.

{
  body = {
    code = 3,
    error = "etcdserver: user name is empty",
    message = "etcdserver: user name is empty"
  },
  body_reader = <function 1>,
  has_body = true,
  headers = {...},
  read_body = <function 4>,
  read_trailers = <function 5>,
  reason = "Bad Request",
  status = 400,
  trailer_reader = <function 6>
}


function _M.get_format(res, realkey)
if res.body.error == "etcdserver: user name is empty" then
return nil, "insufficient credentials code: 401"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return res.body.error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we need to deal with other error msg in res.body.error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The error resulted from get when etcd auth failed in v3 is different from v2, so I tried to keep it the same
  2. That's the only difference I know between v2 and v3. We could return other error directly.

@Yiyiyimu
Copy link
Member Author

Currently the log produced by lua-resty-etcd would show message without base64 decode. For example

2020/09/15 23:18:52 [info] 1964#1964: *2 [lua] v3.lua:284: set(): v3 set body: {"prev_kv":{"value":"InRlc3RfdmFsdWUi","create_revision":"149","mod_revision":"150","key":"L2FwaXNpeC90ZXN0X2tleQ==","version":"2"},"header":{"raft_term":"2","cluster_id":"8925027824743593106","member_id":"13803658152347727308","revision":"151"}}, client: 127.0.0.1, server: localhost, request: "GET /t HTTP/1.1", host: "localhost"
2020/09/15 23:18:52 [info] 1964#1964: *15 [lua] v3.lua:500: request_chunk(): http request method: POST path: /v3/watch body: {"create_request":{"range_end":"L2FwaXNpeC91cHN0cmVhbXQ=","start_revision":151,"key":"L2FwaXNpeC91cHN0cmVhbXM="}} query: nil, context: ngx.timer

Shall we do the decode on etcd side, or just remove these logs like v2 did?

@nic-chen
Copy link
Member

Currently the log produced by lua-resty-etcd would show message without base64 decode. For example


2020/09/15 23:18:52 [info] 1964#1964: *2 [lua] v3.lua:284: set(): v3 set body: {"prev_kv":{"value":"InRlc3RfdmFsdWUi","create_revision":"149","mod_revision":"150","key":"L2FwaXNpeC90ZXN0X2tleQ==","version":"2"},"header":{"raft_term":"2","cluster_id":"8925027824743593106","member_id":"13803658152347727308","revision":"151"}}, client: 127.0.0.1, server: localhost, request: "GET /t HTTP/1.1", host: "localhost"

2020/09/15 23:18:52 [info] 1964#1964: *15 [lua] v3.lua:500: request_chunk(): http request method: POST path: /v3/watch body: {"create_request":{"range_end":"L2FwaXNpeC91cHN0cmVhbXQ=","start_revision":151,"key":"L2FwaXNpeC91cHN0cmVhbXM="}} query: nil, context: ngx.timer

Shall we do the decode on etcd side, or just remove these logs like v2 did?

we could optimize it in another pr

@moonming moonming merged commit 4722198 into apache:master Sep 16, 2020
bin/apisix Show resolved Hide resolved
@Yiyiyimu
Copy link
Member Author

we could optimize it in another pr

@nic-chen I think we need to release a newer version of lua-resty-etcd, since the current version would output TONS of logs.

@membphis
Copy link
Member

@Yiyiyimu you can create a new Github issue if you find some other things.

this PR has been merged, we should use the new issue to resolve the problem.

@Yiyiyimu
Copy link
Member Author

TODO: we need to add doc for etcd migration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants