-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement apiToken failover mechanism #1256
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1256 +/- ##
==========================================
+ Coverage 35.91% 44.23% +8.32%
==========================================
Files 69 76 +7
Lines 11576 9895 -1681
==========================================
+ Hits 4157 4377 +220
+ Misses 7104 5183 -1921
- Partials 315 335 +20 |
@cr7258 可以用SetSharedData同步一下,要注意用cas机制避免冲突,同时也可以基于SetSharedData机制进行选主,让一个worker做健康检查恢复,不过要注意SharedData中的数据是VM级别的,即使插件配置更新也不会清理。 |
healthCheckClient = wrapper.NewClusterClient(wrapper.StaticIpCluster{ | ||
ServiceName: "local_cluster", | ||
Port: 10000, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个应该要配置吗吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,现在是需要配置一个 cluster, 指向 127.0.0.1。不知道有没有更好的方式处理?可以让用户不需要额外配置这个 cluster。
- name: outbound|10000||local_cluster.static
connect_timeout: 0.25s
type: STATIC
load_assignment:
cluster_name: outbound|10000||local_cluster.static
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 10000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接用 RouteCluster 就好了,就是当前路由到的服务,然后去请求这个服务来校验健康,直接用 Authorization 头就行了,不用设置这个 ApiToken-Health-Check 头
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
健康检测任务是在 parseConfig 阶段设置的,我试了下应该是拿不到当前请求的 cluster 的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是在有一次失败之后再触发会好一些?失败的时候把RouteCluster和失败的apikey组合在一起加入到检查队列里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把 RouteCluster 传递过来这样是可以实现的。
但是健康检查的请求我感觉还是应该通过 ai-proxy,因为在 ai-proxy 会统一处理不同 LLM 的差异,比如:
- 有的 LLM 的 model 是设置在 header 里的,有的是设置在 body 里的。
- 不同 LLM Authorization 的 header 也不一样。
- 不同 LLM 的 path 也不一样。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local-cluster这块是不是要配置一下?而且用户是可能修改gateway监听端口的。如果作为一个通用解决方案,是不是最好让 pilot-agent 或者 controller 去下发这么个内置 cluster?
@johnlanni 我修改了代码,使用 SetSharedData 在多个 VM 之间同步 apiToken 的信息,并且也使用 SetSharedData 进行选主了。
这个地方提到的注意点,我需要做那些处理? |
healthCheckClient = wrapper.NewClusterClient(wrapper.StaticIpCluster{ | ||
ServiceName: "local_cluster", | ||
Port: 10000, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接用 RouteCluster 就好了,就是当前路由到的服务,然后去请求这个服务来校验健康,直接用 Authorization 头就行了,不用设置这个 ApiToken-Health-Check 头
} | ||
|
||
func generateVMID() string { | ||
return fmt.Sprintf("%016x", time.Now().Nanosecond()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vm id 可以通过 getProperty 直接拿到,key 是 "plugin_vm_id"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
哦 是我搞错了 这个vm_id是配置里的,不是标识一个唯一的vm,现在配置的是空字符串
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那是不是生成一个 uuid 好一些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
if c.failover != nil && c.failover.enabled { | ||
wrapper.RegisteTickFunc(c.failover.healthCheckTimeout, func() { | ||
// Only the Wasm VM that successfully acquires the lease will perform health check | ||
if tryAcquireOrRenewLease(vmID, log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要加个 else 逻辑,没有选到主的,需要定时(健康检查的间隔)从 shared data 中获取全局token,来更新当前自己本地 thread local 的全局token。
这样避免每次请求来都去请求 shared data,因为envoy底层实现这个get/set shared data操作都要加锁,有额外开销
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有选到主的 Wasm VM 是不会去做健康检测的,只有选到主的 VM 才会去获取全局的 unavailableTokens 进行健康检测。所以这里好像不需要加上 else 的逻辑?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
但是不止有健康检查的时候要去请求shared data,当次请求失败,需要增加fail count的时候也要去访问
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在总有 4 个 shared data:
- ctxApiTokenRequestFailureCount:请求失败的 token 计数,map 结构,key 是 apiToken,value 是失败的次数
- ctxApiTokenRequestSuccessCount:达到失败次数阈值禁用的 token,需要进行健康检查,map 结构,key 是 apiToken,value 是健康检测成功的次数
- ctxApiTokens:可以使用的 token 列表
- ctxUnavailableApiTokens:禁用的 token 列表
没有选到主的,需要定时(健康检查的间隔)从 shared data 中获取全局token,来更新当前自己本地 thread local 的全局 token:
这里有两个问题:
1.如果某个 apiToken 达到失败阈值被移除了,由于定时从 shared data 同步到本地会有延迟,会出现仍然有 wasm vm 尝试使用已禁用的 token 进行访问
2.选到主的 wasm vm 也是会接收请求的,如果需要定期从 shared data 中获取全局 token 来更新到本地,那么这个逻辑应该不只是在 else 中做,不管是选到主的 还是没选到主的,都应该进行这个操作
但是不止有健康检查的时候要去请求shared data,当次请求失败,需要增加fail count的时候也要去访问
增加 fail count 的时候要访问 ctxApiTokenRequestFailureCount,这个计数应该要是精确的,所以不用定时从全局去同步到本地
}) | ||
|
||
vmID := generateVMID() | ||
err := c.initApiTokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为 shared data 中的内容是跟随插件 vm 的生命周期的,只有插件关闭/版本升级等情况内容才会被清理。所以这里初始化的时候,要重置所有 shared data 中相关的数据,availiabe和unavailiable的都要重置。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 82b2284
log.Errorf("Failed to get unavailable apiToken: %v", err) | ||
return | ||
} | ||
c.addApiToken(ctxUnavailableApiTokens, apiToken, unavailableTokens, unavailableCas, log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle unavailable api token 会在每个worker线程里都掉用,所以这里 add 的时候不能简单覆盖,要考虑冲突的情况,应该先 get 出来,再在基础上加上对应的 token,再去 set,这个过程中用 cas 来识别冲突,如果冲突进行重试。可以最多重试例如10次。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里我是有先 get 出来,再 append 进行添加的。
higress/plugins/wasm-go/extensions/ai-proxy/provider/failover.go
Lines 256 to 261 in 856343c
unavailableTokens, unavailableCas, err := getApiTokens(ctxUnavailableApiTokens) | |
if err != nil { | |
log.Errorf("Failed to get unavailable apiToken: %v", err) | |
return | |
} | |
c.addApiToken(ctxUnavailableApiTokens, apiToken, unavailableTokens, unavailableCas, log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我补充一下 cas 重试的逻辑。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加 cas 重试逻辑。
大的问题没有,上面提到一些跟机制相关的细节处理,辛苦再调整下 |
README.md 应该也要更新一下 |
return | ||
} | ||
|
||
failureCount[apiToken]++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个计数是连续失败次数还是总失败次数呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是该 apiToken 总的失败次数,如果达到失败阈值会被加入 unavailableTokens 后续做健康检测,同时从 apiTokens 中移除后续请求不再使用该 token,并且失败计数重置为 0。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是否统计连续失败更合理呢?比如阈值是3的时候,前天请求10次失败1次,昨天请求20次失败一次,今天又10次失败一次,这样拉出是否不太合理诶?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 1e40d82
…ures exceeds the threshold
@@ -47,6 +47,10 @@ func parseGlobalConfig(json gjson.Result, pluginConfig *config.PluginConfig, log | |||
if err := pluginConfig.Complete(); err != nil { | |||
return err | |||
} | |||
|
|||
providerConfig := pluginConfig.GetProviderConfig() | |||
providerConfig.SetApiTokensFailover(log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方不只是global要调用,parseOverrideRuleConfig里也要调用。另外,这块逻辑是不是可以放到 pluginConfig 的 complete 函数里?
providerConfig := pluginConfig.GetProviderConfig() | ||
// If apiToken failover is enabled and the request is not a health check request, handle unavailable apiToken. | ||
if providerConfig.IsFailoverEnabled() && ctx.GetContext(provider.ApiTokenHealthCheck) == nil { | ||
providerConfig.HandleUnavailableApiToken(apiTokenInUse, log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些逻辑都封到ProviderConfig或者PluginConfig里面,比如加个函数叫 OnRequestFailed 之类的,这样不需要把这么多细节暴露到main里
} | ||
} | ||
log.Debugf("[onHttpRequestHeader] use apiToken %s to send request", apiTokenInUse) | ||
ctx.SetContext(provider.ApiTokenInUse, apiTokenInUse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分逻辑是不是也可以抽到ProviderConfig里
Ⅰ. Describe what this PR did
配置示例:
目前仅根据 HTTP 请求的响应状态码是否是 200 来判断 apiToken 是否可用,应该暂时用不到其他复杂的判断条件。
Ⅱ. Does this pull request fix one issue?
fixes #1227
Ⅲ. Why don't you add test cases (unit test/integration test)?
Ⅳ. Describe how to verify it
Ⅴ. Special notes for reviews
Question
目前还有两个问题: