Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:[动态配置]修复config_flow递归加读锁导致的死锁问题 #212

Merged
merged 1 commit into from
Aug 11, 2024

Conversation

ArmstrongCN
Copy link
Contributor

@ArmstrongCN ArmstrongCN commented Aug 10, 2024

Please provide issue(s) of this PR:
Fixes #183

To help us figure out who should review this PR, please put an X in all the areas that this PR affects.

  • Configuration
  • Docs
  • Performance and Scalability
  • Naming
  • HealthCheck
  • Test and Release

Please check any characteristics that apply to this pull request.

  • Does not have any user-facing changes. This may include API changes, behavior changes, performance improvements, etc.

问题背景

程序启动时串行拉取大量配置文件,容易出现死锁,表现为永久卡在某个配置文件的GetConfigFile接口,在两次设置配置文件期间增加sleep退让可以缓解。

类似的问题报告: #183

复现手法

伪代码

for _, fileName := range configFileNames {
	cfile, _ := configAPI.GetConfigFile(trpcConfig.Global.Namespace,
			fmt.Sprintf("%s.%s", trpcConfig.Server.App, trpcConfig.Server.Server),
			fileName)}
}

问题分析

  1. 协程1,GetConfigFile,获取写锁
    代码

    c.fclock.Lock()
    defer c.fclock.Unlock()
    // double check
    configFile, ok = c.configFileCache[cacheKey]
    if ok {
    return configFile, nil
    }
    fileRepo, err := newConfigFileRepo(configFileMetadata, c.connector, c.chain, c.conf, c.persistHandler)
    if err != nil {
    return nil, err
    }
    configFile = newDefaultConfigFile(configFileMetadata, fileRepo)
    if req.Subscribe {
    c.addConfigFileToLongPollingPool(fileRepo)
    c.repos = append(c.repos, fileRepo)
    c.configFileCache[cacheKey] = configFile
    }

  2. 协程2,定时任务,获取两次读锁

    流程:

    1. 首次GetConfigFile,拉起新协程(addConfigFileToLongPollingPool),每隔5秒执行轮询:
      c.startLongPollingTaskOnce.Do(func() {
      ctx, cancel := context.WithCancel(context.Background())
      c.cancel = cancel
      go func() {
      time.Sleep(5 * time.Second)
      c.mainLoop(ctx)
      }()
      })
    2. 轮询过程中,生成订阅配置列表,获取读锁:
      func (c *ConfigFileFlow) assembleWatchConfigFiles() []*configconnector.ConfigFile {
      c.fclock.RLock()
      defer c.fclock.RUnlock()
      watchConfigFiles := make([]*configconnector.ConfigFile, 0, len(c.configFilePool))
      for cacheKey := range c.configFilePool {
      configFileMetadata := extractConfigFileMetadata(cacheKey)
      watchConfigFiles = append(watchConfigFiles, &configconnector.ConfigFile{
      Namespace: configFileMetadata.GetNamespace(),
      FileGroup: configFileMetadata.GetFileGroup(),
      FileName: configFileMetadata.GetFileName(),
      Version: c.getConfigFileNotifiedVersion(cacheKey),
      })
      }
      return watchConfigFiles
      }
    3. 在步骤2里持有读锁的同时,调用getConfigFileNotifiedVersion,再次请求读锁:
      func (c *ConfigFileFlow) getConfigFileNotifiedVersion(cacheKey string) uint64 {
      c.fclock.RLock()
      defer c.fclock.RUnlock()
      version, ok := c.notifiedVersion[cacheKey]
      if !ok {
      version = initVersion
      }
      return version
      }

问题原因

读写锁为了防止写锁饿死,加写锁时,等待持有读锁的协程释放,且阻止新的读锁请求。

上述过程中,协程1(业务GetConfigFile)在协程2(定时任务)第一次持有读锁(步骤2)后申请写锁,此时协程2(定时任务一)第二次申请读锁(步骤3)会被阻塞,形成了循环等待,且互不退让的局面。

RWMutex的文档也有提示避免递归加读锁。https://pkg.go.dev/sync#RWMutex

If any goroutine calls Lock while the lock is already held by one or more readers,
concurrent calls to RLock will block until the writer has acquired (and released) the lock,
to ensure that the lock eventually becomes available to the writer. Note that this prohibits recursive read-locking.

修复手法

getConfigFileNotifiedVersion需要同时提供无锁和加锁版本,在外层持有读锁时避免重复加锁。新增locking参数,由调用方决定是否加锁。

这部分逻辑有更优雅的写法,但就死锁这个问题,通过加参数可以最少修改、快速修复。待未来某个里程碑再考虑调整代码结构。

@chuntaojun chuntaojun merged commit fd7c1cc into polarismesh:main Aug 11, 2024
6 of 18 checks passed
ArmstrongCN added a commit to ArmstrongCN/polaris-go that referenced this pull request Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

polaris.ConfigAPI 不能并行监听多个配置文件
2 participants