Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clustering] when gateway lost connection to zookeeper, it didn't update ZookeeperNonBlockingLock's acquired or lockPath field, make a target get dup scribe when the gateway restored connection to zk #61

Open
yangyu66 opened this issue Nov 25, 2024 · 1 comment

Comments

@yangyu66
Copy link

yangyu66 commented Nov 25, 2024

Bug description
Init setup:
two gateway instances (a and b) on different servers subscribe to the same target (with the same config), only one(a) gets the lock from zk and starts subscribing.

Produce bug:
when gateway a lost connection to Zookeeper, it didn't update its own ZookeeperNonBlockingLock's "acquired" or "lockPath" field, making a target get dup subscribe when the gateway restored connection to Zk

Step to reproduce
Steps to reproduce the behavior.

  1. My config is
{
  "request": {
    "default": {
      "subscribe": {
        "prefix": {
        },
        "subscription": [
          {
            "path": {
              "elem": [
                {
                  "name": "interfaces"
                },
                {
                  "name": "interface",
                  "key": {
                    "name": "Ethernet100"
                  }
                },
                {
                  "name": "state"
                },
                {
                  "name": "counters"
                },
                {
                  "name": "in-discards"
                }
              ]
            },
            "mode": 2,
            "sample_interval": 30000000000,
            "heartbeat_interval": 30000000000
          }
        ]
      }
    }
  },
  "target": {
    "demo-router:8080": {
      "addresses": [
        "demo-router:8080"
      ],
      "credentials": {
        "username": "xx",
        "password": "xx"
      },
      "request": "default",
      "meta": {
        "NoTLS": "yes"
      }
    }
  }
}
  1. on one server(a), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'

  2. on another server(b), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'

  3. Then break the connection between server a and zk

  4. Server b starts subscribing to the target. (expected)

  5. Restore connection between server a and zk

  6. Server a starts subscribing to the target (bug)

Expected behavior
in step 8, Server a should not start subscribing to the target

Output or code snippets
in connections/zookeeper.go,
func (c *ZookeeperConnectionManager) eventListener(zkEvents <-chan zk.Event) {}, targetConfig.unlock(), , it should also clear/correct distributed lock status.

@yangyu66
Copy link
Author

I think in

func (c *ZookeeperConnectionManager) eventListener(zkEvents <-chan zk.Event) {
....
case zk.StateDisconnected: ...
}

should add some code to clear DistributedLocker's "acquired" and "lockPath" fields, as
targetConfig.unlock() looks only to cancel the context to the target, but once the connection to zk is restored, it will get lock already acquired (even though with deadlock error, in connections/state.go t.config.Log.Error().Msgf("Target %s: error while trying to acquire lock: %v", t.name, err), get deadlock error but still mark lock acquired after connection to zk restored), and start subscribing to target.
I'm trying to create a pr to fix this issue, should publish soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant