You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[clustering] when gateway lost connection to zookeeper, it didn't update ZookeeperNonBlockingLock's acquired or lockPath field, make a target get dup scribe when the gateway restored connection to zk
#61
Open
yangyu66 opened this issue
Nov 25, 2024
· 1 comment
Bug description
Init setup:
two gateway instances (a and b) on different servers subscribe to the same target (with the same config), only one(a) gets the lock from zk and starts subscribing.
Produce bug:
when gateway a lost connection to Zookeeper, it didn't update its own ZookeeperNonBlockingLock's "acquired" or "lockPath" field, making a target get dup subscribe when the gateway restored connection to Zk
Step to reproduce
Steps to reproduce the behavior.
on one server(a), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'
on another server(b), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'
Then break the connection between server a and zk
Server b starts subscribing to the target. (expected)
Restore connection between server a and zk
Server a starts subscribing to the target (bug)
Expected behavior
in step 8, Server a should not start subscribing to the target
Output or code snippets
in connections/zookeeper.go, func (c *ZookeeperConnectionManager) eventListener(zkEvents <-chan zk.Event) {}, targetConfig.unlock(), , it should also clear/correct distributed lock status.
The text was updated successfully, but these errors were encountered:
should add some code to clear DistributedLocker's "acquired" and "lockPath" fields, as targetConfig.unlock() looks only to cancel the context to the target, but once the connection to zk is restored, it will get lock already acquired (even though with deadlock error, in connections/state.go t.config.Log.Error().Msgf("Target %s: error while trying to acquire lock: %v", t.name, err), get deadlock error but still mark lock acquired after connection to zk restored), and start subscribing to target.
I'm trying to create a pr to fix this issue, should publish soon
Bug description
Init setup:
two gateway instances (a and b) on different servers subscribe to the same target (with the same config), only one(a) gets the lock from zk and starts subscribing.
Produce bug:
when gateway a lost connection to Zookeeper, it didn't update its own ZookeeperNonBlockingLock's "acquired" or "lockPath" field, making a target get dup subscribe when the gateway restored connection to Zk
Step to reproduce
Steps to reproduce the behavior.
on one server(a), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'
on another server(b), Type './gnmi-gateway -EnableGNMIServer -ServerTLSCert=server.crt -ServerTLSKey=server.key -TargetLoaders=json -TargetJSONFile=targets.json -Exporters=kafka -ExporterKafkaTopic=xxx -ExporterKafkaBrokers=kafka:9092 -ExporterKafkaLogging -ZookeeperHosts zk:2181'
Then break the connection between server a and zk
Server b starts subscribing to the target. (expected)
Restore connection between server a and zk
Server a starts subscribing to the target (bug)
Expected behavior
in step 8, Server a should not start subscribing to the target
Output or code snippets
in connections/zookeeper.go,
func (c *ZookeeperConnectionManager) eventListener(zkEvents <-chan zk.Event) {}, targetConfig.unlock(),
, it should also clear/correct distributed lock status.The text was updated successfully, but these errors were encountered: