Increase health-check threshhold for filesystem #11721

vnovotny98 · 2024-01-03T08:37:14Z

Is your feature request related to a problem? Please describe

I have old and slow HDD disks and when I delete big indices on these HDD - healthcheck is failing and then disconned cold nodes and red the whole cluster.

It happens only on COLD nodes because their disk utilization is 100% when I delete 50GB index by ILM.

When I delete them manually by DELETE index in kibana, disk utilization is about 5%. Is there a bug in ILM or can I increase threshhold?

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

Describe the solution you'd like

CAN I INCREASE THIS THRESHOLD TO 10 OR MORE SECONDS?

Related component

Storage:Performance

Describe alternatives you've considered

Here is my topic on forum but without response.
https://forum.opensearch.org/t/increase-health-check-threshhold/17302

Additional context

LOGS

pescold02-elastic[7214]: [2023-12-30T16:10:02,030][WARN ][o.o.m.f.FsHealthService  ] [pescold02-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]
pescold02-elastic[7214]: [2023-12-30T16:10:12,959][INFO ][o.o.c.c.Coordinator      ] [pescold02-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold02-elastic[7214]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
pescold01-elastic[6944]: [2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
pescold01-elastic[6944]: [2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
Dec 30 17:10:12 pescold01-spc pescold01-elastic[6944]: [2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold01-elastic[6944]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks

The text was updated successfully, but these errors were encountered:

Bukhtawar · 2024-01-03T17:55:01Z

Can you share the cluster manager grepping for node_left. The default threshold is set at 60s. which is pretty reasonable in my opinion for a vast variety of storage devices. The warn logs are controlled by a dynamic settings monitor.fs.health.slow_path_logging_threshold which is set at 5s. Note it is just a warning to indicate there are IO bottlenecks for admins to take corrective actions.

vnovotny98 · 2024-01-04T15:36:16Z

LOGS FROM COLD NODE:

[2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
[2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
[2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks

[2023-12-30T16:10:22,860][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72
[2023-12-30T16:10:25,228][INFO ][o.o.j.s.JobSweeper       ] [pescold01-spc] Running full sweep

[2023-12-30T16:10:32,861][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72

[2023-12-30T16:10:42,861][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72


[2023-12-30T16:12:19,915][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Hash ring build result: true
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Build AD version hash ring successfully
[2023-12-30T16:12:19,916][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pescold01-spc] Detected cluster change event for destination migration
[2023-12-30T16:12:19,932][INFO ][o.o.d.PeerFinder         ] [pescold01-spc] setting findPeersInterval to [1s] as node commission status = [true] for local node [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}]

[2023-12-30T16:12:19,959][INFO ][o.o.c.s.ClusterApplierService] [pescold01-spc] added {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold03-spc}{kQMsooynRqa_wWrerH2sKw}{e0GIClkPQWq7xn9wnMFFpA}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109462, reason: ApplyCommitRequest{term=72, version=109462, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:12:20,060][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:20,060][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Node added: [KqFo0_SCQdeBhDotJlQ28Q, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:20,060][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pescold01-spc] Detected cluster change event for destination migration
[2023-12-30T16:12:20,061][INFO ][o.o.m.a.MLModelAutoReDeployer] [pescold01-spc] Model auto reload configuration is false, not performing auto reloading!
[2023-12-30T16:12:20,296][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] All nodes with known AD version: {KTmndIIPSJW5ZG9h0SsH8Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, zW7MK68wTciyYhNsLYdQRg=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, u393ACmDRiyS7Gfv20CvwQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, edKjilksTHmpfKVOMOsqUw=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, MQFKK1lfRcGpZ9l9JE8t7Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, OBI1tBR-RlW8N5jRl3Du5w=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, NrLruO_pR3GK6kg8o6a-CA=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, KqFo0_SCQdeBhDotJlQ28Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, Qnrakfs_T-qeQG880TcD9Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, 9jX0k5J6Q5muN5DPn4vu1Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, PiKX_fELQHusiepq5RLhbQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, TnKeeOk_SKS7VeM80Qaefg=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, kQMsooynRqa_wWrerH2sKw=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, DVfYSiL6Reaysc8a2tNYMw=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, El3uXMa2TTK7-vi77ctNRQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}}
[2023-12-30T16:12:20,296][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Hash ring build result: true

CLUSTER MANAGER NODE:

[2023-12-30T16:00:24,760][INFO ][o.o.j.s.JobSweeper       ] [pesmaster01-spc] Running full sweep
[2023-12-30T16:05:24,760][INFO ][o.o.j.s.JobSweeper       ] [pesmaster01-spc] Running full sweep
[2023-12-30T16:08:10,812][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pesmaster01-spc] Detected cluster change event for destination migration
[2023-12-30T16:10:10,865][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109455, reason: ApplyCommitRequest{term=72, version=109455, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Hash ring build result: true
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Build AD version hash ring successfully
[2023-12-30T16:10:10,866][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pesmaster01-spc] Detected cluster change event for destination migration
[2023-12-30T16:10:12,328][WARN ][r.suppressed             ] [pesmaster01-spc] path: /_prometheus/metrics, params: {}
java.lang.NullPointerException: Cannot invoke "org.opensearch.index.shard.DocsStats.getCount()" because the return value of "org.opensearch.action.admin.indices.stats.CommonStats.getDocs()" is null

When I GREP LEFT output was CLEAR

GREP REMOVE :


[2023-12-30T16:10:10,865][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109455, reason: ApplyCommitRequest{term=72, version=109455, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:10:40,920][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:11:10,960][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109457, reason: ApplyCommitRequest{term=72, version=109457, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:11:10,961][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:11:10,961][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:11:41,032][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:11,145][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold03-spc}{kQMsooynRqa_wWrerH2sKw}{e0GIClkPQWq7xn9wnMFFpA}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109459, reason: ApplyCommitRequest{term=72, version=109459, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:12:11,146][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:12:11,146][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:12,085][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:20,008][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:14:20,017][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109466, reason: ApplyCommitRequest{term=72, version=109466, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:14:20,017][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:14:20,017][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:14:24,007][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:16:24,037][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109474, reason: ApplyCommitRequest{term=72, version=109474, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:16:24,037][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:16:24,037][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:16:26,283][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:18:26,253][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109485, reason: ApplyCommitRequest{term=72, version=109485, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:18:26,254][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:18:26,254][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:18:28,485][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true

Bukhtawar · 2024-01-04T18:32:26Z

In any case you should see this logs if health check threshold is indeed getting breached.

OpenSearch/server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

Lines 240 to 241 in aca2e9d

    
           logger.error( 
        
               "health check of [{}] failed, took [{}ms] which is above the healthy threshold of [{}]",

vnovotny98 · 2024-01-04T18:47:01Z

IDK man, i have my own docker builds and in dockerfile i am downloading:
wget -P $TEMP_DIR https://artifacts.opensearch.org/releases/bundle/opensearch/${VERSION}/opensearch-${VERSION}-linux-x64.tar.gz

And right now I am using VERSION=2.8.0.
All logs are presented.

So is there a way how can I check that this "The default threshold is set at 60s." is set correctly? Maybe some GET API?
bcs I think its set 5s and when it fails 3x 5secs then it will disconnect my node.

andrross · 2024-01-09T18:11:20Z

So is there a way how can I check that this "The default threshold is set at 60s." is set correctly? Maybe some GET API?

Yes, there is the cluster settings API:

GET _cluster/settings?include_defaults=true&flat_settings=true

You should see values like the following to confirm your current settings:

    "monitor.fs.health.healthy_timeout_threshold" : "60s",
    "monitor.fs.health.refresh_interval" : "60s",
    "monitor.fs.health.slow_path_logging_threshold" : "5s",
    "monitor.fs.refresh_interval" : "1s",

vnovotny98 · 2024-01-09T18:35:15Z

You are right,

this is my cluster settings

    "monitor.fs.health.enabled": "true",
    "monitor.fs.health.healthy_timeout_threshold": "60s",
    "monitor.fs.health.refresh_interval": "60s",
    "monitor.fs.health.slow_path_logging_threshold": "5s",
    "monitor.fs.refresh_interval": "1s",
    "monitor.jvm.gc.enabled": "true",
    "monitor.jvm.gc.overhead.debug": "10",
    "monitor.jvm.gc.overhead.info": "25",
    "monitor.jvm.gc.overhead.warn": "50",
    "monitor.jvm.gc.refresh_interval": "1s",
    "monitor.jvm.refresh_interval": "1s",
    "monitor.os.refresh_interval": "1s",
    "monitor.process.refresh_interval": "1s",

and this is the main problem then "monitor.fs.health.slow_path_logging_threshold": "5s",?

I need to set this "monitor.fs.health.slow_path_logging_threshold": "5s", to probably like 10s and then this may disappear?

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

vnovotny98 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 3, 2024

github-actions bot added the Storage:Performance label Jan 3, 2024

Bukhtawar added feedback needed Issue or PR needs feedback and removed enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 3, 2024

Bukhtawar added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

ashking94 moved this from 🆕 New to 👀 In review in Storage Project Board Apr 18, 2024

ashking94 added the Resiliency label Apr 18, 2024

shourya035 moved this from 👀 In review to Ready To Be Picked in Storage Project Board Jun 13, 2024

shourya035 moved this from Ready To Be Picked to 👀 In review in Storage Project Board Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase health-check threshhold for filesystem #11721

Increase health-check threshhold for filesystem #11721

vnovotny98 commented Jan 3, 2024

Bukhtawar commented Jan 3, 2024

vnovotny98 commented Jan 4, 2024

Bukhtawar commented Jan 4, 2024

vnovotny98 commented Jan 4, 2024

andrross commented Jan 9, 2024

vnovotny98 commented Jan 9, 2024

Increase health-check threshhold for filesystem #11721

Increase health-check threshhold for filesystem #11721

Comments

vnovotny98 commented Jan 3, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Bukhtawar commented Jan 3, 2024

vnovotny98 commented Jan 4, 2024

Bukhtawar commented Jan 4, 2024

vnovotny98 commented Jan 4, 2024

andrross commented Jan 9, 2024

vnovotny98 commented Jan 9, 2024