Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase health-check threshhold for filesystem #11721

Open
vnovotny98 opened this issue Jan 3, 2024 · 6 comments
Open

Increase health-check threshhold for filesystem #11721

vnovotny98 opened this issue Jan 3, 2024 · 6 comments

Comments

@vnovotny98
Copy link

Is your feature request related to a problem? Please describe

I have old and slow HDD disks and when I delete big indices on these HDD - healthcheck is failing and then disconned cold nodes and red the whole cluster.

It happens only on COLD nodes because their disk utilization is 100% when I delete 50GB index by ILM.

When I delete them manually by DELETE index in kibana, disk utilization is about 5%. Is there a bug in ILM or can I increase threshhold?

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

Describe the solution you'd like

CAN I INCREASE THIS THRESHOLD TO 10 OR MORE SECONDS?

Related component

Storage:Performance

Describe alternatives you've considered

Here is my topic on forum but without response.
https://forum.opensearch.org/t/increase-health-check-threshhold/17302

Additional context

LOGS

pescold02-elastic[7214]: [2023-12-30T16:10:02,030][WARN ][o.o.m.f.FsHealthService  ] [pescold02-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]
pescold02-elastic[7214]: [2023-12-30T16:10:12,959][INFO ][o.o.c.c.Coordinator      ] [pescold02-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold02-elastic[7214]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
pescold01-elastic[6944]: [2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
pescold01-elastic[6944]: [2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
Dec 30 17:10:12 pescold01-spc pescold01-elastic[6944]: [2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold01-elastic[6944]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
@vnovotny98 vnovotny98 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 3, 2024
@Bukhtawar
Copy link
Collaborator

Can you share the cluster manager grepping for node_left. The default threshold is set at 60s. which is pretty reasonable in my opinion for a vast variety of storage devices. The warn logs are controlled by a dynamic settings monitor.fs.health.slow_path_logging_threshold which is set at 5s. Note it is just a warning to indicate there are IO bottlenecks for admins to take corrective actions.

@Bukhtawar Bukhtawar added feedback needed Issue or PR needs feedback and removed enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 3, 2024
@vnovotny98
Copy link
Author

LOGS FROM COLD NODE:

[2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
[2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
[2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks

[2023-12-30T16:10:22,860][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72
[2023-12-30T16:10:25,228][INFO ][o.o.j.s.JobSweeper       ] [pescold01-spc] Running full sweep

[2023-12-30T16:10:32,861][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72

[2023-12-30T16:10:42,861][WARN ][o.o.c.c.ClusterFormationFailureHelper] [pescold01-spc] cluster-manager not discovered yet: have discovered [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}, {pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300, 10.X.X.X:9300] from hosts providers and [{pesmaster01-spc}{KTmndIIPSJW5ZG9h0SsH8Q}{yCor3E-RQdaHCwksEbGo9A}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}, {pesmaster02-spc}{PiKX_fELQHusiepq5RLhbQ}{8NoVsoUQTnyQfDtGb_9G1w}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 72, last-accepted version 109454 in term 72


[2023-12-30T16:12:19,915][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Hash ring build result: true
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2023-12-30T16:12:19,916][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Build AD version hash ring successfully
[2023-12-30T16:12:19,916][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pescold01-spc] Detected cluster change event for destination migration
[2023-12-30T16:12:19,932][INFO ][o.o.d.PeerFinder         ] [pescold01-spc] setting findPeersInterval to [1s] as node commission status = [true] for local node [{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}]

[2023-12-30T16:12:19,959][INFO ][o.o.c.s.ClusterApplierService] [pescold01-spc] added {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold03-spc}{kQMsooynRqa_wWrerH2sKw}{e0GIClkPQWq7xn9wnMFFpA}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109462, reason: ApplyCommitRequest{term=72, version=109462, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:12:20,060][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:20,060][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] Node added: [KqFo0_SCQdeBhDotJlQ28Q, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:20,060][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pescold01-spc] Detected cluster change event for destination migration
[2023-12-30T16:12:20,061][INFO ][o.o.m.a.MLModelAutoReDeployer] [pescold01-spc] Model auto reload configuration is false, not performing auto reloading!
[2023-12-30T16:12:20,296][INFO ][o.o.a.c.HashRing         ] [pescold01-spc] All nodes with known AD version: {KTmndIIPSJW5ZG9h0SsH8Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, zW7MK68wTciyYhNsLYdQRg=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, u393ACmDRiyS7Gfv20CvwQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, edKjilksTHmpfKVOMOsqUw=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, MQFKK1lfRcGpZ9l9JE8t7Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, OBI1tBR-RlW8N5jRl3Du5w=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, NrLruO_pR3GK6kg8o6a-CA=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, KqFo0_SCQdeBhDotJlQ28Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, Qnrakfs_T-qeQG880TcD9Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, 9jX0k5J6Q5muN5DPn4vu1Q=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, PiKX_fELQHusiepq5RLhbQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, TnKeeOk_SKS7VeM80Qaefg=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, kQMsooynRqa_wWrerH2sKw=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}, DVfYSiL6Reaysc8a2tNYMw=ADNodeInfo{version=2.8.0, isEligibleDataNode=true}, El3uXMa2TTK7-vi77ctNRQ=ADNodeInfo{version=2.8.0, isEligibleDataNode=false}}
[2023-12-30T16:12:20,296][INFO ][o.o.a.c.ADClusterEventListener] [pescold01-spc] Hash ring build result: true

CLUSTER MANAGER NODE:

[2023-12-30T16:00:24,760][INFO ][o.o.j.s.JobSweeper       ] [pesmaster01-spc] Running full sweep
[2023-12-30T16:05:24,760][INFO ][o.o.j.s.JobSweeper       ] [pesmaster01-spc] Running full sweep
[2023-12-30T16:08:10,812][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pesmaster01-spc] Detected cluster change event for destination migration
[2023-12-30T16:10:10,865][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109455, reason: ApplyCommitRequest{term=72, version=109455, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Hash ring build result: true
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Build AD version hash ring successfully
[2023-12-30T16:10:10,866][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [pesmaster01-spc] Detected cluster change event for destination migration
[2023-12-30T16:10:12,328][WARN ][r.suppressed             ] [pesmaster01-spc] path: /_prometheus/metrics, params: {}
java.lang.NullPointerException: Cannot invoke "org.opensearch.index.shard.DocsStats.getCount()" because the return value of "org.opensearch.action.admin.indices.stats.CommonStats.getDocs()" is null

When I GREP LEFT output was CLEAR

GREP REMOVE :


[2023-12-30T16:10:10,865][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109455, reason: ApplyCommitRequest{term=72, version=109455, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:10:10,866][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:10:40,920][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:11:10,960][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109457, reason: ApplyCommitRequest{term=72, version=109457, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:11:10,961][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:11:10,961][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w]
[2023-12-30T16:11:41,032][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:11,145][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold03-spc}{kQMsooynRqa_wWrerH2sKw}{e0GIClkPQWq7xn9wnMFFpA}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold01-spc}{OBI1tBR-RlW8N5jRl3Du5w}{-MhqbxRVSPqlKl7vECUCUg}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true},{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109459, reason: ApplyCommitRequest{term=72, version=109459, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:12:11,146][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:12:11,146][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q, OBI1tBR-RlW8N5jRl3Du5w, kQMsooynRqa_wWrerH2sKw]
[2023-12-30T16:12:12,085][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:12:20,008][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:14:20,017][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109466, reason: ApplyCommitRequest{term=72, version=109466, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:14:20,017][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:14:20,017][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:14:24,007][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:16:24,037][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109474, reason: ApplyCommitRequest{term=72, version=109474, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:16:24,037][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:16:24,037][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:16:26,283][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true
[2023-12-30T16:18:26,253][INFO ][o.o.c.s.ClusterApplierService] [pesmaster01-spc] removed {{pescold02-spc}{KqFo0_SCQdeBhDotJlQ28Q}{Q2zIseMCQM-Dj6fDFO1IAQ}{10.X.X.X}{10.X.X.X:9300}{dr}{temp=cold, box_type=cold, shard_indexing_pressure_enabled=true}}, term: 72, version: 109485, reason: ApplyCommitRequest{term=72, version=109485, sourceNode={pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.X.X.X}{10.X.X.X:9300}{mr}{shard_indexing_pressure_enabled=true}}
[2023-12-30T16:18:26,254][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: true, node added: false
[2023-12-30T16:18:26,254][INFO ][o.o.a.c.HashRing         ] [pesmaster01-spc] Node removed: [KqFo0_SCQdeBhDotJlQ28Q]
[2023-12-30T16:18:28,485][INFO ][o.o.a.c.ADClusterEventListener] [pesmaster01-spc] Cluster node changed, node removed: false, node added: true

@Bukhtawar
Copy link
Collaborator

In any case you should see this logs if health check threshold is indeed getting breached.

logger.error(
"health check of [{}] failed, took [{}ms] which is above the healthy threshold of [{}]",

@vnovotny98
Copy link
Author

IDK man, i have my own docker builds and in dockerfile i am downloading:
wget -P $TEMP_DIR https://artifacts.opensearch.org/releases/bundle/opensearch/${VERSION}/opensearch-${VERSION}-linux-x64.tar.gz

And right now I am using VERSION=2.8.0.
All logs are presented.

So is there a way how can I check that this "The default threshold is set at 60s." is set correctly? Maybe some GET API?
bcs I think its set 5s and when it fails 3x 5secs then it will disconnect my node.

@andrross
Copy link
Member

andrross commented Jan 9, 2024

So is there a way how can I check that this "The default threshold is set at 60s." is set correctly? Maybe some GET API?

Yes, there is the cluster settings API:

GET _cluster/settings?include_defaults=true&flat_settings=true

You should see values like the following to confirm your current settings:

    "monitor.fs.health.healthy_timeout_threshold" : "60s",
    "monitor.fs.health.refresh_interval" : "60s",
    "monitor.fs.health.slow_path_logging_threshold" : "5s",
    "monitor.fs.refresh_interval" : "1s",

@vnovotny98
Copy link
Author

You are right,

this is my cluster settings

    "monitor.fs.health.enabled": "true",
    "monitor.fs.health.healthy_timeout_threshold": "60s",
    "monitor.fs.health.refresh_interval": "60s",
    "monitor.fs.health.slow_path_logging_threshold": "5s",
    "monitor.fs.refresh_interval": "1s",
    "monitor.jvm.gc.enabled": "true",
    "monitor.jvm.gc.overhead.debug": "10",
    "monitor.jvm.gc.overhead.info": "25",
    "monitor.jvm.gc.overhead.warn": "50",
    "monitor.jvm.gc.refresh_interval": "1s",
    "monitor.jvm.refresh_interval": "1s",
    "monitor.os.refresh_interval": "1s",
    "monitor.process.refresh_interval": "1s",

and this is the main problem then "monitor.fs.health.slow_path_logging_threshold": "5s",?

I need to set this "monitor.fs.health.slow_path_logging_threshold": "5s", to probably like 10s and then this may disappear?

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

@ashking94 ashking94 moved this from 🆕 New to 👀 In review in Storage Project Board Apr 18, 2024
@shourya035 shourya035 moved this from 👀 In review to Ready To Be Picked in Storage Project Board Jun 13, 2024
@shourya035 shourya035 moved this from Ready To Be Picked to 👀 In review in Storage Project Board Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👀 In review
Development

No branches or pull requests

4 participants