[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

VinDataR · 2024-04-30T06:08:38Z

Environment information:
We have two opensearch clusters deployed on two different Kubernetes clusters in Azure.

We have enabled cross cluster replication between these two opensearch clusters deployed on two seaprate Kuberenetes clusters.

Whenever we push some documents into indices to the leader cluster and it is taking much longer time to replica.
We have tried with indices which are in KiloBytes as well as larger indices which are in GigaBytes.

In both cases, there is a definite lag we could observe from the time of pushing the documents in the leader site and the same documents being available in the follower site.

We had pushed an index 'projecttask' on the leader site. These are stats from leader site.

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open projecttask atOBXJVWT_6wp7cwV_Y0Mw 3 2 4657988 1786411 51.3gb 17.1gb

When we kept monitoring the remote site, the replication status on follower site was showing as BOOTSTRAPPING for a long time. Now status is showing as SYNCING but the time duration is taking a long time.

Posted output of replication status API at several times yesterday to monitor the status. If you can see the bytes_percent and files_percent, it is progressing very gradually. Seeking help on how to troubleshoot this further and to speed up the replication.

curl -XGET -k -u ':' 'http://localhost:9200/_plugins/_replication/projecttask/_status?pretty'


{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 209247
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 229621
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 1846339
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 2521097
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 282059954,
    "bytes_percent" : 2.3038514,
    "files_total" : 419,
    "files_recovered" : 31,
    "files_percent" : 7.391076,
    "start_time" : 1714390992205,
    "running_time" : 2800792
  }
}

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 6474917,
    "follower_checkpoint" : 6474917,
    "seq_no" : 6474917
  }
}

 curl -XGET -vkguadmin:admin "http://localhost:9200/_cluster/settings?pretty"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_cluster/settings?pretty HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 471
<
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "my-connection-test" : {
          "mode" : "proxy",
          "proxy_address" : "*.*.*.100:9300"
        }
      }
    },
    "plugins" : {
      "index_state_management" : {
        "template_migration" : {
          "control" : "-1"
        }
      }
    },
    "logger" : {
      "org" : {
        "opensearch" : {
          "transport" : "DEBUG"
        }
      }
    }
  },
  "transient" : { }
}


curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/autofollow_stats

{
  "num_success_start_replication" : 2,
  "num_failed_start_replication" : 0,
  "num_failed_leader_calls" : 0,
  "failed_indices" : [ ],
  "autofollow_stats" : [
    {
      "name" : "replicate-all-rule",
      "pattern" : "*",
      "num_success_start_replication" : 2,
      "num_failed_start_replication" : 0,
      "num_failed_leader_calls" : 0,
      "failed_indices" : [ ],
      "last_execution_time" : 1714468771677
    }
  ]

curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/follower_stats"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_plugins/_replication/follower_stats HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 1844
<
{
"num_syncing_indices" : 4,
"num_bootstrapping_indices" : 0,
"num_paused_indices" : 2,
"num_failed_indices" : 1,
"num_shard_tasks" : 6,
"num_index_tasks" : 4,
"operations_written" : 0,
"operations_read" : 0,
"failed_read_requests" : 60,
"throttled_read_requests" : 0,
"failed_write_requests" : 0,
"throttled_write_requests" : 0,
"follower_checkpoint" : 6475327,
"leader_checkpoint" : 6475327,
"total_write_time_millis" : 0,
"index_stats" : {
  "security-auditlog-2024.04.08" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 11,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 18,
    "leader_checkpoint" : 18,
    "total_write_time_millis" : 0
  },
  "projecttask" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 8,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 6474917,
    "leader_checkpoint" : 6474917,
    "total_write_time_millis" : 0
  },
  "test-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 9,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 392,
    "leader_checkpoint" : 392,
    "total_write_time_millis" : 0
  },
  "auto-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 32,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 0,
    "leader_checkpoint" : 0,
    "total_write_time_millis" : 0
  }
}

The text was updated successfully, but these errors were encountered:

IanHoang · 2024-04-30T17:07:54Z

@VinDataR Please create an issue in the the OpenSearch repository as this relates to strategies of debugging cross cluster replication in core OpenSearch. OpenSearch Benchmark is a benchmarking client that runs workloads on clusters and helps users determine performance of operations such as queries.

VinDataR added bug Something isn't working untriaged labels Apr 30, 2024

IanHoang closed this as completed Apr 30, 2024

IanHoang removed the untriaged label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

VinDataR commented Apr 30, 2024 •

edited

Loading

IanHoang commented Apr 30, 2024

[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

Comments

VinDataR commented Apr 30, 2024 • edited Loading

IanHoang commented Apr 30, 2024

VinDataR commented Apr 30, 2024 •

edited

Loading