Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #527

Closed
VinDataR opened this issue Apr 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@VinDataR
Copy link

VinDataR commented Apr 30, 2024

Environment information:
We have two opensearch clusters deployed on two different Kubernetes clusters in Azure.

We have enabled cross cluster replication between these two opensearch clusters deployed on two seaprate Kuberenetes clusters.

Whenever we push some documents into indices to the leader cluster and it is taking much longer time to replica.
We have tried with indices which are in KiloBytes as well as larger indices which are in GigaBytes.

In both cases, there is a definite lag we could observe from the time of pushing the documents in the leader site and the same documents being available in the follower site.

We had pushed an index 'projecttask' on the leader site. These are stats from leader site.

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open projecttask atOBXJVWT_6wp7cwV_Y0Mw 3 2 4657988 1786411 51.3gb 17.1gb

When we kept monitoring the remote site, the replication status on follower site was showing as BOOTSTRAPPING for a long time. Now status is showing as SYNCING but the time duration is taking a long time.

Posted output of replication status API at several times yesterday to monitor the status. If you can see the bytes_percent and files_percent, it is progressing very gradually. Seeking help on how to troubleshoot this further and to speed up the replication.

curl -XGET -k -u ':' 'http://localhost:9200/_plugins/_replication/projecttask/_status?pretty'


{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 209247
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 229621
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 1846339
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 2521097
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 282059954,
    "bytes_percent" : 2.3038514,
    "files_total" : 419,
    "files_recovered" : 31,
    "files_percent" : 7.391076,
    "start_time" : 1714390992205,
    "running_time" : 2800792
  }
}

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 6474917,
    "follower_checkpoint" : 6474917,
    "seq_no" : 6474917
  }
} 
 curl -XGET -vkguadmin:admin "http://localhost:9200/_cluster/settings?pretty"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_cluster/settings?pretty HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 471
<
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "my-connection-test" : {
          "mode" : "proxy",
          "proxy_address" : "*.*.*.100:9300"
        }
      }
    },
    "plugins" : {
      "index_state_management" : {
        "template_migration" : {
          "control" : "-1"
        }
      }
    },
    "logger" : {
      "org" : {
        "opensearch" : {
          "transport" : "DEBUG"
        }
      }
    }
  },
  "transient" : { }
}


curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/autofollow_stats

{
  "num_success_start_replication" : 2,
  "num_failed_start_replication" : 0,
  "num_failed_leader_calls" : 0,
  "failed_indices" : [ ],
  "autofollow_stats" : [
    {
      "name" : "replicate-all-rule",
      "pattern" : "*",
      "num_success_start_replication" : 2,
      "num_failed_start_replication" : 0,
      "num_failed_leader_calls" : 0,
      "failed_indices" : [ ],
      "last_execution_time" : 1714468771677
    }
  ]
curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/follower_stats"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_plugins/_replication/follower_stats HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 1844
<
{
"num_syncing_indices" : 4,
"num_bootstrapping_indices" : 0,
"num_paused_indices" : 2,
"num_failed_indices" : 1,
"num_shard_tasks" : 6,
"num_index_tasks" : 4,
"operations_written" : 0,
"operations_read" : 0,
"failed_read_requests" : 60,
"throttled_read_requests" : 0,
"failed_write_requests" : 0,
"throttled_write_requests" : 0,
"follower_checkpoint" : 6475327,
"leader_checkpoint" : 6475327,
"total_write_time_millis" : 0,
"index_stats" : {
  "security-auditlog-2024.04.08" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 11,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 18,
    "leader_checkpoint" : 18,
    "total_write_time_millis" : 0
  },
  "projecttask" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 8,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 6474917,
    "leader_checkpoint" : 6474917,
    "total_write_time_millis" : 0
  },
  "test-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 9,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 392,
    "leader_checkpoint" : 392,
    "total_write_time_millis" : 0
  },
  "auto-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 32,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 0,
    "leader_checkpoint" : 0,
    "total_write_time_millis" : 0
  }
}
@VinDataR VinDataR added bug Something isn't working untriaged labels Apr 30, 2024
@IanHoang
Copy link
Collaborator

@VinDataR Please create an issue in the the OpenSearch repository as this relates to strategies of debugging cross cluster replication in core OpenSearch. OpenSearch Benchmark is a benchmarking client that runs workloads on clusters and helps users determine performance of operations such as queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants