You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This perf improvement exercise primarily began with a buggy release that made users without block-sync enabled to start resyncing (this issue was patched later). The impact it had on the DB and app servers was completely unsustainable and made us realize that the same is going to happen to users when we start releasing block-sync for real.
We wrote a quick script that simulates what would happen if all users in India started block-level syncing on the perf envs which have real production data.
The following is the series of improvements made by looking at traces and finding bottlenecks.
Query optimizations
We ran the script for 100 users first (100 coroutines across 6-8 threads) and the bottlenecks were complete DB-bound (datadog ref):
There were mainly two slow queries, which are very similar:
# BPs belonging to the current facility's registered patientsSELECT ? FROM blood_pressures
WHERE blood_pressures . patient_id IN( <patients subquery> )
AND blood_pressures . patient_id IN (<current facility patients>)
AND ( blood_pressures.updated_at>= ? )
ORDER BY blood_pressures . updated_at ASCLIMIT ?
# BPs belonging to patients other than the current facilitySELECT ? FROM blood_pressures
WHERE blood_pressures . patient_id IN (<patients subquery>)
AND blood_pressures . patient_id NOT IN (<current facility patients>)
AND ( blood_pressures.updated_at>= ? )
ORDER BY blood_pressures . updated_at ASCLIMIT ?
# The patients subquery is syncable_patients for a regionSELECT patients . id FROM (
(SELECT patients . *FROM patients WHERE patients . registration_facility_id IN ( ? ))
UNION (SELECT patients . *FROM patients WHERE patients . assigned_facility_id IN ( ? ))) patients)
UNION (SELECT patients . *FROM patients INNER JOIN appointments ON appointments . deleted_at IS ?
AND appointments . patient_id = ? . id WHERE appointments . facility_id IN ( ? ))
Index on (patient_id, updated_at) instead of just updated_at
On further investigation, we found the syncable_patients subquery on its own was decent, whereas simple BP queries were far too slow.
SELECT"blood_pressures".*FROM"blood_pressures"WHERE"blood_pressures"."patient_id"IN (<1k ids>)
AND (blood_pressures.updated_at>=2018-12-3118:30:00 BC') ORDER BY "blood_pressures"."updated_at" ASC LIMIT 1000
This took ~2-3s in a regular FG. We probed at some initial suspects:
too many Patient IDs in the IN clause - the query was slow for a handful of patients as well;
if indexes on updated_at and patient_id were missing - they were present;
The query explain suggested that Postgres was not using the patient_id index at all and relied solely on the updated_at index. When we tried the query without the ORDER BY updated_at clause it was drastically faster (down to ~100ms). This time it used the patient_id index which was surprising, since we already had this in the first place.
Separately, removing the updated_at index (and keeping the patient_id index) also sped up the query. The query planner now used the patient_id index (see query explain). We don't have a definite answer to why the query planner chooses one index over the other but we have a guess for why the updated_at index was slower:
When we do a re-sync, we search through data from the beginning of time, which means filtering by updated_at doesn't filter down any records. Narrowing records by patient_id first is more efficient.
When we do a regular sync, we search through the latest records (BPs added since the last sync, for example). An updated_at index does help in this case over having just a patient_id index.
To cover both the scenarios, we've added a compound index on patient_id and updated_at (#1900). The updated_at index is removed in #1901.
This change had the most impact on sync health overall. P99 latencies in the stress test went from consistently 6s+ to ~3s. This also had marked (re: insane) improvements to the DB CPU.
Splitting current_facility_records and other_facility_records
After fixing the index we found that the other_facility_records query was still slow (datadog ref).
However Postgres has no way of figuring out that the not query is a subset of region_records. The not is actually applied on the whole table and Postgres combines these at the end (EXPLAIN). This can be fixed with some query refactoring. We can tweak which patients to fetch the records for in ruby, like so:
Doing .or(assigned_patients) instead of .union(assigned_patients) in syncable_patients makes the query slightly faster as expected.
Controller Optimizations
Activerecord instantiation
We noticed a lot of time being taken by Patientactiverecord.instantiation. This was happening in the freshly added where clauses: An easy fix and it shaved off a handsome 15% in response time:
Since our JSON payloads are quite large (~100kB) we expected adding a faster JSON dumping replacement to save us some time. We found OJ to be a good fit. This made JSON dumping 2.5x faster and cut roughly 10% in CPU load during stress tests.
Facilities controller
The facilities controller was firing 1+N queries to get each facility's block region id (8000 queries in total 😬 ). This warranted adding a with_block_region_id scope to fetch the block id in an includes. #1896
Note on datadog
Datadog.tracer.trace is your friend. This helped a ton in breaking apart the controller. Example usage
The text was updated successfully, but these errors were encountered:
This perf improvement exercise primarily began with a buggy release that made users without block-sync enabled to start resyncing (this issue was patched later). The impact it had on the DB and app servers was completely unsustainable and made us realize that the same is going to happen to users when we start releasing block-sync for real.
We wrote a quick script that simulates what would happen if all users in India started block-level syncing on the perf envs which have real production data.
The following is the series of improvements made by looking at traces and finding bottlenecks.
Query optimizations
We ran the script for 100 users first (100 coroutines across 6-8 threads) and the bottlenecks were complete DB-bound (datadog ref):
There were mainly two slow queries, which are very similar:
Index on (patient_id, updated_at) instead of just updated_at
On further investigation, we found the
syncable_patients
subquery on its own was decent, whereas simple BP queries were far too slow.This took ~2-3s in a regular FG. We probed at some initial suspects:
updated_at
andpatient_id
were missing - they were present;The query explain suggested that Postgres was not using the
patient_id
index at all and relied solely on theupdated_at
index. When we tried the query without theORDER BY updated_at
clause it was drastically faster (down to ~100ms). This time it used thepatient_id
index which was surprising, since we already had this in the first place.Separately, removing the
updated_at
index (and keeping thepatient_id
index) also sped up the query. The query planner now used thepatient_id
index (see query explain). We don't have a definite answer to why the query planner chooses one index over the other but we have a guess for why theupdated_at
index was slower:updated_at
doesn't filter down any records. Narrowing records bypatient_id
first is more efficient.updated_at
index does help in this case over having just apatient_id
index.To cover both the scenarios, we've added a compound index on
patient_id
andupdated_at
(#1900). Theupdated_at
index is removed in #1901.This change had the most impact on sync health overall. P99 latencies in the stress test went from consistently 6s+ to ~3s. This also had marked (re: insane) improvements to the DB CPU.
Splitting current_facility_records and other_facility_records
After fixing the index we found that the
other_facility_records
query was still slow (datadog ref).This was easy to pinpoint to the
NOT
query.One would assume this query to:
region_records
not
conditionupdated_at
and limitHowever Postgres has no way of figuring out that the
not
query is a subset ofregion_records
. Thenot
is actually applied on the whole table and Postgres combines these at the end (EXPLAIN). This can be fixed with some query refactoring. We can tweak which patients to fetch the records for in ruby, like so:Similarly since
current_facility_records
also filters byregistered_patients
, it can be made more performant:Fixed in #1897
OR
instead ofUNION
Doing
.or(assigned_patients)
instead of.union(assigned_patients)
insyncable_patients
makes the query slightly faster as expected.Controller Optimizations
Activerecord instantiation
We noticed a lot of time being taken by
Patient
activerecord.instantiation
. This was happening in the freshly addedwhere
clauses: An easy fix and it shaved off a handsome 15% in response time:OJ for JSON dumping
Facilities controller
The facilities controller was firing 1+N queries to get each facility's block region id (8000 queries in total 😬 ). This warranted adding a
with_block_region_id
scope to fetch the block id in anincludes
. #1896Note on datadog
Datadog.tracer.trace
is your friend. This helped a ton in breaking apart the controller. Example usageThe text was updated successfully, but these errors were encountered: