You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes containers gets evicted due to out-of-memory issues or other random crashes. Also, running crawls with preemptible nodes is preferable since costs are much lower, but it comes with the risk of nodes being suddenly shut down. There is currently a risk for data loss and data duplication in case the container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3.
Data loss risk
If a container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3, data-loss occurs.
Data duplication risk
Even if we make sure that a new worker re-visits the site that the terminated container was in the midst of processing, we still have partially s3-written data from the previous container already written to S3, resulting in potentially duplicated data.
One way to mitigate this is to have the containers write the write queue to a persistent stable cache (such as Google Cloud Memorystore Redis) and run a separate job for draining the queue and writing to S3 atomically. Other solutions such as BigQuery could also be investigated, but could require more engineering to adopt.
The text was updated successfully, but these errors were encountered:
Apparently we still might be forced to use non preemtible nodes as per the GKE docs any nodes on preemptible instances might be terminated without any notice (including not running the preStop hook)
Sometimes containers gets evicted due to out-of-memory issues or other random crashes. Also, running crawls with preemptible nodes is preferable since costs are much lower, but it comes with the risk of nodes being suddenly shut down. There is currently a risk for data loss and data duplication in case the container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3.
Data loss risk
If a container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3, data-loss occurs.
Data duplication risk
Even if we make sure that a new worker re-visits the site that the terminated container was in the midst of processing, we still have partially s3-written data from the previous container already written to S3, resulting in potentially duplicated data.
One way to mitigate this is to have the containers write the write queue to a persistent stable cache (such as Google Cloud Memorystore Redis) and run a separate job for draining the queue and writing to S3 atomically. Other solutions such as BigQuery could also be investigated, but could require more engineering to adopt.
The text was updated successfully, but these errors were encountered: