Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

motin · 2019-08-02T07:23:30Z

Sometimes containers gets evicted due to out-of-memory issues or other random crashes. Also, running crawls with preemptible nodes is preferable since costs are much lower, but it comes with the risk of nodes being suddenly shut down. There is currently a risk for data loss and data duplication in case the container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3.

Data loss risk
If a container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3, data-loss occurs.

Data duplication risk
Even if we make sure that a new worker re-visits the site that the terminated container was in the midst of processing, we still have partially s3-written data from the previous container already written to S3, resulting in potentially duplicated data.

One way to mitigate this is to have the containers write the write queue to a persistent stable cache (such as Google Cloud Memorystore Redis) and run a separate job for draining the queue and writing to S3 atomically. Other solutions such as BigQuery could also be investigated, but could require more engineering to adopt.

vringar · 2019-11-19T12:40:31Z

Apparently we still might be forced to use non preemtible nodes as per the GKE docs any nodes on preemptible instances might be terminated without any notice (including not running the preStop hook)

or we bother with this complexity

vringar · 2019-11-19T12:41:52Z

For normal pod level stopping we have registered a preStop hook and have 30s to save everything before we get SIGTERMed

motin mentioned this issue Aug 2, 2019

Various stability enhancements #20

Merged

vringar mentioned this issue Nov 19, 2019

Added a shutdown hook to the gcp crawler config #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

motin commented Aug 2, 2019

vringar commented Nov 19, 2019

vringar commented Nov 19, 2019

Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

Comments

motin commented Aug 2, 2019

vringar commented Nov 19, 2019

vringar commented Nov 19, 2019