Skip to content

An AI/ML solution that provides a probability that a hard drive will fail within some pre-defined time period.

License

Notifications You must be signed in to change notification settings

chauhankaranraj/ceph_drive_failure

 
 

Repository files navigation

Ceph Hard Drive Failure Prediction

A comprehensive machine learning project for predicting whether a hard disk will fail within a given time interval

More than 2500 petabytes of data is generated every day by sources such as social media, IoT devices, etc., and every bit of it is valuable. That’s why modern storage systems need to be reliable, scalable, and efficient. To ensure that data is not lost or corrupted, many large scale distributed storage systems, such as Ceph, use use erasure-coded redundancy or mirroring. Although this provides reasonable fault tolerance, it can make it more difficult and expensive to scale up the storage cluster.

This project seeks to mitigate this problem using machine learning. Specifically, the goal of this project is to train a model that can predict if a given disk will fail within a predefined future time window. These predictions can then be used by Ceph (or other similar systems) to determine when to add or remove data replicas. In this way, the fault tolerance can be improved by up to an order of magnitude, since the probability of data loss is generally related to the probability of multiple, concurrent disk failures.

In addition to creating models, we also aim to catalyze community involvement in this domain by providing Jupyter notebooks to easily get started with and explore some publicly available datasets such as Backblaze Dataset and Ceph Telemetry. Ultimately, we want to provide a platform where data scientists and subject matter experts can collaborate and contribute to this ubiquitous problem of predicting disk failures.

Contact

This project is maintained by the AIOps teams in the AI Center of Excellence within the Office of the CTO at Red Hat. For more information, reach out to us at [email protected].

About

An AI/ML solution that provides a probability that a hard drive will fail within some pre-defined time period.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 89.3%
  • Jupyter Notebook 10.7%