Skip to content

zachmayer/backblaze_analysis

Repository files navigation

Readme

Data Sources

I’m buying a hard drive for backups, and I want to buy a drive that’s not going to fail. I’m going to use data from BackBlaze to assess drive reliability. Backblaze did their own analysis of drive failures, but I don’t like their approach for 2 reasons:

  1. Their “annualized failure rate” Drive Failures / (Drive Days / 365) assumes that failure rates are constant over time. E.g. this assumption means that observing 1 drive for 100 days gives you the exact same information as observing 100 drives for 1 day. If drives fail at a constant rate over time, this is fine, but I suspect that drives actually fail at a higher rate early in their lives. So their analysis is biased against newer drives.
  2. I want to compute a confidence interval, so I can select a drive where we have enough observations to be very confident in a low failure rate. For example, if I have a model drive that’s been observed for one drive for 1 day with 0 failures, I probably don’t want to buy it, despite it’s zero percent failure rate. I’d rather buy a drive model thats been observed for 100 drives for 1000 days with one failure. This blog post has some good details on why confidence intervals are useful for sorting things.

Results

I chose to order the drives by their expected 5 year survival rate. I calculated a 95% confidence interval on the 5-year survival rate, and I used that interval to sort the drives. Based on this analysis, the wdc wuh721816ale6l4 is the most reliable model in our data, with an estimated 5-year survival rate that is at least 97.46%. (In other words we are 95% confident that at least 97.46% of the wdc wuh721816ale6l4’s will last at least 5 years).

Here are the top drives from this analysis, by size. (for example, many manufacturers have drives in the 16TB range that are very reliable, but I’m only showing the single best model in this size range).

model capacity_tb N drive_days failures years_97pct surv_5yr_lo surv_5yr surv_5yr_hi
wdc wuh721816ale6l4 16 26602 11616742 102 5.7 97.46% 97.91% 98.28%
wdc wuh721414ale6l4 14 8603 10867094 113 5.6 97.43% 97.86% 98.22%
wdc hms5c4040ale640 04 8716 18224627 253 5.4 97.30% 97.61% 97.89%
wdc huh721212ale600 12 2673 4483664 61 4.8 96.80% 97.50% 98.05%
st6000dx000 06 1939 4330559 100 3.6 95.25% 96.09% 96.78%
wdc hds5c3030ala630 03 4664 6934573 150 3.5 95.18% 95.88% 96.48%
wdc huh728080ale600 08 1218 2609201 73 3.0 94.18% 95.34% 96.28%
wdc hds722020ala330 02 4774 5675646 235 1.9 90.53% 91.62% 92.60%
wdc wuh722222ale6l4 22 13244 1279877 42 1.8 89.85% 92.40% 94.36%
st10000nm0086 10 1304 2924650 202 1.5 87.58% 89.09% 90.44%
st18000nm000j 18 70 82370 10 0.5 63.65% 77.91% 87.67%

Data details:

  • model is the drive model.
  • capacity_tb is the size of the drive.
  • N is the number of unique drives in the analysis.
  • drive_days is the total number of days that we’ve observed for drives of this model in the sample.
  • failures is the number of failures observed so far.
  • years_97pct Is the 97th percentile survival time for the drives. 97% of the drives will last at least this long.
  • surv_5yr_lo is the lower bound of the 95% confidence interval of the 5-year survival rate.
  • surv_5yr is the 5-year survival rate.
  • surv_5yr_hi is the upper bound of the 95% confidence interval of the 5-year survival rate.

Technical Details

Survival analysis is a little weird, because you don’t observe the full distribution of the data. This makes some traditional statistics impossible to calculate. For example, until you observe every hard drive in the sample fail, you can’t know the mean time to failure: if you have one drive left that hasn’t failed, and becomes an outlier in survival time, that might have a big impact on mean survival time. You won’t know the true mean until that last drive fails.

Similarly, to find the median survival time, you need to wait for half of the drives in your sample fail, which can take a decade or more!

Modern hard drives are so reliable, that even after 5+ years of observation, we’ve barely observed the distribution of failures! (This is a good thing, but it makes it hard to chose between drives!).

To compare models with different observational periods (e.g. 22 TB vs 4TB drives), I fit a Cox Proportional Hazard model. This enabled me to estimate 5 years survival rates for all of the drives, as well as a confidence interval on that rate. The confidence interval narrows as you observe more drives and as you observe those drives for a longer time.

The Cox model is semi-parametric. It assumes a non-parametric, baseline hazard rate that is the same for all drives. It then fits a single parameter for each drive that is a multiple on that baseline hazard rate. So every drive has the same “shape” for its survival curve, but multiplied by a fixed coefficient per model that makes that “shape” steeper or shallower.

Plots

Here is a plot of the survival for each of the best drive models. Each curve ends with the oldest drive we’ve observed (these are called Kaplan–Meier curves):

The “proportional hazards” assumption from the Cox model allows us to extend these curves and estimate survival times at 5 years for all of the drives:

Note that the curves all have the same shape, but each model has a different slope. Compare this plot to the Kaplan-Meier plot above: The proportional hazards assumption works pretty well, but isn’t perfect.

Cost effectiveness

I manually gatherted some hard drive prices from ebay and amazon. I limited this search to drives with >70% expected 5 years survival, as I want to buy drives that are unlikely to fail on me. I can then use the price data to calculate the cost to store 1TB of data for 5 years for each drive. Note that these prices could be wrong, and also not that only one drive may be available at the given price.

model price capacity_tb surv_5yr_lower cost_per_tb cost_per_tb_5yr
st16000nm001g 165.0 16 94.9 10.3 10.9
st12000nm001g 126.0 12 94.3 10.5 11.1
st12000nm0008 122.0 12 90.0 10.2 11.3
st14000nm001g 149.9 14 91.7 10.7 11.7
wdc huh721212ale600 148.8 12 96.8 12.4 12.8
wdc wuh721414ale6l4 179.0 14 97.4 12.8 13.1
wdc wuh721816ale6l4 209.0 16 97.5 13.1 13.4
wdc huh728080ale600 101.2 08 94.2 12.6 13.4
st4000dm000 48.0 04 88.5 12.0 13.6
toshiba mg08aca16te 200.0 16 91.8 12.5 13.6
wdc hds722020ala330 25.1 02 90.5 12.6 13.9
st6000dx000 85.0 06 95.3 14.2 14.9
st8000nm0055 109.0 08 91.5 13.6 14.9
wdc wuh721816ale6l0 229.9 16 96.0 14.4 15.0
st10000nm0086 133.0 10 87.6 13.3 15.2
toshiba mg08aca16tey 229.9 16 93.5 14.4 15.4
st12000nm0007 162.0 12 87.5 13.5 15.4
st16000nm002j 225.0 16 90.9 14.1 15.5
wdc hus726040ale610 50.1 04 79.2 12.5 15.8
st8000nm000a 111.8 08 88.2 14.0 15.8
wdc hds5c3030ala630 45.7 03 95.2 15.2 16.0
st8000dm002 119.7 08 93.4 15.0 16.0
wdc huh721212ale604 199.0 12 94.3 16.6 17.6
toshiba dt01aca300 45.5 03 70.2 15.2 21.6
wdc hds724040ale640 69.0 04 79.2 17.2 21.8
wdc hds723030ala640 59.8 03 88.6 19.9 22.5
wdc wuh722222ale6l4 450.0 22 89.8 20.5 22.8
wdc wd30efrx 54.0 03 73.5 18.0 24.5
wdc wd40efrx 79.5 04 77.6 19.9 25.6
wdc hms5c4040ale640 102.0 04 97.3 25.5 26.2
wdc wd60efrx 125.0 06 77.5 20.8 26.9
toshiba mg08aca16ta 415.0 16 91.9 25.9 28.2
wdc huh721212aln604 418.0 12 92.1 34.8 37.8
wdc hds5c4040ale630 180.0 04 95.6 45.0 47.1
st8000dm005 603.8 08 75.0 75.5 100.6

According to this analysis, the most cost effective drive is the st16000nm001g, which costs $165 and has a 5 year survival rate of 94.9%. This drive costs $10.87 to store 1 TB for 5 years (this price includes the probability of failure).

Again, the price data is probably incorrect, but its still an interesting analysis.

Full Data

Here are the full results for all drives, excluding drives that are less than 2TB:

model capacity_tb N drive_days failures years_97pct surv_5yr_lo surv_5yr surv_5yr_hi
wdc wuh721816ale6l4 16 26602 11616742 102 5.7 97.46% 97.91% 98.28%
wdc wuh721414ale6l4 14 8603 10867094 113 5.6 97.43% 97.86% 98.22%
wdc hms5c4040ale640 04 8716 18224627 253 5.4 97.30% 97.61% 97.89%
wdc huh721212ale600 12 2673 4483664 61 4.8 96.80% 97.50% 98.05%
wdc wuh721816ale6l0 16 3069 2772374 37 4.0 96.02% 97.10% 97.89%
wdc hds5c4040ale630 04 2837 4790383 95 3.8 95.61% 96.40% 97.04%
st6000dx000 06 1939 4330559 100 3.6 95.25% 96.09% 96.78%
wdc hds5c3030ala630 03 4664 6934573 150 3.5 95.18% 95.88% 96.48%
st16000nm001g 16 34293 22614411 480 3.4 94.88% 95.32% 95.72%
toshiba mg07aca14ta 14 39365 51123732 1376 3.1 94.40% 94.69% 94.96%
st12000nm001g 12 13627 16705713 434 3.1 94.31% 94.82% 95.27%
wdc huh721212ale604 12 13519 15607978 392 3.1 94.28% 94.81% 95.30%
wdc huh728080ale600 08 1218 2609201 73 3.0 94.18% 95.34% 96.28%
toshiba mg08aca16tey 16 5347 4846164 125 2.7 93.48% 94.51% 95.37%
st8000dm002 08 10307 27580788 1111 2.7 93.41% 93.79% 94.15%
wdc huh721212aln604 12 11422 20559877 882 2.3 92.06% 92.55% 93.01%
toshiba mg08aca16ta 16 39184 12456523 361 2.2 91.88% 92.67% 93.39%
toshiba mg08aca16te 16 6130 5737006 192 2.2 91.84% 92.89% 93.81%
toshiba md04aba400v 04 150 378365 11 2.2 91.74% 95.32% 97.39%
st14000nm001g 14 11177 13299198 504 2.2 91.68% 92.36% 92.98%
st8000nm0055 08 15680 36632508 1893 2.1 91.46% 91.84% 92.20%
st16000nm002j 16 468 259866 4 2.0 90.91% 96.45% 98.66%
wdc hds722020ala330 02 4774 5675646 235 1.9 90.53% 91.62% 92.60%
st12000nm0008 12 20955 31032423 1615 1.8 89.96% 90.42% 90.87%
wdc wuh722222ale6l4 22 13244 1279877 42 1.8 89.85% 92.40% 94.36%
wdc hds723030ala640 03 1048 1495337 73 1.6 88.62% 90.84% 92.65%
st4000dm000 04 37040 81347421 5770 1.6 88.51% 88.83% 89.14%
toshiba mg07aca14tey 14 738 692480 28 1.6 88.22% 91.69% 94.20%
st8000nm000a 08 249 128292 1 1.5 88.20% 98.18% 99.74%
st10000nm0086 10 1304 2924650 202 1.5 87.58% 89.09% 90.44%
st12000nm0007 12 38842 36947060 2173 1.5 87.48% 88.00% 88.51%
wdc hds724040ale640 04 45 64934 2 1.0 79.20% 94.08% 98.51%
wdc hus726040ale610 04 55 69213 3 1.0 79.16% 92.49% 97.56%
wdc wd40efrx 04 50 77099 4 0.9 77.59% 90.65% 96.44%
wdc wd60efrx 06 499 692834 72 0.9 77.55% 81.68% 85.20%
st8000dm005 08 27 53730 3 0.8 75.01% 90.77% 96.99%
wdc wd30efrx 03 1335 1365902 174 0.8 73.54% 76.71% 79.61%
toshiba dt01aca300 03 60 78820 7 0.6 70.16% 84.05% 92.19%
st4000dm005 04 90 95987 11 0.6 69.16% 81.19% 89.25%
st14000nm0138 14 1690 1951778 317 0.6 68.21% 70.99% 73.62%
st33000651as 03 351 241851 31 0.6 66.53% 74.90% 81.75%
st18000nm000j 18 70 82370 10 0.5 63.65% 77.91% 87.67%
toshiba hdwf180 08 69 65254 9 0.4 60.88% 76.61% 87.33%
st12000nm000j 12 482 74238 7 0.4 60.47% 77.96% 89.10%
wdc huh728080ale604 08 98 77317 10 0.4 59.18% 74.76% 85.82%
wdc wd30ezrx 03 500 149891 22 0.3 54.14% 66.31% 76.65%
st4000dx000 04 222 305172 81 0.3 52.70% 59.57% 66.08%
wdc wd20efrx 02 167 88330 15 0.3 52.05% 66.77% 78.81%
st10000nm001g 10 29 23056 2 0.3 50.56% 82.44% 95.57%
st32000542as 02 385 147014 33 0.2 46.35% 57.43% 67.81%
wdc hds723030ble640 03 10 14225 1 0.2 45.21% 87.06% 98.21%
st14000nm0018 14 80 52551 12 0.2 41.31% 59.29% 75.08%
st4000dm004 04 20 13505 1 0.1 38.97% 84.33% 97.84%
wdc hus728t8tale6l4 08 20 13248 1 0.1 38.39% 84.05% 97.81%
st14000nm000j 14 143 26784 5 0.1 33.15% 60.27% 82.27%
st4000dm001 04 425 95868 34 0.1 25.86% 37.26% 50.29%
seagate barracuda ssd za2000cm10002 02 4 6645 1 0.1 24.80% 75.70% 96.71%
wdc hds723020bla642 02 11 9985 3 0.1 18.89% 52.09% 83.54%
st3000dm001 03 4707 2463925 1708 0.1 17.25% 18.88% 20.62%
st12000nm0117 12 30 13805 7 0.1 14.78% 36.11% 64.81%
st2000vn000 02 10 4920 2 0.0 6.48% 37.76% 84.15%
wdc wd30ezrs 03 18 5421 2 0.0 5.53% 35.29% 83.55%
st2000dm001 02 8 2510 1 0.0 1.92% 33.21% 92.67%
st320005xxxx 02 18 6256 7 0.0 0.50% 4.96% 35.21%
st4000dx002 04 7 2337 4 0.0 0.02% 1.53% 49.92%
st8000dm004 08 7 2823 7 0.0 0.01% 0.35% 19.01%
st2000dl001 02 12 1556 5 0.0 0.00% 0.09% 29.71%
st2000dl003 02 17 1278 8 0.0 0.00% 0.00% 1.24%
wdc hus726040aln610 04 19 4483 0 0.0 0.00% 100.00% 100.00%
st16000nm000j 16 62 15848 0 0.0 0.00% 100.00% 100.00%
st12000nm003g 12 5 2031 0 0.0 0.00% 100.00% 100.00%
st16000nm005g 16 26 33182 0 0.0 0.00% 100.00% 100.00%
wdc hds5c3030ble630 03 1 1477 0 0.0 0.00% 100.00% 100.00%
st6000dm001 06 15 17510 0 0.0 0.00% 100.00% 100.00%
st6000dm004 06 3 2761 0 0.0 0.00% 100.00% 100.00%
toshiba hdwe160 06 10 10437 0 0.0 0.00% 100.00% 100.00%
wdc huh721010ale600 10 20 40104 0 0.0 0.00% 100.00% 100.00%
wdc hms5c4040ble641 04 1 2070 0 0.0 0.00% 100.00% 100.00%

Note that some drives have a very low sample size, which gives them a very wide confidence interval. More data are needed for these drives to draw conclusions about their survival rates.

Replicating my results

results/drive_dates.csv has the cleaned up data from backblaze, with each drive by serial number, model, when it was installed, when it was last observed, and whether it failed.

serial_number capacity_tb model min_date max_date failed
PL1311LAG2AULH 4 wdc hds5c4040ale630 2013-04-10 2024-06-30 0
W300B5H1 4 st4000dm000 2013-06-27 2024-06-30 0
W300CK7H 4 st4000dm000 2013-06-27 2024-06-30 0
W300BA76 4 st4000dm000 2013-06-28 2024-06-30 0
Z300GPBY 4 st4000dm000 2013-07-23 2024-06-30 0
Z300GYP2 4 st4000dm000 2013-07-23 2024-06-30 0
W300B2WJ 4 st4000dm000 2013-07-25 2024-06-30 0
W300B37Q 4 st4000dm000 2013-07-25 2024-06-30 0
PK1331PAJ2V6WS 4 wdc hds724040ale640 2013-10-15 2024-06-30 0
W300460J 4 st4000dm000 2013-10-15 2024-06-30 0

I use a Makefile to automate the analysis. Run make help for more info, or just run make all to download the data, unzip and combine it, run the survival analysis and generate the README.md. The download is 50+ GB, so it takes a while, but you only need to do it once.

An interesting note about this data: It’s 55GB uncompressed, and contains a whole bunch of irrelevant information. It was very interesting to me that I could compress a 55GB dataset to 22 Mb, while still keeping all of the relevant information for modeling. (In other words, this dataset was thousands of times larger than it needed to be!). I think this is another example of how good data structures are essential for data science.

Erratum

I’m probably way over-thinking this, but it was fun to analyze the data.

There are some drives in this data I plan to avoid. For example, the st3000dm001 has a 5 year survival of 18.9%. I’d be a little nervous to buy this drive. Backblaze has a good analysis of issues with 3TB drives on their blog.

I nerd sniped myself

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published