Research to predict the possibility of hard disk failure by machine learning



Hard disk drives (HDDs) have the advantage of being suitable for long-term storage because they can provide large-capacity storage at low cost, but they have the disadvantage of being vulnerable to shock and heat, and because many precision parts are used, physical failures occur. Can happen enough. Backblaze, an online storage service provider, describes a research paper on technology that predicts the possibility of future failures in machine learning from the state of the hard disk.

Interpretable predictive maintenance for hard drives --ScienceDirect

https://www.sciencedirect.com/science/article/pii/S2666827021000219

Using Machine Learning to Predict Hard Drive Failures
https://www.backblaze.com/blog/using-machine-learning-to-predict-hard-drive-failures/


Backblaze every day, HDD model number and serial number from the data centers around the world, SMART has been collecting data, such as, that of the attention has been recorded of the total $ 266 million cases more than in April 2013. As of September 30, 2021, it seems that data is being sent under Backblaze from 191,000 HDDs.

SMART, which is a self-diagnosis function of HDD, records data transfer speed, energization time, HDD temperature, frequency of seek error, number of start / stop of disk rotation motor, etc.



Attempts to predict HDD failures from this SMART data have been made since the 1990s. For example, Backblaze is 2014 and the 2016 study was published in, and Google announced in 2007 research in, out of the SMART information '05: Alternate treated number of sectors,' 'BB: uncorrectable number of errors', 'BC : Command timeout, 'C5: Number of sectors pending alternative processing', and 'C6: Number of unrecoverable sectors' correlate with HDD failure, and univariate analysis is performed for each.

The paper that Backblaze paid attention to this time was published by the research team of Interpretable AI, an AI company. The research team analyzes SMART information collected daily from the first quarter of 2017 (January to March) to the first quarter of 2020 from more than 35,000 Seagate helium-filled HDDs ' ST12000NM0007'. Then, the remaining useful life of each HDD was calculated, the data was combined with SMART, and AI was made to build a survival tree showing how the remaining life is affected by the SMART attribute, and failure prediction was performed.

Below is a survival tree for making long-term forecasts on a yearly basis. Node 1 at the top of the tree verifies '05: Number of alternative processed sectors'. If the result is less than 1.5, proceed to node 2 to verify '03: spin-up time', and if the result is 1.5 or more, proceed to node 15 to verify 'C5: number of sectors pending alternative processing'. We will make predictions by repeating branches based on the verification and the results.



For example, node 18 at the bottom layer predicts that 'at least half of the HDDs that have been verified so far will not fail within two years.' On the other hand, HDDs that have been verified on node 11 are predicted to 'fail within 50 days.'

And the survival tree for making short-term prediction in the range of 90 days is as follows, in this case the HDD branched to the bottom node 21 and node 24 will almost certainly be predicted to fail within 90 days. That thing. On the other hand, HDDs branched to nodes 12 and 15 are unlikely to fail within 90 days.



The survival tree for making an ultra-short-term forecast of 30 days looks like this.



In making long-term forecasts for HDDs, the research team used data for three years from 2017 to 2020, and then limited the data to one year from 2019 to 2020 for observation. Reduced the value to 557,936. After that, he randomly resampled the observations from the first dataset to train the AI model and used the rest for the test.

Backblaze says, 'You can predict a drive failure, but it's clear that it's not perfect. But with Backblaze, that's not necessary. In our environment, if a drive fails, there are numerous backup strategies. 'If you rely on one HDD or SSD for your digital life, forget about failure prediction and rather back up your data assuming that a failure will occur.' I have commented.

in Hardware, Posted by log1i_yk