What is an important factor in estimating the reliability of SSDs found by studying SSDs used in Google's data center?



As the capacity and cost of SSDs increase, the number of cases where SSDs are introduced instead of HDDs in data centers is increasing. From research that analyzes the data of a large number of SSDs operated at Google's data center, is there a difference between factors to infer the reliability of SSD and whether it is different from inexpensive consumer SSD and expensive enterprise? SSD Is it safer than HDD?

Flash Reliability in Production: The Expected and the Unexpected.pdf
(PDF file)http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf

High-end SLC SSDs No More Reliable than MLC SSDs: Google Study | techPowerUp
https://www.techpowerup.com/220432/high-end-slc-ssds-no-more-reliable-than-mlc-ssds-google-study.html

SSD reliability in the real world: Google's experience | ZDNet
http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/

Professor Bianca Schroder of the University of Toronto examined the factors that determine the reliability of the SSD by analyzing the data surveying the SSD used in Google's data center for more than 6 years, every few million days. In this survey, since it writes 1 bit into the cell, it is said to have high reliability and it is mainly expensive for enterprisesSLC, Since it writes 2 bits or more in the cell, it is inexpensiveMLC, And enterprise MLC (eMLC) which increased the number of times of rewriting even at the low price of MLC, are targeted.

In this research, I found that the important factor for determining the reliability of SSD is "Raw Bit Error Rate (RBER)". RBER is the value obtained by dividing the number of data errors generated at the time of reading by the total number of bits read, and "Uncorrectable Bit Error Rate (UBER)" which is generally used as an index of SSD error occurrence rate is ECC (Error Correction Function) The error occurrence rate after data error recovery by RBER indicates the value before error recovery. Schroder said that UBER does not function as an index to measure the reliability of SSD and that RBER had a high relationship with SSD reliability.

In addition, it is known that there is no correlation between the RBER value and the UBER value, and furthermore, the increase trend of RBER was also found to be much smaller than the expected SSD consumption. And we conclude that it is the age of use rather than usage, which greatly affects the reliability of the SSD.


Furthermore, it was found that there is almost no difference in reliability between SLC and MLC, and it turns out that it is the same composition that there is no big difference in reliability between consumer HDD of inexpensive SATA interface and enterprise HDD of expensive SAS connection . In general, MLC's cell rewrite upper limit is 3000 times, but it is also clear that the MLC type SSD in the surveyed Google data center has not reached the writing limit.

This research found that SSD had more reliability than expected, along with the importance of RBER as an index to measure the reliability of SSD. However, each model of the SSD surveyed said that 30% to 80% of defective blocks occurred within 4 years from the start of use, and it was found that chip defects occurred at a rate of 2% to 7%. We also know that the incidence of this bad block deteriorates depending on usage.


Also, since the value of UBER is higher than that of HDD, considering the risk of data loss due to inability to recover from errors, it is concluded that the importance of backup is higher in SSD than HDD.

in Hardware, Posted by darkhorse_log