What is 'dRAID', a technology that dramatically reduces the RAID rebuilding time in the event of a disk failure?



'

RAID ' is widely used as a technology to make data redundant, and RAID 6, which can ensure the most redundancy, can protect data up to two disk failures, but recovers immediately from a disk failure. As for the hot spare for this, I had to use a single disk. With ' dRAID ', such hot spares can be distributed to each disk.

dRAID, Finally! | Klara Inc.
https://klarasystems.com/articles/openzfs-draid-finally/

dRAID — OpenZFS documentation
https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html

The history of RAID can be regarded as the history of redundancy due to data distribution. In the figure, green represents the parity that protects the data, and yellow represents the hot spare that is the spare area in the event of a disk failure. RAID 4, which was rarely used at the time of writing the article, had a problem that the write speed was limited because the parity was written only to a single disk.



RAID 5 and RAID 6 have solved the problems that RAID 4 had. In these RAIDs, the parity is distributed to each disk to eliminate the write speed bottleneck.



However, since hot spares were still managed by one disk, there was a problem that when a certain disk failed, data writing during recovery was concentrated on the hot spare disk. According to

Klara , a FreeBSD enterprise service provider, keeping a single disk in a standby state is inefficient, and large RAID configurations can even take weeks to resynchronize in the event of a failure. That thing.



In response to these problems with RAID 5 and RAID 6, dRAID is a technology that distributes not only parity but also hot spares to each disk as shown in the figure.



By distributing the hot spare to each disk, the RAID rebuilding process in the event of a failure is also distributed to each disk, and the rebuilding process can be speeded up.



Below is a graph comparing the rebuilding time of RAID without checking data integrity between conventional RAID and dRAID. It can be confirmed that dRAID can be rebuilt in a shorter time regardless of the number of parity than the conventional RAID shown in indigo.



The following figure

OpenZFS is an image of dRAID implementation by. With OpenZFS dRAID, data and parity in multiple RAID-Z groups and hot spares are distributed and stored on each disk. In addition, the recovery process from a failure can be quickly restored to redundancy by dividing it into two stages: 'rebuilding' to write data to the hot spare area and 'resynchronization' to check the integrity of the data. It seems that it is possible.



In normal RAID-Z, it was possible to use the disk capacity efficiently with the variable stripe width, but in dRAID, the rebuilding speed of RAID is emphasized and the stripe width is fixed. If you write a large amount of small data, the available data capacity will be reduced because the part that is less than the fixed stripe width will be padded . As a workaround, OpenZFS recommends adding a special virtual disk to store small data.

dRAID is not installed in the latest OpenZFS at the time of writing the article, but it can be used by compiling the development version from the source code. The official version of OpenZFS will be available in the first half of 2021.

in Software, Posted by darkhorse_log