Pretty much every storage vendor protects customer data using some form of RAID protection. RAID-1, RAID-5, RAID-6 etc have served us well, but what about the future? In 2013 we saw 4TB drives. In 2014, both Seagate1 and Western Digital2 have announced 5TB and possibly 6TB drives will be available – with a shared goal of reaching 20TB per drive by 2020. TDK have also entered the race3 - suggesting that a 30-40TB HDD by 2020 is also achievable. Makes you think.
With this context in mind, the question becomes “Is RAID the right answer for these high capacity environments?"
- How long will a rebuild take?
- What happens to I/O performance from the raidgroup, during a rebuild
- Since my rebuilds will take longer, does my risk of subsequent failure increase (due to rebuild stress)
- The longer the rebuild, the longer the service may fall outside of an agreed SLA.
Disk failures are a fact of life. Disk manufacturers provide Mean Time Between Failure (MTBF4) and Annualised Failure Rate (AFR5) statistics and probabilities, but they don't change the simple fact that disks will fail – normally when you're not looking, or are in the middle of an important production run! Once you accept that disk failure is a fact of life, the question becomes “How do you design storage to accommodate the failures?”.
In the proprietary storage world there are a number of techniques that address this. One of the stand outs for me is EMC's Isilon array. Isilon protects data and meta data differently; data is protected by Reed Solomon erasure coding6, and meta data protection is handled by RAID-1 mirroring. Data written to an Isilon is augmented with recovery codes, and striped across disks and nodes within the cluster. The additional recovery information added by the erasure codes provide the flexibility to survive multiple node and/or disk failures, without impact to the end user.
The principles of erasure codes have been defined for over 50 years , but exploiting erasure codes to support multiple failures has had to wait for Moore's Law to deliver. Today, CPU architectures provide the potential for these hardware problems to be fixed in software. The good news for open source storage is that the two leading distributed storage projects - gluster7 and ceph8 – are both developing erasure coding techniques.
To find out more, I decided to talk to the Xavier Hernandez, who's working on the erasure coding implementation within the gluster project.
Xavier works in the R&D Department of Datalab s.l. based in Spain. Datalab were initially searching for a scalable storage solution that was simple to deploy and offered the potential to add functionality. The modular architecture of gluster was a good fit, and around 18 months ago they started to look at the possibility of adding 'translators' to gluster. The goal was to improve the usable capacity and fault tolerance within a gluster volume, through the adoption of erasure coding.
The investigation that followed has lead Xavier to the development of the 'disperse' volume, which is based on Rabin's Information Dispersal Algorithm9 (IDA). When you look at the impact a dispersed volume has on the available capacity it's easy to see the attraction.
So from this chart you can see that with the same amount of physical infrastructure the disperse volume reduces 'wasted' space and improves the fault tolerance of the solution.
In my next post, I'll dive 'into' the architecture of a disperse volume, looking deeper into the translator stack that Xavier has developed.
7 - https://forge.gluster.org/disperse