Pretty much every
storage vendor protects customer data using some form of RAID
protection. RAID-1, RAID-5, RAID-6 etc have served us well, but what
about the future? In 2013 we saw 4TB drives. In 2014, both Seagate1
and Western Digital2 have announced 5TB and possibly 6TB drives will
be available – with a shared goal of reaching 20TB per drive by
2020. TDK have also entered the race3 - suggesting that a 30-40TB HDD
by 2020 is also achievable. Makes you think.
With this context in
mind, the question becomes “Is RAID the right answer for these high
capacity environments?"
- How long will a rebuild take?
- What happens to I/O performance from the raidgroup, during a rebuild
- Since my rebuilds will take longer, does my risk of subsequent failure increase (due to rebuild stress)
- The longer the rebuild, the longer the service may fall outside of an agreed SLA.
Disk failures are a
fact of life. Disk manufacturers provide Mean Time Between Failure (MTBF4) and Annualised Failure Rate (AFR5) statistics
and probabilities, but they don't change the simple fact that disks
will fail – normally when you're not looking, or are in the middle
of an important production run! Once you accept that disk failure is
a fact of life, the question becomes “How do you design storage to
accommodate the failures?”.
In the proprietary
storage world there are a number of techniques that address this. One
of the stand outs for me is EMC's Isilon array. Isilon protects data
and meta data differently; data is protected by Reed Solomon erasure
coding6, and meta data protection is handled by RAID-1
mirroring. Data written to an Isilon is augmented with recovery
codes, and striped across disks and nodes within the cluster. The
additional recovery information added by the erasure codes provide
the flexibility to survive multiple node and/or disk failures,
without impact to the end user.
The principles of
erasure codes have been defined for over 50 years , but exploiting
erasure codes to support multiple failures has had to wait for
Moore's Law to deliver. Today, CPU architectures provide the
potential for these hardware problems to be fixed in software. The
good news for open source storage is that the two leading distributed
storage projects - gluster7 and ceph8 – are both developing
erasure coding techniques.
To find out more, I
decided to talk to the Xavier Hernandez, who's working on the erasure
coding implementation within the gluster project.
Xavier works in the R&D
Department of Datalab s.l. based in Spain. Datalab were initially
searching for a scalable storage solution that was simple to deploy
and offered the potential to add functionality. The modular
architecture of gluster was a good fit, and around 18 months ago they
started to look at the possibility of adding 'translators' to
gluster. The goal was to improve the usable capacity and fault
tolerance within a gluster volume, through the adoption of erasure
coding.
The investigation that
followed has lead Xavier to the development of the 'disperse' volume,
which is based on Rabin's Information Dispersal Algorithm9 (IDA).
When you look at the impact a dispersed volume has on the available capacity it's easy to see the attraction.
So from this chart you can see that with the same amount of physical infrastructure the disperse volume reduces 'wasted' space and improves the fault tolerance of the solution.
In my next post, I'll dive 'into' the architecture of a disperse volume, looking deeper into the translator stack that Xavier has developed.
References
No comments:
Post a Comment