In my last post I introduced the notion that RAID may simply not be the best vehicle for high capacity environments in the future, and proposed an alternate approach based on erasure coding. This lead me to discuss developments that have been going on in the gluster community lead by Xavier Hernandez. He's developing a new type of gluster volume called the dispersed volume, which has the potential to deliver some pretty cool advantages;
- Increased usable capacity over standard replication
- reduced network traffic between the client(s) and storage nodes
- increased fault tolerance (user configurable)
Sounds like magic doesn't it?
So let's have a look to see how this new volume actually works.
The disperse volume introduces 2 (soon to be 3) translators and potentially a set of volume options to 'tweak' behaviour of the volume
- The 'ida' translator is really the engine room of the new volume. This translator operates client side, and is responsible for splitting the data stream up, and calculating the associated erasure codes that need to be combined with each 'fragment' of the data.
- The 'heal' translator is responsible for co-ordinating healing activity across the sub volumes when a discrepancy is detected. It can actually be regarded as a server-side helper for the ida translator which runs on the client – but more on that later.
- The 3rd translator, nearing completion, addresses a specific read/write sequencing issue that results from splitting a file into fragments. In a large environment, multiple clients could be reading/writing to the same file - but there isn't any order imposed on these operations from a gluster itself, so clients may get different results. The role of this translator is therefore to queue and sort the requests, ensuring read integrity and as a by-product reduces locking requirements. The expectation is that once this is complete, performance will also improve.
- Although currently there are numerous parameters that can be tweaked during development, the goal is to maintain gluster's approach of simplicity. In all likelihood, this will mean that the admin will only really have to worry about the redundancy level and potentially timeout settings.
When you create a disperse volume, the first thing to decide is the redundancy level required. The redundancy level defines how many simultaneous brick/node failures can be accommodated by the volume without affecting data access. So if I have 6 bricks, and want a redundancy of 2, this represents a layout conceptually similar to a RAID-6 array – i.e. 4+2. As it stands today, once the volume is defined it's data+redundancy relationship must remain the same – so growing the volume has to adhere to this restriction.
Now we've 'virtually' defined our volume, we can look at how read and write requests are serviced.
For read requests, the translator understands the volumes data+redundancy ratio, so it knows that to satisfy the read it only needs to dispatch requests to a subset of the subvolumes that provide the volume. Here's a worked example. In our 4+2 scenario, when a read request is received, the translator will send reads to 4 of the 6 nodes in parallel. If all of the nodes return the data fragments, the user data is sent on to the application and all is well with the world. However, if gremlins have been at hard at work, and a subvolume does not respond it's marked as 'bad', and a further read is dispatched to one of the other subvolumes enabling the data request to be satisfied. When a subvolume is marked bad it will be avoided in future read requests until it is recovered.
One of the reasons for introducing the RAID analogy is to aid the understanding of the write process. With a write request, the translator has to 'align' the write – in a process similar to RAID's READ-MODIFY-WRITE cycle. For example, when a write occurs to a file byte range or when a the write doesn't align to the internal block size, the translator needs to perform a READ request to 'fill in the blanks'. Once this is done, all the data is available and the fragments can be assembled with the erasure codes and written out to each of the subvolumes.
At this point we can see the decisions needed to define a volume, and understand at a high level how the ida translator services read and write requests. At first glance the write of the fragments would appear to place more work over the network than a standard replicated volume – but the reality is somewhat different. In a replica 2 volume, gluster will write out the full file twice to two target bricks. If we have a file of 20MB, on a replica volume, we'll be writing 40MB between the client and the gluster nodes. Now with a dispersed volume this changes. Returning to our 4+2 example, the data written from the client will resolve to 6 fragments, with each fragment representing 1/4 the size of the file. This means that instead of writing 40MB of data between client and storage, we're actually only writing 30MB!
Now that's pretty cool in itself – but there's more to it than that. If you increase the number of bricks at the same redundancy level, the fragment size sent to each brick gets smaller, ultimately reducing the bandwidth consumption even further.
Take a look at this chart;
Although architecturally the fragment count could be as large as 128, the testing done to date has only gone to 16+n configuration – which to my mind is probably as far as you'd go anyway with a single disperse 'set'.
So far so good, but "what happens when things go wrong?" This is where the heal translator steps in.
The heal process is client initiated, and engages the server side heal translator as a helper process to co-ordinate heal activity across multiple clients. Clients that detect a problem with a fragment, send a heal request to the translator across all of the bricks in the volume. This is similar in function to lock acquisition, and provides a method of ensuring a file is only healed by one client process. The healing process itself is the same as a normal read request – fragments are obtained from the surviving bricks, and the missing fragment re-assembled. The replacement fragment is then sent to the corresponding brick in a special write request.
In the context of a complete node/brick failure the heal process itself is the same. However, since the healing process is initiated 'client-side', the current implementation requires that the files on the volume are touched (stat'd) to force the heal process to recover the missing fragments from the surviving data. It's important to realise that although this process is client initiated, the 'client' could actually be on one of the storage nodes.
In summary, the disperse volume offers the potential to save capacity, reduce costs, reduce network consumption and increase fault tolerance. These are very attractive characteristics, so the next logical question is “when can I use it?”
If you take a look at the code on github or in the gluster forge, you'll see that it hasn't received a commit in over 8 months..so what's the story?. Early testing showed some scale/performance limitations which forced an architectural re-think, inevitably resulting in some areas of the code being rewritten.
The good news is that this work is approaching completion, with a beta expected in the next month or two.
As for future plans, Xavier already has a to-do list;
- dynamic configuration changes of the redundancy level and or subvolume size
- optimisation of the read-mondify-write cycle
- bit rot detection (storing checksums in each fragment)
- automated server-side self heal (similar to the glustershd process today)
- introduce throttling for the heal processed
This is a project that could really impact the cost model of solutions based on gluster – and it's certainly a project I'll be following.
Many thanks to Xavier for making the time to discuss the project with me.