In my last post I
introduced the notion that RAID may simply not be the best vehicle
for high capacity environments in the future, and proposed an
alternate approach based on erasure coding. This lead me to discuss
developments that have been going on in the gluster community lead by
Xavier Hernandez. He's developing a new type of gluster volume called
the dispersed volume, which has the potential to deliver some pretty
cool advantages;
- Increased usable capacity over standard replication
- reduced network traffic between the client(s) and storage nodes
- increased fault tolerance (user configurable)
Sounds like magic
doesn't it?
So let's have a look to see how this new volume actually works.
The disperse volume
introduces 2 (soon to be 3) translators and potentially a set of volume options to 'tweak' behaviour of the volume
- The 'ida' translator is really the engine room of the new volume. This translator operates client side, and is responsible for splitting the data stream up, and calculating the associated erasure codes that need to be combined with each 'fragment' of the data.
- The 'heal' translator is responsible for co-ordinating healing activity across the sub volumes when a discrepancy is detected. It can actually be regarded as a server-side helper for the ida translator which runs on the client – but more on that later.
- The 3rd translator, nearing completion, addresses a specific read/write sequencing issue that results from splitting a file into fragments. In a large environment, multiple clients could be reading/writing to the same file - but there isn't any order imposed on these operations from a gluster itself, so clients may get different results. The role of this translator is therefore to queue and sort the requests, ensuring read integrity and as a by-product reduces locking requirements. The expectation is that once this is complete, performance will also improve.
- Although currently there are numerous parameters that can be tweaked during development, the goal is to maintain gluster's approach of simplicity. In all likelihood, this will mean that the admin will only really have to worry about the redundancy level and potentially timeout settings.
When you create a
disperse volume, the first thing to decide is the redundancy level
required. The redundancy level defines how many simultaneous brick/node
failures can be accommodated by the volume without affecting data
access. So if I have 6 bricks, and want a redundancy of 2, this
represents a layout conceptually similar to a RAID-6 array – i.e.
4+2. As it stands today, once the volume is defined it's
data+redundancy relationship must remain the same – so growing the
volume has to adhere to this restriction.
Now we've 'virtually'
defined our volume, we can look at how read and write requests are
serviced.
For read requests, the
translator understands the volumes data+redundancy ratio, so it knows
that to satisfy the read it only needs to dispatch requests to a
subset of the subvolumes that provide the volume. Here's a worked
example. In our 4+2 scenario, when a read request is received, the
translator will send reads to 4 of the 6 nodes in parallel. If all of
the nodes return the data fragments, the user data is sent on to the
application and all is well with the world. However, if gremlins have
been at hard at work, and a subvolume does not respond it's marked as 'bad',
and a further read is dispatched to one of the other subvolumes
enabling the data request to be satisfied. When a subvolume is marked
bad it will be avoided in future read requests until it is recovered.
One of the reasons for
introducing the RAID analogy is to aid the understanding of the write
process. With a write request, the translator has to 'align' the
write – in a process similar to RAID's READ-MODIFY-WRITE cycle.
For example, when a write occurs to a file byte range or when a the write doesn't
align to the internal block size, the translator needs to perform a
READ request to 'fill in the blanks'. Once this is done, all the data
is available and the fragments can be assembled with the erasure
codes and written out to each of the subvolumes.
At this point we can
see the decisions needed to define a volume, and understand at a high
level how the ida translator services read and write requests. At
first glance the write of the fragments would appear to place more
work over the network than a standard replicated volume – but the
reality is somewhat different. In a replica 2 volume, gluster will
write out the full file twice to two target bricks. If we have a file
of 20MB, on a replica volume, we'll be writing 40MB between the
client and the gluster nodes. Now with a dispersed volume this
changes. Returning to our 4+2 example, the data written from the
client will resolve to 6 fragments, with each fragment representing
1/4 the size of the file. This means that instead of writing 40MB of
data between client and storage, we're actually only writing 30MB!
Now that's pretty cool
in itself – but there's more to it than that. If you increase the
number of bricks at the same redundancy level, the fragment size sent
to each brick gets smaller, ultimately reducing the bandwidth
consumption even further.
Take a look at this chart;
Although
architecturally the fragment count could be as large as 128, the
testing done to date has only gone to 16+n configuration – which to
my mind is probably as far as you'd go anyway with a single disperse
'set'.
So far so good, but
"what happens when things go wrong?" This is where the heal translator
steps in.
The heal process is
client initiated, and engages the server side heal translator as a
helper process to co-ordinate heal activity across multiple clients.
Clients that detect a problem with a fragment, send a heal request to
the translator across all of the bricks in the volume. This is
similar in function to lock acquisition, and provides a method of
ensuring a file is only healed by one client process. The healing
process itself is the same as a normal read request – fragments are
obtained from the surviving bricks, and the missing fragment
re-assembled. The replacement fragment is then sent to the
corresponding brick in a special write request.
In the context of a
complete node/brick failure the heal process itself is the same. However,
since the healing process is initiated 'client-side', the current
implementation requires that the files on the volume are touched
(stat'd) to force the heal process to recover the missing fragments
from the surviving data. It's important to realise that although this
process is client initiated, the 'client' could actually be on one of
the storage nodes.
In summary, the
disperse volume offers the potential to save capacity, reduce costs,
reduce network consumption and increase fault tolerance. These are
very attractive characteristics, so the next logical question is
“when can I use it?”
If you take a look at
the code on github or in the gluster forge, you'll see that it hasn't
received a commit in over 8 months..so what's the story?. Early
testing showed some scale/performance limitations which forced an
architectural re-think, inevitably resulting in some areas of the
code being rewritten.
The good news is that
this work is approaching completion, with a beta expected in the next
month or two.
As for future plans,
Xavier already has a to-do list;
- dynamic configuration changes of the redundancy level and or subvolume size
- optimisation of the read-mondify-write cycle
- bit rot detection (storing checksums in each fragment)
- automated server-side self heal (similar to the glustershd process today)
- introduce throttling for the heal processed
This is a project that
could really impact the cost model of solutions based on gluster –
and it's certainly a project I'll be following.
Many thanks to Xavier
for making the time to discuss the project with me.
No comments:
Post a Comment