Thursday, 17 October 2013

Downtime..."Ain't got time for that"

Storage solutions underpin all of the front line and back room services that can be found in most organisations today. That's the reality. It's also a harsh reality for storage designers and developers to wrestle with. Storage is akin to the network - it's a service that always has to be there - especially for organisations that have embraced a virtualisation strategy. The applications themselves still have service levels, and maintenance windows written into the business requirements, but organisations tend to consolidate many applications onto a single infrastructure. And trying to organise an outage across different business areas, and customers is like "nailing jelly to a tree."

So Storage needs to be 'always on'.

But maintenance still needs to happen, so the storage architecture needs to support this requirement.

I've been playing with Ceph recently, so I thought I'd look at the implications of upgrading my test cluster to the latest Dumpling release.

Here's my test environment

As you can see it's only a little lab, but will hopefully serve to indicate how well the Ceph guys are handling the requirement for 'always-on' storage.

The first thing I did was head on over to the upgrade docs at They're pretty straight forward, with a well defined sequence and some fairly good instructions(albeit with a focus skewed towards Ubuntu!)

For my test I decided to create a rbd volume, mount it from the cluster and then continue to access to the disk while performing the upgrade.


[root@ceph2-admin /]# rbd create test.img --size 5120
[root@ceph2-admin /]# rbd info test.img
rbd image 'test.img':
size 5120 MB in 1280 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.1353.2ae8944a
format: 1
[root@ceph2-admin /]# rbd map test.img
[root@ceph2-admin /]# mkfs.xfs /dev/rbd1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd1              isize=256    agcount=9, agsize=162816 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=1310720, imaxpct=25
         =                       sunit=1024   swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@ceph2-admin /]# mount /dev/rbd1 /mnt/temp
[root@ceph2-admin /]# df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 481M     0  481M   0% /dev
tmpfs                    498M     0  498M   0% /dev/shm
tmpfs                    498M  1.9M  496M   1% /run
tmpfs                    498M     0  498M   0% /sys/fs/cgroup
/dev/mapper/fedora-root   11G  3.6G  7.0G  35% /
tmpfs                    498M     0  498M   0% /tmp
/dev/vda1                485M   70M  390M  16% /boot

Upgrade Process

I didn't want to update the OS on each node and ceph at the same time, so I decided to only bring ceph up to the current level. The dumpling version has a couple of additional dependencies, so I installed those first on each node;

>yum install python-requests and python-flask

Then I ran the rpm update on each node with the following command;

>yum update --disablerepo="*" --enablerepo="ceph"

Once this was done, the upgrade guide simply indicates a service restart at each layer is needed.
Monitors first
[root@ceph2-3 ceph]# /etc/init.d/ceph restart mon.3
=== mon.3 ===
=== mon.3 ===
Stopping Ceph mon.3 on ceph2-3...kill 846...done
=== mon.3 ===
Starting Ceph mon.3 on ceph2-3...
Starting ceph-create-keys on ceph2-3...

Once all the monitors were done, you notice that the admin box could no longer talk to the cluster...gasp...but the update from 0.61 to 0.67 has changed the protocol and port used by the monitors. So this is an expected outcome, until the admin client is updated.

Now each of the osd processes needed to be restarted
[root@ceph2-4 ~]# /etc/init.d/ceph restart osd.4
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on ceph2-4...kill 956...done
=== osd.4 ===
create-or-move updated item name 'osd.4' weight 0.01 at location {host=ceph2-4,root=default} to crush map
Starting Ceph osd.4 on ceph2-4...
starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal
[root@ceph2-4 ~]#

Now my client rbd volume obviously connects to the osd's, but even with the osd restart my mounted rbd volume/filesystem carried on regardless.

Kudos to Ceph guys - a non-disruptive upgrade (at least for my small lab!)

I finished off the upgrade process by upgrading the rpm's on the client, and now my lab is running Dumpling.

After any upgrade I'd always recommend a sanity check before you consider it 'job done'. In this case, I thought I'd use a simple performance metric, and compare before and after. In this case, I just ran a 'dd' to the rbd device, before and after. The chart below shows the results;

As you can see the profile is very similar, with only minimal disparity between the releases.

It's good to see that open source storage is also delivering to the 'always on' principle. The next release of gluster (v3.5) is scheduled for December this year, so I'll run through the same scenario with gluster then.

Under the 'Hood' of the Disperse Volume

In my last post I introduced the notion that RAID may simply not be the best vehicle for high capacity environments in the future, and proposed an alternate approach based on erasure coding. This lead me to discuss developments that have been going on in the gluster community lead by Xavier Hernandez. He's developing a new type of gluster volume called the dispersed volume, which has the potential to deliver some pretty cool advantages;
  • Increased usable capacity over standard replication
  • reduced network traffic between the client(s) and storage nodes
  • increased fault tolerance (user configurable)
Sounds like magic doesn't it?

So let's have a look to see how this new volume actually works.
The disperse volume introduces 2 (soon to be 3) translators and potentially a set of volume options to 'tweak' behaviour of the volume
  • The 'ida' translator is really the engine room of the new volume. This translator operates client side, and is responsible for splitting the data stream up, and calculating the associated erasure codes that need to be combined with each 'fragment' of the data.
  • The 'heal' translator is responsible for co-ordinating healing activity across the sub volumes when a discrepancy is detected. It can actually be regarded as a server-side helper for the ida translator which runs on the client – but more on that later.
  • The 3rd translator, nearing completion, addresses a specific read/write sequencing issue that results from splitting a file into fragments. In a large environment, multiple clients could be reading/writing to the same file - but there isn't any order imposed on these operations from a gluster itself, so clients may get different results. The role of this translator is therefore to queue and sort the requests, ensuring read integrity and as a by-product reduces locking requirements. The expectation is that once this is complete, performance will also improve.
  • Although currently there are numerous parameters that can be tweaked during development, the goal is to maintain gluster's approach of simplicity. In all likelihood, this will mean that the admin will only really have to worry about the redundancy level and potentially timeout settings.
When you create a disperse volume, the first thing to decide is the redundancy level required. The redundancy level defines how many simultaneous brick/node failures can be accommodated by the volume without affecting data access. So if I have 6 bricks, and want a redundancy of 2, this represents a layout conceptually similar to a RAID-6 array – i.e. 4+2. As it stands today, once the volume is defined it's data+redundancy relationship must remain the same – so growing the volume has to adhere to this restriction.

Now we've 'virtually' defined our volume, we can look at how read and write requests are serviced.

For read requests, the translator understands the volumes data+redundancy ratio, so it knows that to satisfy the read it only needs to dispatch requests to a subset of the subvolumes that provide the volume. Here's a worked example. In our 4+2 scenario, when a read request is received, the translator will send reads to 4 of the 6 nodes in parallel. If all of the nodes return the data fragments, the user data is sent on to the application and all is well with the world. However, if gremlins have been at hard at work, and a subvolume does not respond it's marked as 'bad', and a further read is dispatched to one of the other subvolumes enabling the data request to be satisfied. When a subvolume is marked bad it will be avoided in future read requests until it is recovered.

One of the reasons for introducing the RAID analogy is to aid the understanding of the write process. With a write request, the translator has to 'align' the write – in a process similar to RAID's READ-MODIFY-WRITE cycle. For example, when a write occurs to a file byte range or when a the write doesn't align to the internal block size, the translator needs to perform a READ request to 'fill in the blanks'. Once this is done, all the data is available and the fragments can be assembled with the erasure codes and written out to each of the subvolumes.

At this point we can see the decisions needed to define a volume, and understand at a high level how the ida translator services read and write requests. At first glance the write of the fragments would appear to place more work over the network than a standard replicated volume – but the reality is somewhat different. In a replica 2 volume, gluster will write out the full file twice to two target bricks. If we have a file of 20MB, on a replica volume, we'll be writing 40MB between the client and the gluster nodes. Now with a dispersed volume this changes. Returning to our 4+2 example, the data written from the client will resolve to 6 fragments, with each fragment representing 1/4 the size of the file. This means that instead of writing 40MB of data between client and storage, we're actually only writing 30MB!

Now that's pretty cool in itself – but there's more to it than that. If you increase the number of bricks at the same redundancy level, the fragment size sent to each brick gets smaller, ultimately reducing the bandwidth consumption even further. 

Take a look at this chart;

Although architecturally the fragment count could be as large as 128, the testing done to date has only gone to 16+n configuration – which to my mind is probably as far as you'd go anyway with a single disperse 'set'.

So far so good, but "what happens when things go wrong?" This is where the heal translator steps in.

The heal process is client initiated, and engages the server side heal translator as a helper process to co-ordinate heal activity across multiple clients. Clients that detect a problem with a fragment, send a heal request to the translator across all of the bricks in the volume. This is similar in function to lock acquisition, and provides a method of ensuring a file is only healed by one client process. The healing process itself is the same as a normal read request – fragments are obtained from the surviving bricks, and the missing fragment re-assembled. The replacement fragment is then sent to the corresponding brick in a special write request.

In the context of a complete node/brick failure the heal process itself is the same. However, since the healing process is initiated 'client-side', the current implementation requires that the files on the volume are touched (stat'd) to force the heal process to recover the missing fragments from the surviving data. It's important to realise that although this process is client initiated, the 'client' could actually be on one of the storage nodes.

In summary, the disperse volume offers the potential to save capacity, reduce costs, reduce network consumption and increase fault tolerance. These are very attractive characteristics, so the next logical question is “when can I use it?

If you take a look at the code on github or in the gluster forge, you'll see that it hasn't received a commit in over 8 what's the story?. Early testing showed some scale/performance limitations which forced an architectural re-think, inevitably resulting in some areas of the code being rewritten.

The good news is that this work is approaching completion, with a beta expected in the next month or two.

As for future plans, Xavier already has a to-do list;
  • dynamic configuration changes of the redundancy level and or subvolume size
  • optimisation of the read-mondify-write cycle
  • bit rot detection (storing checksums in each fragment)
  • automated server-side self heal (similar to the glustershd process today)
  • introduce throttling for the heal processed
This is a project that could really impact the cost model of solutions based on gluster – and it's certainly a project I'll be following.

Many thanks to Xavier for making the time to discuss the project with me.

Monday, 14 October 2013

Is it time to ditch RAID?

Pretty much every storage vendor protects customer data using some form of RAID protection. RAID-1, RAID-5, RAID-6 etc have served us well, but what about the future? In 2013 we saw 4TB drives. In 2014, both Seagate1 and Western Digital2 have announced 5TB and possibly 6TB drives will be available – with a shared goal of reaching 20TB per drive by 2020. TDK have also entered the race3 - suggesting that a 30-40TB HDD by 2020 is also achievable. Makes you think.

With this context in mind, the question becomes “Is RAID the right answer for these high capacity environments?"
  • How long will a rebuild take?
  • What happens to I/O performance from the raidgroup, during a rebuild
  • Since my rebuilds will take longer, does my risk of subsequent failure increase (due to rebuild stress)
  • The longer the rebuild, the longer the service may fall outside of an agreed SLA.

Disk failures are a fact of life. Disk manufacturers provide Mean Time Between Failure (MTBF4) and Annualised Failure Rate (AFR5) statistics and probabilities, but they don't change the simple fact that disks will fail – normally when you're not looking, or are in the middle of an important production run! Once you accept that disk failure is a fact of life, the question becomes “How do you design storage to accommodate the failures?”.

In the proprietary storage world there are a number of techniques that address this. One of the stand outs for me is EMC's Isilon array. Isilon protects data and meta data differently; data is protected by Reed Solomon erasure coding6, and meta data protection is handled by RAID-1 mirroring. Data written to an Isilon is augmented with recovery codes, and striped across disks and nodes within the cluster. The additional recovery information added by the erasure codes provide the flexibility to survive multiple node and/or disk failures, without impact to the end user.

The principles of erasure codes have been defined for over 50 years , but exploiting erasure codes to support multiple failures has had to wait for Moore's Law to deliver. Today, CPU architectures provide the potential for these hardware problems to be fixed in software. The good news for open source storage is that the two leading distributed storage projects - gluster7 and ceph8 – are both developing erasure coding techniques.

To find out more, I decided to talk to the Xavier Hernandez, who's working on the erasure coding implementation within the gluster project.

Xavier works in the R&D Department of Datalab s.l. based in Spain. Datalab were initially searching for a scalable storage solution that was simple to deploy and offered the potential to add functionality. The modular architecture of gluster was a good fit, and around 18 months ago they started to look at the possibility of adding 'translators' to gluster. The goal was to improve the usable capacity and fault tolerance within a gluster volume, through the adoption of erasure coding.

The investigation that followed has lead Xavier to the development of the 'disperse' volume, which is based on Rabin's Information Dispersal Algorithm9 (IDA). When you look at the impact a dispersed volume has on the available capacity it's easy to see the attraction.

So from this chart you can see that with the same amount of physical infrastructure the disperse volume reduces 'wasted' space and improves the fault tolerance of the solution.

In my next post, I'll dive 'into' the architecture of a disperse volume, looking deeper into the translator stack that Xavier has developed.


Wednesday, 9 October 2013

Simplifying Gluster Post-Installation

In my last post, I talked about making storage easier and how that can lead to success. This made me take a look at the post install process of gluster. Gluster is a distributed storage solution that has a very simple client-server architecture. It's simple to install, doesn't have any 'kernel' baggage (aka dependencies), so there must also be some simple tools to help with setup once the servers were built.

Hmmm, not quite. Users of gluster are actually faced with two options;
  1. install the oVirt management framework, just to configure gluster
  2. configure gluster by hand with the CLI and parallel ssh

To para-phrase the great Obi-Wan, "these aren't the tools, you're looking for"

So in true open source fashion, I sat down to see if I could write something that could help simplify this initial configuration step. One of the cool things about the gluster project these days is the gluster forge.  This provides a melting pot for all manner of projects relating to open source storage and gluster, so it seemed a natural place for my project to find a home.

These were my goals;
  1. make a 'wizard' to guide the user through the initial configuration steps
  2. design the wizard as a web interface 
  3. make the tool light-weight, with little or, ideally no external dependencies
  4. give it a modern look and feel, exploiting css3 or html5 features where it makes sense
  5. ensure configurations are consistent from the start
I'm not a python developer, or a web developer - but I didn't let that stop me ;o)  I decided on a combination of python and javascript to provide the main logic, with html/css providing the eye-candy, and a dash of bash to provide any lower level OS 'glue'.

The project can be found on the gluster forge here 

The features that have been implemented so far include;
  • detection of glusterd nodes (to use to build the cluster)
  • formation of the cluster
  • distribution of ssh keys from a 'master' node to the other nodes 
  • detect eligible, unused disks on each node
  • user selection of disks to use for gluster 'bricks'
  • format and mount the disks across each node

I've created a screencast on youtube to show the interface being used to create a cluster.

There's still a fair amount to do, but for now it's in a state that others can poke a stick at and see if it helps them.

Monday, 7 October 2013

Make it Easy...

What would you say is the key to drive technology adoption?

Maybe features? If your product has more features than the other guys, then you'll win -  right?

I don't think so..

My take is that the storage market has moved on from that. If you take a look at features in proprietary storage arrays for example - they all offer pretty much the same deal. So what helps customers make purchasing decisions in today's "Storage Wars"?

In a word consumability (if that's even a real word!)

The goal for vendors, proprietary and open source alike, should be to make their products
  • easy to deploy
  • easy to support
  • easy to expand
  • easy to budget for

Making products easy for IT departments to live with, is the key. Take IBM as an example. When they brought XIV to market, it's management interface was streets ahead of the competition. But IBM didn't stop there; they adopted the same management interface across their SVC, DS8000 and SONAS products.

Same look and feel = lower administration overheads for customers (and ultimately, maybe more sales for IBM sales reps!)

These are lessons that the proprietary storage world has learnt. The challenge in front of the open source community is applying those lessons to our storage projects. Ultimately, this means adapting to take on an additional focus. Development projects will need to expand to embrace different skill sets - broadening the community around the core project.

And, broadening the community is never a bad thing ;o)

The gluster project is moving in this direction. To date the administration effort has focused on exploiting the oVirt management framework. oVirt provides a management infrastructure typically for kvm based virtualisation. However, work between the ovirt team and the folks at has now delivered a management framework for gluster clusters too. In fact the ovirt engine itself can be installed in different 'modes' - virtualisation, gluster and virtualisation or just gluster.

So it's good to know that there are options out there, in open source storage land, where rich UI's, and API's exist to make an admins life just that little bit easier.