Thursday, 17 October 2013

Downtime..."Ain't got time for that"

Storage solutions underpin all of the front line and back room services that can be found in most organisations today. That's the reality. It's also a harsh reality for storage designers and developers to wrestle with. Storage is akin to the network - it's a service that always has to be there - especially for organisations that have embraced a virtualisation strategy. The applications themselves still have service levels, and maintenance windows written into the business requirements, but organisations tend to consolidate many applications onto a single infrastructure. And trying to organise an outage across different business areas, and customers is like "nailing jelly to a tree."

So Storage needs to be 'always on'.

But maintenance still needs to happen, so the storage architecture needs to support this requirement.

I've been playing with Ceph recently, so I thought I'd look at the implications of upgrading my test cluster to the latest Dumpling release.

Here's my test environment

As you can see it's only a little lab, but will hopefully serve to indicate how well the Ceph guys are handling the requirement for 'always-on' storage.

The first thing I did was head on over to the upgrade docs at They're pretty straight forward, with a well defined sequence and some fairly good instructions(albeit with a focus skewed towards Ubuntu!)

For my test I decided to create a rbd volume, mount it from the cluster and then continue to access to the disk while performing the upgrade.


[root@ceph2-admin /]# rbd create test.img --size 5120
[root@ceph2-admin /]# rbd info test.img
rbd image 'test.img':
size 5120 MB in 1280 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.1353.2ae8944a
format: 1
[root@ceph2-admin /]# rbd map test.img
[root@ceph2-admin /]# mkfs.xfs /dev/rbd1
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd1              isize=256    agcount=9, agsize=162816 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=1310720, imaxpct=25
         =                       sunit=1024   swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@ceph2-admin /]# mount /dev/rbd1 /mnt/temp
[root@ceph2-admin /]# df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 481M     0  481M   0% /dev
tmpfs                    498M     0  498M   0% /dev/shm
tmpfs                    498M  1.9M  496M   1% /run
tmpfs                    498M     0  498M   0% /sys/fs/cgroup
/dev/mapper/fedora-root   11G  3.6G  7.0G  35% /
tmpfs                    498M     0  498M   0% /tmp
/dev/vda1                485M   70M  390M  16% /boot

Upgrade Process

I didn't want to update the OS on each node and ceph at the same time, so I decided to only bring ceph up to the current level. The dumpling version has a couple of additional dependencies, so I installed those first on each node;

>yum install python-requests and python-flask

Then I ran the rpm update on each node with the following command;

>yum update --disablerepo="*" --enablerepo="ceph"

Once this was done, the upgrade guide simply indicates a service restart at each layer is needed.
Monitors first
[root@ceph2-3 ceph]# /etc/init.d/ceph restart mon.3
=== mon.3 ===
=== mon.3 ===
Stopping Ceph mon.3 on ceph2-3...kill 846...done
=== mon.3 ===
Starting Ceph mon.3 on ceph2-3...
Starting ceph-create-keys on ceph2-3...

Once all the monitors were done, you notice that the admin box could no longer talk to the cluster...gasp...but the update from 0.61 to 0.67 has changed the protocol and port used by the monitors. So this is an expected outcome, until the admin client is updated.

Now each of the osd processes needed to be restarted
[root@ceph2-4 ~]# /etc/init.d/ceph restart osd.4
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on ceph2-4...kill 956...done
=== osd.4 ===
create-or-move updated item name 'osd.4' weight 0.01 at location {host=ceph2-4,root=default} to crush map
Starting Ceph osd.4 on ceph2-4...
starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal
[root@ceph2-4 ~]#

Now my client rbd volume obviously connects to the osd's, but even with the osd restart my mounted rbd volume/filesystem carried on regardless.

Kudos to Ceph guys - a non-disruptive upgrade (at least for my small lab!)

I finished off the upgrade process by upgrading the rpm's on the client, and now my lab is running Dumpling.

After any upgrade I'd always recommend a sanity check before you consider it 'job done'. In this case, I thought I'd use a simple performance metric, and compare before and after. In this case, I just ran a 'dd' to the rbd device, before and after. The chart below shows the results;

As you can see the profile is very similar, with only minimal disparity between the releases.

It's good to see that open source storage is also delivering to the 'always on' principle. The next release of gluster (v3.5) is scheduled for December this year, so I'll run through the same scenario with gluster then.

No comments:

Post a Comment