Sunday, 2 March 2014

What's your Storage Up to?

Proprietary and Software Define Storage platforms are getting increasing complex as more and more features are added. Let's face it, complexity is a fact of life! The thing that really matters is whether this increasing complexity makes the day-to-day life of an admin easier or harder.

Understanding what the state of a platform is at any given time is a critical 'feature' for any platform - be it compute, network or storage. Most of the proprietary storage vendors spend a lot of time, money and resources in trying to get this right, but what about the open source storage. Is that as mature as proprietary offerings?

I'd have to so 'no'. But open source, never stands still - someone always has an itch to scratch which moves things forward.
I took a look at the ceph project and compared it to the current state of glusterfs.

The architecture of Ceph is a lot more complex than gluster. To my mind this has put more emphasis on the ceph team to provide the tools that give insight in to the state of the platform. Looking at the ceph cli, you can see that the ceph guys have thought this through and provide great "at-a-glance" of what's going on within a ceph cluster.

Here's an example from a test "0.67.4" environment I have;

[root@ceph2-admin ceph]# ceph -s
  cluster 5a155678-4158-4c1d-96f4-9b9e9ad2e363
   health HEALTH_OK
   monmap e1: 3 mons at {1=,2=,3=}, election epoch 34, quorum 0,1,2 1,2,3
   osdmap e95: 4 osds: 4 up, 4 in
    pgmap v5356: 1160 pgs: 1160 active+clean; 16 bytes data, 4155 MB used, 44940 MB / 49096 MB avail
   mdsmap e1: 0/0/1 up

This is what I'm talking about -  a single command, and you stand a fighting chance of understanding what's going on - albeit in geek speak! 

By contrast, gluster's architecture is much less complex, so the necessity for easy to consume status information has not been as great. In fact as it stands today with gluster 3.4 and the new 3.5 release, you have to piece together information from several commands to get a consolidated view of what's going on.

Not exactly 'optimal' !

However, the gluster project has the community forge where projects can be submitted to add functionality to a gluster deployment, even if the changes aren't inside the core of gluster.

So I came up with a few ideas about what I'd like to see and started a project on the forge called gstatus. I've based the gstatus tool on the xml functionality that was introduced in gluster 3.4, so earlier versions of gluster are not compatible at this point (sorry, if that's you!)

It's also still early days for the tool, but I find useful now - maybe you will too. As a taster here's an example of the output the tool generates against a test gluster 3.5 environment.

[root@glfs35-1 gstatus]# ./ -s

      Status: HEALTHY           Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  2 Up
   Self Heal:  4/ 4                  0 Up(Degraded)
   Bricks   :  8/ 8                  0 Up(Partial)
                                     0 Down
Status Messages
  - Cluster is healthy, all checks successful

As I mentioned, gluster does provide all the status information - just not in a single command. 'gstatus', grabs the output from several commands and brings the information together into a logical model that represents a cluster. The object model it creates centres on a cluster object as the parent, with child objects of volume(s), bricks and nodes. The objects 'link' together allowing simple checks to be performed and state information to be propagated.

You may have noticed that there are some new volume states I've introduced, which aren't currently part of gluster-core.
Up Degraded : a volume in this state is a replicated volume that has at least one brick down/unavailable within a replica set. Data is still available from other members of the replica set, so the degraded state just makes you aware that performance may be affected, and further brick loss to the same replica set will result in data being inaccessible.
Up Partial : this state is an escalation of the "up degraded" state. In this case all members of the same replica set are unavailable - so data stored within this replica set is no longer accessible. Access to data residing on other replica sets is continues, so I still regard the volume as a whole as up, but flag it as partial.

The script has a number of 'switches' that allow you to get a high level view of the cluster, or drill down into the brick layouts themselves (much the same as the lsgvt project on the forge)

To further illustrate the kind of information the tool can provide, here's a run where things are not quite as happy (I killed a couple of brick daemon's!)

[root@glfs35-1 gstatus]# ./ -s

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  0 Up
   Self Heal:  4/ 4                  1 Up(Degraded)
   Bricks   :  6/ 8                  1 Up(Partial)
                                          0 Down
Status Messages
  - Cluster is UNHEALTHY
  - Volume 'dist' is in a PARTIAL state, some data is inaccessible data, due to missing bricks
  - WARNING -> Write requests may fail against volume 'dist'
  - Brick glfs35-3:/gluster/brick1 in volume 'myvol' is down/unavailable
  - Brick glfs35-2:/gluster/brick2 in volume 'dist' is down/unavailable

I've highlighted some of the areas in the output, to make it easier to talk about
  • The summary section shows 6 of 8 bricks are in an up state - so this is the first indication that things are not going well.
  • We also have a volume in partial state, and as the warning indicates this will mean the potential of failed writes, and inaccessible data for that volume
Now to drill down, you can use '-v' and get a more detailed volume level view of the problems (this doesn't have to be two steps, you could use gstatus -a, to get state info and volume info combined)

[root@glfs35-1 gstatus]# ./ -v

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 79.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    dist             UP(PARTIAL)  - 3/4 bricks up - Distribute
                     Capacity: (0% used) 129.00 MiB/40.00 GiB (used/total)
                     Self Heal: N/A   Heal backlog:0 files
                     Protocols: glusterfs:on  NFS:on  SMB:on
  • For both volumes a brick is down, so each volume shows only 3/4 bricks up. The state of the volume however depends upon the volume type. 'myvol' is a volume that uses replication, so it's volume state is just degraded - whereas the 'dist' volume is in a partial state
  • The heal backlog count against the 'myvol' volume, shows 54 files. This indicates that this volume is tracking updates being made to a replica set, where a brick is currently unavailable. Once the brick comes back online, the backlog will be cleared by the self heal daemon (all self heal daemons are present and 'up' as shown in the summary section.
You can drill down further into the volumes too, and look at the brick sizes and relationships using the -l flag.

 [root@glfs35-1 gstatus]# ./ -v myvol -l

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 77.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    myvol----------- +
                Distribute (dht)
                         +-- Repl Set 0 (afr)
                         |     |
                         |     +--glfs35-1:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         |     |
                         |     +--glfs35-2:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         +-- Repl Set 1 (afr)
                               +--glfs35-3:/gluster/brick1(DOWN) 36.00 MiB/10.00 GiB
                               +--glfs35-4:/gluster/brick1(UP) 39.00 MiB/10.00 GiB S/H Backlog 54

You can see in the example above, that the layout display also shows which bricks are recording the self-heal backlog information.

How you run the tool depends on the question you want to ask, but the program options provide a way to show state, state + volumes, brick layouts and capacities etc. Here's the 'help' information;

[root@glfs35-1 gstatus]# ./ -h
Usage: [options]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s, --state           show highlevel health of the cluster
  -v, --volume          volume info (default is ALL, or supply a volume name)
  -a, --all             show all cluster information
  -u UNITS, --units=UNITS
                        display capacity units in DECimal or BINary format (GB
                        vs GiB)
  -l, --layout          show brick layout when used with -v, or -a

Lots of ways to get into trouble :)

Now the bad news...

As I mentioned earlier, the tool relies upon the xml output from gluster, which is a relatively new thing. The downside is that I've uncovered a couple of bugs that result in malformed xml coming from the gluster commands. I'll raise a bug for this, but until this gets fixed in gluster I've added some workarounds in the code to mitigate. It's not perfect, so if you do try the tool and see errors like this -

 Traceback (most recent call last):
  File "./", line 153, in <module>
  File "./", line 42, in main
  File "/root/bin/gstatus/classes/", line 398, in updateState
  File "/root/bin/gstatus/classes/", line 601, in update
    brick_path = node_info['hostname'] + ":" + node_info['path']
KeyError: 'hostname''re hitting the issue.

If you're running gluster 3.4 (or above!) and you'd like to give gstatus a go -  head on over to the forge and give it a go. Installing is as simple as grabbing the tar archive from the forge, extracting it and running it..that's it.

Update 4th March. Thanks to Niels at Red Hat, the code has been restructured to make it easier for packing and installation.  Now to install, you need to download the tar file, extract it and run 'python install' (you'll need to ensure your system has python-setuptools installed first!)

Anyway, it helps me - maybe it'll help you too!