Monday, 30 June 2014

"gluster-deploy" Updated to Support Multiple Volume Configurations

When I started the gluster-deploy project, the main requirement I had for a gluster setup wizard was to support a single volume configuration. However, over time that's changed - in no small part, due to OpenStack!

Once glusterfs provided support for OpenStack, multi-volume setups were needed to support different volumes for glance and cinder, each optimised for their respective workloads.

So this requirement slowly rose to the top of my to-do list, and now version 0.8 is available on the gluster forge. The 0.7 and 0.8 releases have extended the tool to provide:

  • support for thin provisioned gluster bricks based on LVM
  • support for raid alignment of the bricks (through globals set in the deploy.cfg file)
  • more flexibility on the node discovery page (you can now change you mind when defining which peers to use in your config)
  • introduced a volume queue concept to allow multiple volume to be defined in a single run
  • provide the ability to change characteristics of the volumes in the queue
  • added the ability to mount volumes defined for hadoop across all nodes (converged storage + compute use case)
  • provide the ability to set a tuned profile across all nodes during the brick creation phase

Here's a video showing the tool running against a beta version of the new Red Hat Storage 3.0 product, which also provides a quick taste of volume level snapshots that are coming in version 3.6 of glusterfs.

Onwards, towards 1.0!

Sunday, 2 March 2014

What's your Storage Up to?

Proprietary and Software Define Storage platforms are getting increasing complex as more and more features are added. Let's face it, complexity is a fact of life! The thing that really matters is whether this increasing complexity makes the day-to-day life of an admin easier or harder.

Understanding what the state of a platform is at any given time is a critical 'feature' for any platform - be it compute, network or storage. Most of the proprietary storage vendors spend a lot of time, money and resources in trying to get this right, but what about the open source storage. Is that as mature as proprietary offerings?

I'd have to so 'no'. But open source, never stands still - someone always has an itch to scratch which moves things forward.
I took a look at the ceph project and compared it to the current state of glusterfs.

The architecture of Ceph is a lot more complex than gluster. To my mind this has put more emphasis on the ceph team to provide the tools that give insight in to the state of the platform. Looking at the ceph cli, you can see that the ceph guys have thought this through and provide great "at-a-glance" of what's going on within a ceph cluster.

Here's an example from a test "0.67.4" environment I have;

[root@ceph2-admin ceph]# ceph -s
  cluster 5a155678-4158-4c1d-96f4-9b9e9ad2e363
   health HEALTH_OK
   monmap e1: 3 mons at {1=,2=,3=}, election epoch 34, quorum 0,1,2 1,2,3
   osdmap e95: 4 osds: 4 up, 4 in
    pgmap v5356: 1160 pgs: 1160 active+clean; 16 bytes data, 4155 MB used, 44940 MB / 49096 MB avail
   mdsmap e1: 0/0/1 up

This is what I'm talking about -  a single command, and you stand a fighting chance of understanding what's going on - albeit in geek speak! 

By contrast, gluster's architecture is much less complex, so the necessity for easy to consume status information has not been as great. In fact as it stands today with gluster 3.4 and the new 3.5 release, you have to piece together information from several commands to get a consolidated view of what's going on.

Not exactly 'optimal' !

However, the gluster project has the community forge where projects can be submitted to add functionality to a gluster deployment, even if the changes aren't inside the core of gluster.

So I came up with a few ideas about what I'd like to see and started a project on the forge called gstatus. I've based the gstatus tool on the xml functionality that was introduced in gluster 3.4, so earlier versions of gluster are not compatible at this point (sorry, if that's you!)

It's also still early days for the tool, but I find useful now - maybe you will too. As a taster here's an example of the output the tool generates against a test gluster 3.5 environment.

[root@glfs35-1 gstatus]# ./ -s

      Status: HEALTHY           Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  2 Up
   Self Heal:  4/ 4                  0 Up(Degraded)
   Bricks   :  8/ 8                  0 Up(Partial)
                                     0 Down
Status Messages
  - Cluster is healthy, all checks successful

As I mentioned, gluster does provide all the status information - just not in a single command. 'gstatus', grabs the output from several commands and brings the information together into a logical model that represents a cluster. The object model it creates centres on a cluster object as the parent, with child objects of volume(s), bricks and nodes. The objects 'link' together allowing simple checks to be performed and state information to be propagated.

You may have noticed that there are some new volume states I've introduced, which aren't currently part of gluster-core.
Up Degraded : a volume in this state is a replicated volume that has at least one brick down/unavailable within a replica set. Data is still available from other members of the replica set, so the degraded state just makes you aware that performance may be affected, and further brick loss to the same replica set will result in data being inaccessible.
Up Partial : this state is an escalation of the "up degraded" state. In this case all members of the same replica set are unavailable - so data stored within this replica set is no longer accessible. Access to data residing on other replica sets is continues, so I still regard the volume as a whole as up, but flag it as partial.

The script has a number of 'switches' that allow you to get a high level view of the cluster, or drill down into the brick layouts themselves (much the same as the lsgvt project on the forge)

To further illustrate the kind of information the tool can provide, here's a run where things are not quite as happy (I killed a couple of brick daemon's!)

[root@glfs35-1 gstatus]# ./ -s

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  0 Up
   Self Heal:  4/ 4                  1 Up(Degraded)
   Bricks   :  6/ 8                  1 Up(Partial)
                                          0 Down
Status Messages
  - Cluster is UNHEALTHY
  - Volume 'dist' is in a PARTIAL state, some data is inaccessible data, due to missing bricks
  - WARNING -> Write requests may fail against volume 'dist'
  - Brick glfs35-3:/gluster/brick1 in volume 'myvol' is down/unavailable
  - Brick glfs35-2:/gluster/brick2 in volume 'dist' is down/unavailable

I've highlighted some of the areas in the output, to make it easier to talk about
  • The summary section shows 6 of 8 bricks are in an up state - so this is the first indication that things are not going well.
  • We also have a volume in partial state, and as the warning indicates this will mean the potential of failed writes, and inaccessible data for that volume
Now to drill down, you can use '-v' and get a more detailed volume level view of the problems (this doesn't have to be two steps, you could use gstatus -a, to get state info and volume info combined)

[root@glfs35-1 gstatus]# ./ -v

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 79.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    dist             UP(PARTIAL)  - 3/4 bricks up - Distribute
                     Capacity: (0% used) 129.00 MiB/40.00 GiB (used/total)
                     Self Heal: N/A   Heal backlog:0 files
                     Protocols: glusterfs:on  NFS:on  SMB:on
  • For both volumes a brick is down, so each volume shows only 3/4 bricks up. The state of the volume however depends upon the volume type. 'myvol' is a volume that uses replication, so it's volume state is just degraded - whereas the 'dist' volume is in a partial state
  • The heal backlog count against the 'myvol' volume, shows 54 files. This indicates that this volume is tracking updates being made to a replica set, where a brick is currently unavailable. Once the brick comes back online, the backlog will be cleared by the self heal daemon (all self heal daemons are present and 'up' as shown in the summary section.
You can drill down further into the volumes too, and look at the brick sizes and relationships using the -l flag.

 [root@glfs35-1 gstatus]# ./ -v myvol -l

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 77.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    myvol----------- +
                Distribute (dht)
                         +-- Repl Set 0 (afr)
                         |     |
                         |     +--glfs35-1:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         |     |
                         |     +--glfs35-2:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         +-- Repl Set 1 (afr)
                               +--glfs35-3:/gluster/brick1(DOWN) 36.00 MiB/10.00 GiB
                               +--glfs35-4:/gluster/brick1(UP) 39.00 MiB/10.00 GiB S/H Backlog 54

You can see in the example above, that the layout display also shows which bricks are recording the self-heal backlog information.

How you run the tool depends on the question you want to ask, but the program options provide a way to show state, state + volumes, brick layouts and capacities etc. Here's the 'help' information;

[root@glfs35-1 gstatus]# ./ -h
Usage: [options]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s, --state           show highlevel health of the cluster
  -v, --volume          volume info (default is ALL, or supply a volume name)
  -a, --all             show all cluster information
  -u UNITS, --units=UNITS
                        display capacity units in DECimal or BINary format (GB
                        vs GiB)
  -l, --layout          show brick layout when used with -v, or -a

Lots of ways to get into trouble :)

Now the bad news...

As I mentioned earlier, the tool relies upon the xml output from gluster, which is a relatively new thing. The downside is that I've uncovered a couple of bugs that result in malformed xml coming from the gluster commands. I'll raise a bug for this, but until this gets fixed in gluster I've added some workarounds in the code to mitigate. It's not perfect, so if you do try the tool and see errors like this -

 Traceback (most recent call last):
  File "./", line 153, in <module>
  File "./", line 42, in main
  File "/root/bin/gstatus/classes/", line 398, in updateState
  File "/root/bin/gstatus/classes/", line 601, in update
    brick_path = node_info['hostname'] + ":" + node_info['path']
KeyError: 'hostname''re hitting the issue.

If you're running gluster 3.4 (or above!) and you'd like to give gstatus a go -  head on over to the forge and give it a go. Installing is as simple as grabbing the tar archive from the forge, extracting it and running it..that's it.

Update 4th March. Thanks to Niels at Red Hat, the code has been restructured to make it easier for packing and installation.  Now to install, you need to download the tar file, extract it and run 'python install' (you'll need to ensure your system has python-setuptools installed first!)

Anyway, it helps me - maybe it'll help you too!

Monday, 6 January 2014

Distributed Storage - too complicated to try?

The thing about distributed storage is that all the pieces that make the magic happen are.....well, distributed! The distributed nature of the components can represent a significant hurdle for people looking to evaluate whether distributed storage is right for them. Not only do people have to set up multiple servers, but they also have to get to grips with services/daemons, new terms and potentially clustering complexity.

So what can be done?

Well the first thing is to look for a distributed storage architecture that tries to make things simple in the first's too short for unnecessary complexity.

The next question is "Does the platform provide an easy to use and understand" deployment tool?"

Confession time - I'm involved with the gluster community. A while ago I started a project called gluster-deploy which aims to make the first time configuration of a gluster cluster, childs play. I originally blogged about an early release of the tool in October, so perhaps now is a good time to revisit the project and see how easy it is to get started with gluster (completely unbiased view naturally!)

At a high level, all distributed storage platforms consist of a minimum of two layers;
  • cluster layer - binding the servers together, into a single namespace
  • aggregated disk capacity - pooling storage from each of the servers together to present easy to consume capacity to the end user/applications
So the key thing is to deliver usable capacity as quickly and as pain-free as possible - whilst ensuring that the storage platform is configured correctly. Now I could proceed to show you a succession of screenshots of gluster-deploy in action - but to prevent 'death-by-screenshot' syndrome, I'll refrain from that and just pick out the highlights.

I wont cover installing the gluster rpms, but I will point out that if you're using fedora - they are in the standard repository and if you're not using fedora, head on over to the gluster download site

So let's assume that you have several servers available; each one has an unused disk and gluster installed and started. If you grab the gluster-deploy tool from the gluster-deploy link above you'll have a tar.gz archive that you can untar onto one of your test nodes. Login to one of the nodes as 'root' and untar the archive;

>tar xvzf gluster-deploy.tar.gz && cd gluster-deploy

This will untar the archive and place you in the gluster-deploy directory, so before we run it lets take a look at the options the program supports

[root@mynode gluster-deploy]# ./ -h
Usage: [options]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n, --no-password     Skip access key checking (debug only)
  -p PORT, --port=PORT  Port to run UI on (> 1024)
  -f CFGFILE, --config-file=CFGFILE
                        Config file providing server list bypassing subnet

 Ok. So there is some tweaking we can do but for now,  let's just run it.

[root@mynode gluster-deploy]# ./

gluster-deploy starting

    Configuration file
        -> Not supplied, UI will perform subnet selection/scan

    Web server details:
        Access key  - pf20hyK8p28dPgIxEaExiVm2i6
        Web Address -

    Setup Progress

Taking the URL displayed in the CLI, and pasting into a browser, starts the configuration process.

The deployment tool basically walks through a series of pages that gather some information about how we'd like our cluster and storage to look. Once the information is gathered, the tool then does all the leg-work across the cluster nodes to complete the configuration, resulting in a working cluster and a volume ready to receive application data.
At a high level, gluster-deploy performs the following tasks;

- Build the cluster;
  • via a subnet scan - the user chooses which subnet to scan (based on the subnets seen on the server running the tool) 
  • via a config file that supplies the nodes to use in the cluster (-f invocation parameter) 
- Configure passwordless login across the nodes, enabling automation 

- Perform disk discovery. Any unused disk is shown up in the UI

- You then choose which of the discovered disks you want gluster to use

- Once the disks are selected, you define how you want the disks managed
  • lvm (default)
  • lvm with dm-thinp
  • btrfs (not supported yet, but soon!)
  NB When you choose to use snapshot support (lvm with dm-thinp or btrfs), confirmation is required since these are 'future' features, typically there for developers.

- Once the format is complete, you define the volume that you want gluster to present to your application(s). The volume create process includes 'some' intelligence to make life a little easier
  • tuning presets are provided for common gluster workloads like OpenStack cinder and glance, ovirt/rhev, and hadoop
  • distributed volumes and distributed-replicated volumes types are supported
  • for volumes that use replication, the UI prevents disks (bricks) from the same server being assigned to the same replica set
  • UI shows a summary of the capacity expectation for the volume given the brick configuration and replication overheads
Now, let's take a closer look at what you can expect to see during these phases.

The image above shows the results from the subnet scan. Four nodes have been discovered on the selected subnet that have gluster running on them. You then select which nodes you want from the left hand 'box' and click the 'arrow' icon to add them to the cluster nodes. Once you're happy, click 'Create'.

Passwordless login is a feature of ssh, which enables remote login by shared public keys. This capability is used by the tool to enable automation across the nodes.

With the public keys in place, the tool can scan for 'free' disks.

Choosing the disks to use is just a simple checkbox, and if they all look right - just click on the checkbox in the table heading. Understanding which disks to use is phase one, the next step is to confirm how you want to manage these disks (which at a low level defines the characteristics for the Logical Volume Manager)

Clicking on the "Build Bricks" button, initiates a format process across the servers to prepare the disks, building the low-level filesystem and updating the node's filesystem table (fstab). These bricks then become the component parts of the gluster volume that get's mounted by the users or applications.

Volumes can be tuned/optimised for different workloads, so the tool has a number of presets to choose from. Choose a 'Use case' that best fits your workload, and then a volume type (distributed or replicated) that meets your data availability requirements. Now you can see a list of bricks on the left and an empty table on the right. Select which bricks you want in the volume and click the arrow to add them to the table. A Volume Summary is presented at the bottom of the page showing you what will be built (space usable, brick count, fault tolerance). Once you're happy, simply click the "Create" button.

The volume will be created and started making it available to clients straight away. In my test environment the time to configure the cluster and storage < 1minute...

So, if you can use a mouse and a web browser, you can now configure and enjoy the gluster distributed filesystem : no excuses!

For a closer look at the tool's workflow, I've posted a video to youtube.

In a future post, I'll show you how to use foreman to simplify the provisioning of the gluster nodes themselves.