Tuesday, 5 July 2016

De-mystifying gluster shards

Recently I've been working on converging glusterfs with oVirt - hyperconverged, open source style. oVirt has supported glusterfs storage domains for a while, but in the past a virtual disk was stored as a single file on a gluster volume. This helps some workloads, but file distribution and functions like self heal and rebalance have more work to do. The larger the virtual disk, the more work gluster has to do in one go.

Enter sharding

The shard translator was introduced with version 3.7, and enables large files to be split into smaller chunks(shards) of a user defined size. This addresses a number of legacy issues when using glusterfs for virtual machine storage - but does introduce an additional level complexity. For example, how do you now relate a file to it's shard, or vice-versa?

The great thing is that even though a file is split into shards, the implementation still allows you to relate files to shards with a few simple commands.
Firstly, let's look at how to relate a file to it's shards;

And now, let's go the other way. We start with a shard, and end with the parent file.

Hopefully this helps others getting to grips with glusterfs sharding (and maybe even oVirt!)

Sunday, 29 May 2016

Making gluster play nicely with others

These days hyperconverged strategies are everywhere. But when you think about it, sharing the finite resources within a physical host requires an effective means of prioritisation and enforcement. Luckily, the Linux kernel already provides an infrastructure for this in the shape of cgroups, and the interface to these controls is now simplified with systemd integration.

So lets look at how you could use these capabilities to make Gluster a better neighbour in a collocated or hyperconverged  model. 

First some common systemd terms, we should to be familiar with;
slice : a slice is a concept that systemd uses to group together resources into a hierarchy. Resource constraints can then be applied to the slice, which defines 
  • how different slices may compete with each other for resources (e.g. weighting)
  • how resources within a slice are controlled (e.g. cpu capping)
unit : a systemd unit is a resource definition for controlling a specific system service
NB. More information about control groups with systemd can be found here

In this article, I'm keeping things simple by implementing a cpu cap on glusterfs processes. Hopefully, the two terms above are big clues, but conceptually it breaks down into two main steps;
  1. define a slice which implements a CPU limit
  2. ensure gluster's systemd unit(s) start within the correct slice.
So let's look at how this is done.

Defining a slice

Slice definitions can be found under /lib/systemd/system, but systemd provides a neat feature where /etc/systemd/system can be used provide local "tweaks". This override directory is where we'll place a slice definition. Create a file called glusterfs.slice, containing;


CPUQuota is our means of applying a cpu limit on all resources running within the slice. A value of 200% defines a 2 cores/execution threads limit.

Updating glusterd

Next step is to give gluster a nudge so that it shows up in the right slice. If you're using RHEL7 or Centos7, cpu accounting may be off by default (you can check in /etc/systemd/system.conf). This is OK, it just means we have an extra parameter to define. Follow these steps to change the way glusterd is managed by systemd

# cd /etc/systemd/system
# mkdir glusterd.service.d
# echo -e "[Service]\nCPUAccounting=true\nSlice=glusterfs.slice" > glusterd.service.d/override.conf

glusterd is responsible for starting the brick and self heal processes, so by ensuring glusterd starts in our cpu limited slice, we capture all of glusterd's child processes too. Now the potentially bad news...this 'nudge' requires a stop/start of gluster services. If your doing this on a live system you'll need to consider quorum, self heal etc etc. However, with the settings above in place, you can get gluster into the right slice by;

# systemctl daemon-reload
# systemctl stop glusterd
# killall glusterfsd && killall glusterfs
# systemctl daemon-reload
# systemctl start glusterd

You can see where gluster is within the control group hierarchy by looking at it's runtime settings

# systemctl show glusterd | grep slice
After=rpcbind.service glusterfs.slice systemd-journald.socket network.target basic.target

or use the systemd-cgls command to see the whole control group hierarchy

├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 19
│ └─glusterd.service
│   ├─ 867 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
│   ├─1231 /usr/sbin/glusterfsd -s server-1 --volfile-id repl.server-1.bricks-brick-repl -p /var/lib/glusterd/vols/repl/run/server-1-bricks-brick-repl.pid 

 │   └─1305 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log
│ └─user-0.slice
│   └─session-1.scope
│     ├─2075 sshd: root@pts/0  
│     ├─2078 -bash
│     ├─2146 systemd-cgls
│     └─2147 less

At this point gluster is exactly where we want it! 

Time for some more systemd coolness ;) The resource constraints that are applied by the slice are dynamic, so if you need more cpu, you're one command away from getting it;

# systemctl set-property glusterfs.slice CPUQuota=350%

Try the 'systemd-cgtop' command to show the cpu usage across the complete control group hierarchy.

Now if jumping straight into applying resource constraints to gluster is a little daunting, why not test this approach with a tool like 'stress'. Stress is designed to simply consume components of the system - cpu, memory, disk. Here's an example .service file which uses stress to consume 4 cores

Description=CPU soak task

ExecStart=/usr/bin/stress -c 4


Now you can tweak the service, and the slice with different thresholds before you move on to bigger things! Use stress to avoid stress :)

And now the obligatory warning. Introducing any form of resource constraint may resort in unexpected outcomes especially in hyperconverged/collocated systems - so adequate testing is key.

With that said...happy hacking :)

Tuesday, 26 April 2016

Using LIO with Gluster

In the past, gluster users of have been able to open up their gluster volumes to iSCSI using the tgt daemon. This has been covered in the past on other blogs and also documented on gluster.org.

But, tgt has been superseded in more recent distro's by LIO. LIO provides a number of different local storage options to be utilised as SCSI targets, including; FILEIO, BLOCK, PSCSI and RAMDISK. These SCSI targets are implemented as modules in kernel space, but what isn't immediately obvious is that LIO also provides a userspace framework called TCMU. TCMU enables userspace files to become iSCSI targets. 

With LIO, the easiest way to exploit gluster as an iSCSI target was through the FILEIO 'storage engine' over FUSE. However, the high number of context switches incurred within FUSE is likely to reduce the performance potential to your 'client' -  especially for random I/O access patterns.

Until now, FUSE was your only option. But Andy Grover at Red Hat has just changed things. Andy has developed tcmu-runner which utilises the TCMU framework, allowing a glusterfs target to be used over gluster's libgfapi interface. Typically, with libgfapi you can expect less context switching, and improved performance.

For those like me, with short attention spans, here's what the improvement looked like when I compared LIO/FUSE with LIO/gfapi using a couple of fio  based workloads.

Read Improvement
Mixed Workload Improvement

In both charts, IOPS and latency significantly improves using LIO/GFAPI, and further still by adopting the arbiter volume.

As you can see, for a young project, these results are really encouraging. The bad news is that to try tcmu-runner you'll need to either build systems based on Fedora F24/rawhide or compile it yourself from the github repo. Let's face it, there's always a price to pay for new shiny stuff :)

For the remainder of this article, I'll walk through the configuration of LIO and the iSCSI client that I used during my comparisons.

Preparing Your Environment

In the interests of brevity, I'm assuming that you know how to build servers,  create a gluster trusted pool and define volumes. Here's a checklist of the tasks you should do in order to prepare a test environment;
  1. build 3 Fedora24 nodes and install gluster (3.7.11) on each peer/node
  2. on each node, ensure /etc/gluster/glusterd.vol contains the following setting - option rpc-auth-allow-insecure on. This is needed for gfapi access. Once added, you'll need to restart glusterd.
  3. install targetcli (targetcli-2.1.fb43-1) and tcmu-runner (tcmu-runner-1.0.4-1) on each of your gluster nodes
  4. form a gluster trusted pool, and create a replica 3 volume or replica with arbiter volume (or both!) 
  5. issue "gluster vol set <vol_name> server.allow-insecure on" to enable libgfapi access to the volume
There are several ways to configure the iSCSI environment, but for my tests I adopted the following approach;
  • two of my three gluster nodes will be iSCSI gateways (LIO targets)
  • each gateway will have it's own iqn (iSCSI Qualified Name)
  • each gateway will only access the gluster volume from itself, so if gluster is down on this node so is the path for any attached client (makes things simple)
  • high availability for the LUN is provided by client side multipathing
Before moving on, you can confirm that targetcli/tcmu-runner are providing the gluster integration by simply running 'ls' from the targetcli.

# targetcli ls
o- / ...............
  o- backstores ....
  | o- block .......
  | o- fileio ......
  | o- pscsi .......
  | o- ramdisk .....
  | o- user:glfs ...    <--- gluster gfapi available through tcmu
  | o- user:qcow ...
  o- iscsi .........
  o- loopback ......
  o- vhost ......

With the preparation complete, let's configure the LIO gateways.

Configuring LIO - Node 1

The following steps provide an example configuration You'll need to make changes to naming etc specific to your test environment.

  1. Mount the volume (called iscsi-pool), and allocate the file that will become the LUN image
  2. # fallocate -l 100G mytest.img
  1. Enter the targetcli shell. The remaining steps all take place within this shell.
  1. Create the backing store connection to the glusterfs file
  2. /backstores/user:glfs create myLUN 100G iscsi-pool@iscsi-3/mytest.img
  1. Create the node's target portal (this is the name the client will connect to). In this example 'iscsi-3' is the node name
  2. /iscsi/ create iqn.2016-04.org.gluster:iscsi-3
    NB. this will create the target IQN and the iscsi portal will be enabled and listening on port 3260
  1. On the client, 'grab' it's iqn from /etc/iscsi/initiatorname.iscsi, then add it to the gateway
  2. /iscsi/iqn.2016-04.org.gluster:iscsi-3/tpg1/acls/ create iqn.1994-05.com.redhat:14a2b41fe9e4
  1. Add the LUN, "myLUN", to the target and automatically map it to the client(s) 
  2. /iscsi/iqn.2016-04.org.gluster:iscsi-3/tpg1/luns create /backstores/user:glfs/myLUN 0
  1. Issue saveconfig to commit the configuration (config is stored in /etc/target/saveconfig.json)

Configuring LIO - Node 2 

When a LUN is defined by targetcli, a wwn is automatically generated for it. This is neat, but to ensure multipathing works we need the LUN exported by the gateways to share the same wwn - if they don't match, the client will see two devices, not two paths to the same device.

So for subsequent nodes, the steps are slightly different.
  1. On the first node, look at /etc/target/saveconfig.json. You'll see a storage object item for the gluster file you've just created, together with the wwn that was assigned (highlighted).
  2.   "storage_objects": [
          "config": "glfs/iscsi-pool@iscsi-3/mytest.img",
          "name": "myLUN",
          "plugin": "user",
          "size": 107374182400,
          "wwn": "653e4072-8aad-4e9d-900e-4059f0e19e7e"
  1. Open the targetcli shell on node 2, and define a LUN pointing to the same backing file as node 1, but this time explicitly specifying the wwn (from step 1)
  2. /backstores/user:glfs create myLUN 100G iscsi-pool@iscsi-1/mytest.img 653e4072-8aad-4e9d-900e-4059f0e19e7e
    (if you cd to /backstores/user:glfs and use help create you'll see a summary of the options available when creating the LUN)
  1. With the LUN in place, you can follow steps 4-7 above to create the iqn, portal and LUN masking for this node.

At this point you have;
  • 3 gluster nodes
  • a gluster volume with a file defined, serving as an iscsi target
  • 2 gluster nodes defined as iscsi gateways
  • each gateway exports the same LUN to a client (supporting multipathing)

Next up...configuring the client.

Client Configuration

To get the client to connect to your 'exported' LUN(s), you first need to ensure that the following rpms are installed on the client; device-mapper-multipath, iscsi-initiator-utils and preferably sg3_utils. With these packages in place you can move on to configure multipathing and connect to you LUN(s).
  • Multipathing : the example below shows a devices section from /etc/multipath.conf that I used to ensure my exported LUNs are seen as multipath devices. With this in place, you can take a node down for maintenance and your LUN remains accessible (as long as your volume has quorum!)
devices {
    device {
        vendor "LIO-ORG"
        path_grouping_policy "multibus"
# I tested with a path_selector of "round-robin" and "queue-length"
        path_selector "queue-length 0"
        path_checker "directio"
        prio "const"
        rr_weight "uniform"

  • iscsi discovery/login : to login to the gluster iscsi gateway's just use the iscsiadm command (from iscsi-initiator-utils rpm)

# iscsiadm -m discovery -t st -p <your_gluster_node_1> -l
# iscsiadm -m discovery -t st -p <your_gluster_node_2> -l

# #check your paths are working as expected with multipath command
# multipath -ll
mpathd (36001405891b9858f4b0440285cacbcca) dm-2 LIO-ORG ,TCMU device   
size=8.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 33:0:0:1 sdc 8:32 active ready running
  `- 34:0:0:1 sde 8:64 active ready running
mpathb (3600140596a3a65692104740a88516aba) dm-3 LIO-ORG ,TCMU device   
size=8.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 33:0:0:0 sdb 8:16 active ready running
  `- 34:0:0:0 sdd 8:48 active ready running
mpathf (36001405653e40728aad4e9d900e4059f) dm-6 LIO-ORG ,TCMU device   
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 35:0:0:0 sdf 8:80 active ready running
  `- 33:0:0:2 sdg 8:96 active ready running

You can see in this example, I have three LUN's exported, and each one has two active paths (one to each gluster node). By default, the iscsi node definition in (/var/lib/iscsi/nodes) uses a setting of node.startup=automatic, which means LUN(s) will automagically reappear on the client following a reboot.

But from the client's perspective, how do you know which LUN is from which glusterfs volume/file? For this, sg_inq is your friend...

# sg_inq -i /dev/dm-6
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG
      vendor specific: 653e4072-8aad-4e9d-900e-4059f0e19e7e
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0x653e40728
      Vendor Specific Identifier Extension: 0xaad4e9d900e4059f
  Designation descriptor number 3, descriptor length: 39
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/iscsi-pool@iscsi-3/mytest.img

The highlighted text shows the configuration string you specified when you created the LUN in targetcli. If you run the same command against the devices themselves (/dev/sdf or /dev/sdg) you'd see the connection string from each of respective gateways. Nice and easy!

And Finally...

Remember, this is all shiny and new - so if you try it, expect some rough edges! However, I have to say that it looks promising, and during my tests I didn't lose any data...but YMMV :)

Happy testing!

Tuesday, 31 March 2015

Using SSL with Glusterfs

Wow - it's been a while since my last post!

Recently, I needed to configure glusterfs with SSL and found that the documention that describes how to do it is actually pretty thin.  What's annoying is that this feature has been around since 2013!

First the caveat - I'm not an expert with SSL, but I arrived at this working process after digging through mail lists and a great article from Zbyszek Żółkiewski

There are 8 steps to follow, so nothing too taxing :)
  1. Create the keys and certificates
  • On each node, perform the following;
  • # cd /etc/ssl
    # openssl genrsa -out glusterfs.key 1024
    # openssl req -new -x509 -days 3650 -key glusterfs.key -subj /CN=<hostname> -out glusterfs.pem
  • This step creates a private key(.key) and associated certificate(.pem) on each node. The common name (CN), I've used is the hostname, so each certificate is unique to each gluster node and/or client. You may opt for a different scheme - but the important thing is the CN chosen here is reflected in step 6.
  1. Combine the pem files to a single file
  • Use scp to copy the .pem file from each node to a single node in the cluster (I'm calling it the primary host for the purpose of this article)
  • # scp glusterfs.pem root@<primary-host>:/etc/ssl/<this-hostname>.pem
    On the primary host concatenate the files
    # cat glusterfs.pem host2.pem host3.pem > glusterfs.ca
  1. Distribute the common 'ca' file to all nodes
  • On the primary host distribute the common CA containing the certs from all nodes/clients
  • # scp /etc/ssl/glusterfs.ca root@<hostX>:/etc/ssl/.
  1. Stop the volume you want to enable SSL on

  2. # gluster vol stop <volume-name>
  1. Restart glusterd

  2. # systemctl restart glusterd
  1. Update the volume to enable SSL

  2. # gluster vol set <volume-name> client.ssl on
    # gluster vol set <volume-name> server.ssl on
    # gluster vol set <volume-name> auth.ssl-allow host-1,host-2,host-3
  • The comma separated list should consist of the CN's used when generating the .pem files on each host, from step '1'.
  1. Start the volume

  2. # gluster vol start <volume-name>
  1. Check SSL is enabled on the I/O Path
  • Although you can use vol info to check the SSL setting is in place, the best way to confirm that SSL is actually being used is to look at one of the log files;
  • # grep SSL /var/log/glusterfs/glustershd.log
    [2015-03-31 06:58:34.674091] I [socket.c:3799:socket_init] 0-vol-client-2: SSL support on the I/O path is ENABLED
    [2015-03-31 06:58:34.679316] I [socket.c:3799:socket_init] 0-vol-client-1: SSL support on the I/O path is ENABLED
    [2015-03-31 06:58:34.680784] I [socket.c:3799:socket_init] 0-vol-client-0: SSL support on the I/O path is ENABLED
That's it - enjoy a more secure glusterfs!

Monday, 30 June 2014

"gluster-deploy" Updated to Support Multiple Volume Configurations

When I started the gluster-deploy project, the main requirement I had for a gluster setup wizard was to support a single volume configuration. However, over time that's changed - in no small part, due to OpenStack!

Once glusterfs provided support for OpenStack, multi-volume setups were needed to support different volumes for glance and cinder, each optimised for their respective workloads.

So this requirement slowly rose to the top of my to-do list, and now version 0.8 is available on the gluster forge. The 0.7 and 0.8 releases have extended the tool to provide:

  • support for thin provisioned gluster bricks based on LVM
  • support for raid alignment of the bricks (through globals set in the deploy.cfg file)
  • more flexibility on the node discovery page (you can now change you mind when defining which peers to use in your config)
  • introduced a volume queue concept to allow multiple volume to be defined in a single run
  • provide the ability to change characteristics of the volumes in the queue
  • added the ability to mount volumes defined for hadoop across all nodes (converged storage + compute use case)
  • provide the ability to set a tuned profile across all nodes during the brick creation phase

Here's a video showing the tool running against a beta version of the new Red Hat Storage 3.0 product, which also provides a quick taste of volume level snapshots that are coming in version 3.6 of glusterfs.

Onwards, towards 1.0!

Sunday, 2 March 2014

What's your Storage Up to?

Proprietary and Software Define Storage platforms are getting increasing complex as more and more features are added. Let's face it, complexity is a fact of life! The thing that really matters is whether this increasing complexity makes the day-to-day life of an admin easier or harder.

Understanding what the state of a platform is at any given time is a critical 'feature' for any platform - be it compute, network or storage. Most of the proprietary storage vendors spend a lot of time, money and resources in trying to get this right, but what about the open source storage. Is that as mature as proprietary offerings?

I'd have to so 'no'. But open source, never stands still - someone always has an itch to scratch which moves things forward.
I took a look at the ceph project and compared it to the current state of glusterfs.

The architecture of Ceph is a lot more complex than gluster. To my mind this has put more emphasis on the ceph team to provide the tools that give insight in to the state of the platform. Looking at the ceph cli, you can see that the ceph guys have thought this through and provide great "at-a-glance" of what's going on within a ceph cluster.

Here's an example from a test "0.67.4" environment I have;

[root@ceph2-admin ceph]# ceph -s
  cluster 5a155678-4158-4c1d-96f4-9b9e9ad2e363
   health HEALTH_OK
   monmap e1: 3 mons at {1=,2=,3=}, election epoch 34, quorum 0,1,2 1,2,3
   osdmap e95: 4 osds: 4 up, 4 in
    pgmap v5356: 1160 pgs: 1160 active+clean; 16 bytes data, 4155 MB used, 44940 MB / 49096 MB avail
   mdsmap e1: 0/0/1 up

This is what I'm talking about -  a single command, and you stand a fighting chance of understanding what's going on - albeit in geek speak! 

By contrast, gluster's architecture is much less complex, so the necessity for easy to consume status information has not been as great. In fact as it stands today with gluster 3.4 and the new 3.5 release, you have to piece together information from several commands to get a consolidated view of what's going on.

Not exactly 'optimal' !

However, the gluster project has the community forge where projects can be submitted to add functionality to a gluster deployment, even if the changes aren't inside the core of gluster.

So I came up with a few ideas about what I'd like to see and started a project on the forge called gstatus. I've based the gstatus tool on the xml functionality that was introduced in gluster 3.4, so earlier versions of gluster are not compatible at this point (sorry, if that's you!)

It's also still early days for the tool, but I find useful now - maybe you will too. As a taster here's an example of the output the tool generates against a test gluster 3.5 environment.

[root@glfs35-1 gstatus]# ./gstatus.py -s

      Status: HEALTHY           Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  2 Up
   Self Heal:  4/ 4                  0 Up(Degraded)
   Bricks   :  8/ 8                  0 Up(Partial)
                                     0 Down
Status Messages
  - Cluster is healthy, all checks successful

As I mentioned, gluster does provide all the status information - just not in a single command. 'gstatus', grabs the output from several commands and brings the information together into a logical model that represents a cluster. The object model it creates centres on a cluster object as the parent, with child objects of volume(s), bricks and nodes. The objects 'link' together allowing simple checks to be performed and state information to be propagated.

You may have noticed that there are some new volume states I've introduced, which aren't currently part of gluster-core.
Up Degraded : a volume in this state is a replicated volume that has at least one brick down/unavailable within a replica set. Data is still available from other members of the replica set, so the degraded state just makes you aware that performance may be affected, and further brick loss to the same replica set will result in data being inaccessible.
Up Partial : this state is an escalation of the "up degraded" state. In this case all members of the same replica set are unavailable - so data stored within this replica set is no longer accessible. Access to data residing on other replica sets is continues, so I still regard the volume as a whole as up, but flag it as partial.

The script has a number of 'switches' that allow you to get a high level view of the cluster, or drill down into the brick layouts themselves (much the same as the lsgvt project on the forge)

To further illustrate the kind of information the tool can provide, here's a run where things are not quite as happy (I killed a couple of brick daemon's!)

[root@glfs35-1 gstatus]# ./gstatus.py -s

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

   Nodes    :  4/ 4        Volumes:  0 Up
   Self Heal:  4/ 4                  1 Up(Degraded)
   Bricks   :  6/ 8                  1 Up(Partial)
                                          0 Down
Status Messages
  - Cluster is UNHEALTHY
  - Volume 'dist' is in a PARTIAL state, some data is inaccessible data, due to missing bricks
  - WARNING -> Write requests may fail against volume 'dist'
  - Brick glfs35-3:/gluster/brick1 in volume 'myvol' is down/unavailable
  - Brick glfs35-2:/gluster/brick2 in volume 'dist' is down/unavailable

I've highlighted some of the areas in the output, to make it easier to talk about
  • The summary section shows 6 of 8 bricks are in an up state - so this is the first indication that things are not going well.
  • We also have a volume in partial state, and as the warning indicates this will mean the potential of failed writes, and inaccessible data for that volume
Now to drill down, you can use '-v' and get a more detailed volume level view of the problems (this doesn't have to be two steps, you could use gstatus -a, to get state info and volume info combined)

[root@glfs35-1 gstatus]# ./gstatus.py -v

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 79.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    dist             UP(PARTIAL)  - 3/4 bricks up - Distribute
                     Capacity: (0% used) 129.00 MiB/40.00 GiB (used/total)
                     Self Heal: N/A   Heal backlog:0 files
                     Protocols: glusterfs:on  NFS:on  SMB:on
  • For both volumes a brick is down, so each volume shows only 3/4 bricks up. The state of the volume however depends upon the volume type. 'myvol' is a volume that uses replication, so it's volume state is just degraded - whereas the 'dist' volume is in a partial state
  • The heal backlog count against the 'myvol' volume, shows 54 files. This indicates that this volume is tracking updates being made to a replica set, where a brick is currently unavailable. Once the brick comes back online, the backlog will be cleared by the self heal daemon (all self heal daemons are present and 'up' as shown in the summary section.
You can drill down further into the volumes too, and look at the brick sizes and relationships using the -l flag.

 [root@glfs35-1 gstatus]# ./gstatus.py -v myvol -l

      Status: UNHEALTHY         Capacity: 80.00 GiB(raw bricks)
   Glusterfs: 3.5.0beta3                  60.00 GiB(usable)

Volume Information
    myvol            UP(DEGRADED) - 3/4 bricks up - Distributed-Replicate
                     Capacity: (0% used) 77.00 MiB/20.00 GiB (used/total)
                     Self Heal:  4/ 4   Heal backlog:54 files
                     Protocols: glusterfs:on  NFS:off  SMB:off

    myvol----------- +
                Distribute (dht)
                         +-- Repl Set 0 (afr)
                         |     |
                         |     +--glfs35-1:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         |     |
                         |     +--glfs35-2:/gluster/brick1(UP) 38.00 MiB/10.00 GiB
                         +-- Repl Set 1 (afr)
                               +--glfs35-3:/gluster/brick1(DOWN) 36.00 MiB/10.00 GiB
                               +--glfs35-4:/gluster/brick1(UP) 39.00 MiB/10.00 GiB S/H Backlog 54

You can see in the example above, that the layout display also shows which bricks are recording the self-heal backlog information.

How you run the tool depends on the question you want to ask, but the program options provide a way to show state, state + volumes, brick layouts and capacities etc. Here's the 'help' information;

[root@glfs35-1 gstatus]# ./gstatus.py -h
Usage: gstatus.py [options]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s, --state           show highlevel health of the cluster
  -v, --volume          volume info (default is ALL, or supply a volume name)
  -a, --all             show all cluster information
  -u UNITS, --units=UNITS
                        display capacity units in DECimal or BINary format (GB
                        vs GiB)
  -l, --layout          show brick layout when used with -v, or -a

Lots of ways to get into trouble :)

Now the bad news...

As I mentioned earlier, the tool relies upon the xml output from gluster, which is a relatively new thing. The downside is that I've uncovered a couple of bugs that result in malformed xml coming from the gluster commands. I'll raise a bug for this, but until this gets fixed in gluster I've added some workarounds in the code to mitigate. It's not perfect, so if you do try the tool and see errors like this -

 Traceback (most recent call last):
  File "./gstatus.py", line 153, in <module>
  File "./gstatus.py", line 42, in main
  File "/root/bin/gstatus/classes/gluster.py", line 398, in updateState
  File "/root/bin/gstatus/classes/gluster.py", line 601, in update
    brick_path = node_info['hostname'] + ":" + node_info['path']
KeyError: 'hostname'

...you're hitting the issue.

If you're running gluster 3.4 (or above!) and you'd like to give gstatus a go -  head on over to the forge and give it a go. Installing is as simple as grabbing the tar archive from the forge, extracting it and running it..that's it.

Update 4th March. Thanks to Niels at Red Hat, the code has been restructured to make it easier for packing and installation.  Now to install, you need to download the tar file, extract it and run 'python setup.py install' (you'll need to ensure your system has python-setuptools installed first!)

Anyway, it helps me - maybe it'll help you too!

Monday, 6 January 2014

Distributed Storage - too complicated to try?

The thing about distributed storage is that all the pieces that make the magic happen are.....well, distributed! The distributed nature of the components can represent a significant hurdle for people looking to evaluate whether distributed storage is right for them. Not only do people have to set up multiple servers, but they also have to get to grips with services/daemons, new terms and potentially clustering complexity.

So what can be done?

Well the first thing is to look for a distributed storage architecture that tries to make things simple in the first place...life's too short for unnecessary complexity.

The next question is "Does the platform provide an easy to use and understand" deployment tool?"

Confession time - I'm involved with the gluster community. A while ago I started a project called gluster-deploy which aims to make the first time configuration of a gluster cluster, childs play. I originally blogged about an early release of the tool in October, so perhaps now is a good time to revisit the project and see how easy it is to get started with gluster (completely unbiased view naturally!)

At a high level, all distributed storage platforms consist of a minimum of two layers;
  • cluster layer - binding the servers together, into a single namespace
  • aggregated disk capacity - pooling storage from each of the servers together to present easy to consume capacity to the end user/applications
So the key thing is to deliver usable capacity as quickly and as pain-free as possible - whilst ensuring that the storage platform is configured correctly. Now I could proceed to show you a succession of screenshots of gluster-deploy in action - but to prevent 'death-by-screenshot' syndrome, I'll refrain from that and just pick out the highlights.

I wont cover installing the gluster rpms, but I will point out that if you're using fedora - they are in the standard repository and if you're not using fedora, head on over to the gluster download site download.gluster.org.

So let's assume that you have several servers available; each one has an unused disk and gluster installed and started. If you grab the gluster-deploy tool from the gluster-deploy link above you'll have a tar.gz archive that you can untar onto one of your test nodes. Login to one of the nodes as 'root' and untar the archive;

>tar xvzf gluster-deploy.tar.gz && cd gluster-deploy

This will untar the archive and place you in the gluster-deploy directory, so before we run it lets take a look at the options the program supports

[root@mynode gluster-deploy]# ./gluster-deploy.py -h
Usage: gluster-deploy.py [options]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n, --no-password     Skip access key checking (debug only)
  -p PORT, --port=PORT  Port to run UI on (> 1024)
  -f CFGFILE, --config-file=CFGFILE
                        Config file providing server list bypassing subnet

 Ok. So there is some tweaking we can do but for now,  let's just run it.

[root@mynode gluster-deploy]# ./gluster-deploy.py

gluster-deploy starting

    Configuration file
        -> Not supplied, UI will perform subnet selection/scan

    Web server details:
        Access key  - pf20hyK8p28dPgIxEaExiVm2i6
        Web Address -

    Setup Progress

Taking the URL displayed in the CLI, and pasting into a browser, starts the configuration process.

The deployment tool basically walks through a series of pages that gather some information about how we'd like our cluster and storage to look. Once the information is gathered, the tool then does all the leg-work across the cluster nodes to complete the configuration, resulting in a working cluster and a volume ready to receive application data.
At a high level, gluster-deploy performs the following tasks;

- Build the cluster;
  • via a subnet scan - the user chooses which subnet to scan (based on the subnets seen on the server running the tool) 
  • via a config file that supplies the nodes to use in the cluster (-f invocation parameter) 
- Configure passwordless login across the nodes, enabling automation 

- Perform disk discovery. Any unused disk is shown up in the UI

- You then choose which of the discovered disks you want gluster to use

- Once the disks are selected, you define how you want the disks managed
  • lvm (default)
  • lvm with dm-thinp
  • btrfs (not supported yet, but soon!)
  NB When you choose to use snapshot support (lvm with dm-thinp or btrfs), confirmation is required since these are 'future' features, typically there for developers.

- Once the format is complete, you define the volume that you want gluster to present to your application(s). The volume create process includes 'some' intelligence to make life a little easier
  • tuning presets are provided for common gluster workloads like OpenStack cinder and glance, ovirt/rhev, and hadoop
  • distributed volumes and distributed-replicated volumes types are supported
  • for volumes that use replication, the UI prevents disks (bricks) from the same server being assigned to the same replica set
  • UI shows a summary of the capacity expectation for the volume given the brick configuration and replication overheads
Now, let's take a closer look at what you can expect to see during these phases.

The image above shows the results from the subnet scan. Four nodes have been discovered on the selected subnet that have gluster running on them. You then select which nodes you want from the left hand 'box' and click the 'arrow' icon to add them to the cluster nodes. Once you're happy, click 'Create'.

Passwordless login is a feature of ssh, which enables remote login by shared public keys. This capability is used by the tool to enable automation across the nodes.

With the public keys in place, the tool can scan for 'free' disks.

Choosing the disks to use is just a simple checkbox, and if they all look right - just click on the checkbox in the table heading. Understanding which disks to use is phase one, the next step is to confirm how you want to manage these disks (which at a low level defines the characteristics for the Logical Volume Manager)

Clicking on the "Build Bricks" button, initiates a format process across the servers to prepare the disks, building the low-level filesystem and updating the node's filesystem table (fstab). These bricks then become the component parts of the gluster volume that get's mounted by the users or applications.

Volumes can be tuned/optimised for different workloads, so the tool has a number of presets to choose from. Choose a 'Use case' that best fits your workload, and then a volume type (distributed or replicated) that meets your data availability requirements. Now you can see a list of bricks on the left and an empty table on the right. Select which bricks you want in the volume and click the arrow to add them to the table. A Volume Summary is presented at the bottom of the page showing you what will be built (space usable, brick count, fault tolerance). Once you're happy, simply click the "Create" button.

The volume will be created and started making it available to clients straight away. In my test environment the time to configure the cluster and storage < 1minute...

So, if you can use a mouse and a web browser, you can now configure and enjoy the gluster distributed filesystem : no drama...no stress...no excuses!

For a closer look at the tool's workflow, I've posted a video to youtube.

In a future post, I'll show you how to use foreman to simplify the provisioning of the gluster nodes themselves.