Archives For 30 November 1999

kvm-logo-squareAs KVM seems more and more interesting, at work we wanted to do a Proof-of-Concept. The KVM hypervisor cluster had to be controlled by CloudStack and also integrate with NSX (formerly known as Nicira).

NSX is owned by VMware these days and is one of the first Software Defined Networking solutions. At Schuberg Philis we use this since early 2012.

Choosing an OS
To me, the most interesting part of KVM is the fact you only need a very basic Linux box with some tooling and you have a nice modern hypervisor ready to rock. Since we’re using CloudStack to orchestrate everything, we do not need cluster features. In fact, this prevents the “two captain” problem that we sometimes encounter with XenServer and VMware ESX. We compared Ubuntu with CentOS/RHEL and both work fine. It all depends on your needs.

Installing KVM
Installing the software is pretty straight forward:

CentOS:

yum install kvm libvirt python-virtinst qemu-kvm bridge-utils pciutils

Ubuntu:

apt-get install qemu-kvm libvirt-bin bridge-utils virt-manager openntpd

Installing Open vSwitch
Open vSwitch is a multilayer virtual switch and it brings a lot of flexibility in the way you can create interfaces and bridges in Linux. There are two options here. If you need STT tunnels, you need the NSX patched version of Open vSwitch. If you need VXLAN or GRE tunnels, you can use the open source version that comes with Ubuntu and CentOS. Both ship version 2.3.1 which works perfectly fine.

CentOS:

yum install openvswitch kmod-openvswitch

Ubuntu:

apt-get install openvswitch-switch

Configuring Open vSwitch
Instead of classic Linux bridges, we use Open vSwitch bridges. In our POC lab environment, we were using HP DL380 G9 servers that have 2 10Gbit NICs to two Arista switches. They run a LACP bond and on top of this we create the bridges for KVM to use. Because we setup the Open vSwitch networking over and over again while debugging and testing different OS’es, I wrote a script that can quickly configure networking. You can find it at Github.

To give some quick pointers:

Create a bridge:

ovs-vsctl add-br cloudbr0

Create an LACP bond:

ovs-vsctl add-bond cloudbr0 bond0 eno49 eno50\
bond_mode=balance-tcp lacp=active other_config:lacp-time=fast

Create a so-called fake bridge (with a VLAN tag):

ovs-vsctl add-br mgmt0 cloudbr0 123

Get an overview of current configuration:

ovs-vsctl show

Get an overview of current bond status:

ovs-appctl show/bond

Example output:

---- bond0 ----
bond_mode: balance-tcp
bond may use recirculation: yes, Recirc-ID : 300
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
next rebalance: 9887 ms
lacp_status: negotiated
active slave mac: fc:00:00:f2:00(eno50)

slave eno49: enabled
 may_enable: true
 hash 139: 3 kB load

slave eno50: enabled
 active slave
 may_enable: true
 hash 101: 1 kB load
 hash 143: 5 kB load
 hash 214: 5 kB load
 hash 240: 5 kB load

Add hypervisor to NSX
For now I assume you already have a NSX cluster running capable of acting as a controller/manager for Open vSwitch. If you don’t know NSX, have a look because it’s awesome.

We do need to connect our Open vSwitch to the NSX cluster. To do that, you need a SSL certificate. This is how you generate one:

cd /etc/openvswitch
 ovs-pki req ovsclient
 ovs-pki self-sign ovsclient
 ovs-vsctl -- --bootstrap set-ssl \
 "/etc/openvswitch/ovsclient-privkey.pem" \
 "/etc/openvswitch/ovsclient-cert.pem" \
 /etc/openvswitch/vswitchd.cacert

Next, add the hypervisor to NSX. You can either set the authentication to be ip-address based (it will then exchange certificates on connect) or copy/paste the certificate (ovsclient-cert.pem) to NSX at once. The first method allows for easier automation. I’m showing the UI here, but of course you can also use the API to add the hypervisor.

Setting authentication to be ip-address based (allow for automatic exchange of security certificates)

Setting authentication to be ip-address based (allow for automatic exchange of security certificates)

 

Setting authentication to use a security certificate (and provide one manually)

Setting authentication to use a security certificate (and provide one manually)

 

The final step is to connect Open vSwitch to NSX:

ovs-vsctl set-manager ssl:10.10.10.10:6632

Then NSX should show green lights and tunnels are being created.

To get an idea of what’s going on, you can run:

ovs-vsctl list manager
ovs-vsctl list controller
ovs-vsctl show

Debugging this can be done from the command line (check Open vSwitch logs) or from NSX.

Debugging in NSX is very nice

Debugging in NSX is very nice

 

At this point, NSX is controlling our Open vSwitch.

Setting up the CloudStack agent
When running KVM, CloudStack runs an agent on the hypervisor in order to configure VMs.

Installing the agent is simply installing a RPM or DEB package. Depending on the version you will use different repositories. At Schuberg Philis, we build our own packages that we serve in a repository.

Because we’re using Open vSwitch, some settings need to be tweaked in the agent.properties file, found in /etc/cloudstack/agent.

echo "libvirt.vif.driver=com.cloud.hypervisor.kvm.resource.OvsVifDriver" \
>> /etc/cloudstack/agent/agent.properties
echo "network.bridge.type=openvswitch" \
>> /etc/cloudstack/agent/agent.properties

You may also want to set the log level to debug:

sed -i 's/INFO/DEBUG/g' /etc/cloudstack/agent/log4j-cloud.xml

CloudStack requires some KVM related settings to be tweaked:

# Libvirtd
echo 'listen_tls = 0' >> /etc/libvirt/libvirtd.conf
echo 'listen_tcp = 1' >> /etc/libvirt/libvirtd.conf
echo 'tcp_port = "16509"' >> /etc/libvirt/libvirtd.conf
echo 'mdns_adv = 0' >> /etc/libvirt/libvirtd.conf
echo 'auth_tcp = "none"' >> /etc/libvirt/libvirtd.conf
 
# libvirt-bin.conf
sed -i -e 's/libvirtd_opts="-d"/libvirtd_opts="-d -l"/' \
/etc/init/libvirt-bin.conf
service libvirt-bin restart

# qemu.conf
sed -i -e 's/\#vnc_listen.*$/vnc_listen = "0.0.0.0"/g' \
/etc/libvirt/qemu.conf

On CentOS 7 Systemd ‘co-mounts’ cpu,cpuacct cgroups, and this causes issues for launching a VM with libvirt. On the mailing list this is the suggested fix

Edit /etc/systemd/system.conf and pass empty string to JoinControllers parameter. Then rebuild the initramfs via ‘new-kernel-pkg –mkinitrd –install `uname -r`’.

Something I currently don’t like: SELinux and AppArmour need to be disabled. I will dive into this and get it fixed. For now, let’s continue:

#AppArmour (Ubuntu)
ln -s /etc/apparmor.d/usr.sbin.libvirtd /etc/apparmor.d/disable/
ln -s /etc/apparmor.d/usr.lib.libvirt.virt-aa-helper /etc/apparmor.d/disable/
apparmor_parser -R /etc/apparmor.d/usr.sbin.libvirtd
apparmor_parser -R /etc/apparmor.d/usr.lib.libvirt.virt-aa-helper

# SELinux (CentOS)
setenforce permissive
vim /etc/selinux/config
SELINUX=permissive

You can now add the host to CloudStack, either via de UI or the API.

Keep an eye on the agent log file:

less /var/log/cloudstack/agent/agent.log

After a few minutes, the hypervisor is added and you should be able to spin up virtual machines! 🙂

When we spin up a VM, CloudStack does the orchestration. So, CloudStack is the one to talk to NSX to provide the network (lswitch) and NSX communicates with Open vSwitch on the hypervisor. The VM is provisioned by CloudStack and KVM/Libvirt makes sure the right virtual interfaces are plugged in Open vSwitch. This way VMs on different hypervisors can communicate over their own private guest network. All dynamically created without manual configuration. No more VLANs!

If it does not work right away, look at the different log files and see what happens. There usually are hints that help you solve the problem.

Conclusion
KVM hypervisors can be connected to NSX and using Open vSwitch you can build a Software Defined Networking setup. CloudStack is the orchestrator that connects the dots for us. I’ve played with this setup for some time now and I find it very fast. We’ll keep testing and probably create some patches for CloudStack. Great to see that the first KVM related pull request I sent is already merged 🙂

Looking forward to more KVM!

Recently I played with Open vSwitch and it’s awesome! Open vSwitch is a multilayer virtual switch and it brings a lot of flexibility in the way you can create interfaces and bridges in Linux. It’s also a Linux distribution independent way to configure these things. Switching in software!

To create a bridge, simply run:

ovs-vsctl add-br remibr0

You can also create another bridge on top of it, to handle a VLAN for example:

ovs-vsctl add-br mgmt0 remibr0 101

Even better, create a bond based on LACP:

ovs-vsctl add-bond remibr0 bond0 em49 em50 bond_mode=balance-tcp lacp=active other_config:lacp-time=fast

This is all quite nice but still basic. It gets interesting when you realise you can connect two switches like you can put a patch cable between physical switches. To test how cross platform this works, I setup two boxes: a CentOS 7 box and a Ubuntu 15.04 one. This shows it in a picture:

openvswitch-vxlan-interconnect

We’ll create a new bridge and add a vxlan interface that connects to the other vswitch. Then create a port on it and assign it an ip address. Installing Open vSwitch should be simple, as it is included in the releases.

Create the configuration and be sure to fill in the right ip addresses.

ovs-vsctl add-br remibr0
ovs-vsctl add-port remibr0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=92.168.1.23
ovs-vsctl add-port remibr0 vi0 -- set Interface vi0 type=internal
ifconfig vi0 10.250.204.10/24 up

On the second box, bring up 10.25.204.20/24 on vi0.

Your config should look like this:

ovs-vsctl show
f11505d7-199c-4fa9-9f3a-21016ab4fded
 Bridge "remibr0"
   Port "vxlan0"
     Interface "vxlan0"
       type: vxlan
       options: {remote_ip="92.168.1.23"}
   Port "remibr0"
     Interface "remibr0"
       type: internal
   Port "vi0"
     Interface "vi0"
       type: internal
 ovs_version: "2.3.1"

And on the second box:

ovs-vsctl show

Output:

129f100b-1377-46bd-89ba-eaf1f1bc5162
 Bridge "remibr0"
   Port "vi0"
     Interface "vi0"
       type: internal
   Port "vxlan0"
     Interface "vxlan0"
       type: vxlan
       options: {remote_ip="92.168.2.34"}
   Port "remibr0"
     Interface "remibr0"
       type: internal
 ovs_version: "2.3.90"

As you can see, I used different versions on purpose. You can use two boxes that are the same, of course.

By now, a simple ping test should work:

PING 10.250.204.20 (10.250.204.20) 56(84) bytes of data.
64 bytes from 10.250.204.20: icmp_seq=1 ttl=64 time=0.019 ms
64 bytes from 10.250.204.20: icmp_seq=2 ttl=64 time=0.009 ms
^C
--- 10.250.204.20 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.009/0.014/0.019/0.005 ms

And reversed:

PING 10.250.204.10 (10.250.204.10) 56(84) bytes of data.
64 bytes from 10.250.204.10: icmp_seq=1 ttl=64 time=1.47 ms
64 bytes from 10.250.204.10: icmp_seq=2 ttl=64 time=0.202 ms
^C
--- 10.250.204.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.202/0.839/1.477/0.638 ms

Create a virtual floating ip address
To make the demo a bit more advanced, let’s setup a virtual ip address on the interfaces that can travel between the switches. We use keepalived for this.

vim /etc/keepalived/keepalived.conf

Add this:

global_defs {
 notification_email {
 [email protected]
 [email protected]
 [email protected]
 }
 notification_email_from [email protected]
 smtp_server 192.168.200.1
 smtp_connect_timeout 30
 router_id LVS_DEVEL
}
vrrp_instance VI_1 {
 state MASTER
 interface vi0
 virtual_router_id 51
 priority 200
 advert_int 1
 authentication {
 auth_type PASS
 auth_pass 1111
 }
 virtual_ipaddress {
 10.250.204.30/24 
 }
}

Copy the config to the other box, be sure to have on MASTER and one BACKUP. Also, the priority of the MASTER should be 200 and the BACKUP 100. It’s just a demo, all it does it bring up an ip address.

Start them both and they should discover each other over the vi0 interfaces on the connected vswitches.

Try pinging the virtual ip address:

PING 10.250.204.30 (10.250.204.30) 56(84) bytes of data.
64 bytes from 10.250.204.30: icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from 10.250.204.30: icmp_seq=2 ttl=64 time=0.031 ms
64 bytes from 10.250.204.30: icmp_seq=3 ttl=64 time=0.023 ms
^C
--- 10.250.204.30 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms

Depending on where the virtual address resides, the latency may be different:

PING 10.250.204.30 (10.250.204.30) 56(84) bytes of data.
64 bytes from 10.250.204.30: icmp_seq=1 ttl=64 time=0.481 ms
64 bytes from 10.250.204.30: icmp_seq=2 ttl=64 time=0.202 ms
64 bytes from 10.250.204.30: icmp_seq=3 ttl=64 time=0.215 ms
64 bytes from 10.250.204.30: icmp_seq=4 ttl=64 time=0.203 ms
^C
--- 10.250.204.30 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2998ms
rtt min/avg/max/mdev = 0.202/0.275/0.481/0.119 ms

Now start a ping and stop keepalived, then start it again and stop it on the other side. You’ll miss a ping or two when it fails over and then it will recover just fine.

PING 10.250.204.30 (10.250.204.30) 56(84) bytes of data.
64 bytes from 10.250.204.30: icmp_seq=1 ttl=64 time=0.824 ms
64 bytes from 10.250.204.30: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 10.250.204.30: icmp_seq=3 ttl=64 time=0.160 ms
64 bytes from 10.250.204.30: icmp_seq=4 ttl=64 time=0.148 ms
64 bytes from 10.250.204.30: icmp_seq=5 ttl=64 time=0.149 ms
From 10.250.204.10: icmp_seq=6 Redirect Host(New nexthop: 10.250.204.30)
From 10.250.204.10 icmp_seq=6 Redirect HostFrom 10.250.204.10: icmp_seq=7 Redirect Host(New nexthop: 10.250.204.30)
From 10.250.204.10 icmp_seq=7 Redirect Host64 bytes from 10.250.204.30: icmp_seq=8 ttl=64 time=0.012 ms
64 bytes from 10.250.204.30: icmp_seq=9 ttl=64 time=0.025 ms
64 bytes from 10.250.204.30: icmp_seq=10 ttl=64 time=0.012 ms
64 bytes from 10.250.204.30: icmp_seq=11 ttl=64 time=0.016 ms
64 bytes from 10.250.204.30: icmp_seq=12 ttl=64 time=0.011 ms
64 bytes from 10.250.204.30: icmp_seq=13 ttl=64 time=0.011 ms
From 10.250.204.10: icmp_seq=14 Redirect Host(New nexthop: 10.250.204.30)
From 10.250.204.10 icmp_seq=14 Redirect HostFrom 10.250.204.10: icmp_seq=15 Redirect Host(New nexthop: 10.250.204.30)
From 10.250.204.10 icmp_seq=15 Redirect Host64 bytes from 10.250.204.30: icmp_seq=16 ttl=64 time=0.323 ms
64 bytes from 10.250.204.30: icmp_seq=17 ttl=64 time=0.162 ms
64 bytes from 10.250.204.30: icmp_seq=18 ttl=64 time=0.145 ms
64 bytes from 10.250.204.30: icmp_seq=19 ttl=64 time=0.179 ms
64 bytes from 10.250.204.30: icmp_seq=20 ttl=64 time=0.147 ms
^C
--- 10.250.204.30 ping statistics ---
20 packets transmitted, 16 received, +4 errors, 20% packet loss, time 19000ms
rtt min/avg/max/mdev = 0.011/0.155/0.824/0.193 ms

Note on the MTU when travelling over the internet
vxlan is encapsulation and this obviously needs space in the packets send over the wire. If you travel over networks that have a default MTU of 1500, it may be wise to lower the MTU of the vi0 interfaces as this will prevent fragmentation. Lowering the MTU is a simple work-around. You could also have a look at GRE tunnels instead.

To alter the MTU:

ip link set dev vi0 mtu 1412

You can make this persistent in Ubuntu’s ‘interfaces’ file and add ‘mtu 1400’. Red Hat alike systems have ‘ifcfg-*’ files for each interface. Add ‘MTU=1400’ to them to alter the MTU.

Conclusion
Although this is a simple demo, the real power comes when you use this to connect two different (virtual or physical) networks in different data centers. You’ll be able to create a Layer-2 network over Layer-3. It’s simple, fast and awesome.

When you create firewall rules with iptables on Linux, you want to make them persistent over reboot, because they are not by default. Different Linux distributions have different methods of achieving this, although the basics are similar. I’ve been working with Debian, Red Hat Enterprise Linux and SUSE Linux Enterprise Server and in this blog I’ll describe how to configure each of them to save your iptables rules across reboots.

First the good news: the iptables package, the administration tool for packet filtering and NAT, always ships with Linux distributions. The package also includes the ‘iptables-save‘ and ‘iptables-restore‘ tools. These do what you might already expect from their names: save or restore iptables rules. ‘iptables-save‘ outputs to stdout, which you can save to a file:

iptables-save > /etc/iptables/rules

To load these again:

iptables-restore < /etc/iptables/rules

These really are the basics that work across Linux distributions and that you can use in your custom boot scripts. In addition to this, each Linux distribution has its own way to make this process easier.

Red Hat Enterprise Linux (RHEL):
redhat-logo

RHEL (and the same counts for CentOS and Fedora) has some built in mechanism to help automate this. First of all, there are some settings in ‘/etc/sysconfig/iptables-config‘:

IPTABLES_SAVE_ON_STOP=”yes”
IPTABLES_SAVE_ON_RESTART=”yes”

You can set them to “yes” to have persistent iptables rules. In fact, there are many more settings in that file that allow for finer control. That’s all, since the rest is handled automatically.

At any time it’s possible to save the current state. Just run:

service iptables save

And it will, like on reboot, save the rules to: ‘/etc/sysconfig/iptables‘. Pretty easy and pretty powerful!

SUSE Linux Enterprise Server (SLES 11):

suse-logo SLES (and the same counts for OpenSUSE) is yet another story. SLES 11 now ships with SUSE Firewall. Instead of defining the rules yourself, you tell Yast what you want to achieve and it generates the needed iptables rules for you. Although SUSE Firewall does allow you to add custom rules, it isn’t really designed for it. The tool is pretty nice though, because it integrates fully with Yast and allows for easy maintenance of rules. When you install a package, it automatically opens the associated port, for example.

This all might seems a bit scary for us sysadmins, right?! Don’t worry, it’s still possible to manage rules on your own by disabling SUSE Firewall. But have a look at it first, as you might as well like it.

To start the SUSE Firewall admin module, run:

yast2 firewall

The interface is pretty self-explaining. Afterwards, to activate the changes run:

SuSEfirewall2

It’s even possible to by-pass Yast, and edit the config file directly. It’s safe to combine the two methods, no problem.

vim /etc/sysconfig/SuSEfirewall2

For example, to open a port you’d edit the ‘FW_SERVICES_EXT_TCP’ variable. Just make a list (space separated) with protocols you want to allow. These protocols refer to files in ‘/etc/sysconfig/SuSEfirewall2.d’.

Like with using Yast, activate the changes when you’re done.

SuSEfirewall2

I’ve used it for some time and it’s actually pretty easy. It just depends on the project whether or not to use it, I guess.

Debian
debian-logoWhen I was using Debian (same counts for Ubuntu as well), I used to create a small shell script and place it in ‘/etc/network/if-pre-up.d’. Just before the network interface is brought up, the iptables rules will be restored. The idea is to do the same when the interface goes down (use the ‘/etc/network/if-post-down.d’ folder to place the script in). Using these thechniques, you can create something and have fine control over it.

Recently I heard about a tool called ‘iptables-persistent’ that can automate this out-of-the-box. Here’s the package description:

iptables-persistent - boot-time loader for iptables rules: Current iptables rules can be saved to the configuration file '/etc/iptables/rules.v4'. These rules will then be loaded automatically during system startup.

To install it:

sudo apt-get install iptables-persistent

During install, the program asks to save both ipv4 and ipv6 iptables rules. Please note this counts for Wheezy, the current stable release uses the file ‘/etc/iptables/rules’.

To manually save the iptables rules, run:

/etc/init.d/iptables-persistent save

Although this should be done automatically when you reboot. It looks like the Red Hat way of doing things, but just with an extra package installed.

Conclusion: 
iptables all over the place, just with different tooling to automate it 🙂

I’ve been building redundant storage solutions for years. At first, I used it for our webcluster storage. Nowadays it’s the base of our CloudStack Cloud-storage. If you ask me, the best way to create a redundant pair of Linux storage servers using Open Source software, is to use DRBD. Over the years it has proven to be rock solid to me.

DRBD is a Distributed Replicated Block Device. You can think of DRBD as RAID-1 between two servers. Data is mirrored from the primary to the secondary server. When the primary fails, the secondary takes over and all services remain online. DRBD provides tools for failover but it does not handled the actual failover. Cluster management software like Heartbeat and PaceMaker are made for this.

In this post I’ll show you how to install and configure DRBD, create file systems using LVM2 on top of the DRBD device, serve the file systems using NFS and manage the cluster using Heartbeat.

Installing and configuring DRBD
I’m using mostly Debian so I’ll focus on this OS. I did setup DRBD on CentOS as well. You need to use the ELREPO repository to find the right packages.

Squeeze-backports has a newer version of DRBD. If you, like me, want to use this version instead of the one in Squeeze itself, use this method to do so:

echo "
deb http://ftp.debian.org/debian-backports squeeze-backports main contrib non-free
" >> /etc/apt/sources.list

echo "Package: drbd8-utils
Pin: release n=squeeze-backports
Pin-Priority: 900
" > /etc/apt/preferences.d/drbd

Then install the DRBD utils:

apt-get update
apt-get install drbd8-utils

As the DRBD-servers work closely together, it’s important to keep the time synchronised. Install a NTP system for this job.

apt-get install ntp ntpdate

You also need a kernel module but that one is in the stock Debian kernel. If you’re compiling kernels yourself, make sure to include this module. When you’re ready, load the module:

modprobe drbd

Verify if all went well by checking the active modules:

lsmod | grep drbd

The expected output is something like:

drbd 191530 4 
lru_cache 12880 1 drbd
cn 12933 1 drbd

Most online tutorials instruct you to edit ‘/etc/drbd.conf’. I’d suggest not to touch that file and create one in /etc/drbd.d/ instead. That way, your changes are never overwritten and it’s clear what local changed you made.

vim /etc/drbd.d/redundantstorage.res

Enter this configuration:

resource redundantstorage {
 protocol C;
 startup { wfc-timeout 0; degr-wfc-timeout 120; }

disk { on-io-error detach; }
 on storage-server0.example.org {
  device /dev/drbd0;
  disk /dev/sda3;
  meta-disk internal;
  address 10.10.0.86:7788;
 }
 on storage-server1.example.org {
  device /dev/drbd0;
  disk /dev/sda3;
  meta-disk internal;
  address 10.10.0.88:7788;
 }
}

Make sure your hostnames match the hostnames in this config file as it will not work otherwise. To see the current hostname, run:

uname -n

Modify /etc/hosts, /etc/resolv.conf and/or /etc/hostname to your needs and do not continue until the actual hostname matches the one you set in the configuration above.

Also, make sure you did all the steps so far on both servers.

It’s now time to initialise the DRBD device:

drbdadm create-md redundantstorage
drbdadm up redundantstorage
drbdadm attach redundantstorage
drbdadm syncer redundantstorage
drbdadm connect redundantstorage

Run this on the primary server only:

drbdadm -- --overwrite-data-of-peer primary redundantstorage

Monitor the progress:

cat /proc/drbd

Start the DRBD service on both servers:

service drbd start

You now have a raw block device on /dev/drbd0 that is synced from the primary to the secondary server.

Using the DRBD device
Let’s create a filesystem on our new DRBD device. I prefer using LVM since that makes it easy to manage the partitions later on. But you may also simply use the /dev/drbd0 device as any block device on its own.

Initialize LVM2:

pvcreate /dev/drbd0
pvdisplay
vgcreate redundantstorage /dev/drbd0

We now have a LVM2 volume group called ‘redundantstorage’ on device /dev/drbd0

Create the desired LVM partitions on it like this:

lvcreate -L 1T -n web_files redundantstorage
lvcreate -L 250G -n other_files redundantstorage

The partitions you create are named like the volume group. You can now use ‘/dev/redundantstorage/web_files’ and ‘/dev/redundantstorage/other_files’ like you’d otherwise use ‘/dev/sda3’ etc.

Before we can actually use them, we need to create a file system on top:

mkfs.ext4 /dev/redundantstorage/web_files
mkfs.ext4 /dev/redundantstorage/other_files

Finally, mount the file systems:

mkdir /redundantstorage/web_files
mkdir /redundantstorage/other_files
mount /dev/redundantstorage/web_files /redundantstorage/web_files
mount /dev/redundantstorage/other_files /redundantstorage/other_files

Using the DRBD file systems
Two more steps are needed to set up before we can test our new redundant storage cluster: Heartbeat to manage the cluster and NFS to make use of it. Let’s start with NFS, so Heartbeat will be able to manage that late on as well.

To install NFS server, simply run:

apt-get install nfs-kernel-server

Then setup what folders you want to export using your NFS server.

vim /etc/exports

And enter this configuration:

/redundantstorage/web_files 10.10.0.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)
/redundantstorage/other_files 10.10.0.0/24(rw,async,no_root_squash,no_subtree_check,fsid=2)

Important:
Pay attention to the the ‘fsid’ parameter. It is really important because it tells the clients that the file system on the primary and secondary are both the same. If you omit this parameter, the clients will ‘hang’ and wait for the old primary to come back online after a fail over happens. Since this is not what we want, we need to tell the clients the other server is simply the same. Fail-over will then happen almost without notice. Most tutorials I read do not tell you about this crucial step.

Make sure you have all this setup on both servers. Since we want Heartbeat to manage our NFS server, we need not to start NFS on boot. To do that, run:

update-rc.d -f nfs-common remove
update-rc.d -f nfs-kernel-server remove

Basic Heartbeat configuration
Install the heartbeat packages is simple:

apt-get install heartbeat

If you’re on CentOS, have a look at the EPEL repository. I’ve successfully setup Heartbeat with those packages as well.

To configure Heartbeat:

vim /etc/ha.d/ha.cf

Enter this configuration:

autojoin none
auto_failback off
keepalive 2
warntime 5
deadtime 10
initdead 20
bcast eth0
node storage-server0.example.org
node storage-server1.example.org
logfile /var/log/heartbeat-log
debugfile /var/log/heartbeat-debug

I set ‘auto_failback’ to off, since I do not want another fail-over when the old primary comes back. If your primary server has better hardware than the secondary one, you may want to set this to ‘on’ instead.

The parameter ‘deadtime’ tells Heartbeat to declare the other node dead after this many seconds. Heartbeat will send a heartbeat every ‘keepalive’ number of seconds.

Protect your heartbeat setup with a password:

echo "auth 3
3 md5 your_secret_password
" > /etc/ha.d/authkeys
chmod 600 /etc/heartbeat/authkeys

You need to select an ip-address that will be your ‘service’-address. Both servers have their own 10.10.0.x ip-address, so choose another one in the same range. I use 10.10.0.10 in this example. Why we need this? Simply because you cannot know to which server you should connect. That’s why we will instruct Heartbeat to manage an extra ip-address and make that alive on the current primary server. When clients connect to this ip-address it will always work.

In the ‘haresources’ file you describe all services Heartbeat manages. In our case, these services are:
– service ip-address
– DRBD disk
– LVM2 service
– Two filesystems
– NFS daemons

Enter them in the order they need to start. When shutting down, Heartbeat will run them in reversed order.

vim /etc/ha.d/haresources

Enter this configuration:

storage-server0.example.org \
IPaddr::10.10.0.10/24/eth0 \
drbddisk::redundantstorage \
lvm2 \
Filesystem::/dev/redundantstorage/web_files::/redundantstorage/web_files::ext4::nosuid,usrquota,noatime \
Filesystem::/dev/redundantstorage/other_files::/redundantstorage/other_files::ext4::nosuid,usrquota,noatime \
nfs-common \
nfs-kernel-server

Use the same Heartbeat configuration on both servers. In the ‘haresources’ file you specify one of the nodes to be the primary. In our case it’s ‘storage-server0’. When this server is or becomes unavailable, Heartbeat will start the services it knows on the other node, ‘storage-server1’ in this case (as specified in the ha.cf config file).

Wrapping up
DRBD combined with Heartbeat and NFS creates a powerful, redundant storage solution all based on Open Source software. When using the right hardware you will be able to achieve great performance with this setup as well. Think about RAID controllers with SSD-cache and don’t forget the Battery Backup Unit so you can enable the Write Back Cache.

Enjoy building your redundant storage!

Since I’m working with CloudStack I’m also working with CentOS. As I’ve been working with Debian for over 10 years, it sometimes takes some extra time to get things done. Like when you want to install a package that is not in the default repo’s.

This is how to enable the RPMForge repository:

Start to import the key:

rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt

Then download the RPMForge release file:

wget http://apt.sw.be/redhat/el6/en/x86_64/rpmforge/RPMS/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm

Then all you need to do is install this rpm:

rpm -i rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm

You only need to do this once. Now the packages from RPMForge can be installed using yum. Have fun!