Notes from building Raspberry Pi clusters

28 Nov 2020

A while ago I got it into my head to put a Raspberry Pi cluster together.

So I uh got a pelican case. And a ton of new pis. And am gonna try to build a portable unit. pic.twitter.com/2mxuYC1aLl
— r"Re(e+|i?)d" (@arrdem) February 19, 2020

As with the other builds floating around the internet, the intent of mine was produce a low-cost and in my case portable platform for developing and ultimately demonstrating cluster operations technology.

We’ve all seen Pi clusters around, and probably seen dozens of blog posts about doing this or that or deploying a cluster management technology on the Pis.

The first thing I’m gonna note is that the TuringPi exists, and that a compute module backplane or other module based minicluster is gonna be cheaper though more limited than any solution built around integrating multiple full size Pis.

That said if you’re heartset on a pile of full size Pis like I used, let’s get to it. I’m gonna skip the basics which you could google for easily and focus on some details that make doing a good cluster build hard. Namely, power, networking, mechanicals, netboot and I’ll offer some closing thoughts on the Pi platform in my application(s).

Power

The various models of Pis have different peak draw requirements. You weren’t going to be able to run something truly CPU intensive on the Pis as a platform anyway, but sometimes Zookeeper or Docker or what have you will peg the cores. And this means you’ll need to have appropriate power available to back it up, unless you want to see undervolting and soft locks.

Model	Power connection	Max draw (A)	… (W)
B	Micro USB, GPIO (5v)	1.2 A	6 W
A	Micro USB, GPIO (5v)	1.2 A	6 W
B+	…	1.2 A	6 W
A+	…	1.2 A	6 W
2B	…	400 mA	2 W
3B	…	730 mA	3.7 W
3A+	…	2.5 A	12.5 W
3B+	…	1 A	5 W
4B	USB-C, GPIO (5v)	1.3 A	6.5 W

It’s important to note here that max current draw is approximate, as it really depends on what other peripherals are hooked up to the Raspberry Pi. These numbers are conservative (high), reflect using USB peripherals on a given Pi in addition to running the Pi itself. The Raspberry Pi foundation quotes higher power usage numbers which assume USB port peak load not CPU peak load.

Power over USB

If you’re going to run several Pis together in a cluster these power numbers matter because you’ll need to ensure that whatever wall voltage to USB hub or other power source you’re using is appropriately provisioned. For instance if you were to build out a 5 RPI 3 B+ cluster, your max power draw is somewhere around 25W. Considerably higher than most 5-port USB power supplies.

The main drawback of going down the USB power road is that USB hubs typically aren’t individually switched, let alone with software control. While the Raspberry Pi is an incredibly stable platform, it lacks the remote management capabilities which can be expected of server hardware.

In a datacenter, if a computer gets real borked, you can usually remotely power cycle it using IPMI in a fancy deployment or in simpler setups by just … unplugging it and plugging it back in again with a remotely managed power distribution unit. Entire management systems such as Open19 rely on being able to do this variously. But a typical USB hub won’t deliver anything you could automate around like this.

Power over backplane

All the Pis have the same GPIO header layout, and it’s possible to power the Pi models directly by supplying a 5v power source. This is how the Power Over Ethernet modules for the Pi work. They provide physical negotiation of PoE delivery, convert voltage as needed and deliver power directly to the Pi’s GPIO. The main drawback of the PoE modules for the Raspberry Pi is cost. A PoE hat for the Pi can cost $30, and less integrated PoE splitter solutions can be down to $12, you’re adding a considerably bulky component to every Pi which can complicate a mechanical layout.

The main advantage of PoE is when deploying devices remotely from the power source, where it’s convenient to deliver power over ethernet rather than separately supplying power. That’s far less relevant in the context of building an integrated cluster, but since some PoE delivery [network] switches offer the ability to turn off PoE delivery per switch port, which would be another way to get remote switching capabilities.

Backplane solutions such as Bitscope’s Blade or better yet ClusterCTL Stack can also make powering groups of Pis extremely easy. Particularly, ClusterCTL provides software defined power switching per-pi which can be used to implement the sort of remote hard reset discussed above.

Networking

It’s also important to note on the networking front that the Pis really are a … limited platform. Commodity compute hardware has offered full GiB/s throughput for years. The Pis however, don’t.

Model	Networking
B	10/100
A
B+
A+
2B
3B
3A+
3B+	10/100/1000 (3Mb/s)
4B	10/100/1000 (full)

The Pis are relatively low performance so just about any off the shelf managed or unmanaged switch will be able to keep up with them. You will need to be able to dedicate a switch port per Pi, but the main consideration in network design for your cluster is how you want to structure DHCP and manage egress.

Networking considerations

I’ll say more of this in a bit when it comes to booting the Pis, but the Raspberry Pi has some … interesting ideas about how netbooting occurs with respect to more conventional platforms. For now, I’ll just say having routing separation between your Pi cluster and any other networks you may run will be convenient because you’ll probably want to run a separate DHCP server not use an embedded one.

In my setup, I accomplish this by runing an unmanaged switch which I connect to an isolated (separate VLAN) upstream switch port. This leaves me at liberty to run my own DHCP server on the unmanaged switched network, and makes the network viable when disconnected from any upstream router(s).

If you want to use a Pi cluster as a testbed for real networking problems or distributed systems, you’ll almost certainly want to run a more sophisticated piece of routing hardware than just a generic Netgear unmanaged switch. Otherwise you’ll have a hard time simulating or causing link failures, packet loss, lag and such.

Laying out a cluster

There’s a ton of mechanical layout options. My original build followed the traditional “pis over power and switch” layout, but because I was using an appropriately provisioned beefier power supply it wound up looking a bit different.

The @tirefireind pi cluster got rebuilt and is looking mighty fine now! Only one of the pis seemed to need re-imaging, but probably gonna spend some time thinking about how to do PXE and roll them all just because pic.twitter.com/CFRlt5iuLg
— r"Re(e+|i?)d" (@arrdem) January 19, 2020

There’s any number (1, 2, 3, 4, 5, 6, 7, …) of 3d printable cluster configurations to be had and a quick google search for "rasbperry pi rack" turns up a number of vendors who would be delighted to sell you something packaged.

Cases

Things get tricker and there are fewer examples of fitting Pis into common hard cases such as Nanuk or Pelican products. It can most certainly be done and done well, it’s just unusual.

I documented most of rebuilding my case -

tore everything out. let's revisit pic.twitter.com/pXx9080t6h
— r"Re(e+|i?)d" (@arrdem) March 27, 2020

And my friend Matt built a comparable thing using a 2u sled design -

pic.twitter.com/zdfkvJdn43
— Matt Getty (@aspen) November 18, 2020

A surprising number of cases can be had which are properly internally dimensioned for a 19” wide rack mount unit, and as the Pis aren’t particularly deep (65mm or 2 9/16” on the longest side) so going with a 2u sled for packing pis like Matt did is a pretty good strategy. I went with manual packing of a 5 pi block into a Nanuk 915, which it turns out fits two such blocks although I haven’t felt the need to expand my case yet.

The big downside of the packed case design I went with is that appropriate cooling is really hard. This isn’t a huge problem for me since I’m not looking to run workloads, just demonstrate provisioning technologies. But the Pis do run plenty toasty when pushed, and most of the Pi “rack” solutions do incorporate fans to push air through the Pis for a reason.

I could probably do a better job with my case layout if I were to design and 3d print up a 5 Pi carrier which integrated with a fan and bolted into the box, but right now everything’s hand-packed with foam. C’est la vie.

Booting the Raspberry Pi

[Network] Booting the Pis is, politely, a mess.

The brief version is that all Pi versions will first try to boot from their SD card, and will then try to boot from a USB device. If you’re willing to individually image SD cards, go crazy. That’s a well trodden path that totally works, although it doesn’t dovetail particularly well with any sort of remote cluster management technology like Puppet or inventory discovery or what have you.

In a real production environment, you’d boot new hardware into some sort of “discovery” phase using either your DHCP server or netbooting hardware to an “OS” which collects host metadata, reports it back and reboots the machine so a different decision can be made metadata in hand. Usually in a production environment you’d use a HTTP server to emit PXE menus (2008), and play games in your webserver of choice to control the generated PXE menu.

The good news is some Pi models (the 3s and later) also support a rudimentary form of network booting. I won’t spend too many words on it here, it’s reasonably documented, but the short and very bad news is that the Pis don’t use PXE booting.

Instead of performing a PXE boot, they implement a form of TFTP booting. This is what booting a server used to look prior to about ‘99.

The Raspberry Pi’s firmware knows how to make a DHCP request, extract a next-server and boot-file from the DHCP response, will fetch that file and boot it. Typically, this will be a bootloader, which separately identifies itself and requests more files, eventually loading a .txt file specifying a kernel initrd and command line to boot. This works great, and totally works for implementing locally stateless boot of Raspberry Pis.

The really bit caveat is that netbooting is (unless you have a Pi 4 whose boot order can easily be changed) the dead last thing a Pi will try when it turns on. This means that if you write a (seemingly) viable image to a local SD card, that bootable SD card will always win out in the future over a bootable network. This can present a remote management challenge if you want to be able to recover wedged or corrupted hosts without manually re-imaging SD cards, as in a classical production PXE environment you’d PXE boot every time so remote management would be able to recover a “wedged” host.

iPXE

PiPXE is a build of the iPXE PXE implementation for the Rapsberry Pi platform. Leveraging the (shiny new!) EFI support backported to the Pi 3 series and present in the 4s, it’s possible to use iPXE as a chainloader in a much more conventional PXE menu based boot process than the TFTP based process which the Pi firmware provides.

The ENORMOUS CAVIAT with this is that it isn’t possible (As of this writing, November ‘20) to TFTP boot a Pi to the iPXE chainloader. The short version is that the Pi’s firmware has a “filesystem” abstraction layer which treats TFTP roots and SD cards the same. The EFI implementation has no comparable support for the Pi’s network card. This means in order to do PiPXE booting, the PiPXE boot configuration must be present on a local SD card although I believe it works fine beyond that.

In review

If you want to build a 5-node (or smaller) cluster with which to demonstrate some piece of software, the Pis are a pretty reasonable platform for that. You’ll buy some SD cards, flash each one by hand, give each node a name, maybe use some Puppet or Ansible to configure them after you’ve done some hand setup and it’ll work great. Throw docker-swarm or k3s or something on it and treat it like a cloud, at least until something goes wrong.

For myself, having built a Pi cluster with the intent of using it to demonstrate more traditional large-scale remote management tools which depend on PXE, the Pi has proven to be a limiting substrate due to its lack of proper PXE support. Were one willing to implement a custom TFTP server with support for variable content or do some serious firmware development it would be possible to implement something resembling a traditional PXE provisioning flow but that’s a pretty heavy lift for a hobby project.

Thanks to @krainboltgreene for reminding me that while bits and pieces of this have been tweeted, I’ve never codified it.