In the brand new issue of HackSpace magazine, out now, Derek Woodroffe explains how he made PicoCray, an impressive Raspberry Pi Pico cluster.
Every time a new computer board comes out, shortly after DOOM has been ported to it, there is usually the appearance of ‘the cluster’. The Raspberry Pi Pico was released in January 2021 and, as I’ve seen no distributed computing projects, I thought it was time to see what could be achieved with a cluster of Raspberry Pi Picos.
Of course, the Pico has a few things not going for it, for easy clustering:
1. It doesn’t natively run Linux
2. There is no Ethernet interface or easy access to an IP stack (except the Pico W’s Wi-Fi)
3. It isn’t really that powerful (compared to, say, a Raspberry Pi 4)
But they are inexpensive, and I had ten. The lack of a multitasking OS or IP stack means we will have to natively do the communications and the process control but, for something simple, that shouldn’t be too hard.
Proof of Principle
I started with three Picos connected by their I2C ports – one as an I2C controller and two I2C processors, all powered by individual USB cables and connected via a strip of Veroboard. A quick test in MicroPython proved that it was workable but, as we are aiming for speed, I quickly swapped to C. Luckily, there is example code for the Pico I2C Slave in Raspberry Pi’s C GitHub, and this formed the basis for the Processor-to-Processor communication.
I quickly got the C code working, and this allowed me to have a 256-byte area in each Processor using I2C 1-byte addressing and readable/writable from the Controller. One byte in this area was designated as a status for that Processor.
I quickly realised that scaling this up beyond a few Picos would mean that I’d need to set the address of all the Processors individually, which I could do in software, but this would mean programming each differently, or I could use switches or links on each board, which would be error-prone, not scalable, and require a lot of wiring up. I decided on a more elegant solution.
Defining the Controller was easy – for this, I used a single GPIO pin pull-up enabled. Grounding this pin, the Pico became a Controller; while left unconnected, it was a Processor.
For the Processors, I devised a scheme where each Processor would wait a random amount of time before connecting to the I2C bus. When the Controller saw the Processor on a default I2C port, it would allocate that Processor an I2C address, and the Processor would restart its I2C code with the new address.
This worked fine with two Processors, but adding more led to conflict as there is no way the new Processor knew if the I2C bus already had a device listening on the default address. To resolve this, I added a separate Assert line which was common to one GPIO pin on all Processors.
Now, when a new Processor comes online, it can check the Assert line. If it’s low, the Processor takes the Assert line high and adds itself to the I2C bus. If the Assert line is already high, it does nothing and waits a random period before trying again.
When the Processor has had an I2C address assigned by the Controller, the Processor lowers the Assert line, allowing another to claim the default address and sets its status to READY.
After a few seconds, all the Picos will have a unique I2C address and have registered their statuses as READY.
And the best bit – they now all run the same code.
I chose I2C initially because I’m lazy, and connecting three wires to each of the Picos seemed the quickest way to get them all connected, but I stuck with it.
USB would have been great, but it’s complicated to do host/Controller and still allow programming of the Pico. I discounted it as too difficult.
SPI would also be great, the bus speeds are up to 20 times faster than I2C, but you would need separate CS lines for each Processor. This means no auto address assigning and a limit on Processors, dictated by the spare GPIO pins of the Controller.
Serial/UART has two issues. There is no inherent addressing, so you would have to listen to and process all the messages on all the Processors. Also, there is a limit of around 1MBd (Async), although synchronous may be worth exploring later.
So, I2C became the protocol of choice.
Adding more Picos to the Veroboard caused the cabling to become increasingly difficult to manage – a PCB was in order. To keep the size of the PCB down, I came up with the idea of mounting the Pico’s USB connector downwards, as this would give power to each Pico and, as there were only four connections between Picos (GND, I2C Click, I2C Data, and Assert), it should then be easy to connect a daughter board between them. I then thought it would be great to make an 8-port USB hub to enable me to program them all at once, too.
There are very few 8+ port USB hub ICs available that are easily hand-solderable. I eventually decided on the FE1.1S, which is only a 4-port hub. You need three of these devices to make an 8-port hub, as each chip has four downstream ports and one upstream. If you want to end up with a single upstream port, you need to use a third USB hub chip to combine these two upstreams into a single upstream connection to your computer. This was all a mighty risk, as I’d never played with USB hubs before, but if only the power to each socket worked, it would be a neat way of connecting all the Picos.
With the PCB, I made eight (and some spares) daughter boards to connect the GND, I2C, and Assert lines in parallel to all the Picos at a jaunty angle. Annoyingly, I missed off the I2C pull-up resistors, so these are added to the last PCB in the chain.
With a solid hardware base, I could now start to put all this awesome Pico power to work. I’d been looking for a program to use as a demonstration, ideally a problem that could be easily carved up into multiple tasks, and I rediscovered the Mandelbrot set.
The Mandelbrot set is a mathematical function that creates a complex pattern from a relatively simple algorithm. What is brilliant about this fractal is, by controlling the range of numbers sent to the function, you can ‘zoom in’ to the pattern and reveal more and more detail.
This is a seriously simplistic explanation of a truly fascinating set of formulas and patterns that is out of the scope of this article, and will probably get me in trouble with mathematicians everywhere. Take some time out and have a look at fractal mathematics, or just look through the pictures generated by these formulae.
I’ve used the Mandelbrot set on just about every computing architecture, in one form or another, from printing out a set on an ASCII Teletype over many hours on Z80 computers, to zooming in at almost real time on a PC with a modern graphics card.
It does, of course, require some computation to create these pictures, but it’s easy to split this into areas that can be handed out to each of the Processors for them to work on their own little part, so for this project it’s ideal.
Splitting up the Mandelbrot is quite simple. The set is broken up into ‘lumps’ of a line of 120 pixels by the Controller.
Once there is at least one Processor with status READY, the Controller will send a task to the Processor – this is an X and Y coordinate and a step value (3 doubles) for the Mandelbrot calculation and how many calculations to do, which is the lump value. The Controller sets the Processor’s status to GO to start the computation. This is acknowledged by the Processor changing its state to BUSY. A note of which Processor and starting X and Y coordinate is also saved.
The Controller continues allocating tasks until there are no READY Processors.
To fully use the available processing power on each Processor, the calculations are split over the two Pico cores. When the Processor has completed the calculations in both cores, it sets its status to DONE.
The Controller continues to scan each Processor for its status. If it sees a status of DONE, it will download the results, currently a lump of unsigned chars representing the iteration value of the Mandelbrot calculation. When this has been transferred, the Controller sets the Processor status to READY.
The Controller uses the received data and the saved X and Y coordinate to assemble the image. A palette is applied to the 8-bit data to give 16 bits of colour – this is sent to the display, either directly or via DMA. The display is a 240×320 pixel colour module with touch, and is connected to the Controller via SPI.
Touching the display starts the Mandelbrot at the next zoom level, centred around the area that was touched. A few different colour palettes can be chosen from, although they can only be changed at compile time. The console has a set of three buttons: Reset, Back (undoes the last zoom), and Again (redoes the last zoom).
The original Cray1 was a 1970s supercomputer, with a distinctive dodecagon shape and a settee around the base. They were originally made in various colours and with some transparent panels, ideal as a case/shape for this project. The case was cut from transparent and red acrylic, and glued together using an acrylic welding solution. To keep the shape together whilst the acrylic welded, the inside of the non-transparent pieces were backed with cardboard.
It would not be a Cray without some form of settee, and this is replicated by using red 2 mm thick foam. The foam shape is laser-cut and the seat cushion detail was laser-engraved.
Although the project works very well, there are some ideas I’d like to pursue.
I’d like to do something other than compute a Mandelbrot.
The communication speed could be improved by using SPI, 4-bit SPI, some form of X bit I2C maybe, USB or something else – there are a lot of free PIO processors looking for a job.
The Processor’s results could do with more than 256 bytes in shared/accessible memory too.
There are some unexpected USB problems, although not all caused by the PCB. They occur when twelve devices are plugged in simultaneously into a computer. There is a known issue in the Pico’s tiny USB that causes enumeration problems. This can be partially mitigated by software changes at compile time.
Some issues are caused by the host computer, and I’m still working through them. Due to this, I only really use the PCB for power, although it does mean I have to update the software on the Picos one by one, which is a pain.
I have achieved some level of distributed computing spread over eight or more Picos and a Controller, and there is a rough format in the code for allocating and structuring the data to and from multiple Processors so other algorithms can be implemented.
But I’m not sure if any of it is of any practical use, other than a learning exercise, unless you want a small inexpensive multiprocessor system with a touch display, 18 cores, and 208 free GPIOs.
Basically, I have created a solution looking for a problem.