New QPU macro assembler

Since Broadcom released complete documentation for the VideoCore IV GPU back in February 2014 we’ve seen a number of fun uses of our 24GFLOPs of QPU compute, from Andrew Holme’s FFT library to Pete Warden’s deep learning experiments. It’s not unusual to see a 10x increase in performance over the ARM for algorithms with a decent amount of parallelism.

A platform is only as good as its development tools, so it’s a great start to the New Year to see a new QPU macro assembler from Marcel Müller. This builds on Pete and Eman’s earlier QPU assemblers to include support for macros and functions. Along the way, he’s even managed to squeeze another few percent out of the size and run time of Andrew’s FFT library. You can find source, binaries, documentation and sample code here.



Looks intersting! ;)



Maybe this will make the Pi more useful for different DSP projects, including more advanced uses related to the RTL-SDR software defined radio.


Love to make a mobile scanner setup where I can measure signal strength of a selected frequency and open a port with a USB GPS dongle, plot it to a .kml file. Can currently do it with a laptop but its bulky, an A+ could hide behind the rear vision mirror and status come up on the rear camera display module.

project #43. ( sometime in the year 2525 perhaps after I sort out where to hide a mini keyboard. Behind the sun visor?)


GPU_FFT version 2.0, which is included in the latest Raspbian distribution, or can be obtained from supports all power-of-2 FFT lengths from 2^8 to 2^21 (i.e. 256 to 2 Mega-points) inclusive and it includes a GPU-accelerated transposer for 2-dimensional transforms.


Interesting, but thought Einstein@home needed 2^22. so is it a case of close but no cigar for use with Einstein@home?


Closer :-) I think Einstein@home needs 12M-point and is currently making this using 2^21 transforms followed by radix-2 and radix-3 stages coded on the ARM. The mixed radix stuff could possibly be done faster on the GPU.


As for Einstein@Home: The app that runs on Android and ARM-Linux (so also on the Raspi, but using the CPU only so far) indeed uses an FFT of length 3 x 2^22, real-to-complex, which is easily implemented as a 3 x 2^21 complex-to-complex FFT with some additional twiddling. I’ve begun to try to make use of the GPU for this (e.g. by using batches of length 6 and length 2^10 FFTs (6 x 2^10 x 2^10)) . The current GPU lib can easily do batches of N=2^10 FFTs, but the length 6 FFTs would slow things down if done on the CPU, so this would require some QPU programming. The new GPU-accelerated transposer sounds like something I should look at tho, this might be useful for this project.

Heinz-Bernd Eggenstein, Einstein@Home project team, Max-Planck-Institute for Gravitational Physics.


There is also another guy working on this see

He seems to think there are some problems with the RMS error and plans to contact Andrew


It is a very positive step by Broadcom to release complete documentation for the VideoCore IV GPU and so enable a fuller level of engagement from the Open Source communities which will hopefully yield mutual benefits to all:)



The best acceleration for everyday use I’d like to see so far would have been decoding JPEGs in hardware – for fast image preview and display. But I doubt JPEGs are easily decodable especially progressive ones on a gpu. I’ve seen some efforts, maybe it will get to the raspbian one day?


I would have thought the video decoder would have hardware for jpeg decode?

I have no knowledge whatsoever about the vc4 gpu but motion jpeg is part of the mpeg standard so I would think fast jpeg decode would be simple with the vc4?

I think there maybe a bit of code in the vc4 examples in /opt/vc????? called something like hello_jpeg that does vc4 based jpeg decode (this is a total guess I have never looked at it).

DCT is the basis of jpeg format so……?

I would also think that almost all software on linux uses libjpeg for jpeg decode which is distributed from the jpeg consortium sources, so should be simple to creATE A WRAPPER LIBRARY THAT (oops caps) slots in and would ‘just work’ if library stubs were exchanged for accelerated versions???

I guess only a few days work for someone that knows how to do it.

Of course I am probably totally wrong as I have not researched any of what I said just making guesses.


There is HW JPEG encode and decode on the GPU. Can be accessed via the OpenMAX plugin. I do not believe it supports progressive though.


See hello_jpeg ( for sample code for hw accelerated jpeg decode.

This is used in xbmc (now kodi) and latest version of Epiphany Web browser (and many user projects).

It doesn’t support progressive (I don’t believe any HW codecs do).


A platform is only as good as its development tools, so it’s a great start to the New Year to see a new QPU macro assembler from Marcel Müller.

Leave a Comment

Comments are closed