Accelerating Fourier transforms using the GPU

Andrew Holme is well known to regular blog readers, as the creator of the awesome (and fearsomely clever) homemade GPS receiver. Over the last few months he’s been experimenting with writing general purpose code for the VideoCore IV graphics processing unit (GPU) in the BCM2835, the microchip at the heart of the Raspberry Pi, to create an accelerated fast Fourier transform library. Taking the Fourier transform of a function yields its frequency spectrum (i.e. the pure harmonic functions which can be added together to reconstruct the original function). In the following example, shamelessly lifted from Wikipedia, we have a function which oscillates roughly three times per second, and whose Fourier transform unsurprisingly has a peak around 3Hz.

Being able to perform lots of Fourier transforms quickly is useful for all sorts of audio and radio applications including, unsurprisingly, GPS. Ham radio enthusiasts will also find Andrew’s work very useful. In this guest post, Andrew talks about his Fourier transform library for the Pi.

Last October, Eben attended the Radio Society of Great Britain (RSGB) Convention, where radio amateurs told him they wanted a speedy fast Fourier transform (FFT) library to do Software Defined Radio (SDR) projects on the Pi.

GPU_FFT is an FFT library for the Raspberry Pi which exploits the BCM2835 SoC V3D hardware to deliver ten times the performance that is possible on the 700 MHz ARM. Kernels are provided for all power-of-2 FFT lengths from 256 to 131,072 points inclusive.

GPU_FFT uses single-precision floating point for data and twiddle factors, so it does not compete on accuracy with double-precision libraries; however, the relative root-mean-square (rms) error for a 2048-point transform is less than one part per million, which is not bad.

The library runs on dedicated 3D hardware in the BCM2835 SoC, and communication between ARM and GPU adds 100µs of latency which is much longer than the shortest transform takes to compute! To overcome this, batches of transforms can be executed with a single call. Typical per-transform runtimes in microseconds are:

Points    batch=1    batch=10    batch=50    FFTW    Speedup
256 112 22 16 92 5.8x
512 125 37 26 217 8.3x
1024 136 54 45 482 10.7x
2048 180 107 93 952 10.2x
4096 298 256 240 3002 12.5x
8192 689 624 608 5082 8.4x
16384 1274 1167 1131 12005 10.6x
32768 3397 3225 3196 31211 9.8x
65536 6978 6703 6674 82769 12.4x
131072 16734 16110 16171 183731 11.4x

To get GPU_FFT enter the following at the command prompt:

sudo rpi-update && sudo reboot

To build and run the example program:

cd /opt/vc/src/hello_pi/hello_fft
sudo mknod char_dev c 100 0
sudo ./hello_fft.bin

API documentation can be found in the hello_fft folder.


coppermine avatar

Thanks! Found it very useful!

chic thomson avatar

are we missing a
sudo apt-get install gpu-fft
in the instructions..

or am i missing something

asb avatar

rpi-update is a script that updates the installed ‘firmware’ to the latest upstream git HEAD. This includes /opt/vc/src where hello_fft lives. A future update to the Debian firmware packages will include hello_fft.

Jim Manley avatar

Well, it should be no surprise to anyone here that this is EXTREMELY welcome news! We educators, engineers and scientists hope this is but the first of a number of libraries for using the GPU to accelerate math(s) instead of just graphics. Are you considering any other libraries of this sort? We’ll be looking to see if we can use parts of this library to perform other computations that lend themselves to parallelization – does this use only the floating-point pipelines, or can the integer pipelines also be leveraged for relatively simple calculations (e.g., sorts, matrix manipulations, etc.)?

Thank you, thank you, thank you!!! :D

eben avatar

Hopefully we can provide accelerators for a few more common operations in the future. 1d and 2d convolutions, and FIR, IIR filters are obvious possibilities. We’re still looking to see whether we can make the QPUs available for third parties to write code on.

Martin avatar

Opening up the QPU shaders and the GPU cores would be very nice indeed…

Swen avatar

Hi Eben, how can a company contribute to add more accelerators? We would like to use the QPUs to decode our ICT encoded video streams. Would be great if 3rd parties could get access to it.

eben avatar

QPU is certainly a good match for this sort of thing. Watch this space – I’m hopeful we can find a way to do this.

Sengan Baring-Gould avatar

That would be welcome.

Chris avatar

A 1D convolution is just 2 FFT transformations multiplied together and then the inverse FFT, so in theory this should be good for it.

dom avatar

These timings are with a pi on stock frequency. With my overclock settings, I get about 30% faster results.

Disabling hdmi output (tvservice -o) saves a few % if you are running headless. Reducing resolution or depth with fbset helps a little if you need the display.

dom avatar

sudo ./a.out
should be
sudo ./hello_fft.bin
with the released version of the code.

peepo avatar

well I’d like to have more of a handle on this, but a start is great!

anyone have a matrix transform handy, sum or other less demanding task?

had to run sudo make???

thanks again


Jim Manley avatar

Yes, the instructions should read “sudo make”, not “make” if you’re not logged in as root. We also need to provide some pointers (pun fully intended) to the less-experienced (e.g., educators) who may not be familiar with how to access the output via a character device.

As you know, this library is exactly the sort of thing I was discussing in my post the other day about accessing the GPU for doing math(s) things. Timing really is everything!

Andy avatar

I didn’t know about that directory! One of Pi’s little ‘easter eggs’.
Now, I’ve dreamt of making a copy of a machine I used to repair – dual-channel audio spectrum analyzer.
It looked like this on the inside…
(Scroll down for what it looked like on the inside!)
Stuffed full of TTL, NO microprocessor, just bit-slice. There were only about 4 of us in Europe who could fix one.
Pi would be SOOO much faster! Just a couple of A/D’s…Oh, MAMA!!!

Jonathan Morris avatar

I’m kind of confused. What is this supposed to do?

dom avatar

It’s a library for doing fourier transforms very fast.
Only useful if you want to do fourier transforms.

Fourier analysis has many scientific applications – in physics, partial differential equations, number theory, combinatorics, signal processing, imaging, probability theory, statistics, option pricing, cryptography, numerical analysis, acoustics, oceanography, sonar, optics, diffraction, geometry, protein structure analysis and other areas.

It’s been requested by people doing software defined radio and audio processing.

Jim Manley avatar

Fourier Transforms (FTs) are usually used to derive the frequency spectrum from a signal mathematically, but they have many, many more uses than that. If you imagine a sine wave at a single frequency, its Fourier Transform (FT) is a single line on a spectrograph (amplitude on the y axis vs. frequency on the x axis) at the frequency of the sine wave. However, a square wave at the same frequency will contain all of the odd harmonics of the fundamental frequency, with each harmonic dropping off in amplitude proportional to the inverse of the frequencies as they increase (see Triangle waves contain the odd harmonics dropping off proportional to the inverse of the amplitude squared with increasing frequency. More complex signals have correspondingly more complex spectra. It should be noted that a signal can also be derived in the reverse direction, given its spectrum in mathematical terms.

In practice, signals are never pure sine waves, square waves, etc., and quantizing errors (due to sampling of signals at minimum intervals corresponding to the quality of the equipment used) result in a set of data points that approximate the signal’s waveform (amplitude over time). However, FTs can also be approximated by carrying out mathematical computations over discrete data points, which is what the GPU_FFT library is used to do. You provide it a set of signal amplitude data over time, and it spits out a data set that can be plotted to obtain an amplitude vs. frequency spectrograph. A corresponding transform implemented in software could also be used on spectrum data to derive an estimation of the associated signal, plotted as points on an amplitude vs. time graph.

clive avatar

A Fourier Transform is what werewolves experience during a full moon as compared to werepigs.

Jim Manley avatar

I thought a Fourier Transform was the transaction French-Canadian trappers made when turning pelts into whiskey and other critical supplies needed while traipsing about in the Great White North :lol:

Bill avatar

I believe you are thinking of the fast furrier transform followed by the inverse fast furrier transform.

3xBackup's avatar

Spectrogram’s are probably the easiest way to understand what FFT’s can be used to do (in this case hiding an image in music/noise).

The Y-axis is frequency in Hertz from 0Hz up to half a CD’s audio sample rate so 22,050Hz.. And the X axis is time and scrolls along in a sideways waterfall diagram .

If you could whistle at exactly 11000Hz and record the audio with a microphone to a wave file, and process this file with a (FFT – to convert the data from the time domain to the frequency domain) to generate a spectrogram it would be a straight line at 11kHz.

Spectrogram of Beethoven’s “Sonata Pathétique”

Daniel Barker avatar

I am interested in compiler SIMD parallelisation of bitwise operations on arrays of integers. To my pleasant surprise, I found that gcc – with the right command-line options – will just do this automatically for Cortex A9/NEON. I know the Raspberry Pi does not have NEON so I’m not asking for that. But: will the future hold something similar for the GPU – a gcc option to ‘just do it’? (Or, is using the GPU too complicated, compared to a SIMD unit on the CPU?) Thanks a lot, Daniel

Jim Manley avatar

There have been efforts to attempt to ascertain technical details from patents and reverse-engineer the functionality of the GPU described at—BCM2835-Overview and IIRC, there are 16 parallel 32-bit integer pipelines in the quad processing units (QPU), but the learning curve is very steep for the custom compilers and assemblers used with the GPU components (upwards of 10 years of experience is typically required to become proficient at an expert level), and they’re proprietary (as is the case for all of the integrated GPUs on the market), so they’re not available. I haven’t tracked the progress on reverse-engineering the toolchain, but the page at the second link above contains at least half-a-dozen links to various efforts.

Eben noted above that access to the QPU may happen to some degree, but it probably depends on how exhausted the engineers who already are doing full-time paid work can stay awake long enough to do the unsung volunteer Pi work they’ve already been performing flawlessly for coming up on three years. Maybe a crowdfunding effort to stock their cabinets full of cheer would help, although that may make them fall asleep even faster (and they could use it!) :lol:

JamesH avatar

That’s not quite right. There are two 16 way SIMD vector/scaler cores, AND a number of QPU’s. The QPU’s are, IIRC, scaler processors, but there are a number of them that can work in parallel.

Edwin avatar

The QPUs are SIMD as well.

JamesH avatar

Thanks Edwin, I did wonder when they were SIMD.

Jim Manley avatar

Hi James – Of course it’s not quite right, as trying to find definitive detailed information on the GPU is a real treasure hunt. It’s like the folks who scour the Internet for illegal copies of music files (or references to them, e.g., torrent files) and then have the search engines erase any evidence of their existence have been hired to rub out any mention of how the GPU does its thing. For example, a search for keywords that include QPU and BCM2835 results only in comments to this blog post mixed in with a ton of links to pages discussing other GPUs. If such info beyond what Herman has published exists, it would be nice to know – not critical and not important enough to make anyone at Broadcom drop what they’re doing to help make more use of the GPU, but nice. We are trying our damnedest to promote the Pi and, in my case, the GPU in particular for educational purposes, but with both hands tied behind my back and a blindfold on, I’m limited to semi-randomly poking the keys with my ample schnoz :lol:


hermanhermitage avatar

Jim, I think the miracles come in phases :). The first miracle is getting a cheap board with a SoC and some lifetime guarantees around the SoC at low volumes. The second miracle was all the lovely patents Broadcom/Alphamosaic had created making reverse-engineering fairly straightforward. The third miracle is that there are now APIs to call VPU (integer) and QPU (VLIW/unified shaders) code blocks. The rest will follow… If they had given out all the docs on day 1, a few of us would have missed the fun of solving a digital crossword :)

gus3 avatar

For what it’s worth:

The VFP unit on the BCM2835 does support SIMD operation, but at the expense of GNU ABI compatibility. Once you enable vector operations, you must be responsible for everything until you disable vector mode.

I have a 3-part write-up about VFP vector operations, starting here. Parts 2 has coding examples, and part 3 has the full code base, including a Makefile.

JBeale avatar

By the way, if anyone else wants to try writing their own GPS software, you can get all needed hardware, assembled and shipped, for $19. Already a full GPS but with unique feature: you can run your own code. Google “NavSpark”

ethanol100 avatar

Thanks. Nice to have a really “fast” ft. Can this be used to do fft complex to complex or only real to complex? I would like to use it to construct a 2D fft for image processing(for cross-correlation).

3xBackup's avatar

Use the source Luke, use the source.

hint: ” for (i=0; i<N; i++) base[i].re = base[i].im = 0;"

ethanol100 avatar

Thank you, I will try the source…
But I was not sure, if the fft will evaluate the imaginary part in the forward fft. I will give it a try.

Bruce avatar

Exciting times for Raspberry Pi – unlocking the beast within. Looking forward to more of this :-)

HBE avatar

Excellent, thanks for this!

I’m a member of the Einstein@Home developer team, we already have a Raspberry Pi version of an application that searches for radio pulsars in a volunteer distributed computing project. It relies heavily on FFT (currently FFTW, single precision) and I certainly will try to make some use of this library. We need really large transforms (real-to-complex, length 3*2^22 (!), but if we can do batches of smaller transforms that can already help to improve performance,

We also support Android phones and are somewhat frustrated that we don’t get access to their powerfu GPUs (OpenCL is in theory supported by some mobile GPUs but there are no official drivers installed in the field. To see hardware accelerated FFTs on the Raspberry Pi is a pleasant surprise.

Heinz-Bernd Eggenstein

Keith Sloan avatar

Well this has prompted me to take a gamble and buy a Raspberry Pi. I like the idea of having a Pi working on BOINC projects and exploiting the GPU. Just hope that Heinz can update the Einstein at home code to use the FFT library and that the access to QPU also gets opened up. My main Intel computer already found one pulsar back in 2011, lets hope the Pi can find another.

Keith Sloan avatar

Well I now have my Pi plodding away on Einstein@home work. After 11 hours its 30% of the way through a unit. Any speed up by exploiting the GPU would be gratefully received.

Kostis avatar

Wow! That would actually put my headless Pi, currently working as a dlna server, to even better use. I don’t know how effective even the gpu could be for such a task, or if the RAM (first batch, 256MB only) would be enough, but it’s a device that works 24/7, so why waste it? I’m in for it. Keep us posted!

Federico avatar

the code seems interesting, I will give a look to the source code, it’s the beginning of a nicely time for the pi

Mark R avatar

That is incredibly useful, I’m already using my Pi to do sound analysis using FFT in numpy, doing it using the GPU would save a lot of CPU and enable me to use higher quality samples etc.

Jednorozec avatar

When I runsudo ./hello_fft.bin 12I getrel_rms_err = 2.3e-06, usecs = 273, k = 0I’ve tried a couple of different sd cards both running Raspbian and I’ve changed the amount of GPU memory in raspi-config with no joy.

Andrew Holme avatar

That is the correct output. It is measuring the relative rms error and the runtime. Those are the expected numbers. Have a look at the source.

hermanhermitage avatar

Andrew are you ok with me uploading a dissasembly of the routines to say github?

Might be some useful pointers for others, especially now there is a call QPU fragment mailbox routine.

— herman

Gordon avatar


That sounds fine, I’m sure Andrew would have no problem… Andrew not sure if you know of Herman, but he’s the guy who’s been doing so much of the reverse engineering of the QPU stuff…


Andrew Holme avatar

It is OK with me. I just wish we could publish the commented source, which makes very heavy use of assembler macros. And, yes, I have seen Herman’s work.

hermanhermitage avatar

I’ve put a preliminary disassembly at:

It looks like there are some new instructions that I haven’t encountered before in the VG implementation and GLES shaders, so it doesn’t make complete sense yet.

The mnemonics I’m using are courtesy of one of the khronos blobs, so they probably don’t match the official assembler.

I will verify, comment and document it when I get some free time.

dom avatar

That’s what it does. It’s a test program that does the ffts, times it, and measures the error.

No pretty pictures from this app.
It would be quite easy for someone to use this library to produce a demo that plotted a spectrogram of a given audio file.

Jednorozec avatar

I was expecting to get something like that nice table. Misleading advertising :)

marked avatar

Is this computing on the gpu and returning results to the cpu for plotting, or is the gpu plotting the results?

Also where is this being discussed in the forums? I’ve googled but not found anything yet.

Jednorozec avatar

The only thing I’ve found in the forums is a link to here. Would probably be a good thread to discuss this.

cyk avatar

Bitcoins, anybody?

3xBackup's avatar

So SHA256 on the GPU, I personally can not see how it would be worth it in terms of electricity, financial (return on investment) or time spent implementing for the number of satoshi’s it would eventually generate. GPU mining of bitcoins is dead (unless the electricity if free!), FPGA and now ASIC’s have taken over. Although maybe if the RPi was going to be used as an encryption node, to encrypt and add a hash/checksum to verify no tampering/data-corruption of data being transferred from A to B, then a few GPU accelerated primitives would be good. But with security there is the whole issue of trust, that there is no backdoors in blob. And that rabbit hole is a scary one to crawl down (ANGRYMONK, JETPLOW – or ).

cyk avatar

You’re absolutely right, but I’m sure that some people will try it nevertheless.
One of my Raspis is running 24/7 as fileserver, downloader, mediaserver and for home automation, and its GPU does exactly nothing. So why not using it for something usefull?

Señor Chemist avatar

Thanks for making this available. I’d like to add that Mathematica, a powerful program available free on Raspberry Pi, has functions for FFT and inverse FFT, plus some wonderful plotting functions. Of course, this is not useful for real-time applications, but still a great tool (and toy).

hcpa avatar

Thank you very much for making this available!

I spent some time yesterday to implement a 2D-FFT for image processing based on gpu_fft.

1. 8bit luminance data to complex
2. horizonal fft (batch)
3. transpose result matrix
4. vertical fft (batch)
5. transpose result back

This sequence has pretty much the same speed as fftwf_plan_dft_r2c_2d/fftwf_execute, around 1000 ms over 1024x1024px.

A 2D-real-to-complex FFT directly on the GPU would be a dream.

Andrew Holme avatar

The fft batches should take around 41us*1024*2 = 84ms. Does transposition/copying take ~ 900ms? The GPU can do transposition in a way that accesses SDRAM efficiently. Might be best to have separate real and imaginary arrays. I will give it some thought.

hcpa avatar

Andrew, thanks a lot. You’re completely right, each of the fats takes around 45ms. Filling the complex matrix is 154ms, transposing the matrix a whopping 672ms.
Wikipedia uses matrix transposition as example of a cache-oblivious algorithm, but I’m not sure how much this will help on the Raspberry Pi.
All the best, HC

hcpa avatar

‘fats’ is ‘ffts’. Stupid autocorrect.

hcpa avatar

Just to be complete, here are the results of some optimization.
Bottom line is, GPU_FFT is beating fftw3f in my application by about 40%. YMMV, of course.
I’m doing a phase correlation, i.e. 2D-FFT for 2 images, a cross power spectrum followed by an inverse 2D-FFT. Goal is to identify the shift between the images. It takes 3400ms with fftw3 to do this on a 1024×1024 pic, 2050ms with GPU_FFT.
I’ve optimized things a bit by taking into account the symmetry of real-to-complex fourier transformations. I guess there are still some microseconds to identify and eliminate.

Fred avatar

My C has rusted over the years. I am struggling to understand how hello_fft.c works? Can someone please give an example of how to transform an array containing say 32 time domain samples, and how to output the frequency domain result? Thanks

hcpa avatar

Just for the general idea, not tested. Assuming your data is in my_data and is not complex.

struct GPU_FFT *fft;

mb = mbox_open();
plan = gpu_fft_prepare( mb, 5, GPU_FFT_FWD, 1, &fft );

for( i = 0; i in[i].re = my_data[i],fft->in[i].im = 0;

gpu_fft_execute( plan );

Your reults will be in ffft->out[0..31].re and .im


hcpa avatar

The for-loop got mangled. It should read for( i = 0; i "less than" 32; i++) fft->in[i].re = my_data[i], fft->in[i].im = 0;

Fred avatar

As easy as that?! Doh! :) Thanks

Yggdrasil avatar


does the Testprogramm requires any modules for the mailbox?
I’ve updated the system with rpi-update; apt-get update; apt-get upgrade;
but get the error message

“Unable to enable V3D. Please check your firmware is up to date”

on startup of (sudo) ./hello_fft.bin 8


dom avatar

rpi-update gets newer firmware than the more stable apt-get.
So you don’t want to run apt-get after rpi-update.

sudo rm /boot/.firmware_revision && sudo rpi-update && sudo reboot
to get back to latest rpi-update firmware,

Yggdrasil avatar

Thanks dom,

this solves it.

Now, I can search/test if there exists a sincos-Function to get both Values together. ;)

Jesús Cea avatar

Besiding publishing a SDK for programming the GPU :-), I would love to see md5/sha1/sha256 offloading to it too… :-)

duckboy81 avatar

Has anyone been able to integrate this into another program with cmake running? I’m stuck trying to call functions from this great solution in my existing cmake’d program. Any help would be greatly appreciated!

Tony Abbey avatar

Fantastic news! I was one of the radio amateurs at the RSGB Convention who spoke to Eben about SDR and he promised this would come. The Raspberry Pi Foundation certainly delivers its promises. And thanks for taking time to reply to many of the posts here. Well done.

Tony Abbey avatar

I was unable to get the example working at first, then realised after reading the file gpu_fft.txt in /opt/vc/src/hello_pi/hello_fft that the first parameter of hello_fft is not optional.
Your last instruction should be:
sudo ./hello_fft.bin 12 (or similar.)


Tony Abbey avatar

I looked at the effect of running the code from the R Pi graphics screen – effect of updating the screen is very noticeable on the execution times:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 266, k = 0
rel_rms_err = 2.3e-06, usecs = 1076, k = 1
rel_rms_err = 2.3e-06, usecs = 302, k = 2
rel_rms_err = 2.3e-06, usecs = 303, k = 3
rel_rms_err = 2.3e-06, usecs = 535, k = 4
rel_rms_err = 2.3e-06, usecs = 275, k = 5
rel_rms_err = 2.3e-06, usecs = 275, k = 6
rel_rms_err = 2.3e-06, usecs = 1201, k = 7
rel_rms_err = 2.3e-06, usecs = 1051, k = 8
rel_rms_err = 2.3e-06, usecs = 1054, k = 9

Whereas the same run headless from a terminal is:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 279, k = 0
rel_rms_err = 2.3e-06, usecs = 271, k = 1
rel_rms_err = 2.3e-06, usecs = 271, k = 2
rel_rms_err = 2.3e-06, usecs = 277, k = 3
rel_rms_err = 2.3e-06, usecs = 272, k = 4
rel_rms_err = 2.3e-06, usecs = 271, k = 5
rel_rms_err = 2.3e-06, usecs = 269, k = 6
rel_rms_err = 2.3e-06, usecs = 271, k = 7
rel_rms_err = 2.3e-06, usecs = 277, k = 8
rel_rms_err = 2.3e-06, usecs = 278, k = 9

Now just need to work out how to incorporate the fast FFTs into an SDR program for the FunCubeDongle or similar.
BTW – Eben said once that the USB libraries were being updated to work better with the FCD – did that happen?


dom avatar

There’s a lot of processes running when you launch X. I’m guessing the slow numbers have switched you out.

You may find running with a negative nice (and sudo) or chrt will reduce the disruption.

Tony Abbey avatar

dorn suggested turning off the tc screen with tvservice -o
This does indeed speed the code up a few % when running headless:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 251, k = 0
rel_rms_err = 2.3e-06, usecs = 258, k = 1
rel_rms_err = 2.3e-06, usecs = 249, k = 2
rel_rms_err = 2.3e-06, usecs = 251, k = 3
rel_rms_err = 2.3e-06, usecs = 260, k = 4
rel_rms_err = 2.3e-06, usecs = 256, k = 5
rel_rms_err = 2.3e-06, usecs = 251, k = 6
rel_rms_err = 2.3e-06, usecs = 260, k = 7
rel_rms_err = 2.3e-06, usecs = 257, k = 8
rel_rms_err = 2.3e-06, usecs = 251, k = 9


Andy avatar

Excuse my ignorance, but if I have a Python program running on my Pi that’s already using numpy for FFT, is there a way to take advantage of this library from within Python instead of using the numpy FFT?

Martin OHanlon avatar

I have the same question.

Sway P. avatar

hey guys, i need some help/advice. i am working on a project that involves comparing sounds from a bmw x5 hatch spindle to ones in a database. each spindle sound from the database has a number 1-10 associated with it based on the sound it makes when opening and closing the hatch. i’d love to use RPi and the fft function to listen to a spindle while being run then comparing it to the database then output a number 1-10. can this be done? i’m a mechanical engineering student and would appreciate direction since all this is new to me. i have been able to do something similar in MATLAB but want to see if it can be done with just the RPi, Mic, and display. Thanks!

wirk avatar

Is there any chance the Videocore SDK will ever be released by Broadcom. Implementation of allkind of signal/image processing algorihms using GPU would be fantastic but without public SDK is impossible.

Neil avatar


Alpha162 avatar

“Broadcom announced the release of full documentation for the VideoCore IV graphics core, and a complete source release of the graphics stack under a 3-clause BSD license”

Jo_b avatar

Hey guys! I need to compute a FFT with 1 million points (1’048’576) but the library allow “only” 131’072 points. I used the split radix properties to seperate my FFT into smaller FFTs but it takes too much time to reorganize (compute) it.
Is it a way to compute 1 million points (more than 131’072) with the GPU by changing a little bit the library for example?


Marcel avatar

[larger FFT size]

I have the same issue. I’d like to use the Raspi for digital room correction. While the FFT size of 2^17 is normally sufficient for the deconvolution filter it is too small for the room measurements. Depending on the reverb and the number of simultaneously sampled channels you need 2^18 or 2^19 samples, sometimes even 2^20.

I had a look at the code. It seems to be straight forward to add larger FFT sizes as the processing is done in stages anyway. Taking the twiddles_128k function as template it should be possible to implement up to 2^21 with 4 passes (base_64 and 3 times step_32).

Unfortunately you are not yet done. You also need the shader code. And this part is entirely undocumented. So I have no idea where to get the shader code for larger FFT sizes.

I would really be appreciated to get gpufft working with larger FFTs.

… btw.: Im am working on another issue: gpufft requires root privileges to map the GPU memory. So putting it into some libgpufft would require root privileges for any application that uses this library – very bad :-(
The basic idea is to split the functionality of gpufft into a (small) root daemon and a public library with the current API of gpufft. Furthermore the daemon should handle concurrent access to the character device from different applications by a command queue.

Kattemageren avatar

Could someone please give any pointers to how (if possible) to implement this within a python file?

Say for example that I am trying to do fft of a near realtime recording, how would I do that?

Please respond, I cannot find this information anywhere.

Marcel avatar

Patch to create character device for communication automatically

The following replacement for mbox_open in mailbox.c will no longer depend on the existence of the character device to access the firmware part:

int mbox_open() {
int file_desc;
int retry = 0;
// open a char device file used for communicating with kernel mbox driver
file_desc = open(DEVICE_FILE_NAME, 0);
if (file_desc < 0)
{ if (!retry && errno == ENOENT)
{ if (mknod(DEVICE_FILE_NAME, S_IFCHR|S_IRUSR|S_IWUSR, makedev(MAJOR_NUM, 0)) < 0)
{ fprintf(stderr, "Can't create communication device %s - %s", DEVICE_FILE_NAME, strerror(errno));
retry = 1;
goto retry;
fprintf(stderr, "Can't open device file: %s - %s\n", DEVICE_FILE_NAME, strerror(errno));
fprintf(stderr, "Try creating a device file with: sudo mknod %s c %d 0 or ose option -c\n", DEVICE_FILE_NAME, MAJOR_NUM);
return file_desc;

Since the program needs to run as root anyway creating the device is no big deal.
I would also recommend to rename DEVICE_FILE_NAME in mailbox.h to /dev/gpufft.

Peter Onion avatar

Just a example of what you can achieve with the gpu_fft when you put it together with some openGL ES.

I’ll make a better video later :-)


marco avatar

This is what the pleases the gods

Ed Bird avatar

My understanding of this (after looking at the code) is that the main program waits for the GPU to process the data before recieving it again and then continutes executing.

Is it possible for a program to be written with an interrupt which fires when the GPU has finished processing the data? This would allow the CPU to continue executing code while the GPU is left to compute the FFT.

If this is possible, has it been implemented, or is it likely that it will ever be implemented? I would like to do it myself, but don’t know nearly enough about the Broadcom chip or asm programming to attempt it.

Comments are closed