News

All news

Accelerating Fourier transforms using the GPU

Andrew Holme is well known to regular blog readers, as the creator of the awesome (and fearsomely clever) homemade GPS receiver. Over the last few months he’s been experimenting with writing general purpose code for the VideoCore IV graphics processing unit (GPU) in the BCM2835, the microchip at the heart of the Raspberry Pi, to create an accelerated fast Fourier transform library. Taking the Fourier transform of a function yields its frequency spectrum (i.e. the pure harmonic functions which can be added together to reconstruct the original function). In the following example, shamelessly lifted from Wikipedia, we have a function which oscillates roughly three times per second, and whose Fourier transform unsurprisingly has a peak around 3Hz.

Being able to perform lots of Fourier transforms quickly is useful for all sorts of audio and radio applications including, unsurprisingly, GPS. Ham radio enthusiasts will also find Andrew’s work very useful. In this guest post, Andrew talks about his Fourier transform library for the Pi.

Last October, Eben attended the Radio Society of Great Britain (RSGB) Convention, where radio amateurs told him they wanted a speedy fast Fourier transform (FFT) library to do Software Defined Radio (SDR) projects on the Pi.

GPU_FFT is an FFT library for the Raspberry Pi which exploits the BCM2835 SoC V3D hardware to deliver ten times the performance that is possible on the 700 MHz ARM. Kernels are provided for all power-of-2 FFT lengths from 256 to 131,072 points inclusive.

GPU_FFT uses single-precision floating point for data and twiddle factors, so it does not compete on accuracy with double-precision libraries; however, the relative root-mean-square (rms) error for a 2048-point transform is less than one part per million, which is not bad.

The library runs on dedicated 3D hardware in the BCM2835 SoC, and communication between ARM and GPU adds 100µs of latency which is much longer than the shortest transform takes to compute! To overcome this, batches of transforms can be executed with a single call. Typical per-transform runtimes in microseconds are:

Points	batch=1	batch=10	batch=50	FFTW	Speedup
256	112	22	16	92	5.8x
512	125	37	26	217	8.3x
1024	136	54	45	482	10.7x
2048	180	107	93	952	10.2x
4096	298	256	240	3002	12.5x
8192	689	624	608	5082	8.4x
16384	1274	1167	1131	12005	10.6x
32768	3397	3225	3196	31211	9.8x
65536	6978	6703	6674	82769	12.4x
131072	16734	16110	16171	183731	11.4x

To get GPU_FFT enter the following at the command prompt:

sudo rpi-update && sudo reboot

To build and run the example program:

cd /opt/vc/src/hello_pi/hello_fft make sudo mknod char_dev c 100 0 sudo ./hello_fft.bin

API documentation can be found in the hello_fft folder.

87 comments

Thanks! Found it very useful!

are we missing a
sudo apt-get install gpu-fft
in the instructions..

or am i missing something

rpi-update is a script that updates the installed ‘firmware’ to the latest upstream git HEAD. This includes /opt/vc/src where hello_fft lives. A future update to the Debian firmware packages will include hello_fft.

Well, it should be no surprise to anyone here that this is EXTREMELY welcome news! We educators, engineers and scientists hope this is but the first of a number of libraries for using the GPU to accelerate math(s) instead of just graphics. Are you considering any other libraries of this sort? We’ll be looking to see if we can use parts of this library to perform other computations that lend themselves to parallelization – does this use only the floating-point pipelines, or can the integer pipelines also be leveraged for relatively simple calculations (e.g., sorts, matrix manipulations, etc.)?

Thank you, thank you, thank you!!! :D

Hopefully we can provide accelerators for a few more common operations in the future. 1d and 2d convolutions, and FIR, IIR filters are obvious possibilities. We’re still looking to see whether we can make the QPUs available for third parties to write code on.

Opening up the QPU shaders and the GPU cores would be very nice indeed…

Hi Eben, how can a company contribute to add more accelerators? We would like to use the QPUs to decode our ICT encoded video streams. Would be great if 3rd parties could get access to it.

QPU is certainly a good match for this sort of thing. Watch this space – I’m hopeful we can find a way to do this.

That would be welcome.

A 1D convolution is just 2 FFT transformations multiplied together and then the inverse FFT, so in theory this should be good for it.

These timings are with a pi on stock frequency. With my overclock settings, I get about 30% faster results.

Disabling hdmi output (tvservice -o) saves a few % if you are running headless. Reducing resolution or depth with fbset helps a little if you need the display.

sudo ./a.out
should be
sudo ./hello_fft.bin
with the released version of the code.

well I’d like to have more of a handle on this, but a start is great!

anyone have a matrix transform handy, sum or other less demanding task?

had to run sudo make???

thanks again

~:”

Yes, the instructions should read “sudo make”, not “make” if you’re not logged in as root. We also need to provide some pointers (pun fully intended) to the less-experienced (e.g., educators) who may not be familiar with how to access the output via a character device.

As you know, this library is exactly the sort of thing I was discussing in my post the other day about accessing the GPU for doing math(s) things. Timing really is everything!

Blimey!
I didn’t know about that directory! One of Pi’s little ‘easter eggs’.
Now, I’ve dreamt of making a copy of a machine I used to repair – dual-channel audio spectrum analyzer.
It looked like this on the inside…
http://www.amplifier.cd/Test_Equipment/other/660b.htm
(Scroll down for what it looked like on the inside!)
Stuffed full of TTL, NO microprocessor, just bit-slice. There were only about 4 of us in Europe who could fix one.
Pi would be SOOO much faster! Just a couple of A/D’s…Oh, MAMA!!!

I’m kind of confused. What is this supposed to do?

It’s a library for doing fourier transforms very fast.
Only useful if you want to do fourier transforms.

(From http://en.wikipedia.org/wiki/Fourier_analysis):
Fourier analysis has many scientific applications – in physics, partial differential equations, number theory, combinatorics, signal processing, imaging, probability theory, statistics, option pricing, cryptography, numerical analysis, acoustics, oceanography, sonar, optics, diffraction, geometry, protein structure analysis and other areas.

It’s been requested by people doing software defined radio and audio processing.

Fourier Transforms (FTs) are usually used to derive the frequency spectrum from a signal mathematically, but they have many, many more uses than that. If you imagine a sine wave at a single frequency, its Fourier Transform (FT) is a single line on a spectrograph (amplitude on the y axis vs. frequency on the x axis) at the frequency of the sine wave. However, a square wave at the same frequency will contain all of the odd harmonics of the fundamental frequency, with each harmonic dropping off in amplitude proportional to the inverse of the frequencies as they increase (see https://en.wikipedia.org/wiki/Square_wave). Triangle waves contain the odd harmonics dropping off proportional to the inverse of the amplitude squared with increasing frequency. More complex signals have correspondingly more complex spectra. It should be noted that a signal can also be derived in the reverse direction, given its spectrum in mathematical terms.

In practice, signals are never pure sine waves, square waves, etc., and quantizing errors (due to sampling of signals at minimum intervals corresponding to the quality of the equipment used) result in a set of data points that approximate the signal’s waveform (amplitude over time). However, FTs can also be approximated by carrying out mathematical computations over discrete data points, which is what the GPU_FFT library is used to do. You provide it a set of signal amplitude data over time, and it spits out a data set that can be plotted to obtain an amplitude vs. frequency spectrograph. A corresponding transform implemented in software could also be used on spectrum data to derive an estimation of the associated signal, plotted as points on an amplitude vs. time graph.

A Fourier Transform is what werewolves experience during a full moon as compared to werepigs.

I thought a Fourier Transform was the transaction French-Canadian trappers made when turning pelts into whiskey and other critical supplies needed while traipsing about in the Great White North :lol:

I believe you are thinking of the fast furrier transform followed by the inverse fast furrier transform.

Spectrogram’s are probably the easiest way to understand what FFT’s can be used to do (in this case hiding an image in music/noise).

The Y-axis is frequency in Hertz from 0Hz up to half a CD’s audio sample rate so 22,050Hz.. And the X axis is time and scrolls along in a sideways waterfall diagram .
https://www.youtube.com/watch?feature=player_detailpage&v=M9xMuPWAZW8#t=330

If you could whistle at exactly 11000Hz and record the audio with a microphone to a wave file, and process this file with a (FFT – to convert the data from the time domain to the frequency domain) to generate a spectrogram it would be a straight line at 11kHz.

Spectrogram of Beethoven’s “Sonata Pathétique”
https://www.youtube.com/watch?v=XJ6XXjAOvmY

I am interested in compiler SIMD parallelisation of bitwise operations on arrays of integers. To my pleasant surprise, I found that gcc – with the right command-line options – will just do this automatically for Cortex A9/NEON. I know the Raspberry Pi does not have NEON so I’m not asking for that. But: will the future hold something similar for the GPU – a gcc option to ‘just do it’? (Or, is using the GPU too complicated, compared to a SIMD unit on the CPU?) Thanks a lot, Daniel

There have been efforts to attempt to ascertain technical details from patents and reverse-engineer the functionality of the GPU described at https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV—BCM2835-Overview and https://github.com/hermanhermitage/videocoreiv. IIRC, there are 16 parallel 32-bit integer pipelines in the quad processing units (QPU), but the learning curve is very steep for the custom compilers and assemblers used with the GPU components (upwards of 10 years of experience is typically required to become proficient at an expert level), and they’re proprietary (as is the case for all of the integrated GPUs on the market), so they’re not available. I haven’t tracked the progress on reverse-engineering the toolchain, but the page at the second link above contains at least half-a-dozen links to various efforts.

Eben noted above that access to the QPU may happen to some degree, but it probably depends on how exhausted the engineers who already are doing full-time paid work can stay awake long enough to do the unsung volunteer Pi work they’ve already been performing flawlessly for coming up on three years. Maybe a crowdfunding effort to stock their cabinets full of cheer would help, although that may make them fall asleep even faster (and they could use it!) :lol:

That’s not quite right. There are two 16 way SIMD vector/scaler cores, AND a number of QPU’s. The QPU’s are, IIRC, scaler processors, but there are a number of them that can work in parallel.

The QPUs are SIMD as well.

Thanks Edwin, I did wonder when they were SIMD.

Hi James – Of course it’s not quite right, as trying to find definitive detailed information on the GPU is a real treasure hunt. It’s like the folks who scour the Internet for illegal copies of music files (or references to them, e.g., torrent files) and then have the search engines erase any evidence of their existence have been hired to rub out any mention of how the GPU does its thing. For example, a search for keywords that include QPU and BCM2835 results only in comments to this blog post mixed in with a ton of links to pages discussing other GPUs. If such info beyond what Herman has published exists, it would be nice to know – not critical and not important enough to make anyone at Broadcom drop what they’re doing to help make more use of the GPU, but nice. We are trying our damnedest to promote the Pi and, in my case, the GPU in particular for educational purposes, but with both hands tied behind my back and a blindfold on, I’m limited to semi-randomly poking the keys with my ample schnoz :lol:

Thanks!

Jim, I think the miracles come in phases :). The first miracle is getting a cheap board with a SoC and some lifetime guarantees around the SoC at low volumes. The second miracle was all the lovely patents Broadcom/Alphamosaic had created making reverse-engineering fairly straightforward. The third miracle is that there are now APIs to call VPU (integer) and QPU (VLIW/unified shaders) code blocks. The rest will follow… If they had given out all the docs on day 1, a few of us would have missed the fun of solving a digital crossword :)

For what it’s worth:

The VFP unit on the BCM2835 does support SIMD operation, but at the expense of GNU ABI compatibility. Once you enable vector operations, you must be responsible for everything until you disable vector mode.

I have a 3-part write-up about VFP vector operations, starting here. Parts 2 has coding examples, and part 3 has the full code base, including a Makefile.

By the way, if anyone else wants to try writing their own GPS software, you can get all needed hardware, assembled and shipped, for $19. Already a full GPS but with unique feature: you can run your own code. Google “NavSpark”

Thanks. Nice to have a really “fast” ft. Can this be used to do fft complex to complex or only real to complex? I would like to use it to construct a 2D fft for image processing(for cross-correlation).

Use the source Luke, use the source.

hint: ” for (i=0; i<N; i++) base[i].re = base[i].im = 0;"

Thank you, I will try the source…
But I was not sure, if the fft will evaluate the imaginary part in the forward fft. I will give it a try.

Exciting times for Raspberry Pi – unlocking the beast within. Looking forward to more of this :-)

Excellent, thanks for this!

I’m a member of the Einstein@Home developer team, we already have a Raspberry Pi version of an application that searches for radio pulsars in a volunteer distributed computing project. It relies heavily on FFT (currently FFTW, single precision) and I certainly will try to make some use of this library. We need really large transforms (real-to-complex, length 3*2^22 (!), but if we can do batches of smaller transforms that can already help to improve performance,

We also support Android phones and are somewhat frustrated that we don’t get access to their powerfu GPUs (OpenCL is in theory supported by some mobile GPUs but there are no official drivers installed in the field. To see hardware accelerated FFTs on the Raspberry Pi is a pleasant surprise.

Cheers
Heinz-Bernd Eggenstein

Well this has prompted me to take a gamble and buy a Raspberry Pi. I like the idea of having a Pi working on BOINC projects and exploiting the GPU. Just hope that Heinz can update the Einstein at home code to use the FFT library and that the access to QPU also gets opened up. My main Intel computer already found one pulsar back in 2011, lets hope the Pi can find another.

Well I now have my Pi plodding away on Einstein@home work. After 11 hours its 30% of the way through a unit. Any speed up by exploiting the GPU would be gratefully received.

Wow! That would actually put my headless Pi, currently working as a dlna server, to even better use. I don’t know how effective even the gpu could be for such a task, or if the RAM (first batch, 256MB only) would be enough, but it’s a device that works 24/7, so why waste it? I’m in for it. Keep us posted!

the code seems interesting, I will give a look to the source code, it’s the beginning of a nicely time for the pi

That is incredibly useful, I’m already using my Pi to do sound analysis using FFT in numpy, doing it using the GPU would save a lot of CPU and enable me to use higher quality samples etc.

When I runsudo ./hello_fft.bin 12I getrel_rms_err = 2.3e-06, usecs = 273, k = 0I’ve tried a couple of different sd cards both running Raspbian and I’ve changed the amount of GPU memory in raspi-config with no joy.

That is the correct output. It is measuring the relative rms error and the runtime. Those are the expected numbers. Have a look at the source.

Andrew are you ok with me uploading a dissasembly of the routines to say github?

Might be some useful pointers for others, especially now there is a call QPU fragment mailbox routine.

— herman

Herman,

That sounds fine, I’m sure Andrew would have no problem… Andrew not sure if you know of Herman, but he’s the guy who’s been doing so much of the reverse engineering of the QPU stuff…

Gordon

It is OK with me. I just wish we could publish the commented source, which makes very heavy use of assembler macros. And, yes, I have seen Herman’s work.

I’ve put a preliminary disassembly at:
http://www.freelists.org/post/raspi-internals/GPU-FFT-Disassembly

It looks like there are some new instructions that I haven’t encountered before in the VG implementation and GLES shaders, so it doesn’t make complete sense yet.

The mnemonics I’m using are courtesy of one of the khronos blobs, so they probably don’t match the official assembler.

I will verify, comment and document it when I get some free time.

That’s what it does. It’s a test program that does the ffts, times it, and measures the error.

No pretty pictures from this app.
It would be quite easy for someone to use this library to produce a demo that plotted a spectrogram of a given audio file.

I was expecting to get something like that nice table. Misleading advertising :)

Is this computing on the gpu and returning results to the cpu for plotting, or is the gpu plotting the results?

Also where is this being discussed in the forums? I’ve googled but not found anything yet.

The only thing I’ve found in the forums is a link to here. Would probably be a good thread to discuss this.

Bitcoins, anybody?

So SHA256 on the GPU, I personally can not see how it would be worth it in terms of electricity, financial (return on investment) or time spent implementing for the number of satoshi’s it would eventually generate. GPU mining of bitcoins is dead (unless the electricity if free!), FPGA and now ASIC’s have taken over. Although maybe if the RPi was going to be used as an encryption node, to encrypt and add a hash/checksum to verify no tampering/data-corruption of data being transferred from A to B, then a few GPU accelerated primitives would be good. But with security there is the whole issue of trust, that there is no backdoors in blob. And that rabbit hole is a scary one to crawl down (ANGRYMONK, JETPLOW – https://en.wikipedia.org/wiki/NSA_ANT_catalog or http://leaksource.wordpress.com/2013/12/30/nsas-ant-division-catalog-of-exploits-for-nearly-every-major-software-hardware-firmware/ ).

You’re absolutely right, but I’m sure that some people will try it nevertheless.
One of my Raspis is running 24/7 as fileserver, downloader, mediaserver and for home automation, and its GPU does exactly nothing. So why not using it for something usefull?

Thanks for making this available. I’d like to add that Mathematica, a powerful program available free on Raspberry Pi, has functions for FFT and inverse FFT, plus some wonderful plotting functions. Of course, this is not useful for real-time applications, but still a great tool (and toy).

Thank you very much for making this available!

I spent some time yesterday to implement a 2D-FFT for image processing based on gpu_fft.

Sequence
1. 8bit luminance data to complex
2. horizonal fft (batch)
3. transpose result matrix
4. vertical fft (batch)
5. transpose result back

This sequence has pretty much the same speed as fftwf_plan_dft_r2c_2d/fftwf_execute, around 1000 ms over 1024x1024px.

A 2D-real-to-complex FFT directly on the GPU would be a dream.

The fft batches should take around 41us*1024*2 = 84ms. Does transposition/copying take ~ 900ms? The GPU can do transposition in a way that accesses SDRAM efficiently. Might be best to have separate real and imaginary arrays. I will give it some thought.

Andrew, thanks a lot. You’re completely right, each of the fats takes around 45ms. Filling the complex matrix is 154ms, transposing the matrix a whopping 672ms.
Wikipedia uses matrix transposition as example of a cache-oblivious algorithm, but I’m not sure how much this will help on the Raspberry Pi.
All the best, HC

‘fats’ is ‘ffts’. Stupid autocorrect.

Just to be complete, here are the results of some optimization.
Bottom line is, GPU_FFT is beating fftw3f in my application by about 40%. YMMV, of course.
I’m doing a phase correlation, i.e. 2D-FFT for 2 images, a cross power spectrum followed by an inverse 2D-FFT. Goal is to identify the shift between the images. It takes 3400ms with fftw3 to do this on a 1024×1024 pic, 2050ms with GPU_FFT.
I’ve optimized things a bit by taking into account the symmetry of real-to-complex fourier transformations. I guess there are still some microseconds to identify and eliminate.

My C has rusted over the years. I am struggling to understand how hello_fft.c works? Can someone please give an example of how to transform an array containing say 32 time domain samples, and how to output the frequency domain result? Thanks

Just for the general idea, not tested. Assuming your data is in my_data and is not complex.
struct GPU_FFT *fft;


mb = mbox_open();

plan = gpu_fft_prepare( mb, 5, GPU_FFT_FWD, 1, &fft );
for( i = 0; i in[i].re = my_data[i],fft->in[i].im = 0;

gpu_fft_execute( plan );
Your reults will be in ffft->out[0..31].re and .im

HTH

The for-loop got mangled. It should read for( i = 0; i "less than" 32; i++) fft->in[i].re = my_data[i], fft->in[i].im = 0;

As easy as that?! Doh! :) Thanks

Hello,

does the Testprogramm requires any modules for the mailbox?
I’ve updated the system with rpi-update; apt-get update; apt-get upgrade;
but get the error message

“Unable to enable V3D. Please check your firmware is up to date”

on startup of (sudo) ./hello_fft.bin 8

Regards
Yggdrasil

rpi-update gets newer firmware than the more stable apt-get.
So you don’t want to run apt-get after rpi-update.

Run
sudo rm /boot/.firmware_revision && sudo rpi-update && sudo reboot
to get back to latest rpi-update firmware,

Thanks dom,

this solves it.

Now, I can search/test if there exists a sincos-Function to get both Values together. ;)

Besiding publishing a SDK for programming the GPU :-), I would love to see md5/sha1/sha256 offloading to it too… :-)

Has anyone been able to integrate this into another program with cmake running? I’m stuck trying to call functions from this great solution in my existing cmake’d program. Any help would be greatly appreciated!

Fantastic news! I was one of the radio amateurs at the RSGB Convention who spoke to Eben about SDR and he promised this would come. The Raspberry Pi Foundation certainly delivers its promises. And thanks for taking time to reply to many of the posts here. Well done.
Tony

I was unable to get the example working at first, then realised after reading the file gpu_fft.txt in /opt/vc/src/hello_pi/hello_fft that the first parameter of hello_fft is not optional.
Your last instruction should be:
sudo ./hello_fft.bin 12 (or similar.)

Tony

I looked at the effect of running the code from the R Pi graphics screen – effect of updating the screen is very noticeable on the execution times:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 266, k = 0
rel_rms_err = 2.3e-06, usecs = 1076, k = 1
rel_rms_err = 2.3e-06, usecs = 302, k = 2
rel_rms_err = 2.3e-06, usecs = 303, k = 3
rel_rms_err = 2.3e-06, usecs = 535, k = 4
rel_rms_err = 2.3e-06, usecs = 275, k = 5
rel_rms_err = 2.3e-06, usecs = 275, k = 6
rel_rms_err = 2.3e-06, usecs = 1201, k = 7
rel_rms_err = 2.3e-06, usecs = 1051, k = 8
rel_rms_err = 2.3e-06, usecs = 1054, k = 9

Whereas the same run headless from a terminal is:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 279, k = 0
rel_rms_err = 2.3e-06, usecs = 271, k = 1
rel_rms_err = 2.3e-06, usecs = 271, k = 2
rel_rms_err = 2.3e-06, usecs = 277, k = 3
rel_rms_err = 2.3e-06, usecs = 272, k = 4
rel_rms_err = 2.3e-06, usecs = 271, k = 5
rel_rms_err = 2.3e-06, usecs = 269, k = 6
rel_rms_err = 2.3e-06, usecs = 271, k = 7
rel_rms_err = 2.3e-06, usecs = 277, k = 8
rel_rms_err = 2.3e-06, usecs = 278, k = 9

Now just need to work out how to incorporate the fast FFTs into an SDR program for the FunCubeDongle or similar.
BTW – Eben said once that the USB libraries were being updated to work better with the FCD – did that happen?

Tony

There’s a lot of processes running when you launch X. I’m guessing the slow numbers have switched you out.

You may find running with a negative nice (and sudo) or chrt will reduce the disruption.

dorn suggested turning off the tc screen with tvservice -o
This does indeed speed the code up a few % when running headless:
sudo ./hello_fft.bin 12 10 10
rel_rms_err = 2.3e-06, usecs = 251, k = 0
rel_rms_err = 2.3e-06, usecs = 258, k = 1
rel_rms_err = 2.3e-06, usecs = 249, k = 2
rel_rms_err = 2.3e-06, usecs = 251, k = 3
rel_rms_err = 2.3e-06, usecs = 260, k = 4
rel_rms_err = 2.3e-06, usecs = 256, k = 5
rel_rms_err = 2.3e-06, usecs = 251, k = 6
rel_rms_err = 2.3e-06, usecs = 260, k = 7
rel_rms_err = 2.3e-06, usecs = 257, k = 8
rel_rms_err = 2.3e-06, usecs = 251, k = 9

Tony

Excuse my ignorance, but if I have a Python program running on my Pi that’s already using numpy for FFT, is there a way to take advantage of this library from within Python instead of using the numpy FFT?

I have the same question.

hey guys, i need some help/advice. i am working on a project that involves comparing sounds from a bmw x5 hatch spindle to ones in a database. each spindle sound from the database has a number 1-10 associated with it based on the sound it makes when opening and closing the hatch. i’d love to use RPi and the fft function to listen to a spindle while being run then comparing it to the database then output a number 1-10. can this be done? i’m a mechanical engineering student and would appreciate direction since all this is new to me. i have been able to do something similar in MATLAB but want to see if it can be done with just the RPi, Mic, and display. Thanks!

Is there any chance the Videocore SDK will ever be released by Broadcom. Implementation of allkind of signal/image processing algorihms using GPU would be fantastic but without public SDK is impossible.

No.

“Broadcom announced the release of full documentation for the VideoCore IV graphics core, and a complete source release of the graphics stack under a 3-clause BSD license”

http://www.raspberrypi.org/archives/6299

Hey guys! I need to compute a FFT with 1 million points (1’048’576) but the library allow “only” 131’072 points. I used the split radix properties to seperate my FFT into smaller FFTs but it takes too much time to reorganize (compute) it.
Is it a way to compute 1 million points (more than 131’072) with the GPU by changing a little bit the library for example?

Thank’s!

[larger FFT size]

I have the same issue. I’d like to use the Raspi for digital room correction. While the FFT size of 2^17 is normally sufficient for the deconvolution filter it is too small for the room measurements. Depending on the reverb and the number of simultaneously sampled channels you need 2^18 or 2^19 samples, sometimes even 2^20.

I had a look at the code. It seems to be straight forward to add larger FFT sizes as the processing is done in stages anyway. Taking the twiddles_128k function as template it should be possible to implement up to 2^21 with 4 passes (base_64 and 3 times step_32).

Unfortunately you are not yet done. You also need the shader code. And this part is entirely undocumented. So I have no idea where to get the shader code for larger FFT sizes.

I would really be appreciated to get gpufft working with larger FFTs.

… btw.: Im am working on another issue: gpufft requires root privileges to map the GPU memory. So putting it into some libgpufft would require root privileges for any application that uses this library – very bad :-(
The basic idea is to split the functionality of gpufft into a (small) root daemon and a public library with the current API of gpufft. Furthermore the daemon should handle concurrent access to the character device from different applications by a command queue.

Could someone please give any pointers to how (if possible) to implement this within a python file?

Say for example that I am trying to do fft of a near realtime recording, how would I do that?

Please respond, I cannot find this information anywhere.

Patch to create character device for communication automatically

The following replacement for mbox_open in mailbox.c will no longer depend on the existence of the character device to access the firmware part:

int mbox_open() { int file_desc; int retry = 0; // open a char device file used for communicating with kernel mbox driver retry: file_desc = open(DEVICE_FILE_NAME, 0); if (file_desc < 0) { if (!retry && errno == ENOENT) { if (mknod(DEVICE_FILE_NAME, S_IFCHR|S_IRUSR|S_IWUSR, makedev(MAJOR_NUM, 0)) < 0) { fprintf(stderr, "Can't create communication device %s - %s", DEVICE_FILE_NAME, strerror(errno)); exit(-1); } retry = 1; goto retry; } fprintf(stderr, "Can't open device file: %s - %s\n", DEVICE_FILE_NAME, strerror(errno)); fprintf(stderr, "Try creating a device file with: sudo mknod %s c %d 0 or ose option -c\n", DEVICE_FILE_NAME, MAJOR_NUM); exit(-1); } return file_desc; }

Since the program needs to run as root anyway creating the device is no big deal.
I would also recommend to rename DEVICE_FILE_NAME in mailbox.h to /dev/gpufft.

Just a example of what you can achieve with the gpu_fft when you put it together with some openGL ES.

http://www.peteronion.org.uk/PiPics/pipan2.mpeg

I’ll make a better video later :-)

PeterO
G0DZB

This is what the pleases the gods

My understanding of this (after looking at the code) is that the main program waits for the GPU to process the data before recieving it again and then continutes executing.

Is it possible for a program to be written with an interrupt which fires when the GPU has finished processing the data? This would allow the CPU to continue executing code while the GPU is left to compute the FFT.

If this is possible, has it been implemented, or is it likely that it will ever be implemented? I would like to do it myself, but don’t know nearly enough about the Broadcom chip or asm programming to attempt it.

Comments are closed

News

Latest posts

AI Rocky from 'Project Hail Mary'

ATLAS: a modern Tricorder designed to survive unforgiving terrain

Secure Raspberry Pi Connect at scale

Make a smart home

Next Post

GPS-tracking helmet cam

Previous Post

Laika Explorer: what it's like to build a Raspberry Pi accessory startup

Share this post

87 comments

coppermine

chic thomson

Jim Manley

eben

Martin

Swen

eben

Sengan Baring-Gould

Chris

dom

dom

Jim Manley

Jonathan Morris

dom

Jim Manley

clive

Jim Manley

Bill

3xBackup’s

Jim Manley

JamesH

Edwin

JamesH

Jim Manley

hermanhermitage

ethanol100

3xBackup’s

ethanol100

HBE

Keith Sloan

Keith Sloan

Kostis

Mark R

hermanhermitage

Gordon

hermanhermitage

dom

marked

cyk

3xBackup’s

cyk

Señor Chemist

hcpa

hcpa

hcpa

hcpa

Fred

hcpa

hcpa

Fred

Yggdrasil

dom

Yggdrasil

duckboy81

Tony Abbey

Tony Abbey

Tony Abbey

dom

Tony Abbey

Andy

Sway P.

wirk

Neil

Alpha162

Jo_b

Marcel

Marcel

Peter Onion