Implementation study of FFT on multi-lane vector processors
Paper i proceeding, 2012
In this paper we extend a custom FFT vector architecture by adding multiple lane capabilities and study its hardware implementation. We use the six step algorithm to segment a long Fourier transform of size N=Z x L into L smaller transforms of size Z. We split the data into pairs of vector registers (for the real and imaginary part), each containing Z elements. A vector register pair with its corresponding functional unit form a single lane replicated L times. While smaller transforms proceed iteratively all of them are computed in parallel. The shorter FFT transforms along the X dimension are computed using previously proposed vector permutations while the transforms along the Y dimension are performed using a simple butterfly network that handles inter-lane communication. All data patterns required by the FFT computation are generated implicitly in hardware by a simple control unit. No data transposition is required and the twiddle factors are stored locally inside the functional units. We validated our design through simulation and ASIC synthesis targeting 90nm CMOS technology. We compare three possible configurations for computing a 256point FFT, all running at 217 MHz with Z x L equal to: a) 32 x 8, b) 16 x 16 and c) 8 x 32. Configuration a) is the smallest and the slowest; configuration b) requires 1.43 times fewer cycles but 1.64 more area while configuration c) requires 1.76 times fewer cycles and is 3.16 times larger. Unlike other high performance FFT implementations, our design offers the possibility to trade-off execution speed for two resource types: vector register size and number of lanes. Another important contribution is the possibility to execute 2D FFT without any HW modifications or special provisions. © 2012 IEEE.