Multiplier-free Implementation of Galois Field Fourier Transform on a FPGA
A novel approach to implementing Galois Field Fourier Transform (GFT) is proposed that completely eliminates the need for any finite field multipliers by transforming the symbols from a vector representation to a power representation. The proposed method is suitable for implementing GFTs of prime and nonprime lengths on modern FPGAs that have a large amount of on-chip distributed embedded memory. For GFT of length 255 that is widely used in many applications, the proposed memory based implementation exhibits 25% improvement in latency, 27% improvement in throughput, and 56% reduction in power consumption compared to a finite field multiplier based implementation.
Fourier Transform over a Galois (finite) field (GFT) and its inverse (IGFT) are some of the most computationally demanding tasks in the implementation of Bose-ChaudhuriHocquenghem (BCH) and Reed-Solomon (RS) codes. Finite field multiplication is the bottleneck in implementing the GFT, so several techniques have been proposed in research literature to reduce the number of multiplications required. A substitution and pre-computation based technique is proposed to compute GFT for prime lengths greater than 2 over GF(2m) for arbitrary m that saves about one-quarter of the multiplications compared to a brute-force implementation. There have been research efforts that use a FFT style implementation of GFT to reduce the number of multiplications of researchers have proposed methods that can reduce the number of multiplications from n 2 to 1 4 n(log2 n) 2 over GF(p m) where p is a prime value and n log2 n for GF(22r ), for an arbitrary value of r, and in  the authors improve this further to O(n(log2 (n))log3/2 2 ) using the cyclotomic method. In  authors present the hardware design and implementation of cyclotomic Fast Fourier Transform (CFFT) over GF(2m) by reformulating the method presented. Though the architecture has some advantages because some of the results of the computation are reused instead of computing them again, the number of finite field multipliers.
In this paper we propose a technique to implement GFT that does not use any finite field multipliers. The main insight of our work is that by transforming the GFT computation from a vector representation to a power representation, we can replace multiplication by wrap around carry addition. The challenge is to do the conversion from the vector to power and back efficiently. We proposed to use the extremely large on-chip embedded memory available in modern FPGAs as ROM (readonly memory) to store the pre computed conversion tables. On a Virtex 7 there is about 37 Mb of embedded memory (called Block RAM). This is sufficient to create the ROMs for GFT computation on finite field up to a length of 1023, which meets the requirements of many emerging applications. For example, in the new iterative soft-decision decoding of RS codes – the application that motivates the work presented in this paper – GFT of length 127 is required. Recent FPGAs such as Stratix 10 TX2800 from Altera has about 229Mb of memory with over 11721 M20K blocks. With more memory, the proposed scheme can be easily extended to larger field sizes if necessary. At the end of the paper we also propose a simple technique to reduce the memory requirements for prime length implementation by taking advantage of the cyclic property of multiplications over Galois Field.
In an FPGA the embedded memory blocks can be configured as extremely wide word ROMs. So, it allows a highly parallel vector processing style implementation, which results in extremely low latency and very high throughput. Note that this is only possible because the embedded memory on a FPGA is very flexible and can be configured with extremely large word size. For example, when n is 1023, the proposed architecture accesses 1023 10-bit words (i.e. 10230 bits) in parallel each clock cycle (227 MHz) which represents an aggregate memory bandwidth of 2.3 Tb/s. This allows the computation of GFT of length 1023 in 1027 clock cycles with a throughput of 2.5 Gbps, using about 52% of the on-chip memory resources on the FPGA.
- More number of multiplications
- Finite filed multiplier will take more computations
- Not having carry additions
We propose two architectures – Serial In Parallel Out (SIPO) and Parallel In Serial Out Architecture (PISO) which differ in how the inputs arrive and how outputs are produced. In SIPO architecture (see Fig. 1) inputs a0 to an−1 come in serially. Power ROM converts vector representation to power representation. The depth of this ROM is n + 1 and the width is m = log2 (n + 1) bits. So, the size of the Power ROM is (n + 1) × log2 (n + 1). Beta ROM consists of powers that are needed to be added with the inputs. The size of Beta ROM is n + 1 × (log2 (n + 1) × n).
Figure 1 : Serial In Parallel Out (SIPO) Architecture
In both the architectures, the values of Beta ROM are pre-computed. The adder unit performs wrap-around carry addition where the carry is added back to get the final result. Then the Vector ROMs are used to convert from power to vector representation. The size of each Vector ROM is (n + 1) × log2(n + 1). If dual-port memory is used the number of Vector ROMs needed is b n 2 c. The output of the Vector ROM goes to the multiplexers. The Multiplexer selects outputs zero if the input is zero or selects the output of Vector ROM otherwise. These are accumulated together by Galois field addition i.e. using XOR gates. In this architecture once ak is received, ak ∗ β z is calculated in parallel for all z values according to k th column of the square matrix in equation (2). After all the inputs have arrived, the outputs b0 to bn−1 are available simultaneously.
Figure 2 : Parallel In Serial Out ( PIPO) Architecture
In PISO architecture shown in Fig. 2, inputs a0 to an−1 are assumed to be available in parallel, so we start computing the outputs from b0 to bn−1 one by one (serially). PowerROM converts vector representation to power representation. The depth of this ROM is n + 1 and the width is log2 (n + 1) bits. The main difference between SIPO and PISO is that we replicate multiple copies of this PowerROM to facilitate parallel look-up. With dual-port ROMs, the number of PowerROMs needed is b n 2 c. BetaROM consists of powers that are needed to be added with the inputs. The size of BetaROM is (n + 1) × (log2(n + 1) × n). The adder unit is wraparound carry addition where the carry is again added with the result to get the final result. Then the VectorROMs are used to convert from power to vector representation. The size of each VectorROM is n+1×log2 (n+1). The number of VectorROMs needed are b n 2 c. These outputs are XOR-ed together (as in the previous case) to get the final output. The control unit for this architecture is very simple as it just generates the address for the BetaROM.
- Less number of multiplications
- Finite filed multiplier will take less computations
- carry additions will added
 W. Gappmair, “An efficient prime-length dft algorithm over finite fields gf (2m),” Transactions on Emerging Telecommunications Technologies, vol. 14, no. 2, pp. 171–176, 2003.