Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
- Xilinx 14.2
Field-programmable gate arrays (FPGAs) have been considered as a promising alternative to implement artificial neural networks (ANNs) for high-performance ANN applications because of their massive parallelism, high reconfigurability (compared to application specific integrated circuits), and better energy efficiency [compared to graphics processing units (GPUs)]. However, the majority of the existing FPGA-based ANN implementations was static implementations for specific offline applications without learning capability. While achieving high performance, the static hardware implementations of ANN suffer from low adaptability for a wide range of applications. Onsite training provides a dynamic training flexibility for the ANN implementations.
Currently, high-performance CPUs and GPUs are widely used for offline training, but not suitable for onsite training especially in embedded applications. It is complex to implement onsite training in hardware, due to the following reasons. First, some software features, such as floating-point number representations or advanced training techniques, are not practical or expensive to be implemented in hardware. Second, the implementation of batch-mode training increases the design complexity of hardware. Batch training is that the ANN’s weight update is based on training error from all the samples in training data set, which enables good convergence and efficiently prevents from sample data perturbation attacks. However, the implementation of batch training on an FPGA platform requires large data buffering and quick intermediate weights computation. There have been FPGA implementations of ANNs with online training.
These implementations used backpropagation (BP) learning algorithm with non batch training. Although widely used, the BP algorithm converges slowly, since it is essentially a steepest descent method. Several advanced training algorithms, such as conjugate gradient, quasi-Newton (QN), and Levenberg–Marquardt, have been proposed to speed up the converging procedure and reduce training time, at the cost of memory resources. This brief provides a feasible solution to the challenges in hardware implementation of onsite training by implementing the QN training algorithm and supporting batch-mode training on the latest FPGA. Our previous work implemented the Davidon–Fletcher–Power QN (DFP-QN) algorithm on FPGA and achieved 17 times speedup over the software implementation. To further improve performance, this brief proposes a hardware implementation of the Broyden–Fletcher– Goldfarb–Shanno QN (BFGS-QN) algorithm on FPGA for ANN training.
To the best of our knowledge, this is the first reported FPGA hardware accelerator of the BFGS-QN algorithm in a deeply pipelined fashion. The performance of the implementation is analyzed quantitatively. The architecture is designed with low hardware cost in mind for scalable performance, and for both onsite and offline training. The designed accelerator is applied to ANN training and is able to cope with different network sizes. The experimental results show that the proposed accelerator achieves performance improvement up to 105 times for various neural network sizes, compared with the CPU software implementation. The hardware implementation is also applied to two real-world applications, showing superior performance.
- Performance is not efficient
- Power efficiency is lower
FPGA Hardware implementation
Because the steps for computing B, λ, and g are three computationally intensive parts in the BFGS-QN algorithm, we mainly describe how the three parts are designed.
B Matrix Computation
The B computation (BC) block derives an n × n matrix, where n is the total number of weights, according to step 5 of Algorithm 1. The most computationally intensive operations are matrix-by-vector multiplication (MVM). For scalable performance and modularized architecture, MVM is implemented as multiple vector-by-vector multiplications (VVMs). Three on-chip RAM units, each storing n words, are used to store the intermediate computation results which are repeatedly used in the subsequent computations.
Fig. 2.Architecture of λ computation block. (a) Forward–backward block for determining a search interval [a0, b0]. (b) GSS block. The operations in the dashed blocks share the same hardware.
In addition, an n-word first-input–first-ouput is used to match two input data streams of the final addition and form Bk+1 row by row. where Lmult, L sub, and L div are the latency of multiplier, subtractor, and divider, respectively, and Tvector is the number of execution cycles of a VVM. A deeply pipelined VVM unit is implemented withTvector = n + + L add[log2 L add + 2], where the last two items are for draining out the pipeline. 2) Step Size λ Computation: An exact line search method is implemented to find λkaccording to step 2 of Algorithm 1. The method includes two submodules: the forward–backward algorithm determining an initial search interval that contains the minimizer, and the GSS algorithm shown in Fig. 1 implemented to reduce the interval iteratively. Fig. 2 shows the hardware architecture, where the computation time of each block is analyzed. The architecture comprises a number of pipelined arithmetic operations and the objective function ET evaluation.
The dashed blocks in Fig. 2(a) and (b) are the same piece of hardware shared by the corresponding operations. The complex objective function (1) needs to be frequently evaluated during λ computation. Therefore, we also implement the objective function evaluation in hardware as a separate block for high performance. It can be modified according to different types of ANNs while other parts of the implementation remain the same. As shown in Fig. 3(a), the block is implemented in two parts: the first part exercises the ANN to obtain outputs and the second part computes the training error. Fig. 3(b) shows the structure of the first part. Two accumulation units are implemented to obtain h and yˆ shown. Note that the accumulation unit is implemented differently from the one in the VVM unit, by using length variable shift registers
Resource Usage Analysis
During the computation, intermediate results are accessed in different patterns: 1) some results, such as w and g, which are connected to multiple calculation modules, are read more than once and 2) the values, such as h, F(h), and δ, are read in different orders from writing. Also, the batch training is supported. Therefore, all intermediate results are buffered in FPGA on-chip memories, for pipelined and parallel computations and data reuse. FPGA on-chip dual-port RAMs are used to implement the buffers. The required storage space of each module is shown in Table I. As n increases, off-chip memory will be used and a properly designed memory hierarchy containing on-chip and off chip memories is needed to prevent speed slowdown. In addition, all training data are stored in off-chip memories for applications with large number of training data. The overall data path of the BFGS-QN hardware design needs a number of floating-point (FP) units, as shown in Table II. The number of required FP units is independent of n.
- Performance is efficient
- Power efficiency is higher
Qiang Liu , Jia Liu, Ruoyu Sang, Jiajun Li, Tao Zhang, and Qijun Zhang, “Fast Neural Network Training on FPGA UsingQuasi-Newton Optimization Method”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.