An Efficient Fault-Tolerance Design for Integer Parallel Matrix–Vector Multiplications
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
- Xilinx 14.2
Matrix–vector multiplication (MVM) is a typical computation in domain transformation, projection, or feature extraction. With increasing demand for high performance or throughput, parallel computing is becoming more important in high-performance computing systems, cloud computing systems, and communication systems. For example, parallel matrix processing with a common input vector (parallel MVMs) is performed for pre coding in multiuser multi-in multi-out (MIMO) (especially large-scale MIMO) and high-performance low density parity check decoders. Parallel processing may run on a large number of processing units, e.g., GPUs or distributed systems, and experiments show that soft errors can affect the outputs. In this ABFT design, the two input matrices are changed to a “row checksum” matrix and a “column checksum” matrix, respectively, and single errors in the resulting “full checksum” matrix can be detected and corrected based on its row checksum and column checksum.
Since the row checksum does not exist for an input vector, such ABFT scheme can only be used for error detection in MVM. In the ABFT schemes proposed , checksum technique is used to locate the affected result element or blocks of result elements in MVM, and partial re computation is used for error correction. Such schemes are suitable for the cases in which the delay introduced by re computation is acceptable. But this is not always the case, especially in high-throughput communication receivers where the inputs arrive in real time from A/D converters. A general method to protect linear dynamic systems was proposed. Part of that system model can be extracted as an MVM. I
n a method to protect parallel digital filters was proposed. The scheme is based on considering each filter as a bit in an error correction code (ECC) so that parity check filters are added and used to detect and correct errors. The schemes in share a similar approach in that redundant rows are added to the processing matrix by combination (or coding) of the original rows of the matrix so that the relations among the multiple outputs can be used to locate and correct the erroneous elements. However, this approach only works for single MVM. In the protection of parallel fast Fourier transform (FFT) operations on different input vectors was studied. In that case, a self checking property of the FFT was combined with the ECC method to reduce the overhead required for protection. we proposed an idea based on ECC and MVM self-checking (checksum) for error protection of parallel MVMs. We assume real-time applications in which integer vectors come from ADCs with a fixed timing and processing is synchronous and repetitive, so a re computation is not tolerable.
This paper is an extension of that two page conference paper, and provides new results based on the FGPA implementation. To the best of the authors’ knowledge, no other fault-tolerant design for parallel MVMs has been proposed. In the proposed structure, a “detection matrix” D and a “sum matrix” A are added for fault tolerance. D is generated by performing ECC to the checksum row of each matrix, and A is a combination of all matrices under protection. By comparing the results of the original MVMs and that of the check matrix, the fault can be located. Then, the erroneous output vector can be recovered by subtracting the correct output vectors from the sum output vector.
- Overheads is not reduced
- Cost is high
- Errors are not identified easily
Fault tolerance design for parallel matrix processing
- Basic Structure The basic structure of the proposed fault-tolerant system for parallel MVMs is shown in Fig. 1. From Fig. 1, we see that the original P matrices remain unchanged, and the redundant part includes two matrices (regardless of the number of parallel matrices under protection):
Fig. 1. Fault-tolerant structure for parallel MVMs
detection matrix and sum matrix . By comparing the P outputs with the checking result the faulty vector can be detected, and then the correct output vector , can be obtained by subtracting the correct vectors among from the “sum output vector” . The number of rows of D (Rp) is determined by the number of parallel matrices (P). When considering the case that only one MVM may fail, a Hamming code can be used for protection, and the relationship between P and Rp would be
On the other hand, in our proposed technique (Fig. 1), a branch is an entire MVM processing.
Generation of Detection Matrix and SUM Matrix
The sum matrix is just the direct sum of the original matrix
For the detection matrix, let us assume P = 4 to facilitate the illustration of the basic idea, and assuming only one branch output may contain errors. First, a checksum row for each matrix can be generated as
Then, the rows of detection D can be generated ,Hamming code as
Then, the error detection matrix can be defined as matrix as
if there are no errors, we should have
where “sum(x)” performs addition of all items within vector x, and
In (12), the result of is an integer vector, and sum(produces an integer scalar. In summary, if we denote
Fig. 2. Implementation structure of MVM (N = 20)
error detection of a single branch can also be done. After identifying the failing branch, the corresponding output vector can be recovered by subtracting the right output vectors from the sum output vector
Since the error detection is based on the sum of the output vectors, and the whole vector could be recovered during the correction, the proposed scheme could deal with multiple errors in a single output vector
- Overheads are reduced
- Cost is low
- Error can be easily identified by error detection method
Zhen Gao , Qingqing Jing, Yumeng Li, Pedro Reviriego, and Juan Antonio Maestro, “An Efficient Fault-Tolerance Design for Integer ParallelMatrix–Vector Multiplications”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.