Algorithm and Architecture Design of the H.265/HEVC Intra Encoder
Improved video coding techniques introduced in the H.265/HEVC standard allow video encoders to achieve better compression efficiencies. On the other hand the increased complexity requires a new design methodology able to face challenges associated with ever higher spatio-temporal resolutions. The paper presents the computationally-scalable algorithm and its hardware architecture able to support the intra encoding up to the 2160p@30fps resolution. The scalability allows the tradeoff between the throughput and the compression efficiency. In particular, the encoder is able to check a variable number of candidate modes. The rate estimation based on bin counting and the distortion estimation in the transform domain simplify the rate-distortion analysis and enable the evaluation of a great number of candidate intra modes. The encoder preselects candidate modes by the processing of 8×8 predictions computed from original samples. The preselection shares hardware resources used for the processing of predictions generated from reconstructed samples. To support intra 4×4 modes for the 2160p@30fps resolution, the encoder incorporates a separate reconstruction loop. The processing of blocks with different sizes is interleaved to compensate the delay of reconstruction loops. Implementation results show that the encoder utilizes 1086k gates and 52 kB on-chip memories for TSMC 90nm. The main reconstruction loop can operate at 400 MHz, whereas the remaining modules work at 200 MHz. For 2160p@30fps videos, the average BD-Rate is 5.46% compared to the HM software.
THE latest research and standardization efforts in video coding has led to the specification of the H.265/HEVC standard (High Efficiency Video Coding) in 2013 [1-3]. It significantly improves the rate-distortion efficiency as compared to its predecessor H.264/AVC . On the other hand, the improvement is achieved at the cost of the increased computational complexity. The problem is of particular importance if we take into account ever higher demands for spatio-temporal video resolutions. In many applications, the support for the real-time compression is indispensable. To address this requirement, many research and development works were started.
H.265/HEVC extends the 16×16 macroblock to 64×64 coding tree unit (CTU) which can be recursively split into four coding units (CU). The standard specifies more sizes for prediction units (PUs) and transform units (TUs) included in a CU. In the case of the intra coding, 33 directional and two non-directional modes allow a more accurate spatial prediction of successive blocks, whereas H.264/AVC employs up to nine modes. The best efficiency is achieved when using the expensive rate-distortion optimization (RDO). However, the search for the optimal mode in a brute-force fashion involves a large amount of computations. Therefore, it is beneficial to preselect some modes based on a simplified cost function. This method is applied in the HM reference software , which uses Sum of Absolute Transformed Differences (SATD) as the cost function. Another speed-up technique applied in the software is the table-based rate estimation . Instead of performing Context Adaptive Binary Arithmetic Coding (CABAC), the technique accumulates bin contributions pre-calculated for each possible probability state. Simplifications applied in the software introduce slight quality losses. Nevertheless, the complexity is still huge. Its reduction is indispensable to obtain the algorithm suitable for a real-time implementation with a reasonable amount of resources.
In this paper, the algorithm and the architecture design for the computationally-scalable H.265/HEVC intra encoder is proposed. The architecture supports resolutions up to 2160p@30fps. The encoder allows the tradeoff between the compression efficiency and the throughput. The design takes advantage of the following new techniques:
- The rate estimation based on bin counting and the distortion estimation in the transform domain simplify rate-distortion analysis and enable the analysis of a great number of candidate modes.
- The encoder preselects candidate modes by the processing of 8×8 predictions computed from original samples.
- The preselection shares hardware resources used for the RDO processing of predictions generated from reconstructed samples.
- The encoder incorporates a separate reconstruction loop to support intra 4×4 modes for the 2160p@30fps resolution.
- The processing of blocks with different sizes/types is interleaved to compensate the delay of reconstruction loops.
Fig. 1 shows probability density functions of the ratio of the bit number produced by CABAC to the number of input bins. The functions are estimated based on statistic for all CUs. As can be seen, the number of bits is highly correlated with the number of bins. The correlation is stronger for smaller QP. Taking advantage of the correlation, it is possible to directly replace the output-bit counting by the input-bin counting to estimate rates. In this approach, contributions of non-bypass bins are computed with an error. As a consequence, RD costs determined for particular mode combinations are inaccurate leading to some losses in the compression efficiency. On the other hand, the bin counting significantly simplifies computations since the reference to CABAC probability models is avoided.
- Modelsim 6.0
- Xilinx 14.2
- SPARTAN-III, SPARTAN-VI