Excavating the Hidden Parallelism Inside DRAM Architectures With Buffered Compares
We propose an approach called buffered compares, a less-invasive processing-in-memory solution that can be used with existing processor memory interfaces such as DDR3/4 with minimal changes. The approach is based on the observation that multibank architecture, a key feature of modern main memory DRAM devices, can be used to provide huge internal bandwidth without any major modification. We place a small buffer and a simple ALU per bank, define a set of new DRAM commands to fill the buffer and feed data to the ALU, and return the result for a set of commands (not for each command) to the host memory controller. By exploiting the under-utilized internal bandwidth using ‘compare-n-op’ operations, which are frequently used in various applications, we not only reduce the amount of energy inefficient processor–memory communication, but also accelerate the computation of big data processing applications by utilizing parallelism of the buffered compare units in DRAM banks. We present two versions of buffered compare architecture–full scale architecture and reduced architecture–in trade of performance and energy. The experimental results show that our solution significantly improves the performance and efficiency of the system on the tested workloads.
WITH the emergence of big data applications, the centroid of computing paradigm is shifting toward data from computation. Big data applications are characterized by inherent large memory footprint, small or modest amount of computation, and high degree of parallelism. Together with the trend of increasing number of cores in a system, external memory bandwidth requirement of a system has steadily increased. However, in contrast to the rapidly growing computing power and bandwidth requirement, actual bandwidth and energy efficiency of off-chip channels are not improving as much, so called the memory wall problem .
All these circumstances endorse the movement toward the resurgence of near-data processing (NDP) or processing in memory (PIM) –, which offloads certain computations to processing units placed at or near the memory. One straightforward way to implement NDP is to add fully functional cores atop DRAM dies utilizing 3-D stacking. However, integrating cores with DRAM incurs numerous issues, including thermal problems . Typically, memory chips do not have strong cooling capability as processor chips do. Also, memories are in general more vulnerable to high temperature. Therefore, the power budget of the cores integrated with DRAM devices would be very limited. There are also other problems such as cache coherence, virtual memory support, and overhead in mapping applications.
By contrast, we leverage existing memory systems to realize NDP with minimal changes to the current ecosystem. Thus, our approach adds minimal amount of computing capability to the memory die for offloading memory-intensive operations while leaving complex or unbounded controls to the processor and the memory controller. However, considering the gap between internal and external bandwidth of multibank DRAM, the approach tries to maximally exploit the excessive internal bandwidth.
To achieve this goal, we focus on compare instructions, mainly targeting table/index scan in in-memory databases . For example, table scan depicted in Fig. 1 searches for a specific data in the given table. It is a fundamental operation of databases and critical to the performance, especially for column store databases , . By executing compare operations of table scan at the memory side, a great amount of data read from the memory can be reduced. When scanning a table, there is a key that we search for. The key is compared with the items (called targets) stored in the table. In the conventional system [Fig. 1(a)], a processor: ① fetches the target data stored in a table, ② performs a compare with the key, and ③ outputs the compare results. However, we need to know only whether each target data matches the key, and the actual value except the match result is totally unnecessary. If we perform the compares at the memory side [Fig. 1(b)], we: ① send the key to the memory instead of reading the targets, ② do the comparisons, and ③ read the result after the comparisons are over. In this way, we can reduce the memory bandwidth, and the benefit gets larger as there are more target data to read.
- More Memory Bandwidth
- More Power
We propose a novel buffered compare scheme, a kind of PIM technique that performs compare-n-op operations inside DRAM banks to speed up many applications and amplify effective memory bandwidth. In contrast to existing PIM techniques, the buffered compare operations have deterministic latency so that they can be treated as simple extensions of ordinary DRAM commands, which leaves the DRAM as a ‘passive’ device (a device that does not invoke any event by itself). Also, without any caches or complicated pipelines of ordinary cores, the buffered compare approach incurs minimal overhead to existing DRAM dies.
To perform buffered compares, for each DRAM bank, we place a key buffer that usually holds the fixed search key, an arithmetic unit, and another small buffer that holds the results. We use buffered compares to perform such operations within banks, greatly reducing the traffic/latency of the offchip channel. Thus, it speeds up the system, especially when the off-chip bandwidth is saturated. Simulation results show that our scheme achieves up to 15.5× speedup and significant energy reduction, at the expense of a minimal increase in the DRAM die area on the tested workloads. Our key contributions are as follows.
1) We identify that abundant internal bandwidth unused in modern DRAM architecture provides the opportunity to exploit this extra bandwidth with NDP.
2) We propose buffered compare architecture that performs compare-n-op operations inside DRAM to provide parallelism and off-chip bandwidth savings with lightweight logic.
3) We suggest a way to solve the system integration issues of buffered compare, including programming model, coherence, memory protection, and data placement.
4) We investigate six workloads that utilize buffered compares to enhance system performance and energy effi- ciency. We also present a detailed circuit-level analysis of buffered compare units (BCUs) on performance, power, and area overheads.
INTEGRATING BUFFERED COMPARES TO SYSTEMS
- Challenges for Processing-in-Memory Integration Even though we tried to add minimal overhead to the existing architecture and protocol, there are still some hurdles to overcome to integrate buffered compare to the system. In this section, we will briefly address the challenges and describe the solutions in the following sections. First, a programming model is needed to expose buffered compare to the end-users, where we prefer the programmer not to be aware of the detailed DRAM parameters, such as size of a DRAM row or the number of banks in a rank. It would not only make the programming difficult, but also make the code dependent on the system configuration. The second issue comes with the cache coherence. Because the processor cache might have a copy of the data, it has to be kept coherent with the memory side. While one solution would be to implement a coherence protocol between them, it would incur too much overhead and offset the benefit from using PIM. Virtual memories give another challenge to buffered compare. Unlike other PIM approaches, buffered compare does not require address translation at the memory side. However, because of virtual memories, contiguous range of data might be split over multiple physical pages. Another problem comes from the data placement. In modern DRAM architectures, a rank is usually composed of multiple devices to increase the width of a memory channel. A word is usually interleaved over all the devices within the rank. This raises a problem to buffered compare operations as a full word is needed for processing at a BCU. Finally, changes are required to the memory controllers. Unlike normal reads/writes, the buffered compare operations are performed over a specified address range, and the memory controller should be carefully to support them.
- Effective and less Memory Bandwidth
- Less Power
- Xilinx 14.2