Conference Papers: NEMESYS: Near-Memory Graph Copy Enhanced System-Software

[rheindt19memsys]Sven Rheindt, Andreas Fried, Oliver Lenke, Lars Nolte, Thomas Wild, Andreas Herkersdorf, NEMESYS: Near-Memory Graph Copy Enhanced System-Software, Proceedings of the International Symposium on Memory Systems (MEMSYS'19), ACM, 2019.


Despite tackling the memory and power walls over the last decades, new challenges for manycore architectures arose due to the emergence of ever increasing memory intensiveness of applications with big, irregular and cache unfriendly data sets. As data-to-task locality is of key importance for system performance, the MEMSYS 2017 keynote speaker Peter Kogge showed evidence for the so-called "locality wall", that paved the path to near- and in-memory computing. The reduction of data movement is especially challenging on tile-based architectures with physically distributed memory as they often omit inter-tile cache coherence and thus require a different programming model (e.g. PGAS). Inter-tile communication in the PGAS paradigm is allowed via a remote procedure call (RPC)-like programming language construct. The more modern PGAS languages are object-oriented and thus the RPC mechanism requires object graphs to be copied between tiles. It is the system-software's job to provide an efficient implementation of it since the transfer of such object graphs is crucial for the performance of object-oriented applications on PGAS architectures. We therefore propose NEMESYS: NEar-Memory Graph Copy Enhanced SYstem-Software, which outsources the memory-in\-ten\-sive and cache unfriendly graph copy operation to near-memory hardware accelerators. As NEMESYS is an efficient implementation of the PGAS RPC, it integrates these near-memory accelerators into the system-software, opaque to the application programmer. We integrated NEMESYS into an FPGA prototype and a distributed operating system running on a 4x4-tile design with a total of 56 application cores and two memory tiles. The evaluation with the X10 IMSuite benchmarks, featuring distributed graph algorithm kernels, showed a speedup in execution time between 1.35x and 3.85x compared to a state of the art approach. The overall reduction in communication time was between 40 % and 82 %.


  [PDF]   [DOI]


Authors at the institute

Scientific Staff
Andreas Fried