EventNaveed UI Mustafa

Reducing Processor-Memory Performance Gap and Improving Network-on-Chip Throughput

Performance of computing systems has tremendously improved over last few decades primarily due to decreasing transistor size and increasing clock rate. Billions of transistors placed on a single chip and switching at high clock rate result in overheating of the chip. The demand for performance improvement without increasing the heat dissipation lead to the inception of multi/many core design where multiple cores and/or memories communicate through a network on chip. Unfortunately, performance of memory devices has not improved at the same rate as that of processors and hence become a performance bottleneck. On the other hand, varying traffic pattern in real applications limits the network throughput delivered by a routing algorithm.

In this thesis, we address the issue of reducing processor-memory performance gap in two ways: First, by integrating improved and newly developed memory technologies in memory hierarchy of a computing system. Second, by equipping the execution platform with necessary architectural features and enabling its compiler to parallelize memory access instructions. We also address issue of improving network throughput by proposing a selection scheme that switches routing algorithm of an NoC with changing traffic pattern of an application

We present integration of emerging non-volatile memory (NVM) devices in memory hierarchy of a computing system in the context of database management systems (DBMS). To this end, we propose modifications in storage engine (SE) of a DBMS aiming at fast access to data through bypassing the slow disk interfaces while maintaining all the functionalities of a robust DBMS. As a case study, we modify the SE of PostgreSQL and detail the necessary changes and challenges such modifications entail. We evaluate our proposal using a comprehensive emulation platform. Results indicate that our modified SE reduces query execution time by up to 45% and 13% when compared to disk and NVM storage, with average reductions of 19% and 4%, respectively. Detailed analysis of these results shows that our modified SE suffers from data readiness problem. To solve this, we develop a general purpose library that employs helper threads to prefetch data from NVM hardware via a simple application program interface (API). Our library further improves query execution time for our modified SE when compared to disk and NVM storage by up to 54% and 17%, with average reductions of 23% and 8%, respectively.

As a second way to reduce processor-memory performance gap, we propose a compiler optimization aiming at reduction of memory bound stalls. The proposed optimization generates efficient instruction schedule through classification of memory references and consists of two steps: affinity analysis and affinity-aware instruction scheduling. We suggest two different approaches for affinity analysis, i.e., source code annotation and automated analysis. Our experimental results show that application of annotation-based approach on a memory intensive program reduces stall cycles by 67.44%, leading to 25.61% improvement in execution time. We also evaluate automated-analysis approach using eleven different image processing benchmarks. Experimental results show that automated-analysis reduces stall cycles, on average, by 69.83%.

As all benchmarks are both compute and memory-intensive, we achieve improvement in execution time by up to 30%, with a modest average of 5.79%.

In order to improve network throughput, we propose a selection scheme that switches routing algorithm with changing traffic pattern. We use two selection strategies: static and dynamic selection. While static selection is made off-line, dynamic approach uses run-time information on network congestion for selection of routing algorithm. Experimental results show that our proposal improves throughput for two real applications, i.e., SPARSE and MPEG-4 decoder, up to 17.37% and 6.74%, respectively.

They key conclusion of this thesis is that improvement in performance of a computing system needs multifaceted approach i.e., improving the performance of memory and communication subsystem at the same time. The reduction in performance gap between processors and memories requires not only integration of improved memory technologies in system but also software/compiler support. We also conclude that switching routing algorithm with changing traffic pattern of an application leads to improvement of NoC throughput.