Using Hardware Counters to Improve Memory Performance: Irregular Codes and Page Migration

The idea of this project is to effectively use on-chip hardware counters to improve the performance of the memory accesses in runtime. We have selected two different contexts in which the benefits of this idea can be important: the efficient execution of irregular codes, and the problem of use page migration to improve the execution of parallel codes. Even though, in principle, both issues are not close to each other, we propose to apply the same techniques for increasing the locality of irregular accesses, to the page migration problem.

Objectives

The main objectives to achieve the above benefits are the following:

  1. To validate the use of hardware counters (HCs) as a source of information to guide runtime optimizations of sequential and parallel irregular codes. This implies establishing procedures to get, store and manage the information obtained from HCs. Also it is necessary to specify mechanisms to select the irregular arrays and pointers to be considered for the optimization of their accesses. The influence of the statistical nature of the reading of the HC must be validated and taken into account through the whole project.
  2. To characterize locality (between pairs of different references) and affinity (between references and processes) of the irregular accesses through models based on the information provided by the hardware counters. This information will be obtained in terms of cache lines. The models to characterize these localities and affinities are based on the temporal locality, the spatial locality and the latency of the accesses. In addition, functions in terms of pages, instead of cache lines, will be also obtained to deal with the problem of page migration.
  3. To develop strategies to optimize different performance problems based on techniques such as graph theory methods, clustering, genetic heuristics, etc. Therefore, the optimization problem can be established as the maximization of locality and affinity. To deal with this problem we will use the mentioned techniques for the model given by the functions mentioned in objective 2. The problems to be considered are:
    • Reordering to optimize locality (for one process) and/or affinity (for a number of processes or threads in a parallel code).
    • Page migration in parallel codes.
  4. To adapt the modelling of the behavior of the memory hierarchy at compile time to the FinisTerrae architecture. In a first stage we propose to improve the integration of our model in a compiler infrastructure such as gcc in order to improve its usability. As second step, the model will be adapted to the architecture of the processors in the FinisTerrae supercomputer. This task involves modelling new complex situations not considered by now such as the interaction of different threads in shared caches.