Speedup and Parallelization Models for Energy-Efficient Many-Core Systems Using Performance Counters

Al-Hayanni, M; Shafik, R; Rafiev, A; Xia, F; Yakovlev, A

doi:10.1109/HPCS.2017.68

Speedup and Parallelization Models for Energy-Efficient Many-Core Systems Using Performance Counters

Lookup NU author(s): Mohammed Al-Hayanni, Professor Rishad Shafik ORCiD, Dr Ashur Rafiev, Dr Fei Xia, Professor Alex Yakovlev ORCiD

Downloads

Accepted version [.pdf]

Licence

This is the authors' accepted manuscript of a conference proceedings (inc. abstract) that has been published in its final definitive form by IEEE, 2017.

For re-use rights please refer to the publisher's terms and conditions.

Abstract

Traditional speedup models, such as Amdahls, facilitate the study of the impact of running parallel workloads on manycore systems. However, these models are typically based on software characteristics, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured without instrumenting the original software codes, and secondly, the parallelization factor of an application running on specific hardware is generally unknown. In this paper, we propose a novel method, whereby standard performance counters found in modern many-core platforms can be used to derive speedup without instrumenting applications for time measurements. We postulate that speedup can be accurately estimated as a ratio of instructions per cycle for a parallel manycore system to the instructions per cycle of a single core system. By studying the application instructions and system instructions for the first time, our method leads to the determination of the parallelization factor and the optimal system configuration for energy and/or performance. The method is extensively demonstrated through experiments on three different platforms with core numbers ranging from 4 to 61, running parallel benchmark applications (including synthetic and PARSEC benchmarks) on Linux operating system. Speedup and parallelization estimations using our method and their extensive cross-validations show negligible errors (up to 8%) in these systems. Additionally, we demonstrate the effectiveness of our method to explore parallelization-aware energy-efficient system configurations for many-core systems using energy-delay-product based formulations.