Emergent Failures: Rethinking Cloud Reliability at Scale

Garraghan, P; Yang, R; Wen, Z; Romanovsky, A; Xu, J; Buyya, R; Ranjan, R

doi:10.1109/MCC.2018.053711662

Emergent Failures: Rethinking Cloud Reliability at Scale

Lookup NU author(s): Dr Zhenyu Wen, Emeritus Professor Alexander Romanovsky ORCiD, Jie Xu, Professor Raj Ranjan

Downloads

Accepted version [.pdf]

Licence

This is the authors' accepted manuscript of an article that has been published in its final definitive form by IEEE, 2018.

For re-use rights please refer to the publisher's terms and conditions.

Abstract

Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault-tolerance-and-recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures—frequently transient and identifiable only at runtime—represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.

Publication metadata

Author(s): Garraghan P, Yang R, Wen Z, Romanovsky A, Xu J, Buyya R, Ranjan R

Publication type: Article

Publication status: Published

Journal: IEEE Cloud Computing

Year: 2018

Volume: 5

Issue: 5

Pages: 12-21

Online publication date: 18/10/2018

Acceptance date: 02/04/2018

Date deposited: 19/11/2018

ISSN (electronic): 2325-6095

Publisher: IEEE

URL: https://doi.org/10.1109/MCC.2018.053711662

DOI: 10.1109/MCC.2018.053711662

Altmetrics

Altmetrics provided by Altmetric

Funding

Funder reference	Funder name
2016YFB1000103
EP/P031617/1

ePrints

Emergent Failures: Rethinking Cloud Reliability at Scale

Downloads

Licence

Abstract

Publication metadata

Altmetrics

Funding

Share