Toggle Main Menu Toggle Search

Open Access padlockePrints

Architecting Holistic Fault Tolerance

Lookup NU author(s): Rem Gensh, Dr Ashur Rafiev, Professor Alexander RomanovskyORCiD, Dr Fei Xia, Professor Alex Yakovlev



AbstractThe optimality and maintainability of fault tolerancemechanisms in a computer system has typically not been amajor topic of concern, mostly because fault tolerance is anon-functional system requirement. This paper proposes aHolistic Fault Tolerance architecture, based on acentralised fault tolerance management, with relatedfunctionality distributed across the entire system. Themost suitable error detection and error recovery strategiesfor a given application are chosen by a special crosscuttingcontroller depending on error rates, system performanceand resource utilisation requirements. We discuss themotivation for introducing this holistic fault tolerancearchitecture and reason about its benefits from the pointof view of optimal system operation and improvedmaintainability. The advantages and possibleimplementation challenges of the proposed approach aredemonstrated by a real-world application.

Publication metadata

Author(s): Gensh R, Rafiev a, Romanovsky A, Garcia A, Xia F, Yakovlev A

Publication type: Report

Publication status: Published

Series Title: School of Computing Science Technical Report Series

Year: 2016

Pages: 13

Print publication date: 14/11/2016

Acceptance date: 14/11/2016

Report Number: 1505

Institution: School of Computing Science, University of Newcastle upon Tyne

Place Published: Newcastle upon Tyne