Browse by author
Lookup NU author(s): Rem Gensh, Dr Ashur Rafiev, Emeritus Professor Alexander RomanovskyORCiD, Dr Fei Xia, Professor Alex Yakovlev
AbstractThe optimality and maintainability of fault tolerancemechanisms in a computer system has typically not been amajor topic of concern, mostly because fault tolerance is anon-functional system requirement. This paper proposes aHolistic Fault Tolerance architecture, based on acentralised fault tolerance management, with relatedfunctionality distributed across the entire system. Themost suitable error detection and error recovery strategiesfor a given application are chosen by a special crosscuttingcontroller depending on error rates, system performanceand resource utilisation requirements. We discuss themotivation for introducing this holistic fault tolerancearchitecture and reason about its benefits from the pointof view of optimal system operation and improvedmaintainability. The advantages and possibleimplementation challenges of the proposed approach aredemonstrated by a real-world application.
Author(s): Gensh R, Rafiev a, Romanovsky A, Garcia A, Xia F, Yakovlev A
Publication type: Report
Publication status: Published
Series Title: School of Computing Science Technical Report Series
Year: 2016
Pages: 13
Print publication date: 14/11/2016
Acceptance date: 14/11/2016
Report Number: 1505
Institution: School of Computing Science, University of Newcastle upon Tyne
Place Published: Newcastle upon Tyne
URL: http://www.cs.ncl.ac.uk/publications/trs/papers/1505