Browse by author
Lookup NU author(s): Professor Brian RandellORCiD
Full text for this publication is not currently held within this repository. Alternative links are provided below where available.
This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in "backward error recovery", i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed. The formalisation is based on the use of what we term "Occurrence Graphs" to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restoration. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.
Author(s): Merlin PM, Randell B
Publication type: Conference Proceedings (inc. Abstract)
Publication status: Published
Conference Name: 8th International Conference on Fault-Tolerant Computing (FTCS)
Year of Conference: 1978
Publisher: IEEE Computer Society Press