Toggle Main Menu Toggle Search

Open Access padlockePrints

Determining the Last Membership of a Process Group after a Total Failure

Lookup NU author(s): Dr Paul EzhilchelvanORCiD, Emeritus Professor Santosh Shrivastava



There is a large category of distributed systems that use component (e.g., process, object) replication for availability. A large part of the effort involved in crafting these systems lies in maintaining the cardinality of the set of replicas. For example in primary-secondary replication, in the event that one component crashes, it is necessary to create a replacement on some operational machine and hence maintain the cardinality of the set of components to at least two. In systems where failed components are recreated on other machines, the internal composition of the set of a component group (referred to as a unit) may be seen to `walk? over a number of machines during normal system operation. We are interested in the problem of recovery after a total failure of a unit ( a disaster ); that is, recovery after all or large number of unit members have failed or partitioned such that the unit can no longer function normally. Disaster recovery requires that once sufficient members belonging to the unit have restarted or got reconnected, the unit should resume functioning without further delay. A particular requirement is that only the components belonging to the last unit configuration be part of the post-disaster unit configuration. This paper presents an algorithm which a component can execute to determine whether it belonged to the last unit configuration. The algorithm has been developed in the context of an asynchronous distributed system where message delays are unknown and therefore a slow component can appear as crashed or disconnected.

Publication metadata

Author(s): Black D, Ezhilchelvan PD, Shrivastava SK

Publication type: Report

Publication status: Published

Series Title: Department of Computing Science Technical Report Series

Year: 1997

Pages: 19

Print publication date: 01/01/1997

Source Publication Date: 1997

Report Number: 602

Institution: Department of Computing Science, University of Newcastle upon Tyne

Place Published: Newcastle upon Tyne