Toggle Main Menu Toggle Search

Open Access padlockePrints

An architecture for tolerating processor failures in shared-memory multiprocessors

Lookup NU author(s): Professor Pete Lee

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Abstract

This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications. ©1996 IEEE.


Publication metadata

Author(s): Banatre M, Gefflaut A, Joubert P, Morin C, Lee PA

Publication type: Article

Publication status: Published

Journal: IEEE Transactions on Computers

Year: 1996

Volume: 45

Issue: 10

Pages: 1101-1115

Print publication date: 01/01/1996

ISSN (print): 0018-9340

ISSN (electronic): 1557-9956

Publisher: IEEE

URL: http://dx.doi.org/10.1109/12.543705

DOI: 10.1109/12.543705


Altmetrics

Altmetrics provided by Altmetric


Share