Fault-Tolerant Computer System Design

Pradhan, Dhiraj K.

ISBN 10: 0130578878 ISBN 13: 9780130578877

Publisher: Prentice Hall, 1996

Since the 1980s, the field of fault-tolerant design has broadened in appeal, particularly with its emerging application in distributed computing. This edition specifically deals with this dynamically changing computing environment.

"synopsis" may belong to another edition of this title.

From the Publisher

In the ten years since the publication of the first edition of this book, the field of fault-tolerant design has broadened in appeal, particularly with its emerging application in distributed computing. This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.

From the Inside Flap

This book represents an upgrading and enhancement of the earlier work Fault-Tolerant Computing: Theory and Techniques 1, published by Prentice Hall in 1986 and widely adopted as a text for graduate students. The field of fault-tolerant system design has broadened in appeal in the intervening decade, particularly with its emerging application in distributed computing, such as the proposed information highway, as well as the advent of multiprocessor computing nodes as the state of the art. This new book, therefore, reflects this quickly and dynamically changing computing environment. In addition to certain of the basic chapters included in its predecessor, this book also incorporates chapters dealing specifically with these newer topics such as fault tolerance in multiprocessor and distributed systems 5.

Reliability and availability of computing systems remain major concerns, despite frequent misconceptions that a dramatic increase in component reliability has obviated the need for fault tolerance, reflected in Figure 1 2. One challenge contained in the newer client-server model of computing guarantees data integrity in the event of failures. In fact, some attribute that persistent interest in mainframes to their perceived robustness. Apart from hardware reliability, robustness of a system is rooted in good design discipline at both the hardware and software levels.

The recent discovery of a design flaw in the Pentium chip reveals the complexity of ensuring reliable operation in advanced microprocessor use. It is widely known that the reliability weaknesses of the Intel Paragon traces, in part, to the persistent design bugs in that microprocessor used in the system. This means that design diversity has, therefore, become another issue of wider acceptance, evidenced in the design of the forthcoming 777 jet aircraft flight- control system. Both the real-time as well as the safety requirements mandate precautionary measures against a wide range of possible failures. The opportunity for fast recovery in the event of a fault is greatly aided with the advent of high-speed microprocessors, but new challenges arise regarding reliable synchronization. One other challenge to fault-tolerant design is the increased use of massively parallel systems. Striving to achieve the highest possible performance through innovative architectures, these systems sometimes rely on still unproven technology. It is not surprising, then, that some of these systems suffer from dismal mean- time to failure (MTTF), though achieving spectacular computing speed. One intangible cost of lack of fault tolerance is the loss of computational power in high-performance computing. A small fraction of operational time lost due to faults could easily result in the kind of performance loss approaching that of a mainframe computer. The basic focus, therefore, becomes achieving fault tolerance without significant performance loss. Standard techniques to restart the task may be useless in these computers, one particular task requiring hours of uninterrupted computation on these supercomputers.

One other new, emerging field of computing is mobile computers. These systems require new solutions for their fault tolerance problems 3,4 because of the mobility consideration as well as low-weight power constraints. Specifically, mobility requires a more dynamic environment for recovery, not encountered in the traditional static environment.

Low-power considerations mandate that simple replication techniques, so effective in traditional applications, cannot be used here. For example, those systems powered by batteries do not provide the flexibility of replication, seen in Table 1. Assuming for example, that the nominal capacity of the battery is 10 watt-hours for a typical nickel-cadmiun battery, the MTTF of the widely popular Triple Modular Redundancy (TMR) system often used in many applications is useless in a mobile computing environment because the mean operational time is now limited more by the battery power than the intrinsic reliability. Novel fault-tolerant schemes 6 such as roll- forward techniques may, therefore, find applications. Consequently, what this book brings into focus are these various issues for both mature and emerging technologies.

An overview of the basic concepts is dealt with in Chapter 1, followed by a discussion of various architectures in Chapter 2. Certain of the newer fault tolerance issues in the environment of parallel distributed computing are delved into in Chapter 3. In-depth case studies of several major fault-tolerant architectures are offered in Chapter 4. Both experimental and analytical techniques are crucial to accurate prediction of reliability. Too often, the lack of good dependability evaluation tools handicaps fault-tolerant designers and practitioners. Chapters 5 and 6 treat the subject of dependability prediction and measurement techniques. Chapter 7 reviews the ever- important issue of software reliability. Unlike hardware faults, all software faults are design and implementation errors.

Because of this, a wide range of issues affects software reliability. This chapter focuses specifically on fault tolerance techniques, rather than the myriad of fault avoidance techniques. Chapter 8 concludes with the theory of system-level diagnosis, a popular subject of academic research receiving recent attention among practicing engineers. A set of problems is included with each chapter, as the book is intended both as text for first- year graduate students and a reference for practitioners.

An undertaking such as this one would not have been possible without the invaluable help and contributions from the leading experts, as well as from my students, Gavin Holland, Nitin Vaidya, and Nick Bowen.

References

1

Pradhan, D. K., Fault-Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice Hall, 1986.

2

Stiffler, J. J., private communication, 1992.

3

Krishna, P., Nitkin H. Vaidya, and D. K. Pradhan, "Recovery in Distributed Mobile Environments," IEEE Workshop on Advances in Parallel and Distributed Systems, October 1993, pp. 83,88.

4

Imielinski, T., and B. R. Badrinath, "Wireless Computing: Challenges in Data Management," Comm. of ACM, pp. 19P28, October 1994.

5

Pradhan, D. K., and P. Banerjee, "Fault Tolerance in Multiprocessor and Distributed Systems," Chapter 3 (in this book). 6

Pradhan, D. K., and Nitkin H. Vaidya, "Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off," Annual International Symposium of Fault-Tolerant Computing, Austin, Texas, June 1994, pp. 186P195.

"About this title" may belong to another edition of this title.

PublisherPrentice Hall
Publication date1996
ISBN 10 0130578878
ISBN 13 9780130578877
BindingTextbook Binding
LanguageEnglish
Edition number1
Number of pages560

Buy Used

Condition: Good
Connecting readers with great books... View this item

US$ 47.30

Convert currency

Shipping: US$ 3.75
Within U.S.A.

Destination, rates & speeds

Add to basket

30 Day Return Policy

Search results for Fault-Tolerant Computer System Design

Stock Image