In this scenario, rule one is valid at node x and at node y; if rule one is reapplied at node x before it is applied at node y the system will return to a safe condition. However, if the rule is applied at node y first the system will become unsafe; in addition, each of the lower echelon nodes could readjust their distance variables illegitimately. Gosh, et al. (1996) summarizes this circumstance and the difficulties associated with fault containment in stabilizing algorithms by confirming that "based on local knowledge, processes cannot unambiguously decide the extent of a fault" (p. 47).


Self-Stabilization and Fault Tolerance

The example presented illustrates the maturation of Dijkstra's (1974) original algorithms, focused on operating system design, to applications of fault-tolerance and reliability on a broad scale. Two properties of self-stabilizing systems form a foundation for fault-tolerant systems: no need to be initialized, and recovery from transient failures (Schneider, 1993). Schneider (1993) goes on to offer a computer network as an example; it would be "unreasonable be reasonable to stop a network and reinitialize it" in order to add or remove components (p. 48). While it may not have been Dijkstra's original intent to promote self-stabilization as a method of fault-tolerance and reliability--they parallel.

Consider the currently accepted approach to fault-tolerance in programs constructed with object-oriented languages: explicitly predict the fault and include an exception handling mechanism that passes the fault back up through the call-chain hierarchy--hardly a robust archetype. Schneider (1993) points out that self-stabilization is the "common antidote" to this unpredictable situation as although "the abstract state of a program may be corrupted, the program or system itself is inviolable", the functionality (its behavior) remains intact (p. 49). As an example, consider a Java program with a class that is performs a calculation in memory, then returns a value to its caller. If the local memory is corrupted by outside influences when the calculation is being performed, the abstract aspects of the program fail--but the class itself, along with its original functionality--are still viable. If the calculation is performed in a protected, read-only memory space, the class will be able to return to a known state, and the properties of self-stabilization are applied in the form of fault-tolerance.


Self-Stabilization and the Internet

While grossly simplified, the spanning tree network example (reference Figure 1) also demonstrates the basic premise of Internet routing systems in use today (Adam and Stadler, 2004). Adam and Stadler (2004) offer that "Internet routing systems are well-studied examples of self-stabilizing systems" in that they "maintain a distributed state in a dynamic environment" (p. 66). An Internet routing system provides data forwarding (moving packets from input to output), computing the local state (maintaining a routing table), and disseminating updates (providing updates to other routing systems). This correlates to, and demonstrates that Dijkstra's (1974) original reservations about self-stabilization in a distributed environment are unfounded, in that the data forwarding is the "local action", computing the local state is the "local information", and the dissemination of updates are the "accomplishment of global objectives", respectively (p. 643). Adam and Stadler (2004) offer a pragmatic definition of self-stabilization as it relates to the Internet describing that no matter what catastrophic events, malicious attacks, corrupted data, or "sequence of events", after the "defective equipment is disconnected, "the network will return to normal operation within a tolerable period of time (such as less than an hour)". This has been proven through events on many occasions (Roberts, 2004).


Self-Stabilization and Asynchronous Systems

Many software systems built today have a distributed architecture, use the Internet for transport, and rely on asynchronous messaging over location-neutral protocols such as http and https. A nascent class of technologies known as service-oriented architectures (SOAs) has emerged to provide the functionality these transport protocols lack: transactions, routing, and security (Greenfield and Short, 2004). Gartner (1999) describes this architecture as "the asynchronous system model" and defines it as a "model in which processors communicate by sending messages to one another with arbitrary delay" (p. 2).

The World Wide Web Consortium (W3C) (2003) describes a service-oriented architecture as a "specific type of distributed system in which the agents are services" and goes on to offer that a service performs an operation that can be invoked "outside of the context of a larger application" (Section 1.6.2). The services that makeup the distributed architecture is encapsulated; consumers of services need only know the interface and the data format. While this sounds strikingly familiar to the tenets of the OMG's CORBA (2003) specification, and Microsoft's (n.d.) DCOM, components that offer services in an SOA are not constrained to any specific technology--the machinery that powers the service component is also encapsulated. This class of technologies is often referred to as Web services (W3C, 2003).

Applications of self-stabilization characteristics into service-oriented applications face intrinsic temporal limitations. Gartner (1999) describes the properties that characterize a fault-tolerant distributed system as "safety and liveness" (p. 5). Safety is typically expressed as a system configuration, often referred to as the "invariant" (Gartner, 1999, p. 6). As an example, imagine a remote, distributed system and at any moment in time, only one server was to provide user authentications in response to requests. If for some reason two or three servers came online and began to offer sign-on authorizations--the system would be in an illegitimate sate, unsafe.

Liveness in a distributed system ensures that the system will function. To extend the authentication example, consider if that single point of authentication became unavailable due to a hardware or software fault. Liveness ensures that, eventually, sign-on requests will be serviced. The question may become--how long can you afford to wait? Difficulties associated with ensuring self-stabilization in stateless, distributed architectures grow with the scale of application. Leal and Arora (2004), confirm that for stabilization to occur in these systems "a reachablity analysis must be performed on the state space", and each action must be verified that is does not place the system in an illegitimate state, and checked to ensure it does not move away from the legitimate configuration (p. 1). In these asynchronous messaging systems, the messages in transit must be included when considering the global state of the system (Gartner, 1999).

Redundancy is one component of the possible solution, but only part of it. Gartner (1999) offers fault detection as the "key to safety", but cautions that it is "easy locally, but in distributed settings it generally difficult" (p. 19). Again, consider the temporal aspect; a fault may occur that prevents a message from being delivered, and therefore discovery of the fault, for relatively long periods. Similarly, correction is offered as one solution to ensuring liveness, using coordinated correction and consensus algorithms (Gartner, 1999). The consensus may entail a system reset, unacceptable and unpractical in many situations. Others have theorized that explicit detector and corrector technologies can be developed to address the fault detection and correction issues, respectively (Leal and Arora, 2004). This approach involves software detectors that inspect components and block any moves into illegitimate states. Correctors not only inspect the states but also change it--hence the nomenclature. This work has shown promise in DARPA experiments, but has yet to expose itself to the rigors and realities of the commercial marketplace. Consider that the paradigm shift into Web services over http was motivated in part by disadvantages of single-vendor, proprietary solutions.


Summary

Dijkstra's (1974) resisted the urge to expand the scope of his original article on self-stabilization to include distributed daemon controllers. After reviewing the inherent complications of distributed applications for self-stabilization and fault tolerance, we see that Dijkstra (1974) was a very smart man. He provided a proof in the context of mutual exclusion, that is, an eminent process and the rest--identical. An alternative approach would be to ensure that all of the processes in the domain are unique. As the scope of the solution grows, the later approach is exponentially more difficult.

Schneider (1993) summaries the disparity by of stating, "in general, systems asymmetric by state (only) cannot be self-stabilizing, while those that are asymmetric by identity can be. In this brief article, we have illustrated examples of asymmetric by identity systems: the simple spanning tree network, and on a more grandiose scale, the Internet, each which use identical programs parameterized by a local id, or have machines that are not all identical. In contrast, we also discussed distributed, service-oriented architectures that follow the general pattern of identical machines (stateless Web services), with local states encompassing the various messages in the queue--effectively unpredictable. In addition, we discussed recent innovations in the field of self-stabilization for distributed applications.

Dijkstra's (1974) work on self-stabilization was visionary, withstanding the test of time. It is very likely that thirty years from today, it will be referenced as the seminal work that supports fault-tolerance in asynchronous, service-oriented software systems


References
Adam, C. & Stadler, R. (2004, April). Patterns for routing and self-stabilization. Network Operations and Management Symposium, NOMS 2004, 61-74.

Dijkstra, E.W. (1974, Nov.). Self-stabilization in spite of distributed control. Communications of the ACM, 17(11). 643-644.

Gartner, F. C. (1999, March). Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1), 1-26.

Gosh, S., Gupta, A. Herman, T. & Pemmaraju, S. (1996). Fault-Containing Self Stabilizing Algorithms. Fifteenth ACM Symposium on Principles of Distributed Computing, Philadelphia, PA.

Greenfield, J. & Short, K. (2004). Software Factories. Indianapolis, IN: Wiley.

Leal, W. & Arora, A. (2004). Scalable Self-Stabilization via Composition. Twenty-fourth International Conference on Distributed Computing, Tokyo.  

 Microsoft Corporation (n.d.). Distributed Component Object Model (DCOM) - downloads, specifications, samples, papers, and resources for Microsoft DCOM. Retrieved October 8, 2003 from http://www.microsoft.com/com/tech/DCOM.asp 

Object Management Group (2003). Welcome to the OMG's CORBA website. Retrieved October 8, 2003 from http://www.corba.org/

Roberts, P. (2004, June 15). Update: Akamai blames 'global DNS attack' for disruptions. Computerworld, Retrieved October 11, 2004 from http://www.computerworld.com/developmenttopics/websitemgmt/story/0,10801,93837,00.html

Schneider, M. (1993, March). Self-Stabilization. ACM Computing Surveys, 25(1), 45-67.

World Wide Web Consortium, (2003, August). Web services architecture: working draft 8.  Retrieved October 7, 2004 from http://www.w3.org/TR/2003/WD-ws-arch-20030808/

Copyright © Snyders.US. All rights reserved.

Dijkstra's work on self-stabilization was visionary, withstanding the test of time.

Self-Stabilizing Software: Applications


John R. Snyder, November 2004

Dijkstra's (1974) article in Communications of the ACM was only 2-pages long--but has had lasting influence on the software industry. Thirty-years after that diminutive work, we realize Dijkstra's (1974) statement of relevance to a "worldwide network" (p. 643) was prophetic. The widespread acceptance of that network we know as the Internet started another innovation curve leading to the popularity of distributed, asynchronous systems. On the peak of this curve we have seen a renewed interest in self-stabilization as a method to achieve fault-tolerance; attention fueled by the success of self-stabilizing Internet routing systems (Schneider, 1993). Moreover, our society has developed a dependence on computer systems that are fault-tolerant (Gartner, 1999).

This paper will survey the applications of self-stabilization into typical technologies in use today. Self-stabilization as a means to fault tolerance will be emphasized pragmatically. It is organized with a brief review of self-stabilization concepts, the connection between self-stabilization and fault tolerance, an analysis of applicability to the Internet and asynchronous messaging systems, and a summary of the discussions.


Self-Stabilization Concepts

Dijkstra (1974) envisioned a system that could return itself to a "legitimate" state, but was not confident of being able to prove the concept universally; he limited his initial analysis to a ring of finite state machines (p. 643). In this connected graph, each node has a privilege--described as a boolean function of its own state and the states of its neighbors. Dijkstra (1974) offers that "when such a boolean function is true, we say that the privilege is present" (p. 643). Schneider (1993) summarizes Dijkstra's (1974) definition of a self-stabilizing system: "regardless of the initial state, it is guaranteed to arrive at a legitimate state in a finite number of steps" (p. 45). While this sounds simple on the surface, consider that in a disconnected, stateless environment, the local node may not have knowledge of the global system--and therefore could move into fault condition without some intervention. Dijkstra (1974) limited the scope of his solution and proof to a central controller that was in the same processes space (that is, non-distributed) as the nodes it supervises. Dijkstra (1974) introduced a "central daemon" to instruct the node algorithms so that the system condition can either be updated, or returned to a known, safe condition (note that Schneider (1993) proposed use of the terms "safe and unsafe interchangeably with legitimate and illegitimate, respectively" (p. 47)).

Gosh, Gupta, Herman, and Pemmaraju (1996) offer a simple example of the need for, as Dijkstra (1974) stated, "global criterion" to ensure changes to local states are legitimate. Gosh, et al. (1996) presented a spanning-tree network with a process r as the root. Each process i (i ≠ r) has two variables, (p, l) where p identifies the parent in the tree and l denotes the distance in the tree between the process and the root. The root process has a single variable set to zero, and each non-root process perpetually checks its variables adjusting them with the following rules (derived from Dijkstra's (1974) original article):

  • rule one: if l < n, advance l one greater than its parent's l ;
  • rule two: if l ≥ n, set p to a new parent.


A spanning tree example, in a legitimate state based on the two-rules just presented, is represented in the following figure.

Snyders.US

Now consider a fault condition in this network that causes one of the distance variables to change its value such that rule one is applicable in more than one location--creating an illegitimate (unsafe) condition. This is illustrated by the next figure.