Fault, Failure, Reliability Definitions

OK, the discussion on these basic concepts continues (see the threads “Paper on Software Reliability and the Urn Model”, “Practical Statistical Evaluation of Critical Software”, and “Fault, Failure and Reliability Again (short)” in the System Safety List archive.

This is a lengthy-ish note with a simple point: the notions of software failure, software fault, and software reliability are all well-defined, although it is open what a good measure of software reliability may be.

John Knight has noted privately that in his book he rigorously uses the Avizienis, Laprie, Randell, Landwehr IEEE DSC 2004 taxonomy (IEEE Transactions on Dependable and Secure Computing 1(1):1-23, 2004, henceforth ALRL taxonomy), brought to the List’s attention by Örjan Askerdal yesterday, precisely to be clear about all these potentially confusing matters. The ALRL taxonomy is not just the momentary opinion of four computer scientists. It is the update of a taxonomy on which the authors had been working along with other members of IFIP WG 10.4 for decades. There is good reason to take it very seriously indeed.

Let me first take the opportunity to recommend John’s book on the Fundamentals of Dependable Computing. I haven’t read it yet in detail, but I perused a copy at the 23rd Safety-Critical Systems Symposium in Bristol last month and would use it were I to teach a course on dependable computing. (My RVS group teaches computer networking fundamentals, accident analysis, risk analysis and applied logic, and runs student project classes on various topics.)

The fact that John used the ALRL taxonomy suggests that it is adequate to the task. Let me take John’s hint and run with it.

(One task before us, or, rather, before Chris Goeker , whose PhD topic is vocabulary analysis, is to see how the IEC definitions cohere with ALRL. I could also add my own partial set to such a comparison. )

Below is an excerpt from ALRL on failure, fault, error, reliability and so forth, under the usual fair use provisions.

It should be clear that a notion of software failure as a failure whose associated faults lie in the software logic is well defined, and that a notion of software reliability as some measure of proportion of correct to incorrect service is also possible. What the definitions don’t say is what such a measure should be.

This contradicts Nick Tudor’s suggestion in a List contribution yesterday that “software does not fail ….. It therefore makes no sense to talk about reliability of software“. Nick has suggested, privately, that this is a common view in aerospace engineering. Another colleague has suggested that some areas of the nuclear power industry also adhere to a similar view. If so, I would respectfully suggest that these areas of engineering get themselves up to date on how the experts, the computer scientists, talk about these matters, for example ALRL. I think it’s simply a matter of engineering responsibility that they do so.

In principle you can use whatever words you want to talk about whatever you want. The main criteria are that such talk is coherent (doesn’t self-contradict) and that the phenomena you wish to address are describable. Subsidiary criteria are: such descriptions must be clear (select the phenomena well from amongst the alternatives) and as simple as possible.

I think ALRL fulfils these criteria well.

[begin quote ALRL]

The function of such a system is what the system is intended to do and is described by the functional specification in terms of functionality and performance. The behavior of a system is what the system does to implement its function and is described by a sequence of states. The total state of a given system is the set of the following states: computation, communication, stored information, interconnection, and physical condition. [Matter omitted.]

The service delivered by a system (in its role as a provider) is its behavior as it is perceived by its user(s); a user is another system that receives service from the provider. [Stuff about interfaces and internal/external states omitted.] A system generally implements more than one function, and delivers more than one service. Function and service can be thus seen as composed of function items and of service items.

Correct service is delivered when the service implements the system function. A service failure, often abbreviated here to failure, is an event that occurs when the delivered service deviates from correct service. A service fails either because it does not comply with the functional specification, or because this specification did not adequately describe the system function. A service failure is a transition from correct service to incorrect service, i.e., to not implementing the system function. …… The deviation from correct service may assume different forms that are called service failure modes and are ranked according to failure severities….

Since a service is a sequence of the system’s external states, a service failure means that at least one (or more) external state of the system deviates from the correct service state. The deviation is called an error. The adjudged or hypothesized cause of an error is called a fault. Faults can be internal or external of a system. ….. For this reason [omitted], the definition of an error is the part of the total state of the system that may lead to its subsequent service failure. It is important to note that many errors do not reach the system’s external state and cause a failure. A fault is active when it causes an error, otherwise it is dormant.

[Material omitted]

  • availability: readiness for correct service.
  • reliability: continuity of correct service.
  • safety: absence of catastrophic consequences on the
    user(s) and the environment.
  • integrity: absence of improper system alterations.
  • maintainability: ability to undergo modifications
    and repairs.

[end quote ALRL]