An Observation on the Intertwining of Safety and Security

The security of safety-related and safety-critical systems with components incorporating digital processing is becoming a major issue. We have seen partial control taken, from a remote location, of a car which is being driven. A major electricity outage in an East-European country was caused by intrusion into the digital parts of control systems. Intrusions into the critical infrastructure of developed nations are becoming commonplace.

Many have observed that if a safety-critical system is not secure, then it also cannot be safe. Attempts have been made – are being made – through standards organisations to write guidelines for a coordinated approach to safety and security, in particular for nuclear power plants (IEC 62859 is in final stages of approval), safety of machinery (a new standardisation project is being proposed to the IEC), and digital automation (ISA 84.00.09 was published in 2013, and a new draft was prepared in 2016). The German electrotechnical standards organisation DKE is preparing general guidelines for IACS based on the approach to safety of IEC 61508 and to IACS security in IEC 62443. I am aware of the contents of many of these documents but to date only ISA 84.00.09:2013 has been published and standards documents under development fall under internal confidentiality conditions.

There are two broad lines of thought concerning safety and security for systems containing digital components. One is that they can be adequately covered in separate standards. The other is that they are inevitably intertwined, to use the old Swartout-Balzer phrase .

I argue here that they are inevitably intertwined. I show that an intruder can induce unacceptable risk into system operation, even with perfect safety functions, not compromised, functioning perfectly.

The way that safety is handled according to IEC 61508:2010 is this (I shall use some technical terms without definition). System S consists of equipment under control (EUC) and a control system (EUCCS). A risk analysis of the operation of the EUC under EUCCS is performed, and proceeds as follows. The analysis first identifies all hazards (HazID), then assesses the severity of each hazard (HazAn; often severity is taken as the worst case, but theoretically expected-value could be used instead), then the probability Prob(H) of each hazard H is assessed (“Prob” denotes an aleatory probability in this note), and combined with H’s severity S(H) to form a per-hazard risk: Rsk(Prob(H),S(H)). Each identified hazard H is assigned a risk:

Risk(EUC, EUCCS, H)= Rsk(Prob(H),S(H))
This risk is then assessed for “acceptability”. The acceptability of a risk is a parameter coming from outside; it is taken to be determined by social processes outside the purview of IEC 61508. Let us call this determination AccRisk(H). If Risk(EUC, EUCCS,H) > AccRisk(H), then a safety function SF(H) must be implemented as follows. SF(H) works in concert with the control system (EUCCS). Let us call the resulting control system EUCCS+SF(H). The risk must now be acceptable:

Risk(EUC, EUCCS+SF(H), H) < AccRisk(H)
(or Risk(EUC, EUCCS+SF(H),H) = AccRisk(H) but no one really expects equality, so I’ll stick with inequalities here and below). Since the risk of H has changed, and the risk is a function of Prob(H) and S(H), it follows that SF(H) must have caused either the initial Prob(H), or the initial S(H), or both, to have changed.

The safety requirement on SF(H) is the reliability required of SF(H). SF(H) may occasionally fail, but it must be reliable enough to ensure Risk(EUC, EUCCS+SF(H),H) < AccRisk(H). According to IEC 61508, safety requirements are exactly such reliability requirements on safety functions. (I shall not discuss here whether this is an adequate characterisation of system safety). Safety requirements are thus per-safety-function, and safety functions are per-hazard. So each identified hazard H may lead to precisely one safety requirement (and leads to none if Risk(EUC, EUCCS, H) < AccRisk(H)).

Let us now (unrealistically) suppose that HazID and HazAn are perfect (and let us assume worst-case severity is used in HazAn). The hazards have been determined to be H(1), H(2), H(3), ….. , H(n), and their associated severities S(1), S(2), S(3), …… , S(n). Safety functions have been designed and implemented: the control system is EUCCS+SF(H1)+SF(H2)+…+SF(Hn). Let us assume these safety functions are also perfect (= perfectly reliable as required).

No matter how an intruder may gain access to EUCCS+SF(H1)+SF(H2)+…+SF(Hn), heshe can’t have any effect on safety other than triggering one or more of the hazards, say H(k1), H(k2), ….. , H(ks), and thereby bringing about the severities associated with these hazards S(k1), S(k2), …… , S(ks). These hazards are protected by the safety functions SF(H(k1)), SF(H(k2)), …., SF(H(ks)), and the safety functions are designed to mitigate the risk of H(k1), etc.

Recall that the safety functions are perfect. If SF(H(kp)) is designed to mitigate the severity of H(kp), then the intruder will not be able to cause more damage than S(H(kp)). However, Risk(H(kp)) has changed. Although severities have not changed, we have assumed the intruder is able to trigger hazards H(k1),….., H(ks), which entails that Prob(H(k1), ….. Prob(H(ks)), the likelihood of the hazard occurring, are all not what they were assessed to be during the risk analysis. Indeed, with an active and competent attacker, it might be reasonable to assume they are all now equal to 1! The risk associated with H(kp) is no longer Rsk(Prob(H(kp)), S(kp)) but is Rsk(1,S(kp)).

The condition on which the requirements for SF(H(kp)) were determined was Rsk(Prob(H(kp),S(kp)) and not Rsk(1,S(kp)). It is quite possible that SF(H(kp)), even though perfect, does not protect adequately against Rsk(1,S(kp)) according to the precepts of IEC 61508.

An example. Let us suppose that the acceptable risk of a system S has been determined somehow to be one human death in one million hours of operation. Let us suppose two hazards have been identified: H1 = explosion of equipment next to the control room; H2 = pervasive strong overcurrent in the control systems console sufficient to cause electrocution of an operator. Let us further suppose S(H1) = S(H2) = death of one person. Root causes of H1 and H2 have been determined: RC(H1) and RC(H2). We can assume these root causes are expressed in distributed normal form: RC(H1) = (these-conditions OR those-conditions OR these-other-conditions) and mutatis mutandis for RC(H2). Let us suppose there is one chance in ten that a death will result from H1, and one chance in 1,000 that a death will result from H2.

We can achieve the acceptable risk by distributing the risk between the hazards: we require that there is less than one chance in two that H1 will result in a death in one million hours of operation as well as less than one chance in two that H2 will result in a death in one million hours of operation. I take it to be obvious that this fulfils the acceptable-risk condition.

This entails we require that H1 doesn’t occur more than five times in one million hours of operation, equivalently that Prob(H1) < 0.000005, and H2 doesn’t occur more than five hundred times in one million hours of operation: Prob(H2) < 0.0005 (again ignoring the equality).

Let us assume that RC(1) occurs natively once every thousand hours; similarly with RC(2). In other words, Prob(EUC,EUCCS,H1) = Prob(EUC,EUCCS,H2) = 0.001 and each condition occurs 1,000 times in one million hours. This leads to the safety requirements that SF(H1) must inhibit the consequences of RC(1) in all but 5 times out of 1,000; and SF(H2) must inhibit the consequences of RC(2) in all but 500 times out of 1,000, that is, once out of every 2 occurrences.

Suppose SF(1) and SF(2) are designed and built, perfectly, to do exactly that. We can expect about one death every million hours of system operation.

Suppose now that an intruder enters the system and invokes RC(1) and RC(2) hourly, a rate one hundred times the native rates. Assume the intruder does not affect the operation of SF(H1) or SF(H2) at all. Assume an infinite supply of pliable operators immediately replaced, and instant repair of damage to the system. Up to 100 operators will die every million hours of operation under these conditions of intrusion. That is rather more than the acceptable risk of one death every million hours.

This shows that, even with perfect safety functions, perfectly implemented, and not compromised, an intruder can theoretically implement unacceptable risk in system operation.

Then, of course, there is the possibility that the operation of the safety functions themselves may be compromised through intrusion.

One can argue that such a possibility is implicitly ruled out by IEC 61508. The reliability of safety functions is set: SIL 3 means such-and-such a reliability condition (Part 1, Tables 2 and 3) and no exception is allowed: there is no “escape” clause saying “fulfil Part 1 Table 2/3 except when somebody hacks the code of your SF.”

Practically it is difficult to see how one might assure this, if an SF contains unknown vulnerabilities that have not been identified at development time. Ensuring that your SF does not contain unknown vulnerabilities is the holy grail of security engineering (better said, one of them). But even if you reach this holy grail, the example above shows that you have not solved the safety problem. Intrusion changes the safety requirements arbitrarily. A safety function with a requirement of, say, SIL 3, even if perfect, and not compromised, may no longer be sufficient to guarantee acceptable risk of system operation under intrusion conditions.

Leave a Reply