A Resilience Engineering approach for the Risk Assessment of IT Services

Articolo apparso su MDPI: applied sciences.

Authors: Mario Fargnoli and Luca Murgianu

Faculty of Technological and Innovation Sciences, Universitas Mercatorum – 00186 Rome, Italy

Abstract:

Nowadays, services related to IT technologies have assumed paramount importance in most sectors, creating complex systems involving different stakeholders. Such systems are subject to unpredictable risks that differ from what is usually expected and cannot be properly managed using traditional risk assessment approaches.

Consequently, ensuring their reliability represents a critical task for companies, which need to adopt resilience engineering tools to reduce the occurrence of failures and malfunctions. With this goal in mind, the current study proposes a risk assessment procedure for cloud migration processes that integrates the application of the Functional Resonance Analysis Method (FRAM) with tools aimed at defining specific performance requirements for the suppliers of this service. In particular, the Critical-To-Quality (CTQ) method was used to define the quality drivers of the IT platform customers, while technical standards were applied to define requirements for a security management system, including aspects relevant to the supply chain. Such an approach was verified by means of its application to a real-life case study, which concerns the analysis of the risks inherent to the supply chain related to cloud migration. The results achieved can contribute to augmenting knowledge in the field of IT systems’ risk assessment, providing a base for further research.

Keywords: resilience engineering; IT systems; cloud migration; risk assessment; functional resonance analysis method (FRAM); critical-to-quality (CTQ); IT services supply chain; security management systems

A_Resilience_Engineering_Approach_or_the_Risk_Assessment_of_IT_Services

1. Introduction

Recently, the concept of resilience has been applied to numerous different contexts to stress the role of adaptability and variability in dealing with the events that are characterizing our lives in these years, such as the COVID-19 pandemic, the energy crisis, wars, climate changes, etc.. As stressed by Lay et al., factors such as complexity, ambiguity, constrained resources, and uncertainty are more and more shaping our lives, requiring adaptability and adaptive behaviors to face the variety of perturbations that might occur in real systems. While at a general level, the definition of resilience proposed by the United Nations can illustrate its broad application in the case of adverse situations (“The ability of a system, community or society exposed to hazards to resist, absorb, accommodate to, and recover from the effects of a hazard in a timely and efficient manner, including through the preservation and restoration of its essential basic structures and functions”), in the engineering context the research approach proposed by Hollnagel et al. can be considered as one of the most accepted and diffused, describing resilience as “The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions”.
Resilience represents a key factor in enhancing risk assessment, allowing the shift from traditional risk analysis, mainly based on a control-centric approach, to a more modern concept, which is usually called “Safety II” and relies on the acknowledgment that both acceptable and adverse outcomes are based on everyday performance adjustments.

Actually, as stressed by Farooqi et al., traditional risk analysis tools belonging to the “Safety I” approach, such as Failure Modes and Effect Analysis (FMEA), Fault Tree Analysis (FTA), and Event Tree Analysis (ETA), provide a bimodal perspective of work activities, according to which positive or negative outcomes are a consequence of different systems’ modes of functioning. Hence, despite the advantages in terms of usability, Safety I tools fail to capture the complexity and variability of human performances and related activities.
Accordingly, failures should be considered a result of the everyday variability of human performances rather than unique events. Hence, adverse outcomes are not only due to failures and malfunctions but also to the result of performance variabilities.

In order to put into practice this novel approach, the Functional Resonance Analysis Method (FRAM) is one of the most widespread tools for modeling causal factors of accidents (i.e., what goes wrong) and the behavior of sociotechnical systems. This method is currently considered one of the few structured “Safety II” tools used in the industrial context. As pointed out by Hollnagel, FRAM allows safety managers to describe the functions of a complex system by characterizing their potential variability and the functional resonance based on dependencies and couplings among them. More in detail, FRAM is aimed at analyzing how a certain socio-technical system works by describing it not in terms of components but as a set of functions that represent the “work-as-done”. In this way, it is possible to continuously adjust the system’s performance, allowing things to go right; conversely, when the performance variability leads to unexpected outcomes, which are described as functional resonance, things go wrong and accidents may occur.

In recent literature, numerous studies have investigated the implementation of FRAM to enhance resilience in different domains. Delikhoon et al. observed that the reduction of unplanned chains of events and losses can be achieved by a proactive management of risks, which can enhance the quality of performance and productivity.
Accordingly, most studies on resilience engineering focused on the risk of safety accidents involving modern industrial-technological systems, whose complexity makes reactive risk assessment approaches unsuitable for comprehending how and why accidents occur.
This is in line with the findings by Patriarca et al., who observed that the majority of FRAM applications concern safety problems in manufacturing, transportation, or power plant contexts, while the use of such an approach in business and operational research domains is still limited. In particular, there is a need to analyze and manage the so-called “black swans”, i.e., unpredictable risks that differ from what is usually expected and are very difficult to envisage. For example, due to their unpredictability, human factors are considered a potential source of errors and malfunctions, requiring a proactive approach to ensure effective safety management.
To deal with them, as stressed by Aven, further research is needed to develop risk assessment procedures tailored for the management of “critical infrastructures”, whose vulnerability is related to threats due to human actions that cannot be properly managed using traditional risk assessment approaches, as in the case of security infrastructures and the reliability of IT services. It must be noted that also in the latter context, recent cases of failures and malfunctioning have caused significant problems: for example, the fire that seriously damaged Europe’s largest cloud services firm, OVHcloud, in 2021, or the outage that occurred in 2021, which involved one of the major Content Delivery Network (CDN) providers. Similarly, several authors pointed out the need to use a resilience engineering approach to augment risk analysis of IT services in the healthcare domain, where system failure must be rigorously avoided. Moreover, it must be noted that IoT technologies attract great interest from cybercriminals; consequently, as observed by Zhou et al., the potential for significant damage to an enterprise network can be substantial. All these occurrences also show the unpredictable nature of risks that might affect IT services and cloud computing.
The need for a proper risk assessment approach in this sector was pointed out by several studies, which mainly outlined the provision of risk assessment frameworks capable of dealing with the variability, unpredictability, and vulnerability of these complex systems.
In the IT context, on the one hand, in recent years numerous efforts have been carried out to augment risk assessment approaches that can guarantee cloud security, as reported, for example, by the MEDINA project financed by the European Union. On the other hand, most risk assessment tools in this domain rely on a reactive approach, while only a few of them take into account supply chain risks and the elicitation of software requirements for complex systems. Hence, a “Safety II” approach is needed to explain performance variability in such a context, and FRAM appears to be the proper tool for systemic safety assessment in this complex and dynamic domain.
Based on the above considerations, the current study aims to reduce this research gap by augmenting knowledge of the risk management of IT services by means of a resilience engineering approach through the application of FRAM to the case of cloud migration. In particular, a risk assessment framework was developed to provide a structured approach based on FRAM that can be used at an operational level, which we think can contribute to the lack of practical tools to apply functional resonance and a multidisciplinary approach for the requirements elicitation of IT services.

In more detail, the remainder of the paper is as follows:
The next section concerns the study’s materials and methods and is divided into two parts.
In the first part, an analysis of FRAM is presented in order to better highlight its main features; in the second part, the framework of the proposed research approach is illustrated to deal with the IT supply chain criticalities.
Then, in Section 3, the application of such an approach to a real-world case study concerning the analysis of the risks inherent to the supply chain related to cloud migration is reported, and the results achieved are provided.
Section 4 discusses the research outcomes, while Section 5 concludes the paper by addressing future work.

2. Materials and Method

In recent literature, numerous tools have been found that have been used to evaluate and improve the resilience of socio-technical systems.

read the article

2. Materials and Methods

Relying on the tier-based approach for the classification of these tools, which was proposed by Linkov et al., three different tiers can be distinguished based on the increasing level of complexity of the system and information needed to deal with:

  • Tier I, which consists of a screening level where the main properties of the system are identified and prioritized;
  • Tier II, where the description of the system structure is defined, and bottlenecks are identified;
  • Tier III, which includes themodeling of the interactions between the sub-systems and different scenarios, can be analyzed to verify the system’s performance under uncertainties.

Usually, Tier II tools are sufficient to provide information about the system, allowing decision-makers to properly improve the system’s resilience, while the third level is hardly ever implemented due to the high level of information and resources needed for the analysis. Among the tools belonging to Tier II, FRAM is certainly one of the most diffused, as it can provide information on the system structure and its components by means of a systemic analysis approach.

2.1. The Functional Resonance Analysis Method (FRAM)

In brief, the method’s scheme and functioning can be summarized in the following main phases:

1. Definition of the goal of the analysis;
2. Identification and definition of the functions;
3. Definition of the variability of each one of the functions;
4. Variability aggregation;
5. Identification of possible solutions.

In the first phase, the objective of the analysis is defined, such as risk assessment or hazard analysis, as well as safety management.
The second step concerns the description of the tasks/functions that allow the system’s functioning, they include human, technological, and organizational activities, and based on the taxonomy provided by Hollnagel, each function is characterized by the following aspects:

  • Input (I), representing the function’s starter or transformer;
  • Output (O), which is the result of the function’s transformation;
  • Precondition (P), i.e., what should happen to allow the function’s transformation;
  • Resource (R), i.e., what is needed for the function’s activation or transformation in order to achieve the output;
  • Time (T), representing the time constraints that can affect the function;
  • Control (C), i.e., the control and monitoring modes of the function.

These aspects are usually graphically related to each function by means of a hexagon, as schematized in Figure 1: they represent a certain status of the system, not an activity.

Scheme of a hexagonal representation of a function in the FRAM method. Ingegneria della Resilienza, Luca Murgianu

It must be noted that if a system is characterized by a sequence of activities, each step of such a sequence can be identified as a function. Conversely, a more specific analysis is needed to bring to light elementary actions/tasks characterizing the system, e.g., by means of tools such as the Hierarchical Task Analysis (HTA) technique that allows the decomposition of complex tasks. In such a context, it must be noted that while the term “task” refers to planned working activities (i.e., “work-as-imaged”), the term “activity” identifies their practical execution (i.e., “work-as-done”).

The goal of the third step is to define the variability of each one of the functions/tasks that characterize the system. In other terms, the analysis is focused on ascertaining the variability of function output, which is usually classified as:

  • Endogenous or internal variability;
  • Exogenous or external variability;
  • Influenced by upstream functions, i.e., when a functional upstream-downstream coupling occurs.

Based on this, following the taxonomy proposed by Hollnagel, the output variability of each function can be estimated considering both time and precision, as reported in Table 1, where the functions (Fu) are classified into the three categories: human (M), technological (T), and organizational (O).

Scheme of a hexagonal representation of a function in the FRAM method. Ingegneria della Resilienza, Luca Murgianu

Then, for each function, the variability can be evaluated following the criteria proposed in Table 2.

Scheme of a hexagonal representation of a function in the FRAM method. Ingegneria della Resilienza, Luca Murgianu

The variability aggregation step aims at verifying if and where the functional resonance can occur by means of the analysis of the upstream-downstream couplings, which allows us to understand how the output of each function can be related to the different aspects of the others. At a general level, five different types of upstream-downstream couplings should be considered: between input and output; between output and preconditions; between output and resources; between output and control; and between output and time.

The last step consists of the definition of the consequences of the variability in order to implement possible measures aimed at reducing the variability of the function’s performances (in the case of a negative variability) or a that are augmenting it when the variability is considered positive for the system functioning.

Accordingly, based on the taxonomy provided by Hollnagel, six different solution types can be implemented:

  • Elimination;
  • Prevention;
  • Facilitation;
  • Protection;
  • Monitoring;
  • Dampering.

2.2 Research Approach

The migration of an IT system is considered a high-risk process because only when the process is almost completed is it possible to understand the relationships between the data managed and the related trade-off between the cost, effort, risk, and time on the one hand and the value creation on the other.

continue on MDPI article