Anand Tamboli® - A Systematic Approach to Risk Mitigation

Risk mitigation is not an easy task; instead, it is even harder if there is no quantifiable measure attached to it. Today, we will discuss a systematic approach to risk mitigation. This method will also help in establishing quantitative measures which can then help in assessing the progress of risk mitigation activities.

Pre-mortem Analysis

The term project pre-mortem first appeared in the HBR article written by Gary Klein in September 2007. As Gary writes, “A pre-mortem is the hypothetical opposite of a post-mortem”.

Generally speaking, it is a project management strategy in which a project team imagines a project failure and works backward to determine what potentially could lead to that failure. This working is then used to handle risks upfront.

However, in the risk management context, I am going to use this (pre-mortem) term interchangeably to represent a more sophisticated and engineering-oriented methodology, known as Failure Mode and Effects Analysis, i.e. FMEA.

A pre-mortem is the hypothetical opposite of a post-mortem.

Critical parts of the pre-mortem analysis

Pre-mortem analysis (or FMEA) is typically done by a cross-functional team of subject matter experts (SMEs). A better format to conduct this exercise is in the form of a workshop.

During the workshop, the team thoroughly analyses the design and process are implemented or changed. The primary objective of this activity is to find weaknesses and risks associated with every aspect of the product or process or both. Once you identify these risks, take actions to control and mitigate them, and verify that everything is in control.

Pre-mortem analysis record has 16 columns to it as explained below:

1. Process step or system function: This column briefly outlines the function, step, or an item you’re analyzing. In a multistep process or a multi-function system, there would be several rows, each outlining those steps.

2. Failure mode: For each step listed in column # 1, you can identify one or more failure modes. It is an answer to the question—in what ways can the process or solution may fail? Or what can go wrong?

3. Failure effects: In case of failure, as identified in column # 2, what are its effects. How can the failure affect key process measures, product specifications, or customer requirements or customer experience?

4. Severity: This column lists, severity rating of each of the failures listed in column # 2. Use the failure effects listed in column # 3 to determine the rating. The typical scale of severity is 0 to 10; 0 being the least severe while 10 is the most severe consequence(s).

5. Root cause: For each failure listed in column # 2, root cause analysis is done to find an answer to the question—What will cause this step of function to go wrong?

6. Occurrence: This column is another rating that is based on the frequency of failure. How frequently is each of these failures, as listed in column # 2, are likely to occur? Occurrence is ranked on a scale of 1 to 10, where 1 is a low occurrence, and 10 is a high or frequent occurrence.

7. Controls: An answer to the question—What controls are in place currently to prevent potential failure as per column # 2? What controls are in place to detect the occurrence of a fault, if any?

8. Detection: Another rating column where ease of detection of each failure is assessed. Typical questions to ask are—How easy is it to detect each of the potential failures? What is the likelihood that you can discover these potential failures promptly, or before they reach the customers? Detection is ranked on a scale of 10 to 1 (note reversal of the scale). Here rating of 1 means easily and quickly detectable failure, whereas 10 means unlikely and very late detection of failure.

Detecting late often means a more problematic situation and therefore the rating for late-detection is higher.

9. RPN (Risk Priority Number): The risk priority is determined by multiplying all three ratings from column # 4, 6, and 8. So, RPN = Severity x Occurrence x Detection. Thus, a high RPN would indicate a high-risk process step or solution function (as in column # 1). Accordingly, steps or functions with higher RPN warrant immediate attention for fixing.

10. Recommended actions: In this column, SMEs would recommend one or more actions to handle the risks identified. These actions may be directed towards reducing the severity or reducing the chances of failure occurrence, or to improving the detection level, or maybe all of the above.

11. Action owner, target date, etc.: This column is essential from the project management point of view as well as for tracking. Each recommended action can be assigned to a specific owner and carried out before the target date to contain the risks.

12. Actions taken: This column lists all the actions taken, recommended, or otherwise, to lower the risk level (RPN) to an acceptable level or lower.

13. New severity: Once the actions listed in column # 12 are complete, the same exercise must be repeated to arrive at a new level of severity.

14. New occurrence: Depending upon the completed actions the occurrence must have changed, so this column has a new occurrence rating.

15. New detection: Due to risk mitigation actions, detection must have changed, too, register it in this column.

16. New RPN: Due to change in severity, occurrence, and detection ratings, risk level would have changed. A new RPN is calculated in the same way, Severity x Occurrence x Detection and recorded in this column.

Note: A copy of the pre-mortem analysis template can be downloaded from here.

More about ratings

Several risk analysis methodologies often recommend only two rating evaluations, i.e. severity and occurrence. However, in the case of pre-mortem analysis, we are using the third rating—Detection.

Early detection of the problem can often enable you to contain the risks before becoming significant and out of control. This way, you can either fix the system immediately or may invoke systemwide control measures to remain more alert. Either way, being able to detect failures quickly and efficiently is an advantage in complex systems like AI.

In case of severity and occurrence ratings, the scale of 1 to 10 does not change based on the type of solution or industry as these are generally defined scales.

In implementing pre-mortem analysis, you must take a pragmatic approach and choose the scale as appropriate. Just make sure that you are consistent in your definitions throughout the pre-mortem exercise.

While conducting a pre-mortem workshop, participants must set and agree on rankings criteria upfront and then for the severity, occurrence, and detection level for each of the failure modes.

Using the output of the analysis

The output of the pre-mortem analysis is only useful if you use it.

Each process step or system function would have one or more RPN values associated with it. Higher the RPN, riskier the step is. During the pre-mortem exercise, the team must decide a threshold RPN value. This way, for all the steps where RPN is above the threshold, risk mitigation and control plan become mandatory, whereas for RPNs below a threshold may be addressed later as their priority would be lower.

Ideally, you should be addressing all the practical steps wherever RPN is non-zero. However, it is not always possible due to resource limitations.

One of the ways you can reduce RPN is by reducing the severity of the failure mode. Typically, reducing severity often needs functional changes in process steps or the solution itself. Additionally, the occurrence can be controlled by the addition of specific control measures such as a human in the loop or maker-checker mechanisms.

However, if it is not possible to reduce the severity or occurrence, then by implementing control systems, you can contain the failures. Control systems can help in either detection of causes of unwanted events before the consequence occurring or the detection of root causes of unwanted failures that the team can then avoid altogether.

Having risks quantified and visible also enables you to have plans in place to act quickly and appropriately in case of failures and thus reduces the exposure to more failures or adverse consequences.

It is possible that during the pre-mortem exercise, the team will discover many failure modes or root-causes that were not part of the designed controls or test cases/procedures. It is crucial that the test and control plan is always impacted by the results of this analysis. Ideally, you must include test and control team members for pre-mortem analysis exercise.

A common problem I’ve seen in this exercise is difficulty or failure to get to the root-cause of anticipated failure, and this is where SMEs should lean in. If you do not identify root-causes correctly or do it poorly then your follow-up actions would not yield proper results.

Another problem I’ve seen is the lack of follow-up to ensure that recommended actions are executed and the resulting RPN is lowered to an acceptable level. Doing effective follow-through is a project management function and needs diligent execution to ensure that pre-mortem analysis reaches its logical conclusion.

Pre-mortem analysis workshops can be time-consuming at times. Due to high time demand, it may become challenging to get sufficient participation of SMEs. The key is to get the people who are knowledgeable and experienced about potential failures and their resolutions showing up at these workshops. SME attendance often needs management support, and facilitators need to ensure that this support is garnered.

You can read more about FMEA, and find quality information on the internet.

Sector-specific considerations

In a pre-mortem analysis, Severity-Occurrence-Detection (SOD) ratings range between 1 and 10. However, the weights assigned to each of the rating values are subjective. It is possible that in the same industry two different companies could come up with slightly different ratings for the same failure mode.

To avoid too much subjectivity and confusion, some level of standardization or rating scale could be helpful. However, that would be only necessary when you’re benchmarking two or more products from different vendors in the same industry. If this has to be used only for internal purposes, subjectivity won’t matter much, since relative weights and essential would still be preserved within the risks and action items.

Nonetheless, when considering control or action plans for identified risks, sector-specific approaches could be (and should be) different.

Any failure risk can be controlled by either reducing severity (S) or by lowering the chances of occurrence (O) or by improving detection levels (D). If this were to be done in the banking sector, while enhancing S & O ratings, D ratings might need additional focus for improvement. Given the volume of transactions that the financial sector carries out every day, the severity of failure could be high due to widespread impact, but if the severity can’t be controlled beyond a point, detecting it early to fix would be highly necessary.

In the case of the health care sector though, severity itself should be lower as the detection my likely result in fixing a problem but wouldn’t necessarily reverse the impact. For example, if AI prediction or solution results in incorrect prognosis and thereby change in medicine, early detection of this problem may result in stopping the activity per se. However, it won’t be able to revert the issues caused by having this failure in the first place.

Similarly, in transportation scenario, especially for autonomous cars, detecting that a car’s mechanism has failed as an after the fact is less useful, since the accident would already have happened. Reducing severity and occurrence in those cases is a more acceptable course of action.

Generally speaking, you should focus on improving detection, if the impact of failure can be reversed in due course of time or there is enough time available between the system’s outcome and see the full effect on the end-user. If even having one failure means significant damage to you, then severity must be reduced, and occurrence must be reduced too.

Severity and occurrence improvement are prevention-focused, whereas detection improvement is fixing (cure) focused. If your industry believes that prevention is better than cure, then work on to reduce the severity and lower the occurrence of failures. If your industry is comfortable with fixes after the fact, then detection must be improved.

However, in my view, it is better to address all three factors and ensure that robust risk management is in place.

Conclusion

Pearl Zhu, in her book, Digitizing Boardroom says, “Sense and deal with problems in their smallest state, before they grow bigger and become fatal”.

Managing risks systematically and pragmatically is the key to handle AI risks. The problem with AI risks is that they are highly scalable and can quickly grow out of control due to the power of automation and the sheer capacity of AI to execute tasks.

Moreover, subjectivity in risk management is a myth. If you can’t quantify the risk, you can’t measure it; and if you can’t measure it then you can’t improve or control it. The systematic approach outlined here will help you to quantify your risks, understand them better, and while maintaining the context of your use case.

You can develop and implement AI solutions responsibly; if you understand risks…understand them better and specific to your use case!

Note: A copy of the pre-mortem analysis template can be downloaded from here.

PS: You may also be interested in reading my book, "Keeping Your AI Under Control," which covers this topic further in the context of responsible and ethical AI development.

Welcome to the community, and thanks for subscribing !