VCDX Part 3 – Risk Management

By | October 31, 2017

Risk Management

Risks analysis is a critical aspect of any project, but as you walk through the VCDX journey you shouldn’t be surprised to see risk analysis more in-depth and thorough than a handful of risks defined in a basic table, to a complete list populated from almost every design decision, presented within a comprehensive table. After all, risk management and validation is clearly stated and defined within the VCDX blueprint!

Every design decision you make has some sort of risk associated to it. Every decision table (logical or physical) must have consideration around the risk whether it’s technical, project, people or process, with a stated mitigation to eliminate or reduce the impact of the risk. This is the role of the architect, identify the risks, consider, manage, eliminate\reduce and track. Within my design I stated the following under Risk Analysis which formed part of my conceptual design.

Risk analysis aims at identifying inhibitors to successful implementation and operation of the architecture as designed

The ‘Major’ design risks either project or technical were outlined below.

Other ‘Minor’ risks are outlined throughout the document under each section – ‘design decision’

Note: I also tried to take the risk analysis to the next level by referencing the Standard Operating Procedure(s) and Validation tests, to help mitigate the risk.

My risk analysis table is outlined  below.  In total I had around 18 Major risks, with several Minor risks outlined in design decision table. I could have added these to this table at the top of my design, for a more comprehensive risk analysis, although my risk table already spanned 6-7 pages. I wanted to highlight the most important risks in this table and provide comprehensive management of these.

  • Risk ID – i.e. rk001
  • Risk – Description of the risk
  • Impact Level – High\Medium\Low
  • Probability – High\Medium\Low (The likelyhood this risk will occur)
  • Mitigation or Acceptance – What is the plan to mitigate\reduce\eliminate the risk.
  • Standard Operating ID – Linked to a specific operating procedure to mitigate\reduce\eliminate the risk.

As you work through each of your design decisions, include an extra row or two within your design decision table for ‘Risk’ and ‘Risk Mitigation’ to provide a short summary statement. If this was a Major risk, I referenced the risk ID from the above table (rk001) within that particular design decision table.

Paul McSharry has a great post on this subject which adds more context to the above. I read this a few times during my preparation, to make sure I had covered everything on the blueprint.


When you defend your design, make sure you have a risk management summary slide, within your main deck somewhere, so this further covers the blueprint and ensure greater scoring power. Replaying how you managed risks during your earlier slides through design decisions\justification is inevitable, it will happen, you will be discussing these before the risk summary slide.  However, if for some reason you don’t get to cover these, make sure you cover the slide with the 4-5 key risks with mitigation within your presentation somewhere.

Additional Resources

To gain a more thorough understanding of risk management in general for IT architecture, I recommend the following resources.

During my preparation, I watched this VMworld 2016 session on ‘An Architect’s Guide to Designing risk’. It helps take your thinking on risk to the next level.

Here are some notes from this session.

  • What changes will you make now to affect future?
  • Series of events that have occurred to get you to a place? Same as Infrastructure
  • Different gen of technologies, mergers, acquisitions, layoffs, failures, patches etc
  • Decisions made architectually to have a different outcome?
  • Availability and security goes beyond SLAs? Some companies have
  • What is a framework? Why do we need a framework?
    • Method of doing things that has structure, follows methodology?
    • Direct you to think in a certain manner.
    • It’s repeatable.
    • Can be applied to multiple situations
    • It’s a guide, not law.
    • Consistently successful outcome
    • TOGAF, ITIL, Zacman
  • Have a framework – Need a structure, go through the design process to try and avoid critical failures, for when things go bad such as:
    • Multiple critical failures
    • No budget
    • No additional resources
    • No staff
  • VCDX Methodology, is a framework – Enterprise Architecture
    • VMware stack, but applicable to any type of architecture out there
  • Design Qualities – AMPRS. Apply these to every component in infrastructure.
    • Manageability – PowerCLI, vCenter, APIs, SSH, DCUI, Host Profiles, need to consider for each component.
    • Performance – Cores\socket, RAM, NUMA for VM sizing
    • Recoverabilty – How to recover hosts etc? Auto Deploy
    • Security – Lockdown mode, PCI\SOX compliance for hosts, hardening, configuration management for auditing. RBAC, what level.
      • AD auth + local accounts.
  • Considerations for Design (main drivers)
    • Requirements – Data retention, x5 9s?
    • Constraints – Vendor relationship etc
    • Assumptions – Made up, as answers can’t be confirmed. More space, 2nd Datacenter, staff numbers, training\skills, budget etc
    • Beginning of a dialog, between client and architect
    • May change over a number of design iterations to confirm assumptions and what is the truth?
    • Risks – Use of existing equipment, will load cause an impact? Timing of project? Have we validated solution, validated for problem with PoC? Production ready and capable solution.
  • Risks
    • Usually weakest part of architect design.
    • Should be core to architecture analysis.
    • Component failure usually addresses after the fact.
    • Understand when things are going to fail – Foresight
  • Plan for Failure (strategy)
    • Survivability Analysis
    • Map components
  • Every environment is different – Workloads, environmental concerns
  • Plan for failure (Defense in depth)
    • Fault tolerance components – RAID-1, multiple power supplies
    • HA software – vSphere HA or whatever for 2-3 recovery
    • Load Balancing – Network traffic distributed of load, remove SpoF, plus provides scalable environment
    • Application level clustering
  • Plan for failure (Graceful degradation)
    • Multiple failures but still be able to run services
  • Risk Mitigation
    • Prevent the worst of only what is known. Not enough of deep insight.
    • Multiple failures and cascading failure
    • Root cause analysis only looks backwards, to ensure it doesn’t happen again
    • You have to fail a lot of times, to get a really strong infrastructure!
  • Fault Tree Analysis (identify weak points) to determine what to build up/protect
    • Map out all different components and dependencies
    • How can I break, let me count the ways?
    • Reduce number of possible outcomes
    • Map out path from one area to another to gain outcome
  • Technological Disobedience
    • Making things do, what they weren’t design to do
  • Decomposing of Technology
    • Breakdown whole infrastructure into discrete parts, when things start failing, how do we rebuild to something that is usable again?
    • Understand mapping of those components to business requirements
    • Understand requirement priority
  • What is indestructible?
    • Resiliency
    • Availability
    • Redundancy
    • Business Continuity
    • Recoverability
    • All comes at cost!
  • Plan for failure and determine strategy.

The published book from this presenter is now available. I haven’t had the chance to read this yet as it was published before my defence.  It looks another great resources for the IT Architect series.

You can view the rest of my VCDX Nuggets Series here

Leave a Reply