Resiliency Terms

[email protected]

Information About Terms

Terms and information to assist you in developing resilience.

The terms in this document have been derived from the Disaster Recovery Journal (https://drj.com/resources/drj-glossary-of-business-continuity-terms/) and NIST (https://csrc.nist.gov/glossary/term/NIST) and various information to aid teams with implementing the appropriate resiliency components.

Resiliency Terms

Resilience & Operational Resilience

DRJ Definition:
- Resilience: Ability of an entity to adapt to change or absorb the impact of a business interruption while continuing to provide a minimum acceptable level of service.
- Operational Resilience: The demonstrated and repeated ability of key business units or processes to maintain or return to an acceptable operational status after exposure to disruptive or disastrous events.
NIST Definition:
- Resilience: Official responsible for the overall procurement, development, integration, modification, or operation and maintenance of an information system (NIST SP 800-53 Rev 4).
- Operational Resilience: The ability of systems to resist, absorb, and recover from or adapt to an adverse occurrence during operation that may cause harm, destruction, or loss of ability to perform mission-related functions (CNSSI 4009-2015, DoDI 8500.01).
Resilience is the ability to operate during interruption and after.
Operational Resilience is the same as Resilience but further incorporates the capability of repeated actions or steps that bring about normalizing operations or bringing about recovery.

High Availability (HA)

DRJ Definition: Systems or applications requiring a very high level of reliability and availability.
NIST Definition of High Availability: A failover feature to ensure availability during device or component interruptions (NIST SP 800-113).
Designed to be available 99.999% of the time or as close to it as possible. Does not guarantee 100% uptime.
HA implies that there is no single point of failure. HA only works if you have systems in place to detect failures and redirect workloads, whether at the server level or the physical component level.
Failover systems handle the same workloads as the primary system.
- Examples:
  - Virtual Machine (VM) – Uses clustering, using a pool of VMs and resources within a cluster. When one VM fails, it is restarted on another system within the cluster.
  - Azure, GCP, Amazon Cloud services – Uses Availability Sets across Availability Zones.
  - Physical Systems – Redundant components are required for all critical power, cooling, compute, network, and storage infrastructure.
Typically, the system can detect a failed system and restart elsewhere.

Fault Tolerance (FT)

NIST Definition of Fault Tolerance: A property of a system that allows proper operation even if components fail (NISTIR 8202).
Similar to High Availability, but goes one step further by guaranteeing zero downtime.
The goal is to provide nonstop, 24/7 computing, even during a component failure or software crash. Systems are designed so that in the event of a component failure, a backup component or procedure can immediately take its place with no loss of service.
Methods include the use of error-checking and correcting methods and hot-swappable systems and can be provided using software, embedded hardware, or a combination of both.

Backup & Data Backups

DRJ Definition:
- Backup: A process by which data (electronic or paper-based) and programs are copied in some form so as to be available and used if the original data from which it originated are lost, destroyed or corrupted.
- Data Backup: The copying of production files to media that can be stored both on and/or offsite and can be used to restore corrupted or lost data or to recover entire systems and databases in the event of a disaster.
NIST Definition of Backup: A copy of files and programs made to facilitate recovery if necessary (NIST 800-34 Rev 1).
The act of copying or the actual copies of files and data.
Capture of data from a moment in time.
Copy of data that can be restored in the event of data loss or corruption.

Disaster Recovery (DR)

DRJ Definition: The process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure, systems and applications which are vital to an organization after a disaster or outage. The strategies and plans for recovering and restoring the organizations technological infra-structure and capabilities after a serious interruption.
NIST Definition for Disaster Recovery Plan: A written plan for recovering one or more information systems at an alternate facility in response to a major hardware or software failure or destruction of facilities (NIST 800-34 Rev 1).
DR focuses on how the organization responds after the event has been completed and how to return to normal.
DR is the ability to respond to a disaster or an interruption in services by implementing a disaster recovery plan to stabilize and restore the organization’s critical functions.
DR is a strategy for recovering from a disaster and DR implies that a crash, catastrophe, or disaster has already occurred.
DR deals with a complete failure of all infrastructure.
DR is a complete plan to recover critical business systems to normal business operations in the event of a disaster.
DR plans are vital to the Business Continuity (BC) strategy.
DR is configured with a designated Time to Recovery and Recovery Point or the time it takes to restore essential systems and the point in time before the disaster which is restored.
DR may employ restoring using backups and HA components.
How it works: DR platforms replicate selected systems and data to a separate cluster where it lies in storage. When downtime is detected, the system is turned on, and network paths are redirected.

Business Continuity (BC)

DRJ Definition: The strategic and tactical capability of the organization to plan for and respond to incidents and business disruptions in order to continue business operations at an acceptable predefined level. The capability of the organization to continue delivery of products or services at acceptable predefined levels following a disruptive incident.
NIST Definition of Business Continuity Plan: The documentation of a predetermined set of instructions or procedures that describe how an organization’s mission/business processes will be sustained during and after a significant disruption (NIST 800-34 Rev 1).
Process of keeping the entire business functional during a crisis and immediately after.
In the world today, BC strategies must focus on IT-related risks.
- Examples:
  - What do you do during a DDOS attack?
  - What do you do when the production site goes down, and how do you keep things running during the event?
Business continuity consists of a plan of action.
BC usually is broader in scope and incorporates many facets of business operations.
BC has an emphasis not only on keeping the business going but is concerned with maintaining operational service delivery to customers.
Business Continuity directly affects shareholder value.
BC planning typically follows a basic outline of 1) Disaster Planning, 2) Business Impact Analysis, 3) Business Continuity Management, 4) Business Continuity Plan, 5) Recovery Time Objectives, 6) Deployment.

How this all fits together

DR, BC, FT, and HA are all about Resilience. These key topics focus on maintaining operations during a crisis or returning to operations after a disruptive business impact. Think of Operational Resilience as the repeated actions, services, solutions, technologies, steps, processes, procedures, plans, and policies that outline what to do when a disruption occurs.

High Availability (HA) is often confused with Disaster Recovery (DR). HA is a component of DR. When a system has High Availability, it is Fault Tolerant (FT), or it can “failover.” By having such redundancies built into the system, it can immediately switch over to the redundant source. Just because a system, infrastructure, solution, or network is designed to have High Availability, it may fail to achieve the goal of Disaster Recovery. High Availability is the ability of a system to switch over to a redundant system when there is a component failure. In the case of DR, resources, and activities are used to restore services to normal operations in the shortest possible time by using an alternative production site, the cloud, or some other mechanism. HA is simply then a component of DR.

Fault Tolerance ensures availability by keeping copies on a separate host machine. For example, with HA on VMWare, the hypervisor attempts to restart the Virtual Machine (VM) on the same host cluster. If the physical system has other problems (power, network, etc.), HA may not work because the system itself may have hardware issues. So, with FT, the VM workload is moved to a completely separate host, or in the case of Microsoft Azure, the entire system would move to a different Availability Zone.

DR generally replaces an entire data center, whether physical or virtual. HA deals with faults in a single component like power or a single server rather than a complete failure of all IT infrastructure, which would occur in the case of a catastrophe. DR goes beyond FT and HA and consists of a comprehensive plan to recover critical systems and normal operations in the event of a catastrophic disaster (such as hurricanes, floods, tornadoes, cyberattacks, or any event that causes significant downtime). HA is often a major component of DR, which can consist of an entirely separate physical infrastructure site consisting of 1:1 replacement for every critical infrastructure component or as many as is required to restore the essential business functions.

Backups are also different from Disaster Recovery. Backups are a copy of data at a specific moment in time that is stored in the event of data loss, corruption, etc., where it can be restored.

For simplicity, FT is a subset of HA, Backups are a subset of DR, HA is a subset of DR, and DR is a subset of BC.

The critical difference between DR and BC is when the plan takes effect. Business Continuity (BC) requires the organization to keep operations functional during the event and immediately after. Disaster Recovery focuses on how the organization responds after the event has been completed and how to return to normal. In simple terms, BC is how to keep operational during a disaster, and DR is what to do once the disaster occurs and how to return to normal operations after the event.

Recovery Point Objective (RPO)

DRJ Definition: The point in time to which data is restored and/or systems are recovered after an outage. The point to which information used by an activity must be restored to enable the activity to operate on resumption.
NIST Definition: The point in time to which data must be recovered after an outage.
Objective Time (typically measured in hours or days) from the last successful backup of the data to the moment that the outage began (DR event or interruption occurs).

Recovery Time Objective (RTO)

DRJ Definition: The period of time within which systems, applications, or functions must be recovered after an outage. RTO includes the time required for: assessment, execution and verification. The period of time following an incident within which a product or service or an activity must be resumed, or resources must be recovered.
NIST Definition: The overall objective length of time an information system’s components can be in the recovery phase before negatively impacting the organization’s mission or mission/business processes.

Recovery Point Capability (RPC)

DRJ Definition: The point in time to which data was restored and/or systems were recovered (at the designated recovery/alternate location) after an outage or during a disaster recovery exercise.
The actual duration of time between the last successful backup and the recovery.

Recovery Time Capability (RTC)

DRJ Definition: The demonstrated amount of time in which systems, applications and/or functions have been recovered, during an exercise or actual event, at the designated recovery/alternate location (physical or virtual).
The recorded time it took for recovery to occur, whether within an exercise or disaster event.

Backup Recovery Time Expectation (RTE)

Time to restore backups. Usually shooting for less than 24 hours per backup item. The maximum time it takes for IT resources to recover a single or subset of files. Time is dependent upon the size of the data to be recovered. This is independent of RTO and RPO since it focuses on backup recovery, not disaster recovery.

Get in Touch

Take the first step towards enhancing your organization’s security. Contact us now or schedule an appointment for a consultation with our experts!

Helping you overcome your security and privacy challenges.

Get in Touch

Get in Touch

Helping you overcome your security and privacy challenges.

Main Links

Services

Site Links

Contact Info