Business Continuity and Disaster Recovery Planning

Contingency Planning

Information systems contingency planning refers to a coordinated strategy involving plans, procedures, and technical measures that enable the recovery of information systems, operations, and data after a disruption
Resilience is the state of an organization where it quickly adapts and recovers from any known or unknown changes to the environment
BCP and DRP are types of contingency planning
BCP and DRP help minimize financial impact during serious incidents by protecting tangible and intangible assets

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

BCP / DRP

Business Continuity Planning (BCP)
Preservation of business in the face of disruptions
Focuses on sustaining an organization’s mission/business processes during and after a disruption
BCP may be created for a single business unit or for the entire organization’s processes; may also be scoped for only functions deemed to be priorities
BCP is the responsibility of the security team since it provides availability
Disaster Recovery Planning (DRP)
DRP is concerned with restoring operability of disrupted IT systems, whereas BCP is concerned with keeping business processes available
DRP applies to major (usually physical) disruptions to service that deny access to the primary facility infrastructure for an extended period
DRP only addresses information system disruptions that require relocation to infrastructure at an alternate site

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

The Need for BCP

Natural disasters
Social unrest or terrorist attacks
BCP may often be triggered by an audit
Legislative/regulatory requirements
Equipment failure (such as disk crash)
Disruption of power supply or telecommunication
Application failure or corruption of database
Human error, sabotage or strike
Malicious Software (Viruses, Worms, Trojan horses) attacks
Hacking or other Internet attacks
Fire

Source: Introduction to Business Continuity Planning; SANS Institute InfoSec Reading Room

Standards

NFPA 1600
National Standard on Preparedness by the national Fire Protection Association
ISO 17799
Defense Security Services (DSS)
A division of the DoD
NIST
Standard of due care / best practice/good business practice

Enterprise wide Continuity Planning

Critical Success Factors for BCP Implementation

• Management support

Ensures the management will allocate resources for this project.
It is the key driver of organizational change
Management awareness will steer the program and set priorities

• Accountability and responsibility

All departments/individuals know their role in incorporating BCM
A BCM team lead should oversee the overall process development and report to management on obstacles faced

• Integral part of information assurance management program

BCM is not separate from the organization’s overall IT management
Needs and allows continuous monitoring and improvement
BCM should be integrated into the total change management process

Source: Information Assurance Handbook: Effective Computer Security and Risk Management Strategies by Corey Schou and Steven Hernandez

BCP Process

Source: CISSP CBK

A. Project Initiation

BCP and DRP plan must be based on a clearly defined policy, which states:
Organization’s overall contingency objectives
Organizational framework
Resource requirements
Roles and responsibilities
Scope as applies to common platform types and organization functions
Training requirements
Exercise and testing schedules
Plan maintenance schedule
Minimum frequency of backups and storage of backup media

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Project Initiation

Project scope development and planning
BCP vs. DRP
Crisis management planning
Continuous availability
Incident Command System (ICS)
Executive Management Support - CIO must support the contingency program and be included in the process to develop the program policy
Project scope and authorization
Continuity Planning Project Team formation

B. Current State Assessment

Understand Enterprise Strategy, Goals and Objectives
Business Impact Analysis
Threat analysis
Identify critical business functions
3rd party relationships
Assessment of current continuity planning process
Benchmarking or peer review

Business Impact Analysis (BIA)

BIA correlates system with critical mission/business processes and services provided to characterize the consequences of a disruption
Three steps are typically involved in accomplishing the BIA:
Determine mission/business processes and recovery criticality
Identify resource requirements Realistic recovery efforts of the resources required to resume mission/business processes as quickly as possible
Identify recovery priorities for system resources: Linkage between system resources critical to mission/business processes and functions can be identified. Priority levels can be established for sequencing recovery activities and resources.

Critical Business Functions

Impacts on business functions are analyzed in terms of availability, integrity and confidentiality
Availability (Time Sensitivity)
Recovery Time Objective (RTO) - the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD
Plan of Action and Milestone for mitigation should be initiated if RTO is not feasible
Maximum Tolerable Downtime (MTD) - the total amount of time the system owner is willing to accept for a mission/business process outage or disruption and includes all impact considerations
Max Allowable Downtime (MAD) – the total amount of time that the system can be unavailable before significant organizational impact will result.
Data Integrity
Recovery Point Objective (RPO) - the point in time, prior to a disruption, to which data can be recovered after an outage
Critical business functions should be classified based on the determined impact

Sample Business Impact Analysis (BIA)

Cost Balancing

The longer a disruption is allowed to continue, the more costly it can become to the organization
Conversely, the shorter the RTO, the more expensive the recovery solutions cost to implement
Plotting the cost balance points will show an optimal point between disruption and recovery costs

Critical Business Functions

Identification of critical business functions
Operational impact
Financial impact
Reputation or public image impact
Dependencies
BIA enables characterization of the system components, supported business processes, and interdependencies
Possible business impacts due to the unavailability of systems can be determined (RTO,MTD, etc.)
Then sequencing recovery of information system components can be finalized which will form the basis for developing contingency solutions

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Third Party Relationships

Downstream liabilities
Who will be impacted if your business is interrupted?
Upstream impacts
What happens if a partner’s business is interrupted?
Enforce SLAs

Identify Preventive Controls

Some outage impacts identified in BIA may be mitigated or eliminated through preventive measures that deter, detect, and/or reduce impacts to the system
Where feasible and cost-effective, preventive methods are preferable to recovery methods. For example:
Appropriately sized uninterruptible power supplies (UPS)
Gasoline- or diesel-powered generators to provide long-term backup power;
Air-conditioning systems with adequate excess capacity to prevent failure of certain components, such as a compressor;
Fire detection and suppression systems;
Heat-resistant and waterproof containers for backup media and vital non electronic records;
Offsite storage of backup media, non electronic records, and system documentation
Frequent scheduled backups including where the backups are stored (onsite or offsite) and how often they are recirculated and moved to storage.

C. Development Phase

Develop and design recovery strategies
IT recovery
Business process recovery
Facilities recovery

BCP/DRP Development

Activation and Notification Phase

Defines initial actions taken once a system disruption or outage has been detected or appears to be imminent
Activation Criteria and Procedure - BC or DR plan should be activated if one or more of the activation criteria are met. Criteria may be based on:
Extent of any damage to the system
Criticality of the system to the organization’s mission
Expected duration of the outage lasting longer than the RTO
Notification Procedures - Describe the methods used to notify recovery personnel during business and non business hours. Notification methods can be:
Manual
Automatic
Outage Assessment - Assess the nature and extent of the disruption
Assessment should be completed as quickly as the given conditions permit
Outage Assessment Team is the first team notified of the disruption.

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Recovery Phase

Focuses on implementing recovery strategies to restore system capabilities, repair damage, and resume operational capabilities at the original or new alternate location
Sequence of Recovery Activities
Should reflect the system’s MTD to avoid significant impacts to related systems
Recovery Procedures
Should provide detailed procedures to restore the information system or components to a known state
Recovery procedures should be written in a straightforward, step-by-step style
Recovery Escalation and Notification
Effective escalation and notification procedures should define and describe the events, thresholds, or other types of triggers that are necessary for additional action
At the completion of the Recovery Phase, the information system will be functional and capable of performing the functions identified in the plan

Reconstitution Phase

Defines the actions taken to test and validate system capability and functionality
Concurrent Processing - running two systems concurrently until a level of assurance that recovered system is operating properly
Validation Data Testing - testing and validating recovered data to ensure complete and current recovery
Validation Functionality Testing - verifying that all system functionality has been tested, and that normal operations can resume
Deactivation of plans to return to normal operations are:
Notifications – notifying users using predefined procedures that normal operations have resumed
Cleanup - cleaning up work space or dismantling any temporary recovery locations, restocking supplies, returning manuals or other documentation to their original locations, and readying the system for another contingency event
Offsite Data Storage - If offsite data storage is used, retrieved backup should be returned to its offsite data storage location
Data Backup - system should be fully backed up and a new copy of the current operational system stored for future recovery efforts
Event Documentation - All recovery and reconstitution events should be well documented for an after-action report with lessons learned

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Backup and Recovery

Backup and recovery methods and strategies are a means to restore system operations quickly and effectively following a service disruption
These should be integrated into the system architecture during the Development/Acquisition phase of the SDLC
Considerations for developing or comparing backup and recovery methods:
Cost
Maximum downtimes
Security
Recovery priorities
Integration with organization-level BCM plans

Recovery Time

Recovery Tier	Recovery Timeframe	Recovery Requirement
I	0-24 hours	Resources must be available in advance and implemented first
II	1-3 days	Resources must be available in advance
III	3-5 days	Resources must be identified and quickly available
IV	Other	Resources must be identified

Method to Prioritize Business Processes or IT Infrastructure Components

High Availability (HA) Processes

HA is a process where redundancy and failover processes are built into a system to maximize its uptime and availability
Goal of HA is to achieve an uptime of 999% or higher
HA can be expensive, and is not a viable option for many systems and should be considered only for systems that cannot tolerate downtime
HA systems cannot be a replacement for a solid backup strategy
HA processes need to be extended to an alternate location
Mechanisms such as block mirroring to an alternate site should be considered to provide redundancy and backup of system data outside of the system facility.

IT Recovery Strategies

Multiple Processing Sites
Mirrored Sites
Fully redundant with identical data and equipment as well as communication capabilities
Highest level of availability at highest cost
It ensures virtually 100% availability
Configuration management is a challenge
RTO of minutes to hours

IT Recovery Strategies

Mobile site/trailer
Self contained unit with IT and communications
RTO of 3-5 days
Hot site
Fully equipped data center and communications
RTO of few minutes to hours
Warm site
Has some level of IT capabilities, but will have to be further equipped to take over IT operations
RTO of 5+ days
Cold site
A location capable of supporting IT operations, but with no equipment RTO of 1-2 weeks at the minimum

Alternate IT Recovery Strategies

Virtual business partners
Similar to multiple sites, but alternate sites are hosted by business partners
Reciprocal or mutual aid agreements with an internal or external entity
Dedicated site owned or operated by the organization
Commercially leased facility

Backup Approaches

Electronic vaulting
Sending data directly to an alternate facility
Can be stored on disk or tape depending on RTO requirements
Remote journaling
Replicated data transactions in real-time or near real-time at a secondary processing site
Offsite storage
Storage Area Network
Database shadowing and mirroring

Backup Methods

Data integrity involves keeping data safe and accurate on the system’s primary storage devices
There are three common methods for performing system backups:
Full Backup - captures all files on the disk or within the folder selected for backup
Locating a particular file or group of files is simple
Time required to perform a full backup can be lengthy; might also lead to excessive, unnecessary media storage requirements
Differential Backup - stores files that were created or modified since the last full backup
Takes less time to complete than a full backup
Incremental Backup - captures files that were created or changed since the last backup
Afford more efficient use of storage media; backup times are reduced
Media from different backup operations may be required to recover a system from an incremental backup

Backup Locations

On-site
Near-site
Off-site

Communications

Emergency communication systems
Remote access may serve as an important contingency capability by providing access to organization-wide data for recovery teams or users from another location
Wireless (or WiFi) local area networks can serve as an effective contingency solution to restore network services following a wired LAN disruption
Business communications systems
Networks
Some of the ways to ensure communication availability are:
Redundant communications links
Redundant network service providers
Redundant network-connecting devices
Redundancy from NSP or Internet Service Provider (ISP)
Monitoring software can be installed to provide warning and troubleshoot network issues before users and other nodes notice problems.

D. Implementation

Initial walkthroughs of design
Implement design
Test
Monitor

Testing, Training and Exercises (TT&E)

Training - personnel are trained to fulfill their roles and responsibilities within the plan
Exercises – plans simulated to validate their content
Testing - systems and system components tested to ensure their operability in a disrupted environment

Testing

Design short and long term continuity and crisis management testing plans
Update plans as necessary and document
Test types
Checklist
Walkthrough (table top review)
Simulation
Parallel
Full-interruption

BCP Program Awareness and Training

Recovery strategy and procedures must be documented and made available
Recovery personnel should be familiar with their roles and necessary teaching skills to be prepared for tests, exercises and actual outage events
Training should be provided at least annually, and to the extent that respective recovery roles are executed without aid of documentation
Leadership training – crisis management
Tech teams training – procedures and logistics
Part of onboarding training
Recovery personnel should be trained on the following plan elements:
Purpose of the plan
Cross-team coordination and communication
Reporting procedures
Security requirements
Team-specific processes
Individual responsibilities

BCP Program Exercises

An exercise is a simulation of an emergency designed to validate the viability of one or more aspects of the Business Continuity or Disaster Recovery plans
Exercises are scenario-driven
Types of exercises are:
Tabletop Exercises - Discussion-based exercises roles during an emergency and responses to a particular emergency situation are discussed
Functional Exercises - Personnel validate their operational readiness for emergencies by performing their duties in a simulated operational environment

Developing BCP/DRP culture

Personnel across the organization must be confident and competent with the BCP/DRP program
BCP must be aligned with organizational business objectives
Organizations must establish a BCM culture and integrate it into daily business operations with the support of the CRO and senior management.
Three techniques are involved in developing and establishing BCM culture within an organization:
Design and deliver an awareness campaign to create and promote BCM awareness and develop skills, knowledge, and commitment required to ensure a successful BCM practice.
Ensure the awareness campaign has achieved its goals and monitor BCM awareness for a longer term.
Perform an assessment on the current BCM awareness level against the management-targeted level.

Emergency Operations Center

A physical location to coordinate emergency response efforts
Virtual EOC
Helps in the case of a pandemic or globally dispersed key employees

E. Management of BCP/DRP

Program oversight
Continuity planning manager
Updating and maintenance on the plan - Changes in specific areas may require attention, for example, employee turnover, changes to organizational structure, changes to business processes, etc.
Regular practice of the plan
Validate plans by performing simulations of different scenarios by everyone involved
Frequency of exercises depends on the rate of changes made within the organization
Review the result of earlier exercises to ensure identified weaknesses have been addressed
Review BCP - An audit by internal or external auditors can highlight all key material weaknesses and issues