Business Continuity and Disaster Recovery Planning
Contingency Planning
- Information systems contingency planning refers to a coordinated strategy involving plans, procedures, and technical measures that enable the recovery of information systems, operations, and data after a disruption
- Resilience is the state of an organization where it quickly adapts and recovers from any known or unknown changes to the environment
- BCP and DRP are types of contingency planning
- BCP and DRP help minimize financial impact during serious incidents by protecting tangible and intangible assets
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
BCP / DRP
- Business Continuity Planning (BCP)
- Preservation of business in the face of disruptions
- Focuses on sustaining an organization’s mission/business processes during and after a disruption
- BCP may be created for a single business unit or for the entire organization’s processes; may also be scoped for only functions deemed to be priorities
- BCP is the responsibility of the security team since it provides availability
- Disaster Recovery Planning (DRP)
- DRP is concerned with restoring operability of disrupted IT systems, whereas BCP is concerned with keeping business processes available
- DRP applies to major (usually physical) disruptions to service that deny access to the primary facility infrastructure for an extended period
- DRP only addresses information system disruptions that require relocation to infrastructure at an alternate site
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
The Need for BCP
- Natural disasters
- Social unrest or terrorist attacks
- BCP may often be triggered by an audit
- Legislative/regulatory requirements
- Equipment failure (such as disk crash)
- Disruption of power supply or telecommunication
- Application failure or corruption of database
- Human error, sabotage or strike
- Malicious Software (Viruses, Worms, Trojan horses) attacks
- Hacking or other Internet attacks
- Fire
Source: Introduction to Business Continuity Planning; SANS Institute InfoSec Reading Room
Standards
- NFPA 1600
- National Standard on Preparedness by the national Fire Protection Association
- ISO 17799
- Defense Security Services (DSS)
- A division of the DoD
- NIST
- Standard of due care / best practice/good business practice
Enterprise wide Continuity Planning
Critical Success Factors for BCP Implementation
• Management support
- Ensures the management will allocate resources for this project.
- It is the key driver of organizational change
- Management awareness will steer the program and set priorities
• Accountability and responsibility
- All departments/individuals know their role in incorporating BCM
- A BCM team lead should oversee the overall process development and report to management on obstacles faced
• Integral part of information assurance management program
- BCM is not separate from the organization’s overall IT management
- Needs and allows continuous monitoring and improvement
- BCM should be integrated into the total change management process
Source: Information Assurance Handbook: Effective Computer Security and Risk Management Strategies by Corey Schou and Steven Hernandez
BCP Process
Source: CISSP CBK
A. Project Initiation
- BCP and DRP plan must be based on a clearly defined policy, which states:
- Organization’s overall contingency objectives
- Organizational framework
- Resource requirements
- Roles and responsibilities
- Scope as applies to common platform types and organization functions
- Training requirements
- Exercise and testing schedules
- Plan maintenance schedule
- Minimum frequency of backups and storage of backup media
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
Project Initiation
- Project scope development and planning
- BCP vs. DRP
- Crisis management planning
- Continuous availability
- Incident Command System (ICS)
- Executive Management Support - CIO must support the contingency program and be included in the process to develop the program policy
- Project scope and authorization
- Continuity Planning Project Team formation
B. Current State Assessment
- Understand Enterprise Strategy, Goals and Objectives
- Business Impact Analysis
- Threat analysis
- Identify critical business functions
- 3rd party relationships
- Assessment of current continuity planning process
- Benchmarking or peer review
Business Impact Analysis (BIA)
- BIA correlates system with critical mission/business processes and services provided to characterize the consequences of a disruption
- Three steps are typically involved in accomplishing the BIA:
- Determine mission/business processes and recovery criticality
- Identify resource requirements Realistic recovery efforts of the resources required to resume mission/business processes as quickly as possible
- Identify recovery priorities for system resources: Linkage between system resources critical to mission/business processes and functions can be identified. Priority levels can be established for sequencing recovery activities and resources.
Critical Business Functions
- Impacts on business functions are analyzed in terms of availability, integrity and confidentiality
- Availability (Time Sensitivity)
- Recovery Time Objective (RTO) - the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD
- Plan of Action and Milestone for mitigation should be initiated if RTO is not feasible
- Maximum Tolerable Downtime (MTD) - the total amount of time the system owner is willing to accept for a mission/business process outage or disruption and includes all impact considerations
- Max Allowable Downtime (MAD) – the total amount of time that the system can be unavailable before significant organizational impact will result.
- Data Integrity
- Recovery Point Objective (RPO) - the point in time, prior to a disruption, to which data can be recovered after an outage
- Critical business functions should be classified based on the determined impact
Sample Business Impact Analysis (BIA)
Cost Balancing
- The longer a disruption is allowed to continue, the more costly it can become to the organization
- Conversely, the shorter the RTO, the more expensive the recovery solutions cost to implement
- Plotting the cost balance points will show an optimal point between disruption and recovery costs
Critical Business Functions
- Identification of critical business functions
- Operational impact
- Financial impact
- Reputation or public image impact
- Dependencies
- BIA enables characterization of the system components, supported business processes, and interdependencies
- Possible business impacts due to the unavailability of systems can be determined (RTO,MTD, etc.)
- Then sequencing recovery of information system components can be finalized which will form the basis for developing contingency solutions
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
Third Party Relationships
- Downstream liabilities
- Who will be impacted if your business is interrupted?
- Upstream impacts
- What happens if a partner’s business is interrupted?
- Enforce SLAs
Identify Preventive Controls
- Some outage impacts identified in BIA may be mitigated or eliminated through preventive measures that deter, detect, and/or reduce impacts to the system
- Where feasible and cost-effective, preventive methods are preferable to recovery methods. For example:
- Appropriately sized uninterruptible power supplies (UPS)
- Gasoline- or diesel-powered generators to provide long-term backup power;
- Air-conditioning systems with adequate excess capacity to prevent failure of certain components, such as a compressor;
- Fire detection and suppression systems;
- Heat-resistant and waterproof containers for backup media and vital non electronic records;
- Offsite storage of backup media, non electronic records, and system documentation
- Frequent scheduled backups including where the backups are stored (onsite or offsite) and how often they are recirculated and moved to storage.
C. Development Phase
- Develop and design recovery strategies
- IT recovery
- Business process recovery
- Facilities recovery
BCP/DRP Development
Activation and Notification Phase
- Defines initial actions taken once a system disruption or outage has been detected or appears to be imminent
- Activation Criteria and Procedure - BC or DR plan should be activated if one or more of the activation criteria are met. Criteria may be based on:
- Extent of any damage to the system
- Criticality of the system to the organization’s mission
- Expected duration of the outage lasting longer than the RTO
- Notification Procedures - Describe the methods used to notify recovery personnel during business and non business hours. Notification methods can be:
- Manual
- Automatic
- Outage Assessment - Assess the nature and extent of the disruption
- Assessment should be completed as quickly as the given conditions permit
- Outage Assessment Team is the first team notified of the disruption.
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
Recovery Phase
- Focuses on implementing recovery strategies to restore system capabilities, repair damage, and resume operational capabilities at the original or new alternate location
- Sequence of Recovery Activities
- Should reflect the system’s MTD to avoid significant impacts to related systems
- Recovery Procedures
- Should provide detailed procedures to restore the information system or components to a known state
- Recovery procedures should be written in a straightforward, step-by-step style
- Recovery Escalation and Notification
- Effective escalation and notification procedures should define and describe the events, thresholds, or other types of triggers that are necessary for additional action
- At the completion of the Recovery Phase, the information system will be functional and capable of performing the functions identified in the plan
Reconstitution Phase
- Defines the actions taken to test and validate system capability and functionality
- Concurrent Processing - running two systems concurrently until a level of assurance that recovered system is operating properly
- Validation Data Testing - testing and validating recovered data to ensure complete and current recovery
- Validation Functionality Testing - verifying that all system functionality has been tested, and that normal operations can resume
- Deactivation of plans to return to normal operations are:
- Notifications – notifying users using predefined procedures that normal operations have resumed
- Cleanup - cleaning up work space or dismantling any temporary recovery locations, restocking supplies, returning manuals or other documentation to their original locations, and readying the system for another contingency event
- Offsite Data Storage - If offsite data storage is used, retrieved backup should be returned to its offsite data storage location
- Data Backup - system should be fully backed up and a new copy of the current operational system stored for future recovery efforts
- Event Documentation - All recovery and reconstitution events should be well documented for an after-action report with lessons learned
Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems
Backup and Recovery
- Backup and recovery methods and strategies are a means to restore system operations quickly and effectively following a service disruption
- These should be integrated into the system architecture during the Development/Acquisition phase of the SDLC
- Considerations for developing or comparing backup and recovery methods:
- Cost
- Maximum downtimes
- Security
- Recovery priorities
- Integration with organization-level BCM plans
Recovery Time
Recovery Tier |
Recovery Timeframe |
Recovery Requirement |
I |
0-24 hours |
Resources must be available in advance and implemented first |
II |
1-3 days |
Resources must be available in advance |
III |
3-5 days |
Resources must be identified and quickly available |
IV |
Other |
Resources must be identified |
Method to Prioritize Business Processes or IT Infrastructure Components
High Availability (HA) Processes
- HA is a process where redundancy and failover processes are built into a system to maximize its uptime and availability
- Goal of HA is to achieve an uptime of 999% or higher
- HA can be expensive, and is not a viable option for many systems and should be considered only for systems that cannot tolerate downtime
- HA systems cannot be a replacement for a solid backup strategy
- HA processes need to be extended to an alternate location
- Mechanisms such as block mirroring to an alternate site should be considered to provide redundancy and backup of system data outside of the system facility.
IT Recovery Strategies
- Multiple Processing Sites
- Mirrored Sites
- Fully redundant with identical data and equipment as well as communication capabilities
- Highest level of availability at highest cost
- It ensures virtually 100% availability
- Configuration management is a challenge
- RTO of minutes to hours
IT Recovery Strategies
- Mobile site/trailer
- Self contained unit with IT and communications
- RTO of 3-5 days
- Hot site
- Fully equipped data center and communications
- RTO of few minutes to hours
- Warm site
- Has some level of IT capabilities, but will have to be further equipped to take over IT operations
- RTO of 5+ days
- Cold site
- A location capable of supporting IT operations, but with no equipment RTO of 1-2 weeks at the minimum
Alternate IT Recovery Strategies
- Virtual business partners
- Similar to multiple sites, but alternate sites are hosted by business partners
- Reciprocal or mutual aid agreements with an internal or external entity
- Dedicated site owned or operated by the organization
- Commercially leased facility
Backup Approaches
- Electronic vaulting
- Sending data directly to an alternate facility
- Can be stored on disk or tape depending on RTO requirements
- Remote journaling
- Replicated data transactions in real-time or near real-time at a secondary processing site
- Offsite storage
- Storage Area Network
- Database shadowing and mirroring
Backup Methods
- Data integrity involves keeping data safe and accurate on the system’s primary storage devices
- There are three common methods for performing system backups:
- Full Backup - captures all files on the disk or within the folder selected for backup
- Locating a particular file or group of files is simple
- Time required to perform a full backup can be lengthy; might also lead to excessive, unnecessary media storage requirements
- Differential Backup - stores files that were created or modified since the last full backup
- Takes less time to complete than a full backup
- Incremental Backup - captures files that were created or changed since the last backup
- Afford more efficient use of storage media; backup times are reduced
- Media from different backup operations may be required to recover a system from an incremental backup
Backup Locations
- On-site
- Near-site
- Off-site
Communications
- Emergency communication systems
- Remote access may serve as an important contingency capability by providing access to organization-wide data for recovery teams or users from another location
- Wireless (or WiFi) local area networks can serve as an effective contingency solution to restore network services following a wired LAN disruption
- Business communications systems
- Networks
- Some of the ways to ensure communication availability are:
- Redundant communications links
- Redundant network service providers
- Redundant network-connecting devices
- Redundancy from NSP or Internet Service Provider (ISP)
- Monitoring software can be installed to provide warning and troubleshoot network issues before users and other nodes notice problems.
D. Implementation
- Initial walkthroughs of design
- Implement design
- Test
- Monitor
Testing, Training and Exercises (TT&E)
- Training - personnel are trained to fulfill their roles and responsibilities within the plan
- Exercises – plans simulated to validate their content
- Testing - systems and system components tested to ensure their operability in a disrupted environment
Testing
- Design short and long term continuity and crisis management testing plans
- Update plans as necessary and document
- Test types
- Checklist
- Walkthrough (table top review)
- Simulation
- Parallel
- Full-interruption
BCP Program Awareness and Training
- Recovery strategy and procedures must be documented and made available
- Recovery personnel should be familiar with their roles and necessary teaching skills to be prepared for tests, exercises and actual outage events
- Training should be provided at least annually, and to the extent that respective recovery roles are executed without aid of documentation
- Leadership training – crisis management
- Tech teams training – procedures and logistics
- Part of onboarding training
- Recovery personnel should be trained on the following plan elements:
- Purpose of the plan
- Cross-team coordination and communication
- Reporting procedures
- Security requirements
- Team-specific processes
- Individual responsibilities
BCP Program Exercises
- An exercise is a simulation of an emergency designed to validate the viability of one or more aspects of the Business Continuity or Disaster Recovery plans
- Exercises are scenario-driven
- Types of exercises are:
- Tabletop Exercises - Discussion-based exercises roles during an emergency and responses to a particular emergency situation are discussed
- Functional Exercises - Personnel validate their operational readiness for emergencies by performing their duties in a simulated operational environment
Developing BCP/DRP culture
- Personnel across the organization must be confident and competent with the BCP/DRP program
- BCP must be aligned with organizational business objectives
- Organizations must establish a BCM culture and integrate it into daily business operations with the support of the CRO and senior management.
- Three techniques are involved in developing and establishing BCM culture within an organization:
- Design and deliver an awareness campaign to create and promote BCM awareness and develop skills, knowledge, and commitment required to ensure a successful BCM practice.
- Ensure the awareness campaign has achieved its goals and monitor BCM awareness for a longer term.
- Perform an assessment on the current BCM awareness level against the management-targeted level.
Emergency Operations Center
- A physical location to coordinate emergency response efforts
- Virtual EOC
- Helps in the case of a pandemic or globally dispersed key employees
E. Management of BCP/DRP
- Program oversight
- Continuity planning manager
- Updating and maintenance on the plan - Changes in specific areas may require attention, for example, employee turnover, changes to organizational structure, changes to business processes, etc.
- Regular practice of the plan
- Validate plans by performing simulations of different scenarios by everyone involved
- Frequency of exercises depends on the rate of changes made within the organization
- Review the result of earlier exercises to ensure identified weaknesses have been addressed
- Review BCP - An audit by internal or external auditors can highlight all key material weaknesses and issues