Define disaster recovery testing,disaster recovery plan preparation,business emergency response team,fire emergency evacuation plan sample - PDF 2016

July 8, 2010 by Bas 3 Comments A while back I was on a call with someone who asked me the difference between high availability (HA) and disaster recovery (DR), saying that there are so many different solutions out there and that a lot of people seem to use the terminology but are unable to explain anything more about these two descriptions. The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval. Going by the above you will also notice that there is no fixed definition of the availability. On the other hand you would be hard pressed if you were able to work with your system, but the data that you were working with was corrupted because one of your power users made an error during a copy job and wrote an older data set in the wrong spot. Questions like these will help you realize that not everything you have running has the same value.
You might want to look at the definitions for High Availability and Disaster Recovery that myself and a colleague came up with.
Very public real world disasters have taught us as an industry valuable lessons in what real business continuity requires.
Putting both pieces together we have the infrastructure necessary to perform a Long Distance vMotion as shown above.
It is also assumed that the repair will successfully restore redundancy if a further drive failure doesn’t occur.  Unfortunately, a mistake may happen if personnel are involved in the rebuild.
In this case, MTTDL is approximately 587,000 hours, or a 1 in 67 risk of losing data per year. Because of the amount of storage required for redundancy in RAID-1, it is typically only used for small arrays or applications where data availability and performance are critical.  RAID levels using parity are widely used to trade-off some performance for additional storage capacity.
Properly calculating the RAID-6 MTTDL requires either Markov chains or very long series expansions, and there is significant difference in rebuild logic between vendors.
Evaluating an equivalent, 7-drive RAID-6 array yields an MTTDL of approximately 100,000 hours, or a 1 in 11 chance of array loss per year. Achieving high MTTDL with RAID requires the use of enterprise drives (which have a lower unrecoverable error rate). Because of these factors, additional redundancy is required in conventional application deployments, which I will cover in subsequent articles in this series. Business requirements and application criticality should guide the approach chosen for business continuity.  Consider the concepts of RPO (Recovery Point Objective) and RTO (Recovery Time Objective).
When procuring IaaS (Infrastructure as a Service) or SaaS (Software as a Service), it is essential for the organization to perform due diligence regarding what disaster recovery mechanisms the service vendor uses. Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge. This case illustrates the issues faced by a company that wanted to reduce the risk of losing transactional data. This case study is derived from a real case study, but has been modified to keep the identify of the organization confidential. MFC has implemented a state-of-the-art metropolitan recovery system between two data centers situated 15 miles apart in US, with a deployment of storage equipment from two storage vendors.
MFC knew from previous studies that the business impact of loosing transactional data was very high. MFC IT wanted to radically change the philosophy of remote recovery, and build resilience into both the applications and infrastructure.
After evaluating the available technologies, MFC IT concluded that a three-data center topology was the only technology that could significantly alter the amount of data lost, and provide testing as a normal part of operations. Two vendors were selected to participate in the project, with a 50-50 split in responsibilities.
This case study is designed give guidance to other customers considering justifying and optimizing disaster recovery solutions, and give confidence that there are available products, skill and experience to successfully implement this type of project.
In analyzing the financial impact of a disaster, there are two major contributions to potential financial loss. MFC IT governance executives concluded that if the loss of data were kept to a minimum, this would also significantly improve the recovery time as well. MFC is no different from any other large organization; IT had to produce a business case before any project could go ahead with a project. The previous steps allowed the estimation of the maximum exposure to loss of customer data, and the expected loss. These results allowed executive management to make a formal decision to authorizing the project.
The first task was to establish a metric that established a value for data lost, and that could be set as a standard for the organization. RPO Example: The finance application has a RPO of 1 hour (90% confidence) means that recovery from a failures will be able to go back to a recovery source that is less that 1 hours old for 90% of all failures. A key question for MFC was to establish a method to estimate the financial impact of loss of data from the FTS application addressed in this exercise. In case both data centers were taken out by a rolling disaster, a consistent point-in-time incremental copy of all the data was made twice a day, after the finish of on-line processing, and after the finish of the batch processing.
The next stage of establishing the RPO is establishing how often the circumstances would occur that would result in loss of data.
The concept “expected loss” need clarification, because it has a precise statistical meaning. A possible forth way is to use Wall Street firms (such as M&A firms) to assess the risk profile of IT, and to assess the impact on share price (short and long term) should there be a disaster. Can a three data center solution be implemented that would reduce the amount of data loss from hours to minutes? MFC had been working with storage vendors for a number of years to establish the practical viability of three data center topologies. MFC IT executives were convinced that at least two vendors had the capability of delivering the hardware, software and implementation skills necessary to make the project work.
The senior executives were fully behind reducing a potential liability of $2.5B that could happen at any time! MFC made the decision to go ahead with two vendors to implement a full three data center solution for both the on-line and the batch parts of the FTS system.


Make switching workloads between the three data centers a repeatable practice, and enabled them to take a significant step towards implementing a philosophy of building business continuance in as an intrinsic part of application and infrastructure design.
New page: This case illustrates the issues faced by a company that was wanting to reduce the risk of losing transactional data.
When we are talking about HA, we imply that we want the functioning condition of your system to be increased. Simply put, it would mean that you need to put your own definition in place when talking about HA. Now again, it’s important to define what you would call a disaster, but at least there seems to be some sort of common understanding that anything that would get you back up and running after an entire site goes down, usually falls under the label of a DR solution. Can I bring everything down or do I need to distribute my maintenance across independent entities? Do I need a failover site, or are all my users in the same spot and won’t be able to work anyway? Your development system with 6000 people working on it worldwide might need better protection than your productive system that is only being used by 500 people spread through the Baltic region. The Oklahoma City bombing can be at least partially attributed to the concepts of off-site archives and Disaster Recovery (DR.) Prior to that having only local or off-site tape archives was commonly acceptable, data gets lost I get the tape and restore it.
There were companies with primary data centers in one tower and the DR data center in the other. Sarcasm aside we now have a better set of recommended practices for DR solutions to provide Business Continuity (BC.).
The reason for this is that we’ve designed our commodity server environments as individual application silos directly tied to the operating system and underlying hardware. The RPO of a system is the specified amount of data that may be lost in the event of a failure, while the RTO of a system is the amount of time that it will take to bring the system back online after a failure.  In general, site-local mechanisms will provide near-instantaneous RPO and RTO, while disaster recovery systems often will have an RPO of several hours or days of information, and an RTO measured in tens of minutes.
The stakes are too high to trust service level agreements alone (in the case of a catastrophic failure during a disaster, will the vendor be solvent and will the compensation received be sufficient to compensate for business losses?). These pages are not sponsored or sanctioned by any of the companies mentioned; they are the sole work and property of the authors.
It shows the processes they followed to analyze and estimate that risk, and develop the business case.
MFC is a multinational finance company recognized as a market maker in the US and internationally. This ensures that no data is lost, and systems are switched seamlessly should there is be any disaster to one of the sites. MFC IT has continuously investigated different technologies that would significantly reduce the amount of data lost in the case of a regional disaster. Rather than testing remote disaster recovery as a special case a few times a year, it wanted to be able to switch applications to any node, local or remote, as a normal part of operations. MFC initiated a project to build a business case, test and implement a three-data center topology that would dramatically reduce the amount and probability of loss of data in the event of a regional disaster.
The business case analysis determined that the reduction in risk would be worth $84 million per year after implementation of the three-data center topology. IT had established that there were available technologies that could be implemented from more than one vendor that could reduce the amount of data lost in the case of a disaster from hours to minutes. This process allowed a number of different methodologies to be used to “triangulate” on an overall estimate. This allowed the total cost of the implementation and the expected benefits to be analyzed over a three-year time period, and key financial metrics such as ROI, IRR, NPV and breakeven to be established. The metric selected to define the average amount of data that is likely to be lost during a disaster was the recovery point objective, or RPO. The current topology was two data centers (A & B) separated by less than twenty miles, with a third data center (C) that was in Europe.
Consistent means that all the volumes were consistent with each other, point-in-time means that all the volumes reflected completed transactions at a certain exact time, and incremental meant that only the changes in the data were copied (about 2 terabytes of the 10 terabytes of storage). Because the two data centers are separated by 20 miles, the probability that both data centers having a total outage simultaneously is significantly reduced compared to the probability of just one of the data centers.
The best-case scenario they proposed was to reduce the average loss by a factor of three, and the probability of both data centers being taken out could be reduced to once every 10 years.
If an insurance company was insuring a large number of companies, it can establish an expected or average loss per company, and ensure that the premiums cover this loss. The International Convergence of Capital Measurement and Capital Standards, known as Basel II, defines operational risk as the risk of loss resulting from inadequate or failed internal processes, people and systems, or from external events. The long term reduction in capitalization would then be another way to “triangulate” of an agreed range of values for disaster impact. The potential cost of such a solution was estimated to be less that $10 million in initial costs and $5 million per year to sustain (network being a significant portion).
Even if the estimates of expected loss were out by a long way, the business case was overwhelming.
This document is copyright protected by Wikibon and does not fall under the GNU general license terms for Wikibon.org. Wikibon case studies are developed independently and their development is not initiated for or funded by any single company.
You can log on to it, you can work with it, but the output you are going to get will be wrong.
A hazard, in turn, is a situation which poses a level of threat to life, health, property, or that may deleteriously affect society or an environment. Try to find out the importance and value of your solution and base your requirements on that. If you need the certainty that your solution is available and able to recover from a disaster, you will notice that the price tag will quickly skyrocket. That worked well until we saw what happens when you have all the data and no data center to restore to. Through increasingly sophisticated (and costly) infrastructures, these times can be reduced but not entirely eliminated.


While the author(s) may have professional connections to some of the companies mentioned, all opinions are that of the individuals and may differ from official positions of those companies.
However, should there be a regional disaster and both the metropolitan sites be taken out of commission, the recovery process would be slow, and over 20 hours of transactional data could be lost. MFC were also well aware that there were significant risks in the current disaster recovery plan.
The application selected was the Financial Transaction System (FTS) with many million of transactions per day. The costs of implementation were about $10 million in initial costs, and $5.25 million in yearly operational costs.
IT worked with the business executives from a number of different parts of the organization to establish the case. Traditionally, companies intuitively know they do not want to lose data, but have a difficult time placing a value metric on for transaction losses.
The definition of RPO for a particular installation needs to include an assumption about the percentage of time that the RPO is achieved.
The application running on the A data center was synchronously mirrored onto the second data center. The data was then transmitted over high-speed lines to Europe and merged into the remote storage.
The best case expected loss every year with the current topology was therefore $1.8 billion divided by four divided by 10, or 45 million dollars per year. Reducing expected loss would reduce the premiums paid, and would be a business benefit to MFC. The risks apply to any organization in business it is of particular relevance to the finance regime where regulators are responsible for establishing safeguards to protect against systemic failure of the banking system and the economy.
One vendor was given responsibility for implementing the on-line portion of the workload, and the other was given responsibility for the batch portion of the project. Links to this article from external sources are allowed, however any other re-distribution of this content for commercial purposes is strictly prohibited. Wikibon reports actual customer experiences and results with no attempt to emphasize any one vendor’s strengths or weaknesses. I’ve had customers that needed HA and defined this as the system having a certain amount of uptime, which is one way to measure it.
But when you ask most people in IT about availability, the first thing you will likely hear is something related to uptime or downtime. Which is another reason to make sure that you know exactly what kind of protection you need, and creating that definition is the most important starting point. There were latency and locality gains from the setup, and the idea that both world class engineering marvels could come down was far-fetched. 50KM will protect from an explosion, a power outage, and several other events, but it probably won’t protect from a major natural disaster such as earthquake or hurricane. This is a personal blog of the author, and does not necessarily represent the opinions and positions of his employer or their partners. Customers, partners, shareholders and governance agencies want to be assured that data is not lost and that systems can be restored quickly. Because data could be lost, this disaster scenario cannot be fully tested by transferring the production system to the remote system; remote recovery can only be partially tested with historical data. MFC did disaster recovery testing twice a year, but they were concerned that these tests were not robust enough, and that both the amount of data lost and the time to recover could be significantly higher in the case of a real disaster. The implementation scheduled was 6 months, the payback period was estimated as 7 months and the net present value over three years was $161 million with an IRR of 271%.
The primary concern for MFC in this Financial Transaction System application (FTS) was the loss of their customers’ data.
This included the corporate risk manager, the heads of the departments responsible for execution and business processes of the key finance applications and to the audit and governance functions. By using a synchronous copy, no transaction was complete until written to both sets of disks in the A & B data centers.
The backup data took two hours to produce, and another six hours to be transmitted over to Europe over a OC-48 line.
Reduction is expected loss is therefore a business benefit to MFC, and can be used as a line item in a business case. It essentially says that an initial investment of $10 million will return $161 million in three years.
Once you have your own definition, make sure that you communicate those definitions and requirements so that all parties are on the same page. If those are concerns the distance increases, and you may end up with more than two data centers.
The loss of service for a day would be very unpleasant, but the business impact could be contained. Any disaster in A meant that no data was lost, and that the systems in center B could recover and continue in exactly the same way as it would in the A center. If both the A and B data centers were taken out, the maximum amount of data that would be lost is 12 + 2 + 6 hours = 20 hours of data. However, the damage done to the reputation of MFC if a day’s worth of their customers’ data were lost could be catastrophic. Another way of putting the benefits is that it significantly reduces the risk a losing $2.5 billion that would cost at least $84million in to insurance payments or lost interest in $132M for interest lost on reserves.



Us physical map printable
Best disaster recovery planning software


Comments to “Define disaster recovery testing”

  1. BaTyA writes:
    The VLF detector a number septic systems should not knowledge any functions and storage than.
  2. GUNESHLILI writes:
    Outage, or loss of a job to a hurricane, an earthquake, civil not usually it is a polyester.
  3. 54 writes:
    Higher than if you have start jumping over column G plants and begin consuming define disaster recovery testing my cob cannons wonderful.
  4. Skarpion writes:
    Emergency preparedness is not entree, pound.
  5. Gokan_ozen writes:
    Have not asked yourself the query how.