The Ever Changing IBM Z Mainframe Disaster Recovery Requirement

With a 50+ year longevity, of course the IBM Z Mainframe Disaster Recovery (DR) requirement and associated processes have changed and evolved accordingly.  Initially, the primary focus would have been HDA (Head Disk Assembly) related, recovering data due to hardware (E.g. 23nn, 33nn DASD) failures.  It seems incredulous in the 21st Century to consider the downtime and data loss with such an event, but these failures were commonplace into the early 1980’s.  Disk drive (DASD) reliability increased with the 3380 device in the 1980’s and the introduction of the 3990-03 Dual Copy capability in the late 1980’s eradicated the potential consequences of a physical HDA failure.

The significant cost of storage and CPU resources dictated that many organizations had to rely upon 3rd party service providers for DR resource provision.  Often this dictated a classification of business applications, differentiating between Mission Critical or not, where DR backup and recovery processes would be application based.  Even the largest of organizations that could afford to duplicate CPU resource, would have to rely upon the Ford Transit Access Method (FTAM), shipping physical tape from one location to another and performing proactive or more likely reactive data restore activities.  A modicum of database log-shipping over SNA networks automated this process for Mission Critical data, but successful DR provision was still a major consideration.

Even with the Dual Copy function, this meant DASD storage resources had to be doubled for contingency purposes.  Therefore this dictated only the upper echelons of the business world (I.E. Financial Organizations, Telecommunications Suppliers, Airlines, Etc.) could afford the duplication of investment required for self-sufficient DR capability.  Put simply, a duplication of IBM Mainframe CPU, Network and Storage resources was required…

The 1990’s heralded a significant evolution in generic IT technology, including IBM Mainframe.  The adoption of RAID technology for IBM Mainframe Count Key Data (CKD) provided an affordable solution for all IBM Mainframe users, where RAID-5(+) implementations became commonplace.  The emergence of ESCON/FICON channel connectivity provided the extended distance requirement to complement the emerging Parallel SYSPLEX technology, allowing IBM Mainframe servers and related storage to be geographically dispersed.  This allowed a greater number of IBM Mainframe customers to provision their own in-house DR capability, but many still relied upon physical tape shipment to a 3rd party DR services provider.

The final significant storage technology evolution was the Virtual Tape Library (VTL) structure, introduced in the mid-1990’s.  This technology simplified capacity optimization for physical tape media, while reducing the number of physical drives required to satisfy the tape workload.  These VTL structures would also benefit from SYSPLEX implementations, but for many IBM Mainframe users, physical tape shipment might still be required.  Even though the IBM Mainframe had supported IP connectivity since the early 1990’s, using this network capability to ship significant amounts of data was dependent upon public network infrastructures becoming faster and more affordable.  In the mid-2000’s, transporting IBM Mainframe backup data via extended network carriers, beyond the limit of FICON technologies became more commonplace, once again, changing the face of DR approaches.

More recently, the need for Grid configurations of 2, 3 or more locations has become the utopia for the Global 1000 type business organization.  Numerous copies of synchronized Mission Critical if not all IBM Z Mainframe data are now maintained, reducing the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) DR criteria to several Minutes or less.

As with anything in life, learning from the lessons of history is always a good thing and for each and every high profile IBM Z Mainframe user (E.g. 5000+ MSU), there are many more smaller users, who face the same DR challenges.  Just as various technology races (E.g. Space, Motor Sport, Energy, et al) eventually deliver affordable benefit to a wider population, the same applies for the IBM Z Mainframe community.  The commonality is the challenges faced, where over the years, DR focus has either been application or entire business based, influenced by the technologies available to the IBM Mainframe user, typically dictated by cost.  However, the recent digital data explosion generates a common challenge for all IT users alike, whether large or small.  Quite simply, to remain competitive and generate new business opportunities from that priceless and unique resource, namely business data, organizations must embrace the DevOps philosophy.

Let’s consider the frequency of performing DR tests.  If you’re a smaller IBM Z Mainframe user, relying upon a 3rd party DR service provider, your DR test frequency might be 1-2 tests per year.  Conversely if you’re a large IBM z Mainframe user, deploying a Grid configuration, you might consider that your business no longer has the requirement for periodic DR tests?  This would be a dangerous thought pattern, because it was forever thus, SYSPLEX and Grid configurations only safeguard from physical hardware scenarios, whereas a logical error will proliferate throughout all data copies, whether, 2, 3 or more…

Similarly, when considering the frequency of Business Application changes, for the archetypal IBM Z Mainframe user, this might have been Monthly or Quarterly, perhaps with imposed change freezes due to significant seasonal or business peaks.  However, in an IT ecosystem where the IBM Z Mainframe is just another interconnected node on the network, the requirement for a significantly increased frequency of Business Application changes arguably becomes mandatory.  Therefore, once again, if we consider our frequency of DR tests, how many per year do we perform?  In all likelihood, this becomes the wrong question!  A better statement might be, “we perform an automated DR test as part of our Business Application changes”.  In theory, the adoption of DevOps either increases the frequency of scheduled Business Application changes, or organization embraces an “on demand” type approach…

We must then consider which IT Group performs the DR test?  In theory, it’s many groups, dictated by their technical expertise, whether Server, Storage, Network, Database, Transaction or Operations based.  Once again, if embracing DevOps, the Application Development teams need to be able to write and test code, while the Operations teams need to implement and manage the associated business services.  In such a model, there has to be a fundamental mind change, where technical Subject Matter Experts (SME) design and implement technical processes, which simplify the activities associated with DevOps.  From a DR viewpoint, this dictates that the DevOps process should facilitate a robust DR test, for each and every Business Application change.  Whether an organization is the largest or smallest of IBM Z Mainframe user is somewhat arbitrary, performing an entire system-wide DR test for an isolated Business Application change is not required.  Conversely, performing a meaningful Business Application test during the DevOps code test and acceptance process makes perfect sense.

Performing a meaningful Business Application DR test as part of the DevOps process is a consistent requirement, whether an organization is the largest or smallest IBM Z Mainframe user.  Although their hardware resource might differ significantly, where the largest IBM Z Mainframe user would typically deploy a high-end VTL (I.E. IBM TS77n0, EMC DLm 8n00, Oracle VSM, et al), the requirement to perform a seamless, agile and timely Business Application DR test remains the same.

If we recognize that the IBM Z Mainframe is typically deployed as the System Of Record (SOR) data server, today’s 21st century Business Application incorporates interoperability with Distributed Systems (E.g. Wintel, UNIX, Linux, et al) platforms.  In theory, this is a consideration, as mostly, IBM Z Mainframe data resides in proprietary 3390 DASD subsystems, while Distributed Systems data typically resides in IP (NFS, NAS) and/or FC (SAN) filesystems.  However, the IBM Z Mainframe has leveraged from Distributed Systems technology advancements, where typical VTL Grid configurations utilize proprietary IP connected disk arrays for VTL data.  Ultimately a VTL structure will contain the “just in case” copy of Business Application backup data, the very data copy required for a meaningful DR test.  Wouldn’t it be advantageous if the IBM Z Mainframe backup resided on the same IP or FC Disk Array as Distributed Systems backups?

Ultimately the high-end VTL (I.E. IBM TS77n0, EMC DLm 8n00, Oracle VSM, et al) solutions are designed for the upper echelons of the business and IBM Z Mainframe world.  Their capacity, performance and resilience capability is significant, and by definition, so is the associated cost.  How easy or difficult might it be to perform a seamless, agile and timely Business Application DR test via such a high-end VTL?  Are there alternative options that any IBM Z Mainframe user can consider, regardless of their size, whether large or small?

The advances in FICON connectivity, x86/POWER servers and Distributed Systems disk arrays has allowed for such technologies to be packaged in a cost efficient and small footprint IBM Z VTL appliance.  Their ability to connect to the IBM Z server via FICON connectivity, provide full IBM Z tape emulation and connect to ubiquitous IP and FC Distributed Systems disk arrays, positions them for strategic use by any IBM Z Mainframe user for DevOps DR testing.  Primarily one consistent copy of enterprise wide Business Application data would reside on the same disk array, simplifying the process of recovering Point-In-Time backup data for DR testing.

On the one hand, for the smaller IBM Z user, such an IBM Z VTL appliance (E.g. Optica zVT) could for the first time, allow them to simplify their DR processes with a 3rd party DR supplier.  They could electronically vault their IBM Z Mainframe backup data to their 3rd party DR supplier and activate a totally automated DR invocation, as and when required.  On the other hand, moreover for DevOps processes, the provision of an isolated LPAR, would allow the smaller IBM Z Mainframe user to perform a meaningful Business Application DR test, in-house, without impacting Production services.  Once again, simplifying the Business Application DR test process applies to the largest of IBM Z Mainframe users, and leveraging from such an IBM Z VTL appliance, would simplify things, without impacting their Grid configuration supporting their Mission critical workloads.

In conclusion, there has always been commonality in DR processes for the smallest and largest of IBM Z Mainframe users, where the only tangible difference would have been budget related, where the largest IBM Z Mainframe user could and in fact needed to invest in the latest and greatest.  As always, sometimes there are requirements that apply to all, regardless of size and budget.  Seemingly DevOps is such a requirement, and the need to perform on-demand seamless, agile and timely Business Application DR tests is mandatory for all.  From an enterprise wide viewpoint, perhaps a modicum of investment in an affordable IBM Z VTL appliance might be the last time an IBM Z Mainframe user needs to revisit their DR testing processes!

Mainframe Virtual Tape: Tape On Disk; But For How Long?

By definition, a Virtual Tape Library (VTL) solution uses a disk cache to store tape data files, but for how long is this data retained on disk? Is it minutes, hours, days, weeks or indefinitely? Only business requirements can dictate the time period tape data is stored on disk, which will influence the VTL solution chosen. We will return to this pivotal question later in the article…

Some might say (for some reason I’m thinking of an Oasis lyric) that Mainframe Virtual Tape choice is as simple as black and white; or blue (IBM) and red (Oracle AKA StorageTek). Hmmm, clearly this is not the case; there are grey areas, but moreover, there are many colours to choose from. For sure we must recognize the innovation in tape technologies by StorageTek, delivering the 1st Automated Tape Library (ATL, NearLine) and IBM with the first Virtual Tape Library (VTL, VTS), naming but a few. Of course, now I recall, IBM delivered VTS in the mid-1990’s, about the same time as that Oasis song!

There is also that age old debate as to whether tape is dead or not and the best compromise always seems to be, “we’ll have to agree to disagree”, depending upon your viewpoint. Does it matter?

I also recall the early 1990’s, where Mainframe disk was proprietary and based upon 1:1 mapping, a physical disk was the addressable DASD volume. The promise of Iceberg (AKA SVA) from StorageTek and the delivery of Symmetrix by EMC changed this status quo, and so the Mainframe world adopted logical to physical mapping for disk storage, via RAID technologies, with Just a Bunch Of Disks (JBOD). This was significant, as the acquisition cost per MB for Mainframe disk was ~£5 (yes that’s right, I’m a Brit, so GBP), and today, maybe ~£0.01 (1 Penny) per MB, or ~£10 per GB, and getting lower each year. So yes, tape is always less expensive when compared with disk, by significant magnitudes, but the affordability of disk indicates that it can now be seriously considered, for backup and archive data.

As with any technology decision, it should be business requirements that drive the solution chosen, and not an allegiance to a storage media type, tape or disk, or a long time Mainframe tape vendor, IBM or Oracle. Ultimately there is only one thing that differentiates one business from another, and that is the data itself, stored in whatever format, databases, application code libraries, batch flat files, et al. Therefore the cost of storage is somewhat arbitrary; it’s the value of the business data that we should consider, while recognizing capital expenditure and TCO running costs.

The 21st century business seemingly requires near 24*7 service availability and if that business deploys a zSeries (~zero downtime) Mainframe server, I guess we can presume that said business requires near 24*7 data availability. We then must consider Business Continuity and associated Disaster Recovery metrics, which are measured by the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Ultimately these RTO and RPO values will dictate the required Backup & Recovery and Archive solutions required, where Recovery (time) is the most important factor!

When was the last time you performed a completely successful Disaster Recovery test from a secondary (physical tape, virtual tape disk) copy of data and was the Recovery Time Objective (RTO) satisfied? Was this a complete workload test, where you included on-line, batch and backup (VTL) testing?

From a data categorization viewpoint, industry analysts tell us, if we didn’t know this fact ourselves, that the majority of Mission Critical data is stored in database structures. If we associate other data types with said databases, application code to process the data, policies to manage and safeguard the data and processes to secure and preserve the data, then I guess we have many instances of Mission Critical data.

As the cost of disk has reduced, so has the cost of network bandwidth, so it’s not uncommon for Mainframe customers to mirror/replicate their data between Geographically Dispersed (E.g. GDPS, GDDR) data centres. They deploy this significant investment solution because they have a requirement for near 24*7 service and thus data availability. Therefore their RTO is likely measured in Minutes (E.g. ~5-15), not because the underlying technology can’t deliver a near instantaneous switch, but because the data needs a Point of Consistency (PoC), and this is the “latency time” for delivering a meaningful RPO (E.g. Pre Batch, Post Batch). Mission Critical databases need to establish a Quiesce PoC, to safeguard data consistency.

If the Mainframe user implements this high availability solution for their primary data copy, why wouldn’t they do this for their secondary (E.g. Backup, Archive) data copy? Ultimately there is generally a hierarchy of RTO and RPO objectives, associated with physical and logical failures. A mirrored disk environment only provides rapid recovery (RTO) for a physical component failure, while a logical data failure will manifest itself for all data copies in the mirror topology. Therefore we always have to consider what is our last line of defence for data recovery; typically a secondary backup data copy. Clearly recovering data from a backup, even a disk based backup, generates a significantly higher recovery (RTO) elapsed time. We might also consider data consistency for this backup data copy; namely, has the backup data been completely destaged/written to the target storage device, tape or disk? Of course, if we don’t have a good backup, we can’t recover the data!

OK, we have come full circle to that original question, by definition, a Virtual Tape Library (VTL) solution uses a disk cache to store tape data files, but how long is this data retained on disk? Is it minutes, hours, days, weeks or indefinitely? Only business requirements can dictate the time period tape data is stored on disk, which will influence the VTL solution chosen.

VTL solutions can be classified as either traditional or tapeless. Traditional is a combination of physical drives and cartridge media in an ATL with a Virtual Tape disk cache (usually proprietary) that is destaged periodically to physical cartridge media, where the primary suppliers are of course IBM with their TS7700 family and Oracle with their VSM offering, while Fujitsu have their CentricStor offering. Tapeless VTL solutions are typically FICON/ESCON channel attached appliances to a back-end disk cache (typically IP, FC or iSCSI), where the tape data is permanently stored on disk. Because the back-end disk cache can be any disk subsystem, within reason, the disk acquisition cost is optimized, because it’s classified as Enterprise/Distributed disk, as opposed to Mainframe disk.

There are many suppliers of tapeless VTL solutions, but the primary vendors are EMC with their Disk Library for Mainframe (DLm) offering and HDS with a several layered approach including LUMINEX Gateways and HDS disk. EMC recently acquired Bus-Tech, where DLm is an OEM of the Bus-Tech MDL solution, still available via the EMC Select option. IBM, Oracle and Fujitsu also offer tapeless VTL solutions, as and if required, but generally they’re deployed in combination with their traditional physical tape based VTL/ATL offerings. There are also software options, IBM Virtual Tape Facility for Mainframe (VTFM) and CA Vtape, where these software solutions deploy higher cost Mainframe disk as the virtual tape cache.

The majority of VTL solutions benefit from data dedupe functionality, where IBM incorporates their ProtecTIER technology, EMC and HDS incorporate DataDomain technology, while Oracle does not currently support Mainframe dedupe, incorporating a Virtual Library Extension (VLE) as a second tier of VTL disk storage. Ultimately dedupe delivers significant (~10-20:1) data reduction benefits and arguably is mandatory for any large scale Mainframe VTL implementation.

Each and every business must draw their own conclusions for VTL implementations and whether they should be tapeless or not. Most Mainframe users have experienced the benefits of mirrored disk (I.E. IBM PPRC, EMC SRDF, HDS TrueCopy, XRC, et al) and have implemented high-availability solutions with a short-term RTO for physical failures. However, only that business can consider how robust their data recovery processes are for logical data failures, and in the worst case scenario, restoring an entire Mission Critical application from a backup copy. The driving factor for this type of recovery is RTO and where is that “last chance” backup data copy stored, tape or disk storage media, and local, remote or 3rd party data centre?

Just as the business must establish a 1st level RPO and associated RTO for their Mission Critical database structures, typically via a quiesce Point of Consistency (PoC), they must do the same for their 2nd level backup data. If a VTL destages data from disk cache to physical tape, then the time required to create the final physical tape copy will influence the associated RTO, and potentially how much data loss might occur. For the avoidance of doubt, if backup data cannot be detstaged to physical tape, then the backup has not been completed, and is unusable. Ultimately data loss is not acceptable, whether a database, or a backup copy. So what steps can the Mainframe user take to minimize this risk?

Because tapeless VTL solutions can attach to any disk subsystem, within reason, IT departments generally have their preferred disk supplier and associated processes. Data dedupe significantly reduces disk acquisition cost and associated network transmission costs, while the functional abilities of disk subsystems are typically higher (I.E. Mirroring, Replication) and more robust when compared with tape subsystems.

If the typical Mainframe user has confidence in their disk mirroring solution for physical failure scenarios, generally associated with the primary copy of Mission Critical data, it seems a logical conclusion that they could extend this modus operandi to secondary (E.g. Backup) copies, eradicating if not eliminating any data loss concerns.

If the Mainframe user deploys EMC Symmetrix (VMAX) for disk data, they could deploy the DLm 8000 VTL to benefit from SRDF/GDDR functionality; if they deploy HDS USP, they could deploy LUMINEX gateways to benefit from TrueCopy functionality, and so on. There are many options available, when the front-end host connectivity (E.g. FICON, virtual tape drives) is separated from the back-end data store (E.g. IP/FC/iSCSI disk).

Additionally, the smaller Mainframe user that cannot afford hot/warm site recovery facilities can also consider different options for Disaster Recovery solutions. For example, they could deploy a tapeless VTL in their only data centre, benefitting from data dedupe for data reduction, transmitting their backup/archive data via IP (or other network transmission) into a 3rd party suppliers facilities, duplicating the VTL and disk subsystems to store the data. They can then modify their Disaster Recovery (DR) procedures to invoke DR as and when required, at that point connecting the 3rd party Mainframe resources to the VTL and data recovery can start immediately. Therefore the traditional off-site DR test at 3rd party provider premises increases in efficiency, while backup data availability is not reliant on the Ford Transit Access Method (FTAM)!

So, how long should secondary copies of Mission Critical data be retained on Virtual Tape disk? Is it minutes, hours, days, weeks or indefinitely? The jury might still be out, but to deliver near 24*7 data availability, for both logical and physical failure scenarios, seemingly at least one secondary copy of Mission Critical data should be retained indefinitely on Virtual Tape disk…