IBM Z Server: Best In Class For Availability – Does Form Factor Matter?

A recent ITIC 2017 Global Server Hardware and Server OS Reliability Survey classified the IBM Z server as delivering the highest levels of reliability/uptime, delivering ~8 Seconds or less of unplanned downtime per month.  This was the 9th consecutive year that such a statistic had been recorded for the IBM Z Mainframe platform.  This compares to ~3 Minutes of unplanned downtime per month for several other specialized server technologies, including IBM POWER, Cisco UCS and HP Integrity Superdome via the Linux Operating System.  Clearly, unplanned server downtime is undesirable and costly, impacting the bottom line of the business.  Industry Analysts state that ~80% of global business require 99.99% uptime, equating to ~52.5 Minutes downtime per year or ~8.66 Seconds per day.  In theory, only the IBM Z Mainframe platform exceeds this availability requirement, while IBM POWER, Cisco UCS and HP Integrity Superdome deliver borderline 99.99% availability capability.  The IBM Mainframe is classified as a mission-critical resource in 92 of the top 100 global banks, 23 of the top 25 USA based retailers, all 10 of the top 10 global insurance companies and 23 of the top 25 largest airlines globally…

The requirement for ever increasing amounts of corporate compute power is without doubt, satisfying the processing of ever increasing amounts of data, created from digital sources, including Cloud, Mobile and Social, requiring near real-time analytics to deliver meaningful information from these oceans of data.  Some organizations select x86 server technology to deliver this computing power requirement, either in their own Data Centre or via a 3rd party Cloud Provider.  However, with unplanned downtime characteristics that don’t meet the seeming de facto 99.99% uptime availability metric, can the growth in x86 server technology continue?  From many perspectives, Reliability, Availability & Serviceability (RAS), Data Security via Pervasive Encryption and best-in-class Performance and Scalability, you might think that the IBM Z Mainframe would be the platform of choice?  For whatever reason, this is not always the case!  Maybe we need to look at recent developments and trends in the compute power delivery market and second guess what might happen in the future…

Significant Cloud providers deliver vast amounts of computing power and associated resources, evolving their business models accordingly.  Such business models have many challenges, primarily uptime and data security related, convincing their prospective customers to migrate their workloads from traditional internal Data Centres, into these massive rack provisioned infrastructures.  Recently Google has evolved from using Intel as its primary supplier for Data Centre CPU chips, including CPU chips from IBM and other semiconductor rivals.

In April 2016, Google declared it had ported its online services to the IBM POWER CPU chip and that its toolchain could output code for Intel x86, IBM POWER and 64-bit ARM cores at the flip of a command-line switch.  As part of the OpenPOWER and Open Compute Project (OCP) initiatives, Google, IBM and Rackspace are collaborating to develop an open server specification based on the IBM POWER9 architecture.  The OCP Rack & Power Project will dictate the size and shape or form factor for housing these industry standard rack infrastructures.  What does this mean for the IBM Z server form factor?

Traditionally and over the last decade or more, IBM has utilized the 24 Inch rack form factor for the IBM Z Mainframe and Enterprise Class POWER Systems.  Of course, this is a different form factor to the industry standard 19 Inch rack, which finally became the de facto standard for the ubiquitous blade server.  Unfortunately there was no tangible standard for a 19 Inch rack, generating power, cooling and other issues.  Hence the evolution of the OCP Rack & Power Standard, codenamed Open Rack.  Google and Facebook have recently collaborated to evolve the Open Rack Standard V2.0, based upon an external 21 Inch rack Form factor, accommodating the de facto 19 Inch rack mounted equipment.

How do these recent developments influence the IBM Z platform?  If you’re the ubiquitous global CIO, knowing your organizations requires 99.99%+ uptime, delivering continuous business application change via DevOps, safeguarding corporate data with intelligent and system wide encryption, perhaps you still view the IBM Z Mainframe as a proprietary server with its own form factor?

As IBM have already demonstrated with their OpenPOWER offering, collaborating with Google and Rackspace, their 24 Inch rack approach can be evolved, becoming just another CPU chip in a Cloud (E.g. IaaS, Paas) service provider environment.  Maybe the final evolution step for the IBM Z Mainframe is evolving its form factor to a ubiquitous 19 Inch rack format?  The intelligent and clearly defined approach of the Open Rack Standard makes sense and if IBM could deliver an IBM Z Server in such a format, it just becomes another CPU chip in the ubiquitous Cloud (E.g. IaaS, Paas) service provider environment.  This might be the final piece of the jigsaw for today’s CIO as their approach to procuring compute power might be based solely upon the uptime and data security metrics.  For those organizations requiring in excess of 99.99% uptime and fully compliant security, there only seems to be one choice, the IBM Z Mainframe CPU chip technology, which has been running Linux workloads since 2000!

The Problem With Problems – Are You zAware?

Several decades ago and observing potential challenges with hardware, most of us seasoned Mainframe folk would have been familiar with the terms Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR), although repair might become resolution, replacement, and so on.  As hardware has become more reliable, with very few if any single points of failure, we don’t really use these terms for hardware, but perhaps if we don’t use them for problems associated with our business applications, we should…

Today we generally simplify this area of safeguarding business processing metrics (E.g. SLA, KPI) with the Reliability, Availability and Serviceability (RAS) terminology.  So whether hardware related by an IHV such as IBM, or software related by ISV’s such as ASG, BMC, CA, IBM, naming but a few, or application code writers, we’re all striving to improve the RAS metrics associated with our IT discipline or component.

There will always be the ubiquitous software bugs, human error when making configuration changes, and so on, but what about those scenarios we might not even consider to be a problem, yet they can have a significant impact on our business?  An end-to-end application transaction could consist of an On-Line Transaction Processor (E.g. OLTP, CICS, IMS, et al), a Relational Database Management Subsystem (E.g. RDBMS, DB2, ADABAS, IDMS, et al), a Messaging Broker (E.g. WebSphere MQ), a Networking Protocol (E.g. TCP/IP, SNA, et al), with all of the associated application infrastructure (E.g. Storage, Operating System, Server, Application Programs, Security, et al); so when we experience a “transaction failure”, which might be performance related, which component failed or caused the incident?

Systems Management disciplines dictate Mainframe Data Centres deploy a plethora of monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), but these software solutions typically generate a significant amount of data, but what we really need for successful problem solving is the right amount of meaningful information.

So ask yourself the rhetorical question.  You know it; how many application performance issues remain unsolved, because we just can’t identify which component caused the issue, or there is just too much data (E.g. System Monitor Logs) to analyse?  If you’re being honest, I guess the answer is greater than zero, perhaps significantly greater.  Further complications can occur, because of the collaboration required to resolve such issues, as each discipline, Transaction, Databases, Messaging, Networking, Security, General Systems Management, Performance Monitoring, typically reside in different teams…

IBM System z Advanced Workload Analysis Reporter (IBM zAware) is an integrated, self-learning, analytics solution for IBM z/OS that helps identify unusual system behaviour in near real time.  It is designed to help IT personnel improve problem determination so they can restore service quickly and improve overall availability.  zAware integrates with the family of IBM Mainframe System Management tools, including Runtime Diagnostics, Predictive Failure Analysis (PFA), IBM Health Checker for z/OS and z/OS Management Facility (z/OSMF).

IBM zAware runs in an LPAR on a zEC12 or later CPC.  Just like any System z LPAR, IBM zAware requires processing capacity, memory, disk storage, and connectivity.  IBM zAware is able to use either general purpose CPs or IFLs, which can be shared or dedicated.  It is generally more cost effective to deploy zAware on an IFL.

Used together with other Mainframe System Management Tools, zAware provides another view of your system(S) behaviour, helping answer questions such as:

  • Are my systems showing abnormal message activity?
  • When did this abnormal message activity start?
  • Is this abnormal message activity repetitive?
  • Are there messages appearing that have never appeared before?
  • Do the times of abnormal message activity coincide with problems in the system?
  • Is the abnormal behaviour limited to one system or are multiple systems involved?

IBM zAware creates a model of the normal operating characteristics of a z/OS system using message data captured from OPERLOG.  This message data includes any well-formed message captured by OPERLOG (I.E. A message with a tangible Message ID), whether it is from an IBM product, a non-IBM product, or one of your own application programs.  This model of past system behaviour is used as the base against which to compare message patterns that are occurring now.  The results of this comparison might help answer these questions.

IBM zAware determines, using its model of each system, what messages are new or if messages have been issued out of context based on the past normal behaviour of the system.  The model contains patterns of message ID occurrence over a previous period and does not need to know what job or started task issued the message. It also does not need to use the text of a message.

In summary, zAware is a self-learning technology, for newer zSeries Servers (I.E. zEC12 onwards), which can help reduce the time to identify the “area” of where a problem occurred, or is occurring (E.g. Near Real-Time), allowing a technician to fully identify the problem diagnosis and consider potential resolutions.  Put very simply, zAware will assist in identifying the problem, but it does not fully qualify the problem and associated resolution.  This is a good quality, as ultimately the human technician must complete this most important of activities!

So what if you’re not a zEC12 user or you’re concerned about increased costs because you don’t deploy IFL speciality engines?

ConicIT/MF is a Proactive Performance Management for First Fault Performance Problem Resolution solution.  By interfacing with standard system monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON), ConicIT/MF uses sophisticated mathematic models to perform proactive, intelligent and significant data reduction, quickly highlighting possible causes of problems, allowing for efficient problem determination.  Put another way, Systems Management Performance Monitors provide a wealth of data, but sometimes there’s too much data and not enough information.  ConicIT safeguards that the value of the data provided by Systems Management Performance Monitors is analyzed and consolidated to expedite performance problem resolution.

ConicIT runs on a distributed Linux system external to the Mainframe system being monitored.  ConicIT is a completely agentless architecture which doesn’t require installation on the Mainframe system being monitored.  It receives data from existing monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), through their standard interfaces.  3270 emulation enables ConicIT to appear as just another operator to the existing monitor and adds no more load to the monitored system than would adding an additional human operator.

Until a problem is predicted ConicIT requests basic monitor information at a very low frequency (about once per minute), but if the ConicIT analysis senses a performance problem brewing, its requests for information increase, but never so much as to effect the monitored system.  The maximum load generated by ConicIT is configurable and ConicIT supports all the major Mainframe monitors.

The monitor data stream is retrieved by parsing the data from the various (E.g. Log) data sources.  This raw data is first sent to the ConicIT data cleansing component.  Data from existing monitors is very “noisy”, since various system parameters values can fluctuate widely even when the system is running perfectly.  The job of the data cleansing algorithm is to find meaningful features from the fluctuating data.  Without an appropriate data cleansing algorithm it is very difficult or impossible for any useful analysis to take place. Such cleansing is a simple visual task for a trained operator, but is very tricky for an automated algorithm.

The relevant features found by the data cleansing algorithm are then processed to create appropriate variables.  These variables are created by a set of rules that can process the data and apply transformations to the data (E.g. combine single data points into a new synthesized variable, aggregate data points) to better describe the relevant state of the system.

These processed variables are analyzed by models that are used to discover anomalies that could be indicative of a brewing performance problem.  Each model looks for a specific type of statistical anomalies that could predict a performance problem.  No single model is appropriate for a system as complex as a large computing system, especially since the workload profile changes over time.  So rather than a single model, ConicIT generates models appropriate to the historical data from a large, predefined set of prediction algorithms.  This set of active models is used to analyze the data, detect anomalies and predict performance degradation.  The active models vote on the possibility of an upcoming problem in order to make sure that as wide a set of anomalies as possible are covered, while lowering the number of false alerts.  The set of active models change over time based on the results of an offline learning algorithm which can either generate new models based on the data, or change the weighting of existing models.  The learning algorithm is run in the background on a periodic basis.

When a possible performance problem is predicted by the active models, the ConicIT system takes two actions.  It sends an alert to the appropriate consoles and systems, and also instructs the monitor to collect information from the effected systems more frequently.  The result is that when IT personnel analyze the problem they have the information describing the state of the system and the effected system components as if they were watching the problem while it was happening.  The system also uses the information from the analysis to point out the anomalies that led the system to predict a problem, thereby aiding in root cause analysis of the problem.

So whether zAware or ConicIT, there are solutions to assist today’s busy IT technician to improve the Reliability, Availability and Serviceability (RAS) metric for their business, by implementing practicable resolutions for those problems, which previously, were just too problematic to solve.  zAware can offload its processing to an IFL, as and if available, whereas ConicIT performs its processing on a Non-Mainframe platform, and thus can support all zSeries Servers, not just the zEC12 platform.

Ultimately both the zAware and ConicIT solutions have the same objective, increasing Mean Time Between Failure (MTBF) and decreasing Mean Time To Resolution (MTTR), optimizing IT personnel time accordingly.