MTTR | Value-4IT Blog

Is 2016 the year of the All Flash disk array? Seemingly from a System z perspective, 2016 has seen improvement in the All Flash disk array offerings from the major disk suppliers, namely EMC, HDS and IBM. From a usability perspective, managing latency might be the overall challenge, where these ultra-fast SSD systems are delivering I/O performance response times measured in the ~250-500 Microsecond (μs) range, potentially consigning the traditional Millisecond (ms) measurement to history!

Whatever the speeds and feeds might be, as of 2016, the benchmark for a System z All Flash Disk Array is seemingly measured @ <500 Microsecond μs response time, supporting ~n PB of capacity and delivering ~nnn GB/S throughput for mixed read/write workloads. Of course, strong encryption, typically full disk Data @ Rest Encryption (D@RE) based and full seamless data replication interoperability are also mandatory.

Historically we evolved from Data Processing to Information Technology, not only automating the capture and processing of data, but gradually evolving our processes, using this data for business advantage. In recent years, the information explosion has dictated that each and every business must be a cognitive business, using intelligent analytics to gain insight and faster decision-making from the business data collected.

Currently the Internet of Things (IoT) supplements the medium-term Cloud, Analytics, Mobile, Social & Security (CAMSS) initiative, being the processes and associated solutions required by cognitive businesses to make timely and informed decisions, capturing deeper customer insight, ultimately delivering competitive advantage. Therefore the 21^st century business generates a significant requirement for storage capacity and performance to fully realize the benefits of this truly business aligned cognitive approach.

The largest global organizations from all verticals leverage from the power and true 24*7*365 availability and reliability of the System z Mainframe to power enormous relational databases, processing millions of customer transactions on an hourly basis. These always-on, mission critical business environments demand the performance, reliability, TCO and System z platform integration delivered by the associated DASD (3390) subsystem.

Each and every System z user will have their IHV of choice for delivering disk storage, in alphabetical order, EMC (E.g. VMAX AFA/All Flash Array), HDS (HAF/Hitachi Accelerated Flash) or IBM (E.g. DS8888). The choice of disk storage was forever thus, reviewing the market place and choosing the best option for your business. What might require reflection is how the DASD I/O subsystem is managed and the associated interaction with said IHV supplier. Systems Management solutions such as Easy Disk Analyze Mainframe (EADM) and IntelliMagic Vision (Disk Magic) will certainly simplify the analysis and presentation of DASD subsystem performance data.

However the emphasis of the actual System z DASD subsystem for an All Flash array might move from being an internal consideration, to a direct and timely communication with the IHV supplier. Put very simply, in an environment where Mission Critical systems rely upon ultra fast processing of massive amounts of data, any flash memory issues, whether capacity or defect related will need IHV interaction ASAP, arguably “Before The Event”. As with the System z Server itself, where we’re used to On/Off Capacity on Demand (OOCoD) processes, maybe we need to consider a similar approach with our All Flash System z DASD arrays. For the avoidance of doubt, as opposed to waiting for an issue to impact our business, maybe we need to work smarter with our IHV, to safeguard that sufficient flash memory is available, to proactively resolve capacity or defect issues…

Aligning this with our traditional 3390 DASD I/O subsystem analysis, which might have been on a daily basis, from the rich RMF/CMF data resource, we must fully automate this process to minimize or eliminate the Mean Time To Resolution (MTTR)! The ultimate benefit will be the delivery of meaningful messages that incorporate our 3^rd party IHV supplier, who potentially with Remote Support Facility (RSF) type processes, deploys the “Golden Screwdriver” to seamlessly safeguard the performance profiles of our Mission Critical business applications, leveraging from the latest All Flash disk array.

In conclusion, as always, technology can deliver business benefits, with substance, and this includes All Flash disk arrays. As always, what might need to evolve are the associated Systems Management processes. Therefore asking yet another potential rhetorical question, what is more important, the System z Server or timely data access? The diplomatic answer is that they’re equally important and if so, let’s safeguard the availability of All Flash memory for our DASD subsystems, with the requisite levels of meaningful proactive reporting and IHV supplier interaction.

Several decades ago and observing potential challenges with hardware, most of us seasoned Mainframe folk would have been familiar with the terms Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR), although repair might become resolution, replacement, and so on. As hardware has become more reliable, with very few if any single points of failure, we don’t really use these terms for hardware, but perhaps if we don’t use them for problems associated with our business applications, we should…

Today we generally simplify this area of safeguarding business processing metrics (E.g. SLA, KPI) with the Reliability, Availability and Serviceability (RAS) terminology. So whether hardware related by an IHV such as IBM, or software related by ISV’s such as ASG, BMC, CA, IBM, naming but a few, or application code writers, we’re all striving to improve the RAS metrics associated with our IT discipline or component.

There will always be the ubiquitous software bugs, human error when making configuration changes, and so on, but what about those scenarios we might not even consider to be a problem, yet they can have a significant impact on our business? An end-to-end application transaction could consist of an On-Line Transaction Processor (E.g. OLTP, CICS, IMS, et al), a Relational Database Management Subsystem (E.g. RDBMS, DB2, ADABAS, IDMS, et al), a Messaging Broker (E.g. WebSphere MQ), a Networking Protocol (E.g. TCP/IP, SNA, et al), with all of the associated application infrastructure (E.g. Storage, Operating System, Server, Application Programs, Security, et al); so when we experience a “transaction failure”, which might be performance related, which component failed or caused the incident?

Systems Management disciplines dictate Mainframe Data Centres deploy a plethora of monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), but these software solutions typically generate a significant amount of data, but what we really need for successful problem solving is the right amount of meaningful information.

So ask yourself the rhetorical question. You know it; how many application performance issues remain unsolved, because we just can’t identify which component caused the issue, or there is just too much data (E.g. System Monitor Logs) to analyse? If you’re being honest, I guess the answer is greater than zero, perhaps significantly greater. Further complications can occur, because of the collaboration required to resolve such issues, as each discipline, Transaction, Databases, Messaging, Networking, Security, General Systems Management, Performance Monitoring, typically reside in different teams…

IBM System z Advanced Workload Analysis Reporter (IBM zAware) is an integrated, self-learning, analytics solution for IBM z/OS that helps identify unusual system behaviour in near real time. It is designed to help IT personnel improve problem determination so they can restore service quickly and improve overall availability. zAware integrates with the family of IBM Mainframe System Management tools, including Runtime Diagnostics, Predictive Failure Analysis (PFA), IBM Health Checker for z/OS and z/OS Management Facility (z/OSMF).

IBM zAware runs in an LPAR on a zEC12 or later CPC. Just like any System z LPAR, IBM zAware requires processing capacity, memory, disk storage, and connectivity. IBM zAware is able to use either general purpose CPs or IFLs, which can be shared or dedicated. It is generally more cost effective to deploy zAware on an IFL.

Used together with other Mainframe System Management Tools, zAware provides another view of your system(S) behaviour, helping answer questions such as:

Are my systems showing abnormal message activity?
When did this abnormal message activity start?
Is this abnormal message activity repetitive?
Are there messages appearing that have never appeared before?
Do the times of abnormal message activity coincide with problems in the system?
Is the abnormal behaviour limited to one system or are multiple systems involved?

IBM zAware creates a model of the normal operating characteristics of a z/OS system using message data captured from OPERLOG. This message data includes any well-formed message captured by OPERLOG (I.E. A message with a tangible Message ID), whether it is from an IBM product, a non-IBM product, or one of your own application programs. This model of past system behaviour is used as the base against which to compare message patterns that are occurring now. The results of this comparison might help answer these questions.

IBM zAware determines, using its model of each system, what messages are new or if messages have been issued out of context based on the past normal behaviour of the system. The model contains patterns of message ID occurrence over a previous period and does not need to know what job or started task issued the message. It also does not need to use the text of a message.

In summary, zAware is a self-learning technology, for newer zSeries Servers (I.E. zEC12 onwards), which can help reduce the time to identify the “area” of where a problem occurred, or is occurring (E.g. Near Real-Time), allowing a technician to fully identify the problem diagnosis and consider potential resolutions. Put very simply, zAware will assist in identifying the problem, but it does not fully qualify the problem and associated resolution. This is a good quality, as ultimately the human technician must complete this most important of activities!

So what if you’re not a zEC12 user or you’re concerned about increased costs because you don’t deploy IFL speciality engines?

ConicIT/MF is a Proactive Performance Management for First Fault Performance Problem Resolution solution. By interfacing with standard system monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON), ConicIT/MF uses sophisticated mathematic models to perform proactive, intelligent and significant data reduction, quickly highlighting possible causes of problems, allowing for efficient problem determination. Put another way, Systems Management Performance Monitors provide a wealth of data, but sometimes there’s too much data and not enough information. ConicIT safeguards that the value of the data provided by Systems Management Performance Monitors is analyzed and consolidated to expedite performance problem resolution.

ConicIT runs on a distributed Linux system external to the Mainframe system being monitored. ConicIT is a completely agentless architecture which doesn’t require installation on the Mainframe system being monitored. It receives data from existing monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), through their standard interfaces. 3270 emulation enables ConicIT to appear as just another operator to the existing monitor and adds no more load to the monitored system than would adding an additional human operator.

Until a problem is predicted ConicIT requests basic monitor information at a very low frequency (about once per minute), but if the ConicIT analysis senses a performance problem brewing, its requests for information increase, but never so much as to effect the monitored system. The maximum load generated by ConicIT is configurable and ConicIT supports all the major Mainframe monitors.

The monitor data stream is retrieved by parsing the data from the various (E.g. Log) data sources. This raw data is first sent to the ConicIT data cleansing component. Data from existing monitors is very “noisy”, since various system parameters values can fluctuate widely even when the system is running perfectly. The job of the data cleansing algorithm is to find meaningful features from the fluctuating data. Without an appropriate data cleansing algorithm it is very difficult or impossible for any useful analysis to take place. Such cleansing is a simple visual task for a trained operator, but is very tricky for an automated algorithm.

The relevant features found by the data cleansing algorithm are then processed to create appropriate variables. These variables are created by a set of rules that can process the data and apply transformations to the data (E.g. combine single data points into a new synthesized variable, aggregate data points) to better describe the relevant state of the system.

These processed variables are analyzed by models that are used to discover anomalies that could be indicative of a brewing performance problem. Each model looks for a specific type of statistical anomalies that could predict a performance problem. No single model is appropriate for a system as complex as a large computing system, especially since the workload profile changes over time. So rather than a single model, ConicIT generates models appropriate to the historical data from a large, predefined set of prediction algorithms. This set of active models is used to analyze the data, detect anomalies and predict performance degradation. The active models vote on the possibility of an upcoming problem in order to make sure that as wide a set of anomalies as possible are covered, while lowering the number of false alerts. The set of active models change over time based on the results of an offline learning algorithm which can either generate new models based on the data, or change the weighting of existing models. The learning algorithm is run in the background on a periodic basis.

When a possible performance problem is predicted by the active models, the ConicIT system takes two actions. It sends an alert to the appropriate consoles and systems, and also instructs the monitor to collect information from the effected systems more frequently. The result is that when IT personnel analyze the problem they have the information describing the state of the system and the effected system components as if they were watching the problem while it was happening. The system also uses the information from the analysis to point out the anomalies that led the system to predict a problem, thereby aiding in root cause analysis of the problem.

So whether zAware or ConicIT, there are solutions to assist today’s busy IT technician to improve the Reliability, Availability and Serviceability (RAS) metric for their business, by implementing practicable resolutions for those problems, which previously, were just too problematic to solve. zAware can offload its processing to an IFL, as and if available, whereas ConicIT performs its processing on a Non-Mainframe platform, and thus can support all zSeries Servers, not just the zEC12 platform.

Ultimately both the zAware and ConicIT solutions have the same objective, increasing Mean Time Between Failure (MTBF) and decreasing Mean Time To Resolution (MTTR), optimizing IT personnel time accordingly.

Value-4IT Blog

zWorld Thoughts & Updates

Tag Archives: MTTR

All Flash & Substance – Is The System z Microsecond The New Millisecond?

The Problem With Problems – Are You zAware?