Simplified Business Facing IBM Z Mainframe DevOps APM Problem Determination

Increasingly IBM Z Mainframe stakeholders are becoming cognizant that traditional processes for handling Information Technology operations are becoming obsolete, hence the emergence of DevOps (DevSecOps) frameworks.  Driven by digital transformation & the perpetually increasing demand for new digital services, consuming vast unparalleled amounts of data, Data Centres are becoming increasingly pressurized to deliver & maintain these mission-critical services.  A major challenge is the availability of these services, where transaction & throughput workloads can be unpredictable, often ad-hoc demand driven (E.g. Consumer) & not the typical periodic planned peaks (E.g. Monthly, Annual, et al).

Today’s inward facing, dispassionate & honest CIO knows their organization can spend inordinate amounts of time, being reactive to business application impact incidents, often finding they spend too long reacting to incidents & all too often they don’t have enough bandwidth to be proactive & prevent the incident from occurring in the first place.  It’s widely accepted that for the majority of Global 1000 companies, deploying an IBM Z Mainframe platform provides them with the de facto System Of Record (SOR) data platform, with associated Database (E.g. Db2) & Transaction (E.g. CICS, IMS) subsystems.  Therefore playing such a central & integral part of today’s 21st century digital application infrastructure, business performance issues can affect the entire application, dictating that early detection & resolution of performance issues are business critical, with the ultimate goal of eliminating such issues altogether.

Technologies such as z/OS Connect, provide a simple & intuitive API based method for the IBM Z Mainframe to become an interconnected platform, with all other Distributed Platforms.  This dictates the evolution in Operations Management processes, considering the business application from a non-technical viewpoint, treating management from a holistic viewpoint with end-to-end monitoring, regardless of the underlying hardware & software platforms.

Today’s 21st Century digital economy dictates that central Operation teams don’t have inordinate amounts of time & indeed the requisite Subject Matter Expert (SME) skills for problem investigation activities.  A more proactive & automated response would be the deployment of simplified, lean & cost-efficient automated monitoring processes, allowing Operations teams to detect potential problems & their associated failure reason in near real-time.

Distributed tracing provides a methodology for interpreting how applications function across processes, services & networks.  Tracing uses the associated activity log trail from requests processed, capturing tracing information accordingly, as they move from interconnected system to system.  Therefore with Distributed tracing, organisations can monitor applications with Event Streams, helping them to understand the size & shape of the generated traffic, assisting them in the identification & related causes of potential business application problems.  It comes as no surprise that Distributed tracing has become a pivotal cornerstone of the DevOps toolbox, leveraging from the pervasive Kafka Open-Source Software architecture technology for distributed systems.  Simply, Kafka provides meaningful context from messaging & logging generated by IT platforms various, delivering data flow visualizations, simplifying identification & prioritization of business application performance anomalies.  Put simply, Kafka Distributed tracing pinpoints where failures occur & what causes poor performance (I.E. X marks the spot)!

From a business & therefore non-technical viewpoint, the utopia is to understand the user experiences delivered & associated business impacts; ideally positive, therefore eliminating the negative.  Traditionally from a technical viewpoint, experts have focussed on MELT (Metrics, Events, Logs, Traces) data collection, allowing for potential future problem determination & resolution.  Historically when this was the only data available, it therefore follows, manual & time consuming technical processes ensued.  As we have explored, DevOps is about simplification, optimization, automation & ultimately delivering the best business service!  If only there was a better way…

OpenTelemetry is a collection of tools, APIs & SDKs, utilized to instrument, generate, collect & export telemetry data (Metrics, Events, Logs, Traces) to assist software performance behavioural analysis.  Put simply, OpenTelemetry is an Open-Source Software vendor agnostic standard for application telemetry & supporting infrastructures & services data collection:

  • APIs: Code instrumentation deployment for telemetry data trace generation
  • SDKs: Collect the telemetry data for the rest of the telemetry data processing
  • In-Process Exporters: Translate telemetry data into custom formats for Back-End processing or
  • Out-Of-Process Exporters: Translate telemetry data into custom formats for Back-End processing

In conclusion, from a big picture viewpoint, the IBM Z Mainframe is just another IP node on the network, seamlessly interconnecting with Distributed Systems platforms for 21st century digital business application processing.  Regardless of technical platform, DevOps is not a technical discipline, it’s a business orientated user experience process & as such, requires automated issue detection & rapid resolution.  Open-Source Software (OSS) frameworks such as OpenTelemetry & Distributed Tracing allow for the simplified low cost collection & visualization of instrumentation data.  How can the IBM Z Mainframe organization incorporate a DevOps facing solution to aggregate this log data, providing an optimal cost, resource friendly Application Performance Management (APM) solution for simplified business application performance identification?

z/IRIS (Integrable Real-Time Information Streaming) integrates the IBM Z Mainframe platform into commonplace pervasive enterprise wide Application Performance Monitoring (APM) solutions, allowing DevOps resources to gain the insights they need to better understand Mainframe utilization & potential issues for mission critical business services.

z/IRIS incorporates OpenTelemetry observability for IBM Z Mainframe systems & applications, enriching traces (E.g. Db2 Accounting, Db2 Deadlock, zOS Connect, JES2, OMVS, STC, TSO) with attributes to facilitate searching, filtering & analysis of traces in today’s 3rd party enterprise wide APM tools (E.g. AppDynamics, Datadog, Dynatrace, IBM Instana, Jaeger, New Relic, Splunk, Sumo Logic).

Capturing metrics & creating associated charts has been an integral part of performance monitoring for several decades or more.  z/IRIS seamlessly integrates with APM tools such as Instana & data visualization tools such as Grafana to supply zero maintenance automated dashboards for commonplace day-to-day usage.  Of course, each & every business requires their own perspectives, hence z/IRIS incorporates easy-to-use customizable dashboards for such requirements. Because APM & data visualization tools collect data metrics from a variety of information sources, tracing every request from cradle (E.g. Client Browser) to Grave (E.g. Host Server), the z/IRIS Mainframe data combinations for your digital dashboards are potentially infinite, where the data presented is always accurate & in real time.

z/IRIS is simple to use & simple to install, incorporating many tried & tested industry standard Open-Source Software components, optimizing costs & simplifying product support.  Wherever possible, using Java based applications, from an IBM Z Mainframe viewpoint, CPU utilization is minimized, utilizing zIIP processing cycles whenever available.  z/IRIS delivers a lightweight, resource & cost efficient z/OS APM solution to provide an end-to-end performance analysis of today’s 21st Century digital solutions.  Because z/IRIS leverages from industry standard Open-Source frameworks deployed by commonplace Distributed Systems APM solutions, the instrumentation captured & interpreted by z/IRIS enriches dynamically as APM functionality increases.  For example, Datadog Watchdog Insights can identify increased latency from a downstream z/OS Connect application, just by processing new analytics, from existing telemetry data.  The data had already been captured, as APM functionality evolves, new meaningful business insights are gained.  z/IRIS can deliver the following example benefits for any typical IBM Z Mainframe DevOps environment:

  • Automated IBM Z Mainframe Observability: Automate the collection of end-to-end data tracing information.
  • Real Time Impact Notification: Intelligent data processing to present meaningful DevOps dashboard notifications of business applications service status & variances.
  • Universal Access & Ease Of Use: Facilitate end-to-end Application Performance Monitoring (APM) for all IT teams, not just IBM Z Mainframe Subject Matter Experts (SME).
  • Reduce MTTD & MTTR For Optimized User Services: Reduce Mean Time To Detect (MTTD) & ideally eradicate the Mean Time To Repair (MTTR), the typical Key Performance Indicators (KPIs), with intelligent root cause analysis.

Application Performance Tuning – Why Bother?

With older generations of Mainframe Operating Systems, certainly MVS/XA and perhaps MVS/ESA, application performance tuning was a necessity, not an afterthought.  Quite simply, the cost of Mainframe resources, namely CPU, memory and disk, dictated that your mission critical business application might not perform to business requirements, unless you tuned your programming code.  Programmers, both of the system and application variety understood the bits and bytes of available programming languages (E.g. ASM, COBOL, PL/I) and Operating System (I.E. MVS), collaborating either via proactive process, or reactive problem solving.  With the continuing reduction of IT hardware component costs, the improvement in Operating Systems (E.g. 64-bit architecture) and newer programming languages (E.g. C, C++), it seems that application performing tuning is somewhat of an afterthought, but at what cost?

We all know that the cost of a Mainframe MIPS is significant, and although it might have reduced dramatically from a hardware viewpoint, from a software viewpoint, the cost remains largely static at ~£1,500-£3,500, per year, depending on your configuration.  So if your applications are burning several hundred if not several thousand extra MIPS unnecessarily, that’s very expensive indeed!  Additionally and just as importantly, a badly tuned system will manifest itself in slower transaction response times and longer batch jobs, if applicable, which could impact service availability.  So why is there a seeming reluctance to tune business applications, Mainframe resident or not?

If ever there was a functional IT area where the skills gap has never been wider, then application performance tuning is said skill, when comparing the salty old sea dog Mainframe dinosaur, with the newer Mainframe technician!

From an application development process viewpoint, where does the application performance tuning task live; before or after implementation?  The cynical amongst us will know; if it’s after implementation, there’s a strong likelihood said activity will never be performed!  If it’s before implementation, how many projects incorporate a meaningful stress test, or measure transaction response times versus an SLA or KPI metric?  Additionally, if the project is high-priority and/or running behind schedule, then performance testing is an activity that is easily removed…

Back in the good old days, the late 1980’s to early 1990’s, some application performance tuning tools did start to emerge, most notably Strobe.  Strobe was useful to even the most accomplished of system and application programmer personnel, and invaluable to less experienced personnel, and so arguably Strobe became the de facto software tool for tuning Mainframe applications.  However, later releases of MVS (E.g. OS/390 and z/OS), the non-event that was the Year 2000 (Y2K), seemed to remove the focus on and importance of application tuning.

Arguably most importantly of all, that software MIPS cost item, where Strobe and its competitors (E.g. ASG/BMC TriTune, CA Application Tuner, IBM APA, Macro4 ExpeTune, et al) will utilize even more CPU to capture diagnostic trace information, contributed to the demise of application performance tuning.  However, those companies that have undertaken such application tuning activities in the last decade or so are sitting pretty, having reduced the CPU (MIPS) resource consumed, lowering TCO and optimizing performance accordingly.  In the 21st Century, these software solutions are classified as Application Performance Management (APM) solutions.

Is there a better and easier way to stimulate an interest in the application performance tuning discipline?  If the desire exists to tune an application, lowering CPU MIPS usage, optimizing service performance, then the traditional tools and methods mentioned previously exist, but perhaps a new (or not so new) CPU performance data source exists…

With the introduction of the z10 server, a new function CPU MF (CPU Measurement Facility) was incorporated.  Let’s not forget, z10 is now an n-2 technology, having been superseded by the z196/z114 and the latest zBC12/zEC12 generation of servers.  So each and every committed Mainframe customer should be positioned to benefit from the CPU MF function.

CPU MF provides optional hardware assisted collections of information about logical CPU activity executed over a specified interval in selected Logical Partitions (LPARs).  The CPU MF counters function is intended to be run on a constant basis to collect long-term performance data (I.E. SMF Record 113), in a similar manner to how you collect other performance data.  I have previously briefly discussed how CPU MF SMF data can be used to increase Mainframe Server Capacity Planning efficiencies. 

The CPU MF sampling function is a short duration, precise function that identifies where CPU resources are being used, to help you improve application efficiency.  Put very simply, CPU MF sampling data has minimal CPU overhead (E.g. ~0.1-1.0%) when collecting data (I.E. z/OS Hardware Instrumentation Services – HIS), but this data can then be used to identify CPU “hot spots”, which can then be further analysed to identify the “areas of code” generating the high CPU usage.  However, it was forever thus, whether an APM tool, or CPU MF sampling data, high CPU usage can be identified, but the application programmer must undertake the task of optimizing the application code!

IBM have done a great job in providing CPU MF counters data, optimizing the Capacity Planning process with the SMF 113 record, and the realm of possibility exists with the sample data, but a software solution is required to analyse and summarize this data.

Currently there are very few if only one software solution that analyses CPU MF sample data, namely zHISR from Phoenix Software International.  zHISR interfaces directly with z/OS Hardware Instrumentation Services to collect data for hotspot analysis of customer, vendor, or operating system program execution.  zHISR features include:

  • Support for up to 128 simultaneous data collections events.  zHISR collections do not interfere with any HIS functions including sample or counter collection.
  • System console commands for many zHISR functions.
  • An Application Programming Interface to COBOL and Assembler for starting and stopping data collections. Collection lengths for API generated collections have a time range of one second or more.
  • Ability to schedule a collection with JCL so that collection starts when a given job or step begins.
  • Ability to store data collections as z/OS data sets or UNIX files.
  • Support for collections against CICS/TS transactions.
  • Analysis based on a time range within the collected data for a narrower spotlight on problem code.

An intuitive ISPF dialog allows the user to easily produce a CPU hot spots analysis, which can then be used for identifying the offending code sections.  The user can then drill down and highlight the high CPU CSECT and program offset (instruction), comparing with their Associated Data (ADATA), and thus the source programming instruction.  Therefore the skill required to perform analysis is minimal, as is the CPU overhead in collecting analysis data, and so eradicating the potential barriers when embarking on an application tuning initiative.  Furthermore, the actual cost of deploying the zHISR software is not onerous and so perhaps each and every committed Mainframe user can easily include application performance tuning into their application development lifecycle processes. 

zHISR has a UNIX file system interface that lets you navigate the system and browse or delete files.  With zHISR, users can start and stop hardware event data collections and view the status of the current or prior HIS run.  zHISR also includes a memory display/alter utility that lets you view main storage in the CPU you are logged on to.  If zIIPs are present and zHISR is defined as an authorized subsystem, nearly all of the CPU processing used by zHISR is redirected to a zIIP.

There are also instances, however few and far between, where Mainframe customers have written their own proprietary in-house OLTP (On-Line Transaction Processor) and Relational Database Management Subsystem (RDBMS), where traditional APM software tools can’t provide a solution, only interfacing with underlying subsystems (E.g. Adabas, CICS, DB2, IDMS, WebSphere, et al).  In these instances, CPU MF and zHISR offer a solution to help such customers, who probably face challenges when they upgrade their Mainframe servers, safeguarding software and application code is compatible with the new hardware, and ideally, exploits the latest functionality.

In conclusion, application performance tuning has to be a very important if not mandatory activity for the Mainframe Data Centre.  Whether via CPU MF or traditional APM software solutions, the cost reduction and performance improvement benefits of tuning should be compelling reasons to proactively engage in application tuning activities.  From a skills viewpoint, maybe the KISS (Keep It Simple Stupid) principle can apply, where CPU MF collects the data very simply and efficiently, complemented by zHISR, analysing the data in an intuitive and cost optimized manner.

So turning the subject matter on its head, Application Performance Tuning – Why Bother?  Why not!

Further information can be found from my z/OS Application Performance Tuning presentation, delivered at UK GSE in November 2012.