IBM Z Server: Best In Class For Availability – Does Form Factor Matter?

Posted on 03/09/2017 by zman

A recent ITIC 2017 Global Server Hardware and Server OS Reliability Survey classified the IBM Z server as delivering the highest levels of reliability/uptime, delivering ~8 Seconds or less of unplanned downtime per month. This was the 9^th consecutive year that such a statistic had been recorded for the IBM Z Mainframe platform. This compares to ~3 Minutes of unplanned downtime per month for several other specialized server technologies, including IBM POWER, Cisco UCS and HP Integrity Superdome via the Linux Operating System. Clearly, unplanned server downtime is undesirable and costly, impacting the bottom line of the business. Industry Analysts state that ~80% of global business require 99.99% uptime, equating to ~52.5 Minutes downtime per year or ~8.66 Seconds per day. In theory, only the IBM Z Mainframe platform exceeds this availability requirement, while IBM POWER, Cisco UCS and HP Integrity Superdome deliver borderline 99.99% availability capability. The IBM Mainframe is classified as a mission-critical resource in 92 of the top 100 global banks, 23 of the top 25 USA based retailers, all 10 of the top 10 global insurance companies and 23 of the top 25 largest airlines globally…

The requirement for ever increasing amounts of corporate compute power is without doubt, satisfying the processing of ever increasing amounts of data, created from digital sources, including Cloud, Mobile and Social, requiring near real-time analytics to deliver meaningful information from these oceans of data. Some organizations select x86 server technology to deliver this computing power requirement, either in their own Data Centre or via a 3^rd party Cloud Provider. However, with unplanned downtime characteristics that don’t meet the seeming de facto 99.99% uptime availability metric, can the growth in x86 server technology continue? From many perspectives, Reliability, Availability & Serviceability (RAS), Data Security via Pervasive Encryption and best-in-class Performance and Scalability, you might think that the IBM Z Mainframe would be the platform of choice? For whatever reason, this is not always the case! Maybe we need to look at recent developments and trends in the compute power delivery market and second guess what might happen in the future…

Significant Cloud providers deliver vast amounts of computing power and associated resources, evolving their business models accordingly. Such business models have many challenges, primarily uptime and data security related, convincing their prospective customers to migrate their workloads from traditional internal Data Centres, into these massive rack provisioned infrastructures. Recently Google has evolved from using Intel as its primary supplier for Data Centre CPU chips, including CPU chips from IBM and other semiconductor rivals.

In April 2016, Google declared it had ported its online services to the IBM POWER CPU chip and that its toolchain could output code for Intel x86, IBM POWER and 64-bit ARM cores at the flip of a command-line switch. As part of the OpenPOWER and Open Compute Project (OCP) initiatives, Google, IBM and Rackspace are collaborating to develop an open server specification based on the IBM POWER9 architecture. The OCP Rack & Power Project will dictate the size and shape or form factor for housing these industry standard rack infrastructures. What does this mean for the IBM Z server form factor?

Traditionally and over the last decade or more, IBM has utilized the 24 Inch rack form factor for the IBM Z Mainframe and Enterprise Class POWER Systems. Of course, this is a different form factor to the industry standard 19 Inch rack, which finally became the de facto standard for the ubiquitous blade server. Unfortunately there was no tangible standard for a 19 Inch rack, generating power, cooling and other issues. Hence the evolution of the OCP Rack & Power Standard, codenamed Open Rack. Google and Facebook have recently collaborated to evolve the Open Rack Standard V2.0, based upon an external 21 Inch rack Form factor, accommodating the de facto 19 Inch rack mounted equipment.

How do these recent developments influence the IBM Z platform? If you’re the ubiquitous global CIO, knowing your organizations requires 99.99%+ uptime, delivering continuous business application change via DevOps, safeguarding corporate data with intelligent and system wide encryption, perhaps you still view the IBM Z Mainframe as a proprietary server with its own form factor?

As IBM have already demonstrated with their OpenPOWER offering, collaborating with Google and Rackspace, their 24 Inch rack approach can be evolved, becoming just another CPU chip in a Cloud (E.g. IaaS, Paas) service provider environment. Maybe the final evolution step for the IBM Z Mainframe is evolving its form factor to a ubiquitous 19 Inch rack format? The intelligent and clearly defined approach of the Open Rack Standard makes sense and if IBM could deliver an IBM Z Server in such a format, it just becomes another CPU chip in the ubiquitous Cloud (E.g. IaaS, Paas) service provider environment. This might be the final piece of the jigsaw for today’s CIO as their approach to procuring compute power might be based solely upon the uptime and data security metrics. For those organizations requiring in excess of 99.99% uptime and fully compliant security, there only seems to be one choice, the IBM Z Mainframe CPU chip technology, which has been running Linux workloads since 2000!

z13s: An Affordable IBM Mainframe For Encrypted Hybrid Clouds?

Posted on 01/03/2016 by zman

Recent encryption trends indicate that ~50% of organizations transfer sensitive or confidential data to the cloud, whether encrypted or not; growing to ~75% of organizations in the next year or so. The number of organizations with an enterprise wide encryption strategy has risen slowly to ~35%. Seemingly ~40% of data at rest in the cloud is unprotected! One must draw one’s own conclusions as to why cybersecurity attacks are increasing in number, with the inevitable consequence of business data exposure!

Historically the enterprise class business deployed the best available IT infrastructure for their available budget. Generally this generated a modus operandi of quantifying the cost of computing power (I.E. Cost per MIPS), where business chargeback scenarios were both scarce and simply measured. From a business viewpoint, arguably the best measure of business cost is transaction based, where these external facing transactions deliver business value, both in terms of financial and reputation attributes. With the current digital data explosion, driven by Mobile and Social interfaces, the number of business transactions increases significantly, year-on-year, while the security exposure for the associated business data has never been higher. What is the feasibility of deploying a single footprint computing platform that delivers industry leading security, capacity and performance, while fully interacting with Hybrid Cloud topologies for rapid and agile Application Development and delivery?

Recently IBM announced the z13s, their latest addition to the System z server family. Some 13 months following the release of the Enterprise Class z13, the z13s offers a granularity of capacity from 10 MSU (~100 MIPS) for the 2965-A01 IBM z13s Entry Model to 884 MSU for the 6 Engine 2965-z06 IBM z13s. From a System z MLC software TCO viewpoint, an annual cost of ~£150,000 (~$200,000) applies for a 10 MSU (~100 MIPS) system configured with z/OS, CICS, DB2, WebSphere (MQ), Programming Languages (I.E. COBOL, Java, et al) and a modicum of Systems Management software. Therefore over a 3 year period, the realm of possibility exists for a commercial business to leverage from today’s unrivalled RAS (Reliability, Availability and Security) attributes of the z13s server, for ~£500,000. Even this cost base could be further optimized with use of specialty engines (I.E. zIIP, IFL) and current MLC pricing regimes (I.E. zNALC, zCAP, et al).

IBM state the z13s is enabled and optimized for hybrid cloud environments and can help secure critical information and transactions better than before. Clearly the IT landscape is rapidly evolving, with an ever increasing requirement for secure and timely access to increasing amounts of digital data, primarily from mobile devices. This paradigm shift of data creation and access dictates that cybersecurity is a fundamental and mandatory requirement for each and every organization, where the System z server has always delivered the highest levels of security, currently certified at EAL5+ (Common Criteria Evaluation Assistance Level 5+).

Businesses need to be flexible, dynamic and agile, being mindful of TCO optimization. It was forever thus, Information Technology teams must embrace social and mobile trends and the challenges they create. This requires new insights and ways to integrate these trends into existing processes and IT infrastructures. Incorporating these new insights and opportunities into business processes and associated IT disciplines helps the business grow and be competitive, while reducing cost and increasing efficiencies. Leveraging from technologies such as the latest z13s server can assist organizations in reaching this enterprise class infrastructure, but a combination of IT infrastructure management best practice and leading-edge technology is required.

The z13s is designed for the toughest real-time business challenges. It provides significant scalability attributes in terms of memory, I/O and single footprint CPU power that responds instantaneously to business processing fluctuations. Therefore the z13s helps organizations meet mission critical Service Level Agreements (SLAs), with real-time delivery and analytical insight for ever increasing amounts of business data and information, delivering an advantage of more timely business decisions. The flagship IBM z/OS Operating System supports the z13s processor topology, optimized for scalability, cost saving, advanced compression capabilities, reliability, availability and scalability. Delivered with the unparalleled System z security attributes, the z13s provides best in class data protection for business users, customers and partners alike.

For those organizations that have never considered a System z Mainframe before, the z13s delivers an eminently affordable IT platform that delivers a compelling infrastructure for today’s hybrid cloud environments. From a dispassionate viewpoint, some cloud deployments (I.E. IaaS, PaaS) dictate the utilization of 3^rd party server resources, which of course simplifies IT infrastructure management. However, it can also expose the business to scenarios beyond their control, whatever the uptime promise of the 3^rd party supplier.

Arguably for the digital business with significant user bases (E.g. Millions to Billions), the highest levels of security and data protection is required, safeguarding all parties concerned from the clear and present danger associated with cybersecurity attacks. Therefore the use of hybrid cloud can benefit from agile and rapid Application Development processes, using open source and COTS (Commercial Off The Shelf) code, as and when required, with a “fixed cost” System z platform cost. However scalable and flexible public cloud (E.g. Google Cloud Platform, Amazon AWS/EC2/VPC, IBM Bluemix, SoftLayer, et al) environments can be, they will always be a 3^rd party service and only the business can decide their own TCO, balanced with the value of business data and users…

From a security viewpoint, the z13s server technology leverages from two cryptographic hardware features. Firstly, the Central Processor Assist for Cryptographic Function (CPACF) delivers cryptographic support for the Data Encryption Standard (DES), Triple DES (TDES), Advanced Encryption Standard (AES) data encryption/decryption and Secure Hash Algorithm (SHA). Secondly the Crypto Express5S (CEX5S) feature is packaged in a PCIe adapter card containing a Cryptographic Coprocessor Subsystem housed within a FIPS Level 4 physically secure enclosure (Security Module). CEX5S delivers secure cryptographic functions for banking, finance and high data security environments. The primary customer application within the CEX5S card is CCA (Common Cryptographic Architecture). From a usability viewpoint, z13 cryptographic features support Format Preserving Encryption (FPE), for common user identity data strings such as Social Security Number (SSN), Personal Account Number (PAN), et al, with specific support for the Visa Format Preserving Encryption (VFPE) standard.

Since its inception, the IBM Mainframe has always delivered consistently low transaction response times, especially when a workload grows, sometimes peaking with an abnormally high requirement. The evolution of the z13 architecture safeguards this industry leading transaction response time is maintained, even when applying the highest levels of EAL5+ security. It was forever thus for the System z platform, where marketing statements are supported by the requisite performance benchmarks, in this case detailing the many scenarios for z13 Performance of Cryptographic Operations.

In conclusion, whether an existing IBM Mainframe user or not, the TCO and indeed TCA (Total Cost of Acquisition) attributes of the System z platform reduce year-on-year. Such a cost profile includes the System z platform of worthy consideration for each and every business, with a workload requirement of ~100 MIPS (~10 MSU) or more. Moreover, the notion of decommissioning an IBM Mainframe for the modernization of a legacy workload should be consigned to history forever more. Quite simply because the System z platform is open to all the rapid and agile Application Development and Deployment techniques available to Distributed Systems platforms.

For your business, which do you consider first, the cost of your computing platform, or the value of your business service? With an ever increasing cybersecurity risk, the System z platform delivers a compelling cost ownership model for even an entry level workload of ~100 MIPS, leveraging from the most secure, reliable and scalable single server footprint. We should evolve our cost ownership models from cost per computing power MIPS, to the cost of each and every business transaction. If we can reduce transaction cost, while increasing business value and safeguarding our priceless business data, perhaps that is a computing platform cost versus value balance metric we can take forward forever more…

The Problem With Problems – Are You zAware?

Posted on 04/07/2013 by zman

Several decades ago and observing potential challenges with hardware, most of us seasoned Mainframe folk would have been familiar with the terms Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR), although repair might become resolution, replacement, and so on. As hardware has become more reliable, with very few if any single points of failure, we don’t really use these terms for hardware, but perhaps if we don’t use them for problems associated with our business applications, we should…

Today we generally simplify this area of safeguarding business processing metrics (E.g. SLA, KPI) with the Reliability, Availability and Serviceability (RAS) terminology. So whether hardware related by an IHV such as IBM, or software related by ISV’s such as ASG, BMC, CA, IBM, naming but a few, or application code writers, we’re all striving to improve the RAS metrics associated with our IT discipline or component.

There will always be the ubiquitous software bugs, human error when making configuration changes, and so on, but what about those scenarios we might not even consider to be a problem, yet they can have a significant impact on our business? An end-to-end application transaction could consist of an On-Line Transaction Processor (E.g. OLTP, CICS, IMS, et al), a Relational Database Management Subsystem (E.g. RDBMS, DB2, ADABAS, IDMS, et al), a Messaging Broker (E.g. WebSphere MQ), a Networking Protocol (E.g. TCP/IP, SNA, et al), with all of the associated application infrastructure (E.g. Storage, Operating System, Server, Application Programs, Security, et al); so when we experience a “transaction failure”, which might be performance related, which component failed or caused the incident?

Systems Management disciplines dictate Mainframe Data Centres deploy a plethora of monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), but these software solutions typically generate a significant amount of data, but what we really need for successful problem solving is the right amount of meaningful information.

So ask yourself the rhetorical question. You know it; how many application performance issues remain unsolved, because we just can’t identify which component caused the issue, or there is just too much data (E.g. System Monitor Logs) to analyse? If you’re being honest, I guess the answer is greater than zero, perhaps significantly greater. Further complications can occur, because of the collaboration required to resolve such issues, as each discipline, Transaction, Databases, Messaging, Networking, Security, General Systems Management, Performance Monitoring, typically reside in different teams…

IBM System z Advanced Workload Analysis Reporter (IBM zAware) is an integrated, self-learning, analytics solution for IBM z/OS that helps identify unusual system behaviour in near real time. It is designed to help IT personnel improve problem determination so they can restore service quickly and improve overall availability. zAware integrates with the family of IBM Mainframe System Management tools, including Runtime Diagnostics, Predictive Failure Analysis (PFA), IBM Health Checker for z/OS and z/OS Management Facility (z/OSMF).

IBM zAware runs in an LPAR on a zEC12 or later CPC. Just like any System z LPAR, IBM zAware requires processing capacity, memory, disk storage, and connectivity. IBM zAware is able to use either general purpose CPs or IFLs, which can be shared or dedicated. It is generally more cost effective to deploy zAware on an IFL.

Used together with other Mainframe System Management Tools, zAware provides another view of your system(S) behaviour, helping answer questions such as:

Are my systems showing abnormal message activity?
When did this abnormal message activity start?
Is this abnormal message activity repetitive?
Are there messages appearing that have never appeared before?
Do the times of abnormal message activity coincide with problems in the system?
Is the abnormal behaviour limited to one system or are multiple systems involved?

IBM zAware creates a model of the normal operating characteristics of a z/OS system using message data captured from OPERLOG. This message data includes any well-formed message captured by OPERLOG (I.E. A message with a tangible Message ID), whether it is from an IBM product, a non-IBM product, or one of your own application programs. This model of past system behaviour is used as the base against which to compare message patterns that are occurring now. The results of this comparison might help answer these questions.

IBM zAware determines, using its model of each system, what messages are new or if messages have been issued out of context based on the past normal behaviour of the system. The model contains patterns of message ID occurrence over a previous period and does not need to know what job or started task issued the message. It also does not need to use the text of a message.

In summary, zAware is a self-learning technology, for newer zSeries Servers (I.E. zEC12 onwards), which can help reduce the time to identify the “area” of where a problem occurred, or is occurring (E.g. Near Real-Time), allowing a technician to fully identify the problem diagnosis and consider potential resolutions. Put very simply, zAware will assist in identifying the problem, but it does not fully qualify the problem and associated resolution. This is a good quality, as ultimately the human technician must complete this most important of activities!

So what if you’re not a zEC12 user or you’re concerned about increased costs because you don’t deploy IFL speciality engines?

ConicIT/MF is a Proactive Performance Management for First Fault Performance Problem Resolution solution. By interfacing with standard system monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON), ConicIT/MF uses sophisticated mathematic models to perform proactive, intelligent and significant data reduction, quickly highlighting possible causes of problems, allowing for efficient problem determination. Put another way, Systems Management Performance Monitors provide a wealth of data, but sometimes there’s too much data and not enough information. ConicIT safeguards that the value of the data provided by Systems Management Performance Monitors is analyzed and consolidated to expedite performance problem resolution.

ConicIT runs on a distributed Linux system external to the Mainframe system being monitored. ConicIT is a completely agentless architecture which doesn’t require installation on the Mainframe system being monitored. It receives data from existing monitors (E.g. ASG-TMON, BMC MAINVIEW, CA SYSVIEW, IBM Tivoli OMEGAMON, et al), through their standard interfaces. 3270 emulation enables ConicIT to appear as just another operator to the existing monitor and adds no more load to the monitored system than would adding an additional human operator.

Until a problem is predicted ConicIT requests basic monitor information at a very low frequency (about once per minute), but if the ConicIT analysis senses a performance problem brewing, its requests for information increase, but never so much as to effect the monitored system. The maximum load generated by ConicIT is configurable and ConicIT supports all the major Mainframe monitors.

The monitor data stream is retrieved by parsing the data from the various (E.g. Log) data sources. This raw data is first sent to the ConicIT data cleansing component. Data from existing monitors is very “noisy”, since various system parameters values can fluctuate widely even when the system is running perfectly. The job of the data cleansing algorithm is to find meaningful features from the fluctuating data. Without an appropriate data cleansing algorithm it is very difficult or impossible for any useful analysis to take place. Such cleansing is a simple visual task for a trained operator, but is very tricky for an automated algorithm.

The relevant features found by the data cleansing algorithm are then processed to create appropriate variables. These variables are created by a set of rules that can process the data and apply transformations to the data (E.g. combine single data points into a new synthesized variable, aggregate data points) to better describe the relevant state of the system.

These processed variables are analyzed by models that are used to discover anomalies that could be indicative of a brewing performance problem. Each model looks for a specific type of statistical anomalies that could predict a performance problem. No single model is appropriate for a system as complex as a large computing system, especially since the workload profile changes over time. So rather than a single model, ConicIT generates models appropriate to the historical data from a large, predefined set of prediction algorithms. This set of active models is used to analyze the data, detect anomalies and predict performance degradation. The active models vote on the possibility of an upcoming problem in order to make sure that as wide a set of anomalies as possible are covered, while lowering the number of false alerts. The set of active models change over time based on the results of an offline learning algorithm which can either generate new models based on the data, or change the weighting of existing models. The learning algorithm is run in the background on a periodic basis.

When a possible performance problem is predicted by the active models, the ConicIT system takes two actions. It sends an alert to the appropriate consoles and systems, and also instructs the monitor to collect information from the effected systems more frequently. The result is that when IT personnel analyze the problem they have the information describing the state of the system and the effected system components as if they were watching the problem while it was happening. The system also uses the information from the analysis to point out the anomalies that led the system to predict a problem, thereby aiding in root cause analysis of the problem.

So whether zAware or ConicIT, there are solutions to assist today’s busy IT technician to improve the Reliability, Availability and Serviceability (RAS) metric for their business, by implementing practicable resolutions for those problems, which previously, were just too problematic to solve. zAware can offload its processing to an IFL, as and if available, whereas ConicIT performs its processing on a Non-Mainframe platform, and thus can support all zSeries Servers, not just the zEC12 platform.

Ultimately both the zAware and ConicIT solutions have the same objective, increasing Mean Time Between Failure (MTBF) and decreasing Mean Time To Resolution (MTTR), optimizing IT personnel time accordingly.

Value-4IT Blog

zWorld Thoughts & Updates

Tag Archives: Reliability

IBM Z Server: Best In Class For Availability – Does Form Factor Matter?

The Problem With Problems – Are You zAware?