Optimizing Mission Critical Data Value – IBM Machine Learning for z/OS

Typically the IBM Z Mainframe is recognized as the de facto System Of Record (SOR) for storing Mission Critical data.  It therefore follows for generic business applications, DB2, IMS (DB) and even VSAM could be considered as database servers, while CICS and IMS (DC) are transaction servers.  Extracting value from the Mission Critical data source has always been desirable, initially transferring this valuable Mainframe data source to a Distributed Platform via ETL (Extract, Transform, Load) processes.  A whole new software and hardware ecosystem was born for these processes, typically classified as data warehousing.  This process has proved valuable for the last 20 years or so, but more recently the IT industry has evolved, embracing Artificial Intelligence (AI) technologies, ultimately generating Machine Learning capabilities.

For some, it’s important to differentiate between Artificial Intelligence and Machine Learning, so here goes!  Artificial Intelligence is an explicit Computer Science activity, endeavouring to build machines capable of intelligent behaviour.  Machine Learning is a process of evolving computing platforms to act from data patterns, without being explicitly programmed.  In the “what came first world, the chicken or the egg”?  You need AI scientists and engineers to build the smart computing platforms, but you need data scientists or pseudo machine learning experts to make these new computing platforms intelligent.

Conceptually, Machine Learning could be classified as:

  • An automated and seamless learning ability, without being explicitly programmed
  • The ability to grow, change, evolve and adapt when encountering new data
  • An ability to deliver personalized and optimized outcomes from data analysed

When considering this Machine Learning ability with the traditional ETL model, eliminating the need to move data sources from one platform to another, eradicates the “point in time” data timestamp of such a model, and any associated security exposure of the data transfer process.  Therefore, returning to the IBM Z Mainframe being the de facto System Of Record (SOR) for storing Mission Critical data, it’s imperative that the IBM Z Mainframe server delivers its own Machine Learning ability…

IBM Machine Learning for z/OS is an enterprise class machine learning platform solution, assisting the user to create, train and deploy machine learning models, extracting value from your mission critical data on IBM Z platforms, retaining the data in situ, within the IBM Z complex.

Machine Learning for z/OS integrates several IBM machine learning capabilities, including IBM z/OS Platform for Apache Spark.  It simplifies and automates the machine learning workflow, enabling collaboration on machine learning projects across personal and disciplines (E.g. Data Scientists, Business Analysts, Application Developers, et al).  Retaining your Mission Critical data in situ, on your IBM Z platforms, Machine Learning for z/OS significantly reduces the cost, complexity security risk and time for Machine Learning model creation, training and deployment.

Simplistically there are two categories of Machine Learning:

  • Supervised: A model is trained from a known set of data sources, with a target output in mind. In mathematical terms, a formulaic approach.
  • Unsupervised: There is no input or output structure and unsupervised machine learning is required to formulate results from evolving data patterns.

In theory, we have been executing supervised machine learning for some time, but unsupervised is the utopia.

Essentially Machine Learning for z/OS comprises the following functions:

  • Data ingestion (From SOR data sources, DB2, IMS, VSAM)
  • Data preparation
  • Data training and validation
  • Data evaluation
  • Data analysis deployment (predict, score, act)
  • Ongoing learning (monitor, ingestion, feedback)

For these various Machine Learning functions, several technology components are required:

  • z/OS components on z/OS (MLz scoring service, various SPARK ML libraries and CADS/HPO library)
  • Linux/x86 components (Docker images for Repository, Deployment, Training, Ingestion, Authentication and Metadata, services)

The Machine Learning for z/OS solution incorporates the following added features:

  • CADS: Cognitive Assistant for Data Scientist (helps select the best fit algorithm for training)
  • HPO: Hyper Parameter Optimization (provides the Data Scientist with optimal parameters)
  • Brunel Visualization Tool (assist the Data Scientist to understand data distribution)

Machine Learning for z/OS provides a simple framework to manage the entire machine learning workflow.  Key functions are delivered through intuitive web based GUI, a RESTful API and other programming APIs:

  • Ingest data from various sources including DB2, IMS, VSAM or Distributed Systems data sources.
  • Transform and cleanse data for algorithm input.
  • Train a model for the selected algorithm with the prepared data.
  • Evaluate the results of the trained model.
  • Intelligent and automated algorithm/model selection/model parameter optimization based on IBM Watson Cognitive Assistant for Data Science (CADS) and Hyper Parameter Optimization (HPO) technology.
  • Model management.
  • Optimized model development and Production.
  • RESTful API provision allowing Application Development to embed the prediction using the model.
  • Model status, accuracy and resource consumption monitoring.
  • An intuitive GUI wizard allowing users to easily train, evaluate and deploy a model.
  • z Systems authorization and authentication security.

In conclusion, the Machine Learning for z/OS solution delivers the requisite framework for the emerging Data Scientists to collaborate with their Business Analysts and Application Developer colleagues for delivering new business opportunities, with smarter outcomes, while lowering risk and associated costs.

Big Data: Is the zSeries Mainframe A Viable Platform?

Noting that ~80% of global corporate data is still managed by IBM Mainframes, doesn’t it make sense that processing this mission critical data should remain local, whenever practicable and pragmatic?

Industry Analyst’s estimate that 90%+ of existing IT budget expenditure is expended on the maintenance of existing applications and their supporting infrastructure. A significant factor is the siloed, duplicated and complex nature of these existing IT environments. Repeating this often unnecessary data duplication and processing for big data implementations will only exacerbate this significant TCO expenditure. Therefore it is of primary importance to consider big data from a strategic rather than a purely expedient tactical perspective. Put another way, if big data could be accessed and processed by the incumbent IBM Mainframe environment, why create another silo environment, requiring more servers, storage, software and associated maintenance expenditure?

It is estimated that each and every day another ~2.5 Exabyte’s (2.5 quintillion bytes) of data is created, meaning that ~90% of electronic data stored, has been created in the last two years alone. This data comes from numerous sources, largely Internet and mobile telephony based, including social media sources, digital pictures and videos, financial transaction records, cell phone generated, naming but a few.

Industry Analyst’s estimate that only ~1% of global data is currently analysed, leaving massive scope for growth in this functional area, namely big data analytics. Obviously this scope dictates exponential and arguably uncontrolled growth in deployment of big data analytics solutions, generating significant risk that big data projects will lack management oversight, spiralling out of control from a cost viewpoint.

It therefore follows that big data initiatives require careful and strategic planning, not only for short-term immediate requirements, but also for future big data projects that can already be perceived and forecasted. Moreover, in addition, there needs to be a strategic, scalable, cost efficient and secure infrastructure in place, managing the interrelationship and interdependencies, between mission critical data stored on the IBM Mainframe and big data being created from Internet and mobile technologies.

Without such a diligent and structured management framework, IT infrastructure expenditure costs (TCO) will increase, efficiency reduce, with the inevitable consequence of siloed environments, with duplication of resources, namely servers, software, storage, et al. As always, we must always apply lessons learned from past experiences to avoid these inefficiencies.

Hadoop is seemingly the big data buzzword, being an open source software framework for storing and processing big data in a distributed environment on large clusters of commodity hardware. Ultimately Hadoop delivers two primary functions, massive data storage and faster in memory I/O processing.

In conclusion, the underlying question remains, can mission critical IBM Mainframe data be “coupled” with big data, typically originating from Internet and mobile platforms, to deliver an integrated single image view of customer and/or product data, for business benefit?

IBM offers an integrated solution, namely the zEnterprise Analytics System (I.E. 9700, 9710), comprising hardware (E.g. z196/zEC12 or z114/zBC 12 Server plus DS8870 Disk) and software (E.g. Optimized z/OS software stack), combined with optional services. Primarily data analytics is delivered by the IBM DB2 Analytics Accelerator solution, incorporating Netezza 1000 product function, allowing for intelligent and rapid in-memory data analytics via the DB2 RDBMS. Therefore existing zSeries Mainframe customers can supplement their current IBM Mainframe infrastructure with the IBM DB2 Analytics Accelerator solution, while the realm of possibility exists for a zSeries Mainframe to be deployed for new workloads, via the zEnterprise Analytics System.

Resource and cost efficiencies are delivered by combining z/OS and Linux on zEnterprise solutions. Data transfer is reduced by keeping data analytics in the same environment as the mission critical source data (I.E. z/OS) using hypersockets to process the data between the IBM z/OS and Linux on zEnterprise systems. Overall TCO efficiencies are delivered by optimizing lower cost Linux on zEnterprise systems resources, where for Sub Capacity z/OS customers, no software charges will be incurred for associated CPU processing. Therefore leveraging from existing zEnterprise infrastructure resources, including people and processes to deploy and support expanding data analytics requirements.

zSeries Mainframe big data analytics solutions, whether via the packaged zEnterprise Analytics System or via the IBM DB2 Analytics solution deliver benefits including:

  • Optimized I/O Processing: Reducing the complexity and cost of data storage and associated processing by bringing data transformation and analytic processes to the data origin (I.E. zSeries Mainframe)
  • Enterprise Wide Data Availability: Safeguarding operational data accessibility to many users in a timely and cost efficient manner without impacting core business processes
  • Near Real Time Data Processing: Delivering near real time operational analytics with minimal latency and superior Quality of Service (QoS) attributes (I.E. RAS – Reliability, Availability, Serviceability)

Syncsort also provide their DMX-h ETL solution to integrate IBM mainframe data with Hadoop technologies. Syncsort DMX-h ETL incorporates a library of Use Case Accelerators to implement common ETL tasks including Mainframe data access, change data capture (CDC), joins, web log aggregations, et al. Implementing a more traditional ETL approach, offloading big data batch workload from the Mainframe to Hadoop platforms, reducing Mainframe MIPS accordingly. Obviously ETL solutions have a long-term history, typically associated with Business Intelligence, Data Warehouse, et al. One must draw one’s own conclusions as to whether ETL solutions contribute to the complexity and cost of managing mission critical business data…

From a business viewpoint, big data analytics delivers benefits, including but not limited to:

  • Optimized & Faster Decision Making: Performing real time analysis of customer transaction and activity data, feedback (E.g. survey and experience) data, et al, can dramatically reduce customer attrition, maintaining existing customer loyalty, applying these lessons learned for attracting new customers.
  • New Products & Services: Customer’s and associated market research have always provided valuable insight into driving innovation, but these traditional processes are time consuming and error prone. Rapidly analysing real life customer data from Internet and mobile sources, delivers an opportunity to offer a new product and/or service, seemingly specialized to their personal individual requirements.
  • Cost Reduction: Performed well, clearly big data analytics can deliver significant cost reduction for the business, reducing product/service development time, while retaining existing customers and attracting new customers. However, done badly, data analytics could be a significant drain on the IT expenditure budget

As always, the zSeries Mainframe delivers an integrated, scalable, secure and cost efficient solution for big data initiatives, even Hadoop, typically perceived as a Distributed Systems solution. Without doubt, big data solutions will be implemented by each and every major global company in the short-term, while pragmatic and careful planning will reduce the associated IT implementation and administration cost. With a legacy of several decades or more delivering enterprise wide solutions, arguably seasoned IBM Mainframe personnel are ideally placed to participate in the design and delivery of big data analytics projects!