Maximizing IBM Z System Of Record (SOR) Data Value: Is ETL Still Relevant?

A generic consensus for the IBM Z Mainframe platform is that it’s the best transaction and database server available, and more recently with the advent of Pervasive Encryption, the best enterprise class security server.  It therefore follows that the majority of mission critical and valuable data resides in IBM Z Mainframe System Of Record (SOR) database repositories, receiving and passing data via real-time transaction services.  Traditionally, maximizing data value generally involved moving data from the IBM Mainframe to another platform, for subsequent analysis, typically for Business Intelligence (BI) and Data Warehouse (DW) purposes.

ETL (Extract, Transform, Load) is an automated and bulk data movement process, transitioning data from source systems via a transformation engine for use by target business decision driven applications, via an installation defined policy, loading the transformed data into target systems, typically data warehouses or specialized data repositories.  Quite simply, ETL enables an organization to make informed and hopefully intelligent data driven business decisions.  This ubiquitous IT industry TLA (Three Letter Acronym) generated a massive industry of ETL solutions, involving specialized software solutions, involving various Distributed Systems hardware platforms, both commodity and specialized.  However, some ~30 years since the first evolution of ETL processes, is ETL still relevant in the 21st Century?

The 21st Century has witnessed a massive and arguably exponential data explosion, from cloud, mobile and social media sources.  These dynamic and open data sources demand intelligent analytics to process the data in near real-time and the notion of having a time delay between the Extract and Load part of the ETL process is becoming increasingly unacceptable for most data driven organizations.  During the last several years, there has been increased usage of Cloud BI, with a reported increase from ~25-80% of public cloud users, deploying Cloud BI solutions.

For cloud resident data warehouses, an evolution from ETL to ELT (Extract, Load, Transform) has taken place.  ELT is an evolutionary and savvy method for of moving data from source systems to centralized data repositories without transforming the data before it’s loaded into the target systems.  The major benefit of the ELT approach is the near real-time processing requirement of today’s data driven 21st Century business.  With ELT, all extracted raw data resides in the data warehouse, where powerful and modern analytical architectures can transform the data, as per the associated business decision making policies.  Put simply, the data transformation occurs when the associated analytical query activities are processed.  For those modern organizations leveraging from public cloud resources, ELT and Cloud BI processes make sense and the growth of Cloud BI speaks for itself.  However, what about the traditional business, which has leveraged from the IBM Z Mainframe platform for 30-50+ years?

Each and every leading Public Cloud supplier, including IBM (Watson) has their own proprietary analytical engine, integrating that technology into their mainstream offerings.  As always, the IBM Z Mainframe platform has evolved to deliver the near real-time requirements of an ELT framework, but are there any other generic solutions that might assist any Mainframe organization in their ETL to ELT evolution process?

B.O.S. Software Service und Vertrieb GmbH offer their tcVISION solution, which approaches this subject matter from a data synchronization viewpoint.  tcVISION is a powerful Change Data Capture (CDC) platform for users of IBM Mainframes and Distributed Systems servers.  tcVISION automatically identifies the changes applied to Mainframe and Distributed Systems databases and files.  No programming effort is necessary to obtain the changed data.  tcVISION continuously propagates the changed data to the target systems in real-time or on a policy driven time interval period, as and when required.  tcVISION offers a rich set of processing and controlling mechanisms to guarantee a data exchange implementation that is fully audit proof.  tcVISION contains powerful bulk processors that perform the initial load of mass data or the cyclic exchange of larger data volumes in an efficient, fast and reliable way.

tcVISION supports several data capture methods that can be individually used as the application and associated data processing flow requires.  These methods are based upon a Real-Time or near Real-Time basis, including IBM Mainframe DBMS, Logstream, Log and Snapshot (compare) data sources.  A myriad of generic database repositories are supported:

  • Adabas: Realtime/Near Realtime, Log Processing, Compare Processing
  • Adabas LUW: Real-time/Near Real-time, log processing, compare processing
  • CA-Datacom: Log processing, compare processing
  • CA-IDMS: Real-time/Near real-time, log processing, compare processing
  • DB2: Real-time/Near real-time, log processing, compare processing
  • DB2/LUW: Real-time/Near real-time, log processing, compare processing
  • Exasol: Compare processing
  • IMS: Real-time/Near real-time, log processing, compare processing
  • Informix: Real-time/Near real-time, log processing, compare processing
  • Microsoft SQL Server: Real-time/Near real-time, log processing, compare processing
  • Oracle: Real-time/Near real-time, log processing, compare processing
  • PostgreSQL: Real-time/Near real-time, log processing, compare processing
  • Sequential file: Compare processing
  • Teradata: Compare processing
  • VSAM: Real-time/Near real-time, log processing, compare processing
  • VSAM/CICS: Real-time/Near real-time, log processing, compare processing

tcVISION incorporates an intelligent bulk load component that can be used to unload data from a Mainframe or Distributed Systems data source, loading the data into a target database, either directly or by using a loader file.  tcVISION comes with an integrated loop-back prevention for bidirectional data exchange, where individual criteria can be specified to detect and ignore changes that have already been applied.  tcVISION incorporates comprehensive monitoring, logging and integrated alert notification.  Optional performance data may be captured and stored into any commercially available relational database.  This performance data can be analyzed and graphically displayed using the tcVISION web component.

From an ETL to ELT evolution viewpoint, tcVISION delivers the following data synchronization benefits:

  • Time Optimization: Significant reduction in data exchange implementation processes and data synchronization processing.
  • Heterogenous Support: Independent of database supplier, offering support for a myriad of source and target databases.
  • Resource Optimization: Mainframe MIPS reduction and data transfer optimization via intelligent secure compression algorithms.
  • Data Availability: Real-time data replication across application and system boundaries.
  • Implementation Simplicity: Eradication of application programming and data engineer resources.
  • Security: Full accountability and auditability all data movements.

In conclusion, the ETL process has now been superseded by the real-time data exchange requirement for 21st Century data processing via the ELT evolution.  Whether viewed as an ELT or data synchronization requirement, tcVISION delivers an independent vendor agnostic solution, which can efficiently deliver seamless data delivery for analytical purposes, while maintaining synchronized data copies between environments in real-time.

Optimizing Mission Critical Data Value – IBM Machine Learning for z/OS

Typically the IBM Z Mainframe is recognized as the de facto System Of Record (SOR) for storing Mission Critical data.  It therefore follows for generic business applications, DB2, IMS (DB) and even VSAM could be considered as database servers, while CICS and IMS (DC) are transaction servers.  Extracting value from the Mission Critical data source has always been desirable, initially transferring this valuable Mainframe data source to a Distributed Platform via ETL (Extract, Transform, Load) processes.  A whole new software and hardware ecosystem was born for these processes, typically classified as data warehousing.  This process has proved valuable for the last 20 years or so, but more recently the IT industry has evolved, embracing Artificial Intelligence (AI) technologies, ultimately generating Machine Learning capabilities.

For some, it’s important to differentiate between Artificial Intelligence and Machine Learning, so here goes!  Artificial Intelligence is an explicit Computer Science activity, endeavouring to build machines capable of intelligent behaviour.  Machine Learning is a process of evolving computing platforms to act from data patterns, without being explicitly programmed.  In the “what came first world, the chicken or the egg”?  You need AI scientists and engineers to build the smart computing platforms, but you need data scientists or pseudo machine learning experts to make these new computing platforms intelligent.

Conceptually, Machine Learning could be classified as:

  • An automated and seamless learning ability, without being explicitly programmed
  • The ability to grow, change, evolve and adapt when encountering new data
  • An ability to deliver personalized and optimized outcomes from data analysed

When considering this Machine Learning ability with the traditional ETL model, eliminating the need to move data sources from one platform to another, eradicates the “point in time” data timestamp of such a model, and any associated security exposure of the data transfer process.  Therefore, returning to the IBM Z Mainframe being the de facto System Of Record (SOR) for storing Mission Critical data, it’s imperative that the IBM Z Mainframe server delivers its own Machine Learning ability…

IBM Machine Learning for z/OS is an enterprise class machine learning platform solution, assisting the user to create, train and deploy machine learning models, extracting value from your mission critical data on IBM Z platforms, retaining the data in situ, within the IBM Z complex.

Machine Learning for z/OS integrates several IBM machine learning capabilities, including IBM z/OS Platform for Apache Spark.  It simplifies and automates the machine learning workflow, enabling collaboration on machine learning projects across personal and disciplines (E.g. Data Scientists, Business Analysts, Application Developers, et al).  Retaining your Mission Critical data in situ, on your IBM Z platforms, Machine Learning for z/OS significantly reduces the cost, complexity security risk and time for Machine Learning model creation, training and deployment.

Simplistically there are two categories of Machine Learning:

  • Supervised: A model is trained from a known set of data sources, with a target output in mind. In mathematical terms, a formulaic approach.
  • Unsupervised: There is no input or output structure and unsupervised machine learning is required to formulate results from evolving data patterns.

In theory, we have been executing supervised machine learning for some time, but unsupervised is the utopia.

Essentially Machine Learning for z/OS comprises the following functions:

  • Data ingestion (From SOR data sources, DB2, IMS, VSAM)
  • Data preparation
  • Data training and validation
  • Data evaluation
  • Data analysis deployment (predict, score, act)
  • Ongoing learning (monitor, ingestion, feedback)

For these various Machine Learning functions, several technology components are required:

  • z/OS components on z/OS (MLz scoring service, various SPARK ML libraries and CADS/HPO library)
  • Linux/x86 components (Docker images for Repository, Deployment, Training, Ingestion, Authentication and Metadata, services)

The Machine Learning for z/OS solution incorporates the following added features:

  • CADS: Cognitive Assistant for Data Scientist (helps select the best fit algorithm for training)
  • HPO: Hyper Parameter Optimization (provides the Data Scientist with optimal parameters)
  • Brunel Visualization Tool (assist the Data Scientist to understand data distribution)

Machine Learning for z/OS provides a simple framework to manage the entire machine learning workflow.  Key functions are delivered through intuitive web based GUI, a RESTful API and other programming APIs:

  • Ingest data from various sources including DB2, IMS, VSAM or Distributed Systems data sources.
  • Transform and cleanse data for algorithm input.
  • Train a model for the selected algorithm with the prepared data.
  • Evaluate the results of the trained model.
  • Intelligent and automated algorithm/model selection/model parameter optimization based on IBM Watson Cognitive Assistant for Data Science (CADS) and Hyper Parameter Optimization (HPO) technology.
  • Model management.
  • Optimized model development and Production.
  • RESTful API provision allowing Application Development to embed the prediction using the model.
  • Model status, accuracy and resource consumption monitoring.
  • An intuitive GUI wizard allowing users to easily train, evaluate and deploy a model.
  • z Systems authorization and authentication security.

In conclusion, the Machine Learning for z/OS solution delivers the requisite framework for the emerging Data Scientists to collaborate with their Business Analysts and Application Developer colleagues for delivering new business opportunities, with smarter outcomes, while lowering risk and associated costs.

Big Data: Is the zSeries Mainframe A Viable Platform?

Noting that ~80% of global corporate data is still managed by IBM Mainframes, doesn’t it make sense that processing this mission critical data should remain local, whenever practicable and pragmatic?

Industry Analyst’s estimate that 90%+ of existing IT budget expenditure is expended on the maintenance of existing applications and their supporting infrastructure. A significant factor is the siloed, duplicated and complex nature of these existing IT environments. Repeating this often unnecessary data duplication and processing for big data implementations will only exacerbate this significant TCO expenditure. Therefore it is of primary importance to consider big data from a strategic rather than a purely expedient tactical perspective. Put another way, if big data could be accessed and processed by the incumbent IBM Mainframe environment, why create another silo environment, requiring more servers, storage, software and associated maintenance expenditure?

It is estimated that each and every day another ~2.5 Exabyte’s (2.5 quintillion bytes) of data is created, meaning that ~90% of electronic data stored, has been created in the last two years alone. This data comes from numerous sources, largely Internet and mobile telephony based, including social media sources, digital pictures and videos, financial transaction records, cell phone generated, naming but a few.

Industry Analyst’s estimate that only ~1% of global data is currently analysed, leaving massive scope for growth in this functional area, namely big data analytics. Obviously this scope dictates exponential and arguably uncontrolled growth in deployment of big data analytics solutions, generating significant risk that big data projects will lack management oversight, spiralling out of control from a cost viewpoint.

It therefore follows that big data initiatives require careful and strategic planning, not only for short-term immediate requirements, but also for future big data projects that can already be perceived and forecasted. Moreover, in addition, there needs to be a strategic, scalable, cost efficient and secure infrastructure in place, managing the interrelationship and interdependencies, between mission critical data stored on the IBM Mainframe and big data being created from Internet and mobile technologies.

Without such a diligent and structured management framework, IT infrastructure expenditure costs (TCO) will increase, efficiency reduce, with the inevitable consequence of siloed environments, with duplication of resources, namely servers, software, storage, et al. As always, we must always apply lessons learned from past experiences to avoid these inefficiencies.

Hadoop is seemingly the big data buzzword, being an open source software framework for storing and processing big data in a distributed environment on large clusters of commodity hardware. Ultimately Hadoop delivers two primary functions, massive data storage and faster in memory I/O processing.

In conclusion, the underlying question remains, can mission critical IBM Mainframe data be “coupled” with big data, typically originating from Internet and mobile platforms, to deliver an integrated single image view of customer and/or product data, for business benefit?

IBM offers an integrated solution, namely the zEnterprise Analytics System (I.E. 9700, 9710), comprising hardware (E.g. z196/zEC12 or z114/zBC 12 Server plus DS8870 Disk) and software (E.g. Optimized z/OS software stack), combined with optional services. Primarily data analytics is delivered by the IBM DB2 Analytics Accelerator solution, incorporating Netezza 1000 product function, allowing for intelligent and rapid in-memory data analytics via the DB2 RDBMS. Therefore existing zSeries Mainframe customers can supplement their current IBM Mainframe infrastructure with the IBM DB2 Analytics Accelerator solution, while the realm of possibility exists for a zSeries Mainframe to be deployed for new workloads, via the zEnterprise Analytics System.

Resource and cost efficiencies are delivered by combining z/OS and Linux on zEnterprise solutions. Data transfer is reduced by keeping data analytics in the same environment as the mission critical source data (I.E. z/OS) using hypersockets to process the data between the IBM z/OS and Linux on zEnterprise systems. Overall TCO efficiencies are delivered by optimizing lower cost Linux on zEnterprise systems resources, where for Sub Capacity z/OS customers, no software charges will be incurred for associated CPU processing. Therefore leveraging from existing zEnterprise infrastructure resources, including people and processes to deploy and support expanding data analytics requirements.

zSeries Mainframe big data analytics solutions, whether via the packaged zEnterprise Analytics System or via the IBM DB2 Analytics solution deliver benefits including:

  • Optimized I/O Processing: Reducing the complexity and cost of data storage and associated processing by bringing data transformation and analytic processes to the data origin (I.E. zSeries Mainframe)
  • Enterprise Wide Data Availability: Safeguarding operational data accessibility to many users in a timely and cost efficient manner without impacting core business processes
  • Near Real Time Data Processing: Delivering near real time operational analytics with minimal latency and superior Quality of Service (QoS) attributes (I.E. RAS – Reliability, Availability, Serviceability)

Syncsort also provide their DMX-h ETL solution to integrate IBM mainframe data with Hadoop technologies. Syncsort DMX-h ETL incorporates a library of Use Case Accelerators to implement common ETL tasks including Mainframe data access, change data capture (CDC), joins, web log aggregations, et al. Implementing a more traditional ETL approach, offloading big data batch workload from the Mainframe to Hadoop platforms, reducing Mainframe MIPS accordingly. Obviously ETL solutions have a long-term history, typically associated with Business Intelligence, Data Warehouse, et al. One must draw one’s own conclusions as to whether ETL solutions contribute to the complexity and cost of managing mission critical business data…

From a business viewpoint, big data analytics delivers benefits, including but not limited to:

  • Optimized & Faster Decision Making: Performing real time analysis of customer transaction and activity data, feedback (E.g. survey and experience) data, et al, can dramatically reduce customer attrition, maintaining existing customer loyalty, applying these lessons learned for attracting new customers.
  • New Products & Services: Customer’s and associated market research have always provided valuable insight into driving innovation, but these traditional processes are time consuming and error prone. Rapidly analysing real life customer data from Internet and mobile sources, delivers an opportunity to offer a new product and/or service, seemingly specialized to their personal individual requirements.
  • Cost Reduction: Performed well, clearly big data analytics can deliver significant cost reduction for the business, reducing product/service development time, while retaining existing customers and attracting new customers. However, done badly, data analytics could be a significant drain on the IT expenditure budget

As always, the zSeries Mainframe delivers an integrated, scalable, secure and cost efficient solution for big data initiatives, even Hadoop, typically perceived as a Distributed Systems solution. Without doubt, big data solutions will be implemented by each and every major global company in the short-term, while pragmatic and careful planning will reduce the associated IT implementation and administration cost. With a legacy of several decades or more delivering enterprise wide solutions, arguably seasoned IBM Mainframe personnel are ideally placed to participate in the design and delivery of big data analytics projects!