In the first installment of our blog series on what happens when real-time analytics meets Kubernetes, we discussed how Kubernetes has become the de-facto open source containerized orchestration tool, as well as some practical tips for a one-click Kubernetes deployment with GigaSpaces InsightEdge.

In today’s article, we’re going to take a closer look at the why and how of auto-deploying a machine learning job with Spark, how to visualize it with Apache Zeppelin on a cloud-native Kubernetes deployment and the operationalizing of machine learning with InsightEdge.

The Accelerated Adoption of Kubernetes

Indeed, Kubernetes maturity and adoption are both accelerating at an unprecedented pace, with no end in sight. For example, in 2018 the open source foundation CNCF and Google announced that Kubernetes development will be moving to the CNCF’s control in an effort to even further enable multicloud development. 

In fact, Google donated $9 million in cloud credits to help the CNCF expedite reaching the ability to run the Kubernetes infrastructure on its own. 

Kubernetes for Machine Learning

With Kubernetes enabling faster machine learning capabilities, we are also seeing how efforts are being made to make it easy to develop, deploy, and manage machine learning on Kubernetes. For example, there is the open source Kubeflow project which aims to achieve just that.

Kubernetes for Apache Spark Developers

In addition, Kubernetes is becoming more and more important to the Spark developers who are working on machine learning applications. Namely, containers provide the simplest way to achieve more consistency and predictability when dealing with the multiple challenges of developing Spark applications.

These challenges include dependence on multiple third party products and libraries, dealing with ongoing changes to the software components, maintaining compatibility, and delivering high availability, among others.

Since containers make up the core of Kubernetes, the solution addresses these and other challenges, providing various ways to make provisioning and configuring Spark clusters easier and more transparent. 

Now, let’s take a look at how to auto-deploy your GigaSpaces machine learning stack on Kubernetes.

InsightEdge Platform – Real-Time Machine Learning on Your Operational Data

Providing a full Spark distribution,  InsightEdge Platform runs machine learning models on “hot” data enriched with historical data from HDFS data lakes, Amazon S3 and Azure Blob Storage. This allows hybrid transactional/operational and  analytics processing by co-locating Spark jobs, for low-latency data grid applications.

Figure 1: Spark Co-location in InsightEdge Platform

InsightEdge also reduces bandwidth and memory utilization of machine learning applications and ensures higher availability by saving the Spark worker state for immediate auto recovery.

When speed is of essence,  InsightEdge accelerates Spark jobs with predicate push down and column pruning.

Only relevant data entries  are retrieved when running a query, filtering, pruning and optimizing  the data on the in-memory data grid which improves performance and reduces network overhead.  Speed can be increased by 30X and more.

Figure 2: Acceleration of Spark Example

 Running a Spark & InsightEdge Job

Users can run Spark workloads in an existing Kubernetes 1.9+ cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks.

To run an Apache Spark and InsightEdge job:

  1. Set the Spark configuration property for the InsightEdge Docker image;
  2. Provide a URL for submitting the Spark jobs to Kubernetes;
  3. Configure the Kubernetes service account so it can be used by the Driver Pod;
  4. Deploy a data grid with a headless service (Lookup locator);
  5. Submit the Spark jobs for the examples.

Spark jobs can be submitted with InsightEdge Submit. The insightedge-submit script is located in the InsightEdge home directory, in insightedge/bin.

This script is similar to the spark-submit command used by Spark users to submit Spark jobs.

A SparkPi Example

Run the following InsightEdge submit script for the SparkPi example. This example specifies a JAR file with a specific URI that uses the local:// scheme. This url is the location of the example JAR that is already available in the Docker image. If your application’s dependencies are all hosted in remote locations (like HDFS or HTTP servers), you can use the appropriate remote URIs, such as https://path/to/examples.jar.

Refer to the Apache Spark documentation for more configurations that are specific to Spark on Kubernetes. For example, to specify the Driver Pod name, add the following configuration option to the submit command:

How to Visualize the Job with the Apache Zeppelin Web Notebook

Apache Zeppelin is multi-purposed open web-based notebook that brings data ingestion and exploration, visualization, sharing, and collaboration features to Spark.

To use the interactive Apache Zeppelin Web Notebook with InsightEdge:

  • Run the insightedge demo command; the web notebook is started automatically at localhost:9090; or
  • Start the web notebook manually at any time by running zeppelin.sh/cmd from the <XAP HOME>/insightedge/zeppelin/bin directory.

Once you have the notebook set up and ready for use, we recommend that you review the sample notes that come with the notebook, and use them as a template for your own notes.

There are several things that should be taken into account:

  1. The data grid model can be declared in a notebook using the %define interpreter:

  1. You can load external .JARs from the Spark interpreter settings, or with the z.load(“/path/to.jar”) command:

  1. You must load your dependencies before you start using the SparkContext(sc) command.

If you want to redefine the model or load another .JAR after SparkContext has already started, you must reload the Spark interpreter.

Figure 3: Data Visualization in the InsightEdge Web Notebook

Deploying MemoryXtend Off-Heap

The GigaSpaces MemoryXtend module, provides an optimized access to SSDs and Storage Class Memory – allowing a hybrid storage model where data is stored on multiple tiers according to the priority set by the application. With customized hot data selection, you can take control and utilize RAM for the data that you deem as critical while saving (evicting) the desired data to multi-tiered data storage such as SSD and Storage-Class Memory (3DXPoint).

In the MemoryXtend architecture, data is stored across two tiers: a space partition instance (managed JVM heap) and an embedded key/value store (the blob store). MemoryXtend comes with a built-in blob store cache. This cache is part of the space partition tier, and stores objects in their native form.

 Figure 4: MemoryXtend architecture

Using off-heap memory allows your cache to overcome lengthy JVM Garbage Collection (GC) pauses when working with large heap sizes by caching data outside of main Java Heap space, but still in RAM.

To configure MemoryXtend for off-heap RAM in Kubernetes, there is a two-step process:

  1. Create your pu.jar with MemoryXtend for off-heap RAM as described here: https://docs.gigaspaces.com/xap/14.0/admin/memoryxtend-ohr.html;
  2. Deploy using helm install with a resource url to the jar created in step one, for example:

Proactive Monitoring With Prometheus

Now that you completed the cloud-native Kubernetes deployment of InsightEdge, it’s time to make sure that all is well in production. But, since Kubernetes is a distributed system, it is not always easy to troubleshoot.

That’s why a top-notch monitoring and alerting tool is required, and Prometheus, a CNCF project, is noted by many as the tool for Kubernetes in production. 

To get started with Prometheus, the Prometheus Operator can be deployed on top of Kubernetes. You can create a ServiceMonitor resource that will scrape the Prometheus metrics from the defined set of pods.

For the Prometheus installation use the official Helm chart prometheus-operator, which comes with a lot of options. Among other services, this chart installs Grafana and exporters ready to monitor your cluster. 

Within a few minutes and the whole stack should be up and running. 

 

Figure 5: A Node View In The Prometheus Grafana Dashboard

To learn more about how to achieve a seamless, automated, and cloud-native installation and deployment of InsightEdge, watch our on-demand Kubernetes webinar. 

Real-Time Analytics Meets Kubernetes – Part 2: How to Auto-Deploy Your Machine Learning Stack
Yoav Einav on LinkedinYoav Einav on Twitter
Yoav Einav
VP Product @ GigaSpaces
Yoav drives product management, technology vision, and go-to-market activities for GigaSpaces. Prior to joining GigaSpaces, Yoav filled various leading product management roles at Iguazio and Qwilt, mapping the product strategy and roadmap while providing technical leadership regarding architecture and implementation. Yoav brings with him more than 12 years of industry knowledge in product management and software engineering experience from high growth software companies. As an entrepreneur at heart, Yoav drives innovation and product excellence and successfully incorporates it with the market trends and business needs. Yoav holds a BSC in Computer Science and Business from Tel Aviv University Magna Cum Laude and an MBA in Finance from the Leon Recanati School in Tel Aviv University.
Tagged on: