What’s the benefit of flight delay prediction? For clients, it gives a more accurate expectation about flight time, thus allowing them to plan their time accordingly. For airline companies, it shows where they can minimize flight delays ,thereby minimizing expenses and increasing customers satisfaction. Sounds good right?

In this post we will show how you can use InsightEdge to do exactly that and achieve real-time flight delay predictions. 

We will create a solution based on a decision tree algorithm described by Carol McDonald in her MapR blog post.

InsightEdge Architecture

The following diagram shows the architecture with InsightEdge.

InsightEdge Architecture

For performing real-time predictions we will use Spark Streaming combined with Apache Kafka, which will simulate an endless and continuous data flow. For the prediction part, we will use Spark Machine Learning and decision tree algorithm. Streamed data will be processed by a decision tree model and results are saved into InsightEdge data grid for future usage.

Our solution consists of two parts (Spark jobs):

  • Model Training
  • Flight Delay Prediction

Let’s take a look at these two jobs in detail. All codes and instructions can be found on github. 

Want to Predict Flight Delays? Try InsightEdge!

‘Model Training’ Spark Job

Model training job is a one-time job designed to model initial training and store it in the data grid, so the model can then be used during the second job. In this post, we won’t go into too much detail about machine learning algorithms and decision tree model training. If you’d like, you can familiarize yourself with it with the help of Carol McDonald’s blog post we mentioned earlier.

First Spark  job consists of 3 simple steps:

1. Load data, split it on training and testing part, save testing part for second job usage using the same data set from Carol McDonald’s blog:

2. During the second job we will convert flight data into LabeledPoint, so we will need to store integer representations of origin, destination and carrier in the data grid:

3. Train a model and save it to the data grid:

‘Flight delay prediction’ Spark Job

Second Spark job loads model and mappings from the grid, reads data from stream and uses the model for prediction. Predictions will be stored in the grid along with flight data.

Second Spark job in 3 easy steps:

1. Load models and mappings form data grid:

2. Open Kafka stream and parse lines with flight data:

3. Parse a bunch of lines (rdd), make a prediction and save it to the data grid:

Running the Demo and Examining Results

To run the demo we need to perform the following steps:

  1. Start up InsightEdge
  2. Start up Kafka and create a topic
  3. Submit Model Training job
  4. Submit Flight Prediction job
  5. Push the test data into Kafka’s topic

You can find detailed instructions here to help you run the demo.

After all steps have been completed, we can examine what was stored in the data grid.

Open Zeppelin at and import a notebook. Below you can see an example of the stored data:

  • Day – day of the month
  • Origin – origin airport
  • Destination – destination airport
  • Distance – distance between airports in miles
  • Carrier – airline company
  • Actual_delay_minutes – actual flight delay in minutes
  • Prediction – whether our model made a correct or incorrect prediction


Since we store prediction result alongside with actual flight delays, we can see the ratio of correct and incorrect predictions:


What’s Next?

In this post we built a simple, real-time prediction application using Spark ML combined with Spark Streaming on top of InsightEdge. We haven’t built the perfect solution just yet and there is always room improve it, e.g.:

  • You may want to take a look at other ML algorithm or tune existing algorithms to give a better prediction rate.
  • Over time this model might become outdated. In order to keep it up to date we will need to come up with a model update strategy. There are two possible solutions you can use:
    • Incremental algorithms: A model built on such algorithms will update itself every time it encounters new data.
    • Periodical model retraining: Here the solution is to store income data and periodically preform model retraining and substitute an existing model with an updated one.

For more information on InsightEdge and fast data analytics, sign-up to our 451 Group Webinar or come meet us at one of our upcoming events. You can also check out our latest demo application for a taxi price surge use case that runs real-time analytics on a streaming geospatial data to learn more about InsightEdge.

Flight Delay Prediction with InsightEdge Spark
Danylo Hurin
Tagged on: