A Look to the Future with Timeseries – Part 1

It’s Winter 19 release time! And that means we get some new awesome features in our production environment! One of the things I am excited about is the new timeseries SAQL function. Why? Well, if you look at trends when it comes to big data it is all about machine learning and AI. Einstein Analytics haven’t really had any way to look at predictions – that’s where Einstein Discovery comes in. With timeseries we can actually start predicting what is going to happen based on patterns in our existing data. Let’s have a look at an example.

Prepping your data

You can, of course, use any existing datasets that you have or create a new one with your data from Salesforce. However, I decided to look elsewhere for some data that I could use, so I turned to Kaggle and found some data on avocado prices, cause who does not love avocado? Anyway if you want to do the same as me you can find the data here. We are not going to use all the measures and dimensions and frankly, I don’t know how accurate this dataset is, but it doesn’t matter. Let’s have some fun.

Make sure you download the dataset, unzip it and upload it to Einstein Analytics. Notice that the UI has changed a little bit.

I am sure you are a smart cookie so I won’t go through how you upload the csv file, but once it’s uploaded hit “Explore” in the top right corner.

Let’s look to the future

The very first thing we want to do is to choose the measure average of AveragePrice and group by Date (Year-Month).

Now it’s time to the fun part, we will move away from chart mode and into SAQL mode. So click the SAQL icon in the top right corner. At any point you can always refer to the timeseries documentation here, there are more parameters available than I will be using in my example.

Alright, if you are new to SAQL this may look a little scary, but it really isn’t. All it’s doing is what you told it to load the avocado dataset, group by the date field and showing the average of AveragePrice. Yes, there is also an order and limit. Anyway, your query should look something like this:

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

The first thing we want to add is a new ‘foreach’ statement after our grouping (line 2). It’s almost a copy of the existing ‘foreach’ statement (line 3), however, we are not joining our date together.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

Now when I group by date there may be months where we do not have any data, in order to avoid these gaps we can use the new fill function to fill in the missing dates. Please feel free to look at the documentation here.

The ‘fill’ statement will be added after our new ‘foreach’ statement. For this to work make sure if you have grouped by Year-Month then you use the same combination in the ‘fill’ statement and not switch to let’s say Year-Quarter. The syntax is highlighted below.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

With the gaps filled we have now come to adding in our ‘timeseries’ statement – yay! So what we need to do is after our new fill statement, make room for the timeseries by pressing enter. The syntax is to first call the timeseries but then define all the parameters that we want to use, not all are mandatory to use to please refer to the documentation for that – I will be using ‘length’ and ‘dateCols’. ‘dateCols’ is the time period and ‘length’ is how many of those time periods do I want to predict. And of course the measure you see before all the parameters is the measure that is being predicted.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

So we got this far. One more thing to do before we hit run is to make sure the last ‘foreach’ statement has included our new field ‘Forecasted_Avg_Avocado_Price’, also we don’t need to call the aggregation again. So we just want to delete the aggregation and add the new measure to the last part of the ‘foreach’ statement.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month','avg_AveragePrice', 'Forecasted_Avg_Avocado_Price';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

Now you can hit run. Notice you have your aggregation of the data from the data in the ‘avg_AveragePrice’ and the 12 months in the future has a blank value. However, these are filled in the new ‘Forecasted_Avg_Avocado_Price’.

This is not the end, we can do more. Maybe we want to see how the prediction interval is looking like. We can add that to our ‘timeseries’ and last ‘foreach’ statement. The parameter is ‘predictionInterval’ and we can make it between 80 and 95, which will be the confidence for the prediction. Let’s try to add it to our query in the ‘timeseries’ statement as well as our ‘foreach’ statement, so it’s being projected.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), predictionInterval=95);
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'avg_AveragePrice', 'Forecasted_Avg_Avocado_Price', 'Forecasted_Avg_Avocado_Price_high_95', 'Forecasted_Avg_Avocado_Price_low_95';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

Notice that in my projection (the ‘foreach’ statement) I am calling my predicted measure again but adding ‘_high_95’ and ‘_low_95’. This is just the syntax for these fields. If I had called my measure ‘Price’ and I had used 80 in my prediction interval then it would have been ‘Price_high_80’ and ‘Price_low_80’. Once you have the new parameter added and the two extra fields projected feel free to run the query. You will now see the high and low values as well.

If I switch over to a timeline chart, it would look something like this – if you made the same changes to the chart of course.

We are not done yet. Let’s clean up the chart a little bit, for instance, why not use ‘avg_AveragePrice’ when we have data and ‘Forecasted_Avg_Avocado_Price’ when we don’t have any actuals? Also, the naming could be a little better. So switch back to SAQL mode.

We will be using the ‘coalesce’ function to take the avg_AveragePrice when we have data and when we don’t when that column is blank, we will take the predicted value instead. The syntax for ‘coalesce’ is simple:

coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price'

The first part before the comma is the default value and the value after the comma is the value we will be using if the first turned out to be null. Now the latter “as ‘Avg Price'” is the new name for our column.

We are also renaming the high and the low forecast the be more meaningful as you can see below.

q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), predictionInterval=95);
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price', 'Forecasted_Avg_Avocado_Price_high_95' as 'High Forecast', 'Forecasted_Avg_Avocado_Price_low_95' as 'Low Forecast';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;

You can now hit run and switch back to your chart mode and you will have a chart that is similar to mine – I have made a few more adjustments to the formatting of the chart.

If you want to do more with the new ‘timeseries’ statement, then have a look at the second part of this blog series.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.