• Analysed historical monthly sales data & created multiple Time Series Forecasting models for 2 different wine products, Sparkling wine & Rose Wine.
• Recommended SARIMA model as the optimum forecasting model for Sparkling wine & Triple Exponential Smoothing model for Rose wine, to predict monthly sales for the next 12 months along with appropriate lower and upper confidence limits.
For this particular assignment, the data of two different types of wine sales in the 20th century is to be analysed. Both of these data are from the same company but of different wines. As an analyst in the ABC Estate Wines, we are tasked to analyse and forecast Wine Sales in the 20th century.
NOTE: Here, for illustration purposes I have shared the detailed report for the first product i.e. 'Sparkling Wine' only. The analysis for 'Rose wine' can be found in the ipynb file attached.
Since we have a time series data and that too with a single variable, we will only be able to do limited univariate analysis.
As we can see there are no missing values in the dataset. There is only one variable named ‘Sparkling’ and it is of Integer datatype.
Looking at the spread of the data i.e. the min, 25th, 50th and 75th percentile and comparing them to the max value , we can clearly see that there is skewness and presence of outliers in the data. But since it is a Time series data it may have some peak seasons which is clear from the time series plot displayed above. So we are not going to treat these outliers as they have a strong seasonality component. This will become further clear after seeing the decomposition plot.
• As we can see there are a few outliers across all the years. These are probably the months which had peak sales. To confirm this we will have a look at the monthly boxplots.
• Also it seems that in the year 1995, the sales have been quite low compared to other years, but the reason behind this is completely statistical and has got nothing to do with the business aspect of it.
• The reason for this is that the dataset contains the sales data only till July of 1995 so obviously when compared across all years 1995 will have only 7 months of data while other years will have 12 months of data.
• Looking at the average units sold in an year, represented by green triangle , we can see that there is very little change across different years.
• There is a clear distinction of ‘Wine Sales' within different months spread across various years.
• For the first 6 months, the number of units sold are maintaining more or less a fixed range I.e. between 1000 to 2500 units across all years.
• But starting July it starts increasing drastically reaching upto more than 7000 units in the month of December.
• Here we can clearly see that sales in month of December and November are a cut above the rest of the months with December being the month with the most sales throughout the years.
• The absence of data from year 1995 for months of august to December is also clearly visible here.
• But even so these cannot be considered as missing values as the data collection itself has ended in this case with last data entry being for July 1995. And the data after this point will come under forecasting.
• Here it becomes evidently clear that 60% of the total sales are in the range of 1000 to 2000 units.
• Then the next 20% of the sales fall in the range of 2000 to 3000 units while the last 20% gets saturated over the span of 3000 to 7000 units.
• These observations are in tune with the observations obtained from the previous plots.
In additive decomposition, as we can see, the residuals are located around 0 from the plot of the residuals in the decomposition.
In multiplicative decomposition, as we can see, the residuals are located around 1 from the plot of the residuals in the decomposition.
Comparing the two decomposition plots, it seems that for this dataset, Multiplicative decomposition is a better fit as residuals are more concentrated around 1 in this plot as compared to around 0 in the additive plot.
By looking at the trend line, we can see that in the initial years the sales remained more or less stables around the 2300 mark, but around year 1987 sales got boosted peaking at 2800 mark in 1988 and then it consolidated around 2500 to 2600 mark.
Also looking at the seasonal plot, it is evident that there is a 12 month seasonality.
- Linear Regression
- Naïve forecast
- simple average model
- All the exponential smoothing models (alongwith additive & multiplicative errors)
- ARIMA
- SARIMA
data and evaluate the model using RMSE on the test data. Other models such as . should also be built on the training data and check the performance on the test data using RMSE. Building Models:
We know that Linear regression needs at least one independent variable X to explain the target variable, Y but here we only have one variable available that is Wine sales 'Sparkling', as we converted the date-time variable into the index.
So we have defined an independent variable and assigned it to a new column ‘time' separately for both train and test data.
Now that our training and test data has been modified, let us go ahead use Linear Regression to build the model on the training data and then test the model by doing a forecast on the test data.
• As we can see from the forecast plot, linear regression is not a good model for this test data as the data itself is not linear to begin with.
In the Naive model, the prediction for tomorrow is the same as today and the prediction for day after tomorrow is same as tomorrow.
Therefore we can say that as the prediction for tomorrow is same as today, the prediction for day after tomorrow should also be today.
By following this same logic we will end up with one single value as the forecast value of the complete test data, and that value would be the data entry just before the start of the test data, I.e. the sales value belonging to 1990-12-01 which is 6047.
• Since the assumption behind Naive forecast is quite arbitrary and not practical when put to real world scenarios, it is obvious that this model will not be a good fit for the test data.
• Comparing it with the Linear regression model, the LnR model still performed better than the Naive model as per the Test RMSE value.
In simple average method, we forecast by simply using the average of the training values.
• Like the previous models this model is also not a good fit since it relies on the average of all the training data values, which is too simple a logic for a real world dataset.
• But among the models built thus far simple average model performs slightly better than the other two models as per the Test RMSE value.
This model uses a weighted average of past observations to make predictions. It is a basic form of exponential smoothing and therefore the weighted average doesnot change in this particular smoothing model. This model does not do well in the presence of the trend.
• Since it is a simple form of smoothing and our data clearly has a trend, therefore not a good fit.
• Compared to other models it performed just below Simple Average model as per the Test RMSE value.
This model is an extension of SES(Single Exponential Smoothing) known as Double Exponential model which estimates two smoothing parameters Thus taking care of the drawback of SES.
• Two separate components are considered: Level and Trend.
• Level is the local mean.
• One smoothing parameter α corresponds to the level series
• A second smoothing parameter β corresponds to the trend series.
• But this model also has a drawback, it doesnot do well when data has seasonality.
• Since our data has quite a prominent seasonality, this is not a good fit.
• Even though this model is supposed to perform better than simple exponential smoothing , it gas not done so in our case. This is probably because the data doesnot have a well defined linear trend, which makes it difficult to predict when trend is taken into account.
• In addition to alpha and beta, a new parameter gets added here called gamma which takes care of seasonality, thus making this form of smoothing better than the other two.
• Here we have taken triple smoothing model with additive errors, but from the decomposition plot we saw that our dataset is more likely to have multiplicative errors which we will see in the next model.
• Here we can see that the RMSE score improved drastically as compared to other models, as triple smoothing model takes care for seasonality too along with trend.
• Till now this is the best performing model.
As discussed above triple exponential smoothing can also be applied for multiplicative errors which we will now see.
• This model has also performed quite well when compared to other models except the previous model I.e. triple exponential smoothing model with additive errors.
• But we considered this to instead perform better since the errors or residuals in the dataset are more likely to be multiplicative than additive.
• Therefore we will apply this model once more using the brute force method to determine the best values for the parameters alpha, beta and gamma.
Model 8: Using Brute force method i.e. best parameters (alpha, beta and gamma) for Holt Winter's linear method with multiplicative errors
The brute force method for parameter determination involves systematically trying out different parameter values within a specified range and evaluating the performance of the model for each combination of parameters.
The values so obtained are alpha = 0.3, beta = 0.3, gamma = 0.3, for a test rmse value of 361.397, and the same can be seen in the above figure.
• As we can see after applying the brute force method to get the best parameters, triple smoothing with multiplicative errors performed better than the model with additive errors. Thus becoming the best performing model till now.
• But this creates a question as to whether the model with additive errors can also perform better with brute force method so much so that it might even surpass this current multiplicative model.
• It can happen but not with this dataset as we have already discussed that this dataset contains multiplicative errors. Also the proof that brute force method won’t improve the additive error model for this particular dataset is present in the attached python file where that model is also build.
NOTE : We are checking for stationarity in the data as it a pre-requisite for ARIMA & SARIMA models that the data be stationary.
• Null Hypothesis, H0 : Time Series is non-stationary.
• Alternate Hypothesis, Ha : Time Series is stationary.
if p-value < 0.05 (alpha value) then null hypothesis: Time series data is non-stationary is rejected,
if not, then null hypothesis is true i.e. Time series data is indeed non-stationary.
the p-value is 0.6 which is clearly > 0.05, thus Null hypothesis is true. Thus Time series data is not stationary.
The p-value is 0 which is clearly < 0.05, thus Null hypothesis is rejected i.e. Time series differenced(d=1) data is Stationary.
Thus we have achieved Stationarity
Here parameters were selected using the lowest 'AIC' (Akaike Information Criteria) on the training data and the models were evaluated on the test data using RMSE.
• ARIMA stands for Auto Regression Integrated Moving Average.
• As the name suggests it combines autoregression (AR), differencing (I), and moving average (MA) components. ARIMA models are capable of capturing various patterns and dependencies in time series data.
• The basic ARIMA model has does not explicitly capture seasonality. It is more suitable for capturing trends and dependencies in stationary time series data.
• However, ARIMA can be extended to incorporate seasonality by introducing seasonal components, resulting in a model called SARIMA (Seasonal ARIMA), which we will look into next.
• ARIMA has 3 parameters:
1. p = non-seasonal AR order
2. d = non-seasonal differencing
3. q = non-seasonal MA order
• Since ARIMA doesnot capture seasonality, and our data has a strong seasonality component, this model is bound to perform poorly, which is exactly what has happened.
• SARIMA also known as Seasonal ARIMA, is an extension of ARIMA where 4 new parameters are introduced to capture the seasonality component, making the total number of parameters as 7 in the SARIMA model.
• 7 parameters of SARIMA:
1. p = non-seasonal AR order
2. d = non-seasonal differencing
3. q = non-seasonal MA order
4. P = seasonal AR order
5. D = seasonal differencing
6. Q = seasonal MA order
7. S = time span of repeating seasonal pattern
The best parameters thus obtained for a lowest AIC value of 1382.347 are: p=1, d=1, q=2 and P=0, D=1, Q=2 and S=12.
Here, ‘mean’ column gives the actual forecast values and the ‘mean_ci_lower’ and ‘mean_ci_upper’ columns give the range of confidence interval.
The above 4 plots in the residuals diagnostic plots tell us:
• Standardized residuals plot: The top left plot shows 1-step-ahead standardized residuals. If model is working correctly, then no pattern should be obvious in the residuals. As we can see no such pattern is visible here.
• Histogram plus estimated density plot: This plot shows the distribution of the residuals.The orange line shows a smoothed version of this histogram, and the green line shows a normal distribution. If the model is good these two lines should be the same. Here we see small differences between them, which indicate that our model is doing just well enough.
• Normal Q-Q plot: The Q-Q plot compare the distribution of residuals to normal distribution. If the distribution of the residuals is normal, then all the points should lie along the red line, except for some values at the end, which is exactly happening in this case.
• Correlogram plot: The correlogram plot is the ACF plot of the residuals rather than the data. 95% of the correlations for lag >0 should not be significant (within the blue shades). If there is a significant correlation in the residuals, it means that there is information in the data that was not captured by the model. Here too our model is working as expected.
• Since SARIMA captures seasonality, and our data has a strong seasonality component, as expected this model is the second best fit, falling behind only triple exponential smoothing.
6. Created a data frame with all the models builtalong with their corresponding parameters and the respective RMSE values on the test data.
7. Based on the model-building exercise, built the most optimum model(s) on the complete data and predicted 12 months into the future with appropriate confidence intervals/bands.
As we can see the test rmse values for the top three models are pretty close , and the top two models are both Triple exponential smoothing models. Therefore we will make two final models, one for triple exponential smoothing and one for SARIMA which is the third best performing model.
Final Model 1: Using Brute force method for best parameters (alpha, beta and gamma)for Holt Winter's linear method with multiplicative errors
The values so obtained are alpha = 0.3, beta = 0.3, gamma = 0.4, for a test rmse value of 347.614, and the same can be seen in the above figure.
Plotting the complete data alongwith forecasted values using brute force alpha, beta and gamma and getting RMSE value:
• Here we can see that the RMSE value calculated is pretty low, although strictly speaking it cannot be compared with all the other models build till now as the training set has been now changed to full data, and forecast window has also been changed to 12 months.
• But what we can do is compare the two final models based on their RMSE values to see which one has performed the best among the two.
The best parameters thus obtained for a lowest AIC value of 2184.006 are: p=0, d=1, q=2 and P=0, D=1, Q=2 & S=12.
Here, ‘mean’ column gives the actual forecast values and the ‘mean_ci_lower’ and ‘mean_ci_upper’ columns give the range of confidence interval.
As already discussed, the above 4 plots in the residuals diagnostic plots shows that the model is behaving as it should:
• Standardized residuals plot: As we can see no definite pattern is visible here.
• Histogram plus estimated density plot: Here we see small differences between the two lines, which indicate that our model is doing just well enough.
• Normal Q-Q plot: Here, the distribution of the residuals is normal, as all the points are lying along the red line, except for some values at the end.
• Correlogram plot: Here, all the correlations for lag >0 are within the blue shades meaning they are not significant. If there is a significant correlation in the residuals, it means that there is information in the data that was not captured by the model. Here too our model is working as expected.
• Here we can see that even though SARIMA model was the third best performing model based on the test RMSE value with our predefined training set and testing set.
• When it came to training the model on the complete original data and forecasting for next 12 months, SARIMA performed even better than Brute force Triple exponential Smoothing model and that too by a large margin, given the difference in the model RMSE values for the two final models.
Hence, our final best performing model for forecasting 12 months into the future is SARIMA Model.
9. Commenting on the most optimum model thus built and reporting the findings and suggestions that the company should be taking for future sales.
• Coming to the model predictions part we have already seen in detail how the different models built performed with respect to the dataset provided.
• Overall Triple exponential smoothing using brute force method with multiplicative errors performed well along with SARIMA when these models were built on the train set and forecasting was done on the test set.
• But when we trained the same models on the complete dataset and forecasted for 12 months in future , SARIMA performed the best out of the two.
• This means that when it comes to forecasting on unknown data , SARIMA performs better than Triple exponential smoothing models.
• The first point to notice here is that the trend line is not linear, it first decreased then increased slightly then there was an abrupt increase leading to slight decrease again and then saturating.
• An important thing to notice is the sales are really low in the first 6 months of every year, gradually increasing and reaching its peak in the month of November and December.
• This can be used as a baseline by the production and logistics team to avoid wastage as well as shortage of supplies.
• This also provides an opportunity for the company to provide some extra services or discounts to the customers in the first 6 months to increase the sales. As it is quite clear from the sales in second half of the year that there is room for increase in sales.
• Another thing is as the sales are quite high in months of November and especially December, company can try to capitalise on this by throwing in lucrative offers.