M5 Walmart Sales Forecasting

11 min readOct 16, 2020

LinkedInProfile ,Github Repository of Code

Here in this blog I have discussed about Sales Forecasting of Walmart in Sates of USA for future 28 days.Here the blog is organized as follows:
1. Sales Forecasting & It’s importance
2. Machine Learning Point of view of Problem & Structure of data
3. Traditional Approaches
3. Machine Learning Pipeline( Preprocessing , EDA, Feature Engineering, Modeling)
4. Comparison of ML Models Used
5. Future Work
6. References

Sales Forecasting is a very important area in Field of Business Management in which we construct a system by which future sales volumes are estimated. It helps businesses to prevent panic sales by manufacturing products according to future customer demand thus maximizing the profit. It allows you to manage virtually all aspects of your business. It is similar to weather forecasting as both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses.

Why Sales Forecasting?

Although in simple words sales forecasting is only predicting futures sales volume of a product. But in Real World it has much more Importance.

Some of the reasons for same as discussed in [1] are:-

Help Sales Representatives To Meet Their Targets :- In any business the business representative need to take several decision to achieve their sales target. Sales Forecasting help Representative to make such decision according to predicted sales in future.
Improve And Speed Up Product Delivery :- In most cases while looking up for some products customer look upon the delivery time particular company will take for product. By Knowing about forecast company will manufacture goods keeping them in mind so they have ready to deliver product as soon as customer orders it.

There are several more benefits for this sales forecasting but we will now focus on Main Problem

Machine Learning Problem

In this blog we are going to discuss about Kaggle Competition named M5 Forecasting -Accuracy ,In this competition we have Sales data for Walmart Stores in 3 states(California ,Texas ,Wisconsin) for 3 categories of data (HOBBIES,FOOD,HOSEHOLD) from year 2011 to 2016.We want to use this data to predict sales for next 28 days using several ML techniques.

ABOUT DATA

They have given sales data of products from 2011 to 2016 in form of 4 data frames namely :-

calender.csv :- contains information about the dates on which the products are sold
sales_train_validation.csv :- contains information about the dates on which the products are sold(that we have used for training in our case)
sell_price.csv:- contains information about the price of the products sold per store and date.
sales_train_evaluation.csv :- Includes sales [d_1-d_1941](we will use sales from d_1914–d_1941 form this as test).

Also we will predict sales from d_1942-d_1969 for private Score in Kaggle.

STRUCTURE OF DATA

Structure of data is shown in image in left .Here they have 3 states named California, Texas, Wisconsin. In states of California they have 4 stores ,Texas 3 stores, Wisconsin 3 stores. For each store they have store have 3 categories of items Hobbies, Foods, Household. For Food Category they have 3 departments and 2 departments for Hobbies and Household Categories.

EVALUATION METRICS

Here we are using Weighted Root Mean Squared Scaled Error (WRMSSE) for evaluating model Performance. For computing WRMSSE they are constructing 42,840 time series from this data. The More details on WRMSSE is given in notebooks here.

Traditional Solutions

There are several Existing statistical solutions for sales forecasting problems like ARIMA, Moving Averages, Exponential Smoothing etc. To know more about them click here. The problem with these models/methods was that don’t take categorical variables into account and tends to work for short predictions only. So in This blog we will discuss several Machine Learning Model and See how they perform in this .

PIPELINE FOR SALES FORECASTING

The Figure shows how the basic overview of pipeline followed while working on this Kaggle Competition.

Note:-Here while constructing time series related features we first split train data into train and cv and then calculated them

First Cut Approach:- Here I have tried to focus more on direct modeling strategy rather than Recursive Modeling. The Reason behind this is that In recursive modeling even small error in start predictions may lead to huge error in later on predictions.

DATA READING AND PREPROCESSING

Data Cleaning is important step in every Modeling Strategy. Presence of incorrect data may cause our model to work incorrectly. There are various reason for incorrect data like:- computation error, human error etc. Here we are having event_name_1,event_name_2 ,event_type_1,event_type_2 as nan .We have replaced these nan values as no_events. Also we are trying to reduce memory usage of all categorical datas like item_id, cat_id, store_id, dept_id ,year ,event_name_1, event_name_2, event_type_1,event_type_2, year etc. Here I have shown only for 2 features like item_id,dept_id.

Similarly we have applied Label Encoders to all other categorical features present in data.We are also combining snap_ca, snap_tx, snap_wi into a one column named snap.

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis is an crucial part for purpose of analysis. It enables us to get an overview of data, find up patterns in data. We have performed EDA on this Walmart sales data using two main libraries in python Seaborn & Matplotlib. We have plotted several plots as shown below:-

Sales Variation According to Categories of Items

Here in the plot shown we have plotted average sales of products according to there categories. Using this plot we found out that average sales for FOODS category item is more then the rest two and HOOBIES category item have least average sales out of all three. Also average sales of FOODS category is very much higher then rest two.

Sales Variation According to States where Items are Sold

The Bar graph show how average sales varies in all three states California, Texas & Wisconsin. From this Plot we get to know that average sales in California is maximum, but variation between all three is not so large.

Sales Variation According to Stores where Items are Sold

The Bar Plot shows how average sales of items varies according to stores in which they are sold.From this curve we get to know that store with store_id CA_3 has maximum average sales and CA_4 has minimum average sales.

Average Sales of Products according to day of week

The Bar Plot shows how sales varies according to days of week. Note here 1 corresponds to Saturday, 2 to Sunday, 3 to Monday and so on. From this plot we get to know that Average Sales are more for Weekends (Saturday and Sunday).

Sales For Last 30 days in train Data For Each State

The 3 graphs below shows sales of last 30 days of different products in states of Wisconsin, California and Texas.

Last 30 days sales of FOODS_3_827 in all stores of Wisconsin

Last 30 days sales of HOBBIES_1_008 in all stores of California

Last 30 days sales of HOUSEHOLD_1_526 in all stores of Texas

Time Series for Total Sales in all 3 States

The time series shows how trend of sales in each state. From this we get sales in California are always more than other two.

Time Series for Total Sales of 3 Category of Items

The time series shows trend of sales for each item category. From this we infer that Foods sales are always maximum and minimum for Hobbies.

FEATURE ENGINEERING

Feature Engineering is a core part in any Machine Learning Model .We have done 2 type of feature Engineering in this case study namely Date Related Features & Time Series Related Features.

Date Related Features

Week Number:
Season
Year End
Year Start
Month Start
Month End
Quater Start
Quater End

I got ideas of including these features from this link.

Note:-The Code for these feature is quite straight forward(click here) to see implementation details.

Time Series Related Features

For Time Series Related Features we have created rolling mean/std features(window size of 7,14,30,60,360 days) with shift of 28 days. Here we have taken shift of 28 days because we had to predict sales for last 28 days and if we do shift of 28 days the sales of test(Unknown) data will not be taken into account.

I have also created lag features for 28,35,42,49,56,63,70,77,84,91,96 days. Also I have created Exponential weighted average with shift of 28 days.

Code Snippets for same are given as follows :-

MODELING

After Doing Feature Engineering in this Case Study I have trained various models and tried to see their performance on test, cv data and also their Kaggle Results. I have tried out following Models:-

LSTM Neural Network
CNN-LSTM Neural Network
Linear Regression
AdaBoost Regressor
LGBM Model
CatBoost Model

1 .LSTM NN

In this Model we are using all our features except event_type_1 &event_type_2 features and Date Related Features. The Structure how data is passed in this model is as:-

Categorical Features : Created embedding layer for each categorical variable and then passing it through LSTM layers
Numerical Features: Passing these features together through a LSTM unit

After getting results for all LSTM units I have concatenated them and passed through Several Dense Layers.

While training this model I am using Mean Squared Error as loss and using Nadam(click here to know more about Nadam) as optimizer.

The Structure of model is given as follows:

Code:-

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure shows how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

2 . CNN -LSTM Neural Network

This Model is similar to previous LSTM Neural Network Model, we have Added Conv1d Layers Before LSTM Layers in it. Otherwise it is same. I have used same Nadam optimizer and Mean Squared Error Loss.

The Structure of this model is given as follows :-

Code:-

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure shows how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

3 . Linear Regression

We have also used Linear Regression Model. Here I have not used rolling features due to memory and computation constraints. Linear Regression is a very simple model in which we find relationship between dependent variables and independent variables.

In this case study I have created Sparse Matrix with all categorical and numerical features (except rolling window features) and then trained the model on same and then predicted in cv, test(Kaggle public Score) & final_test(that is used for kaggle Private Score).

Code:-

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure show how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

4 . AdaBoost Regressor

I have also trained a Adaboost Regressor on similar data on which we had trained Linear Regression model. It is also an ensemble model in which one regressor is trained on complete data and then it train copies of regressor on same data and adjust weights according to error of previous models.

Here we have trained this model with quite low parameter due to memory constraints.

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure shows how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

5 . Light GBM Model

We have also used Light GBM for this case study also. This models gives best results out of every model we have tried in this case study. LGMB is gradient boosting framework which uses tree based learning algorithm. The main difference between LGBM and other boosting algorithm is in growth of trees, while other algorithm grows level(depth) wise but lgbm grows leaf wise.For more detail click here[7].

Here we have tweedie as an objective function to train model and used Root Mean Squared Error (RMSE) as loss function.

Code:-

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure shows how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

6 . CatBoost Model

Here I have trained the model Store wise due to several memory constraints. With proper computation power and memory catboost works best but here due to lack of GPU’s we have trained it store wise and on quite low values of parameters. The main idea of choosing tree based method was because of presence of large Categorical Features.

Here also I have used Root Mean Squared Error for purpose of training model.

Code:-

Kaggle Result:-

Variation of Actual & Predicted sales:-

The Figure shows how this model performs on test data. It shows variation between actual & predicted sales of id FOODS_3_090_CA_3_validation.

COMPARISION OF MODELS

FUTURE WORK

There are various methods that we have left untouched in this case study like ensembling of Recursive and Direct Modeling strategy ,Facebook Prophet Model. I had not tried these techniques due to lack of computation power & Memory Constraints. We could have also tried with some more Dense and LSTM Nodes in our deep learning models to improve results. There is also some scope to use transfer learning approach in neural Network etc.