Skip links
time series for sales forecasting

Guide: Analyzing Time Series for Sales Forecasting

Sales forecasting involves predicting future sales levels based on historical sales data accumulated over a specific timeframe. Business managers depend on this data to predict forthcoming trends. A time series denotes a compilation of data points, encompassing observations, values, or events, logged at consistent intervals over time and arranged in chronological order. Time-Series Analysis holds significant importance in sales forecasting, as it scrutinizes historical sales data to reveal fundamental trends. It also identifies factors influencing sales variations at different time intervals, enabling businesses to understand fluctuations in sales levels.

In this blog we aim to describe develop accurate forecasting models that help the company optimize inventory management, staffing, and other operational decisions.

So, let’s get started.

Sales Forecasting refers to the task of predicting future sales for the retail chain based on historical sales data, as well as various external factors such as promotions, holidays, and other influencing factors.

Initial Datasets:

train.csv – historical data including Sales

test.csv – historical data excluding Sales

store.csv – supplemental information about the stores

Data Fields:

  • Id – an Id that represents a (Store, Date) duple within the test set
  • Store – a unique Id for each store
  • Sales – the turnover for any given day (this is what you are predicting)
  • Customers – the number of customers on a given day
  • Open – an indicator for whether the store was open: 0 = closed, 1 = open
  • State Holiday – indicates a state holiday. Normally all stores, with few exceptions,

are closed on state holidays. Note that all schools are closed on public holidays

and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

  • School Holiday – indicates if the (Store, Date) was affected by the closure of public

schools

  • Store Type – differentiates between 4 different store models: a, b, c, d
  • Assortment – describes an assortment level: a = basic, b = extra, c = extended
  • Competition Distance – distance in meters to the nearest competitor store
  • Competition Open Since [Month/Year] – gives the approximate year and month of

the time the nearest competitor was opened

  • Promo – indicates whether a store is running a promo on that day
  • Promo2 – Promo2 is a continuing and consecutive promotion for some stores: 0 =

store is not participating, 1 = store is participating

  • Promo2 Since [Year/Week] – describes the year and calendar week when the store

started participating in Promo2

  • Promo Interval – describes the consecutive intervals Promo2 is started, naming

the months the promotion is started anew.

  • E.g. “Feb, May, Aug, Nov” means each round starts in February, May, August,

November of any given year for that store

Total Rows:

Train dataset: 1017209 rows

Test dataset: 41088 rows

Store dataset: 1115 rows

Exploratory Data Analysis (EDA):

Training data has 1M+ observations of sales data over the year of approximately 2 years (2013-2015). This is a time series data i.e. an ordered sequence of values of a variable (here Sales) at equally spaced time intervals (daily, weekly, monthly, yearly).

The above table gives us information about 1115 stores owned by a company.

Checking days when the stores were closed.

We can see from the above plot that stores were mostly closed on 7th day i.e. Sunday and it makes sense. On other days, they were closed because of a school holiday or/and a state holiday (a = public holiday, b = Easter holiday, c = Christmas, 0 = None).

Sales trend over the months

The above graph tells us that sales tend to spike in December, which makes sense because of the Christmas and holiday season. So, this confirms that the sales vary with the ‘Date’ (time) and there is a seasonality factor present in our data.

Sales trend over days

We can see from the above trend that there are no promotions on the weekends i.e. Saturday and Sunday, which makes sense as stores want to earn a maximum profit during the time when people do their house chores. The sales tend to increase on Sunday because people shop during the weekend. We can also see that the maximum sale happens on Mondays when there are promotional offers.

The following inferences were obtained from EDA on the sales data:

  1. The most selling and crowded Store Type is A.
  2. Customers tends to buy more on Mondays when there are ongoing promotional

offers and on Thursdays/Fridays when there is no promotion at all.

  1. Second promotion (Promo2) doesn’t seem to contribute in the increase of sales.

Time Series Analysis:

In order to use time series forecasting models, we need to ensure that our time series data is stationary i.e. constant mean, constant variance and constant covariance with time.

There are 2 ways to test the stationarity of time series:

  1. a) Rolling Mean: A rolling analysis of a time series model is often used to assess the model’s stability over time. The window is rolled (slid across the data) on a weekly basis, in which the average is taken on a weekly basis. Rolling Statistics is a visualization test, where we can compare the original data with the rolled data and check if the data is stationary or not.
  2. b) Dicky – Fuller test: This test provides us the statistical data such as p-value to understand whether we can reject the null hypothesis. The null hypothesis is that data is not stationary and the alternative hypothesis says that data is stationary. If the p-value is less than the critical value (say 0.5), we will reject the null hypothesis and say that data is stationary.

Results of Dickey-Fuller Test:

  • ADF Statistic: -6.218237
  • p-value: 0.000000
  • Critical Values:
  • 1% -3.4374778690219956
  • 5% -2.864686684217556
  • 10% -2.5684454926748583

We can see from the above plots and statistical tests that mean and variation doesn’t change much with time, i.e. they are constant. Thus, we don’t need to perform any transformation (needed when the time series is not stationary).

Conclusion

Time series analysis is therefore a useful tool for understanding sequential data, identifying trends, and producing precise forecasts. By applying appropriate forecasting methodology, incorporating advanced techniques, and leveraging the core components of time series, analysts can effectively forecast future trends, improve decision-making processes, and extract useful insights. Time series analysis helps researchers and businesses to fully utilize the potential of sequential data in a variety of sectors, including finance and economics.

Happy Reading!