Sales forecast : Working with Outliers
Wether you run an online business, retail store chain or a SaaS product, one of the basic Key performance Indicators (KPI) is to track the overall New sales / New Business of a product in a given financial year or time period.
Established companies subject their sales data to scrutiny and do a lot of data analytics before coming up with targets for the next financial year. Most often targets are given from the leadership team. For those young cub sales folks and aspiring managers in start ups, the story is entirely different. The motto seems to be “Our target is to bring more new business than we did last year”
There’s just no harm in that but how successful can you be in chasing your target? I would like to share a simple statistical approach that will help in understanding how to determine your actual sales with improved accuracy.
CONCEPT
We should know that, when we plot our data( lets say sales data in this case), the resulting chart can be divided into four equal parts or Quartiles in increasing order.

First quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
Median (Q2/50th Percentile): the middle value of the dataset.
Third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
interquartile range (IQR): 25th to the 75th percentile.
We should note that any value is considered to be an Outlier if its value is above the maximum (Q3 + 1.5*IQR) or below the minimum(Q1 -1.5*IQR).
EXAMPLE:
Lets assume that you are a SaaS startup and selling a product online. You are launching a paid version of your product at 5 USD per user/month. Now lets assume that there were 10 new customers in the first week as shown below,

We summarize the overall sales for the entire week to be 240 USD. We naturally expect the revenue to be the same for the upcoming week. We project the same revenue to be our target. That’s perfect.
But lets look at our data again, we see that one customer has brought 14 user licenses which is out of the ordinary while a large number of our customers purchased between 2 and 6 user licences.

This particular customer who purchased 14 users is an example of an outlier. When we treat this data, we should not consider this particular customer to be normal. We should instead replace this value with the median value of the whole dateset (where Median is the middle or commonly used value in the data set) to get a more accurate prediction.
The below figure shows how to calculate Median in a spreadsheet. You can call the function = MEDIAN and select the data range for which the median needs to be calculated.

Once the median is calculated. You need to replace the outlier with Median and determine New revenue.

In our case, we realize that the new revenue predicted for the upcoming week is 190 USD instead of 240 USD. This gives some fresh perspective.
Real time data:
In real time, no data set is going to be so easy to work on. There will be thousands of transactions every day and its impossible to just look at a data set and hunt outliers. In such cases, we can use some tools to get the job done. I use Python to get the job done. Consider the yearly sales data of an online retails store with thousands of entries.

The easiest way to check the presence of outliers is to visualize the data. There are many ways to visualize data. I choose a Boxplot (which is based on the concept of Quartiles explained above)

All data points that lie outside the Whiskers of a boxplot (denoted by two boundary lines) in both directions are outliers. In this case, by adding up the ‘Total Net Sales’ column we can calculate the overall sales

We see that the overall sales/ New revenue equals 333644 USD in the given data set. If we were to calculate our targets based on this data for the next year, we should first treat this data for outliers.
Python deploys effective codes to treat outliers and replace them with median or mode. Once the outlier treatment is done, the resulting box plot can be seen symmetric and follow normal distribution.

We can now calculate the overall sales with some degree of accuracy in the upcoming year.

The actual sales for the upcoming year can be around 241000 USD as opposed to 333500 USD in the first place. We can use this data to improvise our sales action plan.
CONCLUSION :
A good sales manager will be able to predict New business more accurately by understanding / knowing how to treat outliers and use Median values of the data set using suitable tools. Hope this was helpful.
Happy selling!
Note : I have consciously avoided Jargon, deeper exploration of concepts and python codes to keep the reading light & simple. Outlier treatment is the tip of the iceberg. We can do logistic or linear regression models, time series analysis & build decision trees for more sharper insights and forecasts.Watch this space. I will be writing more on sales forecast techniques in the future)