Outlier Detection and Handling

5 min read

What Should We Do with Extreme Values?

Every time we ask for a numerical response like income, expenditure, consumption, etc., and allow people to enter the value freely, we are going to end up with extreme values. Imagine a scenario in which we asked people about daily milk consumption in their household in liters. Most of the answers would be around one liter per day. But then we get a value of 15 liters per day. Is it possible? Maybe the family has many children, or perhaps they make craft cheese. The other option is that person omitted the dot, and that what they really meant was 1.5. What if we got a value of 150 liters per day? Should we report that an average household consumes 7.5 liters per day just because of a couple of extremes? That wouldn't be smart. Those awkward situations are frequent, and don’t happen only when studying people. Conditions under which technical devices work can also change so that they produce extreme results. Basically, whatever your data source is, an extreme value is bound to appear at some point. The question is – what do we do with it? In this post, you will learn about several ways to detect extreme values and deal with them.

Key metrics: What do I need?

Any question that requires an open-ended numerical response will most likely generate some outlying values. Basically, any numerical measure, no matter what the source is, is vulnerable to outliers.

The main concepts: What should I know?

The standard deviation approach

A common approach to detecting extreme values is calculating the standard deviation of the results (read here sample) and then flagging all the values that fall outside of ±3SD as outliers. When our sample size is relatively small (n<1000), we can also use a less strict criterion of ±2.5SD. However, this approach presumes that the data is approximately bell-shaped (read here sample), although analysts commonly neglect this fact and apply it indiscriminately.

IQR

If the distribution of the results drastically deviates from the bell curve,[1] you can use another approach, which uses interquartile range (IQR). To calculate the IQR, you should sort your data and find the highest value lower than 25% of results (Q1) and the highest value lower than 75% of results (Q3). If you have a result higher than the value of Q3 + ((Q3-Q1)*3) or lower than Q1 – ((Q3-Q1)*3), you can say that you have a clear outlier. If you want to detect moderate-intensity outliers, you should multiply the distance between Q1 and Q3 by 1.5 instead of 3.

[1] 9 out of 10 analysts will likely use eyeball method to determine weather the distribution deviates from normal, or even worse – they will simply assume that it’s normal and proceed as it were. We advise you to be an outlier. Use Kolmogorov-Smirnov and Shapiro-Wilks test to assess normality of your data’s distribution.

Robust methods

The approaches above are meant to flag outliers, typically for deletion. However, instead of simply deleting outliers, you can also apply special metrics that are less vulnerable to outliers. We say that these measures are robust. For example, as an alternative to the mean, which is very sensitive to extreme values, we can use median, trimmed mean, winsorized mean and M estimator.

Let’s start with the median. To calculate the median, you should sort your data and find the value below which lies 50% of the data. While the median is very robust, it does not describe the data set very well – it’s not very informative. The trimmed mean, on the other hand, better summarizes the data. It is computed by cutting off a certain percentage of the most extreme cases from both sides of the data distribution. The usual trim value is 20%, which means that we cut off 20% of the cases with the highest and 20% of the cases with the lowest values, and then calculate the mean using the remaining data points. Another similar robust measure is the winsorized mean. Winsorization is similar to the previously described trimming. However, instead of removing the extreme cases, we replace the bottom unwanted percentage of cases with the lowest accepted value and the top unwanted percentage of cases with the highest accepted value. Similarly to the trimmed mean, the standard percentage of replaced cases is 20% at the top and at the bottom, but you can use a smaller or higher percent.

M estimators are a more mathematically advanced robust alternative to the mean. The most popular is Huber M, but Tukey's and bi-weight are also commonly used. Simply out, the M estimators are based on minimizing the function of the distance between each of the data points to the central value. The advantage of M estimators is that they preserve more information about the data set while still being robust to the existence of the outliers. Their disadvantage is the conceptual complexity, which makes them very rare in business data analysis.

Visualizing distributions with Box plots

The most common way to visualize data with outliers is a Box plot diagram (or Box plot with whiskers). Let’s take a look at the accompanying case-study dashboard below. The data set used contains data on sales of different beverage brands[2] at various points of sale (PoS). Gray dots represent the sales volumes at each PoS, while the orange dots represent the volume at the currently selected PoS (you can use the controller in the top right corner to select a different PoS).

Let's focus on the brand E. Midline of the blue box is the median, or the 50th percentile. The bottom part of the box (dark blue) represents the sales volumes of the PoSs between the 1st quartile (or 25th percentile) and median. The top part of the box (light blue) shows the sales volumes of the PoSs between the median and 3rd quartile (or 75th percentile). The lines extending below and above the box are called whiskers, and they encompass all PoSs whose volume of sales is not extreme. All PoSs that are out of bounds of whiskers are considered outliers. In this case, the bounds are determine by the formula.

Minimum = Q1 – ((Q3-Q1)*1.5)

Maximum = Q3 + ((Q3-Q1)*1.5)

[2] Brands are white labeled by capital letters from A to G

How to approach outlier detection in practice

We outlined sever approaches to identifying and handling outliers, and we might have left you feeling confused about what to do. The strategy we propose is to calculate the mean and some of the robust measures. If you get a big difference, you should interpret it as a red flag for existence of extreme values. You should also use some kind of visual inspection, such as the box and whiskers plot. Always keep in mind the distribution of your data – don’t just assume that it is normally distributed.

All this might seem like tedious work, and it is. However, spending some time on the outlier detection before sending the report will save you from much bigger headache that might otherwise come your way after the report has been released. Also, one final note – it’s a good practice to decide on the approach that you are going to use to handle outliers before you begin the actual analysis. Otherwise, it can be easy to succumb to the temptation to modify the results according to one’s expectations.

Further considerations

In this post, we covered only univariant outliers, which means we considered only values on one metric. However, outliers can also be defined by their position on multiple metrics simultaneously. In that case, identifying them can be tricky. Multidimensional outliers do not have to be detectable when looking at any of the measures individually, meaning that the strategies for outlier detection that we outlined above could fail for such cases.