Treatment of outliers
What is this about?
The concept of extreme values, much like other topics in machine learning, is not a concept exclusive to this area. What it is an outlier today may not be tomorrow. The boundaries between normal and abnormal behavior are fuzzy; on the other hand, to stand in the extremes is easy.
What are we going to review in this chapter?
- What is an outlier? Philosophical and practical approaches
- Outliers by dimensionality and data type (numerical or categorical)
- How to detect outliers in R (bottom/top X%, Tukey and Hampel)
- Outliers preparation for profiling in R
- Outliers preparation for predictive modeling in R
The intuition behind outliers
For example, consider the following distribution:
## Loading ggplot2 to visualize the distribution library(ggplot2) ## Creating a sample dataset set.seed(31415) df_1=data.frame(var=round(10000*rbeta(1000,0.15,2.5))) ## Plotting ggplot(df_1, aes(var, fill=var)) + geom_histogram(bins=20) + theme_light()
The variable is skewed to the left, showing some outlier points on the right. We want to deal with them (😎). So, the question arises: Where to set the thresholds of extreme? Based on intuition, it can be at the highest 1%, or we can analyze the mean change after removing the top 1%.
Both cases could be right. In fact, taking another number as the threshold (i.e., 2% or 0.1%), may be right too. Let's visualize them:
## Calculating the percentiles for the top 3% and top 1% percentile_var=quantile(df_1$var, c(0.98, 0.99, 0.999), na.rm = T) df_p=data.frame(value=percentile_var, percentile=c("a_98th", "b_99th", "c_99.9th")) ## Plotting the same distribution plus the percentiles ggplot(df_1, aes(var)) + geom_histogram(bins=20) + geom_vline(data=df_p, aes(xintercept=value, colour = percentile), show.legend = TRUE, linetype="dashed") + theme_light()
To understand more about percentiles, please go to the Annex 1: The magic of percentiles chapter.
For now, we'll keep with the top 1% (99th percentile), as the threshold to flag all the points after it as outliers.
One interesting conceptual element arises here: when we define abnormal (or an anomaly), the normal concept emerges as its opposite.
This “normal” behavior is shown as the green area:
The hard thing to do is to determine where the normal and abnormal separate. There are several approaches to deal with this. We are going to review a few of them.
Where is the boundary between hot and cold weather?
Where can we put the threshold to indicate where the hot weather begins or, conversely, where does the cold weather end?
Near the Equator, probably a temperature around 10°C (50°F) is an extremely low value; however, in Antarctica, it's a beach day! ⛄ 🏖
👺: "Oh! But that is taking an extreme example with two different locations!"
No problem! Like a fractal, let's zoom into any city, the boundary when one starts (and the other ends) will not have one unique value to state the following: "Ok, the hot weather starts at 25.5°C (78°F)."
However, it's quite easy to stand in the extremes, where the uncertainty decreases to almost zero, for example, when we consider a temperature of 60°C (140°F).
🤔: "Ok. But how are these concepts related to machine learning?"
We're exposing here the relativity that exists when considering a label (hot/cold) as a numeric variable (temperature). This can be considered for any other numeric, such as income and the labels “normal” and “abnormal.”
To understand extreme values is one of the first tasks in exploratory data analysis. Then we can see what the normal values are. This is covered in the Profiling chapter.
There are several methods to flag values as outliers. Just as we might analyze the temperature, this flag is relative and all the methods can be right. The quickest method may be to treat the top and bottom X% as outliers.
More robust methods consider the distribution variables by using quantiles (Tukey's method) or the spread of the values through standard deviation (Hampel's method).
The definition of these boundaries is one of the most common tasks in machine learning. Why? When? Let's point out two examples:
Example 1: When we develop a predictive model which returns a probabilty for calling or not certain client, we need to set the score threshold to assign the final label: "yes call!"/"no call". More info about it in Scoring chapter.
Example 2: Another example is when we need to discretize a numerical variable because we need it as categorical. The boundaries in each bin/segment will affect the overall result. More info about it in Data Types chapter, section Discretizing numerical variables.
📌 Going back to the original issue (where does the cold weather end?), not all the questions need to have an answer: some of them just help us simply to think.
The impact of outliers
Some models, such as random forest and gradient-boosting machines, tend to deal better with outliers; however, “noise” may affect the results anyway. The impact of outliers in these models is lower than others, such as linear regressions, logistic regressions, kmeans, and decision trees.
One aspect that contributes to the decrease in impact is that both models create many sub-models. If any of the models takes one outlier as information, then other sub-models probably won't; thus, the error is canceled. The balance yields in the plurality of voices.
Communicating results 🌍 📣
If we need to report the variables used in the model, we'll end up removing outliers not to see a histogram with only one bar and/or show a biased mean.
It's better to show a nonbiased number than justifying that the model will handle extreme values.
Types of outliers by data type
- Numerical 📏: Like the one we saw before:
- Categorical 📊: Having a variable in which the dispersion of categories is quite high (high cardinality): for example, postal code. More about dealing with outliers in categorical variables in the High Cardinality Variable in Descriptive Stats chapter.
## var frequency percentage cumulative_perc ## 1 France 288 68.74 68.74 ## 2 China 65 15.51 84.25 ## 3 Uruguay 63 15.04 99.29 ## 4 Peru 2 0.48 99.77 ## 5 Vietnam 1 0.24 100.00
Vietnam are the outlier countries in this example as their share in the data is less than 1%.
Types of outliers by dimensionality
So far, we have observed one-dimensional univariate outliers. We also can consider two or more variables at a time.
For instance, we have the following dataset,
df_hello_world, with two variables:
v2. Doing the same analysis as before:
## v1 frequency percentage cumulative_perc ## 1 Uruguay 80 59.7 59.7 ## 2 Argentina 54 40.3 100.0
## v2 frequency percentage cumulative_perc ## 1 cat_A 83 61.94 61.94 ## 2 cat_B 51 38.06 100.00
##  "Variables processed: v1, v2"
No outlier so far, right?
Now we build a contingency table that tells us the distribution of both variables against each other:
## v2 ## v1 cat_A cat_B ## Argentina 39.55 0.75 ## Uruguay 22.39 37.31
Oh 😱! The combination of
cat_B is really low (0.75%) in comparison with the other values (less than 1%), whereas the other intersections are above 22%.
The last examples show the potential of extreme values or outliers and are presented as considerations we must make with a new dataset.
We mention 1% as a possible threshold to flag a value as an outlier. This value could be 0.5% or 3%, depending on the case.
In addition, the presence of this kind of outlier may not pose a problem.
How to deal with outliers in R
prep_outliers function present in the funModeling package can help us in this task. It can handle from one to 'N' variables at a time (by specifying the
The core is as follows:
- It supports three different methods (
methodparameter) to consider a value as an outlier: bottom_top, Tukey, and Hampel.
- It works in two modes (
typeparameter) by setting an
NAvalue or by stopping the variable at a particular value. Besides the explanation below,
prep_outliersis a well-documented type:
Step 1: How to detect outliers 🔎
The following methods are implemented in the
prep_outliers function. They retrieve different results so the user can select the one that best fits her or his needs.
Bottom and top values method
This considers outliers based on the bottom and top X% values, based on the percentile. The values are commonly 0.5%, 1%, 1.5%, 3%, among others.
Setting the parameter
0.01 will treat all values in the top 1%.
The same logic applies for the lowest values: setting the parameter
0.01 will flag as outliers the lowest 1% of all values.
The internal function used is
quantile; if we want to flag bottom and top 1%, we type:
quantile(heart_disease$age, probs = c(0.01, 0.99), na.rm = T)
## 1% 99% ## 35 71
All values for those aged less than 35 or more than 71 years will be considered outliers.
For more information about percentiles, check the chapter: The magic of percentiles.
This method flags outliers considering the quartiles values, Q1, Q2, and Q3, where Q1 is equivalent to the percentile 25th, Q2 equals to percentile 50th (also known as the median), and Q3 is the percentile 75th.
The IQR (Inter-quartile range) comes from Q3 − Q1.
- The bottom threshold is: Q1 − 3*IQR. All below are considered as outliers.
- The top threshold is: Q1 + 3*IQR. All above are considered as outliers.
The value 3 is to consider the "extreme" boundary detection. This method comes from the box plot, where the multiplier is 1.5 (not 3). This causes a lot more values to be flagged as shown in the next image.
The internal function used in
prep_outliers to calculate the Tukey's boundary can be accessed:
## bottom_threshold top_threshold ## 9 100
It returns a two-value vector; thus, we have the bottom and the top thresholds: all below nine and all above 100 will be considered as outliers.
A subtle visual and step-by-step example can be found in Ref. .
- The bottom threshold is:
median_value − 3*mad_value. All below are considered as outliers.
- The top threshold is:
median_value + 3*mad_value. All above are considered as outliers.
The internal function used in
prep_outliers to calculate the Hampel's boundary can be accessed:
## bottom_threshold top_threshold ## 29.3132 82.6868
It returns a two-value vector; thus, we have the bottom and the top thresholds. All below 29.31 and all above 82.68 will be considered as outliers.
It has one parameter named
k_mad_value, and its default value is
k_mad_value can be changed, but not in the
prep_outliers function by now.
The higher the
k_mad_value, the higher the threshold boundaries will be.
hampel_outlier(heart_disease$age, k_mad_value = 6)
## bottom_threshold top_threshold ## 2.6264 109.3736
Step 2: What to do with the outliers? 🛠
We've already detected which points are the outliers. Therefore, the question now is: What to do with them? 🤔
There are two scenarios:
- Scenario 1: Prepare outliers for data profiling
- Scenario 2: Prepare outliers for predictive modeling
There is a third scenario in which we don't do anything with the spotted outliers. We just let them be.
We propose the function
prep_outliers from the
funModeling package to give a hand on this task.
Regardless the function itself, the important point here is the underlying concept and the possibility of developing an improved method.
prep_outliers function covers these two scenarios through the parameter
type = "set_na", for scenario 1
type = "stop", for scenario 2
Scenario 1: Prepare outliers for data profiling
The initial analysis:
In this case, all outliers are converted into
NA, thus applying most of the characteristic functions (max, min, mean, etc.) will return a less-biased indicator value. Remember to set the
na.rm=TRUE parameter in those functions. Otherwise, the result will be
For example, let's consider the following variable (the one we saw at the beginning with some outliers):
## To understand all of these metrics, please go to the Profiling Data chapter: http://livebook.datascienceheroes.com/exploratory_data_analysis/profiling.html profiling_num(df_1$var)
## variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95 p_99 ## 1 var 548 1226 2.2 0 0 0 24 370 3382 5467 ## skewness kurtosis iqr range_98 range_80 ## 1 3.3 16 370 [0, 5467.33] [0, 1791.1]
Here we can see several indicators that give us some clues. The
std_dev is really high compared with the
mean, and it is reflected on the
variation_coef. In addition, the kurtosis is high (16) and the
p_99 is almost twice the
p_95 value (5767 vs. 3382).
This last task of looking at some numbers and visualize the variable distribution is like imaging a picture by what another person tells us: we convert the voice (which is a signal) into an image in our brain. 🗣 🙄 ... => 🏔
prep_outliers for data profiling
We need to set
type="set_na". This implies that every point flagged as an outlier will be converted into
We will use the three methods: Tukey, Hampel, and the bottom/top X%.
Using Tukey's method:
df_1$var_tukey=prep_outliers(df_1$var, type = "set_na", method = "tukey")
Now, we check how many
NA values are there before (the original variable) and after the transformation based on Tukey.
# before df_status(df_1$var, print_results = F) %>% select(variable, q_na, p_na)
## variable q_na p_na ## 1 var 0 0
# after df_status(df_1$var_tukey, print_results = F) %>% select(variable, q_na, p_na)
## variable q_na p_na ## 1 var 120 12
Before the transformation, there were 0
NA values, whereas afterwards 120 values (around 12%) are spotted as outliers according to the Tukey's test and replaced by
We can compare the before and after:
profiling_num(df_1, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98)
## variable mean std_dev variation_coef kurtosis range_98 ## 1 var 548 1226 2.2 15.6 [0, 5467.33] ## 2 var_tukey 163 307 1.9 8.4 [0, 1358.46]
The mean decreased by almost the third part while all the other metrics decreased as well.
Let's see what happens with Hampel's method (
df_1$var_hampel=prep_outliers(df_1$var, type = "set_na", method="hampel")
df_status(df_1, print_results = F) %>% select(variable, q_na, p_na)
## variable q_na p_na ## 1 var 0 0 ## 2 var_tukey 120 12 ## 3 var_hampel 364 36
This last method is much more severe in spotting outliers, identifying 36% of values as outliers. This is probably because the variable is quite skewed to the left.
More info can be found at Ref. .
Bottom and top X% method
Finally, we can try the easiest method: to remove the top 2%.
df_1$var_top2=prep_outliers(df_1$var, type = "set_na", method="bottom_top", top_percent = 0.02)
Please note that the 2% value was arbitrarily chosen. Other values, like 3% or 0.5%, can be tried as well.
Time to compare all the methods!
Putting it all together
We'll pick a few indicators to make the quantitative comparison.
df_status(df_1, print_results = F) %>% select(variable, q_na, p_na)
## variable q_na p_na ## 1 var 0 0 ## 2 var_tukey 120 12 ## 3 var_hampel 364 36 ## 4 var_top2 20 2
prof_num=profiling_num(df_1, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98) prof_num
## variable mean std_dev variation_coef kurtosis range_98 ## 1 var 548 1226 2.2 15.6 [0, 5467.33] ## 2 var_tukey 163 307 1.9 8.4 [0, 1358.46] ## 3 var_hampel 17 31 1.8 6.0 [0, 118.3] ## 4 var_top2 432 908 2.1 10.9 [0, 4364.29]
# First we need to convert the dataset into wide format df_1_m=reshape2::melt(df_1) plotar(df_1_m, str_target= "variable", str_input = "value", plot_type = "boxplot")
When selecting the bottom/top X%, we will always have some values matching that condition, whereas with other methods this may not be the case.
Conclusions for dealing with outliers in data profiling
The idea is to modify the outliers as least as possible (for example, if we were interested only in describing the general behavior).
To accomplish this task, for example when creating an ad hoc report, we can use the mean. We could choose the top 2% method because it only affects 2% of all values and causes the mean to be lowered drastically: from 548 to 432, or 21% less.
"To modify or not to modify the dataset, that is the question". William Shakespeare being a Data Scientist.
The Hampel method modified the mean too much, from 548 to 17! That is based on the standard value considered with this method, which is 3-MAD (kind of robust standard deviation).
Please note that this demonstration doesn't mean that neither Hampel nor Tukey are a bad choice. In fact, they are more robust because the threshold can be higher than the current value; indeed, no value is treated as an outlier.
On the other extreme, we can consider, for example, the
age variable from
heart_disease data. Let's analyze its outliers:
# Getting outliers threshold tukey_outlier(heart_disease$age)
## bottom_threshold top_threshold ## 9 100
# Getting min and max values min(heart_disease$age)
##  29
##  77
- The bottom threshold is 9, and the minimum value is 29.
- The top threshold is 100, and the minimum value is 77.
age variable has not outliers.
If we were to have used the bottom/top method, then the input percentages would have been detected as outliers.
All the examples so far have been taking one variable at a time; however,
prep_outliers can handle several at the same time using the parameter
str_input as we will see in next section. All that we have seen up to here will be equivalent, except for what we do once we detect the outlier, i.e., the imputation method.
Scenario 2: Prepare outliers for predictive modeling
The previous case results in spotted outliers being converted to
NA values. This is a huge problem if we are building a machine learning model as many of them don't work with
NA values. More about dealing with missing data at Analysis, Handling, and Imputation of Missing Data.
To deal with outliers in order to use a predictive model, we can adjust the parameter
type='stop' so all values flagged as outliers will be converted to the threshold value.
Some things to keep in mind:
Try to think of variable treatment (and creation) as if you're explaining to the model. By stopping variables at a certain value, 1% for example, we are telling to the model: Hey model, please consider all extreme values as if they are in the 99% percentile as this value is already high enough. Thanks.
Some predictive models are more noise tolerant than others. We can help them by treating some of the outlier values. In practice, to pre-process data by treating outliers tends to produce more accurate results in the presence of unseen data.
Imputing outliers for predictive modeling
First, we create a dataset with some outliers. Now the example has two variables.
# Creating data frame with outliers options(scipen=999) # deactivating scientific notation set.seed(10) # setting the seed to have a reproducible example df_2=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000)) # creating the variables df_2=rbind(df_2, 135, rep(400, 30), 245, 300, 303, 200) # forcing outliers
Dealing with outliers in both variables (
var2) using Tukey's method:
df_2_tukey=prep_outliers(data = df_2, str_input = c("var1", "var2"), type='stop', method = "tukey")
Checking some metrics before and after the imputation:
profiling_num(df_2, print_results = F) %>% select(variable, mean, std_dev, variation_coef)
## variable mean std_dev variation_coef ## 1 var1 2.6 21 8.3 ## 2 var2 1.6 21 13.6
profiling_num(df_2_tukey, print_results = F) %>% select(variable, mean, std_dev, variation_coef)
## variable mean std_dev variation_coef ## 1 var1 0.997 1.3 1.3 ## 2 var2 0.018 1.0 57.5
Tukey worked perfectly this time, exposing a more accurate mean in both variables: 1 for
var1 and 0 for
Note that this time there is no one
NA value. What the function did this time was "to stop the variable" at the threshold values. Now, the minimum and maximum values will be the same as the ones reported by Tukey's method.
Checking the threshold for
## bottom_threshold top_threshold ## -3.8 5.3
Now checking the min/max before the transformation:
# before: min(df_2$var1)
##  0.0000031
##  400
and after the transformation...
# after min(df_2_tukey$var1)
##  0.0000031
##  5.3
The min remains the same (0.0000031), but the maximum was set to the Tukey's value of ~5.3.
The top five highest values before the pre-partition were:
# before tail(df_2$var1[order(df_2$var1)], 5)
##  200 245 300 303 400
# after: tail(df_2_tukey$var1[order(df_2_tukey$var1)], 5)
##  5.3 5.3 5.3 5.3 5.3
And checking there is no one
df_status(df_2_tukey, print_results = F) %>% select(variable, q_na, p_na)
## variable q_na p_na ## 1 var1 0 0 ## 2 var2 0 0
Pretty clear, right?
Now let's replicate the example we did in the last section with only one variable in order to compare all three methods.
df_2$tukey_var2=prep_outliers(data=df_2$var2, type='stop', method = "tukey") df_2$hampel_var2=prep_outliers(data=df_2$var2, type='stop', method = "hampel") df_2$bot_top_var2=prep_outliers(data=df_2$var2, type='stop', method = "bottom_top", bottom_percent=0.01, top_percent = 0.01)
Putting it all together
# excluding var1 df_2_b=select(df_2, -var1) # profiling profiling_num(df_2_b, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98)
## variable mean std_dev variation_coef kurtosis range_98 ## 1 var2 1.5649 21.36 14 223.8 [-2.32, 2.4] ## 2 tukey_var2 0.0178 1.02 58 4.6 [-2.32, 2.4] ## 3 hampel_var2 0.0083 0.98 118 3.2 [-2.32, 2.4] ## 4 bot_top_var2 0.0083 0.97 116 2.9 [-2.32, 2.4]
All three methods show very similar results with these data.
# First we need to convert the dataset into wide format df_2_m=reshape2::melt(df_2_b) %>% filter(value<100) plotar(df_2_m, str_target= "variable", str_input = "value", plot_type = "boxplot")
Important: The two points above the value 100 ( only for
var1) were excluded, otherwise it was impossible to appreciate the difference between the methods.
We've covered the outliers issue from both philosophical and technical perspectives, thereby inviting the reader to improve her/his critical thinking skills when defining the boundaries (thresholds). It is easy to stand in the extremes, but a tough task to find the balance.
In technical terms, we covered three methods to spot outliers whose bases are different:
- Top/Bottom X%: This will always detect points as outliers as there is always a bottom and top X%.
- Tukey: Based on the classical boxplot, which uses the quartiles.
- Hampel: Quite restrictive if default parameter is not changed. It's based on the median and MAD values (similar to standard deviation, but less sensitive to outliers).
After we've got the outliers, the next step is to decide what to do with them. It would be the case that the treatment is not necessary at all. In really small datasets, they can be seen at first glance.
The rule of: "Only modify what is necessary", (which can also apply to the human being–nature relationship), tells us not to treat or exclude all the extreme outliers blindly. With every action we took, we introduced some bias. That's why it's so important to know the implications of every method. Whether it is a good decision or not is dependent on the nature of the data under analysis.
In predictive modeling, those who have any type of internal resampling technique, or create several tiny models to get a final prediction, are more stable to extreme values. More on resampling and error in the Knowing the error chapter.
In some cases when the predictive model is running on production, it's recommended to report or to consider the preparation of any new extreme value, i.e., a value that was not present during the model building. More on this topic, but with a categorical variable, can be found at High Cardinality Variable in Predictive Modeling, section: Handling new categories when the predictive model is on production.
One nice test for the reader to do is to pick up a dataset, treat the outliers, and then compare some performance metrics like Kappa, ROC, Accuracy, etc.; did the data preparation improve any of them? Or, in reporting, to see how much the mean changes. Even if we plot some variable, does the plot now tell us anything?. In this way, the reader will create new knowledge based on her/his experience 😉.