Treatment of outliers

What is this about?

The concept of extreme values, much like other topics in machine learning, is not a concept exclusive to this area. What it is an outlier today may not be tomorrow. The boundaries between normal and abnormal behavior are fuzzy; on the other hand, to stand in the extremes is easy.


What are we going to review in this chapter?

  • What is an outlier? Philosophical and practical approaches
  • Outliers by dimensionality and data type (numerical or categorical)
  • How to detect outliers in R (bottom/top X%, Tukey and Hampel)
  • Outliers preparation for profiling in R
  • Outliers preparation for predictive modeling in R



The intuition behind outliers

For example, consider the following distribution:

## Loading ggplot2 to visualize the distribution
library(ggplot2)

## Creating a sample dataset
set.seed(31415)
df_1=data.frame(var=round(10000*rbeta(1000,0.15,2.5)))

## Plotting
ggplot(df_1, aes(var, fill=var)) + geom_histogram(bins=20) + theme_light()

plot of chunk dealing_with_outliers

The variable is skewed to the left, showing some outlier points on the right. We want to deal with them (😎). So, the question arises: Where to set the thresholds of extreme? Based on intuition, it can be at the highest 1%, or we can analyze the mean change after removing the top 1%.

Both cases could be right. In fact, taking another number as the threshold (i.e., 2% or 0.1%), may be right too. Let's visualize them:

## Calculating the percentiles for the top 3% and top 1%
percentile_var=quantile(df_1$var, c(0.98, 0.99, 0.999), na.rm = T)
df_p=data.frame(value=percentile_var, percentile=c("a_98th", "b_99th", "c_99.9th"))

## Plotting the same distribution plus the percentiles
ggplot(df_1, aes(var)) + geom_histogram(bins=20) + geom_vline(data=df_p, aes(xintercept=value,  colour = percentile), show.legend = TRUE, linetype="dashed") + theme_light()

plot of chunk dealing_with_outliers_4

To understand more about percentiles, please go to the Annex 1: The magic of percentiles chapter.

For now, we'll keep with the top 1% (99th percentile), as the threshold to flag all the points after it as outliers.

plot of chunk dealing_with_outliers_2

One interesting conceptual element arises here: when we define abnormal (or an anomaly), the normal concept emerges as its opposite.

This “normal” behavior is shown as the green area:

plot of chunk dealing_with_outliers_3

The hard thing to do is to determine where the normal and abnormal separate. There are several approaches to deal with this. We are going to review a few of them.



Where is the boundary between hot and cold weather?

Let's make this section more philosophical. Some good mathematicians were also philosophers such as the case of Pythagoras and Isaac Newton.

Where can we put the threshold to indicate where the hot weather begins or, conversely, where does the cold weather end?

Finding the cutpoint

Near the Equator, probably a temperature around 10°C (50°F) is an extremely low value; however, in Antarctica, it's a beach day! ⛄ 🏖

👺: "Oh! But that is taking an extreme example with two different locations!"

No problem! Like a fractal, let's zoom into any city, the boundary when one starts (and the other ends) will not have one unique value to state the following: "Ok, the hot weather starts at 25.5°C (78°F)."

It's relative.

However, it's quite easy to stand in the extremes, where the uncertainty decreases to almost zero, for example, when we consider a temperature of 60°C (140°F).

🤔: "Ok. But how are these concepts related to machine learning?"

We're exposing here the relativity that exists when considering a label (hot/cold) as a numeric variable (temperature). This can be considered for any other numeric, such as income and the labels “normal” and “abnormal.”

To understand extreme values is one of the first tasks in exploratory data analysis. Then we can see what the normal values are. This is covered in the Profiling chapter.

There are several methods to flag values as outliers. Just as we might analyze the temperature, this flag is relative and all the methods can be right. The quickest method may be to treat the top and bottom X% as outliers.

More robust methods consider the distribution variables by using quantiles (Tukey's method) or the spread of the values through standard deviation (Hampel's method).

The definition of these boundaries is one of the most common tasks in machine learning. Why? When? Let's point out two examples:

  • Example 1: When we develop a predictive model which returns a probabilty for calling or not certain client, we need to set the score threshold to assign the final label: "yes call!"/"no call". More info about it in Scoring chapter.

  • Example 2: Another example is when we need to discretize a numerical variable because we need it as categorical. The boundaries in each bin/segment will affect the overall result. More info about it in Data Types chapter, section Discretizing numerical variables.

📌 Going back to the original issue (where does the cold weather end?), not all the questions need to have an answer: some of them just help us simply to think.



The impact of outliers

Model building

Some models, such as random forest and gradient-boosting machines, tend to deal better with outliers; however, “noise” may affect the results anyway. The impact of outliers in these models is lower than others, such as linear regressions, logistic regressions, kmeans, and decision trees.

One aspect that contributes to the decrease in impact is that both models create many sub-models. If any of the models takes one outlier as information, then other sub-models probably won't; thus, the error is canceled. The balance yields in the plurality of voices.

Communicating results 🌍 📣

If we need to report the variables used in the model, we'll end up removing outliers not to see a histogram with only one bar and/or show a biased mean.

It's better to show a nonbiased number than justifying that the model will handle extreme values.

Types of outliers by data type

  • Numerical 📏: Like the one we saw before:

plot of chunk numerical_outliers_1

  • Categorical 📊: Having a variable in which the dispersion of categories is quite high (high cardinality): for example, postal code. More about dealing with outliers in categorical variables in the High Cardinality Variable in Descriptive Stats chapter.

plot of chunk categorical_outliers_1

##       var frequency percentage cumulative_perc
## 1  France       288      68.74           68.74
## 2   China        65      15.51           84.25
## 3 Uruguay        63      15.04           99.29
## 4    Peru         2       0.48           99.77
## 5 Vietnam         1       0.24          100.00

Peru and Vietnam are the outlier countries in this example as their share in the data is less than 1%.



Types of outliers by dimensionality

So far, we have observed one-dimensional univariate outliers. We also can consider two or more variables at a time.

For instance, we have the following dataset, df_hello_world, with two variables: v1 and v2. Doing the same analysis as before:

plot of chunk outlier_analysis_in_r

##          v1 frequency percentage cumulative_perc
## 1   Uruguay        80       59.7            59.7
## 2 Argentina        54       40.3           100.0

plot of chunk outlier_analysis_in_r

##      v2 frequency percentage cumulative_perc
## 1 cat_A        83      61.94           61.94
## 2 cat_B        51      38.06          100.00
## [1] "Variables processed: v1, v2"

No outlier so far, right?

Now we build a contingency table that tells us the distribution of both variables against each other:

##            v2
## v1          cat_A cat_B
##   Argentina 39.55  0.75
##   Uruguay   22.39 37.31

Oh 😱! The combination of Argentina and cat_B is really low (0.75%) in comparison with the other values (less than 1%), whereas the other intersections are above 22%.


Some thoughts...

The last examples show the potential of extreme values or outliers and are presented as considerations we must make with a new dataset.

We mention 1% as a possible threshold to flag a value as an outlier. This value could be 0.5% or 3%, depending on the case.

In addition, the presence of this kind of outlier may not pose a problem.



How to deal with outliers in R

The prep_outliers function present in the funModeling package can help us in this task. It can handle from one to 'N' variables at a time (by specifying the str_input parameter).

The core is as follows:

  • It supports three different methods (method parameter) to consider a value as an outlier: bottom_top, Tukey, and Hampel.
  • It works in two modes (type parameter) by setting an NA value or by stopping the variable at a particular value. Besides the explanation below, prep_outliers is a well-documented type: help("prep_outliers").


Step 1: How to detect outliers 🔎

The following methods are implemented in the prep_outliers function. They retrieve different results so the user can select the one that best fits her or his needs.

Bottom and top values method

This considers outliers based on the bottom and top X% values, based on the percentile. The values are commonly 0.5%, 1%, 1.5%, 3%, among others.

Setting the parameter top_percent in 0.01 will treat all values in the top 1%.

The same logic applies for the lowest values: setting the parameter bottom_percent to 0.01 will flag as outliers the lowest 1% of all values.

The internal function used is quantile; if we want to flag bottom and top 1%, we type:

quantile(heart_disease$age, probs = c(0.01, 0.99), na.rm = T)
##  1% 99% 
##  35  71

All values for those aged less than 35 or more than 71 years will be considered outliers.

For more information about percentiles, check the chapter: The magic of percentiles.


Tukey's method

This method flags outliers considering the quartiles values, Q1, Q2, and Q3, where Q1 is equivalent to the percentile 25th, Q2 equals to percentile 50th (also known as the median), and Q3 is the percentile 75th.

The IQR (Inter-quartile range) comes from Q3 − Q1.

The formula:

  • The bottom threshold is: Q1 − 3*IQR. All below are considered as outliers.
  • The top threshold is: Q1 + 3*IQR. All above are considered as outliers.

The value 3 is to consider the "extreme" boundary detection. This method comes from the box plot, where the multiplier is 1.5 (not 3). This causes a lot more values to be flagged as shown in the next image.

Boxplot, it shows the outliers using the 1.5 multiplier

The internal function used in prep_outliers to calculate the Tukey's boundary can be accessed:

tukey_outlier(heart_disease$age)
## bottom_threshold    top_threshold 
##                9              100

It returns a two-value vector; thus, we have the bottom and the top thresholds: all below nine and all above 100 will be considered as outliers.

A subtle visual and step-by-step example can be found in Ref. [1].


Hampel's method

The formula:

  • The bottom threshold is: median_value − 3*mad_value. All below are considered as outliers.
  • The top threshold is: median_value + 3*mad_value. All above are considered as outliers.

The internal function used in prep_outliers to calculate the Hampel's boundary can be accessed:

hampel_outlier(heart_disease$age)
## bottom_threshold    top_threshold 
##          29.3132          82.6868

It returns a two-value vector; thus, we have the bottom and the top thresholds. All below 29.31 and all above 82.68 will be considered as outliers.

It has one parameter named k_mad_value, and its default value is 3. The value k_mad_value can be changed, but not in the prep_outliers function by now.

The higher the k_mad_value, the higher the threshold boundaries will be.

hampel_outlier(heart_disease$age, k_mad_value = 6) 
## bottom_threshold    top_threshold 
##           2.6264         109.3736



Step 2: What to do with the outliers? 🛠

We've already detected which points are the outliers. Therefore, the question now is: What to do with them? 🤔

There are two scenarios:

  • Scenario 1: Prepare outliers for data profiling
  • Scenario 2: Prepare outliers for predictive modeling

There is a third scenario in which we don't do anything with the spotted outliers. We just let them be.

We propose the function prep_outliers from the funModeling package to give a hand on this task.

Regardless the function itself, the important point here is the underlying concept and the possibility of developing an improved method.

The prep_outliers function covers these two scenarios through the parameter type:

  • type = "set_na", for scenario 1
  • type = "stop", for scenario 2

Scenario 1: Prepare outliers for data profiling

The initial analysis:

In this case, all outliers are converted into NA, thus applying most of the characteristic functions (max, min, mean, etc.) will return a less-biased indicator value. Remember to set the na.rm=TRUE parameter in those functions. Otherwise, the result will be NA.

For example, let's consider the following variable (the one we saw at the beginning with some outliers):

## To understand all of these metrics, please go to the Profiling Data chapter: http://livebook.datascienceheroes.com/exploratory_data_analysis/profiling.html
profiling_num(df_1$var)
##   variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95 p_99
## 1      var  548    1226            2.2    0    0    0   24  370 3382 5467
##   skewness kurtosis iqr     range_98    range_80
## 1      3.3       16 370 [0, 5467.33] [0, 1791.1]

Here we can see several indicators that give us some clues. The std_dev is really high compared with the mean, and it is reflected on the variation_coef. In addition, the kurtosis is high (16) and the p_99 is almost twice the p_95 value (5767 vs. 3382).

This last task of looking at some numbers and visualize the variable distribution is like imaging a picture by what another person tells us: we convert the voice (which is a signal) into an image in our brain. 🗣 🙄 ... => 🏔


Using prep_outliers for data profiling

We need to set type="set_na". This implies that every point flagged as an outlier will be converted into NA.

We will use the three methods: Tukey, Hampel, and the bottom/top X%.

Using Tukey's method:

df_1$var_tukey=prep_outliers(df_1$var, type = "set_na", method = "tukey")

Now, we check how many NA values are there before (the original variable) and after the transformation based on Tukey.

# before
df_status(df_1$var, print_results = F) %>% select(variable, q_na, p_na)
##   variable q_na p_na
## 1      var    0    0
# after
df_status(df_1$var_tukey, print_results = F) %>% select(variable, q_na, p_na)
##   variable q_na p_na
## 1      var  120   12

Before the transformation, there were 0 NA values, whereas afterwards 120 values (around 12%) are spotted as outliers according to the Tukey's test and replaced by NA.

We can compare the before and after:

profiling_num(df_1, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98)
##    variable mean std_dev variation_coef kurtosis     range_98
## 1       var  548    1226            2.2     15.6 [0, 5467.33]
## 2 var_tukey  163     307            1.9      8.4 [0, 1358.46]

The mean decreased by almost the third part while all the other metrics decreased as well.

Hampel's method:

Let's see what happens with Hampel's method (method="hampel"):

df_1$var_hampel=prep_outliers(df_1$var, type = "set_na", method="hampel")

Checking...

df_status(df_1, print_results = F) %>% select(variable, q_na, p_na)
##     variable q_na p_na
## 1        var    0    0
## 2  var_tukey  120   12
## 3 var_hampel  364   36

This last method is much more severe in spotting outliers, identifying 36% of values as outliers. This is probably because the variable is quite skewed to the left.

More info can be found at Ref. [2].


Bottom and top X% method

Finally, we can try the easiest method: to remove the top 2%.

df_1$var_top2=prep_outliers(df_1$var, type = "set_na", method="bottom_top", top_percent = 0.02)

Please note that the 2% value was arbitrarily chosen. Other values, like 3% or 0.5%, can be tried as well.

Time to compare all the methods!


Putting it all together

We'll pick a few indicators to make the quantitative comparison.

df_status(df_1, print_results = F) %>% select(variable, q_na, p_na)
##     variable q_na p_na
## 1        var    0    0
## 2  var_tukey  120   12
## 3 var_hampel  364   36
## 4   var_top2   20    2
prof_num=profiling_num(df_1, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98)
prof_num
##     variable mean std_dev variation_coef kurtosis     range_98
## 1        var  548    1226            2.2     15.6 [0, 5467.33]
## 2  var_tukey  163     307            1.9      8.4 [0, 1358.46]
## 3 var_hampel   17      31            1.8      6.0   [0, 118.3]
## 4   var_top2  432     908            2.1     10.9 [0, 4364.29]

Plotting

# First we need to convert the dataset into wide format
df_1_m=reshape2::melt(df_1) 
plotar(df_1_m,  str_target= "variable", str_input = "value", plot_type = "boxplot")

plot of chunk comparing_outliers_methods


When selecting the bottom/top X%, we will always have some values matching that condition, whereas with other methods this may not be the case.

Conclusions for dealing with outliers in data profiling

The idea is to modify the outliers as least as possible (for example, if we were interested only in describing the general behavior).

To accomplish this task, for example when creating an ad hoc report, we can use the mean. We could choose the top 2% method because it only affects 2% of all values and causes the mean to be lowered drastically: from 548 to 432, or 21% less.

"To modify or not to modify the dataset, that is the question". William Shakespeare being a Data Scientist.

The Hampel method modified the mean too much, from 548 to 17! That is based on the standard value considered with this method, which is 3-MAD (kind of robust standard deviation).

Please note that this demonstration doesn't mean that neither Hampel nor Tukey are a bad choice. In fact, they are more robust because the threshold can be higher than the current value; indeed, no value is treated as an outlier.

On the other extreme, we can consider, for example, the age variable from heart_disease data. Let's analyze its outliers:

# Getting outliers threshold
tukey_outlier(heart_disease$age)
## bottom_threshold    top_threshold 
##                9              100
# Getting min and max values
min(heart_disease$age)
## [1] 29
max(heart_disease$age)
## [1] 77
  • The bottom threshold is 9, and the minimum value is 29.
  • The top threshold is 100, and the minimum value is 77.

Ergo: the age variable has not outliers.

If we were to have used the bottom/top method, then the input percentages would have been detected as outliers.

All the examples so far have been taking one variable at a time; however, prep_outliers can handle several at the same time using the parameter str_input as we will see in next section. All that we have seen up to here will be equivalent, except for what we do once we detect the outlier, i.e., the imputation method.


Scenario 2: Prepare outliers for predictive modeling

The previous case results in spotted outliers being converted to NA values. This is a huge problem if we are building a machine learning model as many of them don't work with NA values. More about dealing with missing data at Analysis, Handling, and Imputation of Missing Data.

To deal with outliers in order to use a predictive model, we can adjust the parameter type='stop' so all values flagged as outliers will be converted to the threshold value.

Some things to keep in mind:

Try to think of variable treatment (and creation) as if you're explaining to the model. By stopping variables at a certain value, 1% for example, we are telling to the model: Hey model, please consider all extreme values as if they are in the 99% percentile as this value is already high enough. Thanks.

Some predictive models are more noise tolerant than others. We can help them by treating some of the outlier values. In practice, to pre-process data by treating outliers tends to produce more accurate results in the presence of unseen data.


Imputing outliers for predictive modeling

First, we create a dataset with some outliers. Now the example has two variables.

# Creating data frame with outliers
options(scipen=999) # deactivating scientific notation
set.seed(10) # setting the seed to have a reproducible example
df_2=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000)) # creating the variables
df_2=rbind(df_2, 135, rep(400, 30), 245, 300, 303, 200) # forcing outliers

Dealing with outliers in both variables (var1 and var2) using Tukey's method:

df_2_tukey=prep_outliers(data = df_2, str_input = c("var1", "var2"), type='stop', method = "tukey")

Checking some metrics before and after the imputation:

profiling_num(df_2, print_results = F) %>% select(variable, mean, std_dev, variation_coef)
##   variable mean std_dev variation_coef
## 1     var1  2.6      21            8.3
## 2     var2  1.6      21           13.6
profiling_num(df_2_tukey, print_results = F) %>% select(variable, mean, std_dev, variation_coef)
##   variable  mean std_dev variation_coef
## 1     var1 0.997     1.3            1.3
## 2     var2 0.018     1.0           57.5

Tukey worked perfectly this time, exposing a more accurate mean in both variables: 1 for var1 and 0 for var2.

Note that this time there is no one NA value. What the function did this time was "to stop the variable" at the threshold values. Now, the minimum and maximum values will be the same as the ones reported by Tukey's method.

Checking the threshold for var1:

tukey_outlier(df_2$var1)
## bottom_threshold    top_threshold 
##             -3.8              5.3

Now checking the min/max before the transformation:

# before:
min(df_2$var1)
## [1] 0.0000031
max(df_2$var1)
## [1] 400

and after the transformation...

# after
min(df_2_tukey$var1)
## [1] 0.0000031
max(df_2_tukey$var1)
## [1] 5.3

The min remains the same (0.0000031), but the maximum was set to the Tukey's value of ~5.3.

The top five highest values before the pre-partition were:

# before
tail(df_2$var1[order(df_2$var1)], 5)
## [1] 200 245 300 303 400

but after...

# after:
tail(df_2_tukey$var1[order(df_2_tukey$var1)], 5)
## [1] 5.3 5.3 5.3 5.3 5.3

And checking there is no one NA:

df_status(df_2_tukey, print_results = F) %>% select(variable, q_na, p_na)
##   variable q_na p_na
## 1     var1    0    0
## 2     var2    0    0

Pretty clear, right?


Now let's replicate the example we did in the last section with only one variable in order to compare all three methods.

df_2$tukey_var2=prep_outliers(data=df_2$var2, type='stop', method = "tukey")
df_2$hampel_var2=prep_outliers(data=df_2$var2, type='stop', method = "hampel")
df_2$bot_top_var2=prep_outliers(data=df_2$var2, type='stop', method = "bottom_top", bottom_percent=0.01, top_percent = 0.01)


Putting it all together

# excluding var1
df_2_b=select(df_2, -var1)

# profiling
profiling_num(df_2_b, print_results = F) %>% select(variable, mean, std_dev, variation_coef, kurtosis, range_98)
##       variable   mean std_dev variation_coef kurtosis     range_98
## 1         var2 1.5649   21.36             14    223.8 [-2.32, 2.4]
## 2   tukey_var2 0.0178    1.02             58      4.6 [-2.32, 2.4]
## 3  hampel_var2 0.0083    0.98            118      3.2 [-2.32, 2.4]
## 4 bot_top_var2 0.0083    0.97            116      2.9 [-2.32, 2.4]

All three methods show very similar results with these data.

Plotting

# First we need to convert the dataset into wide format
df_2_m=reshape2::melt(df_2_b) %>% filter(value<100) 
plotar(df_2_m,  str_target= "variable", str_input = "value", plot_type = "boxplot")

plot of chunk outliers_method_comparison

Important: The two points above the value 100 ( only for var1) were excluded, otherwise it was impossible to appreciate the difference between the methods.



Final thoughts

We've covered the outliers issue from both philosophical and technical perspectives, thereby inviting the reader to improve her/his critical thinking skills when defining the boundaries (thresholds). It is easy to stand in the extremes, but a tough task to find the balance.

In technical terms, we covered three methods to spot outliers whose bases are different:

  • Top/Bottom X%: This will always detect points as outliers as there is always a bottom and top X%.
  • Tukey: Based on the classical boxplot, which uses the quartiles.
  • Hampel: Quite restrictive if default parameter is not changed. It's based on the median and MAD values (similar to standard deviation, but less sensitive to outliers).

After we've got the outliers, the next step is to decide what to do with them. It would be the case that the treatment is not necessary at all. In really small datasets, they can be seen at first glance.

The rule of: "Only modify what is necessary", (which can also apply to the human beingnature relationship), tells us not to treat or exclude all the extreme outliers blindly. With every action we took, we introduced some bias. That's why it's so important to know the implications of every method. Whether it is a good decision or not is dependent on the nature of the data under analysis.

In predictive modeling, those who have any type of internal resampling technique, or create several tiny models to get a final prediction, are more stable to extreme values. More on resampling and error in the Knowing the error chapter.

In some cases when the predictive model is running on production, it's recommended to report or to consider the preparation of any new extreme value, i.e., a value that was not present during the model building. More on this topic, but with a categorical variable, can be found at High Cardinality Variable in Predictive Modeling, section: Handling new categories when the predictive model is on production.

One nice test for the reader to do is to pick up a dataset, treat the outliers, and then compare some performance metrics like Kappa, ROC, Accuracy, etc.; did the data preparation improve any of them? Or, in reporting, to see how much the mean changes. Even if we plot some variable, does the plot now tell us anything?. In this way, the reader will create new knowledge based on her/his experience 😉.


References

results matching ""

    No results matching ""