Treatment of outliers



⚠️ Atention!⚠️ This chapter is being rewriting from scratch! And it will be updated around July 15th. Follow in twitter @dataSciHeroes and @pabloc_ds. to get the latests news. Thanks 🙂.



What is this about?

prep_outliers function tries to automatize as much as it can be outliers preparation. It focus on the values that heavily influence the mean. It sets an NA or stops at a certain value all outliers for the desired variables.


Model building: Some models such as random forest and gradient boosting machines tend to deal better with outliers, but some noise will affect results anyway.

Communicating results: If we need to report the variables used in the model, we'll end up removing outliers to not see an histogram with only one bar, and/or show not a biased mean.

It's better to show a non-biased number than justifying the model will handle extreme values.

Type of outliers:

  • Numerical: For example the ones which bias the mean.
  • Categorical: Having a variable in which the dispersion of categories is quite high (high cardinallity). For example: postal code.
## Loading funModeling !
library(funModeling)
data(heart_disease)

Outlier threshold: The method to detect them is based on the percentile, flagging as an outlier if the value is on the top X % (commonly 0.5%, 1%, 2%). Setting the parameter top_percent in 0.01 will flag all values on the top 1%.

Same logic goes for the lowest values, setting parameter bottom_percent in 0.01 will flag as an outlier the lowest 1% of all values.

These models are highly affected by a biased mean :cry:: linear regressions, logistic regressions, kmeans, decision trees. Random forest deals better with outliers.

Automatization: prep_outliers skips all factor/char columns, so it can receive a whole data frame, removing outliers by finally, returning a the cleaned data.


This function covers two typical scenarios (parameter type):

  • Case 1: Descriptive statistics / data profiling
  • Case 2: Data for the predictive model


Case 1: type='set_na'

In this case all outliers are converted into NA, thus applying most of the descriptive functions (max, min, mean) will return a less-biased mean value - with the proper na.rm=TRUE parameter.


Case 2: type='stop'

The previous case will cause that all rows with NA values will be lost when a machine learning model is trained. To avoid this, but keep the outliers controlled, all values flagged as outlier will be converted to the threshold value.

Key notes:

  • Try to think variables treatment (and creation) as if you're explaining to the model. Stopping variables at a certain value, 1% for example, you are telling to the model: consider all extremes values as if they are on the 99% percentile, this value is already high enough
  • Models try to be noise tolerant, but you can help them by treat some common issues.


Examples

########################################
# Creating data frame with outliers
########################################
set.seed(10)
df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000))
df=rbind(df, 1135, 2432) # forcing outliers
df$id=as.character(seq(1:1002))

# for var1: mean is ~ 4.56, and max 2432
summary(df)
##       var1                var2                id           
##  Min.   :   0.0000   Min.   :  -3.2282   Length:1002       
##  1st Qu.:   0.0989   1st Qu.:  -0.6304   Class :character  
##  Median :   0.4455   Median :  -0.0352   Mode  :character  
##  Mean   :   4.5666   Mean   :   3.5512                     
##  3rd Qu.:   1.3853   3rd Qu.:   0.6242                     
##  Max.   :2432.0000   Max.   :2432.0000


Case 1: type='set_na'

########################################################
### CASE 1: Treatment outliers for data profiling
########################################################

#### EXAMPLE 1: Removing top 1% for a single variable

# checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05
quantile(df$var1, 0.99)
##      99% 
## 7.052883
# Setting type='set_na' sets NA to the highest value)
var1_treated=prep_outliers(data = df,  str_input = 'var1',  type='set_na', top_percent  = 0.01)

# now the mean (~ 0.94) is less biased, and note that: 1st, median and 3rd quartiles remaining very similar to the original variable.
summary(var1_treated)
##       var1               var2                id           
##  Min.   :0.000003   Min.   :  -3.2282   Length:1002       
##  1st Qu.:0.095676   1st Qu.:  -0.6304   Class :character  
##  Median :0.438830   Median :  -0.0352   Mode  :character  
##  Mean   :0.940909   Mean   :   3.5512                     
##  3rd Qu.:1.326450   3rd Qu.:   0.6242                     
##  Max.   :6.794558   Max.   :2432.0000                     
##  NA's   :11
#### EXAMPLE  2: if 'str_input' is missing, then it runs for all numeric variables (which have 3 or more distinct values).
df_treated2=prep_outliers(data = df, type='set_na', top_percent  = 0.01)
summary(df_treated2)
##       var1               var2               id           
##  Min.   :0.000003   Min.   :-3.22817   Length:1002       
##  1st Qu.:0.095676   1st Qu.:-0.64758   Class :character  
##  Median :0.438830   Median :-0.05779   Mode  :character  
##  Mean   :0.940909   Mean   :-0.05862                     
##  3rd Qu.:1.326450   3rd Qu.: 0.57706                     
##  Max.   :6.794558   Max.   : 1.99101                     
##  NA's   :11         NA's   :23
#### EXAMPLE  3: Removing top 1% (and bottom 1%) for 'N' specific variables.
vars_to_process=c('var1', 'var2')
df_treated3=prep_outliers(data = df, str_input = vars_to_process, type='set_na', bottom_percent = 0.01, top_percent  = 0.01)
summary(df_treated3)
##       var1               var2               id           
##  Min.   :0.000003   Min.   :-1.98803   Length:1002       
##  1st Qu.:0.095676   1st Qu.:-0.60871   Class :character  
##  Median :0.438830   Median :-0.03522   Mode  :character  
##  Mean   :0.940909   Mean   :-0.00420                     
##  3rd Qu.:1.326450   3rd Qu.: 0.58415                     
##  Max.   :6.794558   Max.   : 1.99101                     
##  NA's   :11         NA's   :45


Case 2: type='stop'

########################################################
### CASE 2: Treatment outliers for predictive modeling
########################################################
#### EXAMPLE 4: Stopping outliers at the top 1% value for all variables. For example if the top 1% has a value of 7, then all values above will be set to 7. Useful when modeling because outlier cases can be used.
df_treated4=prep_outliers(data = df, type='stop', top_percent = 0.01)

# before
summary(df$var1)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0989    0.4455    4.5670    1.3850 2432.0000
# after, the max value is 7
summary(df_treated4$var1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000003 0.098870 0.445500 1.007000 1.385000 7.000000


Plots

Note that when type='set_na', the last points disappear

ggplot(df_treated3, aes(x=var1)) + geom_histogram(binwidth=.5) + ggtitle("Setting type='set_na' (var1)")
## Warning: Removed 11 rows containing non-finite values (stat_bin).

plot of chunk outliers_treatment4

ggplot(df_treated4, aes(x=var1)) + geom_histogram(binwidth=.5) + ggtitle("Setting type='stop' (var1)")

plot of chunk outliers_treatment4


results matching ""

    No results matching ""