# Treatment of outliers

⚠️ **Atention!**⚠️ This chapter is being rewriting from scratch! And it will be updated around July 15th. Follow in twitter @dataSciHeroes and @pabloc_ds. to get the latests news. Thanks 🙂.

## What is this about?

`prep_outliers`

function tries to automatize as much as it can be outliers preparation. It focus on the values that heavily influence the mean.
It sets an `NA`

or stops at a certain value all outliers for the desired variables.

**Model building**: Some models such as random forest and gradient boosting machines tend to deal better with outliers, but some noise will affect results anyway.

**Communicating results:** If we need to report the variables used in the model, we'll end up removing outliers to not see an histogram with only one bar, and/or show not a biased mean.

It's better to show a non-biased number than justifying the model *will handle* extreme values.

**Type of outliers:**

- Numerical: For example the ones which bias the mean.
- Categorical: Having a variable in which the dispersion of categories is quite high (high cardinallity). For example: postal code.

```
## Loading funModeling !
library(funModeling)
data(heart_disease)
```

**Outlier threshold**: The method to detect them is based on the percentile, flagging as an outlier if the value is on the top X % (commonly 0.5%, 1%, 2%). Setting the parameter `top_percent`

in `0.01`

will flag all values on the top 1%.

Same logic goes for the lowest values, setting parameter `bottom_percent`

in 0.01 will flag as an outlier the lowest 1% of all values.

**These models are highly affected by a biased mean** : linear regressions, logistic regressions, kmeans, decision trees. Random forest deals better with outliers.

**Automatization**: `prep_outliers`

skips all factor/char columns, so it can receive a whole data frame, removing outliers by finally, returning a the *cleaned* data.

This function covers two typical scenarios (parameter `type`

):

- Case 1: Descriptive statistics / data profiling
- Case 2: Data for the predictive model

## Case 1: `type='set_na'`

In this case all outliers are converted into `NA`

, thus applying most of the descriptive functions (max, min, mean) will return a **less-biased mean** value - with the proper `na.rm=TRUE`

parameter.

## Case 2: `type='stop'`

The previous case will cause that all rows with `NA`

values will be lost when a machine learning model is trained. To avoid this, but keep the outliers controlled, all values flagged as outlier will be converted to the threshold value.

**Key notes**:

- Try to think variables treatment (and creation) as if you're explaining to the model. Stopping variables at a certain value, 1% for example, you are telling to the model:
*consider all extremes values as if they are on the 99% percentile, this value is already high enough* - Models try to be noise tolerant, but you can help them by treat some common issues.

## Examples

```
########################################
# Creating data frame with outliers
########################################
set.seed(10)
df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000))
df=rbind(df, 1135, 2432) # forcing outliers
df$id=as.character(seq(1:1002))
# for var1: mean is ~ 4.56, and max 2432
summary(df)
```

```
## var1 var2 id
## Min. : 0.0000 Min. : -3.2282 Length:1002
## 1st Qu.: 0.0989 1st Qu.: -0.6304 Class :character
## Median : 0.4455 Median : -0.0352 Mode :character
## Mean : 4.5666 Mean : 3.5512
## 3rd Qu.: 1.3853 3rd Qu.: 0.6242
## Max. :2432.0000 Max. :2432.0000
```

## Case 1: `type='set_na'`

```
########################################################
### CASE 1: Treatment outliers for data profiling
########################################################
#### EXAMPLE 1: Removing top 1% for a single variable
# checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05
quantile(df$var1, 0.99)
```

```
## 99%
## 7.052883
```

```
# Setting type='set_na' sets NA to the highest value)
var1_treated=prep_outliers(data = df, str_input = 'var1', type='set_na', top_percent = 0.01)
# now the mean (~ 0.94) is less biased, and note that: 1st, median and 3rd quartiles remaining very similar to the original variable.
summary(var1_treated)
```

```
## var1 var2 id
## Min. :0.000003 Min. : -3.2282 Length:1002
## 1st Qu.:0.095676 1st Qu.: -0.6304 Class :character
## Median :0.438830 Median : -0.0352 Mode :character
## Mean :0.940909 Mean : 3.5512
## 3rd Qu.:1.326450 3rd Qu.: 0.6242
## Max. :6.794558 Max. :2432.0000
## NA's :11
```

```
#### EXAMPLE 2: if 'str_input' is missing, then it runs for all numeric variables (which have 3 or more distinct values).
df_treated2=prep_outliers(data = df, type='set_na', top_percent = 0.01)
summary(df_treated2)
```

```
## var1 var2 id
## Min. :0.000003 Min. :-3.22817 Length:1002
## 1st Qu.:0.095676 1st Qu.:-0.64758 Class :character
## Median :0.438830 Median :-0.05779 Mode :character
## Mean :0.940909 Mean :-0.05862
## 3rd Qu.:1.326450 3rd Qu.: 0.57706
## Max. :6.794558 Max. : 1.99101
## NA's :11 NA's :23
```

```
#### EXAMPLE 3: Removing top 1% (and bottom 1%) for 'N' specific variables.
vars_to_process=c('var1', 'var2')
df_treated3=prep_outliers(data = df, str_input = vars_to_process, type='set_na', bottom_percent = 0.01, top_percent = 0.01)
summary(df_treated3)
```

```
## var1 var2 id
## Min. :0.000003 Min. :-1.98803 Length:1002
## 1st Qu.:0.095676 1st Qu.:-0.60871 Class :character
## Median :0.438830 Median :-0.03522 Mode :character
## Mean :0.940909 Mean :-0.00420
## 3rd Qu.:1.326450 3rd Qu.: 0.58415
## Max. :6.794558 Max. : 1.99101
## NA's :11 NA's :45
```

## Case 2: `type='stop'`

```
########################################################
### CASE 2: Treatment outliers for predictive modeling
########################################################
#### EXAMPLE 4: Stopping outliers at the top 1% value for all variables. For example if the top 1% has a value of 7, then all values above will be set to 7. Useful when modeling because outlier cases can be used.
df_treated4=prep_outliers(data = df, type='stop', top_percent = 0.01)
# before
summary(df$var1)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0989 0.4455 4.5670 1.3850 2432.0000
```

```
# after, the max value is 7
summary(df_treated4$var1)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000003 0.098870 0.445500 1.007000 1.385000 7.000000
```

## Plots

Note that when `type='set_na'`

, the last points disappear

`ggplot(df_treated3, aes(x=var1)) + geom_histogram(binwidth=.5) + ggtitle("Setting type='set_na' (var1)")`

`## Warning: Removed 11 rows containing non-finite values (stat_bin).`

`ggplot(df_treated4, aes(x=var1)) + geom_histogram(binwidth=.5) + ggtitle("Setting type='stop' (var1)")`