Profiling Target with BoxPlots

What is this about?

The use of Boxplots in importance variable analysis gives a quick view of how different the quartiles are among the various values in a binary target variable.

## Loading funModeling !
plotar(data=heart_disease, str_input="age", str_target="has_heart_disease", plot_type = "boxplot")

plot of chunk variable_importance2b

Rhomboid near the mean line represents the median.

When to use boxplots? When you need to analyze different percentiles across the classes to predict. Note this is a powerful technique since the bias produced due to outliers doesn't affect as much as it does to the mean.

Boxplot: Good vs. Bad variable

Using more than one variable as inputs is useful in order to compare boxplots quickly, and thus getting the best variables...

plotar(data=heart_disease, str_input=c('max_heart_rate', 'resting_blood_pressure'),  str_target="has_heart_disease", plot_type = "boxplot")

plot of chunk variable_importance2eplot of chunk variable_importance2e

max_heart_rate is clearly a better predictor than resting_blood_pressure.

As a general rule, a variable will rank as more important if boxplots are not aligned horizontally.

Statistical tests: percentiles are another used feature used by them in order to determine -for example- if means across groups are or not the same.

Exporting plots

plotar and cross_plot can handle from 1 to N input variables, and plots generated by them can be easily exported in high quality with parameter path_out.

plotar(data=heart_disease, str_input=c('max_heart_rate', 'resting_blood_pressure'),  str_target="has_heart_disease", plot_type = "boxplot", path_out = "my_awsome_folder")

  • Key in mind this when using Histograms and BoxPlots They are nice to see when the variable:
    • Has a good spread -not concentrated on a bunch of 3, 4..6.. different values, and
    • It has not extreme outliers... (this point can be treated with prep_outliers function present in this package)

results matching ""

    No results matching ""