Profiling Target with Density Histograms
What is this about?
Density histograms are quite standard in any book/resource when plotting distributions. To use them in selecting variables gives a quick view on how well certain variable separates the class.
## Loading funModeling ! library(funModeling) data(heart_disease)
plotar(data=heart_disease, str_input="age", str_target="has_heart_disease", plot_type = "histdens")
Dashed-line represents variable mean.
Density histograms are helpful to visualize the general shape of a numeric distribution.
This general shape is calculated based on a technique called Kernel Smoother, its general idea is to reduce high/low peaks (noise) present in near points/bars by estimating the function that describes the points. Here some pictures to illustrate the concept: https://en.wikipedia.org/wiki/Kernel_smoother
Relationship with statistical test
Something similar is what a statistical test sees: they measured how different the curves are reflecting it in some statistics like the p-value using in the frequentist approach. It gives to the analyst reliable information to determine if the curves have -for example- the same mean._
Good vs. bad variable
plotar(data=heart_disease, str=c('resting_blood_pressure', 'max_heart_rate'), str_target="has_heart_disease", plot_type = "histdens")
And the model will see the same... if the curves are quite overlapped, like it is in
resting_blood_pressure, then it's not a good predictor as if they were more spaced -like
- Key in mind this when using Histograms & BoxPlots They are nice to see when the variable:
- Has a good spread -not concentrated on a bunch of 3, 4..6.. different values, and
- It has not extreme outliers... (this point can be treated with
prep_outliersfunction present in this package)