Gain and Lift Analysis
What is this about?
Both metrics are extremely useful to validate the predictive model (binary outcome) quality. More info about scoring data here.
Make sure you have the latest
funModeling version (>= 1.3).
## Loading funModeling suppressMessages(library(funModeling)) data(heart_disease)
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=heart_disease, family = binomial) heart_disease$score=predict(fit_glm, newdata=heart_disease, type='response') gain_lift(data=heart_disease,str_score='score',str_target='has_heart_disease')
## Population Gain Lift Score.Point ## 1 10 20.86 2.09 0.8185793 ## 2 20 35.97 1.80 0.6967124 ## 3 30 48.92 1.63 0.5657817 ## 4 40 61.15 1.53 0.4901940 ## 5 50 69.06 1.38 0.4033640 ## 6 60 78.42 1.31 0.3344170 ## 7 70 87.77 1.25 0.2939878 ## 8 80 92.09 1.15 0.2473671 ## 9 90 96.40 1.07 0.1980453 ## 10 100 100.00 1.00 0.1195511
How to interpret it?
First, each case is ordered according to the likelihood of being the less representative class, aka, score value.
Gain column accumulates the positive class, for each 10% of rows -
So for the first row, it can be read as:
"The first 10 percent of the population, ordered by score, collects 20.86% of total positive cases"
For example, if we are sending emails based on this model, and we have a budget to reach only 20% of our users, how many responses we should expect to get? Answer: 35.97%
What about not using a model?
If we don't use a model, and we select randomly 20%, how many users do we have to reach? Well, 20%. That is the meaning of the dashed line, which starts at 0% and ends at 100%. Hopefully, with the predictive model we'll beat the randomness.
The Lift column represents the ratio, between the
Gain and the gain by chance. Taking as an example the Population=20%, the model is 1.8 times better than randomness 💪.
Using the cut point
What value of the score reaches 30% of the population?
The cut point allows us to segment the data.
In a good model, the gain will reach the 100% "at the beginning" of the population, representing that it separates the classes.
When comparing models, a quick metric is to see if the gain at the beginning of the population (10-30%) is higher.
As a result, the model with a higher gain at the beginning will have captured more information from data.
Let's illustrate it...
Cumulative Gain Analysis: Model 1 reaches the ~20% of positive cases around the 10% of the population, while model 2 reaches a similar proportion approaching the 20% of the population. Model 1 is better.
Lift analysis: Same as before, but also it is suspicious that not every lift number follow a decreasing pattern. Maybe the model is not ordering the first percentiles of the population.
Same ordering concepts as seen in