Gain and Lift Analysis

Both metrics are extremely useful to validate the predictive model (binary outcome) quality. More info about scoring data here.

Make sure you have the latest `funModeling` version (>= 1.3).

``````## Loading funModeling
suppressMessages(library(funModeling))
data(heart_disease)``````
``````fit_glm=glm(has_heart_disease ~ age + oldpeak, data=heart_disease, family = binomial)
heart_disease\$score=predict(fit_glm, newdata=heart_disease, type='response')
gain_lift(data=heart_disease,str_score='score',str_target='has_heart_disease')``````

``````##    Population   Gain Lift Score.Point
## 1          10  20.86 2.09   0.8185793
## 2          20  35.97 1.80   0.6967124
## 3          30  48.92 1.63   0.5657817
## 4          40  61.15 1.53   0.4901940
## 5          50  69.06 1.38   0.4033640
## 6          60  78.42 1.31   0.3344170
## 7          70  87.77 1.25   0.2939878
## 8          80  92.09 1.15   0.2473671
## 9          90  96.40 1.07   0.1980453
## 10        100 100.00 1.00   0.1195511``````

How to interpret it?

First, each case is ordered according to the likelihood of being the less representative class, aka, score value.

Then `Gain` column accumulates the positive class, for each 10% of rows - `Population` column.

So for the first row, it can be read as:

"The first 10 percent of the population, ordered by score, collects 20.86% of total positive cases"

For example, if we are sending emails based on this model, and we have a budget to reach only 20% of our users, how many responses we should expect to get? Answer: 35.97%

What about not using a model?

If we don't use a model, and we select randomly 20%, how many users do we have to reach? Well, 20%. That is the meaning of the dashed line, which starts at 0% and ends at 100%. Hopefully, with the predictive model we'll beat the randomness.

The Lift column represents the ratio, between the `Gain` and the gain by chance. Taking as an example the Population=20%, the model is 1.8 times better than randomness 💪.

Using the cut point

What value of the score reaches 30% of the population? Answer: `0.56`

The cut point allows us to segment the data.

Comparing models

In a good model, the gain will reach the 100% "at the beginning" of the population, representing that it separates the classes.

When comparing models, a quick metric is to see if the gain at the beginning of the population (10-30%) is higher.

As a result, the model with a higher gain at the beginning will have captured more information from data.

Let's illustrate it...

Cumulative Gain Analysis: Model 1 reaches the ~20% of positive cases around the 10% of the population, while model 2 reaches a similar proportion approaching the 20% of the population. Model 1 is better.

Lift analysis: Same as before, but also it is suspicious that not every lift number follow a decreasing pattern. Maybe the model is not ordering the first percentiles of the population. Same ordering concepts as seen in `cross_plot`