# Gain and Lift Analysis

## What is this about?

Both metrics are extremely useful to validate the predictive model (binary outcome) quality. More info about scoring data here.

Make sure you have the latest `funModeling`

version (>= 1.3).

```
## Loading funModeling
suppressMessages(library(funModeling))
data(heart_disease)
```

```
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=heart_disease, family = binomial)
heart_disease$score=predict(fit_glm, newdata=heart_disease, type='response')
gain_lift(data=heart_disease,str_score='score',str_target='has_heart_disease')
```

```
## Population Gain Lift Score.Point
## 1 10 20.86 2.09 0.8185793
## 2 20 35.97 1.80 0.6967124
## 3 30 48.92 1.63 0.5657817
## 4 40 61.15 1.53 0.4901940
## 5 50 69.06 1.38 0.4033640
## 6 60 78.42 1.31 0.3344170
## 7 70 87.77 1.25 0.2939878
## 8 80 92.09 1.15 0.2473671
## 9 90 96.40 1.07 0.1980453
## 10 100 100.00 1.00 0.1195511
```

## How to interpret it?

First, each case is ordered according to the likelihood of being the less representative class, aka, score value.

Then `Gain`

column accumulates the positive class, for each 10% of rows - `Population`

column.

So for the first row, it can be read as:

*"The first 10 percent of the population, ordered by score, collects 20.86% of total positive cases"*

For example, if we are sending emails based on this model, and we have a budget to reach only **20%** of our users, how many responses we should expect to get? **Answer: 35.97%**

## What about not using a model?

If we **don't use a model**, and we select randomly 20%, how many users do we have to reach? Well, 20%. That is the meaning of the **dashed line**, which starts at 0% and ends at 100%. Hopefully, with the predictive model we'll beat the randomness.

The **Lift** column represents the ratio, between the `Gain`

and the *gain by chance*. Taking as an example the Population=20%, the model is **1.8 times better** than randomness 💪.

### Using the cut point

What value of the score reaches 30% of the population?
Answer: `0.56`

The cut point allows us to segment the data.

### Comparing models

In a good model, the gain will reach the 100% "at the beginning" of the population, representing that it separates the classes.

When comparing models, a quick metric is to see if the gain at the beginning of the population (10-30%) is higher.

As a result, the model with a higher gain at the beginning will have captured more information from data.

Let's illustrate it...

**Cumulative Gain Analysis**: Model 1 reaches the ~20% of positive cases around the 10% of the population, while model 2 reaches a similar proportion approaching the 20% of the population. *Model 1 is better.*

**Lift analysis**: Same as before, but also it is suspicious that not every lift number follow a decreasing pattern. Maybe the model is not ordering the first percentiles of the population.
Same ordering concepts as seen in `cross_plot`