# The Importance of Scoring Data

### The intuition behind

Events can occur, or not... altough we don't have

tomorrow's newspaper, we can make a good guess about how is it going to be.

The future is undoubtedly attached to *uncertainty*, and this uncertainty can be estimated.

### And there are diferents targets...

For now this book will cover the classical: `Yes`

/`No`

target -also known as binary or multiclass prediction.

So, this estimation is the *value of truth* of an event to happen, therefore a probabilistic value between 0 and 1.

#### two-label vs. multi-label outcome :

Please note this chapter is written for a binary outcome (two-label outcome), but **multi-label** target can be seen as a general approach of a binary-class.

For example, having a target with 4 different values, there can be 4 models that predict the likelihood of belonging to certain class, or not. And then a higher model which takes the results of those 4 models and predict the final class.

### Say what?

Some examples:

- Is this client going to buy this product?
- Is this patient going to get better?
- Is certain event going to happen in the next few weeks?

The answers to these last questions are True or False, but **the essense is to have an score**, or a number indicating the likelihood of certain event to happen.

### But we need more control...

Many machine learning resources shows the simplified version -which is good to start- getting the final class as an output. Let's say:

Simplified approach:

- Question:
*Is this person going to have a heart disease?* - Answer: "No"

But there is something else before the "Yes/No" answer, and this is the score:

- Question:
*What is the likelihood for this person of having heart disease?* - Answer: "25%"

So first you get the score, and then according to your needs you set the **cut point**. And this is **really** important.

### Let see an example

Example table showing the following

`id`

=identity`x1`

,`x2`

and`x3`

input variables`target`

=variable to predict

Forgetting about input variables... After the creation of the predictive model, like a random forest, we are interested in the **scores**. Even though our final goal is to deliver a `yes`

/`no`

predicted variable.

For example, the following 2 sentences express the same: *The likelihood of being yes is 0.8* <=>

*The likelihood of being*

`no`

is `0.2`

May be it is understood, but the score usually refers to the less representative class: `yes`

.

**R Syntax** -*skip it if you don't want to see code*-

Following sentence will return the score:

`score = predict(randomForestModel, data, type = "prob")[, 2]`

Please note for other models this syntax may vary a little, but the concept **will remain the same**. Even for other languages.

Where `prob`

indicates we want the probabilities (or scores).

The `predict`

function + `type="prob"`

parameter returns a matrix of 15 rows and 2 columns: the 1st indicates the likelihood of being `no`

while the 2nd one indicates the same for class `yes`

.

Since target variable can be `no`

or `yes`

, the `[, 2]`

return the likelihood of being -in this case- `yes`

(which is the complement of the `no`

likelihood).

### It's all about the cut point

Now the table is ordered by descending score.

This is meant to see how to extract the final class having by default the cut point in `0.5`

. Tweaking the cut point will lead into a better classification.

Accuracy metrics or the confusion matrix are always attached to a certain cut point value.

After assigning the cut point, we can see the classification results getting the famous:

**True Positive**(TP): It's*true*, that the classification is*positive*, or, "the model hitted correctly the positive (`yes`

) class".**True Negative**(TN): Same as before, but with negative class (`no`

).**False Positive**(FP): It's*false*, that the classification is*positive*, or, "the model missed, it predicted`yes`

but the result was`no`

**False Negative**(FN): Same as before, but with negative class, "the model predicted negative, but it was positive", or, "the model predicted`no`

, but the class was`yes`

"

### The best and the worst escenario

The analysis of the extremes will help to find the middle point.

The best escenario is when **TP** and **TN** rates are 100%. That means the model correctly predicts all the `yes`

and all the `no`

; *(as a result, FP and FN rates are 0%)*.

But wait ! If you find a perfect classification, probably it's because of overfitting!

The worst escenario -the opposite to last example- is when **FP** and **FN** rates are 100%. Not even randomness can achieve such an awful escenario.

*Why?* If the classes are balanced, 50/50, flipping a coin will assert around half of the results. This is common baseline to test if the model is better than randomness.

In the example provided, class distribution is 5 for `yes`

, and 10 for `no`

; so: 33,3% (5/15) is `yes`

.

### Comparing classifiers

#### Comparing classification results

**Trivia**: Is a model which correcltly predict this 33.3% (TP rate=100%) a good one?

*Answer*: It depends on how many 'yes', the model predicted.

A classifier that always predicts `yes`

, will have a TP of 100%, but is absolutly useless since lots of `yes`

will be actually `no`

. As a matter of fact, FP rate will be high.

#### Comparing ordering label based on score

A classifier must be trustful, and this is what **ROC** curves measures when plotting the TP vs FP rates. The higher the proportion of TP over FP, the higher the Area Under Roc Curve (AUC) is.

The intuition behind ROC curve is to get an

sanity measureregarding thescore: how well it orders the label. Ideally all the positive labels must be at the top, and the negative ones at the bottom.

`model 1`

will have a higher AUC than `model 2`

.

Wikipedia has an extensive and good article on this: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

There is the comparission of 4 models, given a cutpoint of 0.5:

### Hands on R!

We'll be analyzing 3 scenarios based on 3 cut-points.

```
# install.packages("rpivotTable")
# rpivotTable: it creates a pivot table dinamically, it also supports plots, more info at: https://github.com/smartinsightsfromdata/rpivotTable
library(rpivotTable)
## reading the data
data=read.delim(file="example.txt", sep="\t", header = T, stringsAsFactors=F)
```

**Scenario 1** Cut point @ `0.5`

Classical confusion matrix, indicating how many cases fall in the intersection of real vs predicted value:

```
data$predicted_target=ifelse(data$score>=0.5, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count", rendererName = "Table", width="100%", height="400px")
```

Another view, now each column sums **100%**. Good to answer the following questions:

`rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")`

*What is the percentage of real*Also known as`yes`

values captured by the model? Answer: 80%**Precision**(PPV)*What is the percentage of*`yes`

thrown by the model? 40%.

So, from the last two senteces:

**The model throws 4 out of 10 predictions as yes, and from this segment -the yes- it hits 80%.**

Other view: The model correctly hits 3 cases for each 10 `yes`

predictions *(0.4/0.8=3.2, or 3, rounding down)*.

Note: The last way of analysis can be found when building a association rules (market basket analysis), and a decision tree model.

**Scenario 2** Cut point @ `0.4`

Time to change the cut point to `0.4`

, so the amount of `yes`

will be higher:

```
data$predicted_target=ifelse(data$score>=0.4, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
```

Now the model captures `100%`

of `yes`

(TP), so the total amount of `yes`

produced by the model increased to `46.7%`

, but at no cost since the *TN and FP remained the same* .

**Scenario 3** Cut point @ `0.8`

Want to decrease the FP rate? Set the cut point to a higher value, for example: `0.8`

, which will cause the `yes`

produced by the model decreases:

```
data$predicted_target=ifelse(data$score>=0.8, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
```

Now the FP rate decreased to `10%`

(from `20%`

), and the model still captures the `80%`

of TP which is the same rate as the one obtained with a cut point of `0.5`

.

**Decreasing the cut point to 0.8 improved the model at no cost.**

#### Conclusions

This chapter has focused on the essence of predicting a binary variable: To produce an score or likelihood number which

**orders**the target variable.A predictive model maps the input with the output.

There is not a unique and best

**cut point value**, it relies on the project needs, and is constrained by the rate of`False Positive`

and`False Negative`

we can accept. This live book addresses model performance by ROC curves and lift & gain charts