The Importance of Scoring Data
The intuition behind
Events can occur, or not... altough we don't have tomorrow's newspaper , we can make a good guess about how is it going to be.
The future is undoubtedly attached to uncertainty, and this uncertainty can be estimated.
And there are diferents targets...
For now this book will cover the classical:
No target -also known as binary or multiclass prediction.
So, this estimation is the value of truth of an event to happen, therefore a probabilistic value between 0 and 1.
two-label vs. multi-label outcome :
Please note this chapter is written for a binary outcome (two-label outcome), but multi-label target can be seen as a general approach of a binary-class.
For example, having a target with 4 different values, there can be 4 models that predict the likelihood of belonging to certain class, or not. And then a higher model which takes the results of those 4 models and predict the final class.
- Is this client going to buy this product?
- Is this patient going to get better?
- Is certain event going to happen in the next few weeks?
The answers to these last questions are True or False, but the essense is to have an score, or a number indicating the likelihood of certain event to happen.
But we need more control...
Many machine learning resources shows the simplified version -which is good to start- getting the final class as an output. Let's say:
- Question: Is this person going to have a heart disease?
- Answer: "No"
But there is something else before the "Yes/No" answer, and this is the score:
- Question: What is the likelihood for this person of having heart disease?
- Answer: "25%"
So first you get the score, and then according to your needs you set the cut point. And this is really important.
Let see an example
Example table showing the following
target=variable to predict
Forgetting about input variables... After the creation of the predictive model, like a random forest, we are interested in the scores. Even though our final goal is to deliver a
no predicted variable.
For example, the following 2 sentences express the same: The likelihood of being
0.8 <=> The likelihood of being
May be it is understood, but the score usually refers to the less representative class:
R Syntax -skip it if you don't want to see code-
Following sentence will return the score:
score = predict(randomForestModel, data, type = "prob")[, 2]
Please note for other models this syntax may vary a little, but the concept will remain the same. Even for other languages.
prob indicates we want the probabilities (or scores).
predict function +
type="prob" parameter returns a matrix of 15 rows and 2 columns: the 1st indicates the likelihood of being
no while the 2nd one indicates the same for class
Since target variable can be
[, 2] return the likelihood of being -in this case-
yes (which is the complement of the
It's all about the cut point
Now the table is ordered by descending score.
This is meant to see how to extract the final class having by default the cut point in
0.5. Tweaking the cut point will lead into a better classification.
Accuracy metrics or the confusion matrix are always attached to a certain cut point value.
After assigning the cut point, we can see the classification results getting the famous:
- True Positive (TP): It's true, that the classification is positive, or, "the model hitted correctly the positive (
- True Negative (TN): Same as before, but with negative class (
- False Positive (FP): It's false, that the classification is positive, or, "the model missed, it predicted
yesbut the result was
- False Negative (FN): Same as before, but with negative class, "the model predicted negative, but it was positive", or, "the model predicted
no, but the class was
The best and the worst escenario
The analysis of the extremes will help to find the middle point.
The best escenario is when TP and TN rates are 100%. That means the model correctly predicts all the
yes and all the
no; (as a result, FP and FN rates are 0%).
But wait ! If you find a perfect classification, probably it's because of overfitting!
The worst escenario -the opposite to last example- is when FP and FN rates are 100%. Not even randomness can achieve such an awful escenario.
Why? If the classes are balanced, 50/50, flipping a coin will assert around half of the results. This is common baseline to test if the model is better than randomness.
In the example provided, class distribution is 5 for
yes, and 10 for
no; so: 33,3% (5/15) is
Comparing classification results
Trivia: Is a model which correcltly predict this 33.3% (TP rate=100%) a good one?
Answer: It depends on how many 'yes', the model predicted.
A classifier that always predicts
yes, will have a TP of 100%, but is absolutly useless since lots of
yes will be actually
no. As a matter of fact, FP rate will be high.
Comparing ordering label based on score
A classifier must be trustful, and this is what ROC curves measures when plotting the TP vs FP rates. The higher the proportion of TP over FP, the higher the Area Under Roc Curve (AUC) is.
The intuition behind ROC curve is to get an sanity measure regarding the score: how well it orders the label. Ideally all the positive labels must be at the top, and the negative ones at the bottom.
model 1 will have a higher AUC than
Wikipedia has an extensive and good article on this: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
There is the comparission of 4 models, given a cutpoint of 0.5:
Hands on R!
We'll be analyzing 3 scenarios based on 3 cut-points.
# install.packages("rpivotTable") # rpivotTable: it creates a pivot table dinamically, it also supports plots, more info at: https://github.com/smartinsightsfromdata/rpivotTable library(rpivotTable) ## reading the data data=read.delim(file="example.txt", sep="\t", header = T, stringsAsFactors=F)
Scenario 1 Cut point @
Classical confusion matrix, indicating how many cases fall in the intersection of real vs predicted value:
data$predicted_target=ifelse(data$score>=0.5, "yes", "no") rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count", rendererName = "Table", width="100%", height="400px")
Another view, now each column sums 100%. Good to answer the following questions:
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
- What is the percentage of real
yesvalues captured by the model? Answer: 80% Also known as Precision (PPV)
- What is the percentage of
yesthrown by the model? 40%.
So, from the last two senteces:
The model throws 4 out of 10 predictions as
yes, and from this segment -the
yes- it hits 80%.
Other view: The model correctly hits 3 cases for each 10
yes predictions (0.4/0.8=3.2, or 3, rounding down).
Note: The last way of analysis can be found when building a association rules (market basket analysis), and a decision tree model.
Scenario 2 Cut point @
Time to change the cut point to
0.4, so the amount of
yes will be higher:
data$predicted_target=ifelse(data$score>=0.4, "yes", "no") rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
Now the model captures
yes (TP), so the total amount of
yes produced by the model increased to
46.7%, but at no cost since the TN and FP remained the same .
Scenario 3 Cut point @
Want to decrease the FP rate? Set the cut point to a higher value, for example:
0.8, which will cause the
yes produced by the model decreases:
data$predicted_target=ifelse(data$score>=0.8, "yes", "no") rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count as Fraction of Columns", rendererName = "Table", width="100%", height="400px")
Now the FP rate decreased to
20%), and the model still captures the
80% of TP which is the same rate as the one obtained with a cut point of
Decreasing the cut point to
0.8 improved the model at no cost.
This chapter has focused on the essence of predicting a binary variable: To produce an score or likelihood number which orders the target variable.
A predictive model maps the input with the output.
There is not a unique and best cut point value, it relies on the project needs, and is constrained by the rate of
False Negativewe can accept. This live book addresses model performance by ROC curves and lift & gain charts