Methodological Aspects on Model Validation: Knowing the Error

Error in predictive models

What's this about?

Once we've built a predictive model, how sure we are about its quality? Did it capture general patterns -information- (excluding the -noise-)?

What sort of data?

It has other approach rather than the one covered on Out-of-Time Validation. This approach could be used even when there is not possible to filter cases by date, for example having a data's snapshot at a certain point of time, when no new information will be generated.

For example some health data research from a reduced amount of people, a survey, or some data available on the internet for practicing purposes. It's either expensive, not practical, unethical or even impossible to add new cases. The heart_disease data coming in funModeling package is such an example.

Reducing unexpected behavior

When a model is trained, it just sees a part of reality. It's a sample from a population that cannot be entirely seen.

There are lots of ways to validate a model (Accuracy / ROC curves / Lift / Gain / etc). Any of these metrics are attached to variance, which implies getting different values. If you remove some cases and then fit a new model, you'll see a slightly different value.

Imagine we build a model and achieve an accuracy of 81, now remove 10% of the cases, and then fit a new one, the accuracy now is: 78.4. What is the real accuracy? The one obtained with 100% of data or the other based on 90%? For example, if the model will run live in a production environment, it will see other cases and the accuracy point will move to a new one.

So what is the real value? The one to report? Re-sampling and cross-validation techniques will average -based on different sampling and testing criteria- in order to retrieve an approximation to the most trusted value.

But why remove cases?

There is no sense in removing cases like that, but it gets an idea of how sensible the accuracy metric is, remember you're working with a sample from an unknown population.

If we'd have a fully deterministic model, a model that contains 100% of all cases we are studying, and predictions were 100% accurate in all cases, we wouldn't need all of this.

As far as we always analyze samples, we just need to getting closer to the real and unknown truthness of data through repetition, re-sampling, cross-validation, and so on...

Let's illustrate this with Cross-Validation (CV)


Image credit: Sebastian Raschka Ref. [1]

CV short summary

  • Splits the data into random groups, let's say 10, equally sized. These groups are commonly called folds, represented by the 'k' letter.
  • Take 9 folds, build a model, and then apply the model to the remaining fold (the one which was left out). This will return the accuracy metric you want: accuracy, ROC, Kappa, etc. We're using accuracy in this example.
  • Repeat this k times (10 in our example). So we'll get 10 different accuracies. The final result will be the average of all of them.

This average will be the one to evaluate if a model is good or not, and also to include it in a report.

Practical example

There 150 rows in the iris data frame, using caret package to build a random forest with caret using cross-validation will end up in the -internal- construction of 10 random forest, each one based on 135 rows (9/10 150), and reporting an accuracy based on remaining 15 (1/10 150) cases. This procedure is repeated 10 times.

This part of the output:


Summary of sample sizes: 135, 135, 135, 135, 135, 135, ..., each 135 represents a training sample, 10 in total but the output is truncated.

Rather a single number -the average-, we can see a distribution:

Accuracy on predictive models

Accuracy on predictive models

  • The min/max accuracy will be between ~0.8 and ~1.
  • The mean is the one reported by caret.
  • 50% of times it will be ranged between ~0.93 and ~1.

But what is Error?

The sum of Bias, Variance and the unexplained error -inner noise- in data, or the one that the model will never be able to reduce.

These three elements represent the error reported.

What is the nature of Bias and Variance?

When the model doesn't work well, there may be several causes:

  • Model too complicated: Let's say we have lots of input variables, which is related to high variance. The model will overfit on training data, having a poor accuracy on unseen data due to its particularization.
  • Model too simple: On the other hand, the model may not be capturing all the information from the data due to its simplicity. This is related to high bias.
  • Not enough input data: Data forms shapes in an n-dimensional space (where n is all the input+target variables). If there are not enough points, this shape is not developed well enough.

More info here in [4].

bias and variance

Image credit: Scott Fortmann-Roe [3]

Complexity vs Accuracy Tradeoff

Complexity vs accuracy balance in predictive models

Bias and variance are related in the sense that if one goes down the other goes up, so it's a tradeoff between them. A practical example of this is on Akaike Information Criterion (AIC) model quality measure.

AIC is used as a heuristic to pick the best time series model in the auto.arima function inside forecast package in R [6]. It chooses the model with the lowest AIC.

The lower, the better: The accuracy in prediction will lower the value, while the number of parameters will increase it.

Bootstrapping vs Cross-Validation

  • Bootstrapping is mostly used when estimating a parameter.
  • Cross-Validation is the choice when choosing among different predictive models.
  • Probably there will be a post soon in Data Science Heroes Blog explaining their differences

Note: For a deeper coverage about bias and variance, please go to [3] and [4] at the bottom of the page.

Any advice on practice?

It depends on the data, but it's common to find examples cases 10 fold CV, plus repetition: 10 fold CV, repeated 5 times. Other: 5 fold CV, repeated 3 times.

And using the average of the desired metric. It's also recommended to use the ROC for being less biased to unbalanced target variables.

Since these validation techniques are time consuming, consider choosing a model which will run fast, allowing model tunning, testing different configurations, trying different variables in a "short" amount of time. Random Forest are an excellent option which gives fast and accurate results. Ref. [2].

Another good option is gradient boosting machines, it has more parameters to tune than random forest, but at least in R it's implementation works fast.

Going back to bias and variance

  • Random Forest focuses on decreasing bias, while...
  • Gradient boosting machine focuses on minimizing variance. [5]

Don't forget: Data Preparation

Tweaking input data by transforming and cleaning it, will impact on model quality. Sometimes more than optimizing the model through its parameters. The Data Preparation chapter of this book is under heavy development. Coming soon.

Final thoughts

  • Validating the models through re-sampling / cross-validation helps us to estimate the "real" error present in the data. If the model runs in the future, that will be the expected error to have.
  • Another advantage is model tuning, avoiding the overfitting in selecting best parameters for certain model, Example in caret. The equivalent in Python is included in Scikit Learn.
  • The best test is the one made by you, suited to your data and needs. Try different models and analyze the tradeoff between time consumption and any accuracy metric.

These re-sampling techniques could be among the powerful tools behind the sites like or collaborative open-source software. To have many opinions to produce a less-biased solution.

But each opinion has to be reliable, imagine asking for a medical diagnostic to different doctors.

Further reading


results matching ""

    No results matching ""