# Methodological Aspects on Model Validation

## Knowing the Error

### What's this about?

Once we've built a predictive model, how sure we are about it's quality? Did it capture general patterns *-information-* (excluding the *-noise-*)?

#### What sort of data?

It has other approach rather than the one covered on Out-of-Time Validation. This approach could be used even when there is not possible to filter cases by date, for example having a data's snapshot at certain point of time, when no new information will be generated.

For example some health data research from a reduced amount of people, a survey, or some data available on internet for practicing purposes. It's either expensive, not practical, unethical or even impossible to add new cases. The `heart_disease`

data coming in `funModeling`

package is such an example.

### Reducing unexpected behavior

When a model is trained it just sees a part of a reality. It's a sample from a population that cannot be totally seen.

There are lots of ways to validate a model (Accuracy / ROC curves / Lift / Gain / etc). Any of these metrics are **attached to variance**, which implies **getting different values**. If you remove some cases and then fit a new model, you'll see an *slightly* different value.

Imagine we fit a model and achieve an accuracy of `81`

, now remove 10% of the cases, fit a new model, the accuracy now is: `78.4`

. **What is the real accuracy?** The one obtained with 100% of data or the other based on 90%? For example if the model will run live on a production environment, it will see **other cases** and the accuracy point will move to a new one.

*So what is the real value? The one to report?* **Re-sampling** and **cross-validation** techniques will average -based on different sampling and testing criteria- in order to retrieve an approximation to the most trusted value.

**But why remove cases?**

There is no sense in removing cases like that, but it gets an idea about how sensible the accuracy metric is, remember you're working with a sample from an **unknown population**.

If we'd have a fully deterministic model, a model that contains 100% of all cases we are studying, and predictions were 100% accurate in all cases, we wouldn't need all of this.

As far as we always analyze samples, we just need to getting closer to the

real and unknown truthnessof data through repetition , re-sampling, cross-validation, and so on...

### Let's illustrate this with Cross-Validation (CV)

*Image credit: Sebastian Raschka* Ref. [1]

#### CV short summary

- Splits the data into random groups, let's say
`10`

, equally sized. These groups are commonly called`folds`

, represented by the`'k'`

letter. - Take
`9`

folds, build a model, and then apply the model to the remaining fold (the one which was left-out). This will return the accuracy metric you want: accuracy, ROC, Kappa, etc. We're using accuracy in this example. - Repeat this
`k`

times (`10`

in our example). So we'll get`10`

different accuracies. Final result will be the average of all of them.

This average will be the one to evaluate if a model is good or not, and also to include it in a report.

#### Practical example

There 150 rows in the `iris`

data frame, using caret package to build a `random forest`

with `caret`

using `cross-validation`

will end up in the -internal- construction of 10 random forest, each one based on 135 rows (9/10 * 150), and reporting an accuracy based on remaining 15 (1/10 * 150) cases. This procedure is repeated 10 times.

This part of the output:

`Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...`

, each 135 represents a training sample, 10 in total but the output is truncated.

Rather a single number -the average-, we can see a distribution:

- The min/max accuracy will be between
`~0.8`

and`~1`

. - The mean is the one reported by
`caret`

. - 50% of times it will be ranged between
`~0.93 and ~1`

.

### But what is Error?

The sum of **Bias**, **Variance** and the ** unexplained error** -inner noise- in data, or the one that the model will never be able to reduce.

These 3 elements represent the error reported.

#### What is the nature of Bias and Variance ?

When the model doesn't work well, there may be several causes:

**Model too complicated**: Let's say we have lots of input variables, which is related to**high variance**. The model will overfit on training data, having a low accuracy on unseen data due to its particularization.**Model too simple**: On the other hand, the model may not be capturing all the information from the data due to its simplicity. This is related to**high bias**.**Not enough input data**: Data forms shapes in a n-dimensional space (where`n`

is all the input+target variables). If there are not enough points, this shape is not developed well enough.

More info here in [4].

*Image credit: Scott Fortmann-Roe* [3]

#### Complexity vs Accuracy Tradeoff

Bias and variance are related in the sense that if one goes down the other goes up, so it's a **tradeoff** between them. A practical example of this is on
Akaike Information Criterion (AIC) model quality measure.

**AIC** is used as an heuristic to pick the best **time series model** in the `auto.arima`

function inside `forecast`

package in `R`

[6]. It chooses the model with the lowest AIC.

The lower the better: The accuracy in prediction will lower the value, while the number of parameters will increase it.

#### Bootstrapping vs Cross-Validation

**Bootstrapping**is mostly used when estimating a parameter.**Cross-Validation**is the choice when choosing among different predictive models.- Coming soon a post in Data Science Heroes Blog explaining their differences

Note: For a deeper coverage about bias and variance please go to [3] and [4] at the bottom of the page.

### Any advices on practice?

It depends on the data, but it's common to find examples cases `10 fold CV`

, plus repetition: `10 fold CV, repeated 5 times`

. Other: `5 fold CV, repeated 3 times`

.

And using the average of the desired metric. It's also recommended to use the `ROC`

for being less biased to unbalanced target variables.

Since these validation techniques are **time consuming**, consider choosing a model which will run fast, allowing model tunning, testing different configurations, trying different variables in a "short" amount of time. Random Forest are an excellent option which gives **fast** and **accurate** results. Ref. [2].

Another good option is: **gradient boosting machines**, it has more paramaters to tune than random forest, but at least in R it's implementation works really fast.

#### Going back to bias and variance

- Random forest focuses on decreasing bias, while...
- Gradient boosting machine focuses on decreasing variance. [5]

### Don't forget: Data Preparation

Tweaking input data by transforming and cleaning it, will impact on model quality. Sometimes more than optimizing the model through its parameters. The Data Preparation chapter of this book is under heavy development. Coming soon.

### Final toughts

- Validating the models through re-sampling / cross-validation helps us to estimate the "real" error present in the data. If the model will run in the future, that will be the expected error to have.
- Another advantage is
**model tuning**, avoiding the overfitting in selecting best parameters for certain model, Example in`caret`

. The equivalent in**Python**is included in Scikit Learn. - The best test is the one made by you, suited to your data and needs. Try different models and analyze the tradeoff between time consumption and any accuracy metric.

These re-sampling techniques could be among the powerful tools behind the sites like stackoverflow.com or collaborative open-source software. To have many opinions in order to produce a less-biased solution.

But each opinion has to be reliable, imagine asking for medical diagnostic to different doctors.

#### Further reading

- Tutorial: Cross validation for predictive analytics using r
- Tutorial by Max Kahn (caret's creator): Comparing Different Species of Cross-Validation
- The cross-validation approach can also be applied to time dependant models, check the other chapter: Out-of-Time Validation.

**References:**

- [1] Image source: Machine Learning FAQ by Sebastian Raschka
- [2] More on Random Forest overall performance: Do we Need Hundreds of Classiers to Solve Real World Classication Problems?
- Why every statistician should know about cross-validation? by Rob Hyndman, creator of
`forecast`

package. - [3] Image source: Understanding the Bias-Variance Tradeoff. It contains an intutitive way of understanding error through bias and variance through a animation.
- [4] In Machine Learning, What is Better: More Data or better Algorithms
- [5] Gradient boosting machine vs random forest
- [6] ARIMA modelling in R