Methodological Aspects on Model Validation
What's this about?
Once you've built a predictive model, how sure you are it captured general patterns, and not just the data it has seen (overfitting)?.
Will it perform well when it will be on production / running live? What is the expected error?
What sort of data?
If it's generated along time and -let's say- every day you have new cases like "page visits on a website", or "new patients arriving to a medical center", one strong validation is the Out-Of-Time approach.
Out-Of-Time Validation Example
Imagine you are building the model on Jan-01, then to build the model you use all the data before Oct-31. Between these two dates, there are 2 months.
When predicting a binary/two class variable (or multi-class), it's quite straight-forward: with the model we've built -with data <= Oct-31- we score the data on that exact day, and then we measure how the users/patients/persons/cases evolved during those two months.
Since the output of a binary model should be a number indicating the likelihood for each case to belong to a certain class (Scoring Data chapter), you test what the model "said" on Oct-31 against what it really happened on "Jan-01".
So the validation workflow looks something like...
Using Gain and Lift Analysis
This analysis explained in the other chapter of the book can be used following the out-of-time validation.
Keeping only with those cases that were
Oct-31, we get the
score returned by the model on that date, and the
target variable is the value that those cases actually had on
How about a numerical target variable?
Now the common sense and/or business need is more present. A numerical outcome can take any value, it can increase or decrease through time, so we may have to consider these 2 scenarios to help us thinking what we consider success.
Example scenario: You are measuring certain app usage, the normal thing is as the days pass, the users use it more.
Case A: Convert the numerical target into categorical?
For an app user, she/he can be more active through time-measured in page views, so to do an out of time validation we would predict if the user visit more than the average, or more than the top 10%, or twice what he spent up to the model's creation day, etc.
Examples of this case can be:
- Binary: "yes/no" above average.
- Multi-label: "low increase"/"mid increase"/"high increase"
Case B: Leave it numerical (linear regression)?
- Predicting the concentration of certain substance in blood.
- Predicting page visits.
- Time series analysis.
We also have in these cases the difference between: "what was expected" vs "what it is".
This difference can take any number. This is the error, or residuals.
If the model is good, this error should be white noise . It follows a normal curve when mainly there are some logical properties:
- The error should be around 0 -the model must tend its error to 0-.
- The standard deviation from this error must be finite -to avoid unpredictable outliers-.
- There has to be no correlation between the errors.
- Normal distribution: expect the majority of errors around 0, having the biggest ones in a smaller proportion as the error increases -likelihood of finding bigger errors decreases exponentially-.
Out-of-Time Validation is an strong validation tool to simulate the running of the model on production with data that may not need to depend on sampling.
The error analysis is a big chapter in data science. Time to go to next chapter which will try to cover key-concepts on this: Knowing the error
-  See Time series analysis and regression section in: White noise (wikipedia)