What Is Cross Validation?

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- [Instructor] In this lecture we are going to study the definition of cross validation, the methods of cross validation, examples of cross validation. At the end, let us make a summary. Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset, and validating the analysis on the other subset. To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Cross-validation is important in guarding against testing hypotheses suggested by the data, especially where further samples are hazardous, costly or impossible to collect. There are two types of cross validation. Leave-p-out cross-validation is exhaustive cross-validation, k-fold cross-validation is non-exhaustive cross-validation. Exhaustive cross-validation is to learn and test on all possible ways to divide the original sample into a training and a validation set. Leave-p-out validation is using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. It requires to learn and validate n factorial divided by p factorial times n minus p factorial, where n is the number of observations in the original sample. For example, there are 10 numbers total and three numbers are drawn. The possible number combinations are 120. Even this case requires amount of CPU and professional effect to repeat the estimation procedure, therefore, it is not commonly used. Leave one out cross-validation is a particular case of leave-p-out cross-validation with p equal to 1. That is practical. Here is an example of leaving one out cross validation for 10 data. Yellow is for test data, blue is for training data. As leaving one out each time, there are 10 iterations. In Iteration one, all but the first one are training data, the first is test data. In Iteration two, all but the second one are training data, the second is test data. Similar to iteration three, four, five, six, seven, eight, nine. In Iteration 10, all but the tenth one are training data, the tenth is test data. Non-exhaustive cross-validation is approximations of leave-p-out cross-validation. It does not compute all ways of splitting the original sample. K-fold cross-validation is Non-exhaustive cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k minus one subsamples are used as training data. The cross-validation process is then repeated k times with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. When k is equal to n, the k-fold cross-validation is exactly the leave-one-out cross-validation. Two-fold cross-validation is a particular case of k-fold cross-validation with k equal to two. For each fold, we randomly assign data points to two sets, d zero and d one, so that both sets are equal size. This is usually implemented by shuffling the data array and then splitting it in two. We then train on d zero and test on d one, followed by training on d one and testing on d zero. This has the advantage that our training and test sets are both large, and each data point is used for both training and validation on each fold. Here is an example of 10 data which is split into two groups, d one and d zero. In Iteration one, d zero is training data, d one is test data. In Iteration two, d one is training data, d zero is test data. Here is an example of Leave-One-Out-cross validation using Linear Model. They are nine data x versus y. N is equal to nine. For k equal to one to n, remove data, when k equal one, remove data one. Train the remaining n minus one data and get a model, Y equal to k one x plus b one, k one and b one are constants. The theory of the model y equal to k one x + b one to estimate data one, it is shown as blue data one. Compute as error, that is as data one minus two data one. Go to next k, k equal to two, do the same thing as k equal to one, we get the error for data two. Go to next k, until finish nine points, we get nine estimated values, plot estimated values versus true values in cross plot as shown in the figure. Estimated values are on x axis as we will have access to the estimates. The units and scales of both axes should be the same. The statistics are Number of data is nine. True mean is 5.2. Estimate mean is 4.9. They indicates any systematic bias. True sdev is 0.88. Estimated sdev is 0.82. They indicates smoothing effects. Covariance is 3.1. The higher the better estimation. Cross validation is 0.8. It is sensitive to smoothing and a low variance of estimates. A smooth estimator will have a higher correlation with the truth, but it is not necessarily a good thing. Slope of regression is 0.93 yellow line. A slope less than one would indicate conditional bias. Black line is 45 degree reference line. If all data are on this line, estimated values are the same as the true values. Red lines are errors. Mean squared erroris 1.5, which is a common summary of prediction performance, it should be small. We also can try different models, for example, a Quadratic Regression based on the same steps as for linear regression. Also can try variogram model based on the same steps as for linear regression. Let's make a summary. Cross-validation combines measures of fit to derive a more accurate estimate of model prediction performance. Leave-one-out cross-validation which is named as Cross Validation in geostatistics is commonly used as it is practical. Leave-p-out cross-validation which is named as Jackknife in geostatistics is not commonly used as the possible number combinations is huge which requires amount of CPU and professional effect. Outlier data can be identified from the estimated versus true values plot. This technique is used for checking estimation. A different technique will be used for simulation as a number of stochastic results are generated. This is the end of the lecture.