K-Fold Cross Validation is a widely used technique in machine learning for assessing model performance and ensuring generalization. In K-Fold Cross Validation, the dataset is divided into K equal-sized folds or subsets. The model is trained on K-1 folds and validated on the remaining fold, a process repeated K times with each fold serving as the validation set once. The results are then averaged to provide a robust estimate of model performance.
A special case of this K-Fold Cross Validation method is Leave-One-Out Cross Validation (LOOCV), where K equals the number of data points, making each fold consist of a single sample. K-Fold Cross Validation strikes a balance between computational efficiency and the bias-variance tradeoff compared to LOOCV.
A simplified version of the K-Fold Cross Validation approach is as follows:
- Split the sample data (of size n) into K equal sets (folds)
- Use K-1 sets as training data, leaving one set out. Train the model
- Use the left out set as testing data and measure accuracy
- Repeat the previous two steps with each of the K sets as test data (and other K-1 sets as training data).
- By the end of K iterations, each of the K sets is used as test data and we have K measures of accuracy
- Compute average of the accuracies across all these K iterations
Key features of this K-fold cross validation method are
- Each data sample is used K-1 times for training
- Each data sample is used once for testing
- There is a prerequisite that the sample size in N is divisible for K, the number of folds
What is a good value of K? How many folds are best suited for reaching reasonable accuracy without too much computational cost? To get some insights into this, let us see what is the impact on accuracy for various values of K.
The Diabetes dataset is a widely used medical dataset containing diagnostic measurements used to predict the likelihood of diabetes onset. We take this dataset evaluate the accuracies for multiple values of K.
This Jupyter Notebook code outlines the steps to measure the accuracy of K-Fold Cross Validation method for multiple values of K. The Diabetes dataset of 768 samples is a good candidate for iterating thru multiple values of K.
The code computes the factors of the size of the dataset and then gets rid of 1 and the size, so that we go from factors 2 to n/2 of the dataset. In essence, K is iterated thru all factors of n from 2 to n/2, where n is the size of the dataset.
In each iteration, the accuracy is calculated and saved. At the end of all iterations, the accuracies are printed against values of K.
Accuracy of K-Fold based on K
K Accuracy
0 2.0 0.746094
1 3.0 0.734375
2 4.0 0.729167
3 6.0 0.743490
4 8.0 0.753906
5 12.0 0.756510
6 16.0 0.752604
7 24.0 0.743490
8 32.0 0.747396
9 48.0 0.752604
10 64.0 0.761719
11 96.0 0.765625
12 128.0 0.764323
13 192.0 0.765625
14 256.0 0.772135
15 384.0 0.772135
As we can see, the best values of accuracy are obtained (for this specific data) at high values of K. However, the accuracies around the values of 8 to 16 seem to be reasonable. Here is a graph that depicts the accuracies for values of K
You can download the code and try out a few variations including the choice of classifier (I used Decision Tree Classifier) and depth of the Decision Tree (I used tree depth of 2 to simplify the calculations)
Food for thought: what is your take on the tradeoffs between the accuracy at higher values of K and the compromise we need to make in terms of cost of computation?
The idea of testing K-Fold Cross Validation across multiple values of K came from one of the lab assignments from IIIT-Hyderabad’s flagship AIML program with Talentsprint.