Cross Validation and its types

 

Cross-Validation and its types


Before building any Machine Learning model, we split the dataset into train and test where the model is trained on the train set and is validated on the test data.

Here in many scenarios model tends to overfit, as the model performs well on the train set but fails miserably on test data.

This happens because all the records are not used for training the model and whenever we change the random state parameter, the accuracy of the model fluctuates a lot as a result model is not trained properly.

In order to ensure that model learns from all the records and performs well on unseen data, we use a technique known as Cross-Validation.

In Cross-validation, some portion/bucket of the dataset is used as the validation set and the rest is used for model training and these test and train folds keep on changing till the last iteration is reached.

Let’s understand it along with its various types.

LOOCV (Leave One Out Cross Validation)

LOOCV

In this type of validation, only one record out on N records is used for validation purposes and the remaining records are used for training the model. And N number of iterations has to be performed.

Example: Let’s say we have 1000 records, then 999 records will be used for training the model and 1 record will be used as the test set and this repeats for 1000 iterations which is computationally expensive.

This type of validation is suitable only when we have small datasets.

K-Fold Cross-Validation.

K-Fold Cross-Validation

This type of validation is most commonly used, here data sets are divided into k-folds which are used for validation of the model, and k-1 folds are used to train the model.

Example: Let’s say we have 1000 records, and if we take k=5 then the number of folds and number of iterations will be 5, there will be 200 records/iterations (i.e 1000/5) and in each fold, there will be 40 records. In each iteration, one fold will be used as a test and the remaining folds will be used for training the model. So in this scenario, there will be 5 models and average accuracy will be considered.

Disadvantage:

While creating k folds, the dataset might get imbalanced. i.e unequal distribution of classes.

if we have similar types of data on one side then our model gets failed.

Stratified K-Fold Cross-Validation

This is similar to k-fold cross-validation, but it ensures that in each fold there are equal instances of majority and minority classes.

Ex: Let’s say we have 100 records and if we split the dataset randomly by 70:30 ratio, then there might be a scenario that all the 1’s or 0’s might be in trainset which will definitely give bad accuracy while validating the model with test data. So in order to avoid that Stratified sampling is done instead of random sampling, which ensures that both train and test data has equal instances of 1’s and 0’s.

the number of instances of each class is in proportion but we should have balanced data set first otherwise we use feature engineering to balance it.

Post a Comment

0 Comments

📅 📢 I'm open to new opportunities! If you're hiring or know someone who is, feel free to connect.
📧 Email: gk765813@gmail.com | LinkedIn | Resume ×