Resampling with k-fold Cross Validation

Statistics Machine Learning R

Click for a high-level recap of k-fold cross validation as a resampling method.

Javier Orraca (Scatter Podcast)
02-01-2020

k-fold “randomly divides the training data into k groups (or folds) of approximate size [with the model being] fit on k-1 folds and the remaining fold used to compute model performance.” I use k=5 or k=10 given that this is typical industry practice without really questioning “why?”

In reading through Hands-On Machine Learning with R by Brad Boehmke, Ph.D. and Brandon Greenwell, I was surprised to learn that studies have shown that k=10 performs similarly to leave-one-out cross validation where k=n.

Without realizing it, sometimes I get carried away optimizing code and drifting from statistics and the core “science” in data science… k-fold CV is a technique used by many and is agnostic to your statistical programming language of choice but if you’re an #R user, I can’t recommend this book enough (free in full online, link below)!

Source:

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Orraca (2020, Feb. 1). Javier Orraca: Resampling with k-fold Cross Validation. Retrieved from https://www.javierorraca.com/posts/2020-02-01-k-fold-cross-validation/

BibTeX citation

@misc{orraca2020resampling,
  author = {Orraca, Javier},
  title = {Javier Orraca: Resampling with k-fold Cross Validation},
  url = {https://www.javierorraca.com/posts/2020-02-01-k-fold-cross-validation/},
  year = {2020}
}