intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Lecture Applied data science: Validation

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:23

15
lượt xem
3
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Lecture "Applied data science: Validation" includes content: validation set approach; overfitting; cross-validation; data leakage; nested cross-validation; bootstrapping;... We invite you to consult!

Chủ đề:
Lưu

Nội dung Text: Lecture Applied data science: Validation

  1. Validation
  2. Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification
  3. Lecture outline - Validation set approach - Overfitting - Cross-validation - Data leakage - Nested cross-validation - Bootstrapping
  4. Validation set approach
  5. Validation set approach - Randomly split the original data into two, a training set and a test set. - Fit the OLS model on the training set and predict the responses in the validation set - Calculate the test MSE (MSE from using the model on the test set)
  6. Validation set approach
  7. Overfitting … is the tendency of data mining procedures to tailor models to the training data, at the expense of generalisation to previously unseen data. … as a model gets more complex it is allowed to pick up harmful false correlations (noise)... The harm occurs when these false correlations produce incorrect generalisations in the model.
  8. Validation set approach (continued) Pros. simple and easy to implement Cons. - Highly variable (in multiple runs) - Tend to overestimate the test error (because we used only roughly half of the original dataset for training)
  9. Cross-validation Cross-validation is a resampling method, which - Repeatedly and randomly draws subsets of data from a sample - Refits a model (e.g. OLS) on these subsets of data to reveal information unknown if fitted the model only once, e.g. variability of the fitted model - Is computationally expensive - Has 2 common methods - Cross-validation: model selection and model evaluation - Bootstrapping: evaluating the variability of a parameter estimate
  10. Leave one out cross validation (LOOCV) - Use only 1 observation for testing, and fit the OLS regression on the remaining of the original data - Repeat the procedure n times so that each observation is used for testing once. - Calculate CV error
  11. Leave one out cross validation (LOOCV) Pros - Unbiased estimate of the test error (because we used almost all data points for training) - Very stable (identical CV error from multiple runs)! Cons - Very time consuming (especially when n is large and/or fitting a complex model)
  12. K-fold cross validation - Randomly divide the original data into k groups (folds) - Train the model on k-1 folds, use the last fold for testing - Repeat the procedure k times, so that each fold is used for testing once - (Repeat the above 3 steps multiple times) - Calculate CV error
  13. K-fold cross validation Pros - Less computationally demanding than LOOCV Cons - Test error tend to be more biased compared to LOOCV (but much less so compared to validation set approach)
  14. Cross validation for model selection Model selection chooses the model with smallest test MSE => LOOCV (being the most stable) allows for the unambiguous choice. … or we can use the one standard error rule (Occam’s razor principle)
  15. Cross validation for model evaluation Model evaluation estimates the expected range of error in real life applications - LOOCV: least biased error estimate but with largest variance - Leave one out: most biased error estimate but with smallest variance (only 1 value of the test error) - K-fold CV: a balance between bias and variance of the error estimate Test error estimate by different CV strategies for the regression model mpg ~ horsepower + horsepower^2
  16. Cross validation for time series data Must preserve the chronological order of the data… https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4 https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
  17. Data leakage mpg ~ horsepower + horsepower^2
  18. Data leakage … is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed. https://machinelearningmastery.com/data-leakage-machine-learning/ Column-wise leakage If any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: ‘MinutesLate’ is included in the training of a model to predict ‘IsLate’ Row-wise leakage … is caused by improper sharing of information between rows of data. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: data rows used for both training and validation/testing or premature featurisation
  19. Data leakage Minimising risks of leakage - Perform data preparations within each CV fold - Column-wise: centering, standardising, one-hot- encoding, etc. of columns in each fold - Row-wise: oversampling and undersampling - Completely separate a hold-out dataset for final testing of the model. https://stats.stackexchange.com/questions/351638/random-sampling- methods-for-handling-class-imbalance
  20. Nested cross validation In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use. https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf CV used in model selection or model evaluation, not both. To do both, a k-fold CV for model selection is nested inside a k’-fold CV for model evaluation (k can be different to k’) Nested CV keeps a hold-out subset of the data completely separate from the training dataset and any preparations performed on the training set https://www.analyticsvidhya.com/blog/2021/03/a- step-by-step-guide-to-nested-cross-validation/
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2