YOMEDIA
ADSENSE
Lecture Applied data science: Learning process and Bias – variance tradeoff
10
lượt xem 3
download
lượt xem 3
download
Download
Vui lòng tải xuống để xem tài liệu đầy đủ
Lecture "Applied data science: Learning process and Bias – variance tradeoff" includes content: learning process, bias – variance tradeoff, variance tradeoff, bias variance,... We invite you to consult!
AMBIENT/
Chủ đề:
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Lecture Applied data science: Learning process and Bias – variance tradeoff
- Applied Data Science Sonpvh, 2022
- 1
- 1. Introduction 8. Validation 2. Application 9. Regularization 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias – Variance TradeOff 12. Deployment 6. Regression 13. Ethics 7. Classification 2
- UNKNOWN TARGET FUNCTION 𝑓: 𝒳 → Υ Training sample (x 𝟏 , y 𝟏 ), (x 𝟐 , y 𝟐 ), … g(x) ≈ 𝑓(x) Learning Final Hypothesis algorithms g: 𝒳 → Υ 𝓐 Hypothesis Set ℋ Learning From Data – Yaser [1] 3
- Probability Distribution P on 𝒳 AWARENESS GOOD vs BAD • Age INTEREST 𝑓: 𝒳 → Υ • Salary • Job status • Household size LEAD FORM • …. x 𝟏 , x 𝟐 , … , xN Training sample TELESALE (x 𝟏 , y 𝟏 ), (x 𝟐 , y 𝟐 ), … g(x) ≈ 𝑓(x) Learning ELIGIBILITY Final Hypothesis algorithms g: 𝒳 → Υ GOOD vs BAD 𝓐 y (1:0) DISBURSED Hypothesis Set What is eligibility? (label definition) ℋ 4
- • Learning purpose: g(x) ≈ 𝑓(x) ✓ Binary error: : e(𝑓, g) = ۤ 𝑓(x) ≠ g(x)ۥ But “What does g(x) ≈ 𝑓(x) mean?” ⟹ ERROR MEASURE: e(𝑓, g) ✓ Squared error: e(𝑓, g) = 𝑓(x) – g(x) 2 Supermarket verify for CIA verify for security 5 Learning From Data – Yaser [1] discount
- UNKNOWN TARGET DISTRIBUTION UNKNOWN TARGET FUNCTION Probability Distribution 𝑬 𝒊𝒏 = 𝔼 𝑒(𝑓, g) given x ∈ Training set Target function 𝑓: 𝒳 → Υ + NOISE P on 𝒳 𝑬 𝒐𝒖𝒕 = 𝔼 𝑒(𝑓, g) given x ∈ Test set (xtrain ) ERROR (xtest ) Training sample g(x) ≈ 𝑓(x)⟹ 𝑬in ≈ 0 MEASURE e() (x 𝟏 , y 𝟏 ), (x 𝟐 , y 𝟐 ), … g(x) ≈ 𝑓(x) Learning feasible: 𝑬 𝒊𝒏 ≈ 𝑬 𝒐𝒖𝒕 Learning Final Hypothesis algorithms g: 𝒳 → Υ 𝓐 Hypothesis Set ℋ Learning From Data – Yaser [1] 6
- • Purposes? Measure Metrics? • Target population? • Target / Non Target definition? • What are the use cases? Exclusions? • …. What/Where/When/How to collect data? … UNKNOWN TARGET DISTRIBUTION Probability Business Data Target function 𝑓: 𝒳 → Υ + Distribution P on 𝒳 Understand Understand Noise ing ing (xtrain) (xtest ) ERROR Data Training MEASURE e() sample Preparation (x 𝟏 , y 𝟏 ), (x 𝟐 , y 𝟐 ), … g(x) ≈ 𝑓(x) Final Deploymen Learning Hypothesis Modeling algorithms g: 𝒳 → Υ t 𝓐 Evaluation Hypothesis Set ℋ Learning process Data Unify Data Governance 7
- Learning purpose: With probability ≥ 1- 𝛿 ✓ 𝑬in ≈ 0 Approximati Eout(g) − Ein(g) ≤ Ω(𝓗,N, 𝜹) on ✓ 𝑬 𝒊𝒏 ≈ 𝑬 𝒐𝒖𝒕 Generalizati 𝓗 ~ model complexity N: sample size. 𝟏 − 𝜹: confidence on requirement Approximation – Generalization trade-off More complex H ➔ better chance of approximation f Less complex H ➔ better chance of generalizing out of sample 8 Learning From Data – Yaser [1]
- y = 𝑓 (x) = sin(𝜋𝑥). H0: h(x) = b vs H1: h(x) = ax + b y Which is better? Learning purpose: ✓ 𝑬in ≈ 0 ✓ 𝑬 𝒊𝒏 ≈ 𝑬 𝒐𝒖𝒕 x 9 Learning From Data – Yaser [1]
- H0: h(x) = b H1: h(x) = ax + b y y k1 g2 k2 g1 x x 10 Learning From Data – Yaser [1]
- “Approximation” of 𝓗 - BIAS H0: h(x) = b H1: h(x) = ax + b y y ത k g ത E0 = 0.5 E1 = 0.2 x x 11 Learning From Data – Yaser [1]
- Who is the winner? 12 Learning From Data – Yaser [1]
- 2 Eout = 𝑓 − 𝑔 2 = 𝔼 𝑓 − 𝑔ҧ + 𝑔ҧ − 𝑔 =… D 2 2 Eout = 𝔼 (𝑔 𝐷 − 𝑔) ҧ + 𝔼 𝑔ҧ − 𝑓 Variance Bias 13
- Learning purpose: ✓ 𝑬in ≈ 0 ✓ 𝑬 𝒊𝒏 ≈ 𝑬 𝒐𝒖𝒕 ℋ ERROR = BIAS + Model complexity VARIANCE SIMPLE MODEL COMPLEX MODEL Expected Error Expected Error 𝑬 𝒐𝒖𝒕 𝑬 𝒊𝒏 𝑬 𝒐𝒖𝒕 𝑬 𝒊𝒏 14 Learning From Data – Yaser [1] Number of data Number of data
- y = 𝑓 (x) = sin(𝜋𝑥) + Noise ERROR = BIAS + VARIANCE + NOISE D Eout = 𝔼 𝑔ҧ − 𝑓 2 + 𝔼 (𝑔 𝐷 − 𝑔)2 ҧ + 𝔼 y − 𝑓 2 15
- Methodology? Hypothesis – proxy variables suggestion … Which kind of data should be collected? … UNKNOWN TARGET DISTRIBUTION Probability Target function 𝑓: 𝒳 → Υ + Distribution Business Data Noise P on 𝒳 Understandi Understandi ng ng (xtrain ) (xtest ) ERROR Training MEASURE e() Data sample Data Unify Preparatio Data (x 𝟏 , y 𝟏 ), (x 𝟐 , y 𝟐 ), … n Analysi g(x) ≈ 𝑓(x) s Final Learning Hypothesis algorithms g: 𝒳 → Υ Deployme 𝓐 Modeling nt Hypothesis Set ℋ Evaluation Learning process 16
- Business Data Business Data Understandi Understandi Understandi Understandi ng ng ng ng Data Preparatio Data Data Unify Preparatio Data n Analysi n s Deployme Modeling Deployme nt Modeling nt Evaluation Evaluation 17 Data science for business Book, 2013[2]
- 1. Business team should involve in almost every part of ML lifecycle 2. Business Problems vs Data problems vs Data Unification 3. Data unification provides flexifibity 4. Non-hypothesis driven data analysis (Boiling the ocean) vs Hypothesis driven analysis; Hypothesis ⇌ Data analysis 5. EDA is conducted in each stage of ML 6. Big data improve the learning quality – but business understanding is the key 18
- 1. Learning from data – Yaser S Abu-Mostafa, California institute of technology 2. Data science for business, Foster Provost & Tom Fawcett, 2013 19
ADSENSE
CÓ THỂ BẠN MUỐN DOWNLOAD
Thêm tài liệu vào bộ sưu tập có sẵn:
Báo xấu
LAVA
AANETWORK
TRỢ GIÚP
HỖ TRỢ KHÁCH HÀNG
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn