Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng
lượt xem 3
download
Bài giảng "Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent" cung cấp cho người học các kiến thức: Proximal gradient descent, stochastic gradient descent, convergence rates, early stopping, mini-batches. Mời các bạn cùng tham khảo.
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng
- Stochastic Gradient Descent Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
- Last time: proximal gradient descent Consider the problem min g (x) + h(x) x with g , h convex, g differentiable, and h “simple” in so much as 1 proxt (x) = argminz kx − zk22 + h(z) 2t is computable. Proximal gradient descent: let x (0) ∈ Rn , repeat: x (k) = proxtk (x (k−1) − tk ∇g (x (k−1) )), k = 1, 2, 3, . . . Step sizes tk chosen to be fixed and small, or via backtracking. If ∇g is Lipschitz with constant L, then this has convergence rate √ O(1/ε). Lastly we can accelerate this, to optimal rate O(1/ ε). 1
- Outline Today: I Stochastic gradient descent I Convergence rates I Mini-batches I Early stopping 2
- Stochastic gradient descent Consider minimizing an average of functions m 1 X min fi (x). x m i=1 Pm Pm As ∇ i=1 fi (x) = i=1 ∇fi (x), gradient descent would repeat m 1 X x (k) = x (k−1) − tk · ∇fi (x (k−1) ), k = 1, 2, 3, . . . m i=1 In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x (k) = x (k−1) − tk · ∇fik (x (k−1) ), k = 1, 2, 3, . . . where ik ∈ {1, . . . , m} is some chosen index at iteration k. 3
- Stochastic gradient descent Two rules for choosing index ik at iteration k: I Randomized rule: choose ik ∈ {1, ...m} uniformly at random. I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . . Randomized rule is more common in practice. For randomized rule, note that E[∇fik (x)] = ∇f (x), so we can view SGD as using an unbiased estimate of the gradient at each step. Main appeal of SGD: I Iteration cost is independent of m (number of functions). I Can also be a big savings in terms of memory usage. 4
- Example: stochastic logistic regression Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic regression n 1 X min f (β) = −yi xiT β + log(1 + exp(xiT β)) . β n i=1 | {z } fi (β) 1 Pn Gradient computation ∇f (β) = i=1 (yi − pi (β))xi is doable n when n is moderate, but not when n is huge. Full gradient (also called batch) versus stochastic gradient: I One batch update costs O(np). I One stochastic update costs O(p). Clearly, e.g., 10K stochastic steps are much more affordable. 5
- Batch vs. stochastic gradient descent Small example with n = 10, p = 2 to show the “classic picture” for Small example with n = 10, p = 2 to show the “classic picture” for batch versus stochastic methods: batch versus stochastic methods: ● Batch Blue: batch steps, O(np) 20 ● Random Red: batch Blue: stochastic steps,steps, O(np)O(p) Red: stochastic steps, O(p) Rule of thumb for stochastic 10 methods: Rule of thumb for stochastic ● ● ● ●● * ●●● ●● methods: I generally thrive far from 0 ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● • generally optimumthrive far ●● ● Ifrom optimum −10 ●●● ● ● ●● ●● ●● ● ●● ● generally struggle close ●●● ● ●● ●● ● ●●● ● ● • generally struggle close to optimum ● ●●● ● ●● to optimum −20 ●● ● −20 −10 0 10 20 6
- Step sizes Standard in SGD is to use diminishing step sizes, e.g., tk = 1/k, for k = 1, 2, 3, . . . Why not fixed step sizes? Here’s some intuition. Suppose we take cyclic rule for simplicity. Set tk = t for m updates in a row, we get Xm (k+m) (k) x =x −t ∇fi (x (k+i−1) ). i=1 Meanwhile, full gradient with step size t would give Xm (k+1) (k) x =x −t ∇fi (x (k) ). i=1 Pm (k+i−1) The difference here: t i=1 [∇fi (x − ∇fi (x (k) )], and if we ) hold t constant, this difference will not generally be going to zero. 7
- Convergence rates Recall: for convex f , (sub)gradient descent with diminishing step sizes satisfies √ f (x (k) ) − f ∗ = O(1/ k). When f is differentiable with Lipschitz gradient, there holds for gradient descent with suitable fixed step sizes f (x (k) ) − f ∗ = O(1/k). What about SGD? For convex f , SGD with diminishing step sizes satisfies1 √ E[f (x (k) )] − f ∗ = O(1/ k). Unfortunately this does not improve when we further assume f has Lipschitz gradient. 1 E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming” 8
- Convergence rates Even worse is the following discrepancy! When f is strongly convex and has a Lipschitz gradient, gradient descent satisfies f (x (k) ) − f ∗ = O(c k ) where c < 1. But under same conditions, SGD gives us2 E[f (x (k) )] − f ∗ = O(1/k). So stochastic methods do not enjoy the linear convergence rate of gradient descent under strong convexity. What can we do to improve SGD? 2 E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming” 9
- Mini-batches Also common is mini-batch stochastic gradient descent, where we choose a random subset Ik ⊆ {1, . . . , m} of size |Ik | = b m and repeat 1X x (k) = x (k−1) − tk · ∇fi (x (k−1) ), k = 1, 2, 3, . . . b i∈Ik Again, we are approximating full gradient by an unbiased estimate 1 X E ∇fi (x) = ∇f (x). b i∈Ik Using mini-batches reduces the variance of our gradient estimate by a factor 1/b, but is also b times more expensive. 10
- Batch vs. mini-batches vs. stochastic Back to logistic regression, let’s now consider a regularized version: n λ 1 X T minp −yi xiT β + log(1 + e xi β ) + kβk22 . β∈R n 2 i=1 Write the criterion as n 1X T λ f (β) = fi (β), fi (β) = −yi xiT β + log(1 + e xi β ) + kβk22 . n 2 i=1 1 Pn Full gradient computation is ∇f (β) = n i=1 (yi − pi (β))xi + λβ. Comparison between methods: I One batch update costs O(np). I One mini-batch update costs O(bp). I One stochastic update costs O(p). 11
- Batch vs. mini-batches vs. stochastic Example with n = 10, 000, p = 20, all methods use fixed step sizes Example with n = 10, 000, p = 20, all methods use fixed step sizes: Full Stochastic Mini−batch, b=10 Mini−batch, b=100 0.65 Criterion fk 0.60 0.55 0.50 0 10 20 30 40 50 Iteration number k 12 13
- Batch vs. mini-batches vs. stochastic What’s What’s happening?Now happening? Nowlet’s let’s parametrize parametrize by flops3 byflops: Full Stochastic Mini−batch, b=10 Mini−batch, b=100 0.65 Criterion fk 0.60 0.55 0.50 1e+02 1e+04 1e+06 Flop count 3 flops: floating point operations per second 14 13
- Batch vs. mini-batches vs. stochastic Finally, looking at suboptimality gap (on log scale) Finally, looking at suboptimality gap (on log scale): 1e−03 Criterion gap fk−fstar 1e−06 1e−09 Full Stochastic Mini−batch, b=10 1e−12 Mini−batch, b=100 0 10 20 30 40 50 Iteration number k 14 15
- End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). 15
- End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). Is this the end of the story for SGD? 15
- End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). Is this the end of the story for SGD? For a while, the answer was believed to be yes. Slow convergence for strongly convex functions was believed inevitable, as Nemirovski and others established matching lower bounds ... but this was for a R more general stochastic problem, where f (x) = F (x, ζ)dP(ζ). New wave of “variance reduction” work shows we can modify SGD to converge much faster for finite sums (more later?). 15
- SGD in large-scale ML SGD has really taken off in large-scale machine learning I In many ML problems we don’t care about optimizing to high accuracy, it doesn’t pay off in terms of statistical performance. I Thus (in contrast to what classic theory says) fixed step sizes are commonly used in ML applications. I One trick is to experiment with step sizes using small fraction of training before running SGD on full data set ... many other heuristics are common4 . I Many variants provide better practical stability, convergence: momentum, acceleration, averaging, coordinate-adapted step sizes, variance reduction ... I See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA ... (more later?). 4 E.g., Bottou (2012), “Stochastic gradient descent tricks” 16
- Early stopping Suppose p is large and we wanted to fit (say) a logistic regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n. We could solve (say) `2 regularized logistic regression n 1 X T minp −yi xiT β + log(1 + e xi β ) subject to kβk2 ≤ t β∈R n i=1 We could also run gradient descent on the unregularized problem n 1 X T minp −yi xiT β + log(1 + e xi β ) β∈R n i=1 and stop early, i.e., terminate gradient descent well-short of the global minimum. 17
CÓ THỂ BẠN MUỐN DOWNLOAD
-
Hệ thống tự động xử lý nước thải mạ Crom – Niken
2 p | 211 | 54
-
HyperChem v8.0.6
4 p | 153 | 14
-
Các chất trợ dệt và các chất xử lý hoàn tất vải công năng cao mang lại lợi ích chi phí
4 p | 127 | 12
-
Phân tích và đánh giá hàm lượng Cafein, Theobromin, Theophyllin trong các loại chè xanh Việt Nam có nguồn gốc địa lý khác nhau
6 p | 128 | 9
-
Bài giảng Tối ưu hóa nâng cao: Chương 4 - Hoàng Nam Dũng
54 p | 55 | 5
-
Bài giảng Tối ưu hóa nâng cao: Chương 3 - Hoàng Nam Dũng
47 p | 34 | 4
-
Bài giảng Tối ưu hóa nâng cao: Chương 1 - Hoàng Nam Dũng
30 p | 47 | 4
-
Bài giảng Tối ưu hóa nâng cao: Chương 2 - Hoàng Nam Dũng
76 p | 49 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng
31 p | 47 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 6 - Hoàng Nam Dũng
36 p | 43 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng
34 p | 26 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng
50 p | 22 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 10 - Hoàng Nam Dũng
22 p | 38 | 3
-
Đánh giá mức độ rủi ro vùng biển ven bờ khu vực Mỹ Giang - Hòn Đỏ - Bãi Cỏ thuộc xã Ninh Phước, Ninh Hòa, Khánh Hòa
8 p | 68 | 1
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn