Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng

Chia sẻ: Minh Vũ | Ngày: | Loại File: PDF | Số trang:24

Thêm vào BST

Báo xấu

54
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng "Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent" cung cấp cho người học các kiến thức: Proximal gradient descent, stochastic gradient descent, convergence rates, early stopping, mini-batches. Mời các bạn cùng tham khảo.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng

Stochastic Gradient Descent Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: proximal gradient descent Consider the problem min g (x) + h(x) x with g , h convex, g differentiable, and h “simple” in so much as 1 proxt (x) = argminz kx − zk22 + h(z) 2t is computable. Proximal gradient descent: let x (0) ∈ Rn , repeat: x (k) = proxtk (x (k−1) − tk ∇g (x (k−1) )), k = 1, 2, 3, . . . Step sizes tk chosen to be fixed and small, or via backtracking. If ∇g is Lipschitz with constant L, then this has convergence rate √ O(1/ε). Lastly we can accelerate this, to optimal rate O(1/ ε). 1
Outline Today: I Stochastic gradient descent I Convergence rates I Mini-batches I Early stopping 2
Stochastic gradient descent Consider minimizing an average of functions m 1 X min fi (x). x m i=1 Pm Pm As ∇ i=1 fi (x) = i=1 ∇fi (x), gradient descent would repeat m 1 X x (k) = x (k−1) − tk · ∇fi (x (k−1) ), k = 1, 2, 3, . . . m i=1 In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x (k) = x (k−1) − tk · ∇fik (x (k−1) ), k = 1, 2, 3, . . . where ik ∈ {1, . . . , m} is some chosen index at iteration k. 3
Stochastic gradient descent Two rules for choosing index ik at iteration k: I Randomized rule: choose ik ∈ {1, ...m} uniformly at random. I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . . Randomized rule is more common in practice. For randomized rule, note that E[∇fik (x)] = ∇f (x), so we can view SGD as using an unbiased estimate of the gradient at each step. Main appeal of SGD: I Iteration cost is independent of m (number of functions). I Can also be a big savings in terms of memory usage. 4
Example: stochastic logistic regression Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic regression n 1 X min f (β) = −yi xiT β + log(1 + exp(xiT β)) . β n i=1 | {z } fi (β) 1 Pn Gradient computation ∇f (β) = i=1 (yi − pi (β))xi is doable n when n is moderate, but not when n is huge. Full gradient (also called batch) versus stochastic gradient: I One batch update costs O(np). I One stochastic update costs O(p). Clearly, e.g., 10K stochastic steps are much more affordable. 5
Batch vs. stochastic gradient descent Small example with n = 10, p = 2 to show the “classic picture” for Small example with n = 10, p = 2 to show the “classic picture” for batch versus stochastic methods: batch versus stochastic methods: ● Batch Blue: batch steps, O(np) 20 ● Random Red: batch Blue: stochastic steps,steps, O(np)O(p) Red: stochastic steps, O(p) Rule of thumb for stochastic 10 methods: Rule of thumb for stochastic ● ● ● ●● * ●●● ●● methods: I generally thrive far from 0 ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● • generally optimumthrive far ●● ● Ifrom optimum −10 ●●● ● ● ●● ●● ●● ● ●● ● generally struggle close ●●● ● ●● ●● ● ●●● ● ● • generally struggle close to optimum ● ●●● ● ●● to optimum −20 ●● ● −20 −10 0 10 20 6
Step sizes Standard in SGD is to use diminishing step sizes, e.g., tk = 1/k, for k = 1, 2, 3, . . . Why not fixed step sizes? Here’s some intuition. Suppose we take cyclic rule for simplicity. Set tk = t for m updates in a row, we get Xm (k+m) (k) x =x −t ∇fi (x (k+i−1) ). i=1 Meanwhile, full gradient with step size t would give Xm (k+1) (k) x =x −t ∇fi (x (k) ). i=1 Pm (k+i−1) The difference here: t i=1 [∇fi (x − ∇fi (x (k) )], and if we ) hold t constant, this difference will not generally be going to zero. 7
Convergence rates Recall: for convex f , (sub)gradient descent with diminishing step sizes satisfies √ f (x (k) ) − f ∗ = O(1/ k). When f is differentiable with Lipschitz gradient, there holds for gradient descent with suitable fixed step sizes f (x (k) ) − f ∗ = O(1/k). What about SGD? For convex f , SGD with diminishing step sizes satisfies1 √ E[f (x (k) )] − f ∗ = O(1/ k). Unfortunately this does not improve when we further assume f has Lipschitz gradient. 1 E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming” 8
Convergence rates Even worse is the following discrepancy! When f is strongly convex and has a Lipschitz gradient, gradient descent satisfies f (x (k) ) − f ∗ = O(c k ) where c < 1. But under same conditions, SGD gives us2 E[f (x (k) )] − f ∗ = O(1/k). So stochastic methods do not enjoy the linear convergence rate of gradient descent under strong convexity. What can we do to improve SGD? 2 E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming” 9
Mini-batches Also common is mini-batch stochastic gradient descent, where we choose a random subset Ik ⊆ {1, . . . , m} of size |Ik | = b m and repeat 1X x (k) = x (k−1) − tk · ∇fi (x (k−1) ), k = 1, 2, 3, . . . b i∈Ik Again, we are approximating full gradient by an unbiased estimate   1 X E ∇fi (x) = ∇f (x). b i∈Ik Using mini-batches reduces the variance of our gradient estimate by a factor 1/b, but is also b times more expensive. 10
Batch vs. mini-batches vs. stochastic Back to logistic regression, let’s now consider a regularized version: n λ 1 X T minp −yi xiT β + log(1 + e xi β ) + kβk22 . β∈R n 2 i=1 Write the criterion as n 1X T λ f (β) = fi (β), fi (β) = −yi xiT β + log(1 + e xi β ) + kβk22 . n 2 i=1 1 Pn Full gradient computation is ∇f (β) = n i=1 (yi − pi (β))xi + λβ. Comparison between methods: I One batch update costs O(np). I One mini-batch update costs O(bp). I One stochastic update costs O(p). 11
Batch vs. mini-batches vs. stochastic Example with n = 10, 000, p = 20, all methods use fixed step sizes Example with n = 10, 000, p = 20, all methods use fixed step sizes: Full Stochastic Mini−batch, b=10 Mini−batch, b=100 0.65 Criterion fk 0.60 0.55 0.50 0 10 20 30 40 50 Iteration number k 12 13
Batch vs. mini-batches vs. stochastic What’s What’s happening?Now happening? Nowlet’s let’s parametrize parametrize by flops3 byflops: Full Stochastic Mini−batch, b=10 Mini−batch, b=100 0.65 Criterion fk 0.60 0.55 0.50 1e+02 1e+04 1e+06 Flop count 3 flops: floating point operations per second 14 13
Batch vs. mini-batches vs. stochastic Finally, looking at suboptimality gap (on log scale) Finally, looking at suboptimality gap (on log scale): 1e−03 Criterion gap fk−fstar 1e−06 1e−09 Full Stochastic Mini−batch, b=10 1e−12 Mini−batch, b=100 0 10 20 30 40 50 Iteration number k 14 15
End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). 15
End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). Is this the end of the story for SGD? 15
End of the story? Short story: I SGD can be super effective in terms of iteration cost, memory. I But SGD is slow to converge, can’t adapt to strong convexity. I And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice). Is this the end of the story for SGD? For a while, the answer was believed to be yes. Slow convergence for strongly convex functions was believed inevitable, as Nemirovski and others established matching lower bounds ... but this was for a R more general stochastic problem, where f (x) = F (x, ζ)dP(ζ). New wave of “variance reduction” work shows we can modify SGD to converge much faster for finite sums (more later?). 15
SGD in large-scale ML SGD has really taken off in large-scale machine learning I In many ML problems we don’t care about optimizing to high accuracy, it doesn’t pay off in terms of statistical performance. I Thus (in contrast to what classic theory says) fixed step sizes are commonly used in ML applications. I One trick is to experiment with step sizes using small fraction of training before running SGD on full data set ... many other heuristics are common4 . I Many variants provide better practical stability, convergence: momentum, acceleration, averaging, coordinate-adapted step sizes, variance reduction ... I See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA ... (more later?). 4 E.g., Bottou (2012), “Stochastic gradient descent tricks” 16
Early stopping Suppose p is large and we wanted to fit (say) a logistic regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n. We could solve (say) `2 regularized logistic regression n 1 X T minp −yi xiT β + log(1 + e xi β ) subject to kβk2 ≤ t β∈R n i=1 We could also run gradient descent on the unregularized problem n 1 X T minp −yi xiT β + log(1 + e xi β ) β∈R n i=1 and stop early, i.e., terminate gradient descent well-short of the global minimum. 17