intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng

Chia sẻ: Minh Vũ | Ngày: | Loại File: PDF | Số trang:31

49
lượt xem
3
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng "Tối ưu hóa nâng cao - Chương 5: Gradient descent" cung cấp cho người học các kiến thức: Gradient descent, gradient descent interpretation, fixed step size, backtracking line search,... Mời các bạn cùng tham khảo.

Chủ đề:
Lưu

Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng

  1. Gradient Descent Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
  2. Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . 1
  3. Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . Gradient descent: choose initial point x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, . . . Stop at some point. 1
  4. ● ● ● ● ● 4 2
  5. ● ● ● ●● 53
  6. Gradient descent interpretation At each iteration, consider the expansion 1 2 f (y ) ≈ f (x) + ∇f (x)T (y − x) + ky − xk2 . 2t Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I . f (x) + ∇f (x)T (y − x) linear approximation to f 1 2 2t ky − xk 2 proximity term to x, with weight 1/2t Choose next point y = x + to minimize quadratic approximation x + = x − t∇f (x). 4
  7. Gradient descent interpretation ● ● Blue point Blue pointisisx,x, redred point is is point ∗ T x = argminy f (x) + ∇f (x) (y − x) + 1 ky −1 xk22 + T ky − xk22 2t x = argmin f (x) + ∇f (x) (y − x) + y 2t 5
  8. Outline I How to choose step sizes I Convergence analysis I Nonconvex functions I Gradient boosting 6
  9. Fixed step size Fixed step size Simply take ttkk ==t tfor Simply take forallallk k==1,1, 3, .3,. .,. .can 2, 2, diverge ., can if t is diverge if too t is big. too big. 2 2 2 2 Consider f (x) = (10x Consider f (x) = (10x + x )/2, gradient descent after 1 1 +2 x2 )/2, gradient descent after 8 steps: 8 steps: 20 ● 10 ● ● * 0 −10 −20 −20 −10 0 10 20 7 9
  10. Fixed step size Can be Can be slow slow ifif tt isis too toosmall. small.Same example, Same gradient example, descent gradient after after descent 100 steps: 100 steps: 20 ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 8
  11. Fixed step size Converges Converges nicely whent tisis“just nicely when “just right”. right”. Same Same example, example, 40 steps: 40 steps: 20 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 Convergence analysis Convergence analysislater laterwill willgive us us give a precise ideaidea a precise of “just right”.right” of “just 9
  12. Backtracking line search One way to adaptively choose the step size is to use backtracking line search: I First fix parameters 0 < β < 1 and 0 < α ≤ 1/2. I At each iteration, start with t = tinit , and while 2 f (x − t∇f (x)) > f (x) − αt k∇f (x)k2 shrink t = βt. Else perform gradient descent update x + = x − t∇f (x). Simple and tends to work well in practice (further simplification: just take α = 1/2). 10
  13. Backtracking interpretation Backtracking interpretation escent methods 465 f (x + t∆x) f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x t t=0 t0 For us ∆x = −∇f (x) gure 9.1 Backtracking line search. The curve shows f , restricted to the line For usline ver which we search. The lower dashed ∆x = −∇f shows (x) extrapolation the linear f , and the upper dashed line has a slope a factor of α smaller. The acktracking condition is that f lies below the upper dashed line, i.e., 0 ≤ 11 ≤ t0 . 13
  14. Backtracking line search Setting α= Setting α = ββ==0.5, 0.5,backtracking backtrackingpicks up roughly picks the right up roughly step size the right step (12 outer steps, 40 steps total). size (12 outer steps, 40 steps total), 20 ● ● 10 ●● ● ●● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 12
  15. Exact line search We could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmins≥0 f (x − s∇f (x)). Usually not possible to do this minimization exactly. Approximations to exact line search are typically not as efficient as backtracking and it’s typically not worth it. 13
  16. Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. 14
  17. Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. 14
  18. Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. Chứng minh. Slide 20-25 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 14
  19. Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. 15
  20. Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. Rate under strong convexity is O(c k ), exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations. 15
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2