Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng
lượt xem 3
download
Bài giảng "Tối ưu hóa nâng cao - Chương 5: Gradient descent" cung cấp cho người học các kiến thức: Gradient descent, gradient descent interpretation, fixed step size, backtracking line search,... Mời các bạn cùng tham khảo.
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng
- Gradient Descent Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
- Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . 1
- Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . Gradient descent: choose initial point x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, . . . Stop at some point. 1
- ● ● ● ● ● 4 2
- ● ● ● ●● 53
- Gradient descent interpretation At each iteration, consider the expansion 1 2 f (y ) ≈ f (x) + ∇f (x)T (y − x) + ky − xk2 . 2t Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I . f (x) + ∇f (x)T (y − x) linear approximation to f 1 2 2t ky − xk 2 proximity term to x, with weight 1/2t Choose next point y = x + to minimize quadratic approximation x + = x − t∇f (x). 4
- Gradient descent interpretation ● ● Blue point Blue pointisisx,x, redred point is is point ∗ T x = argminy f (x) + ∇f (x) (y − x) + 1 ky −1 xk22 + T ky − xk22 2t x = argmin f (x) + ∇f (x) (y − x) + y 2t 5
- Outline I How to choose step sizes I Convergence analysis I Nonconvex functions I Gradient boosting 6
- Fixed step size Fixed step size Simply take ttkk ==t tfor Simply take forallallk k==1,1, 3, .3,. .,. .can 2, 2, diverge ., can if t is diverge if too t is big. too big. 2 2 2 2 Consider f (x) = (10x Consider f (x) = (10x + x )/2, gradient descent after 1 1 +2 x2 )/2, gradient descent after 8 steps: 8 steps: 20 ● 10 ● ● * 0 −10 −20 −20 −10 0 10 20 7 9
- Fixed step size Can be Can be slow slow ifif tt isis too toosmall. small.Same example, Same gradient example, descent gradient after after descent 100 steps: 100 steps: 20 ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 8
- Fixed step size Converges Converges nicely whent tisis“just nicely when “just right”. right”. Same Same example, example, 40 steps: 40 steps: 20 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 Convergence analysis Convergence analysislater laterwill willgive us us give a precise ideaidea a precise of “just right”.right” of “just 9
- Backtracking line search One way to adaptively choose the step size is to use backtracking line search: I First fix parameters 0 < β < 1 and 0 < α ≤ 1/2. I At each iteration, start with t = tinit , and while 2 f (x − t∇f (x)) > f (x) − αt k∇f (x)k2 shrink t = βt. Else perform gradient descent update x + = x − t∇f (x). Simple and tends to work well in practice (further simplification: just take α = 1/2). 10
- Backtracking interpretation Backtracking interpretation escent methods 465 f (x + t∆x) f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x t t=0 t0 For us ∆x = −∇f (x) gure 9.1 Backtracking line search. The curve shows f , restricted to the line For usline ver which we search. The lower dashed ∆x = −∇f shows (x) extrapolation the linear f , and the upper dashed line has a slope a factor of α smaller. The acktracking condition is that f lies below the upper dashed line, i.e., 0 ≤ 11 ≤ t0 . 13
- Backtracking line search Setting α= Setting α = ββ==0.5, 0.5,backtracking backtrackingpicks up roughly picks the right up roughly step size the right step (12 outer steps, 40 steps total). size (12 outer steps, 40 steps total), 20 ● ● 10 ●● ● ●● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 12
- Exact line search We could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmins≥0 f (x − s∇f (x)). Usually not possible to do this minimization exactly. Approximations to exact line search are typically not as efficient as backtracking and it’s typically not worth it. 13
- Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. 14
- Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. 14
- Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. Chứng minh. Slide 20-25 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 14
- Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. 15
- Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. Rate under strong convexity is O(c k ), exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations. 15
CÓ THỂ BẠN MUỐN DOWNLOAD
-
Hệ thống tự động xử lý nước thải mạ Crom – Niken
2 p | 213 | 54
-
HyperChem v8.0.6
4 p | 153 | 14
-
Các chất trợ dệt và các chất xử lý hoàn tất vải công năng cao mang lại lợi ích chi phí
4 p | 128 | 12
-
Bài giảng Tối ưu hóa nâng cao: Chương 4 - Hoàng Nam Dũng
54 p | 56 | 5
-
Bài giảng Tối ưu hóa nâng cao: Chương 3 - Hoàng Nam Dũng
47 p | 35 | 4
-
Bài giảng Tối ưu hóa nâng cao: Chương 1 - Hoàng Nam Dũng
30 p | 48 | 4
-
Bài giảng Tối ưu hóa nâng cao: Chương 2 - Hoàng Nam Dũng
76 p | 50 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 6 - Hoàng Nam Dũng
36 p | 44 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng
34 p | 26 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng
50 p | 24 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng
24 p | 51 | 3
-
Bài giảng Tối ưu hóa nâng cao: Chương 10 - Hoàng Nam Dũng
22 p | 40 | 3
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn