zunia.vn

Tuyển sinh 2024 dành cho Gen-Z

zunia.vn

» Khoa Học Tự Nhiên

Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng

Chia sẻ: Minh Vũ | Ngày: | Loại File: PDF | Số trang:31

Báo xấu

50
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng "Tối ưu hóa nâng cao - Chương 5: Gradient descent" cung cấp cho người học các kiến thức: Gradient descent, gradient descent interpretation, fixed step size, backtracking line search,... Mời các bạn cùng tham khảo.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng

Gradient Descent Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . 1
Gradient descent Consider unconstrained, smooth convex optimization min f (x) x with convex and differentiable function f : Rn → R. Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ . Gradient descent: choose initial point x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, . . . Stop at some point. 1
● ● ● ● ● 4 2
● ● ● ●● 53
Gradient descent interpretation At each iteration, consider the expansion 1 2 f (y ) ≈ f (x) + ∇f (x)T (y − x) + ky − xk2 . 2t Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I . f (x) + ∇f (x)T (y − x) linear approximation to f 1 2 2t ky − xk 2 proximity term to x, with weight 1/2t Choose next point y = x + to minimize quadratic approximation x + = x − t∇f (x). 4
Gradient descent interpretation ● ● Blue point Blue pointisisx,x, redred point is is point ∗ T x = argminy f (x) + ∇f (x) (y − x) + 1 ky −1 xk22 + T ky − xk22 2t x = argmin f (x) + ∇f (x) (y − x) + y 2t 5
Outline I How to choose step sizes I Convergence analysis I Nonconvex functions I Gradient boosting 6
Fixed step size Fixed step size Simply take ttkk ==t tfor Simply take forallallk k==1,1, 3, .3,. .,. .can 2, 2, diverge ., can if t is diverge if too t is big. too big. 2 2 2 2 Consider f (x) = (10x Consider f (x) = (10x + x )/2, gradient descent after 1 1 +2 x2 )/2, gradient descent after 8 steps: 8 steps: 20 ● 10 ● ● * 0 −10 −20 −20 −10 0 10 20 7 9
Fixed step size Can be Can be slow slow ifif tt isis too toosmall. small.Same example, Same gradient example, descent gradient after after descent 100 steps: 100 steps: 20 ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 8
Fixed step size Converges Converges nicely whent tisis“just nicely when “just right”. right”. Same Same example, example, 40 steps: 40 steps: 20 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 Convergence analysis Convergence analysislater laterwill willgive us us give a precise ideaidea a precise of “just right”.right” of “just 9
Backtracking line search One way to adaptively choose the step size is to use backtracking line search: I First fix parameters 0 < β < 1 and 0 < α ≤ 1/2. I At each iteration, start with t = tinit , and while 2 f (x − t∇f (x)) > f (x) − αt k∇f (x)k2 shrink t = βt. Else perform gradient descent update x + = x − t∇f (x). Simple and tends to work well in practice (further simplification: just take α = 1/2). 10
Backtracking interpretation Backtracking interpretation escent methods 465 f (x + t∆x) f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x t t=0 t0 For us ∆x = −∇f (x) gure 9.1 Backtracking line search. The curve shows f , restricted to the line For usline ver which we search. The lower dashed ∆x = −∇f shows (x) extrapolation the linear f , and the upper dashed line has a slope a factor of α smaller. The acktracking condition is that f lies below the upper dashed line, i.e., 0 ≤ 11 ≤ t0 . 13
Backtracking line search Setting α= Setting α = ββ==0.5, 0.5,backtracking backtrackingpicks up roughly picks the right up roughly step size the right step (12 outer steps, 40 steps total). size (12 outer steps, 40 steps total), 20 ● ● 10 ●● ● ●● ● ● ● ● ● * 0 −10 −20 −20 −10 0 10 20 12
Exact line search We could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmins≥0 f (x − s∇f (x)). Usually not possible to do this minimization exactly. Approximations to exact line search are typically not as efficient as backtracking and it’s typically not worth it. 13
Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. 14
Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. 14
Convergence analysis Assume that f : Rn → R convex and differentiable and additionally k∇f (x) − ∇f (y )k2 ≤ L kx − y k2 for any x, y i.e., ∇f is Lipschitz continuous with constant L > 0. Theorem Gradient descent with fixed step size t ≤ 1/L satisfies 1 f (x (k) ) − f ∗ ≤ kx (0) − x ∗ k22 2tk and same result holds for backtracking with t replaced by β/L. We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations. Chứng minh. Slide 20-25 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 14
Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. 15
Convergence under strong convexity m 2 Reminder: strong convexity of f means f (x) − 2 kxk2 is convex for some m > 0. Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k kx (0) − x ∗ k22 2 where 0 < c < 1. Rate under strong convexity is O(c k ), exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations. 15

CÓ THỂ BẠN MUỐN DOWNLOAD

THÔNG TIN

TRỢ GIÚP

HỖ TRỢ KHÁCH HÀNG

Theo dõi chúng tôi

Chịu trách nhiệm nội dung:

Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA

LIÊN HỆ

Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM

Hotline: 093 303 0098

Email: support@tailieu.vn

Giấy phép Mạng Xã Hội số: 670/GP-BTTTT cấp ngày 30/11/2015 Copyright ©2025 TaiLieu.VN. All rights reserved.