intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng

Chia sẻ: Minh Vũ | Ngày: | Loại File: PDF | Số trang:34

27
lượt xem
3
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng "Tối ưu hóa nâng cao - Chương 7: Subgradient method" cung cấp cho người học các kiến thức: Last last time - gradient descent, subgradient method, step size choices, convergence analysis, lipschitz continuity, convergence analysis - Proof,... Mời các bạn cùng tham khảo.

Chủ đề:
Lưu

Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng

  1. Subgradient Method Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
  2. Last last time: gradient descent Consider the problem min f (x) x for f convex and differentiable, dom(f ) = Rn . Gradient descent: choose initial x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, . . . Step sizes tk chosen to be fixed and small, or by backtracking line search. If ∇f Lipschitz, gradient descent has convergence rate O(1/ε). Downsides: I Requires f differentiable — addressed this lecture. I Can be slow to converge — addressed next lecture. 1
  3. Subgradient method Now consider f convex, having dom(f ) = Rn , but not necessarily differentiable. Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0) , repeat: x (k) = x (k−1) − tk · g (k−1) , k = 1, 2, 3, . . . where g (k−1) ∈ ∂f (x (k−1) ) any subgradient of f at x (k−1) . 2
  4. Subgradient method Now consider f convex, having dom(f ) = Rn , but not necessarily differentiable. Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0) , repeat: x (k) = x (k−1) − tk · g (k−1) , k = 1, 2, 3, . . . where g (k−1) ∈ ∂f (x (k−1) ) any subgradient of f at x (k−1) . Subgradient method is not necessarily a descent method, so we (k) keep track of best iterate xbest among x (0) , ...x (k) so far, i.e., (k) f (xbest ) = min f (x (i) ). i=0,...,k 2
  5. Outline Today: I How to choose step sizes I Convergence analysis I Intersection of sets I Projected subgradient method 3
  6. Step size choices I Fixed step sizes: tk = t all k = 1, 2, 3, . . . I Fixed step length, i.e., tk = s/kg (k−1) k2 , and hence ktk g (k−1) k2 = s. I Diminishing step sizes: choose to meet conditions X∞ X∞ tk2 < ∞, tk = ∞, k=1 k=1 i.e., square summable but not summable. Important here that step sizes go to zero, but not too fast. There are several other options too, but key difference to gradient descent: step sizes are pre-specified, not adaptively computed. 4
  7. Convergence analysis Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L kx − y k2 for all x, y . Theorem For a fixed step size t, subgradient method satisfies (k) kx (0) − x ∗ k22 L2 t f (xbest ) − f ∗ ≤ + . 2kt 2 For fixed step length, i.e., tk = s/kg (k−1) k2 , we have (k) Lkx (0) − x ∗ k22 Ls f (xbest ) − f ∗ ≤ + . 2ks 2 For diminishing step sizes, subgradient method satisfies kx (0) − x ∗ k22 + L2 ki=1 ti2 P (k) ∗ f (xbest ) − f ≤ , 2 ki=1 ti P (k) i.e., lim f (xbest ) = f ∗ . 5 k→∞
  8. Lipschitz continuity Before the proof let consider the Lipschitz continuity assumption. Lemma f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L kx − y k2 for all x, y , is equivalent to kg k2 ≤ L for all x and g ∈ ∂f (x). Chứng minh. ⇐=: Choose subgradients gx and gy at x and y . We have gxT (x − y ) ≥ f (x) − f (y ) ≥ gyT (x − y ). Apply Cauchy-Schwarz inequality get Lkx − y k2 ≥ f (x) − f (y ) ≥ −Lkx − y k2 . 6
  9. Lipschitz continuity Before the proof let consider the Lipschitz continuity assumption. Lemma f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L kx − y k2 for all x, y , is equivalent to kg k2 ≤ L for all x and g ∈ ∂f (x). Chứng minh. =⇒: Assume kg k2 > L for some g ∈ ∂f (x). Take y = x + g /kg k2 we have ky − xk2 = 1 and f (y ) ≥ f (x) + g T (y − x) = f (x) + kg k2 > f (x) + L, contradiction. 7
  10. Convergence analysis - Proof Can prove both results from same basic inequality. Key steps: I Using definition of subgradient kx (k) − x ∗ k22 = kx (k−1) − tk g (k−1) − x ∗ k22 = kx (k−1) − x ∗ k22 − 2tk g (k−1) (x (k−1) − x ∗ ) + tk2 kg (k−1) k22 ≤ kx (k−1) − x ∗ k22 − 2tk (f (x (k−1) ) − f (x ∗ )) + tk2 kg (k−1) k22 . 8
  11. Convergence analysis - Proof Can prove both results from same basic inequality. Key steps: I Using definition of subgradient kx (k) − x ∗ k22 = kx (k−1) − tk g (k−1) − x ∗ k22 = kx (k−1) − x ∗ k22 − 2tk g (k−1) (x (k−1) − x ∗ ) + tk2 kg (k−1) k22 ≤ kx (k−1) − x ∗ k22 − 2tk (f (x (k−1) ) − f (x ∗ )) + tk2 kg (k−1) k22 . I Iterating last inequality kx (k) − x ∗ k22 k X k X ≤ kx (0) − x ∗ k22 − 2 ti (f (x (i−1) ) − f (x ∗ )) + ti2 kg (i−1) k22 . i=1 i=1 8
  12. Convergence analysis - Proof I Using kx (k) − x ∗ k2 ≥ 0 and letting R = kx (0) − x ∗ k2 , we have k X k X 0 ≤ R2 − 2 ti (f (x (i−1) ) − f (x ∗ )) + ti2 kg (i−1) k22 . i=1 i=1 9
  13. Convergence analysis - Proof I Using kx (k) − x ∗ k2 ≥ 0 and letting R = kx (0) − x ∗ k2 , we have k X k X 0 ≤ R2 − 2 ti (f (x (i−1) ) − f (x ∗ )) + ti2 kg (i−1) k22 . i=1 i=1 (k) I Introducing f (xbest ) = mini=0,...k f (x (i) ) and rearranging, we have the basic inequality Pk (i−1) k2 (k) ∗ R2 + 2 i=1 ti kg 2 f (xbest ) − f (x ) ≤ Pk . 2 i=1 ti 9
  14. Convergence analysis - Proof I Using kx (k) − x ∗ k2 ≥ 0 and letting R = kx (0) − x ∗ k2 , we have k X k X 0 ≤ R2 − 2 ti (f (x (i−1) ) − f (x ∗ )) + ti2 kg (i−1) k22 . i=1 i=1 (k) I Introducing f (xbest ) = mini=0,...k f (x (i) ) and rearranging, we have the basic inequality Pk (i−1) k2 (k) ∗ R2 + 2 i=1 ti kg 2 f (xbest ) − f (x ) ≤ Pk . 2 i=1 ti For different step sizes choices, convergence results can be directly obtained from this bound. E.g., theorems for fixed and diminishing step sizes follow. 9
  15. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With fixed step size t, this and the Lipschitz continuity assumption give (k) R2 L2 t f (xbest ) − f ∗ ≤ + . 2kt 2 10
  16. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With fixed step size t, this and the Lipschitz continuity assumption give (k) R2 L2 t f (xbest ) − f ∗ ≤ + . 2kt 2 I Does not guarantee convergence (as k → ∞). (k) 2 I For large k, f (xbest ) is approximately L2t -suboptimal. I To make the gap ≤ ε, let’s make each term ≤ ε/2. So we can choose t = ε/L2 , and k = R 2 /t · 1/ε = R 2 L2 /ε2 . 10
  17. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With fixed step size t, this and the Lipschitz continuity assumption give (k) R2 L2 t f (xbest ) − f ∗ ≤ + . 2kt 2 I Does not guarantee convergence (as k → ∞). (k) 2 I For large k, f (xbest ) is approximately L2t -suboptimal. I To make the gap ≤ ε, let’s make each term ≤ ε/2. So we can choose t = ε/L2 , and k = R 2 /t · 1/ε = R 2 L2 /ε2 . I.e., subgradient method guarantees the gap ε in k = O(1/ε2 ) iterations ... compare this to O(1/ε) rate of gradient descent. 10
  18. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With fixed step length, i.e., ti = s/kg (i−1) k2 , we have (k) R 2 + ks 2 R 2 + ks 2 LR 2 Ls f (xbest )−f (x ∗ ) ≤ ≤ = + . 2s ki=1 kg (i−1) k−1 2s ki=1 L−1 2ks 2 P P 2 11
  19. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With fixed step length, i.e., ti = s/kg (i−1) k2 , we have (k) R 2 + ks 2 R 2 + ks 2 LR 2 Ls f (xbest )−f (x ∗ ) ≤ ≤ = + . 2s ki=1 kg (i−1) k−1 2s ki=1 L−1 2ks 2 P P 2 I Does not guarantee convergence (as k → ∞). (k) Ls I For large k, f (xbest ) is approximately 2 -suboptimal. I To make the gap ≤ ε, let’s make each term ≤ ε/2. So we can choose s = ε/L, and k = LR 2 /s · 1/ε = R 2 L2 /ε2 . 11
  20. Convergence analysis - Proof The basic inequality tells us that after k steps, we have R 2 + ki=1 ti2 kg (i−1) k22 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P From this and the Lipschitz continuity, we have R 2 + L2 ki=1 ti2 P (k) ∗ f (xbest ) − f (x ) ≤ . 2 ki=1 ti P With diminishing step size, ∞ P P∞ 2 i=1 ti = ∞ and i=1 ti < ∞, there holds (k) lim f (xbest ) = f ∗ . k→∞ 12
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2