Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng

Chia sẻ: Minh Vũ | Ngày: | Loại File: PDF | Số trang:50

Thêm vào BST

Báo xấu

25
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng "Tối ưu hóa nâng cao - Chương 8: Proximal gradient descent (and acceleration)" cung cấp cho người học các kiến thức: Subgradient method, decomposable functions, proximal mapping, proximal gradient descent,.... Mời các bạn cùng tham khảo.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng

Proximal Gradient Descent (and Acceleration) Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: subgradient method Consider the problem min f (x) x with f convex, and dom(f ) = Rn . Subgradient method: choose an initial x (0) ∈ Rn , and repeat: x (k) = x (k−1) − tk · g (k−1) , k = 1, 2, 3, . . . where g (k−1) ∈ ∂f (x (k−1) ). We use pre-set rules for the step sizes (e.g., diminshing step sizes rule). If f is Lipschitz, then subgradient method has a convergence rate O(1/ε2 ). Upside: very generic. Downside: can be slow — addressed today. 1
Outline Today I Proximal gradient descent I Convergence analysis I ISTA, matrix completion I Special cases I Acceleration 2
Decomposable functions Suppose f (x) = g (x) + h(x) where I g is convex, differentiable, dom(g ) = Rn I h is convex, not necessarily differentiable. If f were differentiable, then gradient descent update would be x + = x − t · ∇f (x) Recall motivation: minimize quadratic approximation to f around x, replace ∇2 f (x) by 1t I 1 x + = argminz f (x) + ∇f (x)T (z − x) + kz − xk22 . | {z 2t } f˜t (z) 3
Decomposable functions In our case f is not differentiable, but f = g + h, g differentiable. Why don’t we make quadratic approximation to g , leave h alone? I.e., update x + = argminz g˜t (z) + h(z) 1 = argminz g (x) + ∇g (x)T (z − x) + kz − xk22 + h(z) 2t 1 = argminz kz − (x − t∇g (x))k22 + h(z). 2t 1 2t kz − (x − t∇g (x))k22 stay close to gradient update for g h(z) also make h small 4
Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as 1 proxh (x) = argminz kx − zk22 + h(z). 2 5
Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as 1 proxh (x) = argminz kx − zk22 + h(z). 2 Examples: I h(x) = 0: proxh (x) = x. 5
Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as 1 proxh (x) = argminz kx − zk22 + h(z). 2 Examples: I h(x) = 0: proxh (x) = x. I h(x) is indicator function of a closed convex set C : proxh is the projection on C 1 proxh (x) = argminz∈C kx − zk22 = PC (x). 2 5
Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as 1 proxh (x) = argminz kx − zk22 + h(z). 2 Examples: I h(x) = 0: proxh (x) = x. I h(x) is indicator function of a closed convex set C : proxh is the projection on C 1 proxh (x) = argminz∈C kx − zk22 = PC (x). 2 I h(x) = kxk1 : proxh is the ’soft-threshold’ (shrinkage) operation    xi − 1 xi ≥ 1 proxh (x)i = 0 |xi | ≤ 1   xi + 1 xi ≤ −1. 5
Proximal mapping Theorem If h is convex and closed (has closed epigraph) then 1 proxh (x) = argminz kx − zk22 + h(z). 2 exists and is unique for all x. Chứng minh. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/ proxop.pdf Uniqueness since the objective function is strictly convex. 6
Proximal mapping Theorem If h is convex and closed (has closed epigraph) then 1 proxh (x) = argminz kx − zk22 + h(z). 2 exists and is unique for all x. Chứng minh. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/ proxop.pdf Uniqueness since the objective function is strictly convex. Optimality condition z = proxh (x) ⇔ x − z ∈ ∂h(z) ⇔ h(u) ≥ h(z) + (x − z)T (u − z), ∀u. 6
Properties of proximal mapping Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1) (proxh (x) − proxh (y ))T (x − y ) ≥ k proxh (x) − proxh (y )k22 . 7
Properties of proximal mapping Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1) (proxh (x) − proxh (y ))T (x − y ) ≥ k proxh (x) − proxh (y )k22 . Chứng minh. With u = proxh (x) and v = proxh (y ) we have x − u ∈ ∂f (u) and y − v ∈ ∂f (v ). From the monotonicity of subdifferential we get T (x − u) − (y − v ) (u − v ) ≥ 0. 7
Properties of proximal mapping Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1) (proxh (x) − proxh (y ))T (x − y ) ≥ k proxh (x) − proxh (y )k22 . Chứng minh. With u = proxh (x) and v = proxh (y ) we have x − u ∈ ∂f (u) and y − v ∈ ∂f (v ). From the monotonicity of subdifferential we get T (x − u) − (y − v ) (u − v ) ≥ 0. From firm nonexpansiveness and Cauchy-Schwarz inequality we get nonexpansiveness (Lipschitz continuity with constant 1) k proxh (x) − proxh (y )k2 ≤ kx − y k2 . 7
Proximal gradient descent Proximal gradient descent: choose initialize x (0) , repeat: x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) , k = 1, 2, 3, . . . 8
Proximal gradient descent Proximal gradient descent: choose initialize x (0) , repeat: x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) , k = 1, 2, 3, . . . To make this update step look familiar, can rewrite it as x (k) = x (k−1) − tk · Gtk (x (k−1) ) where Gt is the generalized gradient of f , x − proxth (x − t∇g (x)) Gt (x) = . t 8
Proximal gradient descent Proximal gradient descent: choose initialize x (0) , repeat: x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) , k = 1, 2, 3, . . . To make this update step look familiar, can rewrite it as x (k) = x (k−1) − tk · Gtk (x (k−1) ) where Gt is the generalized gradient of f , x − proxth (x − t∇g (x)) Gt (x) = . t For h = 0 it is gradient descent. 8
Examples minimize g(x) + h(x) Gradient method: special case with h(x) = 0 x+ = x − t∇g(x) Gradient projection method: special case with h(x) = δC (x) (indicator of C ) x x+ = PC (x − t∇g(x)) C x+ x − t∇g(x) Proximal gradient method 6-5
What good did this do? You have a right to be suspicious ... may look like we just swapped one minimization problem for another. Key point is that proxh (·) is can be computed analytically for a lot of important functions h1 . Note: I Mapping proxh (·) doesn’t depend on g at all, only on h. I Smooth part g can be complicated, we only need to compute its gradients. Convergence analysis: will be in terms of number of iterations of the algorithm. Each iteration evaluates proxh (·) once and this can be cheap or expensive depending on h. 1 see http://www.seas.ucla.edu/~vandenbe/236C/lectures/proxop.pdf 9
Example: ISTA (Iterative Shrinkage-Thresholding Algorithm) Given y ∈ Rn , X ∈ Rn×p , recall lasso criterion 1 f (β) = ky − X βk22 + λkβk1 . |2 {z } | {z } h(β) g (β) Proximal mapping is now 1 proxth (β) = argminz kβ − zk22 + λkzk1 2t = Sλt (β), where Sλ (β) is the soft-thresholding operator   βi − λ if βi > λ  [Sλ (β)]i = 0 if − λ ≤ βi ≤ λ, i = 1, . . . , n.    βi + λ if βi < −λ 10