Proximal Gradient Descent (and Acceleration)

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Last time: subgradient method

Consider the problem

f (x) min x

with f convex, and dom(f ) = Rn.

Subgradient method: choose an initial x (0) Rn, and repeat: ∈ x (k) = x (k−1) g (k−1), k = 1, 2, 3, . . . tk · − ∂f (x (k−1)). We use pre-set rules for the step sizes ∈ where g (k−1) (e.g., diminshing step sizes rule).

If f is Lipschitz, then subgradient method has a convergence rate O(1/ε2).

1

Upside: very generic. Downside: can be slow — addressed today.

Outline

(cid:73) Proximal gradient descent

(cid:73) Convergence analysis

(cid:73) ISTA, matrix completion

(cid:73) Special cases

(cid:73) Acceleration

2

Today

Decomposable functions

Suppose

f (x) = g (x) + h(x)

(cid:73) g is convex, differentiable, dom(g ) = Rn (cid:73) h is convex, not necessarily differentiable.

where

If f were differentiable, then gradient descent update would be

t x + = x f (x) · ∇

2f (x) by 1 t I

2 2

− Recall motivation: minimize quadratic approximation to f around x, replace ∇ z x . f (x)T (z x) + x + = argminz f (x) + ∇ 1 2t (cid:107) − (cid:107)

− ˜ft (z)

3

(cid:123)(cid:122) (cid:125) (cid:124)

Decomposable functions

In our case f is not differentiable, but f = g + h, g differentiable. Why don’t we make quadratic approximation to g , leave h alone?

I.e., update

x + = argminz ˜gt(z) + h(z)

2 2 + h(z) (cid:107)

z x g (x)T (z x) + = argminz g (x) + ∇ − 1 2t (cid:107) −

2 2 + h(z).

t (x = argminz 1 z 2t (cid:107) g (x)) (cid:107) ∇ − −

2 2

1 z 2t (cid:107)

4

(x − g (x)) (cid:107) stay close to gradient update for g also make h small t ∇ − h(z)

(cid:73) h(x) = 0: proxh(x) = x.

(cid:73) h(x) is indicator function of a closed convex set C : proxh is

2

2 = PC (x).

(cid:73) h(x) =

1: proxh is the ’soft-threshold’ (shrinkage)

the projection on C 1 x z proxh(x) = argminz∈C 2 (cid:107) − (cid:107) x (cid:107) (cid:107) operation 1 xi 1 xi − ≥ 0 1 xi proxh(x)i =  | ≤ | 1. xi + 1 xi  ≤ −



Examples:

Proximal mapping

The proximal mapping (or prox-operator) of a convex function h is defined as

2 2 + h(z). (cid:107)

5

x z proxh(x) = argminz 1 2 (cid:107) −

(cid:73) h(x) is indicator function of a closed convex set C : proxh is

2

2 = PC (x).

(cid:73) h(x) =

1: proxh is the ’soft-threshold’ (shrinkage)

the projection on C 1 x z proxh(x) = argminz∈C 2 (cid:107) − (cid:107) x (cid:107) (cid:107) operation 1 xi 1 xi − ≥ 0 1 xi proxh(x)i =  | ≤ | 1. xi + 1 xi  ≤ −



Proximal mapping

The proximal mapping (or prox-operator) of a convex function h is defined as

2 2 + h(z). (cid:107)

x z proxh(x) = argminz 1 2 (cid:107) −

(cid:73) h(x) = 0: proxh(x) = x.

5

Examples:

(cid:73) h(x) =

1: proxh is the ’soft-threshold’ (shrinkage)

x (cid:107) (cid:107) operation 1 xi 1 xi − ≥ 0 1 xi proxh(x)i =  | ≤ | 1. xi + 1 xi  ≤ −



Proximal mapping

The proximal mapping (or prox-operator) of a convex function h is defined as

2 2 + h(z). (cid:107)

x z proxh(x) = argminz 1 2 (cid:107) −

(cid:73) h(x) = 0: proxh(x) = x. (cid:73) h(x) is indicator function of a closed convex set C : proxh is

Examples:

the projection on C

2 2 = PC (x).

5

z proxh(x) = argminz∈C 1 x 2 (cid:107) − (cid:107)

Proximal mapping

The proximal mapping (or prox-operator) of a convex function h is defined as

2 2 + h(z). (cid:107)

x z proxh(x) = argminz 1 2 (cid:107) −

(cid:73) h(x) = 0: proxh(x) = x. (cid:73) h(x) is indicator function of a closed convex set C : proxh is

Examples:

the projection on C

2 2 = PC (x).

(cid:73) h(x) =

1: proxh is the ’soft-threshold’ (shrinkage) (cid:107)

z proxh(x) = argminz∈C 1 x 2 (cid:107) − (cid:107)

5

x (cid:107) operation 1 −

1 1. 1 xi xi 0 xi | xi + 1 xi proxh(x)i =   ≥ | ≤ ≤ −



Optimality condition

z x ∂h(z) z = proxh(x) − ∈ ⇔ h(u) h(z) + (x z)T (u z), u. ≥ ⇔ − ∀ −

Proximal mapping

Theorem If h is convex and closed (has closed epigraph) then

2 2 + h(z). (cid:107)

x z proxh(x) = argminz 1 2 (cid:107) − exists and is unique for all x.

Chứng minh. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/ proxop.pdf

6

Uniqueness since the objective function is strictly convex.

Proximal mapping

Theorem If h is convex and closed (has closed epigraph) then

2 2 + h(z). (cid:107)

x z proxh(x) = argminz 1 2 (cid:107) − exists and is unique for all x.

Chứng minh. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/ proxop.pdf

Uniqueness since the objective function is strictly convex.

Optimality condition

6

z x ∂h(z) z = proxh(x) ∈ ⇔ − h(u) h(z) + (x z)T (u z), u. ≥ ⇔ − ∀ −

Chứng minh.

With u = proxh(x) and v = proxh(y ) we have

T

x u v ∂f (u) and y ∂f (v ). − ∈ − ∈ From the monotonicity of subdifferential we get

(x u) (y v ) (u v ) 0. − − − ≥ − (cid:0) (cid:1) From firm nonexpansiveness and Cauchy-Schwarz inequality we get

nonexpansiveness (Lipschitz continuity with constant 1)

2

2.

x y proxh(x) proxh(y ) (cid:107) − (cid:107) ≤ (cid:107) − (cid:107)

Properties of proximal mapping

Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1)

2 2. (cid:107)

7

y ) (proxh(x) proxh(y ))T (x proxh(x) proxh(y ) − − ≥ (cid:107) −

From firm nonexpansiveness and Cauchy-Schwarz inequality we get

nonexpansiveness (Lipschitz continuity with constant 1)

2

2.

x y proxh(x) proxh(y ) (cid:107) − (cid:107) ≤ (cid:107) − (cid:107)

Properties of proximal mapping

Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1)

2 2. (cid:107)

y ) (proxh(x) proxh(y ))T (x proxh(x) proxh(y ) − − ≥ (cid:107) −

Chứng minh. With u = proxh(x) and v = proxh(y ) we have

T

x u v ∂f (u) and y ∂f (v ). − ∈ − ∈ From the monotonicity of subdifferential we get

7

(x u) (y v ) (u v ) 0. − − − ≥ − (cid:0) (cid:1)

Properties of proximal mapping

Theorem Proximal mappings are firmly nonexpansive (co-coercive with constant 1)

2 2. (cid:107)

y ) (proxh(x) proxh(y ))T (x proxh(x) proxh(y ) − − ≥ (cid:107) −

Chứng minh. With u = proxh(x) and v = proxh(y ) we have

T

x u v ∂f (u) and y ∂f (v ). − ∈ − ∈ From the monotonicity of subdifferential we get

(x u) (y v ) (u v ) 0. − − − ≥ − (cid:0) (cid:1)

7

From firm nonexpansiveness and Cauchy-Schwarz inequality we get nonexpansiveness (Lipschitz continuity with constant 1)

2

2.

y proxh(x) proxh(y ) (cid:107) x ≤ (cid:107) − (cid:107) − (cid:107)

To make this update step look familiar, can rewrite it as

x (k) = x (k−1) tk Gtk (x (k−1)) · − where Gt is the generalized gradient of f , x t g (x)) proxth(x . − − ∇ Gt(x) = t

For h = 0 it is gradient descent.

Proximal gradient descent

Proximal gradient descent: choose initialize x (0), repeat:

8

x (k−1) g (x (k−1)) , k = 1, 2, 3, . . . tk x (k) = proxtk h − · ∇ (cid:0) (cid:1)

For h = 0 it is gradient descent.

Proximal gradient descent

Proximal gradient descent: choose initialize x (0), repeat:

x (k−1) g (x (k−1)) , k = 1, 2, 3, . . . tk x (k) = proxtk h − · ∇ (cid:0) (cid:1) To make this update step look familiar, can rewrite it as

x (k) = x (k−1) tk Gtk (x (k−1)) · −

8

where Gt is the generalized gradient of f , t x g (x)) . − − ∇ Gt(x) = proxth(x t

Proximal gradient descent

Proximal gradient descent: choose initialize x (0), repeat:

x (k−1) g (x (k−1)) , k = 1, 2, 3, . . . tk x (k) = proxtk h − · ∇ (cid:0) (cid:1) To make this update step look familiar, can rewrite it as

x (k) = x (k−1) tk Gtk (x (k−1)) · −

where Gt is the generalized gradient of f , t x g (x)) . − − ∇ Gt(x) = proxth(x t

8

For h = 0 it is gradient descent.

What good did this do?

You have a right to be suspicious ... may look like we just swapped one minimization problem for another.

Key point is that proxh( ) is can be computed analytically for a lot · of important functions h1.

Note:

(cid:73) Mapping proxh( · (cid:73) Smooth part g can be complicated, we only need to compute

) doesn’t depend on g at all, only on h.

its gradients.

1see http://www.seas.ucla.edu/~vandenbe/236C/lectures/proxop.pdf

9

Convergence analysis: will be in terms of number of iterations of ) once and this can the algorithm. Each iteration evaluates proxh( · be cheap or expensive depending on h.

Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)

2 2

1 (cid:107) (cid:107) h(β)

Given y Rn×p, recall lasso criterion Rn, X ∈ X β + λ . β ∈ f (β) = 1 y 2 (cid:107) (cid:107) − g (β)

Proximal mapping is now (cid:124) (cid:123)(cid:122) (cid:125) (cid:124)

2 2 + λ

1 (cid:107)

z (cid:125) β (cid:123)(cid:122) proxth(β) = argminz 1 2t (cid:107) − (cid:107) z (cid:107)

= Sλt(β),

where Sλ(β) is the soft-thresholding operator

βi λ if βi > λ − if λ λ, i = 1, . . . , n. βi − ≤ ≤ [Sλ(β)]i =  0  λ βi + λ if βi < −

10



Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall g (β) = X β), hence proximal gradient update is ∇ − − X β)). X T (y β+ = Sλt(β + tX T (y −

Often called the iterative soft-thresholding algorithm (ISTA)2. Very simple algorithm.

11

2Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding algorithm for linear inverse problems”

Example of proximal gradient (ISTA) vs. subgradient method convergence rates

Backtracking line search

Backtracking for prox gradient descent works similar as before (in gradient descent), but operates on g and not f .

Choose parameter 0 < β < 1. At each iteration, start at t = tinit, and while

2 2 (cid:107)

t g (x Gt(x) tGt(x)) > g (x) g (x)T Gt(x) + − − t 2 (cid:107)

∇ shrink t = βt, for some 0 < β < 1. Else perform proximal gradient update.

12

(Alternative formulations exist that require less computation, i.e., fewer calls to prox)

Proximal gradient descent has convergence rate O(1/k) or O(1/ε). Same as gradient descent! (But remember, prox cost matters ...).

Proof: See http://www.seas.ucla.edu/~vandenbe/236C/

lectures/proxgrad.pdf

Convergence analysis

For criterion f (x) = g (x) + h(x), we assume

(cid:73) g is convex, differentiable, dom(g ) = Rn, and

g is Lipschitz ∇

2 2/(2t) + h(z)

z can continuous with constant L > 0. (cid:73) h is convex, proxth(x) = argminz {(cid:107) x } − (cid:107) be evaluated.

13

1/L satisfies ≤ x ∗ f ∗ f (x (k)) Theorem Proximal gradient descent with fixed step size t 2 2 (cid:107) x (0) (cid:107) − 2tk − ≤ and same result holds for backtracking with t replaced by β/L.

Proof: See http://www.seas.ucla.edu/~vandenbe/236C/

lectures/proxgrad.pdf

Convergence analysis

For criterion f (x) = g (x) + h(x), we assume

(cid:73) g is convex, differentiable, dom(g ) = Rn, and

g is Lipschitz ∇

2 2/(2t) + h(z)

z can continuous with constant L > 0. (cid:73) h is convex, proxth(x) = argminz {(cid:107) x } − (cid:107) be evaluated.

1/L satisfies ≤ x ∗ f ∗ f (x (k)) Theorem Proximal gradient descent with fixed step size t 2 2 (cid:107) x (0) (cid:107) − 2tk − ≤ and same result holds for backtracking with t replaced by β/L.

13

Proximal gradient descent has convergence rate O(1/k) or O(1/ε). Same as gradient descent! (But remember, prox cost matters ...).

Convergence analysis

For criterion f (x) = g (x) + h(x), we assume

(cid:73) g is convex, differentiable, dom(g ) = Rn, and

g is Lipschitz ∇

2 2/(2t) + h(z)

z can continuous with constant L > 0. (cid:73) h is convex, proxth(x) = argminz {(cid:107) x } − (cid:107) be evaluated.

1/L satisfies ≤ x ∗ f ∗ f (x (k)) Theorem Proximal gradient descent with fixed step size t 2 2 (cid:107) x (0) (cid:107) − 2tk − ≤ and same result holds for backtracking with t replaced by β/L.

13

Proximal gradient descent has convergence rate O(1/k) or O(1/ε). Same as gradient descent! (But remember, prox cost matters ...).

Proof: See http://www.seas.ucla.edu/~vandenbe/236C/ lectures/proxgrad.pdf

Example: matrix completion

Rm×n, and only observe entries Yij , (i, j) ∈ ∈

Given a matrix Y Ω. Suppose we want to fill in missing entries (e.g., for a recommender system), so we solve a matrix completion problem3

tr. (cid:107)

tr is the trace (or nuclear) norm of B

r

(Yij Bij )2 + λ min B 1 2 B (cid:107) − (cid:88)(i,j)∈Ω Here B (cid:107) (cid:107)

tr =

B σi (B), (cid:107) (cid:107)

i=1 (cid:88) ≥ · · · ≥

3Wikipedia: In the case of the Netflix problem the ratings matrix is expected to be low-rank since user preferences can often be described by a few factors, such as the movie genre and time of release 4https://math.berkeley.edu/~hutching/teach/54-2017/svd-notes.pdf

14

0 are the singular σr (X ) ≥ where r = rank(B) and σ1(X ) values4.

Example: matrix completion

Define PΩ, projection operator onto observed set

tr

2 F (cid:107)

(i, j) Ω Bij ∈ [PΩ(B)]ij = (i, j) Ω.  0  (cid:54)∈ Then the criterion is  B + λ f (B) = . PΩ(Y ) PΩ(B) 1 2 (cid:107) (cid:107) (cid:107) h(B) − g (B)

(cid:73) Gradient calculation

Two ingredients needed for proximal gradient descent: (cid:124) (cid:123)(cid:122) (cid:125) (cid:123)(cid:122) (cid:125) (cid:124)

(cid:73) Prox function

g (B) = (PΩ(Y ) PΩ(B)). − − ∇

tr.

2 F + λ (cid:107)

15

B Z Z proxth(B) = argminZ 1 2t (cid:107) − (cid:107) (cid:107)

Example: matrix completion

Claim:

proxt(B) = Sλt(B), matrix soft-thresholding at the level λ.

Here Sλ(B) is defined by

Sλ(B) = UΣλV T

. Σii where B = UΣV T is an SVD, and Σλ is diagonal with (Σλ)ii = max { λ, 0 } −

tr. (cid:107)

Proof : note that proxth(B) = Z , where Z satisfies Z Z 0 B + λt ∂ ∈ − · (cid:107) Helpful fact: if Z = UΣV T , then

op

tr = (cid:107)

16

Z W ∂ UV T + W : 1, U T W = 0, WV = 0 (cid:107) { (cid:107) (cid:107) ≤ . } Now plug in Z = Sλt(B) and check that we can get 0.

Example: matrix completion

Hence proximal gradient update step is B + = Sλt (B + t(PΩ(Y ) PΩ(B))) . −

Ω (B))

g (B) is Lipschitz continuous with L = 1, so we can ∇ Note that choose fixed step size t = 1. Update step is now

Ω = B.

where P ⊥ B + = Sλ(PΩ(Y ) + P ⊥ Ω projects onto unobserved set, PΩ(B) + P ⊥

5Mazumder et al. (2011), “Spectral regularization algorithms for learning large incomplete matrices”

17

This is the soft-impute algorithm5, simple and effective method for matrix completion.

Special cases

Proximal gradient descent also called composite gradient descent or generalized gradient descent.

(cid:73) h = 0 – gradient descent

(cid:73) h = IC – projected gradient descent (cid:73) g = 0 – proximal minimization algorithm.

Why “generalized”? This refers to the several special cases, when minimizing f = g + h

18

Therefore these algorithms all have O(1/ε) convergence rate.

2

2 + IC (z)

2

2,

We have 1 x z proxtIC (x) = argminz 2t (cid:107) − (cid:107) x z (cid:107) = argminz∈C (cid:107) − i.e., proxtIC (x) = PC (x), projection operator onto C .

Projected gradient descent

Given closed, convex set C Rn, ∈ g (x) g (x) + IC (x) min x min x∈C ⇐⇒

x C is the indicator function of C . ∈ where IC (x) = x C 0   (cid:54)∈ ∞

19

Projected gradient descent

Given closed, convex set C Rn, ∈ g (x) g (x) + IC (x) min x min x∈C ⇐⇒

x C is the indicator function of C . ∈ where IC (x) = x C 0   (cid:54)∈ ∞

We have 

2 2 + IC (z)

19

z proxtIC (x) = argminz − z 1 x 2t (cid:107) x = argminz∈C (cid:107) (cid:107) 2 2, (cid:107) − i.e., proxtIC (x) = PC (x), projection operator onto C .

Projected gradient descent

Therefore proximal gradient update step is

t g (x)), x + = PC (x − ∇

20

i.e., perform usual gradient update and then project back onto C. Called projected gradient descent.

Proximal minimization algorithm

Consider for h convex (not necessarily differentiable)

h(x). min x

Proximal gradient update step is just

2 2 + h(z). (cid:107)

z x + = argminz −

21

1 x 2t (cid:107) Called proximal minimization algorithm. Faster than subgradient method, but not implementable unless we know prox in closed form.

What happens if we can’t evaluate prox?

Theory for proximal gradient, with f = g + h, assumes that prox function can be evaluated, i.e., assumes the minimization

2 2 + h(z) (cid:107)

x z proxth(x) = argminz 1 2t (cid:107) −

can be done exactly. In general, not clear what happens if we just minimize this approximately.

But, if you can precisely control the errors in approximating the prox operator, then you can recover the original convergence rates6.

6Schmidt et al. (2011), “Convergence rates of inexact proximal-gradient methods for convex optimization”

22

In practice, if prox evaluation is done approximately, then it should be done to decently high accuracy.

Acceleration

(cid:73) 1983: original acceleration idea for smooth functions

(cid:73) 1988: another acceleration idea for smooth functions

(cid:73) 2005: smoothing techniques for nonsmooth functions, coupled

Turns out we can accelerate proximal gradient descent in order to achieve the optimal O(1/√ε) convergence rate. Four ideas (three acceleration methods) by Nesterov:

(cid:73) 2007: acceleration idea for composite functions7.

with original acceleration idea

7Each step uses entire history of previous steps and makes two prox calls 8Each step uses information from two last steps and makes one prox call

23

We will follow Beck and Teboulle (2008), an extension of Nesterov (1983) to composite functions8.

Accelerated proximal gradient method

As before consider

g (x) + h(x), min x where g convex, differentiable, and h convex.

Rn, repeat: Accelerated proximal gradient method: choose initial point x (0) = x (−1) ∈ v = x (k−1) + (x (k−1) x (k−2)) −

g (v )) k 2 − k + 1 tk x (k) = proxtk h(v − ∇ for k = 1, 2, 3, . . .

(cid:73) First step k = 1 is just usual proximal gradient update. (cid:73) After that, v = x (k−1) + k−2

k+1 (x (k−1)

x (k−2)) carries some

24

− “momentum” from previous iterations. (cid:73) h = 0 gives accelerated gradient method.

Accelerated proximal gradient method

Momentum weights

25

Accelerated proximal gradient method

Back to lasso example: acceleration can really help!

26

Note: accelerated proximal gradient is not a descent method.

Backtracking line search

Backtracking under with acceleration in different ways.

Simple approach: fix β < 1, t0 = 1. At iteration k, start with t = tk−1, and while

2 2

x + v g (x +) > g (v ) + g (v )T (x + v ) + 1 2t (cid:107) (cid:107) ∇ − t − g (v )). Else keep x +. shrink t = βt, and let x + = proxth(v − ∇

27

Note that this strategy forces us to take decreasing step sizes ... (more complicated strategies exist which avoid this).

Convergence analysis

For criterion f (x) = g (x) + h(x), we assume as before

(cid:73) g is convex, differentiable, dom(g ) = Rn, and

g is Lipschitz ∇

2 2/(2t) + h(z)

z can continuous with constant L > 0. (cid:73) h is convex, proxth(x) = argminz {(cid:107) x − (cid:107) } be evaluated.

2 2

1/L ≤ Theorem Accelerated proximal gradient method with fixed step size t satisfies

f ∗ f (x (k)) (cid:107) x ∗ x (0) 2 (cid:107) − t(k + 1)2 − ≤

and same result holds for backtracking, with t replaced by β/L.

28

Achieves optimal rate O(1/k 2) or O(1/√ε) for first-order methods.

FISTA (Fast ISTA)

Back to lasso problem

2 2 + λ

1. (cid:107)

y X β β min β 1 2 (cid:107) − (cid:107) (cid:107)

Recall ISTA (Iterative Soft-thresholding Algorithm):

X β(k−1))), k = 1, 2, 3, . . .

Sλ( β(k) = Sλtk (β(k−1) + tk X T (y − ) being vector soft-thresholding. ·

Applying acceleration gives us FISTA (F is for Fast)9: for k = 1, 2, 3, . . .

v = β(k−1) + (β(k−1) β(k−2)) k 2 − k + 1

9Beck and Teboulle (2008) actually call their general acceleration technique (for general g , h) FISTA, which may be somewhat confusing

29

− Xv )). β(k) = Sλtk (v + tk X T (y −

ISTA vs. FISTA

Lasso regression: 100 instances (with n = 100, p = 500):

30

ISTA vs. FISTA

Lasso logistic regression: 100 instances (n = 100, p = 500):

31

In practice the speedup of using acceleration is diminished in the

presence of warm starts. E.g., suppose want to solve lasso problem

for tuning parameters values

(cid:73) When solving for λ1, initialize x (0) = 0, record solution ˆx(λ1).

(cid:73) When solving for λj , initialize x (0) = ˆx(λj−1), the recorded

λ1 > λ2 > > λr . · · ·

solution for λj−1.

Over a fine enough grid of λ values, proximal gradient descent can often perform just as well without acceleration.

Is acceleration always useful?

32

Acceleration can be a very effective speedup tool ... but should it always be used?

Is acceleration always useful?

Acceleration can be a very effective speedup tool ... but should it always be used?

In practice the speedup of using acceleration is diminished in the presence of warm starts. E.g., suppose want to solve lasso problem for tuning parameters values

(cid:73) When solving for λ1, initialize x (0) = 0, record solution ˆx(λ1). (cid:73) When solving for λj , initialize x (0) = ˆx(λj−1), the recorded

λ1 > λ2 > > λr . · · ·

solution for λj−1.

32

Over a fine enough grid of λ values, proximal gradient descent can often perform just as well without acceleration.

Is acceleration always useful?

Sometimes backtracking and acceleration can be disadvantageous! Recall matrix completion problem: the proximal gradient update is

Ω (B)) (cid:17)

(cid:73) One backtracking loop evaluates generalized gradient Gt(x), i.e., evaluates proxt(x), across various values of t. For matrix completion, this means multiple SVDs ...

(cid:73) Acceleration changes argument we pass to prox: v

P ⊥ B + = Sλ − (cid:16) B + t(PΩ(Y ) where Sλ is the matrix soft-thresholding operator ... requires SVD.

t g (v ) − ∇ instead of x t g (x). For matrix completion (and t = 1), − B fast SVD + P ⊥ − ∇ ⇒ ∇ g (B) = PΩ(Y ) sparse

Ω (B) low rank P ⊥ Ω (V ) (cid:124) (cid:123)(cid:122) (cid:125) not necessarily low rank

33

V slow SVD. + (cid:124) (cid:123)(cid:122) (cid:125) ⇒ − ∇ g (V ) = PΩ(Y ) sparse

(cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125)

References and further reading

Nesterov’s four ideas (three acceleration methods):

Y. Nesterov (1983), A method for solving a convex programming problem with convergence rate O(1/k 2)

Y. Nesterov (1988), On an approach to the construction of optimal methods of minimization of smooth convex functions

Y. Nesterov (2005), Smooth minimization of non-smooth functions

34

Y. Nesterov (2007), Gradient methods for minimizing composite objective function

References and further reading

Extensions and/or analyses:

A. Beck and M. Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems

S. Becker and J. Bobin and E. Candes (2009), NESTA: a fast and accurate first-order method for sparse recovery

35

P. Tseng (2008), On accelerated proximal gradient methods for convex-concave optimization

References and further reading

Helpful lecture notes/books:

E. Candes, Lecture notes for Math 301, Stanford University, Winter 2010-2011

Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2

36

L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012