Subgradient Method

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Last last time: gradient descent

Consider the problem

f (x) min x for f convex and differentiable, dom(f ) = Rn.

Gradient descent: choose initial x (0) Rn, repeat:

x (k) = x (k−1) ∈ f (x (k−1)), k = 1, 2, 3, . . . tk − · ∇

Step sizes tk chosen to be fixed and small, or by backtracking line search.

If f Lipschitz, gradient descent has convergence rate O(1/ε). ∇

(cid:73) Requires f differentiable — addressed this lecture. (cid:73) Can be slow to converge — addressed next lecture.

1

Downsides:

Subgradient method is not necessarily a descent method, so we

best among x (0), ...x (k) so far, i.e.,

keep track of best iterate x (k)

best) = min

i=0,...,k

f (x (i)). f (x (k)

Subgradient method

Now consider f convex, having dom(f ) = Rn, but not necessarily differentiable.

Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0), repeat:

2

x (k) = x (k−1) g (k−1), k = 1, 2, 3, . . . tk · − where g (k−1) ∂f (x (k−1)) any subgradient of f at x (k−1). ∈

Subgradient method

Now consider f convex, having dom(f ) = Rn, but not necessarily differentiable.

Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0), repeat:

x (k) = x (k−1) g (k−1), k = 1, 2, 3, . . . tk · − where g (k−1) ∂f (x (k−1)) any subgradient of f at x (k−1). ∈

best among x (0), ...x (k) so far, i.e., f (x (i)).

Subgradient method is not necessarily a descent method, so we keep track of best iterate x (k)

best) = min i=0,...,k

2

f (x (k)

Outline

(cid:73) How to choose step sizes

(cid:73) Convergence analysis

(cid:73) Intersection of sets

(cid:73) Projected subgradient method

3

Today:

Step size choices

2, and hence

(cid:73) Fixed step sizes: tk = t all k = 1, 2, 3, . . . (cid:73) Fixed step length, i.e., tk = s/ 2 = s.

g (k−1) (cid:107) (cid:107)

(cid:73) Diminishing step sizes: choose to meet conditions

∞ (cid:88)

∞ (cid:88)

tk g (k−1) (cid:107) (cid:107)

k=1

k=1

tk = t2 k < , ∞ , ∞

i.e., square summable but not summable. Important here that step sizes go to zero, but not too fast.

4

There are several other options too, but key difference to gradient descent: step sizes are pre-specified, not adaptively computed.

Convergence analysis

Assume that f convex, dom(f ) = Rn, and also that f is Lipschitz continuous with constant L > 0, i.e.,

L y f (y ) for all x, y . f (x) | − x (cid:107) (cid:107)2 − | ≤

2 2 (cid:107)

x ∗ f ∗ + . x (0) (cid:107) f (x (k) best) − ≤ Theorem For a fixed step size t, subgradient method satisfies L2t 2 2, we have

2 2 (cid:107)

For fixed step length, i.e., tk = s/ L f ∗ + . − 2kt g (k−1) (cid:107) (cid:107) x ∗ x (0) (cid:107) f (x (k) best) − 2ks Ls 2 − ≤

2 + L2 (cid:80)k i=1 ti

5

x (0) For diminishing step sizes, subgradient method satisfies i=1 t2 i f ∗ , − (cid:107) f (x (k) best) ≤ x ∗ 2 (cid:107) 2 (cid:80)k

i.e., f (x (k) − best) = f ∗. lim k→∞

Lipschitz continuity

Before the proof let consider the Lipschitz continuity assumption.

Lemma f is Lipschitz continuous with constant L > 0, i.e.,

L y f (y ) for all x, y , f (x) | − x (cid:107) (cid:107)2 − | ≤ is equivalent to

2

L for all x and g ∂f (x). g (cid:107) (cid:107) ≤ ∈

Chứng minh.

=: Choose subgradients gx and gy at x and y . We have ⇐ y ). y ) f (x) f (y ) g T y (x g T x (x − − ≥ − ≥ Apply Cauchy-Schwarz inequality get

2

2. (cid:107)

6

L y x y f (x) f (y ) x (cid:107) − (cid:107) ≥ − L (cid:107) − ≥ −

Lipschitz continuity

Before the proof let consider the Lipschitz continuity assumption.

Lemma f is Lipschitz continuous with constant L > 0, i.e.,

L y f (y ) for all x, y , f (x) | − x (cid:107) (cid:107)2 − | ≤ is equivalent to

2

L for all x and g ∂f (x). g (cid:107) (cid:107) ≤ ∈

Chứng minh.

2

2 > f (x) + L,

7

: Assume g ⇒ ∂f (x). Take y = x + g / (cid:107) (cid:107) ∈ = we have − g y (cid:107) f (y ) g 2 > L for some g (cid:107) (cid:107) x 2 = 1 and (cid:107) f (x) + g T (y x) = f (x) + − (cid:107) (cid:107) ≥ contradiction.

(cid:73) Iterating last inequality

2

2

k

k

2

2

2.

2 −

i (cid:107)

i=1

i=1

x (k) x ∗ (cid:107) (cid:107) − (cid:88) (cid:88) x (0) x ∗ 2 g (i−1) t2 f (x ∗)) + ti (f (x (i−1)) (cid:107) − ≤ (cid:107) − (cid:107)

Convergence analysis - Proof

(cid:73) Using definition of subgradient x ∗

Can prove both results from same basic inequality. Key steps:

2 2 (cid:107) 2 2. (cid:107)

8

− − (cid:107) = (cid:107) (cid:107) − x (k) − (cid:107) x (k−1) x (k−1) g (k−1) g (k−1) ≤ (cid:107) tk g (k−1) x (k−1) 2 2 = (cid:107) (cid:107) 2tk g (k−1)(x (k−1) x ∗ 2 2 − 2tk (f (x (k−1)) 2 x ∗ 2 − − x ∗ 2 2 x ∗) + t2 k (cid:107) − f (x ∗)) + t2 k (cid:107) (cid:107) −

Convergence analysis - Proof

(cid:73) Using definition of subgradient x ∗

Can prove both results from same basic inequality. Key steps:

2 2 (cid:107) 2 2. (cid:107)

− (cid:107) − = − (cid:107) (cid:107) x (k) − (cid:107) x (k−1) x (k−1) g (k−1) g (k−1) ≤ (cid:107) − − (cid:107) x ∗ 2 2 x ∗) + t2 k (cid:107) − f (x ∗)) + t2 k (cid:107) tk g (k−1) x (k−1) 2 2 = (cid:107) (cid:107) 2tk g (k−1)(x (k−1) x ∗ 2 2 − 2tk (f (x (k−1)) 2 x ∗ 2 − (cid:73) Iterating last inequality

2 2

k (cid:88)

k (cid:88)

x (k) x ∗ (cid:107) (cid:107) −

2 2.

2 2 −

i=1

i=1

8

x (0) x ∗ 2 g (i−1) f (x ∗)) + ti (f (x (i−1)) (cid:107) − ≤ (cid:107) − t2 i (cid:107) (cid:107)

(cid:73) Introducing f (x (k)

best) = mini=0,...k f (x (i)) and rearranging, we

2

have the basic inequality

2

i=1 t2

i (cid:107)

best)

i=1 ti

g (i−1) R 2 + (cid:80)k f (x (k) f (x ∗) . (cid:107) − ≤ 2 (cid:80)k

For different step sizes choices, convergence results can be directly

obtained from this bound. E.g., theorems for fixed and diminishing

step sizes follow.

Convergence analysis - Proof

(cid:73) Using

2

2, we have (cid:107) g (i−1)

2 2.

i=1

i=1

9

x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 0 R 2 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) ≤ − − − t2 i (cid:107) (cid:107)

For different step sizes choices, convergence results can be directly

obtained from this bound. E.g., theorems for fixed and diminishing

step sizes follow.

Convergence analysis - Proof

(cid:73) Using

2

2, we have (cid:107) g (i−1)

2 2.

i=1

x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 2 0 R 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) − t2 i (cid:107) − − (cid:107)

≤ i=1 (cid:73) Introducing f (x (k) best) = mini=0,...k f (x (i)) and rearranging, we have the basic inequality

2 2 (cid:107)

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

9

g (i−1) R 2 + (cid:80)k . f (x ∗) f (x (k) best) − ≤

Convergence analysis - Proof

(cid:73) Using

2

2, we have (cid:107) g (i−1)

2 2.

i=1

x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 2 0 R 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) − t2 i (cid:107) − − (cid:107)

≤ i=1 (cid:73) Introducing f (x (k) best) = mini=0,...k f (x (i)) and rearranging, we have the basic inequality

2 2 (cid:107)

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k . f (x ∗) f (x (k) best) − ≤

9

For different step sizes choices, convergence results can be directly obtained from this bound. E.g., theorems for fixed and diminishing step sizes follow.

(cid:73) Does not guarantee convergence (as k

(cid:73) For large k, f (x (k)

best) is approximately L2t

2 -suboptimal.

(cid:73) To make the gap

). → ∞

ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·

I.e., subgradient method guarantees the gap ε in k = O(1/ε2)

iterations ... compare this to O(1/ε) rate of gradient descent.

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

With fixed step size t, this and the Lipschitz continuity assumption give

10

f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤

I.e., subgradient method guarantees the gap ε in k = O(1/ε2)

iterations ... compare this to O(1/ε) rate of gradient descent.

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

With fixed step size t, this and the Lipschitz continuity assumption give

best) is approximately L2t

f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤

(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) (cid:73) To make the gap

10

). → ∞ 2 -suboptimal. ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

With fixed step size t, this and the Lipschitz continuity assumption give

best) is approximately L2t

f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤

(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) (cid:73) To make the gap

10

). → ∞ 2 -suboptimal. ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·

I.e., subgradient method guarantees the gap ε in k = O(1/ε2) iterations ... compare this to O(1/ε) rate of gradient descent.

(cid:73) Does not guarantee convergence (as k

(cid:73) For large k, f (x (k)

best) is approximately Ls

2 -suboptimal.

(cid:73) To make the gap

). → ∞

ε, let’s make each term ε/2. So we can ≤ ≤ choose s = ε/L, and k = LR 2/s 1/ε = R 2L2/ε2. ·

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

2, we have (cid:107) R 2 + ks 2

−1 2 ≤

i=1 L−1

i=1 (cid:107)

11

g (i−1) With fixed step length, i.e., ti = s/ (cid:107) R 2 + ks 2 = + . f (x ∗) f (x (k) best) LR 2 2ks Ls 2 ≤ − 2s (cid:80)k g (i−1) 2s (cid:80)k (cid:107)

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

2, we have (cid:107) R 2 + ks 2

−1 2 ≤

i=1 L−1

i=1 (cid:107)

g (i−1) With fixed step length, i.e., ti = s/ (cid:107) R 2 + ks 2 = + . f (x ∗) f (x (k) best) LR 2 2ks Ls 2 ≤ − 2s (cid:80)k g (i−1) 2s (cid:80)k (cid:107)

(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) best) is approximately Ls (cid:73) To make the gap

). → ∞ 2 -suboptimal.

11

ε, let’s make each term ε/2. So we can ≤ ≤ choose s = ε/L, and k = LR 2/s 1/ε = R 2L2/ε2. ·

Convergence analysis - Proof

i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti

i=1 t2 i

g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤

. f (x ∗) f (x (k) best) − ≤ From this and the Lipschitz continuity, we have R 2 + L2 (cid:80)k 2 (cid:80)k i=1 ti

i <

i=1 ti =

i=1 t2

, there and (cid:80)∞ ∞ ∞ With diminishing step size, (cid:80)∞ holds

best) = f ∗.

12

f (x (k) lim k→∞

Example: regularized logistic regression

i β + log(1 + exp(x T

i β))

i=1

Rp for i = 1, ...n, the logistic regression ∈ 0, 1 } Given (xi , yi ) loss is (cid:16) (cid:17) × { n (cid:88) f (β) = . yi x T −

i=1

This is a smooth and convex with n (cid:88) f (β) = (yi pi (β))xi ,

i β)), i = 1, . . . , n.

where pi (β) = exp(x T ∇ − i β)/(1 + exp(x T

Consider the regularized problem

f (β) + λ P(β), min β ·

2 2 ridge penalty; or P(β) =

13

where P(β) = β β (cid:107) (cid:107) (cid:107)1 lasso penalty. (cid:107)

Example: regularized logistic regression

14

Ridge: use gradients; lasso: use subgradients. Example here has n = 1000, p = 20.

Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors).

Polyak step sizes

Polyak step sizes: when the optimal value f ∗ is known, take

f ∗ , k = 1, 2, 3, . . . tk =

f (x (k−1)) g (k−1) (cid:107) − 2 2 (cid:107)

2 2.

2 2 ≤ (cid:107)

2 2−

x (k−1) x ∗ x ∗ g (k−1) 2tk (f (x (k−1)) f (x ∗))+t2 k (cid:107) − (cid:107) (cid:107) (cid:107)

Can be motivated from first step in subgradient proof: x (k) (cid:107) − − Polyak step size minimizes the right-hand side.

. f (x ∗) With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/ε2) f (x (k) best) ≤ − LR √k

15

(Proof: see slide 11, http://www.seas.ucla.edu/~vandenbe/ 236C/lectures/sgmethod.pdf).

Example: intersection of sets

C1 Cm, i.e., find a point in ∩ · · · ∩ ∈ Suppose we want to find x ∗ intersection of closed, convex sets C1, ..., Cm.

First define

i = 1, . . . , m fi (x) = dist(x, Ci ),

i=1,...,m

f (x) = max fi (x)

and now solve

f (x). min x

Check: is this convex?

16

x ∗ Note that f ∗ = 0 C1 Cm. ⇐⇒ ∈ ∩ · · · ∩

Example: intersection of sets

y x (cid:107) (cid:107)2. Last − Recall the distance function dist(x, C ) = miny ∈C time we computed its gradient

dist(x, C ) = x x ∇ PC (x) PC (x) (cid:107) − − (cid:107)2 where PC (x) is the projection of x onto C .

Also recall subgradient rule: if f (x) = maxi=1,...m fi (x), then   (cid:91) ∂f (x) = conv ∂fi (x)   .

i:fi (x)=f (x) ∂fi (x), then gi

17

∂f (x). So if fi (x) = f (x) and gi ∈ ∈

Example: intersection of sets

Put these two facts together for intersection of sets problem, with fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)), and

2 (cid:107)

gi = fi (x) = ∇ PCi (x) PCi (x) x x (cid:107) − − ∂f (x). then gi ∈

Now apply subgradient method, with Polyak size tk = f (x (k−1)). At iteration k, with Ci farthest from x (k−1), we perform update

x (k) = x (k−1) f (x (k−1)) − x (k−1) x (k−1) (cid:107) PCi (x (k−1)) PCi (x (k−1)) (cid:107) − − = PCi (x (k−1)),

since

18

x (k−1) f (x (k−1)) = dist(x (k−1), Ci ) = PCi (x (k−1)) (cid:107) − . (cid:107)

Example: intersection of sets

1von Neumann (1950), “Functional operators, volume II: The geometry of orthogonal spaces”

19

For two sets, this is the famous alternating projections1 algorithm, i.e., just keep projecting back and forth.

Projected subgradient method

To optimize a convex function f over a convex set C ,

C f (x) subject to x min x ∈ we can use the projected subgradient method.

Just like the usual subgradient method, except we project onto C at each iteration:

g (k−1)), k = 1, 2, 3, . . . x (k) = PC (x (k−1) tk · −

20

Assuming we can do this projection, we get the same convergence guarantees as the usual subgradient method, with the same step size choices.

Projected subgradient method

What sets C are easy to project onto? Lots, e.g.,

(cid:73) Affine images:

(cid:73) Some norm balls:

Rn Ax + b : x { } x : Ax = b } ∈ (cid:73) Solution set of linear system: (cid:73) Nonnegative orthant: Rn

+ = x (cid:107)

0 ≥ } for p = 1, 2, { x : x { (cid:107)p ≤ ∞ 1 } x : { (cid:73) Some simple polyhedra and simple cones.

b Warning: it is easy to write down seemingly simple set C , and PC can turn out to be very hard! E.g., generally hard to project onto arbitrary polyhedron C = x : Ax { . } ≤

21

Note: projected gradient descent works too, more next time ...

Can we do better?

Upside of the subgradient method: broad applicability.

Downside: O(1/ε2) convergence rate over problem class of convex, Lipschitz functions is really slow.

Nonsmooth first-order methods: iterative methods updating x (k) in

x (0) + span g (0), g (1), . . . , g (k−1) { } where subgradients g (0), g (1), . . . , g (k−1) come from weak oracle.

Theorem (Nesterov)

n ≤ − 1 and starting point x (0), there is a function in the For any k problem class such that any nonsmooth first-order method satisfies

22

f ∗ . f (x (k)) − ≥ RG 2(1 + √k + 1)

Improving on the subgradient method

In words, we cannot do better than the O(1/ε2) rate of subgradient method (unless we go beyond nonsmooth first-order methods).

So instead of trying to improve across the board, we will focus on minimizing composite functions of the form

f (x) = g (x) + h(x)

where g is convex and differentiable, h is convex and nonsmooth but “simple”.

23

For a lot of problems (i.e., functions h), we can recover the O(1/ε) rate of gradient descent with a simple algorithm, having important practical consequences.

References and further reading

S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011

Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 3

B. Polyak (1987), Introduction to optimization, Chapter 5

24

L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012