Subgradient Method
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last last time: gradient descent
Consider the problem
f (x) min x for f convex and differentiable, dom(f ) = Rn.
Gradient descent: choose initial x (0) Rn, repeat:
x (k) = x (k−1) ∈ f (x (k−1)), k = 1, 2, 3, . . . tk − · ∇
Step sizes tk chosen to be fixed and small, or by backtracking line search.
If f Lipschitz, gradient descent has convergence rate O(1/ε). ∇
(cid:73) Requires f differentiable — addressed this lecture. (cid:73) Can be slow to converge — addressed next lecture.
1
Downsides:
Subgradient method is not necessarily a descent method, so we
best among x (0), ...x (k) so far, i.e.,
keep track of best iterate x (k)
best) = min
i=0,...,k
f (x (i)). f (x (k)
Subgradient method
Now consider f convex, having dom(f ) = Rn, but not necessarily differentiable.
Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0), repeat:
2
x (k) = x (k−1) g (k−1), k = 1, 2, 3, . . . tk · − where g (k−1) ∂f (x (k−1)) any subgradient of f at x (k−1). ∈
Subgradient method
Now consider f convex, having dom(f ) = Rn, but not necessarily differentiable.
Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0), repeat:
x (k) = x (k−1) g (k−1), k = 1, 2, 3, . . . tk · − where g (k−1) ∂f (x (k−1)) any subgradient of f at x (k−1). ∈
best among x (0), ...x (k) so far, i.e., f (x (i)).
Subgradient method is not necessarily a descent method, so we keep track of best iterate x (k)
best) = min i=0,...,k
2
f (x (k)
Outline
(cid:73) How to choose step sizes
(cid:73) Convergence analysis
(cid:73) Intersection of sets
(cid:73) Projected subgradient method
3
Today:
Step size choices
2, and hence
(cid:73) Fixed step sizes: tk = t all k = 1, 2, 3, . . . (cid:73) Fixed step length, i.e., tk = s/ 2 = s.
g (k−1) (cid:107) (cid:107)
(cid:73) Diminishing step sizes: choose to meet conditions
∞ (cid:88)
∞ (cid:88)
tk g (k−1) (cid:107) (cid:107)
k=1
k=1
tk = t2 k < , ∞ , ∞
i.e., square summable but not summable. Important here that step sizes go to zero, but not too fast.
4
There are several other options too, but key difference to gradient descent: step sizes are pre-specified, not adaptively computed.
Convergence analysis
Assume that f convex, dom(f ) = Rn, and also that f is Lipschitz continuous with constant L > 0, i.e.,
L y f (y ) for all x, y . f (x) | − x (cid:107) (cid:107)2 − | ≤
2 2 (cid:107)
x ∗ f ∗ + . x (0) (cid:107) f (x (k) best) − ≤ Theorem For a fixed step size t, subgradient method satisfies L2t 2 2, we have
2 2 (cid:107)
For fixed step length, i.e., tk = s/ L f ∗ + . − 2kt g (k−1) (cid:107) (cid:107) x ∗ x (0) (cid:107) f (x (k) best) − 2ks Ls 2 − ≤
2 + L2 (cid:80)k i=1 ti
5
x (0) For diminishing step sizes, subgradient method satisfies i=1 t2 i f ∗ , − (cid:107) f (x (k) best) ≤ x ∗ 2 (cid:107) 2 (cid:80)k
i.e., f (x (k) − best) = f ∗. lim k→∞
Lipschitz continuity
Before the proof let consider the Lipschitz continuity assumption.
Lemma f is Lipschitz continuous with constant L > 0, i.e.,
L y f (y ) for all x, y , f (x) | − x (cid:107) (cid:107)2 − | ≤ is equivalent to
2
L for all x and g ∂f (x). g (cid:107) (cid:107) ≤ ∈
Chứng minh.
=: Choose subgradients gx and gy at x and y . We have ⇐ y ). y ) f (x) f (y ) g T y (x g T x (x − − ≥ − ≥ Apply Cauchy-Schwarz inequality get
2
2. (cid:107)
6
L y x y f (x) f (y ) x (cid:107) − (cid:107) ≥ − L (cid:107) − ≥ −
Lipschitz continuity
Before the proof let consider the Lipschitz continuity assumption.
Lemma f is Lipschitz continuous with constant L > 0, i.e.,
L y f (y ) for all x, y , f (x) | − x (cid:107) (cid:107)2 − | ≤ is equivalent to
2
L for all x and g ∂f (x). g (cid:107) (cid:107) ≤ ∈
Chứng minh.
2
2 > f (x) + L,
7
: Assume g ⇒ ∂f (x). Take y = x + g / (cid:107) (cid:107) ∈ = we have − g y (cid:107) f (y ) g 2 > L for some g (cid:107) (cid:107) x 2 = 1 and (cid:107) f (x) + g T (y x) = f (x) + − (cid:107) (cid:107) ≥ contradiction.
(cid:73) Iterating last inequality
2
2
k
k
2
2
2.
2 −
i (cid:107)
i=1
i=1
x (k) x ∗ (cid:107) (cid:107) − (cid:88) (cid:88) x (0) x ∗ 2 g (i−1) t2 f (x ∗)) + ti (f (x (i−1)) (cid:107) − ≤ (cid:107) − (cid:107)
Convergence analysis - Proof
(cid:73) Using definition of subgradient x ∗
Can prove both results from same basic inequality. Key steps:
2 2 (cid:107) 2 2. (cid:107)
8
− − (cid:107) = (cid:107) (cid:107) − x (k) − (cid:107) x (k−1) x (k−1) g (k−1) g (k−1) ≤ (cid:107) tk g (k−1) x (k−1) 2 2 = (cid:107) (cid:107) 2tk g (k−1)(x (k−1) x ∗ 2 2 − 2tk (f (x (k−1)) 2 x ∗ 2 − − x ∗ 2 2 x ∗) + t2 k (cid:107) − f (x ∗)) + t2 k (cid:107) (cid:107) −
Convergence analysis - Proof
(cid:73) Using definition of subgradient x ∗
Can prove both results from same basic inequality. Key steps:
2 2 (cid:107) 2 2. (cid:107)
− (cid:107) − = − (cid:107) (cid:107) x (k) − (cid:107) x (k−1) x (k−1) g (k−1) g (k−1) ≤ (cid:107) − − (cid:107) x ∗ 2 2 x ∗) + t2 k (cid:107) − f (x ∗)) + t2 k (cid:107) tk g (k−1) x (k−1) 2 2 = (cid:107) (cid:107) 2tk g (k−1)(x (k−1) x ∗ 2 2 − 2tk (f (x (k−1)) 2 x ∗ 2 − (cid:73) Iterating last inequality
2 2
k (cid:88)
k (cid:88)
x (k) x ∗ (cid:107) (cid:107) −
2 2.
2 2 −
i=1
i=1
8
x (0) x ∗ 2 g (i−1) f (x ∗)) + ti (f (x (i−1)) (cid:107) − ≤ (cid:107) − t2 i (cid:107) (cid:107)
(cid:73) Introducing f (x (k)
best) = mini=0,...k f (x (i)) and rearranging, we
2
have the basic inequality
2
i=1 t2
i (cid:107)
best)
i=1 ti
g (i−1) R 2 + (cid:80)k f (x (k) f (x ∗) . (cid:107) − ≤ 2 (cid:80)k
For different step sizes choices, convergence results can be directly
obtained from this bound. E.g., theorems for fixed and diminishing
step sizes follow.
Convergence analysis - Proof
(cid:73) Using
2
2, we have (cid:107) g (i−1)
2 2.
i=1
i=1
9
x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 0 R 2 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) ≤ − − − t2 i (cid:107) (cid:107)
For different step sizes choices, convergence results can be directly
obtained from this bound. E.g., theorems for fixed and diminishing
step sizes follow.
Convergence analysis - Proof
(cid:73) Using
2
2, we have (cid:107) g (i−1)
2 2.
i=1
x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 2 0 R 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) − t2 i (cid:107) − − (cid:107)
≤ i=1 (cid:73) Introducing f (x (k) best) = mini=0,...k f (x (i)) and rearranging, we have the basic inequality
2 2 (cid:107)
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
9
g (i−1) R 2 + (cid:80)k . f (x ∗) f (x (k) best) − ≤
Convergence analysis - Proof
(cid:73) Using
2
2, we have (cid:107) g (i−1)
2 2.
i=1
x ∗ 0 and letting R = x (k) (cid:107) − x ∗ (cid:107) k (cid:88) x (0) k (cid:88) 2 0 R 2 (cid:107) f (x ∗)) + ≥ ti (f (x (i−1)) − t2 i (cid:107) − − (cid:107)
≤ i=1 (cid:73) Introducing f (x (k) best) = mini=0,...k f (x (i)) and rearranging, we have the basic inequality
2 2 (cid:107)
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k . f (x ∗) f (x (k) best) − ≤
9
For different step sizes choices, convergence results can be directly obtained from this bound. E.g., theorems for fixed and diminishing step sizes follow.
(cid:73) Does not guarantee convergence (as k
(cid:73) For large k, f (x (k)
best) is approximately L2t
2 -suboptimal.
(cid:73) To make the gap
). → ∞
ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·
I.e., subgradient method guarantees the gap ε in k = O(1/ε2)
iterations ... compare this to O(1/ε) rate of gradient descent.
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
With fixed step size t, this and the Lipschitz continuity assumption give
10
f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤
I.e., subgradient method guarantees the gap ε in k = O(1/ε2)
iterations ... compare this to O(1/ε) rate of gradient descent.
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
With fixed step size t, this and the Lipschitz continuity assumption give
best) is approximately L2t
f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤
(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) (cid:73) To make the gap
10
). → ∞ 2 -suboptimal. ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
With fixed step size t, this and the Lipschitz continuity assumption give
best) is approximately L2t
f ∗ + . f (x (k) best) R 2 2kt L2t 2 − ≤
(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) (cid:73) To make the gap
10
). → ∞ 2 -suboptimal. ε, let’s make each term ε/2. So we can ≤ ≤ choose t = ε/L2, and k = R 2/t 1/ε = R 2L2/ε2. ·
I.e., subgradient method guarantees the gap ε in k = O(1/ε2) iterations ... compare this to O(1/ε) rate of gradient descent.
(cid:73) Does not guarantee convergence (as k
(cid:73) For large k, f (x (k)
best) is approximately Ls
2 -suboptimal.
(cid:73) To make the gap
). → ∞
ε, let’s make each term ε/2. So we can ≤ ≤ choose s = ε/L, and k = LR 2/s 1/ε = R 2L2/ε2. ·
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
2, we have (cid:107) R 2 + ks 2
−1 2 ≤
i=1 L−1
i=1 (cid:107)
11
g (i−1) With fixed step length, i.e., ti = s/ (cid:107) R 2 + ks 2 = + . f (x ∗) f (x (k) best) LR 2 2ks Ls 2 ≤ − 2s (cid:80)k g (i−1) 2s (cid:80)k (cid:107)
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
2, we have (cid:107) R 2 + ks 2
−1 2 ≤
i=1 L−1
i=1 (cid:107)
g (i−1) With fixed step length, i.e., ti = s/ (cid:107) R 2 + ks 2 = + . f (x ∗) f (x (k) best) LR 2 2ks Ls 2 ≤ − 2s (cid:80)k g (i−1) 2s (cid:80)k (cid:107)
(cid:73) Does not guarantee convergence (as k (cid:73) For large k, f (x (k) best) is approximately Ls (cid:73) To make the gap
). → ∞ 2 -suboptimal.
11
ε, let’s make each term ε/2. So we can ≤ ≤ choose s = ε/L, and k = LR 2/s 1/ε = R 2L2/ε2. ·
Convergence analysis - Proof
i=1 t2 i (cid:107) 2 (cid:80)k i=1 ti
i=1 t2 i
g (i−1) R 2 + (cid:80)k f (x ∗) . The basic inequality tells us that after k steps, we have 2 2 (cid:107) f (x (k) best) − ≤
. f (x ∗) f (x (k) best) − ≤ From this and the Lipschitz continuity, we have R 2 + L2 (cid:80)k 2 (cid:80)k i=1 ti
i <
i=1 ti =
i=1 t2
, there and (cid:80)∞ ∞ ∞ With diminishing step size, (cid:80)∞ holds
best) = f ∗.
12
f (x (k) lim k→∞
Example: regularized logistic regression
i β + log(1 + exp(x T
i β))
i=1
Rp for i = 1, ...n, the logistic regression ∈ 0, 1 } Given (xi , yi ) loss is (cid:16) (cid:17) × { n (cid:88) f (β) = . yi x T −
i=1
This is a smooth and convex with n (cid:88) f (β) = (yi pi (β))xi ,
i β)), i = 1, . . . , n.
where pi (β) = exp(x T ∇ − i β)/(1 + exp(x T
Consider the regularized problem
f (β) + λ P(β), min β ·
2 2 ridge penalty; or P(β) =
13
where P(β) = β β (cid:107) (cid:107) (cid:107)1 lasso penalty. (cid:107)
Example: regularized logistic regression
14
Ridge: use gradients; lasso: use subgradients. Example here has n = 1000, p = 20.
Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors).
Polyak step sizes
Polyak step sizes: when the optimal value f ∗ is known, take
f ∗ , k = 1, 2, 3, . . . tk =
f (x (k−1)) g (k−1) (cid:107) − 2 2 (cid:107)
2 2.
2 2 ≤ (cid:107)
2 2−
x (k−1) x ∗ x ∗ g (k−1) 2tk (f (x (k−1)) f (x ∗))+t2 k (cid:107) − (cid:107) (cid:107) (cid:107)
Can be motivated from first step in subgradient proof: x (k) (cid:107) − − Polyak step size minimizes the right-hand side.
. f (x ∗) With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/ε2) f (x (k) best) ≤ − LR √k
15
(Proof: see slide 11, http://www.seas.ucla.edu/~vandenbe/ 236C/lectures/sgmethod.pdf).
Example: intersection of sets
C1 Cm, i.e., find a point in ∩ · · · ∩ ∈ Suppose we want to find x ∗ intersection of closed, convex sets C1, ..., Cm.
First define
i = 1, . . . , m fi (x) = dist(x, Ci ),
i=1,...,m
f (x) = max fi (x)
and now solve
f (x). min x
Check: is this convex?
16
x ∗ Note that f ∗ = 0 C1 Cm. ⇐⇒ ∈ ∩ · · · ∩
Example: intersection of sets
y x (cid:107) (cid:107)2. Last − Recall the distance function dist(x, C ) = miny ∈C time we computed its gradient
dist(x, C ) = x x ∇ PC (x) PC (x) (cid:107) − − (cid:107)2 where PC (x) is the projection of x onto C .
Also recall subgradient rule: if f (x) = maxi=1,...m fi (x), then (cid:91) ∂f (x) = conv ∂fi (x) .
i:fi (x)=f (x) ∂fi (x), then gi
17
∂f (x). So if fi (x) = f (x) and gi ∈ ∈
Example: intersection of sets
Put these two facts together for intersection of sets problem, with fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)), and
2 (cid:107)
gi = fi (x) = ∇ PCi (x) PCi (x) x x (cid:107) − − ∂f (x). then gi ∈
Now apply subgradient method, with Polyak size tk = f (x (k−1)). At iteration k, with Ci farthest from x (k−1), we perform update
x (k) = x (k−1) f (x (k−1)) − x (k−1) x (k−1) (cid:107) PCi (x (k−1)) PCi (x (k−1)) (cid:107) − − = PCi (x (k−1)),
since
18
x (k−1) f (x (k−1)) = dist(x (k−1), Ci ) = PCi (x (k−1)) (cid:107) − . (cid:107)
Example: intersection of sets
1von Neumann (1950), “Functional operators, volume II: The geometry of orthogonal spaces”
19
For two sets, this is the famous alternating projections1 algorithm, i.e., just keep projecting back and forth.
Projected subgradient method
To optimize a convex function f over a convex set C ,
C f (x) subject to x min x ∈ we can use the projected subgradient method.
Just like the usual subgradient method, except we project onto C at each iteration:
g (k−1)), k = 1, 2, 3, . . . x (k) = PC (x (k−1) tk · −
20
Assuming we can do this projection, we get the same convergence guarantees as the usual subgradient method, with the same step size choices.
Projected subgradient method
What sets C are easy to project onto? Lots, e.g.,
(cid:73) Affine images:
(cid:73) Some norm balls:
Rn Ax + b : x { } x : Ax = b } ∈ (cid:73) Solution set of linear system: (cid:73) Nonnegative orthant: Rn
+ = x (cid:107)
0 ≥ } for p = 1, 2, { x : x { (cid:107)p ≤ ∞ 1 } x : { (cid:73) Some simple polyhedra and simple cones.
b Warning: it is easy to write down seemingly simple set C , and PC can turn out to be very hard! E.g., generally hard to project onto arbitrary polyhedron C = x : Ax { . } ≤
21
Note: projected gradient descent works too, more next time ...
Can we do better?
Upside of the subgradient method: broad applicability.
Downside: O(1/ε2) convergence rate over problem class of convex, Lipschitz functions is really slow.
Nonsmooth first-order methods: iterative methods updating x (k) in
x (0) + span g (0), g (1), . . . , g (k−1) { } where subgradients g (0), g (1), . . . , g (k−1) come from weak oracle.
Theorem (Nesterov)
n ≤ − 1 and starting point x (0), there is a function in the For any k problem class such that any nonsmooth first-order method satisfies
22
f ∗ . f (x (k)) − ≥ RG 2(1 + √k + 1)
Improving on the subgradient method
In words, we cannot do better than the O(1/ε2) rate of subgradient method (unless we go beyond nonsmooth first-order methods).
So instead of trying to improve across the board, we will focus on minimizing composite functions of the form
f (x) = g (x) + h(x)
where g is convex and differentiable, h is convex and nonsmooth but “simple”.
23
For a lot of problems (i.e., functions h), we can recover the O(1/ε) rate of gradient descent with a simple algorithm, having important practical consequences.
References and further reading
S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011
Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 3
B. Polyak (1987), Introduction to optimization, Chapter 5
24
L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012