Subgradients

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Last time: gradient descent

Consider the problem

n.

f (x) min x

n, repeat

for f convex and differentiable, dom(f ) = R Gradient descent: choose initial x (0)

x (k) = x (k−1) R ∈ f (x (k−1)), k = 1, 2, 3, . . . tk − · ∇

Step sizes tk chosen to be fixed and small, or by backtracking line search

If f Lipschitz, gradient descent has convergence rate O(1/ε) ∇

(cid:73) Requires f differentiable (cid:73) Can be slow to converge

1

Downsides:

Outline

(cid:73) Subgradients

(cid:73) Examples

(cid:73) Properties

(cid:73) Optimality characterizations

2

Today:

Basic inequality

Recall that for convex and differentiable f ,

(cid:73) The first-order approximation of f at x is a global lower

f (y ) f (x) + f (x)T (y x), x, y dom(f ). ≥ ∇ − ∀ ∈

(cid:73)

bound.

f defines a non-vertical supporting hyperplane to epi(f ) at

3

∇ (x, f (x)) (cid:33) (cid:32) (cid:33)(cid:33) (cid:16) f 0, (y , t) epi(f ). (cid:32)(cid:32) y t − ≤ ∀ ∈ ∇ (cid:17) 1 − x f (x)

g1 and g2 are subgradients at x1, g3 is subgradient at x2.

Subgradients

n such that

A subgradient of a convex function f at x is any g R

f (y ) f (x) + g T (y x), ∈ dom(f ). ≥ y ∀ ∈ −

(cid:73) Always exists (on the relative interior of dom(f )) (cid:73) If f differentiable at x, then g = f (x) uniquely (cid:73) Same definition works for nonconvex f (however, subgradients

4

need not exist).

Subgradients

n such that

A subgradient of a convex function f at x is any g R

f (y ) f (x) + g T (y x), ∈ dom(f ). ≥ y ∀ ∈ −

(cid:73) Always exists (on the relative interior of dom(f )) (cid:73) If f differentiable at x, then g = f (x) uniquely (cid:73) Same definition works for nonconvex f (however, subgradients

need not exist).

4

g1 and g2 are subgradients at x1, g3 is subgradient at x2.

Examples of subgradients

x Consider f : R R, f (x) = → | |

(cid:73) For x (cid:73) For x = 0, subgradient g is any element of [

5

= 0, unique subgradient g = sign(x) (cid:54) 1, 1]. −

Examples of subgradients

n

(cid:73) For x

Consider f : R R, f (x) = x (cid:107) (cid:107)2 →

(cid:73) For x = 0, subgradient g is any element of

6

= 0, unique subgradient g = x x (cid:54) (cid:107) (cid:107)2 z . z : { (cid:107) (cid:107)2 ≤ 1 }

Examples of subgradients

n

Consider f : R R, f (x) = x (cid:107) (cid:107)1 →

(cid:73) For xi = 0, unique ith component gi = sign(xi ) (cid:73) For xi = 0, ith component gi is any element of [

7

(cid:54) 1, 1]. −

Examples of subgradients

n

, for f1, f2 : R R convex, f1(x), f2(x) } → Consider f (x) = max { differentiable

(cid:73) For f1(x) > f2(x), unique subgradient g = (cid:73) For f2(x) > f1(x), unique subgradient g = (cid:73) For f1(x) = f2(x), subgradient g is any point on line segment

8

f1(x) f2(x) ∇ ∇

between f1(x) and f2(x). ∇ ∇

(cid:73) Nonempty for convex f at x

Properties:

(cid:73) ∂f (x) is closed and convex (even for nonconvex f )

(cid:73) If f is differentiable at x, then ∂f (x) =

int(domf ) ∈

(cid:73) If ∂f (x) =

f (x) } {∇ g , then f is differentiable at x and f (x) = g . { } ∇

Proof: See http://www.seas.ucla.edu/~vandenbe/236C/

lectures/subgradients.pdf

Subdifferential

Set of all subgradients of convex f is called the subdifferential:

n : g is a subgradient of f at x

9

g ∂f (x) = . R { ∈ }

Subdifferential

Set of all subgradients of convex f is called the subdifferential:

n : g is a subgradient of f at x

g ∂f (x) = . R { ∈ }

(cid:73) Nonempty for convex f at x

Properties:

(cid:73) ∂f (x) is closed and convex (even for nonconvex f )

(cid:73) If f is differentiable at x, then ∂f (x) =

int(domf ) ∈

(cid:73) If ∂f (x) =

{∇ f (x) } , then f is differentiable at x and f (x) = g . g { } ∇

9

Proof: See http://www.seas.ucla.edu/~vandenbe/236C/ lectures/subgradients.pdf

Question: Monotonicity for differentiable convex function?

f (x) ( f (y ))T (x y ) 0, ∇ − ∇ ≥ − which follows directly from the first order characterization of

convex functions.

Monotonicity

Theorem The subdifferential of a convex function f is a monotone operator

u v )T (x (u y ) 0, ∂f (x), v ∂f (y ). − ∀ ∈ ≥ − ∈

Chứng minh.

By definition we have

10

f (y ) f (x) + uT (y x) and f (x) f (y ) + v T (x y ). ≥ − ≥ − Combining them get shows monotonicity.

f (x) ( f (y ))T (x y ) 0, ∇ − ∇ ≥ − which follows directly from the first order characterization of

convex functions.

Monotonicity

Theorem The subdifferential of a convex function f is a monotone operator

u v )T (x (u y ) 0, ∂f (x), v ∂f (y ). − ∀ ∈ ≥ − ∈

Chứng minh.

By definition we have

f (y ) f (x) + uT (y x) and f (x) f (y ) + v T (x y ). ≥ − ≥ − Combining them get shows monotonicity.

10

Question: Monotonicity for differentiable convex function?

Monotonicity

Theorem The subdifferential of a convex function f is a monotone operator

u v )T (x (u y ) 0, ∂f (x), v ∂f (y ). − ∀ ∈ ≥ − ∈

Chứng minh.

By definition we have

f (y ) f (x) + uT (y x) and f (x) f (y ) + v T (x y ). ≥ − ≥ − Combining them get shows monotonicity.

Question: Monotonicity for differentiable convex function?

y ) 0, f (x) ( f (y ))T (x ≥ − ∇ − ∇

10

which follows directly from the first order characterization of convex functions.

Examples of non-subdifferentiable functions

(cid:73) f : R

The following functions are not subdifferentiable at x = 0

(cid:73) f : R

R, dom(f ) = R+ → (cid:40) f (x) = 1 if x = 0 0 if x > 0.

R, dom(f ) = R+ → √x. f (x) = −

11

The only supporting hyperplane to epi(f ) at (0, f (0)) is vertical.

Connection to convex geometry

n

n, consider indicator function IC : R  0 

C (x), the normal cone of C at x is, recall g T y for any y

n : g T x

C =

Convex set C R R, ⊆ → if x C x C ∈ = IC (x) = I { ∈ } if x C .  (cid:54)∈ ∞ For x C , ∂IC (x) = ∈ N C . R ∈ } N ≥ ∈

(cid:73) For y

x) for all y . g { Why? By definition of subgradient g , IC (x) + g T (y IC (y ) ≥ −

(cid:73) For y

12

C , IC (y ) = ∞ (cid:54)∈ C , this means 0 g T (y x). ∈ − ≥

13

Subgradient calculus

(cid:73) Scaling: ∂(af ) = a

Basic rules for convex functions:

(cid:73) Addition: ∂(f1 + f2) = ∂f1 + ∂f2. (cid:73) Affine composition: if g (x) = f (Ax + b), then

∂f provided a > 0. ·

(cid:73) Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then

∂g (x) = AT ∂f (Ax + b).

i:fi (x)=f (x)

(cid:19) (cid:18) (cid:91) ∂f (x) = conv ∂fi (x)

14

convex hull of union of subdifferentials of active functions at x.

Subgradient calculus

(cid:73) General pointwise maximum: if f (x) = maxs∈S fs (x), then

s:fs (x)=f (x)

(cid:19)     (cid:18) (cid:91) ∂f (x) cl conv . ∂fs (x) ⊇  

(cid:73) Norms: important special case, f (x) =

x Under some regularity conditions (on S, fs ), we get equality. (cid:107)p. Let q be such (cid:107) that 1/p + 1/q = 1, then

z T x. x (cid:107) (cid:107)p = max (cid:107)z(cid:107)q≤1

And

15

∂f (x) = argmax(cid:107)z(cid:107)q≤1 z T x.

Why subgradients?

(cid:73) Convex analysis: optimality characterization via subgradients,

Subgradients are important for two reasons:

(cid:73) Convex optimization: if you can compute subgradients, then

monotonicity, relationship to duality.

16

you can minimize any convex function.

Why? Easy: g = 0 being a subgradient means that for all y

f (y ) f (x ∗) + 0T (y x ∗) = f (x ∗). ≥ − Note the implication for a convex and differentiable function f

. with ∂f (x) = f (x) } {∇

Optimality condition

Subgradient optimality condition: For any f (convex or not),

x

17

0 f (x ∗) = min ∂f (x ∗), ⇐⇒ ∈ f (x) i.e., x ∗ is a minimizer if and only if 0 is a subgradient of f at x ∗.

Optimality condition

Subgradient optimality condition: For any f (convex or not),

x

0 f (x ∗) = min ∂f (x ∗), ⇐⇒ ∈ f (x) i.e., x ∗ is a minimizer if and only if 0 is a subgradient of f at x ∗.

Why? Easy: g = 0 being a subgradient means that for all y f (x ∗) + 0T (y x ∗) = f (x ∗). f (y ) ≥ −

17

. Note the implication for a convex and differentiable function f with ∂f (x) = f (x) } {∇

Derivation of first-order optimality

Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition.

Theorem For f convex and differentiable and C convex

C f (x) subject to x min x ∈

0 for all y is solved at x if and only if f (x)T (y x) C .a ∇ − ∈

≥ aDirect proof see, e.g., http://www.princeton.edu/~amirali/Public/ Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf. Proof using subgradient next slide.

n (unconstrained case) it reduces to

Intuitively: says that gradient increases as we move away from x.

18

f = 0. Note that for C = R ∇

Derivation of first-order optimality

Chứng minh.

First recast problem as

f (x) + IC (x). min x

∂(f (x) + IC (x)). ∈ Now apply subgradient optimality: 0 Observe

C (x)

0 0 + ∂(f (x) + IC (x)) ∈ f (x) } ⇔ N C (x) ∈ N ⇔ −∇ C f (x)T y for all y ⇔ −∇ ∈ 0 for all y C ∈ {∇ f (x) f (x)T x f (x)T (y ≥ −∇ x) ⇔ ∇ ≥ − ∈ as desired.

19

∂f (x) + ∈

Note: the condition 0 C (x) is a fully general condition N for optimality in convex problems. But it’s not always easy to work with (KKT conditions, later, are easier).

2

2 + λ

Subgradient optimality (cid:19) (cid:18) 1 0 y 0 ∂ X β β X T (y X β) + λ∂ β ∈ 2 (cid:107) − (cid:107) (cid:107) (cid:107)1 ⇔ ∈ − − (cid:107) (cid:107)1

X T (y X β) = λv ⇔ − for some v ∂ β (cid:107)1, i.e., (cid:107) ∈  1 if βi > 0 { }  1 vi i = 1, . . . , p. if βi < 0, ∈ {− }  [ 1, 1] if βi = 0 −

Example: lasso optimality conditions

n×p, lasso problem can be parametrized as

n, X

2 2 + λ

20

Given y R R ∈ ∈ X β β min β 1 y 2 (cid:107) − (cid:107) (cid:107) (cid:107)1 0. where λ ≥

Example: lasso optimality conditions

n×p, lasso problem can be parametrized as

n, X

2 2 + λ

2 2 + λ (cid:107)

Given y R R ∈ ∈ X β β min β 1 y 2 (cid:107) − (cid:107) (cid:107)1 (cid:107) 0. Subgradient optimality where λ (cid:19) ≥ (cid:18) 1 0 0 ∂ X β β X T (y X β) + λ∂ ∈ y 2 (cid:107) − (cid:107) (cid:107)1 − β (cid:107) (cid:107)1 ⇔

∈ − X T (y X β) = λv ⇔ − for some v ∂ β (cid:107) ∈ if βi > 0

20

vi i = 1, . . . , p. if βi < 0, ∈ if βi = 0 (cid:107)1, i.e.,  1  } { 1 } {−  1, 1] [ −

Note: subgradient optimality conditions don’t lead to closed-form

expression for a lasso solution. However they do provide a way to

check lasso optimality.

They are also helpful in understanding the lasso estimator; e.g., if

i (y

X T X β) < λ, then βi = 0 (used by screening rules, later?). | − |

Example: lasso optimality conditions

Write X1, . . . , Xp for columns of X . Then our condition reads

21

  X β) = λ = 0 sign(βi ) if βi − (cid:54) · λ X β) if βi = 0. X T i (y X T i (y  | − | ≤

Example: lasso optimality conditions

Write X1, . . . , Xp for columns of X . Then our condition reads

  X β) = λ = 0 sign(βi ) if βi − (cid:54) · λ X β) if βi = 0. X T i (y X T i (y  | − | ≤

Note: subgradient optimality conditions don’t lead to closed-form expression for a lasso solution. However they do provide a way to check lasso optimality.

i (y

21

They are also helpful in understanding the lasso estimator; e.g., if X T < λ, then βi = 0 (used by screening rules, later?). X β) | − |

Example: soft-thresholding

Simplfied lasso problem with X = I :

2 2 + λ

y β β min β 1 2 (cid:107) − (cid:107)1 . (cid:107)

(cid:107) This we can solve directly using subgradient optimality. Solution is β = Sλ(y ), where Sλ is the soft-thresholding operator

λ if yi > λ − if λ, i = 1, . . . , n. λ [Sλ(y )]i = yi − ≤ ≤  yi  0  λ yi + λ if yi < − Check: from last slide, subgradient optimality conditions are

22

  = 0 yi βi = λ sign(βi ) if βi − (cid:54) · λ yi βi if βi = 0.  | − | ≤

Example: soft-thresholding

Now plug in β = Sλ(y ) and check these are satisfied:

(cid:73) When yi > λ, βi = yi (cid:73) When yi < (cid:73) When

1. βi = λ = λ − − ·

= λ. λ > 0, so yi λ, argument is similar. λ, βi = 0, and βi yi − | ≤ yi | yi | − | | | ≤

23

Soft-thresholding in one variable

x Write dist(x, C ) = PC (x) (cid:107) − (cid:107)2, where PC (x) is the projection of x onto C . It turns out that when dist(x, C ) > 0, (cid:27) (cid:26) x PC (x) ∂ dist(x, C ) = − x PC (x) − (cid:107) (cid:107)2 only has one element, so in fact dist(x, C ) is differentiable and this is its gradient.

Example: distance to a convex set

Recall the distance function to a closed, convex set C

y ∈C (cid:107)

24

y x dist(x, C ) = min − (cid:107)2 . This is a convex function. What are its subgradients?

Example: distance to a convex set

Recall the distance function to a closed, convex set C

y ∈C (cid:107)

y x dist(x, C ) = min − (cid:107)2 . This is a convex function. What are its subgradients?

PC (x) x (cid:107) Write dist(x, C ) = − x onto C . It turns out that when dist(x, C ) > 0,

∂ dist(x, C ) = PC (x) PC (x) − − (cid:107)2

24

(cid:107)2, where PC (x) is the projection of (cid:27) (cid:26) x x (cid:107) only has one element, so in fact dist(x, C ) is differentiable and this is its gradient.

Example: distance to a convex set

∂ dist(x, C ). PC (x) PC (x) We will only show one direction, i.e., that x x (cid:107) − −

(cid:107)2 ∈ Write u = PC (x). Then by first-order optimality conditions for a projection,

0 for all y u)T (y (x u) C . − − ≤ ∈ Hence

25

C H = y : (x u)T (y u) . ⊆ { − 0 } ≤ − Claim (x u) dist(y , C ) for all y . u)T (y u x ≥ − (cid:107)2 − (cid:107) − H, the right-hand side is 0. Check: first, for y ≤ ∈

Example: distance to a convex set

x u − (cid:107)2 cos θ where θ is the angle − y u (cid:107)2 (cid:107) u. Thus

H, we have u) = (cid:107) u and y u) y u − − = dist(y , C ) (cid:107) (cid:107)2 cos θ = dist(y , H) − ≤ − (cid:107) − − (cid:107)2 Now for y (cid:54)∈ u)T (y (x − between x − u)T (y (x u x as desired.

26

(x u) dist(y , C ) − − Using the claim, we have for any y u)T (y x ≥ (cid:19)T x u = (y x) u u − x + x − u (cid:107)2 − (cid:107) (cid:18) x (cid:107)2 + x (cid:107) − − Hence g = (x u)/ − (cid:107) x (cid:107) − − (cid:107)2 (cid:107)2 is a subgradient of dist(x, C ) at x. u

References and further reading

S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011

R. T. Rockafellar (1970), Convex analysis, Chapters 23–25

27

L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012