Stochastic Gradient Descent

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Last time: proximal gradient descent

Consider the problem

g (x) + h(x) min x with g , h convex, g differentiable, and h “simple” in so much as

2 2 + h(z)

z proxt(x) = argminz 1 x 2t (cid:107) − (cid:107) is computable.

Rn, repeat: ∈ g (x (k−1))), k = 1, 2, 3, . . . tk Proximal gradient descent: let x (0) x (k) = proxtk (x (k−1) − ∇ Step sizes tk chosen to be fixed and small, or via backtracking.

1

g is Lipschitz with constant L, then this has convergence rate ∇ If O(1/ε). Lastly we can accelerate this, to optimal rate O(1/√ε).

Outline

(cid:73) Stochastic gradient descent

(cid:73) Convergence rates

(cid:73) Mini-batches

(cid:73) Early stopping

2

Today:

Stochastic gradient descent

m (cid:88)

Consider minimizing an average of functions

i=1

fi (x). min x 1 m

i=1 fi (x) = (cid:80)m

m (cid:88)

(cid:80)m As fi (x), gradient descent would repeat ∇

i=1 ∇ 1 tk m

i=1

x (k) = x (k−1) k = 1, 2, 3, . . . fi (x (k−1)), · − ∇

In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats:

3

x (k) = x (k−1) k = 1, 2, 3, . . . tk fik (x (k−1)), · ∇ − is some chosen index at iteration k. 1, . . . , m where ik ∈ { }

Stochastic gradient descent

Two rules for choosing index ik at iteration k:

(cid:73) Randomized rule: choose ik (cid:73) Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .

uniformly at random. 1, ...m ∈ { }

Randomized rule is more common in practice. For randomized rule, note that

E[ f (x), fik (x)] = ∇ ∇

so we can view SGD as using an unbiased estimate of the gradient at each step.

(cid:73) Iteration cost is independent of m (number of functions). (cid:73) Can also be a big savings in terms of memory usage.

4

Main appeal of SGD:

Example: stochastic logistic regression

Rp Given (xi , yi )

i β))

i=1

(cid:17) 0, 1 , i = 1, . . . , n, recall logistic regression } n (cid:16) (cid:88) ∈ f (β) = . yi x T min β × { 1 n − (cid:124) (cid:125)

i β + log(1 + exp(x T (cid:123)(cid:122) fi (β) i=1(yi

(cid:80)n pi (β))xi is doable f (β) = 1 n Gradient computation ∇ − when n is moderate, but not when n is huge.

(cid:73) One batch update costs O(np).

(cid:73) One stochastic update costs O(p).

Full gradient (also called batch) versus stochastic gradient:

5

Clearly, e.g., 10K stochastic steps are much more affordable.

Batch vs. stochastic gradient descent

Small example with n = 10, p = 2 to show the “classic picture” for batch versus stochastic methods:

Blue: batch steps, O(np) Red: stochastic steps, O(p)

(cid:73) generally thrive far from

Rule of thumb for stochastic methods:

(cid:73) generally struggle close

optimum

6

to optimum

Step sizes

Standard in SGD is to use diminishing step sizes, e.g., tk = 1/k, for k = 1, 2, 3, . . .

Why not fixed step sizes? Here’s some intuition.

m (cid:88)

Suppose we take cyclic rule for simplicity. Set tk = t for m updates in a row, we get

i=1

t x (k+m) = x (k) fi (x (k+i−1)). ∇ −

m (cid:88)

Meanwhile, full gradient with step size t would give

i=1 fi (x (k+i−1))

i=1[

t x (k+1) = x (k) fi (x (k)). − ∇

7

− ∇ ∇ The difference here: t (cid:80)m fi (x (k))], and if we hold t constant, this difference will not generally be going to zero.

Convergence rates

Recall: for convex f , (sub)gradient descent with diminishing step sizes satisfies

f (x (k)) f ∗ = O(1/√k). −

When f is differentiable with Lipschitz gradient, there holds for gradient descent with suitable fixed step sizes f ∗ = O(1/k). f (x (k)) −

What about SGD? For convex f , SGD with diminishing step sizes satisfies1

E[f (x (k))] f ∗ = O(1/√k). −

1E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming”

8

Unfortunately this does not improve when we further assume f has Lipschitz gradient.

Convergence rates

Even worse is the following discrepancy!

When f is strongly convex and has a Lipschitz gradient, gradient descent satisfies

f (x (k)) f ∗ = O(c k ) − where c < 1. But under same conditions, SGD gives us2

E[f (x (k))] f ∗ = O(1/k). −

So stochastic methods do not enjoy the linear convergence rate of gradient descent under strong convexity.

2E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to stochastic programming”

9

What can we do to improve SGD?

Mini-batches

i∈Ik

of size 1, . . . , m = b Ik | ⊆ { (cid:28) } | Also common is mini-batch stochastic gradient descent, where we choose a random subset Ik m and repeat (cid:88) x (k) = x (k−1) k = 1, 2, 3, . . . fi (x (k−1)), tk 1 b − · ∇

i∈Ik

Again, we are approximating full gradient by an unbiased estimate   (cid:88) E f (x). fi (x)   = 1 b ∇ ∇

10

Using mini-batches reduces the variance of our gradient estimate by a factor 1/b, but is also b times more expensive.

Batch vs. mini-batches vs. stochastic

Back to logistic regression, let’s now consider a regularized version:

n (cid:88)

2 2.

i β + log(1 + ex T

i=1

n (cid:88)

(cid:16) β (cid:17) i β) + yi x T min β∈Rp 1 n λ 2 (cid:107) (cid:107) −

i β) +

i β + log(1 + ex T

2 2. (cid:107)

i=1

f (β) = β fi (β), fi (β) = yi x T Write the criterion as 1 n − λ 2 (cid:107)

i=1(yi

(cid:73) One batch update costs O(np).

(cid:73) One mini-batch update costs O(bp).

(cid:73) One stochastic update costs O(p).

11

(cid:80)n Full gradient computation is pi (β))xi + λβ. f (β) = 1 n − ∇ Comparison between methods:

Batch vs. mini-batches vs. stochastic

12

Example with n = 10, 000, p = 20, all methods use fixed step sizes

Batch vs. mini-batches vs. stochastic

3flops: floating point operations per second

13

What’s happening? Now let’s parametrize by flops3

Batch vs. mini-batches vs. stochastic

14

Finally, looking at suboptimality gap (on log scale)

Is this the end of the story for SGD?

For a while, the answer was believed to be yes. Slow convergence

for strongly convex functions was believed inevitable, as Nemirovski

and others established matching lower bounds ... but this was for a

more general stochastic problem, where f (x) = (cid:82) F (x, ζ)dP(ζ).

New wave of “variance reduction” work shows we can modify SGD

to converge much faster for finite sums (more later?).

End of the story?

(cid:73) SGD can be super effective in terms of iteration cost, memory. (cid:73) But SGD is slow to converge, can’t adapt to strong convexity. (cid:73) And mini-batches seem to be a wash in terms of flops (though

Short story:

15

they can still be useful in practice).

For a while, the answer was believed to be yes. Slow convergence

for strongly convex functions was believed inevitable, as Nemirovski

and others established matching lower bounds ... but this was for a

more general stochastic problem, where f (x) = (cid:82) F (x, ζ)dP(ζ).

New wave of “variance reduction” work shows we can modify SGD

to converge much faster for finite sums (more later?).

End of the story?

(cid:73) SGD can be super effective in terms of iteration cost, memory. (cid:73) But SGD is slow to converge, can’t adapt to strong convexity. (cid:73) And mini-batches seem to be a wash in terms of flops (though

Short story:

they can still be useful in practice).

15

Is this the end of the story for SGD?

End of the story?

(cid:73) SGD can be super effective in terms of iteration cost, memory. (cid:73) But SGD is slow to converge, can’t adapt to strong convexity. (cid:73) And mini-batches seem to be a wash in terms of flops (though

Short story:

they can still be useful in practice).

Is this the end of the story for SGD?

For a while, the answer was believed to be yes. Slow convergence for strongly convex functions was believed inevitable, as Nemirovski and others established matching lower bounds ... but this was for a more general stochastic problem, where f (x) = (cid:82) F (x, ζ)dP(ζ).

15

New wave of “variance reduction” work shows we can modify SGD to converge much faster for finite sums (more later?).

SGD in large-scale ML

(cid:73) In many ML problems we don’t care about optimizing to high accuracy, it doesn’t pay off in terms of statistical performance. (cid:73) Thus (in contrast to what classic theory says) fixed step sizes

SGD has really taken off in large-scale machine learning

(cid:73) One trick is to experiment with step sizes using small fraction of training before running SGD on full data set ... many other heuristics are common4.

(cid:73) Many variants provide better practical stability, convergence: momentum, acceleration, averaging, coordinate-adapted step sizes, variance reduction ...

(cid:73) See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA ... (more

are commonly used in ML applications.

4E.g., Bottou (2012), “Stochastic gradient descent tricks”

16

later?).

Early stopping

Rp Suppose p is large and we wanted to fit (say) a logistic regression 0, 1 model to data (xi , yi ) ∈ × { , i = 1, . . . , n. } We could solve (say) (cid:96)2 regularized logistic regression

n (cid:88)

i β)

2

i β + log(1 + ex T

i=1

(cid:16) (cid:17) subject to t β yi x T min β∈Rp 1 n − (cid:107) (cid:107) ≤

We could also run gradient descent on the unregularized problem

n (cid:88)

i β)

i β + log(1 + ex T

i=1

(cid:16) (cid:17) yi x T min β∈Rp 1 n −

17

and stop early, i.e., terminate gradient descent well-short of the global minimum.

Early stopping

(cid:73) Start at β(0) = 0, solution to regularized problem at t = 0.

(cid:73) Perform gradient descent on unregularized criterion n (cid:88)

Consider the following, for a very small constant step size ε:

i=1

k = 1, 2, 3, . . . β(k) = β(k−1) ε pi (β(k−1)))xi , (yi 1 n − · −

(cid:73) Treat β(k) as an approximate solution to regularized problem

(we could equally well consider SGD).

2.

with t = β(k) (cid:107) (cid:107)

18

This is called early stopping for gradient descent. Why would we ever do this? It’s both more convenient and potentially much more efficient than using explicit regularization.

An intriguing connection

When we solve the (cid:96)2 regularized logistic problem for varying t, solution path looks quite similar to gradient descent path!

19

Example with p = 8, solution and grad descent paths side by side

Lots left to explore

(cid:73) Connection holds beyond logistic regression, for arbitrary loss. (cid:73) In general, the grad descent path will not coincide with the (cid:96)2

(cid:73) Can extend early stopping idea to mimick a generic regularizer

0). Though in practice, it seems to → regularized path (as ε give competitive statistical performance.

(cid:73) There is a lot of literature on early stopping, but it’s still not

(beyond (cid:96)2)5.

(cid:73) Early stopping is just one instance of implicit or algorithmic regularization, many others are effective in large-scale ML, they all should be better understood

5Tibshirani (2015), “A general framework for fast stagewise algorithms”

20

as well-understood as it should be.

References and further reading

D. Bertsekas (2010), Incremental gradient, subgradient, and proximal methods for convex optimization: a survey

A. Nemirovski and A. Juditsky and G. Lan and A. Shapiro (2009), Robust stochastic optimization approach to stochastic programming

21

R. Tibshirani (2015), A general framework for fast stagewise algorithms