Complexity of first, second and zero order optimization Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum

When a candle burns, why does the top of wick glow if bottom of flame is hottest?

Illegal assignment from sObject to Id

Maximum summed subsequences with non-adjacent items

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Most bit efficient text communication method?

Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?

Can anything be seen from the center of the Boötes void? How dark would it be?

How to react to hostile behavior from a senior developer?

Take 2! Is this homebrew Lady of Pain warlock patron balanced?

Dating a Former Employee

How often does castling occur in grandmaster games?

How do living politicians protect their readily obtainable signatures from misuse?

Project Euler #1 in C++

Trademark violation for app?

What was the first language to use conditional keywords?

Localisation of Category

Is a ledger board required if the side of my house is wood?

How could we fake a moon landing now?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

Should I use a zero-interest credit card for a large one-time purchase?

If Windows 7 doesn't support WSL, then what does Linux subsystem option mean?

How to tell that you are a giant?

Is grep documentation about ignoring case wrong, since it doesn't ignore case in filenames?

How does the math work when buying airline miles?

Complexity of first, second and zero order optimization

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum

I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)

Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$

where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.

Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.

Now using gradient information:

Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'

I have absolutely no idea how to obtain these insights which I cited.

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

asked Mar 27 at 16:25

EpsilonDelta

7301615

$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56

add a comment |

I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)

Now using gradient information:

I have absolutely no idea how to obtain these insights which I cited.

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

asked Mar 27 at 16:25

EpsilonDelta

7301615

$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56

add a comment |

I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)

Now using gradient information:

I have absolutely no idea how to obtain these insights which I cited.

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

asked Mar 27 at 16:25

EpsilonDelta

7301615

I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)

Now using gradient information:

I have absolutely no idea how to obtain these insights which I cited.

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

optimization nonlinear-optimization machine-learning neural-networks

asked Mar 27 at 16:25

EpsilonDelta

7301615

asked Mar 27 at 16:25

EpsilonDelta

7301615

asked Mar 27 at 16:25

EpsilonDelta

7301615

asked Mar 27 at 16:25

EpsilonDelta

7301615

asked Mar 27 at 16:25

EpsilonDelta

7301615

$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56

add a comment |

$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56

This 3Blue1Brown series may prove helpful, particularly the second episode.

– Paul Sinclair
Mar 28 at 0:56

add a comment |

1 Answer
1

active

oldest

votes

I have absolutely no idea how to obtain these insights which I cited.

Here's what I think he means: we have (by assumption) some quadratic error surface
$$
E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
$$
where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.

I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
$$ E(w) = c + w^Tb + frac12w^T Hw $$
This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
$$ nabla E = Hw - b $$
so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
$$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

The optimization order refers to the number of derivatives needed.
In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).

Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
Otherwise we would know in advance how long to train our networks for. :)
In some simple cases one can establish convergence theorems though, but this something for optimization theory.

answered Mar 30 at 19:37

user3658307

5,0833949

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3164715%2fcomplexity-of-first-second-and-zero-order-optimization%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I have absolutely no idea how to obtain these insights which I cited.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

answered Mar 30 at 19:37

user3658307

5,0833949

add a comment |

I have absolutely no idea how to obtain these insights which I cited.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

answered Mar 30 at 19:37

user3658307

5,0833949

add a comment |

I have absolutely no idea how to obtain these insights which I cited.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

answered Mar 30 at 19:37

user3658307

5,0833949

I have absolutely no idea how to obtain these insights which I cited.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

answered Mar 30 at 19:37

user3658307

5,0833949

answered Mar 30 at 19:37

user3658307

5,0833949

answered Mar 30 at 19:37

user3658307

5,0833949

answered Mar 30 at 19:37

user3658307

5,0833949

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

W hM2GARVqMQDm h,xyae7pnJ2 mo16xM4M HVWosn85Ec jDUT5TJImzAG11emnYoIpmwmC4

搜尋此網誌

Fdtxjr

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Football at the 1986 Brunei Merdeka Games Contents Teams Group stage Knockout stage References Navigation menu"Brunei Merdeka Games 1986".

Solar Wings Breeze Design and development Specifications (Breeze) References Navigation menu1368-485X"Hang glider: Breeze (Solar Wings)"e

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Football at the 1986 Brunei Merdeka Games Contents Teams Group stage Knockout stage References Navigation menu"Brunei Merdeka Games 1986".

Solar Wings Breeze Design and development Specifications (Breeze) References Navigation menu1368-485X"Hang glider: Breeze (Solar Wings)"e

1 Answer
1

1 Answer
1

1 Answer
1