Complexity of first, second and zero order optimization Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum

When a candle burns, why does the top of wick glow if bottom of flame is hottest?

Illegal assignment from sObject to Id

Maximum summed subsequences with non-adjacent items

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Most bit efficient text communication method?

Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?

Can anything be seen from the center of the Boötes void? How dark would it be?

How to react to hostile behavior from a senior developer?

Take 2! Is this homebrew Lady of Pain warlock patron balanced?

Dating a Former Employee

How often does castling occur in grandmaster games?

How do living politicians protect their readily obtainable signatures from misuse?

Project Euler #1 in C++

Trademark violation for app?

What was the first language to use conditional keywords?

Localisation of Category

Is a ledger board required if the side of my house is wood?

How could we fake a moon landing now?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

Should I use a zero-interest credit card for a large one-time purchase?

If Windows 7 doesn't support WSL, then what does Linux subsystem option mean?

How to tell that you are a giant?

Is grep documentation about ignoring case wrong, since it doesn't ignore case in filenames?

How does the math work when buying airline miles?



Complexity of first, second and zero order optimization



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum










1












$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56















1












$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56













1












1








1





$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$




I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.







optimization nonlinear-optimization machine-learning neural-networks






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Mar 27 at 16:25









EpsilonDeltaEpsilonDelta

7301615




7301615











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56
















  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56















$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56




$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56










1 Answer
1






active

oldest

votes


















0












$begingroup$


I have absolutely no idea how to obtain these insights which I cited.




Here's what I think he means: we have (by assumption) some quadratic error surface
$$
E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
$$

where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
$$ E(w) = c + w^Tb + frac12w^T Hw $$
This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
$$ nabla E = Hw - b $$
so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
$$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




Ok, finally you asked




I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




The optimization order refers to the number of derivatives needed.
In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
Otherwise we would know in advance how long to train our networks for. :)
In some simple cases one can establish convergence theorems though, but this something for optimization theory.






share|cite|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3164715%2fcomplexity-of-first-second-and-zero-order-optimization%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$


    I have absolutely no idea how to obtain these insights which I cited.




    Here's what I think he means: we have (by assumption) some quadratic error surface
    $$
    E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
    $$

    where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




    I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
    $$ E(w) = c + w^Tb + frac12w^T Hw $$
    This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
    $$ nabla E = Hw - b $$
    so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
    $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
    which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
    the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



    In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




    Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




    Ok, finally you asked




    I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




    The optimization order refers to the number of derivatives needed.
    In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
    Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



    Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
    Otherwise we would know in advance how long to train our networks for. :)
    In some simple cases one can establish convergence theorems though, but this something for optimization theory.






    share|cite|improve this answer









    $endgroup$

















      0












      $begingroup$


      I have absolutely no idea how to obtain these insights which I cited.




      Here's what I think he means: we have (by assumption) some quadratic error surface
      $$
      E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
      $$

      where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




      I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
      $$ E(w) = c + w^Tb + frac12w^T Hw $$
      This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
      $$ nabla E = Hw - b $$
      so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
      $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
      which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
      the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



      In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




      Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




      Ok, finally you asked




      I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




      The optimization order refers to the number of derivatives needed.
      In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
      Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



      Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
      Otherwise we would know in advance how long to train our networks for. :)
      In some simple cases one can establish convergence theorems though, but this something for optimization theory.






      share|cite|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$


        I have absolutely no idea how to obtain these insights which I cited.




        Here's what I think he means: we have (by assumption) some quadratic error surface
        $$
        E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
        $$

        where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




        I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
        $$ E(w) = c + w^Tb + frac12w^T Hw $$
        This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
        $$ nabla E = Hw - b $$
        so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
        $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
        which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
        the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



        In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




        Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




        Ok, finally you asked




        I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




        The optimization order refers to the number of derivatives needed.
        In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
        Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



        Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
        Otherwise we would know in advance how long to train our networks for. :)
        In some simple cases one can establish convergence theorems though, but this something for optimization theory.






        share|cite|improve this answer









        $endgroup$




        I have absolutely no idea how to obtain these insights which I cited.




        Here's what I think he means: we have (by assumption) some quadratic error surface
        $$
        E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
        $$

        where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




        I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
        $$ E(w) = c + w^Tb + frac12w^T Hw $$
        This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
        $$ nabla E = Hw - b $$
        so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
        $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
        which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
        the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



        In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




        Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




        Ok, finally you asked




        I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




        The optimization order refers to the number of derivatives needed.
        In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
        Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



        Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
        Otherwise we would know in advance how long to train our networks for. :)
        In some simple cases one can establish convergence theorems though, but this something for optimization theory.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Mar 30 at 19:37









        user3658307user3658307

        5,0833949




        5,0833949



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3164715%2fcomplexity-of-first-second-and-zero-order-optimization%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Lowndes Grove History Architecture References Navigation menu32°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661132°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661178002500"National Register Information System"Historic houses of South Carolina"Lowndes Grove""+32° 48' 6.00", −79° 57' 58.00""Lowndes Grove, Charleston County (260 St. Margaret St., Charleston)""Lowndes Grove"The Charleston ExpositionIt Happened in South Carolina"Lowndes Grove (House), Saint Margaret Street & Sixth Avenue, Charleston, Charleston County, SC(Photographs)"Plantations of the Carolina Low Countrye

            random experiment with two different functions on unit interval Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)Random variable and probability space notionsRandom Walk with EdgesFinding functions where the increase over a random interval is Poisson distributedNumber of days until dayCan an observed event in fact be of zero probability?Unit random processmodels of coins and uniform distributionHow to get the number of successes given $n$ trials , probability $P$ and a random variable $X$Absorbing Markov chain in a computer. Is “almost every” turned into always convergence in computer executions?Stopped random walk is not uniformly integrable

            How should I support this large drywall patch? Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How do I cover large gaps in drywall?How do I keep drywall around a patch from crumbling?Can I glue a second layer of drywall?How to patch long strip on drywall?Large drywall patch: how to avoid bulging seams?Drywall Mesh Patch vs. Bulge? To remove or not to remove?How to fix this drywall job?Prep drywall before backsplashWhat's the best way to fix this horrible drywall patch job?Drywall patching using 3M Patch Plus Primer