Complexity of first, second and zero order optimization Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum

Multi tool use
Multi tool use

When a candle burns, why does the top of wick glow if bottom of flame is hottest?

Illegal assignment from sObject to Id

Maximum summed subsequences with non-adjacent items

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Most bit efficient text communication method?

Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?

Can anything be seen from the center of the Boötes void? How dark would it be?

How to react to hostile behavior from a senior developer?

Take 2! Is this homebrew Lady of Pain warlock patron balanced?

Dating a Former Employee

How often does castling occur in grandmaster games?

How do living politicians protect their readily obtainable signatures from misuse?

Project Euler #1 in C++

Trademark violation for app?

What was the first language to use conditional keywords?

Localisation of Category

Is a ledger board required if the side of my house is wood?

How could we fake a moon landing now?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

Should I use a zero-interest credit card for a large one-time purchase?

If Windows 7 doesn't support WSL, then what does Linux subsystem option mean?

How to tell that you are a giant?

Is grep documentation about ignoring case wrong, since it doesn't ignore case in filenames?

How does the math work when buying airline miles?



Complexity of first, second and zero order optimization



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)What is the definition of a first order method?Minimization of a convex quadratic formOptimization of Unconstrained Quadratic formMultivariate Quadratic RegressionWhy is the conjugate direction better than the negative of gradient, when minimizing a functionunderstanding a statement in Gill, Murray and Wright “Practical Optimization”Newton optimization algorithm with non-positive definite Hessianconjugate gradient with directions other than gradientOptimality conditions - necessary vs sufficientConvexity of sum of second order polynomialsdoes it make sense to talk about third-order (or higher order) optimization methods?Optimization and splitting the problem by dependent/independent variablesWhen are KKT conditions indeed necessary first order conditions?Finding a global minimum










1












$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56















1












$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56













1












1








1





$begingroup$


I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.










share|cite|improve this question









$endgroup$




I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)



Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadric approximation of an error function $E(textbfw)$ around $bartextbfw$ ($textbfw$ is a vector of arbitrary size):
$$E(textbfw)=E(bartextbfw) + (textbfw-bartextbfw)^Tnabla E bigr|_textbfw =bartextbfw +frac12(textbfw-bartextbfw)^TH(textbfw-bartextbfw)$$



where $H$ is the Hessian matrix. Note that the error function $E(textbfw)$ is highly nolinear and is the loss function of a neural network. We define $dim textbfw=W$ and then he states that since the Hessian has $O(W^2)$ elements the miminum also depends on the same order of elements.



Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.



Now using gradient information:



Cite: Because each evaluation of $nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'



I have absolutely no idea how to obtain these insights which I cited.



I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.







optimization nonlinear-optimization machine-learning neural-networks






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Mar 27 at 16:25









EpsilonDeltaEpsilonDelta

7301615




7301615











  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56
















  • $begingroup$
    This 3Blue1Brown series may prove helpful, particularly the second episode.
    $endgroup$
    – Paul Sinclair
    Mar 28 at 0:56















$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56




$begingroup$
This 3Blue1Brown series may prove helpful, particularly the second episode.
$endgroup$
– Paul Sinclair
Mar 28 at 0:56










1 Answer
1






active

oldest

votes


















0












$begingroup$


I have absolutely no idea how to obtain these insights which I cited.




Here's what I think he means: we have (by assumption) some quadratic error surface
$$
E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
$$

where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
$$ E(w) = c + w^Tb + frac12w^T Hw $$
This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
$$ nabla E = Hw - b $$
so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
$$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




Ok, finally you asked




I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




The optimization order refers to the number of derivatives needed.
In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
Otherwise we would know in advance how long to train our networks for. :)
In some simple cases one can establish convergence theorems though, but this something for optimization theory.






share|cite|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3164715%2fcomplexity-of-first-second-and-zero-order-optimization%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$


    I have absolutely no idea how to obtain these insights which I cited.




    Here's what I think he means: we have (by assumption) some quadratic error surface
    $$
    E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
    $$

    where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




    I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
    $$ E(w) = c + w^Tb + frac12w^T Hw $$
    This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
    $$ nabla E = Hw - b $$
    so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
    $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
    which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
    the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



    In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




    Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




    Ok, finally you asked




    I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




    The optimization order refers to the number of derivatives needed.
    In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
    Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



    Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
    Otherwise we would know in advance how long to train our networks for. :)
    In some simple cases one can establish convergence theorems though, but this something for optimization theory.






    share|cite|improve this answer









    $endgroup$

















      0












      $begingroup$


      I have absolutely no idea how to obtain these insights which I cited.




      Here's what I think he means: we have (by assumption) some quadratic error surface
      $$
      E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
      $$

      where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




      I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
      $$ E(w) = c + w^Tb + frac12w^T Hw $$
      This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
      $$ nabla E = Hw - b $$
      so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
      $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
      which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
      the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



      In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




      Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




      Ok, finally you asked




      I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




      The optimization order refers to the number of derivatives needed.
      In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
      Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



      Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
      Otherwise we would know in advance how long to train our networks for. :)
      In some simple cases one can establish convergence theorems though, but this something for optimization theory.






      share|cite|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$


        I have absolutely no idea how to obtain these insights which I cited.




        Here's what I think he means: we have (by assumption) some quadratic error surface
        $$
        E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
        $$

        where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




        I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
        $$ E(w) = c + w^Tb + frac12w^T Hw $$
        This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
        $$ nabla E = Hw - b $$
        so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
        $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
        which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
        the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



        In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




        Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




        Ok, finally you asked




        I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




        The optimization order refers to the number of derivatives needed.
        In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
        Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



        Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
        Otherwise we would know in advance how long to train our networks for. :)
        In some simple cases one can establish convergence theorems though, but this something for optimization theory.






        share|cite|improve this answer









        $endgroup$




        I have absolutely no idea how to obtain these insights which I cited.




        Here's what I think he means: we have (by assumption) some quadratic error surface
        $$
        E(w) = widehatE + (w-widehatw)^Tb+frac12(w-widehatw)^TH (w-widehatw)
        $$

        where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.




        I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
        $$ E(w) = c + w^Tb + frac12w^T Hw $$
        This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $widehatw$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
        $$ nabla E = Hw - b $$
        so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $ (w_1,E_1),ldots, (w_n,E_n)$, and then solving
        $$ H^*,b^*,c^* = argmin_H,b sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
        which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||widetildeE - Xbeta||^2 $, where $widetildeE$ is the vector of measured errors, $beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ widetildeE = Xbeta $ or
        the normal equations $ (X^TXbeta = X^T tildeE) $. You cannot expect to get a reasonable solution without at least $|beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |beta|in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.



        In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.




        Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).




        Ok, finally you asked




        I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.




        The optimization order refers to the number of derivatives needed.
        In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
        Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).



        Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
        Otherwise we would know in advance how long to train our networks for. :)
        In some simple cases one can establish convergence theorems though, but this something for optimization theory.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Mar 30 at 19:37









        user3658307user3658307

        5,0833949




        5,0833949



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3164715%2fcomplexity-of-first-second-and-zero-order-optimization%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            W hM2GARVqMQDm h,xyae7pnJ2 mo16xM4M HVWosn85Ec jDUT5TJImzAG11emnYoIpmwmC4
            sY DQGS77qDhF GxZeJGDk2oW NaVIdgqB7kGoVv WgU8z,fznulD9zt V

            Popular posts from this blog

            Football at the 1986 Brunei Merdeka Games Contents Teams Group stage Knockout stage References Navigation menu"Brunei Merdeka Games 1986".

            Solar Wings Breeze Design and development Specifications (Breeze) References Navigation menu1368-485X"Hang glider: Breeze (Solar Wings)"e

            Kathakali Contents Etymology and nomenclature History Repertoire Songs and musical instruments Traditional plays Styles: Sampradayam Training centers and awards Relationship to other dance forms See also Notes References External links Navigation menueThe Illustrated Encyclopedia of Hinduism: A-MSouth Asian Folklore: An EncyclopediaRoutledge International Encyclopedia of Women: Global Women's Issues and KnowledgeKathakali Dance-drama: Where Gods and Demons Come to PlayKathakali Dance-drama: Where Gods and Demons Come to PlayKathakali Dance-drama: Where Gods and Demons Come to Play10.1353/atj.2005.0004The Illustrated Encyclopedia of Hinduism: A-MEncyclopedia of HinduismKathakali Dance-drama: Where Gods and Demons Come to PlaySonic Liturgy: Ritual and Music in Hindu Tradition"The Mirror of Gesture"Kathakali Dance-drama: Where Gods and Demons Come to Play"Kathakali"Indian Theatre: Traditions of PerformanceIndian Theatre: Traditions of PerformanceIndian Theatre: Traditions of PerformanceIndian Theatre: Traditions of PerformanceMedieval Indian Literature: An AnthologyThe Oxford Companion to Indian TheatreSouth Asian Folklore: An Encyclopedia : Afghanistan, Bangladesh, India, Nepal, Pakistan, Sri LankaThe Rise of Performance Studies: Rethinking Richard Schechner's Broad SpectrumIndian Theatre: Traditions of PerformanceModern Asian Theatre and Performance 1900-2000Critical Theory and PerformanceBetween Theater and AnthropologyKathakali603847011Indian Theatre: Traditions of PerformanceIndian Theatre: Traditions of PerformanceIndian Theatre: Traditions of PerformanceBetween Theater and AnthropologyBetween Theater and AnthropologyNambeesan Smaraka AwardsArchivedThe Cambridge Guide to TheatreRoutledge International Encyclopedia of Women: Global Women's Issues and KnowledgeThe Garland Encyclopedia of World Music: South Asia : the Indian subcontinentThe Ethos of Noh: Actors and Their Art10.2307/1145740By Means of Performance: Intercultural Studies of Theatre and Ritual10.1017/s204912550000100xReconceiving the Renaissance: A Critical ReaderPerformance TheoryListening to Theatre: The Aural Dimension of Beijing Opera10.2307/1146013Kathakali: The Art of the Non-WorldlyOn KathakaliKathakali, the dance theatreThe Kathakali Complex: Performance & StructureKathakali Dance-Drama: Where Gods and Demons Come to Play10.1093/obo/9780195399318-0071Drama and Ritual of Early Hinduism"In the Shadow of Hollywood Orientalism: Authentic East Indian Dancing"10.1080/08949460490274013Sanskrit Play Production in Ancient IndiaIndian Music: History and StructureBharata, the Nāṭyaśāstra233639306Table of Contents2238067286469807Dance In Indian Painting10.2307/32047833204783Kathakali Dance-Theatre: A Visual Narrative of Sacred Indian MimeIndian Classical Dance: The Renaissance and BeyondKathakali: an indigenous art-form of Keralaeee