Confused about Nesterov momentum gradient descent algorithmGradient descent with constraintsA constrained gradient descent algorithmIntuition for gradient descent with Nesterov momentumProjected gradient descent with momentumgradient descent algorithm definitionGradient Descent DivergenceWhy is gradient descent used?Momentum in gradient descentReason for differences in EWA equations and Momentum equationsGetting to the gradient descent algorithm

Student evaluations of teaching assistants

Displaying the order of the columns of a table

Increase performance creating Mandelbrot set in python

Is HostGator storing my password in plaintext?

How do I rename a LINUX host without needing to reboot for the rename to take effect?

Will it be accepted, if there is no ''Main Character" stereotype?

There is only s̶i̶x̶t̶y one place he can be

What's the purpose of "true" in bash "if sudo true; then"

Is expanding the research of a group into machine learning as a PhD student risky?

What are the ramifications of creating a homebrew world without an Astral Plane?

What is difference between behavior and behaviour

Everything Bob says is false. How does he get people to trust him?

Why are on-board computers allowed to change controls without notifying the pilots?

Is a roofing delivery truck likely to crack my driveway slab?

How will losing mobility of one hand affect my career as a programmer?

Is there any reason not to eat food that's been dropped on the surface of the moon?

Trouble understanding overseas colleagues

How was Earth single-handedly capable of creating 3 of the 4 gods of chaos?

Hide Select Output from T-SQL

Why does John Bercow say “unlock” after reading out the results of a vote?

Why Were Madagascar and New Zealand Discovered So Late?

How can a jailer prevent the Forge Cleric's Artisan's Blessing from being used?

Is it correct to write "is not focus on"?

Have I saved too much for retirement so far?



Confused about Nesterov momentum gradient descent algorithm


Gradient descent with constraintsA constrained gradient descent algorithmIntuition for gradient descent with Nesterov momentumProjected gradient descent with momentumgradient descent algorithm definitionGradient Descent DivergenceWhy is gradient descent used?Momentum in gradient descentReason for differences in EWA equations and Momentum equationsGetting to the gradient descent algorithm













0












$begingroup$


I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.



Here is one I found that can just be re-arranged, can someone explain why I am wrong?



$theta_t = y_t - gamma nabla f(y_t) \
y_t+1 = theta_t + rho (theta_t - theta_t-1)$



Plug first equation into second,



$y_t+1 = y_t - gamma nabla f(y_t) + rho (theta_t - theta_t-1)$



Let $Delta y_t = y_t+1 - y_t$ then it simply becomes



$$Delta y_t = - gamma nabla f(y_t) + rho (y_t - gamma nabla f(y_t) - y_t-1 + gamma nabla f(y_t-1) \
= - gamma nabla f(y_t) + rho (Delta y_t-1 + gamma (nabla f(y_t-1) - nabla f(y_t)) $$



so what am I doing wrong?



In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.



where gradient descent with momentum is defined as



$$Delta theta_t = - gamma nabla f(theta) + rho Delta theta_t-1 $$ (I'm also not sure why it's $f(theta)$ and not $f(theta_t)$)










share|cite|improve this question











$endgroup$











  • $begingroup$
    Why do you think it’s wrong to write it in one line?
    $endgroup$
    – David M.
    Mar 17 at 16:24










  • $begingroup$
    What about the update for the momentum term in the one-liner?
    $endgroup$
    – user3658307
    2 days ago















0












$begingroup$


I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.



Here is one I found that can just be re-arranged, can someone explain why I am wrong?



$theta_t = y_t - gamma nabla f(y_t) \
y_t+1 = theta_t + rho (theta_t - theta_t-1)$



Plug first equation into second,



$y_t+1 = y_t - gamma nabla f(y_t) + rho (theta_t - theta_t-1)$



Let $Delta y_t = y_t+1 - y_t$ then it simply becomes



$$Delta y_t = - gamma nabla f(y_t) + rho (y_t - gamma nabla f(y_t) - y_t-1 + gamma nabla f(y_t-1) \
= - gamma nabla f(y_t) + rho (Delta y_t-1 + gamma (nabla f(y_t-1) - nabla f(y_t)) $$



so what am I doing wrong?



In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.



where gradient descent with momentum is defined as



$$Delta theta_t = - gamma nabla f(theta) + rho Delta theta_t-1 $$ (I'm also not sure why it's $f(theta)$ and not $f(theta_t)$)










share|cite|improve this question











$endgroup$











  • $begingroup$
    Why do you think it’s wrong to write it in one line?
    $endgroup$
    – David M.
    Mar 17 at 16:24










  • $begingroup$
    What about the update for the momentum term in the one-liner?
    $endgroup$
    – user3658307
    2 days ago













0












0








0





$begingroup$


I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.



Here is one I found that can just be re-arranged, can someone explain why I am wrong?



$theta_t = y_t - gamma nabla f(y_t) \
y_t+1 = theta_t + rho (theta_t - theta_t-1)$



Plug first equation into second,



$y_t+1 = y_t - gamma nabla f(y_t) + rho (theta_t - theta_t-1)$



Let $Delta y_t = y_t+1 - y_t$ then it simply becomes



$$Delta y_t = - gamma nabla f(y_t) + rho (y_t - gamma nabla f(y_t) - y_t-1 + gamma nabla f(y_t-1) \
= - gamma nabla f(y_t) + rho (Delta y_t-1 + gamma (nabla f(y_t-1) - nabla f(y_t)) $$



so what am I doing wrong?



In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.



where gradient descent with momentum is defined as



$$Delta theta_t = - gamma nabla f(theta) + rho Delta theta_t-1 $$ (I'm also not sure why it's $f(theta)$ and not $f(theta_t)$)










share|cite|improve this question











$endgroup$




I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.



Here is one I found that can just be re-arranged, can someone explain why I am wrong?



$theta_t = y_t - gamma nabla f(y_t) \
y_t+1 = theta_t + rho (theta_t - theta_t-1)$



Plug first equation into second,



$y_t+1 = y_t - gamma nabla f(y_t) + rho (theta_t - theta_t-1)$



Let $Delta y_t = y_t+1 - y_t$ then it simply becomes



$$Delta y_t = - gamma nabla f(y_t) + rho (y_t - gamma nabla f(y_t) - y_t-1 + gamma nabla f(y_t-1) \
= - gamma nabla f(y_t) + rho (Delta y_t-1 + gamma (nabla f(y_t-1) - nabla f(y_t)) $$



so what am I doing wrong?



In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.



where gradient descent with momentum is defined as



$$Delta theta_t = - gamma nabla f(theta) + rho Delta theta_t-1 $$ (I'm also not sure why it's $f(theta)$ and not $f(theta_t)$)







optimization numerical-optimization gradient-descent






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 17 at 12:49







Alexis Drakopoulos

















asked Mar 17 at 12:44









Alexis DrakopoulosAlexis Drakopoulos

1013




1013











  • $begingroup$
    Why do you think it’s wrong to write it in one line?
    $endgroup$
    – David M.
    Mar 17 at 16:24










  • $begingroup$
    What about the update for the momentum term in the one-liner?
    $endgroup$
    – user3658307
    2 days ago
















  • $begingroup$
    Why do you think it’s wrong to write it in one line?
    $endgroup$
    – David M.
    Mar 17 at 16:24










  • $begingroup$
    What about the update for the momentum term in the one-liner?
    $endgroup$
    – user3658307
    2 days ago















$begingroup$
Why do you think it’s wrong to write it in one line?
$endgroup$
– David M.
Mar 17 at 16:24




$begingroup$
Why do you think it’s wrong to write it in one line?
$endgroup$
– David M.
Mar 17 at 16:24












$begingroup$
What about the update for the momentum term in the one-liner?
$endgroup$
– user3658307
2 days ago




$begingroup$
What about the update for the momentum term in the one-liner?
$endgroup$
– user3658307
2 days ago










0






active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3151501%2fconfused-about-nesterov-momentum-gradient-descent-algorithm%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Mathematics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3151501%2fconfused-about-nesterov-momentum-gradient-descent-algorithm%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Lowndes Grove History Architecture References Navigation menu32°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661132°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661178002500"National Register Information System"Historic houses of South Carolina"Lowndes Grove""+32° 48' 6.00", −79° 57' 58.00""Lowndes Grove, Charleston County (260 St. Margaret St., Charleston)""Lowndes Grove"The Charleston ExpositionIt Happened in South Carolina"Lowndes Grove (House), Saint Margaret Street & Sixth Avenue, Charleston, Charleston County, SC(Photographs)"Plantations of the Carolina Low Countrye

random experiment with two different functions on unit interval Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)Random variable and probability space notionsRandom Walk with EdgesFinding functions where the increase over a random interval is Poisson distributedNumber of days until dayCan an observed event in fact be of zero probability?Unit random processmodels of coins and uniform distributionHow to get the number of successes given $n$ trials , probability $P$ and a random variable $X$Absorbing Markov chain in a computer. Is “almost every” turned into always convergence in computer executions?Stopped random walk is not uniformly integrable

How should I support this large drywall patch? Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How do I cover large gaps in drywall?How do I keep drywall around a patch from crumbling?Can I glue a second layer of drywall?How to patch long strip on drywall?Large drywall patch: how to avoid bulging seams?Drywall Mesh Patch vs. Bulge? To remove or not to remove?How to fix this drywall job?Prep drywall before backsplashWhat's the best way to fix this horrible drywall patch job?Drywall patching using 3M Patch Plus Primer