Directional Derivatives and Jacobian of a Linear Neural Network The 2019 Stack Overflow Developer Survey Results Are In Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar ManaraHow to write the arguments $y_pred=argmax_iP(Y=i|x,W,b)$ in MATLABDerivative of neural network function with respect to weightsNeural Network - Why use DerivativeDerivative of softmax function in neural networkDerivative of slope of neural network layer activationsHessian of the loss of a linear neural network with respect to a weight matrixTaylor expansion of a Neural NetworkBackpropagation with two hidden layers - matrix dimension doesn't add upTrouble with taking the derivative for neural networkDirectional Derivative of Softmax

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

Why not take a picture of a closer black hole?

Keeping a retro style to sci-fi spaceships?

Do working physicists consider Newtonian mechanics to be "falsified"?

Didn't get enough time to take a Coding Test - what to do now?

Why are PDP-7-style microprogrammed instructions out of vogue?

Is 'stolen' appropriate word?

Do I have Disadvantage attacking with an off-hand weapon?

Loose spokes after only a few rides

Homework question about an engine pulling a train

"is" operation returns false even though two objects have same id

What do I do when my TA workload is more than expected?

Button changing its text & action. Good or terrible?

Why doesn't a hydraulic lever violate conservation of energy?

How did the audience guess the pentatonic scale in Bobby McFerrin's presentation?

Why can't wing-mounted spoilers be used to steepen approaches?

"... to apply for a visa" or "... and applied for a visa"?

Was credit for the black hole image misappropriated?

Huge performance difference of the command find with and without using %M option to show permissions

University's motivation for having tenure-track positions

Student Loan from years ago pops up and is taking my salary

Sub-subscripts in strings cause different spacings than subscripts

Python - Fishing Simulator

Why did Peik Lin say, "I'm not an animal"?



Directional Derivatives and Jacobian of a Linear Neural Network



The 2019 Stack Overflow Developer Survey Results Are In
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar ManaraHow to write the arguments $y_pred=argmax_iP(Y=i|x,W,b)$ in MATLABDerivative of neural network function with respect to weightsNeural Network - Why use DerivativeDerivative of softmax function in neural networkDerivative of slope of neural network layer activationsHessian of the loss of a linear neural network with respect to a weight matrixTaylor expansion of a Neural NetworkBackpropagation with two hidden layers - matrix dimension doesn't add upTrouble with taking the derivative for neural networkDirectional Derivative of Softmax










1












$begingroup$


I have to compute the following double derivative:



$$ partial _x_i nabla_W sigma(f(W,x))$$



where $W = (W_1, W_2, dots, W_L)$ is the set of weight matrices, $f(W,x)$ is a $linear$ neural network, hence $f(W,x) = xW_1W_2 dots W_L$, the value $x_i$ is the $i$-th entry of the vector $x$ and $sigma$ is the softmax activation function,



$$sigma(x)_i = frac e^x_isum_j e^x_j.$$



I know that $fracpartial sigmapartial x$ is a matrix $J_i,j(x) = sigma(x)_i(delta_i,j - sigma_j(x))$, hence



$$nabla_W sigma(f(W,x)) = J(f(W,x))cdot x $$



First of all, is it right?



So, now, how can I compute



$$partial_x_i J(f(W,x))cdot x ?$$



Let's assume $z = f(W,x)$ for an easy notation, my guess would be to compute the matrix $A = partial_x_i J(z)$ s.t.



$$A_i,j = sigma(z)_i(1-sigma(z)_i (1-2sigma(z)_i) text if i=j $$



$$A_i,j = sigma(z)_i sigma(z)_j(sigma(z)_i + sigma(z)_j textif ineq j $$



where $A$ is the second derivative of the softmax function.



Hence, $$partial_x_iJ(z)cdot x = J(z)cdot x^(i) + Acdot partial_x_iz cdot x$$



I don't know wether it's right or not, and second I don't know how to keep going. Does someone have suggestions?



Thank you very much!!










share|cite|improve this question











$endgroup$











  • $begingroup$
    A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
    $endgroup$
    – Balakrishnan Rajan
    Mar 24 at 19:59











  • $begingroup$
    Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
    $endgroup$
    – Alfred
    Mar 24 at 21:19










  • $begingroup$
    Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:07










  • $begingroup$
    Perhaps you can add more info about the dimensionality of your tensors.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:08















1












$begingroup$


I have to compute the following double derivative:



$$ partial _x_i nabla_W sigma(f(W,x))$$



where $W = (W_1, W_2, dots, W_L)$ is the set of weight matrices, $f(W,x)$ is a $linear$ neural network, hence $f(W,x) = xW_1W_2 dots W_L$, the value $x_i$ is the $i$-th entry of the vector $x$ and $sigma$ is the softmax activation function,



$$sigma(x)_i = frac e^x_isum_j e^x_j.$$



I know that $fracpartial sigmapartial x$ is a matrix $J_i,j(x) = sigma(x)_i(delta_i,j - sigma_j(x))$, hence



$$nabla_W sigma(f(W,x)) = J(f(W,x))cdot x $$



First of all, is it right?



So, now, how can I compute



$$partial_x_i J(f(W,x))cdot x ?$$



Let's assume $z = f(W,x)$ for an easy notation, my guess would be to compute the matrix $A = partial_x_i J(z)$ s.t.



$$A_i,j = sigma(z)_i(1-sigma(z)_i (1-2sigma(z)_i) text if i=j $$



$$A_i,j = sigma(z)_i sigma(z)_j(sigma(z)_i + sigma(z)_j textif ineq j $$



where $A$ is the second derivative of the softmax function.



Hence, $$partial_x_iJ(z)cdot x = J(z)cdot x^(i) + Acdot partial_x_iz cdot x$$



I don't know wether it's right or not, and second I don't know how to keep going. Does someone have suggestions?



Thank you very much!!










share|cite|improve this question











$endgroup$











  • $begingroup$
    A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
    $endgroup$
    – Balakrishnan Rajan
    Mar 24 at 19:59











  • $begingroup$
    Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
    $endgroup$
    – Alfred
    Mar 24 at 21:19










  • $begingroup$
    Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:07










  • $begingroup$
    Perhaps you can add more info about the dimensionality of your tensors.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:08













1












1








1





$begingroup$


I have to compute the following double derivative:



$$ partial _x_i nabla_W sigma(f(W,x))$$



where $W = (W_1, W_2, dots, W_L)$ is the set of weight matrices, $f(W,x)$ is a $linear$ neural network, hence $f(W,x) = xW_1W_2 dots W_L$, the value $x_i$ is the $i$-th entry of the vector $x$ and $sigma$ is the softmax activation function,



$$sigma(x)_i = frac e^x_isum_j e^x_j.$$



I know that $fracpartial sigmapartial x$ is a matrix $J_i,j(x) = sigma(x)_i(delta_i,j - sigma_j(x))$, hence



$$nabla_W sigma(f(W,x)) = J(f(W,x))cdot x $$



First of all, is it right?



So, now, how can I compute



$$partial_x_i J(f(W,x))cdot x ?$$



Let's assume $z = f(W,x)$ for an easy notation, my guess would be to compute the matrix $A = partial_x_i J(z)$ s.t.



$$A_i,j = sigma(z)_i(1-sigma(z)_i (1-2sigma(z)_i) text if i=j $$



$$A_i,j = sigma(z)_i sigma(z)_j(sigma(z)_i + sigma(z)_j textif ineq j $$



where $A$ is the second derivative of the softmax function.



Hence, $$partial_x_iJ(z)cdot x = J(z)cdot x^(i) + Acdot partial_x_iz cdot x$$



I don't know wether it's right or not, and second I don't know how to keep going. Does someone have suggestions?



Thank you very much!!










share|cite|improve this question











$endgroup$




I have to compute the following double derivative:



$$ partial _x_i nabla_W sigma(f(W,x))$$



where $W = (W_1, W_2, dots, W_L)$ is the set of weight matrices, $f(W,x)$ is a $linear$ neural network, hence $f(W,x) = xW_1W_2 dots W_L$, the value $x_i$ is the $i$-th entry of the vector $x$ and $sigma$ is the softmax activation function,



$$sigma(x)_i = frac e^x_isum_j e^x_j.$$



I know that $fracpartial sigmapartial x$ is a matrix $J_i,j(x) = sigma(x)_i(delta_i,j - sigma_j(x))$, hence



$$nabla_W sigma(f(W,x)) = J(f(W,x))cdot x $$



First of all, is it right?



So, now, how can I compute



$$partial_x_i J(f(W,x))cdot x ?$$



Let's assume $z = f(W,x)$ for an easy notation, my guess would be to compute the matrix $A = partial_x_i J(z)$ s.t.



$$A_i,j = sigma(z)_i(1-sigma(z)_i (1-2sigma(z)_i) text if i=j $$



$$A_i,j = sigma(z)_i sigma(z)_j(sigma(z)_i + sigma(z)_j textif ineq j $$



where $A$ is the second derivative of the softmax function.



Hence, $$partial_x_iJ(z)cdot x = J(z)cdot x^(i) + Acdot partial_x_iz cdot x$$



I don't know wether it's right or not, and second I don't know how to keep going. Does someone have suggestions?



Thank you very much!!







derivatives neural-networks






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 26 at 21:44







Alfred

















asked Mar 23 at 15:17









AlfredAlfred

418




418











  • $begingroup$
    A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
    $endgroup$
    – Balakrishnan Rajan
    Mar 24 at 19:59











  • $begingroup$
    Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
    $endgroup$
    – Alfred
    Mar 24 at 21:19










  • $begingroup$
    Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:07










  • $begingroup$
    Perhaps you can add more info about the dimensionality of your tensors.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:08
















  • $begingroup$
    A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
    $endgroup$
    – Balakrishnan Rajan
    Mar 24 at 19:59











  • $begingroup$
    Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
    $endgroup$
    – Alfred
    Mar 24 at 21:19










  • $begingroup$
    Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:07










  • $begingroup$
    Perhaps you can add more info about the dimensionality of your tensors.
    $endgroup$
    – Balakrishnan Rajan
    Mar 27 at 8:08















$begingroup$
A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
$endgroup$
– Balakrishnan Rajan
Mar 24 at 19:59





$begingroup$
A linear neural network doesn't make much sense as all the layers will become equivalent to a single layer ( with $left( prod_l=1^L W_l right)$ as the equivalent weight). Mostly ReLU is employed to introduce non linearity to every layer.
$endgroup$
– Balakrishnan Rajan
Mar 24 at 19:59













$begingroup$
Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
$endgroup$
– Alfred
Mar 24 at 21:19




$begingroup$
Yes, I know this. My question is part of a bigger problem I'm facing, but I'm having troubles with some computations like this one.
$endgroup$
– Alfred
Mar 24 at 21:19












$begingroup$
Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
$endgroup$
– Balakrishnan Rajan
Mar 27 at 8:07




$begingroup$
Umm. What is $deltai,j$? Why are you performing $fracpartial sigmapartial x$? $Delta_W sigma(f(W,x))$ is akin to differentiating (partial ?) $sigma(f(W,x))$ with respect to $W$. I lost you there. Also if you are trying to understand back propogation, then bear in mind that there is a separate loss function $L(y, haty)$ which is actually minimized w.r.t weights not the activation.
$endgroup$
– Balakrishnan Rajan
Mar 27 at 8:07












$begingroup$
Perhaps you can add more info about the dimensionality of your tensors.
$endgroup$
– Balakrishnan Rajan
Mar 27 at 8:08




$begingroup$
Perhaps you can add more info about the dimensionality of your tensors.
$endgroup$
– Balakrishnan Rajan
Mar 27 at 8:08










0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3159428%2fdirectional-derivatives-and-jacobian-of-a-linear-neural-network%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Mathematics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3159428%2fdirectional-derivatives-and-jacobian-of-a-linear-neural-network%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How should I support this large drywall patch? Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How do I cover large gaps in drywall?How do I keep drywall around a patch from crumbling?Can I glue a second layer of drywall?How to patch long strip on drywall?Large drywall patch: how to avoid bulging seams?Drywall Mesh Patch vs. Bulge? To remove or not to remove?How to fix this drywall job?Prep drywall before backsplashWhat's the best way to fix this horrible drywall patch job?Drywall patching using 3M Patch Plus Primer

random experiment with two different functions on unit interval Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)Random variable and probability space notionsRandom Walk with EdgesFinding functions where the increase over a random interval is Poisson distributedNumber of days until dayCan an observed event in fact be of zero probability?Unit random processmodels of coins and uniform distributionHow to get the number of successes given $n$ trials , probability $P$ and a random variable $X$Absorbing Markov chain in a computer. Is “almost every” turned into always convergence in computer executions?Stopped random walk is not uniformly integrable

Lowndes Grove History Architecture References Navigation menu32°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661132°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661178002500"National Register Information System"Historic houses of South Carolina"Lowndes Grove""+32° 48' 6.00", −79° 57' 58.00""Lowndes Grove, Charleston County (260 St. Margaret St., Charleston)""Lowndes Grove"The Charleston ExpositionIt Happened in South Carolina"Lowndes Grove (House), Saint Margaret Street & Sixth Avenue, Charleston, Charleston County, SC(Photographs)"Plantations of the Carolina Low Countrye