Floating point representation in 8 bitWhat are the biggest and smallest represent-able numbers with single precision floating points?What is the maximum difference between two successive real numbers in the given floating point representation?Floating point number,Mantissa,ExponentCalculating range and eps-machine of floating-point systemDoes the rounding unit of a floating point system depend only on the mantissa?Are all integers with exponent over 52 are even in 64 bit floating pointCalculate the largest possible floating-point value: formula?IEEE 754 32 and 64 bitFloating point representationHow to find smallest and largest representable number possible given a Normalized Floating Point System

How do I fix the group tension caused by my character stealing and possibly killing without provocation?

How do I Interface a PS/2 Keyboard without Modern Techniques?

Air travel with refrigerated insulin

When is "ei" a diphthong?

Unable to disable Microsoft Store in domain environment

Mimic lecturing on blackboard, facing audience

How to reduce predictors the right way for a logistic regression model

What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?

I'm just a whisper. Who am I?

Does Doodling or Improvising on the Piano Have Any Benefits?

Why does the Persian emissary display a string of crowned skulls?

What does "Scientists rise up against statistical significance" mean? (Comment in Nature)

Personal or impersonal in a technical resume

Quoting Keynes in a lecture

Can you identify this lizard-like creature I observed in the UK?

Why can't the Brexit deadlock in the UK parliament be solved with a plurality vote?

Should I assume I have passed probation?

Did I make a mistake by ccing email to boss to others?

What is the meaning of "You've never met a graph you didn't like?"

Should I warn new/prospective PhD Student that supervisor is terrible?

Limit max CPU usage SQL SERVER with WSRM

How to get directions in deep space?

How do I tell my boss that I'm quitting in 15 days (a colleague left this week)

"Oh no!" in Latin



Floating point representation in 8 bit


What are the biggest and smallest represent-able numbers with single precision floating points?What is the maximum difference between two successive real numbers in the given floating point representation?Floating point number,Mantissa,ExponentCalculating range and eps-machine of floating-point systemDoes the rounding unit of a floating point system depend only on the mantissa?Are all integers with exponent over 52 are even in 64 bit floating pointCalculate the largest possible floating-point value: formula?IEEE 754 32 and 64 bitFloating point representationHow to find smallest and largest representable number possible given a Normalized Floating Point System













2












$begingroup$



A computer has 8 bits of memory for floating point representation.



The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.



The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.



  1. What is the smallest positive number that can be represented?


  2. What is the machine epsilon


  3. How many numbers in base 10 can be represented?




  1. In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$


  2. To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$


How should I approach 3, and are my solutions to 1 and 2 correct?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
    $endgroup$
    – Bungo
    Mar 5 '17 at 18:30











  • $begingroup$
    The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
    $endgroup$
    – David K
    Feb 12 '18 at 2:25















2












$begingroup$



A computer has 8 bits of memory for floating point representation.



The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.



The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.



  1. What is the smallest positive number that can be represented?


  2. What is the machine epsilon


  3. How many numbers in base 10 can be represented?




  1. In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$


  2. To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$


How should I approach 3, and are my solutions to 1 and 2 correct?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
    $endgroup$
    – Bungo
    Mar 5 '17 at 18:30











  • $begingroup$
    The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
    $endgroup$
    – David K
    Feb 12 '18 at 2:25













2












2








2





$begingroup$



A computer has 8 bits of memory for floating point representation.



The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.



The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.



  1. What is the smallest positive number that can be represented?


  2. What is the machine epsilon


  3. How many numbers in base 10 can be represented?




  1. In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$


  2. To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$


How should I approach 3, and are my solutions to 1 and 2 correct?










share|cite|improve this question











$endgroup$





A computer has 8 bits of memory for floating point representation.



The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.



The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.



  1. What is the smallest positive number that can be represented?


  2. What is the machine epsilon


  3. How many numbers in base 10 can be represented?




  1. In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$


  2. To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$


How should I approach 3, and are my solutions to 1 and 2 correct?







numerical-methods floating-point






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 14 at 7:52









Winfield Chen

484




484










asked Dec 27 '16 at 16:33









gboxgbox

5,49562262




5,49562262











  • $begingroup$
    Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
    $endgroup$
    – Bungo
    Mar 5 '17 at 18:30











  • $begingroup$
    The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
    $endgroup$
    – David K
    Feb 12 '18 at 2:25
















  • $begingroup$
    Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
    $endgroup$
    – Bungo
    Mar 5 '17 at 18:30











  • $begingroup$
    The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
    $endgroup$
    – David K
    Feb 12 '18 at 2:25















$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30





$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30













$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25




$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25










1 Answer
1






active

oldest

votes


















0












$begingroup$

Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.



If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation

with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.



In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.



According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation

and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation

We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation

When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation

We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation

The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.



The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.



The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation

Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.






share|cite|improve this answer











$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2073753%2ffloating-point-representation-in-8-bit%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.



    If $x neq 0$ then the mantissa $m$ can be written as:
    beginequation
    m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
    endequation

    with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.



    In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.



    According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
    mantissae of two successive positive numbers.
    Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
    beginequation
    x_textmin = beta^e_textmin
    endequation

    and
    beginequation
    x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
    endequation

    We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
    beginequation
    x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
    endequation

    When we collect the terms $beta^N$ we have:
    beginequation
    x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
    endequation

    We can identify $N$ with $e$ ($N=e$). Then:
    beginequation
    m=sum_i=-p^N a_i beta^i-N
    endequation

    The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
    The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.



    The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.



    The total number of elements in $mathbbF$ is given by the following expression:
    beginequation
    2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
    endequation

    Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.






    share|cite|improve this answer











    $endgroup$

















      0












      $begingroup$

      Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.



      If $x neq 0$ then the mantissa $m$ can be written as:
      beginequation
      m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
      endequation

      with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.



      In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.



      According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
      mantissae of two successive positive numbers.
      Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
      beginequation
      x_textmin = beta^e_textmin
      endequation

      and
      beginequation
      x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
      endequation

      We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
      beginequation
      x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
      endequation

      When we collect the terms $beta^N$ we have:
      beginequation
      x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
      endequation

      We can identify $N$ with $e$ ($N=e$). Then:
      beginequation
      m=sum_i=-p^N a_i beta^i-N
      endequation

      The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
      The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.



      The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.



      The total number of elements in $mathbbF$ is given by the following expression:
      beginequation
      2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
      endequation

      Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.






      share|cite|improve this answer











      $endgroup$















        0












        0








        0





        $begingroup$

        Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.



        If $x neq 0$ then the mantissa $m$ can be written as:
        beginequation
        m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
        endequation

        with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.



        In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.



        According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
        mantissae of two successive positive numbers.
        Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
        beginequation
        x_textmin = beta^e_textmin
        endequation

        and
        beginequation
        x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
        endequation

        We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
        beginequation
        x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
        endequation

        When we collect the terms $beta^N$ we have:
        beginequation
        x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
        endequation

        We can identify $N$ with $e$ ($N=e$). Then:
        beginequation
        m=sum_i=-p^N a_i beta^i-N
        endequation

        The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
        The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.



        The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.



        The total number of elements in $mathbbF$ is given by the following expression:
        beginequation
        2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
        endequation

        Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.






        share|cite|improve this answer











        $endgroup$



        Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.



        If $x neq 0$ then the mantissa $m$ can be written as:
        beginequation
        m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
        endequation

        with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.



        In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.



        According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
        mantissae of two successive positive numbers.
        Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
        beginequation
        x_textmin = beta^e_textmin
        endequation

        and
        beginequation
        x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
        endequation

        We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
        beginequation
        x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
        endequation

        When we collect the terms $beta^N$ we have:
        beginequation
        x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
        endequation

        We can identify $N$ with $e$ ($N=e$). Then:
        beginequation
        m=sum_i=-p^N a_i beta^i-N
        endequation

        The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
        The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.



        The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.



        The total number of elements in $mathbbF$ is given by the following expression:
        beginequation
        2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
        endequation

        Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Mar 14 at 8:01









        Winfield Chen

        484




        484










        answered Mar 4 '17 at 20:37









        UpaxUpax

        1,522613




        1,522613



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2073753%2ffloating-point-representation-in-8-bit%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How should I support this large drywall patch? Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How do I cover large gaps in drywall?How do I keep drywall around a patch from crumbling?Can I glue a second layer of drywall?How to patch long strip on drywall?Large drywall patch: how to avoid bulging seams?Drywall Mesh Patch vs. Bulge? To remove or not to remove?How to fix this drywall job?Prep drywall before backsplashWhat's the best way to fix this horrible drywall patch job?Drywall patching using 3M Patch Plus Primer

            random experiment with two different functions on unit interval Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)Random variable and probability space notionsRandom Walk with EdgesFinding functions where the increase over a random interval is Poisson distributedNumber of days until dayCan an observed event in fact be of zero probability?Unit random processmodels of coins and uniform distributionHow to get the number of successes given $n$ trials , probability $P$ and a random variable $X$Absorbing Markov chain in a computer. Is “almost every” turned into always convergence in computer executions?Stopped random walk is not uniformly integrable

            Lowndes Grove History Architecture References Navigation menu32°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661132°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661178002500"National Register Information System"Historic houses of South Carolina"Lowndes Grove""+32° 48' 6.00", −79° 57' 58.00""Lowndes Grove, Charleston County (260 St. Margaret St., Charleston)""Lowndes Grove"The Charleston ExpositionIt Happened in South Carolina"Lowndes Grove (House), Saint Margaret Street & Sixth Avenue, Charleston, Charleston County, SC(Photographs)"Plantations of the Carolina Low Countrye