Floating point representation in 8 bitWhat are the biggest and smallest represent-able numbers with single precision floating points?What is the maximum difference between two successive real numbers in the given floating point representation?Floating point number,Mantissa,ExponentCalculating range and eps-machine of floating-point systemDoes the rounding unit of a floating point system depend only on the mantissa?Are all integers with exponent over 52 are even in 64 bit floating pointCalculate the largest possible floating-point value: formula?IEEE 754 32 and 64 bitFloating point representationHow to find smallest and largest representable number possible given a Normalized Floating Point System
How do I fix the group tension caused by my character stealing and possibly killing without provocation?
How do I Interface a PS/2 Keyboard without Modern Techniques?
Air travel with refrigerated insulin
When is "ei" a diphthong?
Unable to disable Microsoft Store in domain environment
Mimic lecturing on blackboard, facing audience
How to reduce predictors the right way for a logistic regression model
What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?
I'm just a whisper. Who am I?
Does Doodling or Improvising on the Piano Have Any Benefits?
Why does the Persian emissary display a string of crowned skulls?
What does "Scientists rise up against statistical significance" mean? (Comment in Nature)
Personal or impersonal in a technical resume
Quoting Keynes in a lecture
Can you identify this lizard-like creature I observed in the UK?
Why can't the Brexit deadlock in the UK parliament be solved with a plurality vote?
Should I assume I have passed probation?
Did I make a mistake by ccing email to boss to others?
What is the meaning of "You've never met a graph you didn't like?"
Should I warn new/prospective PhD Student that supervisor is terrible?
Limit max CPU usage SQL SERVER with WSRM
How to get directions in deep space?
How do I tell my boss that I'm quitting in 15 days (a colleague left this week)
"Oh no!" in Latin
Floating point representation in 8 bit
What are the biggest and smallest represent-able numbers with single precision floating points?What is the maximum difference between two successive real numbers in the given floating point representation?Floating point number,Mantissa,ExponentCalculating range and eps-machine of floating-point systemDoes the rounding unit of a floating point system depend only on the mantissa?Are all integers with exponent over 52 are even in 64 bit floating pointCalculate the largest possible floating-point value: formula?IEEE 754 32 and 64 bitFloating point representationHow to find smallest and largest representable number possible given a Normalized Floating Point System
$begingroup$
A computer has 8 bits of memory for floating point representation.
The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.
The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.
What is the smallest positive number that can be represented?
What is the machine epsilon
How many numbers in base 10 can be represented?
In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$
To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$
How should I approach 3, and are my solutions to 1 and 2 correct?
numerical-methods floating-point
$endgroup$
add a comment |
$begingroup$
A computer has 8 bits of memory for floating point representation.
The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.
The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.
What is the smallest positive number that can be represented?
What is the machine epsilon
How many numbers in base 10 can be represented?
In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$
To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$
How should I approach 3, and are my solutions to 1 and 2 correct?
numerical-methods floating-point
$endgroup$
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25
add a comment |
$begingroup$
A computer has 8 bits of memory for floating point representation.
The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.
The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.
What is the smallest positive number that can be represented?
What is the machine epsilon
How many numbers in base 10 can be represented?
In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$
To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$
How should I approach 3, and are my solutions to 1 and 2 correct?
numerical-methods floating-point
$endgroup$
A computer has 8 bits of memory for floating point representation.
The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.
The computer has no representation for $infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $textbase^-1$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.
What is the smallest positive number that can be represented?
What is the machine epsilon
How many numbers in base 10 can be represented?
In general the exponent is $2^(textbits)-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^-6=2^-6+2^-9=0.017578125$
To find machine epsilon we take $textbase^-(p-1)$ where $p$ is the number of significant bits in the mantissa which is $2^-(3-1)=2^-2=0.25$
How should I approach 3, and are my solutions to 1 and 2 correct?
numerical-methods floating-point
numerical-methods floating-point
edited Mar 14 at 7:52
Winfield Chen
484
484
asked Dec 27 '16 at 16:33
gboxgbox
5,49562262
5,49562262
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25
add a comment |
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.
If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation
with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.
In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.
According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation
and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation
We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation
When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation
We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation
The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.
The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.
The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation
Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2073753%2ffloating-point-representation-in-8-bit%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.
If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation
with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.
In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.
According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation
and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation
We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation
When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation
We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation
The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.
The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.
The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation
Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.
$endgroup$
add a comment |
$begingroup$
Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.
If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation
with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.
In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.
According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation
and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation
We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation
When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation
We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation
The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.
The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.
The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation
Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.
$endgroup$
add a comment |
$begingroup$
Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.
If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation
with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.
In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.
According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation
and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation
We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation
When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation
We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation
The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.
The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.
The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation
Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.
$endgroup$
Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $mathbbF=F(beta,p,e_textmin,e_textmax)$ consists of a set of real numbers written in normalized floating point form $x=pm m times beta^e$, where $m$ is the mantissa of $x$ and $e$ is the exponent.
If $x neq 0$ then the mantissa $m$ can be written as:
beginequation
m = a_N +a_N-1 beta^-1+...+a_-p beta^-p-N
endequation
with $a_N neq 0$ and $e_textmin leq e leq e_textmax$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.
In the above expressions, $p$ is the precision of the system, $beta$ the base, and $[e_textmin,e_textmax]$ the exponent range, with $e_textmin<0$, and $e_textmax=|e_textmin|+1$.
According to the definition the mantissa $m$ belongs to the range $[1,beta)$. The machine epsilon is $beta^1-p$ and represents the difference between the
mantissae of two successive positive numbers.
Now a number $x$ belong to the range $[x_textmin, x_textmax]$ where:
beginequation
x_textmin = beta^e_textmin
endequation
and
beginequation
x_textmax = (beta-1)(1+beta^-1+beta^-2+... + beta^-(p-1)) beta^e_textmax< beta^e_textmax+1
endequation
We now prove the statement above. The general representation of $x in mathbbR$ in base $beta$ is:
beginequation
x=pm (a_N beta^N+a_N-1 beta^N-1+...+a_1 beta+a_0+a_-1 beta^-1+...+a_-p beta^-p)= pm m times beta^e
endequation
When we collect the terms $beta^N$ we have:
beginequation
x=pm (a_N +a_N-1 beta^-1+...+a_1 beta^-N+1+a_0 beta^-N+a_-1 beta^-1-N+...+a_-p beta^-p-N) times beta^N= pm m times beta^e
endequation
We can identify $N$ with $e$ ($N=e$). Then:
beginequation
m=sum_i=-p^N a_i beta^i-N
endequation
The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 leq i leq p-1$. In this case $m=1$ and $x_textmin = beta^e_textmin$.
The maximum value of $m$ is obtained when $a_i=beta-1$ for all $0 leq i leq p-1$.
The machine epsilon is defined as $epsilon_M=beta^1-p$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.
The total number of elements in $mathbbF$ is given by the following expression:
beginequation
2 (beta-1) beta^p-1 (e_textmax-e_textmin+1)+2
endequation
Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.
edited Mar 14 at 8:01
Winfield Chen
484
484
answered Mar 4 '17 at 20:37
UpaxUpax
1,522613
1,522613
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2073753%2ffloating-point-representation-in-8-bit%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Regarding part 3, there are 8 bits available, so there are 256 possible bit patterns. If all of these represent distinct numbers, then 256 is the answer. So the question boils down to whether there are any cases where two distinct bit patterns represent the same number. Can this happen? Offhand I would think the only possible candidate would be 0 (with the sign bit set or not set), but I don't know the details of IEEE754 representation. I'm not sure what base 10 has to do with this.
$endgroup$
– Bungo
Mar 5 '17 at 18:30
$begingroup$
The quoted question is unfortunately incompletely specified. Since some of the encoding is "like IEEE-754," I would guess the exponent is probably meant to use excess-7 encoding, but could it be excess-6 or excess-8? $2^bits - 2$ is the number of possible exponents when we have infinities, NaN, and denormals; without infinities and NaN, another exponent is possible, and if 00000001 is treated as a normalized positive number it has yet another exponent.
$endgroup$
– David K
Feb 12 '18 at 2:25