Moving brute-force search to FPGA The 2019 Stack Overflow Developer Survey Results Are Infpga internal metastabilityVHDL Fpga debouncingPattern Recogniser on FPGATransmitting HDMI/DVI over an FPGA with no support for TMDSFPGA Test Equipmentfpga clock muxingswapping char VHDL FPGAFPGA input synchronisationReverse Search NTEBrute-force convolution reverb in FPGA
If the Wish spell is used to duplicate the effect of Simulacrum, are existing duplicates destroyed?
How can I fix this gap between bookcases I made?
On the insanity of kings as an argument against monarchy
Should I write numbers in words or as numerals when there are multiple next to each other?
How come people say “Would of”?
What is the steepest angle that a canal can be traversable without locks?
In microwave frequencies, do you use a circulator when you need a (near) perfect diode?
Should I use my personal or workplace e-mail when registering to external websites for work purpose?
Monty Hall variation
How was Skylab's orbit inclination chosen?
Lethal sonic weapons
How to reverse every other sublist of a list?
How are circuits which use complex ICs normally simulated?
Landlord wants to switch my lease to a "Land contract" to "get back at the city"
How to create dashed lines/arrows in Illustrator
What are the motivations for publishing new editions of an existing textbook, beyond new discoveries in a field?
CiviEvent: Public link for events of a specific type
How to manage monthly salary
Why do UK politicians seemingly ignore opinion polls on Brexit?
Limit to 0 ambiguity
Deadlock Graph and Interpretation, solution to avoid
Does a dangling wire really electrocute me if I'm standing in water?
How long do I have to send my income tax payment to the IRS?
Protecting Dualbooting Windows from dangerous code (like rm -rf)
Moving brute-force search to FPGA
The 2019 Stack Overflow Developer Survey Results Are Infpga internal metastabilityVHDL Fpga debouncingPattern Recogniser on FPGATransmitting HDMI/DVI over an FPGA with no support for TMDSFPGA Test Equipmentfpga clock muxingswapping char VHDL FPGAFPGA input synchronisationReverse Search NTEBrute-force convolution reverb in FPGA
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I am currently working on a scientific hobby project about computing the error detection capabilities of CRCs. Unfortunately the C++ code used for such computations has up to years of run time on normal x64 CPUs, even on multi core systems. Also the power consumption of such systems is a pain.
It came to my mind that the common way of x64 brute-force-searching isn't the best. I would like to move the algorithm to an FPGA. Alas I have worked very little with FPGAs and I lost the minimal knowledge after working in C/C++ software engineering for decades. So I need a little help about the feasibility of my idea before burying myself into the technology.
The algorithm I want to run in hardware is a specialized ~1000 line C++ code that could easily be ported to C. No floating point operations. No standard libraries required. High frequent loops. Lots of basic 64 bit integer arithmetic. Even more binary operations (shift, or, xor, bit-counting, etc.) and some array operations. A few kB of RAM and ROM should be sufficient. No peripherals required. Very few memory allocations are used that could be removed by adapting the code. The computation results can be easily filtered internally so a serial interface should be enough to pass the results to a PC.
I would like to compile the C++ or C code into VHDL code and let it run on a FPGA as fast as possible. Also, since this is a hobby project, the FPGA (including software and a developer board) should be affordable.
My questions:
- Can I expect a significant speedup? By which order of magnitude?
- Is there a C/C++ compiler suited for the purpose?
- Which FPGAs are suitable?
fpga vhdl component-selection compiler
$endgroup$
|
show 12 more comments
$begingroup$
I am currently working on a scientific hobby project about computing the error detection capabilities of CRCs. Unfortunately the C++ code used for such computations has up to years of run time on normal x64 CPUs, even on multi core systems. Also the power consumption of such systems is a pain.
It came to my mind that the common way of x64 brute-force-searching isn't the best. I would like to move the algorithm to an FPGA. Alas I have worked very little with FPGAs and I lost the minimal knowledge after working in C/C++ software engineering for decades. So I need a little help about the feasibility of my idea before burying myself into the technology.
The algorithm I want to run in hardware is a specialized ~1000 line C++ code that could easily be ported to C. No floating point operations. No standard libraries required. High frequent loops. Lots of basic 64 bit integer arithmetic. Even more binary operations (shift, or, xor, bit-counting, etc.) and some array operations. A few kB of RAM and ROM should be sufficient. No peripherals required. Very few memory allocations are used that could be removed by adapting the code. The computation results can be easily filtered internally so a serial interface should be enough to pass the results to a PC.
I would like to compile the C++ or C code into VHDL code and let it run on a FPGA as fast as possible. Also, since this is a hobby project, the FPGA (including software and a developer board) should be affordable.
My questions:
- Can I expect a significant speedup? By which order of magnitude?
- Is there a C/C++ compiler suited for the purpose?
- Which FPGAs are suitable?
fpga vhdl component-selection compiler
$endgroup$
2
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
2
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
1
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42
|
show 12 more comments
$begingroup$
I am currently working on a scientific hobby project about computing the error detection capabilities of CRCs. Unfortunately the C++ code used for such computations has up to years of run time on normal x64 CPUs, even on multi core systems. Also the power consumption of such systems is a pain.
It came to my mind that the common way of x64 brute-force-searching isn't the best. I would like to move the algorithm to an FPGA. Alas I have worked very little with FPGAs and I lost the minimal knowledge after working in C/C++ software engineering for decades. So I need a little help about the feasibility of my idea before burying myself into the technology.
The algorithm I want to run in hardware is a specialized ~1000 line C++ code that could easily be ported to C. No floating point operations. No standard libraries required. High frequent loops. Lots of basic 64 bit integer arithmetic. Even more binary operations (shift, or, xor, bit-counting, etc.) and some array operations. A few kB of RAM and ROM should be sufficient. No peripherals required. Very few memory allocations are used that could be removed by adapting the code. The computation results can be easily filtered internally so a serial interface should be enough to pass the results to a PC.
I would like to compile the C++ or C code into VHDL code and let it run on a FPGA as fast as possible. Also, since this is a hobby project, the FPGA (including software and a developer board) should be affordable.
My questions:
- Can I expect a significant speedup? By which order of magnitude?
- Is there a C/C++ compiler suited for the purpose?
- Which FPGAs are suitable?
fpga vhdl component-selection compiler
$endgroup$
I am currently working on a scientific hobby project about computing the error detection capabilities of CRCs. Unfortunately the C++ code used for such computations has up to years of run time on normal x64 CPUs, even on multi core systems. Also the power consumption of such systems is a pain.
It came to my mind that the common way of x64 brute-force-searching isn't the best. I would like to move the algorithm to an FPGA. Alas I have worked very little with FPGAs and I lost the minimal knowledge after working in C/C++ software engineering for decades. So I need a little help about the feasibility of my idea before burying myself into the technology.
The algorithm I want to run in hardware is a specialized ~1000 line C++ code that could easily be ported to C. No floating point operations. No standard libraries required. High frequent loops. Lots of basic 64 bit integer arithmetic. Even more binary operations (shift, or, xor, bit-counting, etc.) and some array operations. A few kB of RAM and ROM should be sufficient. No peripherals required. Very few memory allocations are used that could be removed by adapting the code. The computation results can be easily filtered internally so a serial interface should be enough to pass the results to a PC.
I would like to compile the C++ or C code into VHDL code and let it run on a FPGA as fast as possible. Also, since this is a hobby project, the FPGA (including software and a developer board) should be affordable.
My questions:
- Can I expect a significant speedup? By which order of magnitude?
- Is there a C/C++ compiler suited for the purpose?
- Which FPGAs are suitable?
fpga vhdl component-selection compiler
fpga vhdl component-selection compiler
edited Mar 22 at 23:38
Silicomancer
asked Mar 22 at 23:27
SilicomancerSilicomancer
1236
1236
2
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
2
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
1
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42
|
show 12 more comments
2
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
2
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
1
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42
2
2
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
2
2
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
1
1
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42
|
show 12 more comments
1 Answer
1
active
oldest
votes
$begingroup$
Can I expect a significant speedup? By which order of magnitude?
Sure, by quite a lot. CRCs can be computed on data a byte a a time using a straightforward table lookup. A moderate-sized FPGA (say, a Xilinx XC6SLX75) will have a hundred or more blocks of internal dual-port RAM that allow 200 data streams to be processed in parallel at a rate of one byte per clock cycle, where the clock could be 200 MHz or more. That's a throughput of at least 40 GB/s. How fast is your "x64" CPU?
Is there a C/C++ compiler suited for the purpose?
Not really. If you want to get the most out of your FPGA, you'll want to use an HDL to define the hardware datapath directly. Implementations derived from programming languages are possible, but the performance ranges from lousy to useless.
Which FPGAs are suitable?
That's bordering on a product recommendation, which would be off-topic for this site, but look at the midrange offerings from Xilinx (such as the Spartan-6 series) or Intel (formerly Altera, such as their Cyclone IV series). Inexpensive development boards for these families are readily available from places like Digilent.
$endgroup$
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");
StackExchange.ifUsing("editor", function ()
return StackExchange.using("schematics", function ()
StackExchange.schematics.init();
);
, "cicuitlab");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "135"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f428626%2fmoving-brute-force-search-to-fpga%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Can I expect a significant speedup? By which order of magnitude?
Sure, by quite a lot. CRCs can be computed on data a byte a a time using a straightforward table lookup. A moderate-sized FPGA (say, a Xilinx XC6SLX75) will have a hundred or more blocks of internal dual-port RAM that allow 200 data streams to be processed in parallel at a rate of one byte per clock cycle, where the clock could be 200 MHz or more. That's a throughput of at least 40 GB/s. How fast is your "x64" CPU?
Is there a C/C++ compiler suited for the purpose?
Not really. If you want to get the most out of your FPGA, you'll want to use an HDL to define the hardware datapath directly. Implementations derived from programming languages are possible, but the performance ranges from lousy to useless.
Which FPGAs are suitable?
That's bordering on a product recommendation, which would be off-topic for this site, but look at the midrange offerings from Xilinx (such as the Spartan-6 series) or Intel (formerly Altera, such as their Cyclone IV series). Inexpensive development boards for these families are readily available from places like Digilent.
$endgroup$
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
add a comment |
$begingroup$
Can I expect a significant speedup? By which order of magnitude?
Sure, by quite a lot. CRCs can be computed on data a byte a a time using a straightforward table lookup. A moderate-sized FPGA (say, a Xilinx XC6SLX75) will have a hundred or more blocks of internal dual-port RAM that allow 200 data streams to be processed in parallel at a rate of one byte per clock cycle, where the clock could be 200 MHz or more. That's a throughput of at least 40 GB/s. How fast is your "x64" CPU?
Is there a C/C++ compiler suited for the purpose?
Not really. If you want to get the most out of your FPGA, you'll want to use an HDL to define the hardware datapath directly. Implementations derived from programming languages are possible, but the performance ranges from lousy to useless.
Which FPGAs are suitable?
That's bordering on a product recommendation, which would be off-topic for this site, but look at the midrange offerings from Xilinx (such as the Spartan-6 series) or Intel (formerly Altera, such as their Cyclone IV series). Inexpensive development boards for these families are readily available from places like Digilent.
$endgroup$
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
add a comment |
$begingroup$
Can I expect a significant speedup? By which order of magnitude?
Sure, by quite a lot. CRCs can be computed on data a byte a a time using a straightforward table lookup. A moderate-sized FPGA (say, a Xilinx XC6SLX75) will have a hundred or more blocks of internal dual-port RAM that allow 200 data streams to be processed in parallel at a rate of one byte per clock cycle, where the clock could be 200 MHz or more. That's a throughput of at least 40 GB/s. How fast is your "x64" CPU?
Is there a C/C++ compiler suited for the purpose?
Not really. If you want to get the most out of your FPGA, you'll want to use an HDL to define the hardware datapath directly. Implementations derived from programming languages are possible, but the performance ranges from lousy to useless.
Which FPGAs are suitable?
That's bordering on a product recommendation, which would be off-topic for this site, but look at the midrange offerings from Xilinx (such as the Spartan-6 series) or Intel (formerly Altera, such as their Cyclone IV series). Inexpensive development boards for these families are readily available from places like Digilent.
$endgroup$
Can I expect a significant speedup? By which order of magnitude?
Sure, by quite a lot. CRCs can be computed on data a byte a a time using a straightforward table lookup. A moderate-sized FPGA (say, a Xilinx XC6SLX75) will have a hundred or more blocks of internal dual-port RAM that allow 200 data streams to be processed in parallel at a rate of one byte per clock cycle, where the clock could be 200 MHz or more. That's a throughput of at least 40 GB/s. How fast is your "x64" CPU?
Is there a C/C++ compiler suited for the purpose?
Not really. If you want to get the most out of your FPGA, you'll want to use an HDL to define the hardware datapath directly. Implementations derived from programming languages are possible, but the performance ranges from lousy to useless.
Which FPGAs are suitable?
That's bordering on a product recommendation, which would be off-topic for this site, but look at the midrange offerings from Xilinx (such as the Spartan-6 series) or Intel (formerly Altera, such as their Cyclone IV series). Inexpensive development boards for these families are readily available from places like Digilent.
answered Mar 23 at 0:09
Dave Tweed♦Dave Tweed
124k10153267
124k10153267
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
add a comment |
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
3
3
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
A table lookup is really not the best way to do it on an FPGA because it only works for small inputs. Since CRC is just a bunch of XOR gates, what you can do is run a wide data bus (and get a really high data rate) and then do an unrolled parallel CRC. You can relatively easily do a CRC 32 over data 64 bits at a time at 400 MHz, which gives you 25 Gbps, then drop a bunch of instances on the FPGA to run in parallel.
$endgroup$
– alex.forencich
Mar 23 at 4:36
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
Please note that this is not about CRC calculation. This is about computing the error detection performance of CRCs. The algorithm is pretty much different.
$endgroup$
– Silicomancer
Mar 23 at 10:03
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
$begingroup$
@Dave Tweed: So you would go with C/C++ under no circumstances? VHDL will be a rough path for me. Any recommendations which ease break-in a little?
$endgroup$
– Silicomancer
Mar 23 at 10:15
2
2
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
$begingroup$
The syntax of Verilog is similar to that of C/C++, while VHDL is more like ADA. Therefore, you might be more comfortable with the former. But the real hurdle is learning to think in terms of parallel hardware operations instead of sequential software operations. And if computing the CRCs is not the issue, then you haven't described the actual workload at all.
$endgroup$
– Dave Tweed♦
Mar 23 at 11:12
add a comment |
Thanks for contributing an answer to Electrical Engineering Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f428626%2fmoving-brute-force-search-to-fpga%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
Ultimately running an algorithm in an FPGA instead of software can be faster but it really depends on the algorithm details. Essentially you will gain speed if you can parallelize or pipeline the data flow. If 64 simple operations need to be applied to a single point before it can be fully processed, the FPGA can pipeline them so that a new result comes out every clock cycle. But I don't know if your algorithm is like that.
$endgroup$
– mkeith
Mar 22 at 23:39
$begingroup$
Almost the entire algorithm can be highly parallelized. The algorithm processes a single dataword. For a 32 bit CRC there are ~2^30 datawords á 32 bit that can easily be processed independently (except final comparing/filtering of the results that needs to be serialized).
$endgroup$
– Silicomancer
Mar 22 at 23:43
2
$begingroup$
Have you considered using GPU acceleration for this? Implementation will be a lot simpler, as well as less expensive.
$endgroup$
– duskwuff
Mar 22 at 23:56
$begingroup$
You can probably speed up your algorithm a lot if you take advantage of the linearity properties of CRC, i.e. that crc(a ^ b ^ c) = crc(a) ^ crc(b) ^ crc(c) for any odd number of equal-length messages
$endgroup$
– jpa
Mar 23 at 6:45
1
$begingroup$
Well, then, here is the code: users.ece.cmu.edu/~koopman/crc/hdlen.html, this is more or less the results I want to achieve: users.ece.cmu.edu/~koopman/crc/index.html and here is one interesting paper from the same author: users.ece.cmu.edu/~koopman/roses/dsn04/…
$endgroup$
– Silicomancer
Mar 23 at 20:42