> We also need to revisit the approach of compiling a high-level language to VHDL/Verilog and then using vendor-specific synthesis tools to generate bitstreams. If FPGA vendors open up FPGA architecture details, third parties could develop new tools that compile a high-level description directly to a bitstream without going through the intermediate step of generating VHDL/Verilog. This is attractive because current synthesis times are too long to be acceptable in the mainstream.
This is both an ideological as well as a practical matter. Until the whole process INCLUDING bitstream generation is open, I don't see FPGAs as a viable alternative to general purpose processors.
It boils down to economics: 3rd party companies that make EDA tools for the semiconductor industry can't tweak and sell their tools at a premium to the embedded designers that use FPGAs. The reason is the FPGA designers already get their tools for free from the FPGA vendor.
Moreover, the article states FPGA vendors spend more money on development of tools than on the development of the FPGAs themselves!
The tool-chains for FPGAs have become so complex that they themselves cannot make them if asked to repeat, much in the similar fashion as Microsoft could not have created software to open .doc formats from scratch that is fully compatible with existing .doc files.
Free tools implies third-party companies cannot readily enter the game because the customers would not be ready to pay (even though they already are effectively paying for the tools, but why pay for them twice). Again, this is same as Microsoft bundling Windows with laptops people buy without realizing that those laptops could have been cheaper if they were loaded with Linux and there were no political moves from somewhere to prevent that from happening.
To me, these are signs of a pending disruption. Watch-out however since many companies have attempted this in the past and have not survived.
How many supporters would a Kickstarter need to design such an FPGA, with all specs open, and ship devices to backers in the $200-$500 range?
Let's assume you wouldn't need to write the bitmungers  and other parts of the toolchain, the open-source community would likely produce one for you pro bono if you're shipping physical hardware to the general public at a reasonable price with open specs.
If anyone is actually planning on doing this, don't forget you'll need FCC certification because you'll probably pick up customers who are non-hardware people  .
 Or VHDL compilers, or netlisteners, or place-and-route, or whatever the tools are called. A hardware guru I am certainly not!
Also I suspect you do not have a copy of the firmware inside the microcontroller that runs your mouse, or your PC 8042 keyboard controller, or innumerable microcontrollers thru the machine, so either you are already OK with using "some" closed source or don't have a full picture of your whole system. Pessimism is your friend in this case and the situation is probably much worse than you think. I ALWAYS prioritize open source above all other criteria for obvious reasons, but when its not available I do my best with what I can get... Its an engineering background thing... Much as I'd love to ideally make everything outta aerospace carbon fiber and titanium, but lets face it, my house is made mostly outta wood and my car outta steel.
The problem is not just closed source software (which I don't like and try not to use) but a lack of documentation of the architecture and bitstream formats. I program for a living, but wouldn't for the life of me work with a closed source compiler for an unknown architecture with no/incomplete documentation. Many people may be ok with this though (see for example, some shader compilers for embedded GPUs).
There's a reason processor manufactures release all info about instructions sets and that in turn allows a lot of projects like gcc, llvm, v8, luajit, etc. to support them.
I would think that opening up the details of this technology would spur a lot of innovation regarding FPGAs and HDLs, both in academia and in the industry.
That has to be carefully defined. My point with the eprom anecdote was that you're stuck with closed source if you put a z80 core and some rom/ram/decode logic in a FPGA and do a synthetic SoC and merely avoiding FPGA and doin' it with discrete Z80 and eprom and sram and some glue isn't going to save you, because in 2013 using PC hardware you pretty much need a closed source windows only USB eprom programmer. So if you're stuck with closed source doing a hobby project, at least do something "cool" and use the FPGA.
To do "about the same thing" but open source there are several completely open source alternatives that would result in CPU controlled I/O ports being wiggled. Avoiding closed source is not just as simple as saying "no FPGAs" or closed source is solely a disease of FPGAs. for example it took years as I recall before the Propeller had a non-windows/dos stack, although the BST works pretty well now.
Confusing the issue, if you just want to play around with Verilog as a language or abstract technology, Icarus is free software part of the GnuEDA project. Obviously that's playing with simulation not silicon, but it is possible. So you don't have to entirely avoid FPGA technology.
This paper will explain how to reverse engineer the xilinx bitstream by examining the xdl intermediary format. the source code from this project is here (not updated since 2008):
I think that in addition to open source tools, FPGAs will need cloud-hosted simulation, synthesis and place-and-route tools that can handle scaling up to designs with many millions of logic gates. Even in a well-partitioned design, a small change to the logic takes forever to recompile to FPGA.
DrDreams 29 minutes ago | link [dead]
Speaking as an embedded developer, I see a number of other embedded devs hobbying around with FPGAs. However, I very rarely see convincing use cases for FGPAs. This article seems to lean toward the belief many of my colleagues have, that FPGAs are right around the corner in terms of general usefulness. However, I disagree strongly. I find that they are highly-specialized devices.
Before reading the rest of my writing, consider that at this time, brilliant hardware designers are putting similar amounts of work into both general purpose CPUs and into FPGAs. However, CPUS are comprised of dense blocks of special-purpose silicon for common purposes, such as floating-point math. FPGAs always have to match that dense silicon through configurable silicon, which is less dense. Furthermore, the routing in CPUs is a known entity at manufacturing time. On FPGAs, the routing is highly variable and must be re-negotiated at nearly every compile cycle. That's a huge time sink, both in terms of build time and in predicting performance. Especially since those short routes that you get at the beginning of a project, typically end up being longer by the end of it. Nowadays, we are seeing more FPGAs with dedicated, pre-made hardware blocks inside of them, such as FPUs and even CPU cores. These have more of a chance of catching on for general purpose computing. Notice however, that on these devices, it's the general-purpose CPU dominating, leaving the FPGA as a configurable peripheral, subordinate to the dense, pre-designed silicon.
Although one may be able to match GPU performance with an FPGA, it's usually just not worth it. It will take dozens of hours of FPGA coding and simulation. Compiling and fitting and the rest of the FPGA dev chain is very time-consuming and resource-intensive, compared to the speed and elegance of gcc. Speaking of standard development practices, FPGA code is not nearly as portable as C. It often has special optimizations done for the sake of the device implemented. http://opencores.org has a number of more generic modules available, but still, FPGA code does not scale as well as C code. There are add-on packages that help write FPGA synthesis code - code synthesizers, but they make matters especially complicated. The syntax of Verilog and VHDL is not well-designed for scaling. Speaking of these languages, if you are used to languages written to be parsed easily, such as lisp, python, or even C and Java to some extent, you will be very appalled at the structure of Verilog and VHDL. There are many redundant entities, lots of excess verbiage and all kinds of special cases. It really has evolved very little since the days of the Programmed Array of Logic (PAL).
Another problem with FPGAs is the additional hardware on board needed to configure them. It's one more component or interface that is not needed when using CPUs. It's an additional software image to maintain, revision, store in source control, etc. FPGAs also often require more power supplies and better power supply conditioning than a regular CPU and often a separate clock crystal. They are high-maintenance.
FPGAs do shine though in a few specific instances: 1. When there is a particular, uncommon high-speed bus protocol you need to communicate with and can not buy pre-designed silicon for. This does not mean, e.g. USB. It means something like a non-standard digital camera interface or embedded graphic display. 2. Software Radio. 3. Obscure, but computationally-intensive algorithms like Bitcoin.
I hope my words have convinced some people to cool their lust for FPGAs, because I feel they're a bit of a dead-end or distraction for many who are attracted to the idea of "executing their algorithm extremely fast" or "becoming a chip designer." I have seen many students and professionals burn up hours and hours of their time getting something to run on an FPGA which could just as easily have been CPU-based. For example, one student implemented a large number of PWM oscillators on an FPGA where it would have been much simpler to use the debugged, verified, PWM peripherals on microcontrollers. Another guy I work with is intent on running CPU cores on FPGAs. This is an especially perverse use of the FPGA. Unless you've got some subset of the CPU which adds incredible value to the process, you're exchanging the the density of the VLSI/ASIC version of the chip for the flexible, less dense version on FPGA. This may be useful in rare situations, such as adding an out-of-order address generator to an existing core for speeding up an FFT, but it suffers an incredible performance and developer time hit to get to this point.
Furthermore, the routing in CPUs is a known entity at manufacturing time.
The 'routing' of a CPU is much more variable than the routing of an FPGA. Data moves around a CPU based on the program that is executing. The control logic of a CPU is the equivalent of the routing logic of an FPGA.
On FPGAs, the routing is highly variable and must be re-negotiated at nearly every compile cycle.
The 'routing' of a CPU is the same. The compiler has to perform register allocation afresh on every compilation. There's obviously a tradeoff between fast compilation vs most efficient use of resources. Both problems are NP-complete I believe.
Nowadays, we are seeing more FPGAs with dedicated, pre-made hardware blocks inside of them, such as FPUs and even CPU cores.
You just contradicted yourself. Previously, you said "FPGAs always have to match that dense silicon through configurable silicon".
Your next paragraph talks about toolchain issues. This is hardly an insurmountable problem. Someone just needs to design a high level language that can be synthesised; something akin to a python of the FPGA world if you will.
Another problem with FPGAs is the additional hardware on board needed to configure them.
I don't quite understand, do you mean the hardware that reads the bitstream, etc or the hardware that is required in order for the FPGA to be configurable, like routing, LUTs, etc.
Another guy I work with is intent on running CPU cores on FPGAs.
I do agree with you here, this is a weird perversion if the purpose is not eventually to create an ASIC.
I also don't believe that future processors will be FPGAs, but I do believe they will be a lot closer to FPGAs than CPUs.
The advantage of FPGAs is that they allow nontrivial parallelism. On a CPU with 4 cores, you can run 4 instructions at a time (ignoring the pipelining). On the FPGA, you can run any number of operations at the same time, as long as the FPGA is big enough. The problem is not the low-level nature of hardware description languages, the problem is that we still don't have a smart compiler that can release us from the difficulty of writing nontrivial massively-parallel code.
Want a system on a chip with 2 cores leaving plenty of space for an ethernet accelerator, or 3 cores without space for the ethernet accelerator? Its only an include and some minor configuration away.
"the problem is that we still don't have a smart compiler that can release us from the difficulty"
Still don't have smart programmer... its hard to spec. Erlang looking elegant, doesn't magically make it easy to map non-technical description of requirements to Erlang.
Thanks for clearing up some of my blurrier points :) I should not have included "always" in that statement about matching the density of silicon. That breaks my own rule about absolutes.
I hadn't considered the routing of a CPU at compile time similar to the routing of an FPGA. They're at different scales and have different challenges... I guess it's because I have mainly seen FPGAs in time-critical situations (where CPUs can't be used) in which it was very difficult to predict their performance and they required lots of hand-tweaking. That was in SONET routing, btw. CPUs on the other hand, usually have some time to spare, whether it is because they are overspecced, or because they are used in applications which tolerate variation in execution time (user interfaces) in comparison to FPGAs used for e.g. translating data between communications busses. It is simple to measure, reproduce and predict specs for common algorithms like the FFT, convolution, etc. for CPUs. I believe this is because the operations in the "routing" of CPU algorithms are based on the sum of known, discrete, orthogonal events like memory fetches, ALU operations, etc.
Inside FPGAs though, the timing is not as uniformly discretized. It's about taking resources of routing distance from a large pool with a large geographical component, making it highly non-linear and tricky to predict.
I think we all agree a better language would be the key to getting more out of FPGA technology! And I would like to see more FPGA-elements on traditional CPUs like ARM, x86, AVR, PIC, etc. I wonder what elements an improved hardware description language would use? It could certainly be trivially parseable by tools like antlr while still giving bit-level access...
W/r/t additional hardware, I meant hardware that configures the bitstream, such as a header or CPU interface, as well as power supplies. For example, this schematic which I grabbed as an example: http://upload.wikimedia.org/wikipedia/en/3/3f/WillWare_Usb_f... . You can see the FPGA has THREE power supplies (1.2V, 2.5V, 3.3V). This particular design doesn't need additional clock sources, fortunately, but it's rather common. Understand that this isn't a fatal flaw with FPGAs, it's simply a disadvantage- one doesn't add an FPGA alone, one adds an FPGA, power supplies, possibly a crystal, and a programming interface. It means using an FPGA incurs a bunch of overhead.
As for getting my account banned due to my reaction to a bogus article about a way of multiplication fraudulently claimed to be taught to Japanese school kids: Well if you folks in this little community don't need me around and don't feel what I have to say counts for anything and you don't like the way I say what I have to say, and you can't tolerate differences in people who are different from you, then go ahead and enjoy gassing each other up and blowing smoke up each other's asses in your sealed-off little echo chamber without me. I don't need you, either. I will still profit and benefit from the information here without contributing anything back if you're all so sure you know all the answers. And I'll be sure to let everyone I encounter know what an open and swell bunch of folks they can expect here, after everyone with a differing or seasoned opinion is silently silenced and banned. Keep slurping up your daily vomit from Jacques Mattheij. http://24.media.tumblr.com/tumblr_lfiferFT7c1qa55edo1_400.jp... .
I suspect that if you used FPGAs for less time critical applications, you'd have more room for productivity tradeoffs.
Such details can be abstracted over. If you only create synchronous circuits, for example, these subtle timing considerations can be handled automatically.
W/r/t additional hardware, I meant hardware that configures the bitstream, such as a header or CPU interface, as well as power supplies.
I don't really understand why such additional hardware is necessary, I'm not able to comment.
Ah yes exactly. I would not translate machine language or C into raw asynchronous VHDL I'd stick a real or virtual core in the FPGA and create the worlds most amazing accelerator peripheral to make the CPU little more than a UI and a DMA controller to keep the accelerator fed.
You don't implement your video game's title page in discrete (emulated) FPGA logic. You put your 3-d graphics engine in the FPGA and leave the author credits in plain old CPU memory.
"For example, one student implemented a large number of PWM oscillators on an FPGA where it would have been much simpler to use the debugged, verified, PWM peripherals on microcontrollers"
Totally missing the point. You want 768 PWMs to run 768 servos you instantiate 768 VHDL and an onboard proc to set them up and/or UI, all done. Not all that much harder than running 1. You are not going to find 768 PWM outputs on any single chip microcontroller that I'm aware of. Actually you aren't going to find a 768+ pin microcontroller out there anywhere, although thats kinda fat its not unheard of for a FPGA. You want one PWM output, yeah you use an off the shelf PIC, but that doesn't scale. Also don't make a plain old generic PWM when you need a servo controller. Make the worlds best hardware dedicated servo controller (which is, at its core, basically a PWM, but add extensive offset and calibration registers, and maybe limit registers, and sensible hardware default on boot positions and all that, and maybe integrated hardware backlash correction, thats how you make a servo controller, not just "make a PWM" that later gets turned into a servo controller)
"often a separate clock crystal. They are high-maintenance"
This is a bit WTF the guy isn't too far off until he gets to this kind of stuff.
I guess the lesson is - don't talk about gently fucking math-hipsters with chainsaws on HN.
There's more to life than self-denial.
My idea of cool stuff can't be done without them, unfortunately.
So the "inner loop" which needs optimizing is a crazy deep complicated DSP pipeline, obviously you implement that in FGPA "hardware" directly in a HDL. On the other hand, you'd be crazy to implement your UI or a generic protocol like TCP/IP in hardware (unless you're building a router or switch...). Something like I2C is right about on the cusp where you're better off writing it in plain ole C or implement it as a "hardware" peripheral in the FPGA.
Peripheral ... of what you ask? Well, depending on your license requirements and personal feelings there are a zillion options like microblaze/picoblaze from the FPGA mfgr, or visit the opencores website and download a Z80 or a 6502 or a PDP-10 or whatever floats your boat for the high level. Yes, a full PDP-10 will fit easily in one of the bigger hobby size Spartan FPGAs. Its not 1995 anymore, you've got enough space to put dozens (hundreds?) of picoblaze cores on a single FPGA if you want now a days.
There's no point in hand optimized HDL to output "hello world" just like there's no point in the antique technique of software driven "bit banged" serial ports just "include" an off the shelf opencore UART to simplify your UI code.
I've been in this game a long time and this is the future of microcontrollers and possibly general purpose computing. The engineering "skill" of searching a feature matrix to find which PIC microcontroller has 3 I2C hardware and 7 timers and 2 UARTS in your favorite core family is going to be dead, you'll just "include uart.h" and instantiate it 2 times and you pick your favorite core, be it a Z80 or a microblaze or an ARM or a SPARC.
In the future I think very few people "programming" FPGAs are going to be writing anything other than a bunch of includes and then doing everything in the embedded synthesized Z80. The "old timers" who actually write in HDLs are going to look down on the noob FPGA progs much like the old assembly coders used to look down on the visual basic noobs, etc.
The same cost argument applies to unused blocks. If you can save 5 cents per device by not using FPGAs then you'll save 10 cents by selecting a minimal old fashioned controller, but that blocks you into a corner.
Another issue is the infinite variety of peripherals. An infinite number of an infinite variety of peripherals rapidly becomes very expensive compared to "here's 1M identical gates to implement whatever you want"
Its like the extremely early history of computing, will reconfigurable and infinitely reprogramable von-neumann processors ever beat hard wired logic "computers" based on physical plugboards and unit record equipment in the marketplace? Well, yeah that's exactly how it turned out.
"unless the designer gets something very compelling from that flexibility" Yes almost infinitely fast hardware accelerated peripherals and the devs choice of cpu core. If your app has no inner loop that requires optimization (perhaps an automobile transmission, or a tamagotchi) or WRT core selection then in the style of "only got a hammer then the whole world looks like a nail" in those situations there would be little advantage.
Then there's the element of surprise. If, for example, I was developing an FPGA-based board for a drone or a medical device, I would, more than likely, require that 100% of the design be done in house (or crazy extensive testing be done to outside modules).
Anyone in software has had the experience of using some open-source module to save time only to end-up paying for it dearly when something doesn't work correctly and help isn't forthcoming. If the software you are working on is for a life support device it is very likely that taking this approach is actually prohibited, and for good reason.
While I fully understand your point of view, this is one that reduces software and hardware development to simply wiring together a bunch of includes. In my experience this isn't even reality in the most trivial of non-trivial real-world projects.
FPGA's are not software.
I see these "FPGA's for the masses" articles pop-up every so often. Here's what's interesting to me. If you are an engineer schooled in digital circuit design, developing with FPGA's is a piece of cake. There's nothing difficult about it at all, particularly when compared to the old days of wire-wrapping prototypes out of discrete chips. Sure, there can be a bit of tedium and repetition here and there. At the same time, one person can be fully responsible for a ten million logic element design...which was impossible just a couple of decades ago.
If you don't understand logic circuits, FPGA's are voodoo. Guess what? A carburetor is voodoo too if you don't understand it.
Let's invert the roles: Ask a seasoned FPGA engineer without (or with superficial) web coding experience to code a website --server and client side-- using JS, JQuery, HTML5, CSS3, PHP, Zend and MySQL. Right.
Then let's write an article about how difficult web programming is and how it ought to be available to the masses. Then let's further suggest that you can do nearly everything in web development via freely available includes.
I happen to be equally at home with hardware and software (web, embedded, system, whatever) and I can't see that scenario (development-by-includes) playing out in any of these domains.
In general terms, yes, FPGA work can and does usually take longer than the equivalent work in the software domain. It doesn't have to be that way though.
For me it starts with language choices. I suppose that if you work in VHDL all the time you probably rock. I have an intense dislike for VHDL. I don't see a reason to type twice as much to do the same thing. Fifteen years ago VHDL had advantages with such constructs as "generate", this is no-longer the case. I realize that this can easily turn into an argument of religious nature, so we'll have to leave it at that.
One approach that I have used with great success with complex modules is to write them in software first and then port to the FPGA. Going between C and Verilog is very natural.
The key is to write C code keeping in mind that you are describing hardware all along. Don't do anything that you would not be able to easily replicate on the FPGA. You are, effectively, authoring a simulation of what you might implement in the FPGA. The beauty of this approach is that you get the advantage of immediate execution and visualization in software. Debug initial structures and assumptions this way to save tons of time.
Maybe the best way to put it is that I try not to use the FPGA HDL coding stage to experiment and create but rather to simply enter the implementation. Then my goal is to go through as few Modelsim simulation passes as possible to verify operation.
If you've done non-trivial FPGA work you have probably experienced the agony of waiting an hour and a half for a design to compiler and another N hours for it to simulate before discovering problems. The write-compile-simulate-evaluate-modify-repeat loop in FPGA work takes orders of magnitude longer than with software. I've had projects where you can only reasonably make one to half-a-dozen code changes per 18 hour day. That's the way it goes.
This is why I've resorted to extensive software-based validation before HDL coding. I've done this with, for example, challenging custom high-performance DDR memory controllers where there was a need to fiddle with a number of parameters and be able to visualize such conditions as FIFO fill/drain levels, etc. A nice GUI on top of the simulation made a huge difference. The final implementation took far less time to code in HDL and worked as required from the very start.
Another general comment. When it comes to image processing in FPGA's you don't really pay a penalty for modularizing your code to a relatively fine-grained degree. This because module interfaces don't necessarily create any overhead (the best example of this being interconnect wires). In that sense FPGA's are vastly different from software in that function or class+method interfaces generally come at a price.
Modularization can produce benefits during synthesis and placement. If you can pre-place portions of your design and do your floor planning in advance you can save tons of time. Incremental compilation has been around for a while. Still, nothing beats getting into the chip and locking down structures when it makes sense.
To circle back to the recurring theme of "FPGA for the masses" that pops-up every so often. I maintain that FPGA's are, fundamentally, still about electrical engineering and not about software development. These, at certain levels, become vastly different disciplines. Once FPGA compilers become 100 to 1,000 times faster and FPGA's come with 100 to 1,000 times more resources for the money the two worlds will probably blur into one very quickly for most applications.
I have an intense dislike for VHDL.
I have yet to meet an engineer who likes it!
I hate it with passion, but it lets me write circuits in the way I want.
Luckily, emacs VHDL mode makes me type less.
If you've done non-trivial FPGA work you have probably experienced the agony of waiting an hour and a half for a design to compiler and another N hours for it to simulate before discovering problems.
My simulations never took hours.
I use GHDL (an open source tool that converts VHDL into C++) to simulate my code, which is much slower than running Modelsim in a virtual machine.
So I guess that you are working on much larger problems than I do.
I have tried using a high level language before writing my circuits in VHDL before.
But the results were not very good, apart from learning a lot more about the actual algorithm/circuit.
Either I coded at a too high of a level, which would be impossible in an FPGA (e.g., accessing a true dual port block RAM at 3 different addresses in a clock cycle), or I ended up simulating a lot of hardware just to make sure that it will work.
But the point is, no matter which approach I tried, it was painful, so I ended up choosing the workflow that is less painful.
I'd have to know more specifics to be able to comment beyond a certain level.
I am developing a marker detection system that runs at 100fps, with 640x480 8-bit grayscale images.
First I am doing CCL to find anything in the image that could be a marker.
At the same time, some features are accumulated for each detected component (potential marker).
Then the features are used to find which component is a real marker and what's its ID.
And finally, the markers have some spacial information that allows me to find out the position and orientation of the camera.
Even though the FPGA that I use is the largest of all Cyclone II FPGAs with 70k LEs, I have to juggle registers and block RAM because it's too small to store all data in the registers, and using up too many registers substantially increases the time to place&route the design.
I maintain that FPGA's are, fundamentally, still about electrical engineering and not about software development. These, at certain levels, become vastly different disciplines. Once FPGA compilers become 100 to 1,000 times faster and FPGA's come with 100 to 1,000 times more resources for the money the two worlds will probably blur into one very quickly for most applications.
I agree, and I would add that the compilers need to be smarter about parallelizing the code.
So while being able to perform better than the alternatives, the FPGAs are still a pain to develop for.
Even if the compilers are faster, and FPGAs are bigger, writing code for FPGAs feels still more like writing assembly code rather than code that is easily accessible "for the masses".
But I would be happy if the compilers become just 10x faster!
Can you explain what you are doing. I am wondering if you might be making your work more difficult by not taking advantage of inference. Are you doing logic-element level hardware description? In other words, are you wiring the circuits by hand, if you will, by describing everything in VHDL?
I've done that of course, but I don't think it's necessary unless you really have to squeeze a lot out of a design. Where it works well is in doing your own hand-placement and hand-routing thorough switch boxes, etc. to get a super-tight design that runs like hell. I've done that mostly with adders and multipliers in the context of filter structures.
My guess is that you have setup several delay lines in order to process a kernel of NxM pixels at a time?
It's been a while but I recall doing a fairly complex shallow diagonal edge detector that had to look at 16 x 16 pixel blocks in order to do its job. This ended-up taking the form of using internal storage in a large FPGA to build a 16 line FIFO with output taps every line. Now you could read a full 16 lines vertical chunk-o-pixels into the shallow edge processor and let it do its thing.
The fact that you are working on a 70k LE Cyclone imposes certain limits, not the least of which is internal memory availability. I haven't used a Cyclone in a long time, I'd have to look and see what resources you might have. That could very well be the source of much of your pain. Don't know.
The wikipedia entry also has a link to a parallelizable algo from 20+ years ago for CCL. FPGAs certainly parallel pretty easily. I wonder if your simplified optimum solution is to calculate one cell and replicate into 20x20 matrix or whatever you can fit on your FPGA and then have a higher level CPU sling work units and stitch overlapping parts together.
More practically I'd suggest your quick prototype would be slap a SoC on a FPGA that does it in your favorite low-ish level code, since it only takes hours, then very methodically and smoothly create an acceleration peripheral that begins to do the grunt-iest of the grunt work one little step at a time.
So lets start with just are there any connections at all? That seems a blindingly simple optimization. Well thats a bitwise comparison, so replace that in your code with a hardware detection and flag. Next thing you know you've got a counter that automatically in hardware skips past all blank space into the first possible pixel... But thats an optimization, maybe not the best place to start.
Next I suppose if you're doing 4-connected you have some kind of inner loop that looks a lot like the wikipedia list of 4 possible conditions. Now rather than having the on FPGA cpu compare if you're in the same region one direction at a time, do all 4 dirs at once in parallel in VHDL and output the result in hardware to your code, and your code reads it all in and decides which step (if any) was the lowest/first success.
The next step is obviously move the "whats the first step to succeed?" question outta the software and into the VHDL, so the embedded proc thinks, OK just read one register to see if its connected and if so in which direction.
Then you start feeding in a stream and setting up a (probably painful) pipeline.
This is a solid bottom up approach. One painful low level detail at a time, only one at a time, never more than one at a time. Often this is a method to find a local maximum, its never going to improve the algo (although it'll make it faster...)
"because on FPGAs you are forced to optimize right from the start" Don't do that. Emulate something that works from the start, then create an acceleration peripheral to simplify your SoC code. Eventually remove your onboard FPGA cpu if you're going to interface externally to something big, once the "accelerator" is accelerating enough.
Imagine building your own floating point mult instead of using an off the shelf one ... you don't write the control blocks and control code in VHDL and do the adders later... your first step should be writing a fast adder only later replacing control code and simulated pipelining with VHDL code. You write the full adder first, not the fast carry, or whatever.
The paper that you reference divides the image into regions, so that the merging can start earlier, because labels used in one region are independent of the other regions.
If it starts earlier, it also ends earlier, so that new data can be processed.
In my case, there is no need for such high performance, just a real time requirement of 100fps for 640x480 images, where CCL is used for feature extraction.
The work by Bailey and his group is good enough, and the reference can be done in the future, if there is need for more throughput!
My workflow is a lot different from the one that you describe.
I don't use any soft cores, and write everything in VHDL!
I have used soft cores before, but they were kind of not to my liking.
I miss the short feedback loop (my PC is a Mac and the synthesis tools run in a VM).
After trying out a couple of environments, I ended up using open source tools---GHDL for VHDL->C++ compilation and simulation, and GTKwave for waveform inspection.
Usually, I start with a testbench a testbench that instantiates my empty design under test.
The testbench reads some test image that I draw in photoshop.
It prints some debugging values, and the wave inspection helps to figure out what's going on.
If it works in the simulator, it usually works on the FPGA!
But the biggest advantage is that it takes just some seconds to do all that.
I will give the softcore approach another chance once my deadline is over!
If your data is coming in at 13.5MHz and you can run your internal evaluation core at 500MHz there's a lot you can do that, all of a sudden, appears "magical".
FPGA's do cellular automata pretty well because you can create an ever larger matrix of them until you run into some hardware limit.
This is not exactly what you're trying to do, but it sure is simple and a possible start. I'm guessing when you're done you'll end up with a really smart peripheral that looks like a CA accelerator.
Probably some combo of your pixel's X/Y coord and/or just a (very large) random number.
I would go with X/Y because it requires less memory than a random number.
Besides, random numbers on FPGAs need extra (though not much!) logic to produce them in LFSRs.
To a greater or lesser extent its just fear of the unknown. I could subject your post to copy and paste conversion and it would ring true with the conversion from mech timers to electronic contro, or discretes to ICs, or microcode based CPUs, or SBC microprocessors to single chip SoC microcontrollers, etc. The industry will adjust, over time.
"Anyone in software has had the experience of using some open-source module to save time only to end-up paying for it dearly when something doesn't work correctly and help isn't forthcoming."
LOL write it yourself merely means you reinvent the wheel complete with having to discover and patch all the obvious bugs first, before you even begin to catch up to the hard bugs.
"simply wiring together a bunch of includes"
What I'm getting at, is much as no one would be crazy enough to write their own homemade Perl database driver instead of using the world's universal standard to do the job, no one in FPGA land is crazy enough to write their own Z80 core when the T80 core at opencores has about a decade of R+D, and more importantly debugging, behind it. Plus or minus crazy regulatory/licensing requirements of course.
"If you don't understand logic circuits FPGA's are voodoo"
Yes insane race conditions and clocking issues are "fun". Fast digital is very much analog that is hidden behind the curtain... "...ignore the man behind the curtain..." Then again my car transmission, my wife's coffee maker, my clothes dryer, my microwave, and my dishwasher will never, ever test the boundaries of modern digital logic speeds so we're back at the 99.9% vs 0.01% argument again.
"you can do nearly everything in web development via freely available includes."
Well, yeah. What I am getting at is writing your own homemade clone of script.aculo.us or buying an expensive clone of it would be a complete disaster compared to just including "the real thing" and using it.
That is not my world at all. My applications have never had the luxury of being able to simply import an 8 bit processor core and a few peripherals and off we go into software land. Nearly everything I've done has been in two domains: real time image processing in hardware or beam-forming applications. In all cases virtually no use of canned modules could be made or justified. Sure, there's the SPI's and I2C's and a few other knick-knacks, but that's about it.
In fact, most of the applications I've done would end-up with external physical embedded processors because the high-speed FPGA resources could not be spared for low speed "command and control" work.
Maybe I'm living in that 0.1% you referred to?
Surely there are people doing more mundane things such as motor control, glue logic or battery chargers who might benefit from wiring together a complete custom embedded system within single Spartan FPGA and life is great. I can see that being a possibility. It just hasn't been part of my reality, for better or worst.
Like being reminded of an old girl friend you'll never really get over. If I get time, I should get a hobby...
It's a bit more complicated than that. True, C defines two environments called "freestanding" (without OS) and "hosted" (with OS). As per the ISO specification (sect. 22.214.171.124.1 for both 9899:1999 and :2011), main() must return int for hosted implementations, whereas anything goes in freestanding.
Meanwhile, while it is true that most embedded systems are indeed "freestanding" as per the C standard, an increasing number is not. So one needs to be careful about these things.
On the other hand, it only takes a day of writing low-level Verilog to realize that the problem of correctly and efficiently parallelizing algorithms is a hard one. We were using a very early C to Verilog (C2H) compiler from Altera and it worked but was very inefficient in terms of logic element use. I'm sure there's a lot of R&D going on in that space because without significant progress general purpose CPUs or at least cores will remain dominant for some time.