If you want to use it in your closed source project, give me a holler. I'm sure we could work something out. Even if we don't, I'd love to have the opportunity to hear from you, meet you, say hello, and encourage you in any way I can.
Realistically, those who've wanted to use the ZipCPU tend to just chat with me about it and we work out a handshake agreement or some such.
Were someone to want to use it in a commercial product, I'd be more interested in some form of compensation. For now it's filling the education role quite nicely -- which is somewhat of a surprise for me since I never saw that coming when I started working on it in the first place.
Basically, I was shocked when I applied SymbiYosys to the ZipCPU at how many bugs I found. Bug lists like this are typically company secrets, since no one wants to reveal how many bugs their product has. In my case, part of my success has been my openness.
Next week's project will be programming the software for a PicoRV inside the bus structure created by AutoFPGA, so I am very aware of the RISC-V architectures ... now.
I would very much like to build a RISC-V 64-bit variant (there are already many 32-bit variants), but so far the task isn't funded. Well, neither was the ZipCPU when I built it, but now such an upgrade would need to compete with other paying projects I'm working on whereas I didn't have an income when I started working on the ZipCPU.
Think of it this way, what do you need an FPGA for? Now, when budgets are tight, how do you pack more bang for your buck? Rather than placing more and more state machines on an FPGA, you can offload such task loads onto a CPU. Indeed, the more (slow, not time-sensitive, complex, etc) tasks you can off load onto the CPU, the more fabric is available for ... whatever task you actually wanted to place onto the FPGA.
Hm, but then why don't they use a discrete CPU built for this purpose? Many FPGA development boards already come with an ARM MCU.
I'd guess that when the CPU core is on the FPGA, the communication between the CPU and whatever the custom logic does is faster and more in-sync.
Thank you for answering. I have seen a couple of resources on using FPGAs, with those being increasingly accessible to "hobbyists". I often fail to see the "why" in the examples.
This is a cpu made out of pure open source software thin air, which is a special and incalculable open-eneded value all it's own.
You still have to buy an fpga to run it, but you don't have to buy any particular fpga from any single particular vendor. It can be any fpga, including ones like ice40 which at least do not require any closed vendor software to program them.
This means also including theoretical future fpgas that are themselves some open hardware design that any fab could make to order, ie even more open than Lattice, like ordering a pcb is today.
Also, unlike any hardware cpu, a user can modify this cpu in whatever unknown unpredictable ways they might want. If I like this cpu, but I need to add a counter register or some weird tri-state interrupt line or magic instruction, or ... lets try a real idea, how about make the cpu hide it's activity from external analysis for security reasons, by making the cpu always be eecuting some sort of instruction at all times, no no-op (no-op becomes rand-op). This burns power, but also this means you can not tell when it is decrypting an ssl packet. A user can do something unpredictably kooky like that the minute they merely have the idea or the wish. They don't have to beg Arm or Intel to make them a cpu that does it, or live without the odd thing they want.
It's a completely worthy project, whether or not it has any immediate commercial application.
As Mithro of SymbiFlow points out, you would be insane to use a proprietary vendor c compoler today. The open source ones have outclassed them all decades ago.
So it is with practically everything, whether it knows it yet or not, including in this case, cpu design.
I assume such projects to be worthy. When I don't understand why though, I discovered this question to be very valuable to ask.
Clearly, when people invest so much time in one project, they have a good reason (for them). Even when they are emulating an apollo guidance computer in minecraft redstone.
I have yet to place more than two ZipCPU's on a board, mostly because I'm still working out the various bus details associated with making multiple CPUs the master of the same bus. For example, should they all have the same perspective of the address map? This I'm still working out.
Oh cool, good to hear from you! If you happen to read this, I would like to give you a vote on the direction I'm hoping for. My background is in video games, blitters, MATLAB, basically swizzling large amounts of data in various DSP-like ways. A cache miss can easily cost 100 cycles, so processors are sitting around waiting for data 99% of the time sometimes. IMHO pretty much all software today is RAM bus bound, not CPU bound. And this was true in the late 90s, so is even more true today by about 2-4 orders of magnitude (that's why single-threaded CPU performance hasn't changed much since the early 2000s).
In the beginning, caching was a nice idea to get around bus latency. But the last processor design I saw in 1999 when I graduated college had about 75% of the chip devoted to cache. I would have much preferred to have 3 more cores. But the industry kept doubling down on that, so today we're still stuck with CPUs with maybe 4, 8 or 16 cores, but only one bus to RAM.
So to me, hardware caching has been a waste of time when it comes to parallel computation. I think a better model is a software cache running at a privileged level above userland. In other words, a smart MMU per CPU, written in microcode or software that presents a uniform address space to the application. Each MMU could/should have a hash for each (say 4k) page's data, and then each CPU/MMU core would act like a router, either returning the contents of a page for a given hash, or asking its 4 or 8 neighbors if they contain the page (returning it if so, otherwise asking the next neighbors):
From the software side, this would look like a single deduplicated virtual memory where each unique page would appear once in the address space, usually out in the mesh somewhere (not resident) unless its core had cached the page locally while its code operated on it. At most, a page would only be an x+y hop from any of the neighboring cores, and a thread in each core should block until the page is resident (just like with any other cache or virtual memory).
The next step is to deal with mutation via a copy-on-write mechanism:
So if an app has the same string allocated 10 times, only one page would hold the original data. The other 9 would be mapped to that first page via virtual memory. Everything could/should be immutable, so when a string is modified, it allocates a new page for the data with its own unique hash. The original string would eventually be deallocated through a least recently used (LRU) strategy, or perhaps each MMU could have some notion of garbage collection that preserves a single unique page somewhere in the mesh until all cores no longer reference it. A quick and dirty way to deal with this later would be to recommend that the mesh be hooked up to a large flash memory (over 1 GB) where old pages could be written to be stored indefinitely until storage runs out (the same as with traditional virtual memory).
The most elegant way I know to implement this would be to run something like Clojure in each MMU:
So basically the MMU would treat every read/write like a transaction and insulate the userland code from having to deal with routing and deduplication. Then each (say 32 bit) userland memory address would map to one of the CAM data hashes so that ordinary compiled code can run unmodified. All of the CAM and STM stuff should happen entirely in the MMU. If we're mapping 32 bit addresses to data hashed using a 160 bit (20 byte) SHA-1, then the MMU can treat that 20 byte hash like a pointer and utilize standard garbage collection algorithms. The easiest one I know is to refcount allocations, freeing the data when it's no longer referenced:
The hardest part will be handling circular references. Unfortunately this is an open problem, but maybe you can borrow established techniques from something like Java:
A proper MMU implementation should be able to run as something like a kernel extension so that the whole chip can reach out to other chips in a cluster or the internet, using a convention similar to IPFS:
The goal I want to accomplish is to get away from the constraining SIMD that's been hyped by video cards, TensorFlow, etc and replace it with the general purpose MIMD of vector languages like MATLAB but in a parallelized way. From a user standpoint, we should be able to write code from vanilla C all the way up to rapid application development languages like Elixer or Go, where everything is synchronous-blocking, and pipe data to/from other processes via channels and the Actor model (the same as UNIX executables), so it looks like we're running in a single computer with say 256 cores and ~256 buses to small local RAMs so we can use high throughput techniques like MapReduce. Here are some breadcrumbs for what I'm getting at:
Anyway, sorry this got too long. It's a lot to digest, but I've been thinking about it for over 20 years. The takeway is that perhaps you can punt on the MMU for now, and research adding something like a Lisp machine (a simple stack machine) as a coprocessor to each ZipCPU core, ideally able to run a subset of Clojure directly. It doesn't have to be fast or efficient, because it will mostly only be called for cache misses, similarly to how virtual memory reads from slower storage. Except in this case, that slow storage is the mesh itself. The MMU just needs to be able to read/write content-addressable pages to each of its 4 or 8 neighbors in the mesh, maybe with dedicated data and hash lines, or else through a packet protocol similar to UDP or Rambus serialized to a smaller bus, raising interrupts when userland tries to read data that isn't resident. The rest can be implemented purely through software. To just "get it done" without an MMU, you could run your own microkernel on each CPU with a software MMU that maps the 32 bit address space to the CAM and handles the hash routing.
Or like, do whatever you want man :-) Hope this sparks some inspiration.
I'm not sure I followed all of that, but let me at least try to look up the links and see what I can learn. Maybe they'll help me follow the concepts you are suggesting.
I realized after sleeping on it that I should clarify the core idea. With computer architecture, it's easy to get hung up thinking about things like I/O, busses, caching, etc. But I think the future is distributed content-addressable computing, where the state is just "out there" in the internet somewhere, and the runtime handles all the data locality issues so it looks like one continuous memory to the application. If we have that, then it's trivial to add processors either in the local mesh or even out on the internet somewhere.
Thinking in terms of cores connected to their 4 neighbors with a routing protocol like "do you have the data for hash ABC, otherwise ask your neighbors" or "store this data with hash XYZ" reduces the problem space to a key-value store similar to redis, so you don't have to worry about hardware caching.
For now, the network protocol could probably just be something like multicast. Packets would go to all neighbors, and they would be unconnected, unreliable like UDP. The header would either be a REQUEST with an N bit hash, or a RESPONSE with the N bit hash and the data associated with it. There might also be a TTL field that goes up to roughly sqrt(# of cores) so that packets diffuse through the network once but don't bounce around endlessly. I think that cores would be broadcasting new [hash][data] packets for every RAM write, and other cores would save all of them and evict their oldest hashes (which allows the cores to pre-cache data that's likely to be used soon). Then upon requesting a RESPONSE for a certain hash, a core would block until it arrived from the cluster in roughly sqrt(# of cores) cycles.
From a code standpoint, a page in the 32 bit address space would have a real pointer but it would map through the virtual memory manager (VMM) to one of these [hash][data] pages. So two processes in the cluster wouldn't have the same address for the same piece of data. This can be looked at as a feature, not a bug, because it enforces process separation similar to Erlang. So when a page's data is updated, the page gets swapped out by the VMM and re-hashed to a new content-addressable page out in the cluster. So it might make sense to have a coprocessor per-cpu that hashes pages for every RAM write. Or maybe the hashing could happen in the RAM write interrupt and we'd settle for the speed hit, not sure. Also it (might) be good to choose a hash that implements something where you can re-compute a partial range inside the hash rather than having to recompute the whole thing if a single byte changes. Like maybe a custom hasher could be designed that uses a divide and conquer strategy to find which half changed, and then the half in the half that changed, down to some level and append them like with the MD5 appending flaw where they make a document that looks like another document but has different data in the middle that both hash to the same hash so the documents have very different behavior. What I mean is, maybe there is a hashing algorithm that allows you to recalculate a half, or a quarter, or an eighth, down to some granularity, so the hashing could be optimized from O(n) to O(n/64) or whatever. Also after writing this out, I realized that the VMM would store (for SHA-1) a 20 byte hash for every 4k block, so roughly 1/200 of each core's RAM would go to this storage scheme.
Would this VMM be easier to build than an MMU? I don't know, but I do know that software is cheaper than hardware so it's maybe possible.
The ZipCPU and several peripherals can fit within an iCE40 8k using only about 5k LUTs--and that's with the multiply unit, the divide unit, the pipelined fetch unit, and a debugging bus infrastructure. While that isn't the full pipelined implementation neither do the caches fit, it is sufficient to be competitive with the picoRV.