
ZipCPU – A small, lightweight, RISC soft core in Verilog - eatonphil
https://github.com/ZipCPU/zipcpu
======
GeorgeTirebiter
Jan Gray is, I think, more-or-less the Godfather of this approach. I
especially enjoyed his Circuit Cellar 3-part series where he describes and
builds the xr16 -- a 16-bit machine whose instruction set is designed to well-
match the LUT capabilities of an FPGA. I implemented it, and with the LCC C
compiler, I was able to run plenty of code. Start here:
[https://github.com/BigEd/XSOC-xr16/blob/master/doc/xsoc-
talk...](https://github.com/BigEd/XSOC-xr16/blob/master/doc/xsoc-talk.pdf)
which is the presentation; then check out the rest of the repo doc:
[https://github.com/BigEd/XSOC-
xr16/tree/master/doc](https://github.com/BigEd/XSOC-xr16/tree/master/doc) and
although it's not maintained these days, his old site is also useful:
[http://www.fpgacpu.org/xsoc/xr16.html](http://www.fpgacpu.org/xsoc/xr16.html)
for more modern stuff, see [http://fpga.org/](http://fpga.org/)

~~~
zipcpu
Jan Gray is one of my heroes. I definitely look up to him, and I've enjoyed my
interactions with him and the encouragement and advice he's offered me.

------
eterps
I always wonder why Niklaus Wirth's RISC5 architecture[1] (not to be confused
with RISC-V) doesn't get more love. From my perspective (but I'm no expert on
this) it seems exceptionally simple and elegant while still being able to run
a desktop OS [2].

[1] [https://inf.ethz.ch/personal/wirth/FPGA-
relatedWork/RISC.pdf](https://inf.ethz.ch/personal/wirth/FPGA-
relatedWork/RISC.pdf)

[2] [https://i.redd.it/ukagn6imtiv01.png](https://i.redd.it/ukagn6imtiv01.png)

~~~
tyingq
Probably doesn't help that googling for "risc5" brings back RISC-V results.

~~~
eterps
True, the name is unfortunate, but it predates RISC-V quite a bit.

Using: 'risc wirth' as a query yields more accurate results.

~~~
kristianp
Or google "risc5" in quotes.

------
zackmorris
Found some info in the repo's issues about how much room ZipCPU takes on an
FPGA:

[https://github.com/ZipCPU/zipcpu/issues/8](https://github.com/ZipCPU/zipcpu/issues/8)

PDFs:

[https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf.pdf](https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf.pdf)

[https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf2017....](https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf2017.pdf)

[https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf2018....](https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf2018.pdf)

Also a list of FPGA dev boards as of 2018:

[https://www.reddit.com/r/FPGA/comments/9yutk8/best_100300_fp...](https://www.reddit.com/r/FPGA/comments/9yutk8/best_100300_fpga_development_board_in_2018/)

So looks like ZipCPU takes between 2400 and 7372 LUTs, depending on how
optimized the routing is. An Ultra96 with 154k LUTs could hold between about
20 and 64 ZipCPUs, although busses would be an issue.

I'd like to see more of these projects start from a stance of connecting to
other CPUs on the same chip rather than out to main memory. Or rather, a spec
should only be considered complete when it has designs for both an in-mesh
core to be placed beside its neighbors in a grid, and a bus-facing core for
connecting out to main memory.

I think the future is going to be a series of basic blocks like CPUs, RAMs,
maybe even GPUs that can be placed arbitrarily and run as clusters for
tackling big data problems using techniques like map reduce. Or better yet,
maybe FPGAs that look like multiple FPGAs (each with their own RAM) running on
the same chip.

~~~
zipcpu
I have yet to place more than two ZipCPU's on a board, mostly because I'm
still working out the various bus details associated with making multiple CPUs
the master of the same bus. For example, should they all have the same
perspective of the address map? This I'm still working out.

~~~
zackmorris
Oh cool, good to hear from you! If you happen to read this, I would like to
give you a vote on the direction I'm hoping for. My background is in video
games, blitters, MATLAB, basically swizzling large amounts of data in various
DSP-like ways. A cache miss can easily cost 100 cycles, so processors are
sitting around waiting for data 99% of the time sometimes. IMHO pretty much
all software today is RAM bus bound, not CPU bound. And this was true in the
late 90s, so is even more true today by about 2-4 orders of magnitude (that's
why single-threaded CPU performance hasn't changed much since the early
2000s).

In the beginning, caching was a nice idea to get around bus latency. But the
last processor design I saw in 1999 when I graduated college had about 75% of
the chip devoted to cache. I would have much preferred to have 3 more cores.
But the industry kept doubling down on that, so today we're still stuck with
CPUs with maybe 4, 8 or 16 cores, but only one bus to RAM.

So to me, hardware caching has been a waste of time when it comes to parallel
computation. I think a better model is a software cache running at a
privileged level above userland. In other words, a smart MMU per CPU, written
in microcode or software that presents a uniform address space to the
application. Each MMU could/should have a hash for each (say 4k) page's data,
and then each CPU/MMU core would act like a router, either returning the
contents of a page for a given hash, or asking its 4 or 8 neighbors if they
contain the page (returning it if so, otherwise asking the next neighbors):

[https://en.wikipedia.org/wiki/Content-
addressable_memory](https://en.wikipedia.org/wiki/Content-addressable_memory)

From the software side, this would look like a single deduplicated virtual
memory where each unique page would appear once in the address space, usually
out in the mesh somewhere (not resident) unless its core had cached the page
locally while its code operated on it. At most, a page would only be an x+y
hop from any of the neighboring cores, and a thread in each core should block
until the page is resident (just like with any other cache or virtual memory).

The next step is to deal with mutation via a copy-on-write mechanism:

[https://en.wikipedia.org/wiki/Copy-on-
write](https://en.wikipedia.org/wiki/Copy-on-write)

[https://en.wikipedia.org/wiki/Software_transactional_memory](https://en.wikipedia.org/wiki/Software_transactional_memory)

So if an app has the same string allocated 10 times, only one page would hold
the original data. The other 9 would be mapped to that first page via virtual
memory. Everything could/should be immutable, so when a string is modified, it
allocates a new page for the data with its own unique hash. The original
string would eventually be deallocated through a least recently used (LRU)
strategy, or perhaps each MMU could have some notion of garbage collection
that preserves a single unique page somewhere in the mesh until all cores no
longer reference it. A quick and dirty way to deal with this later would be to
recommend that the mesh be hooked up to a large flash memory (over 1 GB) where
old pages could be written to be stored indefinitely until storage runs out
(the same as with traditional virtual memory).

The most elegant way I know to implement this would be to run something like
Clojure in each MMU:

[https://clojure.org/about/state](https://clojure.org/about/state)

[https://clojure.org/reference/refs](https://clojure.org/reference/refs)

A more "hands on" way to think of this with ugly syntax is Redux:

[https://www.freecodecamp.org/news/an-intro-to-redux-and-
how-...](https://www.freecodecamp.org/news/an-intro-to-redux-and-how-state-is-
updated-in-a-redux-application-839c8334d1b1/)

So basically the MMU would treat every read/write like a transaction and
insulate the userland code from having to deal with routing and deduplication.
Then each (say 32 bit) userland memory address would map to one of the CAM
data hashes so that ordinary compiled code can run unmodified. All of the CAM
and STM stuff should happen entirely in the MMU. If we're mapping 32 bit
addresses to data hashed using a 160 bit (20 byte) SHA-1, then the MMU can
treat that 20 byte hash like a pointer and utilize standard garbage collection
algorithms. The easiest one I know is to refcount allocations, freeing the
data when it's no longer referenced:

[https://en.wikipedia.org/wiki/Automatic_Reference_Counting](https://en.wikipedia.org/wiki/Automatic_Reference_Counting)

The hardest part will be handling circular references. Unfortunately this is
an open problem, but maybe you can borrow established techniques from
something like Java:

[https://www.google.com/search?q=garbage+collection+circular+...](https://www.google.com/search?q=garbage+collection+circular+reference)

A proper MMU implementation should be able to run as something like a kernel
extension so that the whole chip can reach out to other chips in a cluster or
the internet, using a convention similar to IPFS:

[https://en.wikipedia.org/wiki/InterPlanetary_File_System](https://en.wikipedia.org/wiki/InterPlanetary_File_System)

[https://hackernoon.com/a-beginners-guide-to-
ipfs-20673fedd3f](https://hackernoon.com/a-beginners-guide-to-
ipfs-20673fedd3f)

The goal I want to accomplish is to get away from the constraining SIMD that's
been hyped by video cards, TensorFlow, etc and replace it with the general
purpose MIMD of vector languages like MATLAB but in a parallelized way. From a
user standpoint, we should be able to write code from vanilla C all the way up
to rapid application development languages like Elixer or Go, where everything
is synchronous-blocking, and pipe data to/from other processes via channels
and the Actor model (the same as UNIX executables), so it looks like we're
running in a single computer with say 256 cores and ~256 buses to small local
RAMs so we can use high throughput techniques like MapReduce. Here are some
breadcrumbs for what I'm getting at:

[https://news.ycombinator.com/item?id=17419917](https://news.ycombinator.com/item?id=17419917)

[https://news.ycombinator.com/item?id=18294939](https://news.ycombinator.com/item?id=18294939)

[http://aggregate.org/MOG/](http://aggregate.org/MOG/)

Anyway, sorry this got too long. It's a lot to digest, but I've been thinking
about it for over 20 years. The takeway is that perhaps you can punt on the
MMU for now, and research adding something like a Lisp machine (a simple stack
machine) as a coprocessor to each ZipCPU core, ideally able to run a subset of
Clojure directly. It doesn't have to be fast or efficient, because it will
mostly only be called for cache misses, similarly to how virtual memory reads
from slower storage. Except in this case, that slow storage is the mesh
itself. The MMU just needs to be able to read/write content-addressable pages
to each of its 4 or 8 neighbors in the mesh, maybe with dedicated data and
hash lines, or else through a packet protocol similar to UDP or Rambus
serialized to a smaller bus, raising interrupts when userland tries to read
data that isn't resident. The rest can be implemented purely through software.
To just "get it done" without an MMU, you could run your own microkernel on
each CPU with a software MMU that maps the 32 bit address space to the CAM and
handles the hash routing.

Or like, do whatever you want man :-) Hope this sparks some inspiration.

~~~
zipcpu
I'm not sure I followed all of that, but let me at least try to look up the
links and see what I can learn. Maybe they'll help me follow the concepts you
are suggesting.

Dan

~~~
zackmorris
I realized after sleeping on it that I should clarify the core idea. With
computer architecture, it's easy to get hung up thinking about things like
I/O, busses, caching, etc. But I think the future is distributed content-
addressable computing, where the state is just "out there" in the internet
somewhere, and the runtime handles all the data locality issues so it looks
like one continuous memory to the application. If we have that, then it's
trivial to add processors either in the local mesh or even out on the internet
somewhere.

Thinking in terms of cores connected to their 4 neighbors with a routing
protocol like "do you have the data for hash ABC, otherwise ask your
neighbors" or "store this data with hash XYZ" reduces the problem space to a
key-value store similar to redis, so you don't have to worry about hardware
caching.

For now, the network protocol could probably just be something like multicast.
Packets would go to all neighbors, and they would be unconnected, unreliable
like UDP. The header would either be a REQUEST with an N bit hash, or a
RESPONSE with the N bit hash and the data associated with it. There might also
be a TTL field that goes up to roughly sqrt(# of cores) so that packets
diffuse through the network once but don't bounce around endlessly. I think
that cores would be broadcasting new [hash][data] packets for every RAM write,
and other cores would save all of them and evict their oldest hashes (which
allows the cores to pre-cache data that's likely to be used soon). Then upon
requesting a RESPONSE for a certain hash, a core would block until it arrived
from the cluster in roughly sqrt(# of cores) cycles.

From a code standpoint, a page in the 32 bit address space would have a real
pointer but it would map through the virtual memory manager (VMM) to one of
these [hash][data] pages. So two processes in the cluster wouldn't have the
same address for the same piece of data. This can be looked at as a feature,
not a bug, because it enforces process separation similar to Erlang. So when a
page's data is updated, the page gets swapped out by the VMM and re-hashed to
a new content-addressable page out in the cluster. So it might make sense to
have a coprocessor per-cpu that hashes pages for every RAM write. Or maybe the
hashing could happen in the RAM write interrupt and we'd settle for the speed
hit, not sure. Also it (might) be good to choose a hash that implements
something where you can re-compute a partial range inside the hash rather than
having to recompute the whole thing if a single byte changes. Like maybe a
custom hasher could be designed that uses a divide and conquer strategy to
find which half changed, and then the half in the half that changed, down to
some level and append them like with the MD5 appending flaw where they make a
document that looks like another document but has different data in the middle
that both hash to the same hash so the documents have very different behavior.
What I mean is, maybe there is a hashing algorithm that allows you to
recalculate a half, or a quarter, or an eighth, down to some granularity, so
the hashing could be optimized from O(n) to O(n/64) or whatever. Also after
writing this out, I realized that the VMM would store (for SHA-1) a 20 byte
hash for every 4k block, so roughly 1/200 of each core's RAM would go to this
storage scheme.

Would this VMM be easier to build than an MMU? I don't know, but I do know
that software is cheaper than hardware so it's maybe possible.

------
dang
A thread from 10 months ago:
[https://news.ycombinator.com/item?id=18138174](https://news.ycombinator.com/item?id=18138174)

Half a dozen blog post discussions as well:
[https://hn.algolia.com/?query=ZipCPU%20points%3E10&sort=byDa...](https://hn.algolia.com/?query=ZipCPU%20points%3E10&sort=byDate&dateRange=all&type=story&storyText=false&prefix=false&page=0)

------
rwmj
Of course developers are free to do what they like, but I think it's a bit of
a shame that it's not RISC-V. He's missing out on all the toolchain support in
GCC and Clang, plus the ability to run various embedded operating systems
including these days seL4. RISC-V has an embedded profile (RV32E) and a tiny
implementation of that (PicoRV32) so I'm fairly sure it could be suitable for
this.

~~~
tyingq
He points out the reasons it is the way it is here:
[http://zipcpu.com/about/zipcpu.html](http://zipcpu.com/about/zipcpu.html)

~~~
debatem1
I don't see anything specific about riscv here, am I missing something?

~~~
tyingq
Reasons like learning, very small #LUT devices, etc, that wouldn't be
satisfied with 32 bit Linux capable RISCV.

For example _" I wanted to know how to build a CPU. This includes learning how
to build not only a CPU, but also backends for the C-library, GCC and binutils
(GDB soon to come)."_

------
bjt2n3904
This guy posts some amazing content on Twitter, his account is really fun to
follow.

------
bayesian_horse
Just out of curiosity: What exactly would you need this for?

Is it just about learning how to design a CPU? Is it about solving a specific
task a soft core on an FPGA can do better? What kind of use cases are there
where an FPGA with a RISC core is better suited than some kind of
microprocessor?

~~~
q3k
Sometimes when designing for an FPGA/ASIC, the easiest way to implement some
sort of complex control logic is to use a small CPU.

For example, the Nvidia Tegra X1 SoC contains a handful of Falcon CPUs
(proprietary ISA and core), responsible for amongst others:

\- USB 3.0 command/event processing

\- Video Image Compositor 'inverse telecine' processing

\- HDMI security

Same goes for your nVidia PC GPUs - they have Falcons sprinkled across the
entire thing.

~~~
rcxdude
Heck, one SoC I know of has an entire cortex M3 core to manage entering and
exiting sleep modes.

------
sehugg
I didn't know yosys did formal verification now:
[http://zipcpu.com/blog/2017/10/19/formal-
intro.html](http://zipcpu.com/blog/2017/10/19/formal-intro.html)

Cool.

~~~
zipcpu
If that article interests you, you might find this one even more interesting:

[http://zipcpu.com/blog/2018/04/02/formal-cpu-
bugs.html](http://zipcpu.com/blog/2018/04/02/formal-cpu-bugs.html)

Basically, I was shocked when I applied SymbiYosys to the ZipCPU at how many
bugs I found. Bug lists like this are typically company secrets, since no one
wants to reveal how many bugs their product has. In my case, part of my
success has been my openness.

------
jacobush
GPLv3 may cause issues when embedding it in your own SoC or whatever,
especially given the very proprietary nature of toolchains. Not that I mind,
just sayin'

~~~
jsjohnst
> Because Gisselquist Technology, LLC, owns all of the code for the ZipCPU and
> its peripherals, proprietary licenses may be purchased.

~~~
zipcpu
Realistically, those who've wanted to use the ZipCPU tend to just chat with me
about it and we work out a handshake agreement or some such.

Were someone to want to use it in a commercial product, I'd be more interested
in some form of compensation. For now it's filling the education role quite
nicely -- which is somewhat of a surprise for me since I never saw that coming
when I started working on it in the first place.

~~~
jsjohnst
Awesome! Love that you are flexible in that regard.

I was merely pointing out to GP that it wasn’t “only GPL licensed”, there are
other options you provide.

