Hacker News new | more | comments | ask | show | jobs | submit login
The MOnSter 6502 (tubetime.us)
383 points by mmastrac on May 16, 2016 | hide | past | web | favorite | 74 comments

Ok, wow. That is pretty stunning. That is months of painstaking layout and research. I am in absolute awe.

Windell and I talked about building a discrete 7400 (quad NAND gate), in theory you can build any logic you want out of enough NAND gates :-) but implementing the CPU on the PDP 8 was going to be a very , very big arrangement. at 13 in on a side this is pretty manageable. A lot of boards for the MicroVAX are larger than that.

Now the question is if you make a proportionally sized 40 pin package for it, how big is the Apple II motherboard? :-)

Nicely done, congrats on an awesome project. I must go to Makerfaire now to see this in person.

He addresses the apple ][ idea. Since his board will be limited to 100s of khz (a PCB has signalling issues that an IC doesn't that will limit the max clock rate) and the apple ][ had HW with dependencies on the CPU being at a certain rate, it will not work just to build an adapter.

"the apple ][ had HW with dependencies on the CPU being at a certain rate"

On the other hand something like a KIM-1 would be totally doable although there would have to be changes to the cassette interface and TTY interface software timing loops. Oh and I suppose if you clocked below 10 KHz or something then keypad debounce routines might need some work. But from memory there would be no hardware changes necessary to downclock a KIM-1. There is a hardware PLL in the cassette interface input but all it does is sniff if the incoming tone is above or below some frequency

A KIM-1 was a rather simple 6502 single board computer from the dawn of the microcomputer era.

ChuckMcM doesn't propose hooking it up to an Apple ][ motherboard, he suggests building a replica motherboard at the same scale. At about 75:1 scale (6502 was about 4mm), that motherboard would be over twenty meters wide.

I'm in. Just need to set our watches back, right?

I also want to check this out in person. What day are you demoing it? Is there a booth, table, etc.?

My experience presenting at maker faire is that you don't know where your booth is going to be until you show up on Friday morning. What would help is to know what section it's in and what the name of the booth will be.

this is correct. it'll probably be in the expo hall. the exhibit is called "Making Dis-Integrated Circuits", see http://makerfaire.com/maker/entry/55149/

A curiosity question for people in this thread who understand circuit design: How efficient is the design of the 6502 given modern knowledge of CPU design? If we designed a similar CPU from scratch in 2016, could we make one that works much better (by whatever metric makes sense) and uses far fewer transistors?

All in all, its design is not that bad. It shares some slight similarity with Patterson's later RISC design ideas in that it is a very reduced, simple design with zero fancy instructions. Just like in RISC chips, anything beyond the very basics of math/boolean operations, and you have to write code because the chip does not have a 'custom instruction' for that. It also has a slight overlap in it's processing and memory fetches that can be viewed as a slight pipelining. Nothing impressive vs. an IBM 360/91 mainframe or any modern CPU with pipelining however.

As for a 2016 design in fewer transistors, that's unlikely. You need a certain number of transistors for each basic function, so there's a floor where if you want all the features of a 6502 (what little there were) you can't possibly do it without X transistors minimum.

But, what 2016's tech would bring is a 6502 that instead of being clocked at 1.5-2Mhz might be clocked instead at a couple Ghz. Performance wise such a chip would be nearly infinitely faster than an original 6502, but would pale in comparison to a modern Intel chip with all the extras (cache, instruction translation, out of order issue, branch predictor, etc.) included in those chips. The 6502 would also be quite acutely sensitive to the speed of memory, so the chip would not be able to run any faster than RAM could feed it data (it has only three user accessible 8-bit registers, and only one of those can be used for computations). It worked with 1970's tech because 1970's memory's were as fast as it was so it was not slowed down by a huge memory vs cpu speed differential. This lack of registers is where its design diverges with RISC tech. as detailed by Patterson.

What a modern 6502 design might do, however, is be extremely power efficient. A modern 2016 CMOS design clocked at 1970's speeds might use very little power. Whether it would beat ARM in that market is an unknown.

Its biggest limitation for a 2016 design that is true to the original is being only an 8-bit chip with only a 16-bit address bus. Having 64k of RAM max on one's CPU in 2016 is going to crimp what solutions it might be useful for vs. using an ARM chip for the same solution.

The 6502 is still being produced in 2016, mostly not as an individual processor but as a core for various controllers in things like mice, keyboards, toys[1], monitors, and other electronics[2]

[1] http://hackaday.com/2013/05/24/tamagotchi-rom-dump-and-rever...

[2] http://electronics.stackexchange.com/questions/168867/how-to...

They (WDC) still sell the 65c02 (CMOS) variant and 65c816 in enough volume to justify producing them, and they are sold through Mouser, among other places (along with microcontroller variants of both that include integrated peripherals, etc.)

A new 65c02 clocks up to 20mhz (apparently) without any real issues. And like you said, WDC can provide custom cores that do much more.

> But, what 2016's tech would bring is a 6502 that instead of being clocked at 1.5-2Mhz might be clocked instead at a couple Ghz.

For most applications in 2016 that could actually use a 6502, they probably don't need it to run that fast (and more speed would mean more power usage). New pin-compatible 6502s are still sold today for embedded applications: http://www.tomshardware.com/news/mouser-6502-motorola-6800-c...

But if not for pin-compatibility, instead of being 4.3mm x 4.7mm, a 6502 could be a speck of dust, with the only size constraints being connections to other components.

> What a modern 6502 design might do, however, is be extremely power efficient. A modern 2016 CMOS design clocked at 1970's speeds might use very little power. Whether it would beat ARM in that market is an unknown.

The 6502 design from 2012 that I linked above uses 300µA.

> The 6502 would also be quite acutely sensitive to the speed of memory, so the chip would not be able to run any faster than RAM could feed it data (it has only three user accessible 8-bit registers, and only one of those can be used for computations). It worked with 1970's tech because 1970's memory's were as fast as it was so it was not slowed down by a huge memory vs cpu speed differential.

The 6502 also only addressed 64kB of memory (though variants like the 6509 supported up to 1MB via multiple banks). With 2016 technology, you could easily supply all the RAM a 6502 could ever want or use as on-die SRAM that matches the CPU speed.

> what 2016's tech would bring is a 6502 that instead of being clocked at 1.5-2Mhz might be clocked instead at a couple Ghz. Performance wise such a chip would be nearly infinitely faster

Well, a thousand times faster.

Ok, so what integer would qualify as "nearly infinite"?

Well, none.

Thanks, what an awesomely comprehensive response!

Not 6502, but I have for a long time wondered what alternative ISA than the Z80 one could have produced assuming the same 8500 transistor budget and technology. Notably, something that was vastly more compiler friendly.

"Efficient", as measured by what? It's a couple thousand gates or so, which is enough to do an 8 bit microcontroller that doesn't have to worry about having a protected mode and such. Also, there are not many registers, so the latch count is pretty small. As measured by die-area per delivered functionality, it does pretty well, and if you compare 6502 code with 6800 or 1802 code, it is pretty dense for the functionality that you get out of it. That said, I always had my complaints about the 6502, mainly lack of 16 bit index registers that could reach the entire memory space. That forced a coding idiom that relied on indirect addressing through page 0. 6800 had a 16 bit index register.

If you look at the instruction set, you can see that it is relatively horizontal and pretty straight forward to decode and implement with a simple sequencer.

Of course, it was implemented in NMOS like most of the contemporary microcontrollers, so it was a power hog by today's standards, but the transistor count per gate was lower. The 1802 was CMOS, but it was aimed at applications that required space-grade parts, so it was low power and had transistors the size of a cow turd in order to be more resistant to alpha-particle hits. It was also slow.

For comparison, at the time I was playing with 6502's as a hobby, I was at my day job a mainframe CPU logic designer working in 100K ECL -- I was working on a machine roughly equivalent to the Cray-1, and our gate count was roughly 250,000 gates.

The logic design for a CPU the size of a 6502 is really not too bad -- sort of the scale of a largish homework assignment as a semester final -- constructing something like it in a gate level simulator might be an interesting exercise for the motivated.

It's a couple thousand gates or so

You're around an order of magnitude off. It's 3.5K transistors, and slightly less than half of those are NMOS pullups, with many of the gates containing 2-3 or more transistors, so it's less than 1K gates. That of course depends on what you count as a "gate", since many of them are compound-gates like AND-OR-INVERTs and there's also transmission-gate logic too. The decode PLA also has tons of large-input gates (hence many transistors-per-gate).

This article may be related:


'order of magnitude'? Are you implying the 6502 has the equivalent of only ~200 gates?

Perhaps he means a binary order of magnitude.

One of the major differences is that back then you could have complete single-cycle random access to any byte of memory, and the 6502 relied on that. Memory is a complex ball of wax today, with wide bursty buses and caches. While a 1+GHz 6502 could certainly be built today, it would have to be paired with on-chip SRAM for its main memory, which would dominate the CPU in transistor count, power draw, and silicon area.

The transistor count itself was very low even for its day, so while that isn't my forte, I would guess that gains in that metric wouldn't be all that great.

Gee, good question. I know people replicate them in FPGAs and verilog [1]. Improving on the design can be potentially done through software with verilog. I'm not sure of anyone who has improved on it because people focus on replicating it (it's a really fun chip to play with).

Why fewer transistors though?


I think the GreenArrays F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (http://www.greenarraychips.com/home/documents/greg/PB003-110...), which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors: http://www.righto.com/2013/09/intel-x86-documentation-has-mo...

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

There are all kinds of chips that want to embed some kind of small microprocessor using a minimal amount of silicon area, but aren't too demanding of its power. A lot of them embed a Z80 or an 8051, which have lots of existing toolchains targeting them. A 6502 might be a reasonable choice, too. Both 6502 and Z80 have self-hosting toolchains available, too, but they kind of suck compared to modern stuff.

If you wanted to build your own CPU out of discrete components (like this delightful MOnSter!) and wanted to minimize the number of transistors without regard to the number of other components involved, you could go a long way with either diode logic or diode-array ROM state machines.

Diode logic allows you to compute arbitrary non-inverting combinational functions; if all your inputs are from flip-flops that have complementary outputs, that's as universal as NAND. This means that only the amount of state in your state machine costs you transistors. Stan Frankel's Librascope General Precision LGP-21 "had 460 transistors and about 300 diodes", but you could probably do better than that.

Diode-array ROM state machines are a similar, but simpler, approach: you simply explicitly encode the transition function of your state machine into a ROM, decode the output of your state flip-flops into a selection of one word of that ROM, and then the output data gives you the new state of your state machine. This costs you some more transistors in the address-decoding logic, and probably costs you more diodes, too, but it's dead simple. The reason people do this in real life is that they're using an EPROM or similar chip instead of wiring up a diode array out of discrete components. (The Apple ][ disk controller was a famous example of this kind of thing.)

6502 was about 0.02 DMIPS/MHz.

Example 8051. Original had ~50K transistors, about 30K in cpu core. 50K transistors = ~17-25K NMOS gates. Performance was ~0.0095 Dmips/MHz, 12 steps per clock (microcode).

Current state of the art is http://www.dcd.pl/ipcores/56/

fast DQ80251 13K/20K gates (cpu core/whole microcontroller), 0.70579 Dmips/MHz = 75 times faster per mhz. can be clocked >300MHz when implementted in asic.

small DT8051 5600 gates = ~12-17K transistors in NMOS(of course nobody does that anymore) ~23-34K transistors in CMOS, 0.0763 Dmips/MHz = 8 times faster per mhz. can be clocked >200MHz when implementted in asic.

But this is legacy inefficient design. ARM laughts at it with M0 at 12K gates and >1 Dmips/MHz at 1/10 the power. Cortex-M4 at 65K gates reaches 1.9 Dmips/MHz

For those that are not familiar with the 6502 it's a chip that powered a lot of peoples first computer experiences in the 80s. The Atari 2600, Commodore 64, BBC micro etc all ran 6502 or at least slightly modified versions of that chip.

In those days if you were really interested in computers you tended to go lower level and learn assembler.

> The Atari 2600, Commodore 64, BBC micro etc all ran 6502 or at least slightly modified versions of that chip

Let's not forget the Apple I

EDIT: and the Apple II, for that matter...

> BBC micro etc all ran 6502

Mine still runs, not ran :)

> Along the way, I noticed that the Visual6502 netlist had three extra transistors, T1088, T1023, and T3037.

I'm confused by this bit - does this mean that there's a bug in Visual6502?

in a way, yes, but those 3 transistors do not affect the simulation. the drains and sources are shorted together and tied to a random data line. the gates are grounded. they just sit there, doing nothing.

With source/drain shorted together and tied to a data line, and the gate grounded, they might just be high resistance pull down resistors. I.e., they would rely on the gate leakage current to create a high ohm resistor. Creating real resistors on silicon IC's is area expensive, so most need for real resistors is first designed around and what's left as absolutely necessary is usually created using tricks. This might be one of those tricks.

Knowing nothing about IC or CPU design, this was my first thought. They're there to fix some weirdness that happens when they're not there, and nobody could figure out exactly why it was happening so they just left them there. Kind of like the "magic" switch on the front of the old computer case that shouldn't affect anything, but the computer crashes when it's flipped. :P

Hardware is weird.

Very high ohm resistor that. And hopefully that's not it, because the pull-down function might even me needed in a discrete board.

BUt I was thinking of gate capacitance, some voodoo impedence matching (or even mismatching) to prevent trouble from some reflection somewhere. In that case, the discrete board can't be expected to have the same set of problems.

But even if the transistors played

It's likely they were left-overs from a previous revision, maybe as part of fixing a different bug.

There are signs of this patching in other parts of the chip too, like this one:


Are they there for some meta feature like spares/patching or part identification, perhaps?

Perhaps they are the circuit equivalent of a trap street [1]?

[1]: https://en.wikipedia.org/wiki/Trap_street

I love this so much.

(sorry for the useless comment, but really, I love this so much)

me too! :)

In my experience [building my own Commodore PET on an FPGA](https://gergo.erdi.hu/blog/2015-03-02-initial_version_of_my_...), the PET would be a very promising candidate for using this as a plug-in replacement, since the (text-only) video subsystem can be fully isolated from the CPU with just dual-port video RAM between the two sides (and a 60Hz spike from the video side to the CPU IRQ leg).

As the link to the actual project page is buried somewhere in the comments: http://monster6502.com/

Awesome project. This is how processors used to be constructed too, with discrete components:


What would be really nice to see is a photo of it sitting next to a real 6502 for size comparison.

https://twitter.com/TubeTimeUS/status/732023168303435777 for a photo next to a real 6502, and an IBM SMS card for good measure. :)

Very clean layout.

If this actually works I wonder if he'd ever consider doing a kit. Would make an awesome display piece.

we're considering it. see http://monster6502.com/. there is a mailing list for those interested in such a kit.

This is awesome! I want one.

> Is it expensive?

>It is definitely not cheap to make one of these. If we had to ballpark what one of these would sell for — assembled and working — it would certainly be larger than $1k and smaller than $5k.

I think this is out of range for many hobbyists and even schools and the like. Projects like the ErgoDox show that kits in the range of a few hundred bucks can sell well.

>While the circuit board itself is large and a little bit expensive, the cost is actually dominated by the component and assembly costs of an extremely large number of tiny components, each of which is individually quite inexpensive. Add to that the setup and test costs of building complex things like these in small batches, and you'll immediately see how it adds up.

So, the the only way to bring the price down below USD 1000 is, besides (possibly community driven) bulk buying, a kit version.

> Is there going to be a soldering kit version of this?

> No. (But on the other hand, "Anything is a soldering kit if you're brave enough!")

This brings me to my question: Is soldering this even realistic? Did you solder the prototype yourself? How long did that take? I soldered the SMD diodes of a few Ergodoxen (76 for a board) and it gets boring quickly. Can't imagine doing 4304 parts.

> "Is soldering this even realistic? How long did that take?"

Disclaimer, I'm getting old. In the old days we soldered our S100 computers, and something like a 18 slot backplane had 1800 connections, and generally worked. Not unusual for a single card to have maybe 500 or so IC pins, 50 decoupling caps (100) and lets say 100 pins for jumpers and connectors. So it would be equivalent to making an entire S100 computer. I would estimate many tens of hours total.

Also I never did it but no small number of people soldered up IBM PC clone motherboards. There were also clones of Apple-II and TRS-80 model 1 in kit form.

Surface mount is a lot easier because once you learn how (and after 1000 or so components you'll be pretty good) there is no more flipping the board upside down over and over or snipping off wires. After 0204 RF chokes and microwave capacitors its nice to slum with giant digital logic parts so big you can pick them up with your fingers. Why some of the larger IC packages are so large that the device is no longer affected by solder surface tension (around say 100 pin TQFP size)

You rapidly learn tricks like using the same brand of IC socket thru the whole board and keep a wooden board around the size of the PCB so you can stuff and cover and flip and solder all the IC sockets simultaneously. Another trick is to always remove the flux, not because it electrically matters but because you can't do it without 100% inspection of each joint, and you'll probably find one or two to clean up per board. Because you probably don't own a wave solder machine it also saves time to solder bypass caps on the solder side but beware of clearance issues the board might not fit anymore LOL.

On my infinite list of things to build is the transistor clock around 2700 soldering joints. Totally doable. That might be a good place to start.

It would be awesome to (re-)build something like an Atari or NES around it just for all the blinkenlights as you run a game.

If you have doubts about it just budget it out and do it as a kickstarter. If it's not popular enough, no worries, if it is then hey, you get paid up front.

P.S. I'd love to see a 4004 kit too.

It is totally totally impressive that a hardware version of a software simulation of the actual 6502 was built.

What about an 8080 or 6800? Or Z-80.

Zapple V1.1 >g200



Welcome to BASIC, Ver. 1.3 <TDL Z-80 8-K VERSION>


Soon: how to win at Lunar Lander.

"Hardware version of software simulation" seems not to quite convey this. It looks like the software simulation actually simulated the original circuitry, using a netlist. The hardware is based on the same netlist; thus that simulation is of this hardware as much as of the original 6502. This new hardware isn't a simulation, though, but a discrete implementation of the real circuit.


A 6800 or 8080 at the same density would be slightly larger, and a Z80 is more than twice as large. Even the first Pentium has almost 1000x more transistors.

Well http://www.visual6502.org/ has the visual sim for the 6502 but also the 6800 and the ARM-1. I guess that means the full netlist for those two must be done too? If so I guess it'd be possible to do them in this way. Though I don't think they have the same obsessive fan audience that the 6502 has.

It sounds like the process of analyzing the Z80 in a similar way is starting: http://www.visual6502.org/wiki/index.php?title=Z8400

I would pay for a Giant Z-80.

One of the many interesting things about this project is that it's a reminder how people often take for granted the miniaturization of these things (the micro- part in microprocessor). Had we not got the realized size down into the sub-micrometer scale, we would have been forced to contend with heat dissipation making it impractical to increase the performance to what "we're used to" these days.

Now that Intel and others are facing issues at ~5-10nm scale, we will be facing a similar "clock" problem again. There are a few paths forward including: i) smarter microprocessor design tuned to intended application use, and ii) increased parallelization of tasks across multiple cores/cpus/machines.

Edit: nanometer -> sub-micrometer

Someone want to calculate the area needed to make a discrete current-gen processor?

According to https://en.wikipedia.org/wiki/Transistor_count#Microprocesso... a Skylake K has 1.75G transistors, which is nearly exactly 500000 times more than the 3.5K in this one. Take the square root of that and the linear dimensions grow by approximately 707 times, so the 32cm x 32cm of the discrete 6502 would become something closer to 23m x 23m. It's roughly a 6 orders of magnitude increase in density, or 3 orders of magnitude increase in linear dimensions.

And 6502 was apparently first produced on 8 microns versus today's 22nm process, which is a linear scale factor of 363, which aligns fairly well with those numbers.

That's small that I expected it to be -- certainly feels printable, if someone really wanted to for some arty reason.

You have a calculation error:

707 x .32m ~= 230m

I'd love to try this processor in my Acorn System 1 (1MHz 6502)


This reminds me of another discrete machine. Does anyone recall the clockless, discrete x86 (probably 8088) that was made way, way back? As I recall it ran many times faster than the highest clocked version of same at the time.

i love the 555


evilmadscientist's datasheets are fantastic tools for learning

Very nice. But I would be more interested in making the transistors themselves at home. Imagine if that process can be automated (in a 3d printer?), that would be really cool.

Chris Gammell has been preaching transistor printers for over 5 years now :)

Fascinating! Just curious if anyone can provide any more details? Specifically, what design software, which manufacturer?

the schematic capture and layout was done in Altium. proto board spin from a vendor in China.

Have you found the critical path that caused clock slowdown?

Are there higher drive strength transistor available that could make it faster?

PS. love this project.

because it's NMOS, only the pulldowns can work quickly. the pullups are resistors (depletion mode FETs in the IC) which take some time to charge the downstream gate capacitance. i wrote a quick and dirty python program to calculate the optimal value for each of the 1,019 pullup resistors to hit a target clock of 500khz. board bringup is ongoing so we'll see how fast it ends up being...

500kHz is not too far from the Apple ]['s 1MHz. Come on, you can do it!

So then the only way to get it going faster is to raise voltage for the pullups?

well, it has to run on 5v (standard ttl voltage) so the only way to speed it up is to decrease the pullup resistor values, which also increases the current consumption. it's a tradeoff.

If the FETs can take it, you could run it from 12V or whatever and put level shifters on the I/Os...

Agreed, would seem to make sense too, however it would increase power consumption and heat, so might "age" the whole circuit a bit faster than keeping the voltage (and thus clock speed) down. Mind you, he could just pop in replace components at will :)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact