My first real programming job entailed writing a blazingly macro assembler from scratch for the 6502, using a much slower (and non-macro) assembler from the vendor (Ohio Scientific, for those of a certain age).
I simulated the proposed hashing algorithm in FORTRAN at the community college I was attending, and found that it led to a lot of collisions for the base 6502 opcode set. When I pointed this out to my manager (who's still my friend 32+ years on), he made sure that I got my first raise.
I still have those listings and the original design documents around somewhere.
And this is 30 year old tech. The 6502, the processor in you cell phone is about 1000x faster than it (and much more capable)
To me the hardest part (apparently) is converting the electronic circuit to the actual chip drawings. Not sure how this is done (how do you route it). And this was done by hand in the 6502, the drawings were done the size of a desk and reduced photographically. (IIRC)
Here's an article about it too: http://research.swtch.com/6502
Not only did they route / layout by hand-drawing, and then cut the Rubylith photomask by hand (it's not a drawing they use to reduce it photographically - they cut out holes..), but Bill Mensch got it right the first time. 3,510 transistors according to the article.
Bill Mensch left MOS to found Western Design Centre quite early, and so is covered less in that book than Chuck Peddle which was instrumental in Commodore for a long time after they acquired MOS. The amazing thing is WDC still sell variations of the 6502 design, including various replacements such as a 16 bit version.
Peddle (the other main designer) and Mensch deserve a lot more attention than they've gotten.
(EDIT: From WDC's website: "Annual volumes in the hundreds (100’s) of millions of units keep adding in a significant way to the estimated shipped volumes of five (5) to ten (10) billion units. " Yikes...)
I've seen that story before that the 6502 worked perfectly the first time, but I think there's some mythologizing going on. The ROR instruction was totally broken on the first release of the 6502 and wasn't fixed until months later. 
The quote above says that the 6502 has 3510 transistors, and this number appears many other places. It turns out that the 6502 has 3510 enhancement transistors and 1018 depletion transistors, for a total of 4528 transistors, according to the visual6502 analysis.
And if you're interested in the inner workings of the 6502, you should definitely check out the huge transistor-level schematic at: http://www.downloads.reactivemicro.com/Public/Electronics/CP...
 http://en.wikipedia.org/wiki/6502#Bugs_and_quirks and details at http://www.pagetable.com/?p=406
In any case, the main thing is that a lot of MOS' early success came from saving a tremendous amount of money being able to get to market quickly and cheaply compared to the competition thanks to actually having guys like Mensch and Peddle coupled with superior process (though that did not last all that long).
I did all kinds of crazy things with my C64 and got away with it without breaking stuff, like powering it off batteries (the C64 takes multiple different voltages in, but you can get away with just one - don't remember if it was 5v or 9v - just that some things like the user port and realtime clock won't work) and attaching leds and relays to the user port without a clue what I was doing, replacing the IO chips (CIA) with a different version from my 1541 disk drive when one of the ones in the C64 broke (standard troubleshooting: If the CIA chips were hot right after turning the machine on, they had short-circuited - they were the source of breakage on C64's and Amiga's...), and at a later point with one from my Amiga (they're all pin compatible, but some functionality is different, e.g. the Amiga version has a 32 bit timer instead of the realtime clock nobody used...), or replacing the 6510 in my C64 with a 6502 from a 1541 just to see what would happen (the 6510 has 8 general purpose IO lines, mapped to the tape drive IO and I think bank switching - I believe the machine will still start but...)
On my Amiga I at one point made a pause switch by soldering stuff straight onto a pin on the CPU...
Now, normally you have to worry about increasing clockspeeds having diminishing returns, since memory latency remains constant despite a faster CPU clock. But anything that could run on the amount of RAM the 6502 could handle would fit in a modern processor's L1 cache, and the scheduler is perfectly able to hide L1 latency so I think ignoring this factor is fair in this case.
This all gets a bit complicated in modern days because of memory access costs and caches which try to alleviate the costs, but the idea is that modern CPUs are likely to be around 10 times as fast per-clock as 6502 and because of multiple cores and threads that value goes to something like 40-60. Add the huge increase in clock speed and you're a bit south from x100_000 in optimal case.
I would hope every programmer would write some core on a C64 to really learn how much RAM the 64 KB really is. You can actually waste some of it and in some cases it really is "enough so that I don't have to optimize". :) Real hard-core people would go with VIC-20 which as only 5120 bytes of RAM, or Atari 2600 with 128 bytes of RAM. One could imagine there's nothing you can do with them but oh boy how wrong one would be! Heck, a single tweet is 140 characters. And you can fit that in 128 bytes. You really can... :)
CLC ; 2
LDA&70 ADC&74 STA&70 ; 3 3 3 = 9
LDA&71 ADC&75 STA&71 ; 3 3 3 = 9
LDA&72 ADC&76 STA&72 ; 3 3 3 = 9
LDA&73 ADC&77 STA&73 ; 3 3 3 = 9
By comparison, for a modern Pentium, according to Intel's docs, a 32-bit add (again, on data you're using) takes 1 cycle, end to end.
This ignores the fact my modern computer has 2 cores.
But, "ADD ESI,EDX" is adding two registers isn't it? So I think you need to include the loading/storing of those registers back to memory for a more fair comparison.
I haven't touched 6502 assembly in over 20 years. Brings back memories. :-)
CLC ; 2
ADC #b1 STA&70 ; 0 2 3 = 5
LDA #a2 ADC #b2 TAY ; 2 2 2 = 6
LDA #a3 ADC #b3 TAX ; 2 2 2 = 6
LDA #a4 ADC #b4 ; 2 2 = 4
; total 23 cycles
Perhaps the code is intended to be modified at runtime, but then you'd then still want one of the operands loaded from memory, I think (otherwise why not just precalculate the results?), and I've generally found the (fairly substantial) fixed expense not to be worth it anyway.
Anyway, overall I think you're being a bit unfair to the x86 with this comparison.
You're even playing nice against the 6502, you're using a simple add, now compare with SIMD instructions
Things felt faster in the 80's than they did now, even doing the same tasks.
Waiting a few seconds every time I hit save was fun. Didn't stop me from developing a ferocious ^S reflex. Fortunately, save is fast enough not to be noticeable these days, to the extent that it usually happens automatically now.
Watching a WYSIWYG font menu draw each individual entry was fun. We certainly don't get that pleasure now.
But yes, things certainly felt faster in the 80s.... /s
One order of magnitude: 10x
Two orders of magnitude: 100x
Three orders of magnitude: 1000x
 Of course, that's ignoring multiple cores, better microcode, caches, etc.
How MOS 6502 Illegal Opcodes really work
I just finished a 6502 emulator, I recommend doing it for everyone. Lots of fun and very interesting.