The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.
Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)
Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.
The GreenArrays team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.
I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)
So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.
There are all kinds of chips that want to embed some kind of small microprocessor using a minimal amount of silicon area, but aren't too demanding of its power. A lot of them embed a Z80 or an 8051, which have lots of existing toolchains targeting them. A 6502 might be a reasonable choice, too. Both 6502 and Z80 have self-hosting toolchains available, too, but they kind of suck compared to modern stuff.
If you wanted to build your own CPU out of discrete components (like this delightful MOnSter!) and wanted to minimize the number of transistors without regard to the number of other components involved, you could go a long way with either diode logic or diode-array ROM state machines.
Diode logic allows you to compute arbitrary non-inverting combinational functions; if all your inputs are from flip-flops that have complementary outputs, that's as universal as NAND. This means that only the amount of state in your state machine costs you transistors. Stan Frankel's Librascope General Precision LGP-21 "had 460 transistors and about 300 diodes", but you could probably do better than that.
Diode-array ROM state machines are a similar, but simpler, approach: you simply explicitly encode the transition function of your state machine into a ROM, decode the output of your state flip-flops into a selection of one word of that ROM, and then the output data gives you the new state of your state machine. This costs you some more transistors in the address-decoding logic, and probably costs you more diodes, too, but it's dead simple. The reason people do this in real life is that they're using an EPROM or similar chip instead of wiring up a diode array out of discrete components. (The Apple ][ disk controller was a famous example of this kind of thing.)