As a result Intel is able to deploy highly unfavorable features like the Management Engine and consumers don't feel like they have an alternative. A reasonable alternative like Talos costs $3700.
If startups could reasonably compete with traditional volumes of funding, we'd probably have better hardware.
ARM chips are cheap not (only) because the R&D is excessively cheap but because the chips are produced in hundreds of millions. Were it not for the explosion of mobile phones, they'd stay in the "nice but expensive for the performance" niche.
That's about two CPUs for everyone on earth and likely two more this year.
It's not economies of scale that gives Intel its sustained advantage as much as it is huge margins that allow them to continue re-investing in their fabrication process advantage, which again protects their margins by keeping them on top in the only segments of the CPU market that are both high price, high margin (at east for the top SKUs) and high unit.
Reportedly only 1.5B phones have been sold in 2015, compared to 15B of ARM cores.
There is apparently a huge market of small cheap ARM microcontrollers, to the point that I heard they are sometimes getting cheaper than simple 8bit micros.
A huge part of the cost of a chip is the design. You can get custom fab jobs for under $10,000. But if the design requires you to implement hundreds of op codes, the chip will be nowhere near that cheap.
With simpler specs, a lot of this design cost is reduced. You may not be able to get 11nm chips, but you don't need that to be competitive. You just need to be close enough that your performance hit is not a deal breaker, at a price that's not a deal breaker.
> When first released in 2001, Itanium's performance was disappointing compared to better-established RISC and CISC processors. - Wikipedia
* performance per dollar
* performance per watt
* highest single threaded performance
* lowest possible power draw
Itanium AFAIK improved neither of these, or it did so only for very narrow usecases. What Intel needed was either (a) an ARM competitor and/or (b) a processor with large vectors and a SIMT-like parallelisation model (i.e. a Tesla competitor). In both of these cases they didn't think twice before just throwing x86 at the problem until the pain goes away... which it never did.
Why trying to win people over on compatibility if they have to rewrite their software anyways for other reasons, especially after ARM took the largest market share in mobile?
It took a while to Intel to accept that the mobile-app providers won't invest the same effort in porting their apps from ARM to x86 given the market penetration, but I don't believe it was a suprising outcome to them.
Intel's problem is a case of optimization I think - their design teams have been optimizing towards producing a processor with an effective power budget of whatever the cooler could dissipate, rather than optimizing for low power, and that's a very different task.
There's no one processor that will fit all workloads.
On the other hand, it's a completely useless situation for everybody who actually wants to use hardware, instead of getting rich in the hardware business.
Intel's got a lock on top-end performance, but don't think for a second they're in control of the CPU market.
You can get a custom ASIC fabricated for under $10K in low volumes using process that's a generation behind. You can get things fabbed on the current process if you're willing to pay more, but it's on the order of $1-2M or so. It's not often people make new CPUs, but look at how the Bitcoin space took advantage of low ASIC prototyping and production costs to produce high-performance hashing devices.
The cost of producing a 6502 chip in that era was basically the cost of building your own foundry. Now we have places like TSMC that will make chips for anyone who can afford the tape-out costs.
In a heartbeat!
I'd rather live here where if you wanted you could tape out some 60nm chips for less than it costs to print some posters.
But in terms of documentation that is related directly to the transistors, it would be interesting to evaluate number of lines of VHDL to the number of inferred transistors. I know you get a report after you have finished place and route on your typical work flow but has anyone rolled that up to "2.5 lines of VHDL per 100 transistors" or something?
x86 has 4181 pages of documentation. Quad-core Skylake has 1.75 billion transistors. This is 418560 transistors per page.
The 6502 had 12 pages of documentation. It had 3,510 transistors. This is 292 transistors per page.
Yes, Skylake contains GPU stuff in its transistors and also in its documentation.
If Intel were to manufacture a chip as big as that, it would have 20-30 billion transistors.
As you can see, caches are merely half of the chip and half of the rest is about dynamic instruction reordering, branch prediction, register renaming and stuff.
These things have pipelined execution units so they can start a new instruction before the previous one is finished executing, enough duplication to start executing two or three instructions per cycle (sometimes even of the same kind, say two floating point SIMD additions) and logic to schedule instructions wrt data dependencies, not program order, so that instructions which need input data not yet available can wait for a few cycles while other, later instructions are executing.
And all of this has to be done with some degree of appearance of executing instructions serially, so if say some instruction causes a page fault and a jump to the OS fault handling code, the CPU has to cancel all later instructions which may have already finished executing :)
And, btw, this is not in any way specific to x86. POWER, high-end ARM, they all do it.
Maybe Agner's site, in particular his microarchitecture manual, would be a reasonable place to start:
There are "software optimization manuals" from CPU vendors, but these may not be particularly novice-friendly. I think I've used Wikipedia at times for general CPU-agnostic concepts, though it has a tendency to use jargon with tittle explanation. Occasionally somebody submits something to HN.
On the lowest level, it may be helpful to know some simple digital circuits (decoders, multiplexers, adders, flip-flops, ...?) just to have an idea of what kind of things can be done in hardware.
At this rate, 1419 pages over 3 years, the Intel x86 documentation's page count will exceed all transistor counts of the 6502 around 2020.
Shatner plays it pretty straight here, and he does a great job of making the subject mattter interesting to audiences of the day. What's interesting about the video now is that every time he promises us "Thousands of transistors on the head of a pin," you can remind yourself that what we actually got was billions of transistors.
It's humbling from a software engineering perspective to contemplate how poorly we're taking advantage of the semiconductor industry's Promethean gift. My computer still looks a lot like the Apple IIs and TRS-80s did in that video, and the same is true for my workflow.
(This is actually true. I asked x86 architects when I met some).
When I remember all the stuff you could do on an AppleII (and a BBC Micro) with their tiny four odd thousand transistor cpus and 4 whole KB of ram - and consider how much time I spend waiting for this laptop with it's billion-or-so transistor cpu and 16 GB of ram - it's almost enough to make you weep about the profligate waste of resources of the entire software engineering profession... ;-)
I do realise that the last 20 years of GUI progress has stalled and that you could take a Mac from yesteryear or PC from ~1991 and know your way around it without any trouble at all.
Of course software development strategies have changed and languages now let us express ourselves in previously unimaginable ways, but we've come so far and not far at all.
I am particularly struck with the craze over the last 5+ years with regard of "cloud" and shoving data to the other side of the world, particularly given the microcomputer revolution and the lack of need to shove your data elsewhere. That's what the microcomputer is for!
The LGP-21  has the fewest transistors for any mass-market computer I've read about - 460, and 300 diodes.
You will love this paragraph:
"Unlike the 286 LOADALL, the 386 LOADALL is still an Intel top secret. l do not know of any document that describes its use, format, or acknowledges its existence. Very few people at Intel wil1 acknowledge that LOADALL even exists in the 80386 mask. The official Intel line is that, due to U.S. Military pressure, LOADALL was removed from the 80386 mask over a year ago. However, running the program in Listing-2 demonstrates that LOADALL is alive, well, and still available on the latest stepping of the 80386."
Just imagine whats in Intel chips now due to NSA pressure :/
"LOADALL restores the microprocessor state from the State Save Map that is saved during the transition from user mode to ICE mode. LOADALL loads enough of the microprocessor state to ensure return to any processor operating mode."
There certainly are undocumented debug facilities in modern CPUs. For one example, the leaked Socket AM3 datasheet clearly shows a JTAG interface, though I don't know if it's operational in production silicon.
Hopefully, debug capabilities cannot be used to pwn the CPU from unprivileged code without external debug hardware which could pwn the CPU anyway by itself. It's not even clear if they are enabled in production chips at all.
LOADALL for example worked only in RING0 and got ultimately removed early in the 486 days so it seems Intel cared about security somewhat (and probably also about future compatibility, to be honest, it's not fun when software relies on features you want to change in the next generation).
Nowadays they should care even more - if software backdoors were available and leaked to the public, the magnitude of shit happening in all those cloud companies would be monumental.
conveniently "discovered" by a 3 letter agency favorite principle contractor (Batelle Memorial Institute - have fun researching them) employee just after everybody switched to the next(fixed) cpu generation.
It also doesn't work from userspace so pretty much all you can do with it is hacking SMM from a kernel running on the bare metal. Maybe useful for rootkits, but truth be told 3LAs seem to have no problem making non-SMM malware undetectable by commercial AVs. See stuxnet :)
> conveniently "discovered" by a 3 letter agency favorite principle contractor
Not sure what you are alluding to. 3LAs wouldn't want this to be known if it was their job, methinks.
It'd be interesting, but I don't think it'd save all that much unless you strip 32 bit compatibility as well, and even then it might be less than you think or they probably would have tried to see if the market would want it...
Or a mobile dumbphone came with a manual explaining all the menus and options. Now the only paperwork with a smartphone is legal and warranty.
First, we have to acknowledge that most texts (if works for law too) are very diluted now compared to a few decades. There is a lot of blah-blah that doesn't bring information. Information density decreased.
Then, there are docs that are so big (many many thousands pages), that I am sure no editor can read them fully. They pile up copy-paste from older or similar models without checking if it applies to the chip. They don't write a clean doc specifically for the chip. So as a user you can trash parts of the doc. Problem is that you don't know which ones.
Since they don't print manuals any more, they don't have to care about fitting the doc in the book, it's no-limit.
(Which I used as a ~12 year old to work out how to connect a home built 4 micro switch and 2 puch button joystick to the printer port, so I could write blocky text/graphic versions of video games I wanted to play.)
Source: I've got the x86 and x64 manual instructions set from there, which is thousands of pages in PDF. Rootkits ain't gonna write themselves =)
It's quite a surprise to go to HN and see my post from 2013 here, by the way.
That didn't happen in the past and I doubt it will be true in the future; in fact I'd say one of the reasons ARM remained competitive is because of conditional execution, the "free" barrel shifter, and Thumb mode, which really help with code density (directly related to cache usage) and avoiding branches.
AArch64 looks very much like MIPS, an ISA that hasn't really been known for anything other than being cheap and a good simple pedagogical aid (despite plenty of people being convinced it would easily outperform x86 at a fraction of the cost.) I'd guess any perceived performance increases over AArch32 are primarily due to the widening to 64 bits, and in any case much of the benchmark performance rests on the special functional units (SIMD, crypto, etc.)
No compiler developer would agree with you. The conditional execution wreaks havoc with dependencies, and branches are very cheap if correctly predicted. The barrel shifter is not as useful as you would think (what fraction of instructions are shl or shr?)
Thumb mode does help code density, but not as much as you might think due to Thumb-1 not being practical and Thumb-2 being fairly large. AArch64 is quite a bit denser than x86-64 already.
It is true that the ISA doesn't matter too much from a performance point of view. But why not take advantage of the necessary compatibility break to clean things up? There's a lot of needless complexity in our ISAs from the programmer's point of view, and cleaning it up is just good engineering practice. Let's not saddle future generations with the mistakes of the 1980s.
Personally, I prefer ARM's move, because while it's a complexity increase now, it paves the way to someday drop support for 32-bit ARM, which would be a major architectural simplification. It also means that, as a programmer, when you're in 64-bit mode you aren't burdened by all of that weird backwards compatibility stuff going back decades—you get a relatively clean ISA.
(I do have some gripes with AArch64, to be fair: there are too many addressing modes and the condition codes are unnecessary. But I'll take anything that moves us in a more RISC-y direction.)
Granted, I didn't read the full documentation provided online for my hardware before I powered it on. Honestly, I didn't read any documentation and it just works, kind of.
They also probably figured that they could reuse decoding logic.
And it is just very small component of cruft that x86-64 has.
UEFI is an overcomplicated, buggy monstrosity, but that's just the tail end of the "boot process". Nowadays, to get an x86 CPU to execute a single opcode, you need to have a Management Engine (or Platform Security Processor, in AMD-speak) firmware blob resident in the firmware flash chip. More modern CPUs, for Intel, say, oblige you to use Intel-provided "memory reference code" and other "firmware support package" blobs just to initialize the CPU in the early stage. AFAIK, Intel isn't even bothering to document the details of its CPU and chipset initialization sequences anymore, in favour of just making people use unexplained blobs. These are just some of the issues the coreboot project is having to deal with. It really feels like at least in the world of x86, the window is rapidly closing on projects like coreboot being able to accomplish anything useful, although there are at least some major users like Chromebooks.
And then of course we have things like SMM, and the way in which secure firmware updates are facilitated (which relies on things like flash write protect functionality)...
The ME firmware is loaded by the CPU itself before anything begins executing; there's a header in the firmware image stored on the CPU to let the CPU find it. These are cryptographically signed, so all projects like Coreboot can do is incorporate the binaries provided by Intel.
The MRC/FSP blobs are executed by the x86 firmware, they're x86 code which runs very early. Theoretically projects like Coreboot could replace these blobs with their own code, but it would require reverse engineering these blobs to figure out what they're doing. The fact that this would be a major effort is a testiment to the complexity of the initialization routines implemented in these blobs.
The order is basically something along the lines of:
1. CPU loads ME firmware, verifies signature, starts it running on the ME coprocessor.
2. First x86 opcode is executed; this is part of the 3rd party firmware (Coreboot, AMI, etc.)
3. The 3rd party firmware will probably start by executing the Intel MRC/FSP blob. (Possibly this blob even expects to be the reset vector now, wouldn't surprise me; I'm not an expert on this.)
4. The memory controllers/chipset/etc. are now setup. The 3rd party firmware can do what it likes at this point.
5. Typically, firmware will implement a standard boot protocol like MBR boot or UEFI boot. Coreboot executes a payload at this stage.
I should add that microcode is another (signed, encrypted) blob. Modern x86 CPUs are so buggy out of the factory that they're often unable to even boot an OS unless a microcode upgrade is applied, so 3rd party firmware often performs a microcode upgrade before booting. Historically I don't believe it was uncommon for the OS kernel to perform a microcode upgrade, if configured to do so because a newer microcode was available than was incorporated in the firmware; Linux has functionality to do this. However I seem to recall that late (kernel boot or later) microcode application is being phased out; recent x86 CPUs want microcode updates to be completed very early, before kernel boot.
The really weird initial state of SMM is a bigger deal since it happens at runtime.
- the old floating point register stack and its 80-bit registers
- I don't know how many iterations on SIMD instructions (MMX, a few iterations of SSE, a few iterations of AVX, various prefixes to make older instructions use newer registers)
If you got rid of those, you also could get rid of quite a few prefix instructions, maybe a configuration bit here and there, etc.
It also doesn't help that, at times, Intel and AMD independently added stuff to the x86.
Power on reset, shift, decode, execute, repeat.
Intel loves complexity, which is why they invented USB: another tree killer.
The processor doesn't do anything. In all that silicon and its pages of documentation, you can't even find a parser for assembly language; you need software for that.
In spite of 4000+ pages of documentation, printing "Hello, world" on a screen requires additional hardware, and a very detailed program. Want a linked list, or regex pattern matching? Not in the 4000 pages; write the code.
And this is just the architecture manuals software developers. This is not documentation of the actual silicon. What it contains:
This document contains all seven volumes of the Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, Instruction Set Reference A-L, Instruction Set Reference M-Z, Instruction Set Reference, and the System Programming Guide, Parts 1, 2 and 3. Refer to all seven volumes when evaluating your design needs.
Instruction set references and system programming guide; that's it!
Note also that this is not the programming documentation for a system on a chip (SoC). There is nothing in this 4000+ page magnum opus about any peripheral. No serial ports, no real time clocks, no ethernet PHY's, no A/D D/A converters; nothing. Just CPU.
Intel loves performance, because people want performance. Complexity is the cost of increased performance. As an example, I would guess that of the ~2000 pages of the instruction set reference, at a minimum 1000 pages document the various SIMD instructions. You don't need those, or the floating point operations, or SHA instructions, but I don't see any harm done by making them available.
I think the single instruction which can do the entire job without any magic assist is subneg x, y, z:
Subtract location x from location y; store the result in location y; and branch to location z if result is less than 0; else proceed to next.
Or various trivial variations of the same idea.
Any complication beyond this is no more than syntactical sugar and performance optimization.
If you focus only on direct memory addressing (no indirect or indexed), mine does still work, but mov doesn't. I think.
It's just some CPU instructions!
Maybe they aren't doing a good job of describing them succinctly?
These things aren't written to be brief, they are written to be complete.
It's difficult to argue with someone starting from this level of wrong.