Does anyone have a guide to x86 assembly that covers the basics as well as more advanced stuff like SSE, how the caches work (especially wrt multiple cores), a model of how fast the instructions are in which circumstances? I'd like something that is pretty comprehensive and modern (you shouldn't use x86 floating point anymore? great, then skip that and go straight to SSE) while not being ridiculously enormous like the intel manuals. How do experts learn this stuff? Really by reading intel's and amd's manuals?
For a basic description of the architecture, and detailed descriptions of the opcodes, read the Intel architecture manuals themselves. They're big because there's a lot to know. You don't have to read it all at the same time.
For very detailed information on the memory hierarchy, read Ulrich Drepper's "What every programmer should know about memory" series on LWN. Though even this is probably going to fall out of date soon, as it was written in 2007. http://lwn.net/Articles/250967/
For a model of how fast the instructions are, the best I know of is Agner Fog's instruction tables:
I'm not an expert, but in my experience architecture experts seem to learn through experience and by absorbing knowledge from the people around them. You have to be willing to conduct your own experiments, but you also have to be cautious in the conclusions you draw because there are so many interrelated factors.
There is a lot of tribal knowledge floating around about how to optimize assembly. The Intel manuals capture a lot of it in the optimization reference manual in particular. By reading that you'll learn the details of branch prediction, things to avoid like partial register stalls, etc.
At Google (where I work) there are guys who are optimizing at the assembly level. Some guy will post a specific routine he's trying to optimize to a mailing list, and it will go back and forth between a handfull of expert-level people throwing in ideas. There was recently a thread about the performance difference between movdqa and movdqu, wondering what the performance difference is. It seems to vary based on the architecture (Core2 vs Atom, etc.) and whether the read is crossing a cache line etc. etc.
There's just a lot to know, and I doubt anybody knows it all.
I breifly covered ASM in my first year of university but I don't think it counts since we got all input with C.