The counters vary per Intel CPU, though the most useful ones are universal (e.g. cycle counts). AMD has similar counters.
Includes really easy to use performance counter support.
pcm is also quite nice to monitor what an entire system is doing in terms of memory bandwidth, NUMA link traffic, and other package level concerns but doesn't give any kernel or application level tracing like the other tools.
2) Not sure for this, though I can think of some crappy hacks:
--A) Timed LBR mentioned in that LWN article (somewhat indirect, but might get the job done)
--B) use perf counter overflow interrupts (for cache misses) and set the perf counter initial value high (which should let you sample the cache miss locations). This can only tell you if a particular load is making up a large fraction of your overall cache misses (which is probably not super useful).
Edit: Forgot about PEBS, which is really what you want for 2).