A few years ago I made a (non-interactive) visualization similar to this, that made its way to hackernews [1]. It puts this on a linear scale, so you can get a feel for the baffling difference between your most local caches and your hard drive:
The limitation with a decent RAID array of SSD drives will be your controller and/or your PCI interface. I have servers at work we get ca. 1200MB/sec reads from without any real tuning, while we get ca. 6000MB/sec from a RAM disk (which of course then still costs us the overhead of the filesystem layer). You might be able to beat that with a suitably designed setup, but it's likely to be expensive.
Getting 500-600MB/sec is "cheap". Getting 1200MB/sec gets a bit pricy but isn't hard (the setups we've tested "just" involved us popping 4x OCZ Vertex drives of various models into a server with SATA III, without much regard for anything else).
in practice instruction level parallelism and data access patterns make a big difference. if each memory access takes 100 ns but you can do 100 at a time, most of the time there is no practical difference from each memory access taking 1 ns where you can only do 1 at a time. an observer (either a human or another computer over a network) will not notice 100 ns. the numbers listed in this chart are pure latency numbers. what most people care about when talking about cache/memory speed is bandwidth.
on an intel sandy bridge processor, each L1 access takes 4 cycles, each L2 access takes 12 cycles, each L3 access takes about 30 cycles, and each main memory access takes about 65 ns. assuming you are using a 3 GHz processor, this would make you think that you can do 750 MT/s from L1, 250 MT/s from L2, 100 MT/s from L3, and 15 MT/s from RAM.
now imagine 3 different data access patterns on a 1 GB array:
1) sequentially reading through an array
2) reading through an array with a random access pattern
3) reading through an array in a pattern where the next index is determined by the value at the current index.
if you benchmark these 3 access patterns, you will see that:
1) sequential access can do 3750 MT/s
2) random access can do 75 MT/s
3) data-dependent access can do 15 MT/s
you might guess that sequential access is fast because it is a very predictable access pattern, but it is still 5x faster than the speed indicated by the latency of L1. maybe you'd think it's prefetching into registers or something? but notice the difference between random access and data-dependent access. this is probably not what you expected at all! why is random access 5x faster than data-dependent access? because on sandy bridge, each hyper threading core can do 5 memory accesses in parallel. this also explains why sequential access seems to be 5x faster than the speed of L1.
what does this mean in practice? that to do anything practical with these latency numbers, you also need to know the parallelism of your processor and the parallelizability of your access patterns. the pure latency number only matters if you are limited to one access at a time.
With timings that are several orders of magnitude in difference I'd just ignore the constants between factors as they change too frequently. Also there is a difference between latency and bandwidth and the chart is simply inconsistent.
CPU Cycle ~ 1 time unit
Anything you do at all. The cost of doing business.
CPU Cache Hit ~ 10 time units
Something that was located close to something else that was just accessed by either time or location.
Memory Access ~ 100 time units
Something that most likely has been accessed recently, but not immediately previously in the code.
Disk Access ~ 1,000,000 time units
It's been paged out to disk because it's accessed too infrequently or is too big to fit in memory.
Network Access ~ 100,000,000 time units
It's not even here. Damn. Go grab it from that long series of tubes. Roughly about the same amount of time it takes to blink your eye.
It's probably worthwhile to differentiate between local network access and remote network access, and also differentiating between spinning disks and SSD.
Local gigabit network ping time is 200 μs and an SSD typically reads a 4K block in 25 μs. I imagine that converts to 200,000 time units and 25,000 time units, respectively.
"there is a difference between latency and bandwidth the chart is simply inconsistent".
Both transmission delay and propagation delay are measured in units of time -- I don't understand how it's inconsistent. If you'd like to change the payload size, I've made it very easy to do so -- simply change the html at the bottom of the page and it will be reflected immediately up top.
Thank you. This is much more useful for the sake of creating code. The raw numbers are irrelevant in the overwhelming majority of cases. What matters are the ratios. They impact the decisions we make every day.
(1) Mutex performance isn't constant by OS. My own measurements tell me FreeBSD and Linux are around 20x slower than Linux locking and unlocking a mutex in the uncontended case.
(2) Contention makes mutexes a lot slower. If a mutex is already locked, you have to wait until it unlocks before you can proceed. It doesn't matter how fast mutexes are if there's a thread locking the one you're waiting on for seconds at a time. And even if other threads aren't holding the mutex long, putting the waiting thread to sleep involves a syscall, which is relatively expensive.
A good part of contention's cost can be the time you wait for the thread scheduler. When the lock-holding thread releases, your thread is moved from 'sleeping' to 'ready', and then it's gotta wait for a CPU. This is usually fast, but the tail can be awful.
Is that still the case for FreeBSD? I thought they did some work this year to help speed it up. Also, the Linux emulation under FreeBSD was always slow for mutex (did you test under Linux emulation or native?), but they have implemented futexes now and pushed uncontended mutex to userspace to speed it up.
What implication on my RoR programming is supposed to be from knowing that mutex lock/unlock time is 17ns? Especially, if the web site is hosted on Amazon? Should really every programmer know it?
The idea is more about knowing the scale of time it takes to help compare different solution. With those number you can see that it's faster to get data in memory from a distant server than to read the same data from disk. In common application that means using disk storage is less efficient that storing it in a database on an other server, since it usually keeps almost everything in memory. It also tells a lot about why CDN are needed if you are serving client worldwide. A ping from North America to Europe will always takes 100+ ms, so having geographically distributed content is a good idea.
>In common application that means using disk storage is less efficient that storing it in a database on an other server, since it usually keeps almost everything in memory.
Depends on the usage pattern. Database servers generally have more memory than end user machines, but not proportionally more: A database serving a million users won't have a million times more memory. (This matters primarily when each user stores unique or low popularity data, otherwise the shared database has the advantage.) Moreover, data stored "on disk" on user machines will, with modern long-uptime large-memory desktops (and before long mobile devices), have a high probability of having the data cached and retrieved from memory rather than disk.
Thus, if the data is accessed a relatively short period of time after it is written (i.e. within a few hours), storing it locally on disk may be faster. (And storing it nominally on disk even though it's all cached may be preferable to storing it in strictly memory either because some minority of users have insufficient memory to store all necessary data in memory, or because the data should survive a loss of power or other program termination.)
Edit: It's also worth pointing out that latency isn't the sole performance criterion. Local disk generally has more bandwidth than network. If you're bandwidth constrained (either at either endpoint, the database or the user device), or either is getting charged by the byte for network access, that can be an important consideration.
This lame comment comes up every time I see something like this posted. You're right. Knowing exact latency numbers is probably not going to change how you program, or how I program, or how most program. But why does it hurt to know? Why refuse to learn a few numbers that give some perspective on your system's limitations?
The other end of this is, for example, a programmer who is unaware that disk seeks take something like 100000x as long as accessing main memory.
(Anyway, the posted link is interesting since it shows these numbers over time.)
It is probably a useful exercise for someone who thinks knowing these numbers is useless, to understand why other people don't think they are useless. And in that sense asking the question is valid.
If you're never going to leave your RoR comfort zone then I suppose you don't need to know this. But I'd have no qualms saying that "every programmer should know" these numbers, and that if you don't do anything where they matter, you occasionally should.
mutex lock/unlock time is important for anyone writing code that depends on code that is accessing the same data structures at the same time. If you don't know that there is a considerable cost to writing such code, now you do; and this number can quantify it.
Most of the time it won't matter much, if at all, but if you have app that is not running as fast as you would like then understanding those operation latencies can help a lot in optimising it.
It amazes me that something happening locally (in the ~10ms range, say a database query) is even in the same scale as something traveling halfway around the world. The internet wouldn't really work without this fact, but I'm very thankful for it. Simply incredible.
The internet would still work if you had a faster database. 19ms is your hard drive seek, SSD is much faster. Thats the big change of the last 10 years.
Sorry I meant to imply the data travelling far distances was most impressive, given the fact it's on the same time-scale (milliseconds) as a transaction happening 1 inch away on your own computer. To me that's what I think is most incredible about that graph. It almost seems that the traveling portion would need to be 100x longer than something happening locally, but obviously this isn't true.
Speed of light is really fast. Once you have something on a wire traveling close to it, you are in pretty good shape time wise.
The number on a round trip is actually kind of incorrect, because the number of routing stations it must traverse will dramatically impact the round trip time, so it isn't just a fixed value. It is a good ballpark though. I think the real impressive thing is how ethernet hardware has been designed well enough to be able to process billions of packets per second where they have to read the IP destination header, compute the next node in the transaction, and send it off. You can often end up doing that a half dozen times in relatively small trips but the individual time commitments are so much smaller than disk IO :P
Without a good feel for a billion, or a billionth, it's hard to get a feel for a nanosecond.
Visualization for a nanosecond: a piece of wire 11.8 inches long, which is the maximum distance light/electricity can travel in 1 ns. A microsecond is a coil of wire 984 feet long. So if you're trying to understand why it takes so dang long to send a message via satellite or whatever, just understand that there's a whole bunch of nano/micro/milli-seconds between here and there. This also explains why computers must be small to be fast.
(Background: Rear Admiral Grace Hopper started programming on the Harvard Mark I in 1944, and wrote the first ever compiler in 1952 on the UNIVAC I. She earned the nickname "Grandma Cobol". Info via https://en.wikipedia.org/wiki/Grace_Hopper , which also summarizes this video in the "anecdotes" section.)
It's not quite clear what the first ever compiler was, since "compiler" is kind of a fuzzy category, but Rear Admiral Hopper might have given that credit to Betty Holberton.
>> in 1997, she received the IEEE Computer Pioneer Award from the IEE Computer Society for developing the sort-merge generator which, according to IEEE, "inspired the first ideas about compilation."
Whereas Hopper coined the term "compiler" and wrote the first one that's recognized as such:
Indeed. But A-0 isn't what we'd call a "compiler" today. It was a gradual evolution from Holberton's work (1948? 1949?) through A-0 in 1952, up to, say, 1957, when the first optimizing FORTRAN compiler was shipped, which is definitely what we would call a compiler today.
54 ms for neutrinos taking the chord through the Earth (using 0.22 from your result). (That's only 4ms faster than light through vacuum along the great circle. Disappointing.)
It would be nice if the source code of the page were not shown in a frame below though. That screen space is served better to show the rest of the actual diagram. You can already see the source code of a page with your browser, why waste screenspace and annoying extra scrollbars on that?
The intention was to make it really easy to play with the parameters (not that if you make a change to the html, it will be immediately reflected in the screen above).
Oh, very interesting. You could add a button, though, to show the frame, or just hide it below the fold (increase the height of the visualization div). It's not like vertical space is a scarce resource, so you don't really need to hide it or to have the main div short.
I would beware to just extrapolate numbers on a curve. There are some physical constraints which makes it impossible to improve certain numbers by much. And sometimes there are also logical constraints doing the same.
We need somewhat accurate measurements on modern hardware as a base. I would not be surprised if some values have gotten worse over the years.
GP has a point, though: if you got a CS degree without understanding this stuff, your CS department failed you. Still, it is nice to review, and as you say, good for those who don't know.
Much like the speed of spawning new processes under Linux (and other UNIXen), I think it's something that has been very well tuned because it needed to be (ie, it is so widely used).
There's not much room for improvement. It's a little over 5000 miles from CA to the Netherlands, so light takes about 30ms to travel that far on a vacuum. That means a 60ms round trip minimum. Light in fiber goes at 0.6-0.7c, so the fastest round-trip you could have with a straight fiber line is 80ish milliseconds.
Figure out how to make "fiber" that transmits light closer to c, perhaps? This of course would bring other challenges- the refractive index is part of why light stays "inside" fiber in the first place.
The key to improve the noticeable latency is to improve our protocols to minimise the number of round trips; ie the packet latency is pretty much a physical quantity, but if we can reduce the number of times we incure it, we will improve our protocols.
My (ever-so-slightly more than basic undergrad) understanding is that you could send qubits to the Netherlands, have processing be performed by a quantum computer there, and have the result returned faster-than-light (as qubits), but no-one in the Netherlands could read your data or the result (or even the power meter) before at least a light-speed delay.
Take a look at the CERN Neutrinos to Gran Sasso (CNGS) neutrino beam experiments. They start with a proton beam at CERN. Aim it at a carbon target. The results produces pions and kaons, which decay into muons and neutrinos. The beamline, and hence the neutrinos, was directed towards Italy.
In the experiment, they can generate pulses of 3 nanoseconds, each with up to 524 nanosecond gaps. This gives a 2MHz signal. However, not all of those pulses were detected.
If I read this right, "We report here on results based on part of the data taken during 2008 and 2009 runs, which amounted respectively to 1.78 and 3.52 × 1019 [protons on target], respectively. Among the 10122 and 21428 on-time events taken in 2008 and 2009, respectively 1698 and 3693 events were identified as candidate neutrino interactions in the target bricks."
Elsewhere, I found that the experiment ran from 27 May - 23 Nov 2009. I don't know if the beam was in continuous use for all 6 months. Let's assume it was full-time for only 1 month of that, which is unlikely.
There's a signal of some 3000 events per month, giving a transmission rate of 100 bits per day, or about 1 bit every 15 minutes. Rather a lot lower than 2MHz.
We would need some way which is about at least 1,000,000x better at detecting neutrinos before some company is willing to fund the Sydney/Wall St. neutrino beamline.
I have only basic theoretical knowledge about networks but I guess not because:
1. Adding more nodes/routes = shorter physical path
2. Processing speed of nodes has improved over years (?)
http://i.imgur.com/X1Hi1.gif
Note: This is a huge gif, zoom in to the top.
[1] HN discussion: http://news.ycombinator.com/item?id=702713