Here's mantra to repeat until you prove otherwise, "My application is not CPU bound it's IO bound". Even most applications that max the CPU are still IO bound because of their IO access patterns. Seriously, think about how your app is IO bound not CPU bound.
It's the cache thrashing / memory IO / disk IO constraints not the number of threads. People just increase the number of threads to increase the chance that there is data in the cache that the CPU can use.
Trim your data structures so they fit in a CPU cache line, make sure your data structures are properly aligned, make sure all the RAM slots on your server are in use. Put more RAM in your servers, throw out your spindle disk and replace them with SSDs, check your switches, can they actually handle a full 1Gbs burst? Or do they start dropping packets? Are you using all the network ports on your server? Are your switch interconnects 10Gbps? Is close-to-open cache coherence on NFS killing your cache hit rates? Are your indexes fragmented? Is your memory fragmented? Are your splitting your IO so that sequential IO goes to spindles and random IO goes to SSDs? Are your disk partitions aligned to your RAID block size. (Hint: If you're using RAID and your partition starts at sector 1 you're losing 15% of your IO performance) Do you have a cache in front of your app servers that will cache the response so the server thinks it's communicating with the client at a full 1Gbps? I'm not talking about something like varnish, just a simple store and forward cache.
The GIL is just a red herring so that people can point to something they have no control over and can give up hunting for performance.
It's not the GIL it's your IO subsystems. And turn your random IO into sequential IO, here's an example, lets say you have to store a whole bunch of 40K jpegs, instead of writing them as individual files put 1000 of them into an uncompressed ZIP file. Now your IO is sequential instead of random and you've taken a whole bunch of load off the kernel. If you're running on a 7200 RPM disk your throughput just went from 5-10MB/sec to ~100-200MB/sec
I'd up vote this 100x if I could. From my experience dealing with large scale build systems (building 10k - 100m SLOC every 12 hours or so) nothing matters more than IO throughput. Better network architecture, fatter pipes, faster drives, more RAM, bigger L3/L2/L1 caches will all almost always have more impact than more or faster cores.
Most of the time, as you point out, multithreading only ameliorates the IO cost because it allows something in RAM or cache to be processed. If those additional threads need to hit network or disk, however, you're boned and get significantly less benefit. In many cases, you can get a bigger win by not multithreading and simply being smarter about your IO.
Also worth reiterating, something raised both by you and the article: MEASURE! You know nothing about how to optimize your app or system until after you've very thoroughly measured, questioned the results, and re-assessd your assumptions. So often design assumptions made when writing your app or building your system turn out not to be true after measurement.
Sounds like something that would benefit highly from putting your builds on a RAM disk.
If you assume 100M SLOC @ 100 bytes per line you're looking at only ~10GB of RAM. That's only an extra $160 per build machine. (2x4GB DDR3 sticks @ $80 each)
I shaved ~10 minutes off a ~15 minute build and unit test time by putting the SQL DB on RAM and the source tree in RAM. SQL was by far the biggest win. The source tree only saved us about 30 seconds.
If you're on Windows just tick the Advanced Write Optimization box on your drive and all your writes will go to RAM. I think there is something similar on Linux. It sounds like you're the type of person who probably already knows this :)
L2/L3 cache is absolutely huge but the biggest wins by far IMHO are battery backed write cache. The really shitty thing is most BBWC controllers use 50% of the cache as read cache by default which is a total waste IMO. I find about 10% read cache is optimal on the controller.
If you can't afford SSDs and still need to guarantee consistency.
That's a false dichotomy, especially for most Rails applications, and one that can easily be shown false using tools like NewRelic.
Of course YMMV depending on the nature of your Ruby application, but in general (and particularly for Rails applications) they are quite CPU hungry. Assembling HTML, XML, and JSON can be quite CPU intensive, especially for large documents. This can be mitigated to a certain extent with techniques like fragment caching (or page caching, which prevents requests from ever hitting the Rails stack)
Sure, a large part of the Rails request cycle is going to be spent in I/O wait on databases or other external services. But after that wait is over, the app gets to work synthesizing that data into a response to send back to the client, and that process is typically CPU intensive.
I use Ruby in a research lab for scientific computing, and in our case, most of our applications are definitely CPU bound. We end up re-writing many of them in C to speed them up and use threading but if we could keep them in Ruby while still using concurrency the faster development times would improve the lives of everyone in the lab.
Ruby is not Rails. I for one would benefit immensely from being able to run threaded Ruby code.
The easiest move might be to NumPy, assuming it can express your computations in its methods because if you're reduced to doing math in Python you're hardly any better off.
If you're willing to pay the price for Ruby there are certainly other viable options that aren't C. Learning the minimum Haskell to do decently fast math wouldn't be that hard and would not involve to much of the "weird stuff", for instance.
Much of our code is not CPU bound, and the PI is something of a Ruby evangelist. If everyone jumps away from Ruby it will never improve.
That being said, I don't think Ruby will ever be the fastest language, but speeding things up is always a good goal.
I'd highly suggest you consider JRuby + Mirah, or IronRuby + F#. F# especially with units is a really nice way to write your math intensive code. See F# vs. C on the Burrows Wheeler Transform.
Mirah has familar JRuby syntax but would be perfect for your scientific code as IIRC function calls are early-bound not late-bound.
http://www.mirah.org/
Another huge one on Windows and especially .NET is if you see a MemoryStream replace it with a FileStream that uses DELETE_ON_CLOSE semantics. It bypasses the .NET GC and lets the kernel decide whether it actually wants to write the file out. So if you've got a 1GB file that you need temporarily it will write part of it to disk depending on how much RAM is available.
But read load generally follows square law, your most popular files are going to be used an order of magnitude more than your least popular files. If you have 1TB of files you can probably get away with 1GB of RAM for caching with a 90% hit rate. This means that your read speed is something on the order of 11GB/sec. (eg. more than your 1Gbps network) Because ZIP files are essentially append only you can read from a ZIP file while it's being written to as long as you cache the file table. This means that you can have hundreds of readers from the ZIP file and only one writer. UNIX/Windows file locking semantics support this beautifully, readers don't lock the file and writers obtain a write lock.
The basic design of the system is that the readers register an event handler with the writer and when the writer finishes writing a file it updates the readers with the location of the file in the ZIP file. Once you have that information it's a simple read call to the ZIP file.
I want to, but it's hard to make an artificial test for this. You need to have a dataset that's larger than available ram and a read queue that follows real life access patterns but also uses different enough data to force disk reads to happen.
If the theory checked out I would go ahead and implement it - but the theory says it would be slower.
This is a great article, raising many valid points around concurrent programming. I can't speak to Ruby, but Python supports, and has never presumed linear programming.
Python has had POSIX threads since at least 1.5.1. The standard library contains a set tools for concurrent programming (see the threading module, queue, etc) and many included libraries are thread-safe.
The GIL's behavior is complex, but even so it's still possible to write performant, multi-threaded code in Python, especially for IO bound tasks. See David's excellent write up: http://www.dabeaz.com/GIL/gilvis/index.html.
Everybody wants the GIL to be removed, and several attempts have been made, but the collateral damage seems just too great at the moment - C extensions, and tons of libraries, frameworks, and programs would need to be re-written at the cost of greater complexity, for questionable performance gains in the average case.
Backwards compatibility is a core philosophy of Python, one which I think makes it a great programming environment. This makes removal of the GIL more a philosophical problem than a technical one. Python 3 has made some progress in removing GIL contention, but it's certainly here to stay for the foreseeable future.
Perhaps Python will be left behind in the future due to these decisions, but given all the work being done, I find it hard to believe any argument claiming that the Python community isn't actively invested in concurrent programming.
I'm the author of cool.io and Revactor (the latter was mentioned in this article), and as much fun as it is to experiment with these ideas, the path of least surprise, especially for Rails applications, is to use threads. I wrote about that here:
Have I gotten burned by Ruby libraries that aren't thread safe? Sure. I found a thread safety bug in my application just yesterday, and it was a pain in the ass to track down. However, once I tracked it down, it was extremely easy to fix.
With JRuby (and soon with Rubinius Hydra) you can deploy one VM and take advantage of all of the CPU cores in your server. Debugging a multithreaded application isn't an impossible task, and if you're a JRuby user you have access to a whole slew of JVM tools to assist you.
Pushing a language, its libraries, and extensions which have traditionally presumed linear execution into a threaded/parallel world is a minefield. I'd like to offer a few thoughts on some factors which might have contributed to this complexity, to reframe the question of what might constitute the "best" concurrency strategy in terms of tradeoffs, and conclude with a call for us to take up the task of reasoning about our programs in a parallel context.
We've seen an explosion of interest in non-threaded, single-process approaches to concurrency in the Ruby and Python communities in the past couple years. Much of the difficulty here lies in frameworks, libraries, and development paradigms which were not designed to be threadsafe (or to teach threadsafe programming) from the start. Developers and library authors have for years been able to operate under the assumption that Ruby code in a single process is executed serially and not in parallel. In environments which permit parallel, concurrent execution inside the boundaries of a single process, authors must return to the foundations of what they've created and scrutinize every bit, reconceptualizing their programs in terms of shared state and mutation.
Individual developers relying on these libraries must also scrutinize each and every gem they require to ensure that the code is threadsafe as well. This is not an easy task, as reasoning about state is inherently difficult, especially when the original program may not have been designed with concurrency in mind, and even moreso when one was not the original author. Further, we haven't seen a significant chasm between things which "are" or "are not" threadsafe emerge in the Ruby community, and it is not common to certify one's code as "threadsafe" on release. That said, it's commendable that the Rails team has worked diligently to piece through each component of the framework and certify it clean.
I do not mean to suggest that multithreaded code written in Ruby is uncommon, unsafe, or a bad idea in general. Many applications running in environments without a GIL/GVL such as JRuby utilize this functionality effectively today. What I'm suggesting is that process-based concurrency (Mongrel/Passenger/Unicorn, et al) is often favored by developers because it eliminates a large swath of potential pitfalls (or rather, trades them for increased usage of system resources).
In light of this, we've seen developers in the Ruby and Python communities experimenting with and popularizing alternate concurrency models, not the least of which include cooperatively-scheduled fibers/co-routines and evented programming. These approaches avoid a specific subset of the challenges in concurrent multithreaded execution, while enabling limited concurrency in one process. Matt is quick to point out that these models don't achieve true concurrent parallelism, but do offer significant benefit over standard serial execution. At the same time, they impose a different sort of complexity upon the programmer -- either requiring her to reason about a request or operation in terms of events and callbacks, or to use a library to cooperatively schedule multiple IO-bound tasks within a single VM (which may hide some complexity, but introduce uncertainty by clouding what is actually happening behind the scenes).
I wish Ruby and Python the best here. I have a significant investment in both. But as long as developers must ask of libraries "Is it threadsafe?" with fear and negative presumption, traditional multithreaded concurrency might not be ideal given the investment required to achieve it safely and correctly.
More than anything, I hope for continued complex reasoning and thought on this question. We simply must move beyond reductionist statements such as "threads are hard!" to progress. I'd suggest that threaded programming in and of itself is not hard per se -- reasoning about shared state is, and the developer bears a responsibility to either eliminate, or minimize and encapsulate it properly in her code. Evented and coroutine-oriented programming brings its own challenges. Actor models are interesting and may be appropriate for some contexts. STM's pretty cool too, but it would not be proper to hang one's hopes upon something which is not likely to cure all ills.
There is no panacea for concurrency. There will always be challenges and tradeoffs involved in developing performant, efficient applications within the constraints of both hardware and programmer resources. We would do well to push ourselves toward a greater understanding of the challenges in developing such programs, as well as the tradeoffs involved with each approach. I think this article is a step in the right direction.
@cscotta as the author of the blog post, I can only agree with the concerns you raised and thank you for your kind words.
Concurrency is something both Ruby and Python have to take seriously pretty quickly if both communities want to mature and play a more important role in the coming years. As you pointed out, there are different ways to get some sort of concurrency and they all present challenges and tradeoffs while none will cure all ills. But at the end of the day what matters is a community of people embracing these approaches and pushing the language further.
By removing the Global Interpreter Lock most of the alternative Ruby implementations are making a statement and the community seems to react. If we educate the community, commonly used code will become threadsafe over time and developers will learn what it means to write threadsafe code. If made easy, co-routines and non-blocking IO will be used. If implemented well and properly explained, actors will be used more often when it makes sense to do so.
My goal was to try to simplify the concurrency problem so the community as a whole can discuss the topic. I think that the concurrency challenge can be improved by increasing the awareness, getting people motivated and having an open discussion. There is a reason why Ruby and Python still have a GIL and why removing it to improve concurrency would have its downside. I want to make sure people understand that before blaming Matz for not removing the GIL already. The Ruby implementation fragmentation also hurts progress done in MRI, if you look at it, alternative implementations often have more people working on them than people working on MRI!
Let's hope that this article is indeed a step in the right direction.
Absolutely - thanks, Matt. This is a very well-written, thorough article, and it provides a clear, thoughtful survey of the landscape for folks who are interested in parallelizing their programs and learning more about the advantages and tradeoffs implicit in different approaches. This piece has a lot of potential to catalyze thought within the community about how best to move forward.
The explosion of interest in the subject intrigues me. The work that's been going on in terms of making mainstream Ruby libraries threadsafe is great to see, along with fibers and evented approaches. MacRuby's exposure of Grand Central is especially unique. But interest and upvotes must also be followed by action. There will be a lot of false starts. We'll probably see a handful of approaches which bloom and fade from popularity. While the "fragmentation" issue is there, I tend to think of it in terms of spirited experimentation. In any case, people really care now, and it's that drive that forces a community forward. That's good to see.
As much as this was informative... "However these people often don’t mention that the GIL makes single threaded programs faster" <- how did they ever get that idea?
Sure, I understand the idea of GIL being faster than fine-grained locking for single-threaded apps. This is not how I read the article though (maybe that's what they meant...).
The speed can be sorted as (in most cases): single-threaded code, single-threaded with GIL, single-threaded with fine locks.
It's the cache thrashing / memory IO / disk IO constraints not the number of threads. People just increase the number of threads to increase the chance that there is data in the cache that the CPU can use.
Trim your data structures so they fit in a CPU cache line, make sure your data structures are properly aligned, make sure all the RAM slots on your server are in use. Put more RAM in your servers, throw out your spindle disk and replace them with SSDs, check your switches, can they actually handle a full 1Gbs burst? Or do they start dropping packets? Are you using all the network ports on your server? Are your switch interconnects 10Gbps? Is close-to-open cache coherence on NFS killing your cache hit rates? Are your indexes fragmented? Is your memory fragmented? Are your splitting your IO so that sequential IO goes to spindles and random IO goes to SSDs? Are your disk partitions aligned to your RAID block size. (Hint: If you're using RAID and your partition starts at sector 1 you're losing 15% of your IO performance) Do you have a cache in front of your app servers that will cache the response so the server thinks it's communicating with the client at a full 1Gbps? I'm not talking about something like varnish, just a simple store and forward cache.
The GIL is just a red herring so that people can point to something they have no control over and can give up hunting for performance.
It's not the GIL it's your IO subsystems. And turn your random IO into sequential IO, here's an example, lets say you have to store a whole bunch of 40K jpegs, instead of writing them as individual files put 1000 of them into an uncompressed ZIP file. Now your IO is sequential instead of random and you've taken a whole bunch of load off the kernel. If you're running on a 7200 RPM disk your throughput just went from 5-10MB/sec to ~100-200MB/sec