What's significant to me is that you can do this stuff today on stock Linux. No need to run weird single-purpose kernels, strange hypervisors, etc.
You can SSH into your box. You can debug with gdb. Valgrind. Everything is normal...except the performance, which is just insane.
Given how easy it is, there isn't really a good excuse anymore to not write data plane applications the "right" way, instead of jamming everything through the kernel like we've been doing. Especially with Intel's latest E5 processors, the performance is just phenomenal.
If you want a fun, accessible project to play around with these concepts, Snabb Switch[0] makes it easy to write these kinds of apps with LuaJIT, which also has a super easy way to bind to C libraries. It's fast too: 40 million packets a second using a scripting language(!).
The project is new: I and other open source contributors are currently under contract to build a Network Functions Virtualization platform for Deutsche Telekom's TeraStream project [1] [2]. This is called Snabb NFV [3] and it's going to be totally open source and integrated with OpenStack.
Currently we are doing a lot of virtualization work from the "outside" of the VM: implementing Intel VMDq hardware acceleration and providing zero-copy Virtio-net to the VMs. So the virtual machine will see normal Virtio-net but we will make that operate really fast.
Inside the VMs we can either access a hardware NIC directly (via IOMMU "PCI passthrough") or we can write a device driver for the Virtio-net device.
So, early days, first major product being built, and lots of potential both inside and outside VMs, lots of fantastic products to build with nobody yet building them :-)
How about this use case: I have a ChromeCast on my home network, but I want sandbox/log its traffic. I would want to write some logic to ignore video data, because that's big. But I want to see the metadata and which servers it's talking to. I want to see when it's auto-updating itself with new binaries and record them.
Is that a good use case for Snabb Switch, or is there is an easier way to accomplish what I want?
If you can express how you want to filter with a fancy pcap-filter expression the tcpdump is the easy answer. Otherwise you might want to code it up in Lua with snabbswitch.
OK and I forgot to say I might want to deny some traffic... like disable auto updates but still allow it to contact other servers to play video. AFAIK tcpdump doesn't let you do that.
Thanks for the very cool project! I will have to learn more about it.
No. Honestly, software defined networking is the idea of replacing all of your networking staff with a very clever distributed Java program. Half the world thinks this is genius and the other half thinks it's a facepalm.
Network Functions Virtualization is the idea of replacing networking boxes (Cisco, Juniper, Ericsson, F5, ...) with virtual machines running on your own PCs. This is basically a "private cloud" but with emphasis on doing networking in a way that doesn't annoy ISPs.
Rest assured that Software Defined Networking will be redefined to mean whatever the next good technology turns out to be :-)
Another downside is that if you do networking in userspace, portability becomes your problem. If you do TCP/IP in the kernel, it works on nearly everything: just about any brand of Ethernet card, and even exotic non-Ethernet stuff (ISDN, whatever). If you do it in Snabb Switch, according to the wiki it currently supports exactly one class of interface: Intel-branded ethernet cards. Of course, I expect that will expand, but each one of these user-space networking stacks will have to ship with its own complete driver set to reach the same level of portability.
True. Or we might end up sharing drivers between projects too.
In my experience high-end systems choose hardware to match software, so you only need to support one suitable option. The main reason we would add support for a new NIC is if it turns out to be better in some way e.g. as Mellanox support 40G now and Intel don't.
Realistically that isn't much of an issue. If you are doing something at this level, you pick your hardware for the task. You don't need to support every random old dell PC someone wants to use like a general purpose OS has too (linux, the BSDs, etc). And of course, intel cards are not hard to find.
Some NICs support virtualizing raw hardare access, so there's no fundamental reason for why Snabb couldn't support it. Just a simple matter of programming (and last I heard it was high on the priority list, so the support might even be there by now).
Not really. Last I checked, stock Ubuntu was choking around 60k concurrent connections for no reason, and Fedora could handle a lot more. This was a couple years back, but I'd demand numbers before assuming the situation has changed.
The entire purpose is to NOT route stuff through the kernel.
TFA explains how to do it, and I've done it myself. You can set a flag on the Linux kernel when it boots limiting it to the first N cores. I usually use 2. The remaining cores are completely idle—Linux will not schedule any threads on those cores.
Then you build an app that works more-or-less like Snabb Switch, which talks directly to the Ethernet adaptor, bi-passing the kernel (Ubuntu, Fedora, etc. isn't relevant in the least).
So, you launch your app as a normal userland app. For each of your app's threads, schedule them on the remaining CPU cores however you want (I schedule one thread per core). Linux will not schedule its own threads or threads from any other process on those cores, so you own them completely—it'll never context switch to another thread.
That means when you SSH in, it's running on a thread on cores 1 or 2 only. Same with every other Linux process but your own. Other than sucking up available memory bandwidth and potentially trashing your L2/L3 cache, these other processes don't impact your own app at all.
Thus, even though you're running stock Linux, and SSH and gdb works, and you've got a normal userland app, your app is the ONLY app running on the remaining cores, and you're talking directly to the hardware. It's just as fast as doing everything without a kernel, except it cost you 2 cores. IMO, it's more than worth it for the convenience.
This approach is so easy that there's really no reason not to do it. There are so many situations in the past where I wanted the performance of those single-app kernels, but it just wasn't worth the dev effort. That's no longer true.
Linux will often still schedule kernel threads to run on those cores so they are not totally isolated. Also cache effects. If your architecture shares caches between cores, sometimes it would be worth "wasting" a neighboring core to avoid ssh and gdb thrashing your cache.
Oh and also don't forget to set up IRQ affinity to avoid any of those cores to handle.
There is an interesting research done by Siemens, that takes this kind of isolation a step further and uses virtualization extensions to isolate resources (cores for example):
> Linux will often still schedule kernel threads to run on those cores so they are not totally isolated.
Have a look at cpuset[0]. You can forcibly move kernel threads into a given set (even pinned threads), and then force that set onto whatever cores you want.
Oh and also don't forget to set up IRQ affinity to avoid any of those cores to handle.
Perhaps this is what you meant, but this is straightforward. Simply disabling 'irqbalance' is a simple way to do this. Alternatively, you can also configure it to cooperate with 'isolcpus' by using 'FOLLOW_ISOLCPUS'.
Yes, if you don't want all interrupts to handled by CPU0 you'll need a different approach. But doing anything else may be more difficult than it sounds. Do you know if the bug mentioned here is fixed?
This sounds really cool, but I didn't see where the original post talks about this hybrid setup with the Linux kernel (reading the first part and skimming through the rest, which I had already read back when it came out).
Do you know of other resources for this approach? I have seen cpusets but not used them.
On the one hand, I love this. There's an old-school, down-to-the-metal, efficiency-is-everything angle that resonates deeply with me.
On the other hand, I worry that just means I'm old. There are a lot of perfectly competent developers out there that have very little idea about the concerns that motivate thinking like this C10M manifesto.
I sometimes wonder if my urge toward efficiency something like my grandmother's Depression-era tendency to save string? Is this kind of efficiency effectively obsolete for general-purpose programming? I hope not, but I'm definitely not confident.
It really just depends on how much of a computational surplus you have. Some tasks have become so incredibly easy relative to today's computational horsepower, that we can afford waste 90% of it if it means we can be 10% more productive.
But this isn't true everywhere. Imagine if Google's servers cost 100x more to run, if they didn't spend time to make their code efficient. Imagine if your video games ran at a frame rate 100x slower than they do now. Imagine if your phone sucked 100x more battery power, and lagged 100x longer after each tap.
Sounds like I'm making extreme hypothetical examples? Not really, when languages like Python and Ruby can be over 100x slower than other high level languages (e.g. Java, not to mention C/C++).
And of course, the counterpoint is usually "But most of the time is spent in the database/[some highly optimized library], not the high-level logic." To that I reply: You're absolutely right. Guess what those subsystems were written in? C/C++/Java, most likely.
If you're fortunate enough to have all your performance-critical components already highly optimized and written for you by experts at bare-to-the-metal efficiency, you probably don't have to think about this. But if you're working on a technology truly new and unique, chances are you're going to have to "get your hands dirty" at some point if you want to avoid paying 100x more than is necessary in server fees.
"languages like Python and Ruby can be over 100x slower than other high level languages"
Notice the "can be"? It's important.
And I don't think arguing if Java is a high level language or not is very useful in this context. Notice that C++ was also included as a "high level" language. Clearly if C++ is high level then Java is too.
Anyway, the point of that comment wasn't which languages are high level - it's that efficiency is important, because some languages have areas where performance suffers.
Clearly that is true (at least to some extent) as the use of Numpy in Python shows.
I'd argue that writing Numpy (and other high-performance libraries for use in high level languages) isn't "general purpose programming" (which is what the OP) was talking about, but that is also a different discussion.
Javascript is arguably a high-level language however, and meaningless benchmarks (aka "damn lies") attest V8 is in spitting distance of Java.
There are far better examples. E.g. Haskell is very high-level and, as axman6 mentions elsewhere in this thread, has been shown to be able to handle 20m new connections per second.
I'm reminded of a recent article that crossed the front page of HN, about a group writing some sort of fluid layout engine for their mobile app. They were able to optimize it from tens of layouts to thousands of layout per second, which enabled their strategy of trying a bunch of different layouts and picking the best one. This wouldn't have been possible if they stuck with the naive solution. Making it fast enabled a new feature that simply couldn't exist with the slow solution.
I believe it's what used to be called 'craftsmanship'. Taking pride in creating things that are efficient and not wasteful for no other reason than the desire to make the best product possible.
A beautifully carved chair may display excellent craftsmanship, but if you spend three months carving the world's most beautiful chair when the client asked for a dozen basic ladderbacks for a dinner party next Friday, you're not a very wise craftsman.
Similarly, if your software runs in 1% of CPU on a typical customer's machine, spending 10x the time and resources to make it run in 0.1% is not laudable. Context is everything.
>Similarly, if your software runs in 1% of CPU on a typical customer's machine, spending 10x the time and resources to make it run in 0.1% is not laudable.
No, but spending 1.2x the time and resources might be. That's not an unreasonable proposition; something written in Go, C# or Scala can run 10 to 100 times faster than the equivalent code in Ruby, but it certainly doesn't take 10 to 100 times longer to write the code.
It also conserves resources; you're reducing the amount of the world's non-renewable energy that your app consumes by up to 90%. You're also freeing up your users' machines to do more things simultaneously while running your program, potentially increasing the user's enjoyment of the system in the case of a desktop app.
True! I'm just saying that good craftsmanship lies in understanding the situation and finding an appropriate balance. Ultra-efficient code is virtuous, but it's not the only virtue.
I think the "just good enough for business" attitude contributed to the demise of the American car and the rise of the Japanese ones. The American tradition was to use engineering tolerances that would maximize throughput under the constraint that it produced a pretty functional car. The Japanese tradition on the other hand was to use tighter tolerances, well because there was room for tightening the tolerance. At the surface level, or the MBA level it seems that the Japanese way is just dumb. It does not make monetary sense when measured in sales per year. But it turns out that the benefits express themselves at a different timescale: better brand and a culture of devotion to improvement, rather than to being "just good enough to work so that I can move to the next thing".
Part of this applies to the craft of software too. One can be sloppy and churn out functional websites by the dozens. At the superficial level, the goal to extract the most from a 500MHX Pentium-III might seem brain dead, with little or no pay off. But it pays back by instilling an attitude of deeply learning your craft, that learning does pay back although that specific well tuned web server might not. You dont have to build all your servers that way, but if it doesnt hurt you a little when you know exactly what you could have done to extract some more juice out of it, you will not reach that level of understanding. If you are impatient or sloppy, you will build impatience into your product, and it will show.
Besides it is always easy to over value ones time, it sometimes correlates with conceit.
@sitkack what makes you think I was talking about the 50s. Going by the downvote I seem to have touched a nerve. You are right Japanese products around that time were synonymous with bad quality, not just in US. Much water has flowed under the bridge since then.
I think you are actually exactly opposite on this.
There are actually two books that were written about this very topic, one called "The decline and fall of the American Programmer", which was essentially making your argument, that American quality going down meant we would get eclipsed, and "The Rise and Resurrection of the American Programmer", written by the same author, saying "oops, I was wrong", because "good enough" is the winning strategy, not quality at all costs.
Note that at the time of those writings, the 'sloppy' guys were Microsoft, which is still way way more process than your typical RoR shop today has. In the days of shipping new software to websites three times a day, pragmatism rules all.
As the saying goes: "Premature optimization is the root of all evil."
I think you are conflating the demise of the American car, "just good enough for business" and the fetishization of over tolerancing. All of those things are much more complex than "a tradition" of quality. I would assert that Japanese quality is a rather new thing and that it ebbs and flows with the time and the needs.
By your measure, the Zero was a low quality plane, but served the MBAs in the Japanese military well for meeting its objectives. The AK47, one of the most successful machines in the world has extremely sloppy tolerances and it is exactly those lenient tolerances that enabled its success.
We always have to be cognizant of economics, over building, polishing or engineering a system is a waste. Calling it a craft doesn't make it any more acceptable. Proing-up and being good at what you do doesn't mean you need to put burnished wood knobs on your software.
As far as AK-47 goes I think you are conflating two different engineering concepts, perhaps because you are not an engineer. One concept is that of "clearance" as a feature for the designed functionality of a product vs loose tolerance as a result of sloppiness.
I think, you think tolerance means fits that rattle, thats not what it means in engineering. Loose tolerance may just as likely lead to interference fits.
Forget AK-47, there are more egregious examples, consider the Blackbird, SR-71 one of the fastest military aircrafts to have graced the skies, a marvel of engineering: it had just wide gaps between the parts of its metallic skin, at the joints. Now going by your logic one might think SR71 became successful because of loose tolerance and "good enough". It was quite the opposite, the clearances were deliberate and necessary for its success. They were there to account for thermal expansion. For the AK it was necessary for rough use and low to zero maintenance.
In fact not having clearance for SR71 or the AK would have constituted intellectual sloppiness: going with cookie cutter decisions and not optimizing the product for the use case. What looks sloppy to an amateur might actually be the result of perfection by an expert.
May I recommend Zen and the Art of Motorcycle Maintenance to you.
Further, I think you are being obtuse and defensive and railing against something I have not said. I am not quite sure why. I have never claimed all products need to be finished to the point of "burnished wood knobs". My commentary was on personal growth as an engineer. If it doesn't bother you somewhere to turn in a product that you know you could improve with little effort, you are not going to be a quality engineer, and will not be able to produce a quality product when one such is desired. I am categorically not saying that each item that you deliver has to be the epitome of some arbitrary quality standard.
The cultural tradition that I was talking about was not about adding cost to the product by unnecessary finish. Often people do not finish the product even when it would not have taken much effort. This is rationalized with the logic that the finish would have little immediate value, because it is already good enough for the job, but far from "good". The other reason is sometimes the craftsman just does not have the skill and "good enough" is a good argument to take cover under.
Your recollection of Japanese manufacturing is clouded by mythology. Japanese quality sucked in the 1950s, they had a horrible world wide reputation for shoddy goods.
...which got better because Japanese companies had a culture of continuous improvement. That's very relevant to the discussion: the point the parent posters were making is that rather than settle for "good enough" you should continually move the bar for what "good enough" is, because then you'll eventually end up overtaking fat & lazy competitors even if they have a large head start.
Exactly. When people complain about the quality of Chinese goods as a proxy for China and Chinese people as a whole, I ask them where their MacBook Pro is made and they quickly shut up.
All of these lazy, low tolerance sweeping generalizations need to go!
The point I was trying to make is that if you spend your time to make something that works really efficiently, and you take pride in it, then you haven't wasted your time. It doesn't benefit nobody, it benefits yourself, because you can take pride in your work. At least, I believe this idea is part of what's implied in the traditional notion of 'craftsmanship'.
An example might be working nine hours per day rather than eight to produce something that only uses 10% of the resources (and that extra hour is unpaid). For some people, that extra hour would be a waste, but for others, it would bring them greater satisfaction knowing they'd delivered a more efficient product, even if it brought them no extra revenue.
Except the difference at the margin is never just an extra hour of a day. It's almost always weeks instead of days or years instead of month. And that is fine when you are doing things for yourself and have no other requirements on your time, but when you're being paid spending two month on something instead of week just pisses everybody around you off.
I mean don't get me wrong, I'm working on a project right now that I'd love to do 'right'. Spend weeks doing a literature study to catch up on and really understand the current state of the art, then implement three or four different approaches to really get a feel how they work under real world conditions. After that I would try to pick the best aspects of those algorithms and try to write some really smart code that analyses the input and picks the best algorithm based on the data. Then I would tune that code until it is as fast, stable and memory efficient as possible and finally put a really slick UI on top of it. Trust me, nothing would give me greater satisfaction.
Unfortunately I got the project late last week, deadline is in less than two weeks and I have two other projects I'm working on in parallel. So I'm going to end up grabbing some off the shelf solution, tweaking it until it works well enough, wrapping it in up in some hacky shell script and make up for inefficiency by throwing more hardware at the problem. It's not optimal, but deadlines are deadlines, and I'll have a contend myself with the satisfaction of getting a job done well enough on time, rather than a job done perfectly.
> Except the difference at the margin is never just an extra hour of a day. It's almost ways weeks instead of days or years instead of month.
Maybe we're thinking of different things here. What I had in mind was more along the lines of for instance choosing Go or Scala instead of Rails or *.js and getting a 10x speedup at the cost of simply adding a few type declarations. There are situations like yours where deadlines leave little choice, but there are also situations where a bit more leeway is available.
I think both situations are highly context-sensitive.
If you're founding a startup in an emerging consumer market, you'd be making a mistake to use something like Go or Scala rather than Rails or Node.js, because your whole success is dependent upon finding out what particular combination of design, features, and experience emotionally resonates with fickle consumers' minds. That takes a lot of trial and error; anything that slows down that experimentation process is going to cost you the market. Once you've found the market you can get VC and hire experienced technologists to curse out your technology choices and rewrite it in Go or Scala or Java or C++ or whatever.
If you're CloudFlare and billing yourself as "the web performance and security company", however, building out your architecture in a performant and reliable language makes a lot of sense. You know your value proposition: you want to use whichever technology stack lets you execute against that value proposition most effectively.
Twitter got a lot of flack for being built on a "dumb" Rails architecture, but I'm also certain they would not have succeeded had they done anything else. Remember the context of their founding: Twitter grew out of an idea lab that grew out of Odeo, and at the time of their founding they were an idea that was so marginal that nobody would've bothered with it had it taken more than a weekend or so to prototype. When it turned out that people liked it, then they could afford to hire people to rewrite the software into something scalable - but those people would not have jobs had the initial concept not been proven out first.
You perhaps imagine that when Twitter was created the developers sat down and said "should we write this in Rails or in Java?" I very much doubt a conversation anything like that took place. I'm almost 100% certain the developers used the tool they knew best and never considered any other alternatives. Quite possibly they had little or no experience with alternatives.
I work with both a RoR code base and a Scala code base. There is no difference in productivity that I can tell between the two. I think it is a out-dated assumption to believe that dynamic languages like Ruby or Python are more productive than modern statically typed languages. The only cost is one of learning; many more people know Ruby than Scala. This is an artifact of history (i.e. Ruby is 10+ years older than Scala; crappy CS educations that only teach Java).
You're right I suppose. I think as someone who's not a webdev I just have trouble imagining the productivity benefits of something like Rails or Node.js. I assumed that nowadays there'd be frameworks in statically typed languages with type inferences that were almost as productive as Node and Rails, and hence could provide both fast development speed and efficient executables.
Is this kind of efficiency effectively obsolete for general-purpose programming?
I think it is obsolete in that area, and has been for years (decades? Unoptimized SQL is almost always slower than key-value stores). General purpose programming is really about meeting user & business requirements. Speed and ffficiency is a second level concern, which only becomes an issue when you don't have enough of it.
BUT efficiency is even more important now for systems-level program ("system" in the sense of building libraries for use by other systems, as opposed to user-facing).
A good example is SnappyLabs[1] who developed a faster JPEG encoder and ended up being a subject of a Google/Apple bidding war.
I think there has been a renewed interest around 10Gb ethernet (and SSD). Basically computers were grossly limited by IO with gigabit ethernet and hard drives, you could buy a pretty low end box and saturate them, instead people bought high end boxes and ran slow code. But your average slow code wont perform well on 10Gb.
Indeed. I've got an architecture right now that I could actually do everything I need to do in a single rack with 40GigE adaptors, except...I can't (easily) get that kind of bandwidth to so few boxes using third parties, and more importantly, I can't get the bandwidth cheaply that way.
So instead, I'm scaling out with mostly idle Intel E5-1620v2 4 core (8HT) boxes and single 1GbE connections because that's about 70% cheaper for access to the same bandwidth. In fact, bandwidth is actually the only cost we even monitor anymore, because that's what we're limited by (the machines come with far more bandwidth allocated than adding that same bandwidth to an existing machine, so the machine is essentially free).
So now we're swimming in CPUs, RAM, and SSD storage, and we're renting—no joke—the cheapest possible hardware we can at our provider (OVH). And no, we're not serving images, audio, or video—our CDN handles that (although see below for why I dropped S3). I actually do public-key authenticated encryption on every single packet and I've still got tons of CPU to spare.
Honestly, I feel a little bad about the whole thing. Why should I use 600 boxes when I really could do it in about 12? It just feels incredibly wasteful.
To partially make up for it, I put the 2x3TB of rusty metal storage we get with each of those tiny 600 boxes to good use, by moving us off S3 (which'll save another $100K/year, and a lot more as we continue to grow), so it's not been a total waste. Maybe I should look into Bitcoin mining next? :)
> But your average slow code wont perform well on 10Gb.
Yup, that really does require looking at things differently, measuring things differently. You actually need to have some kind of mechanical sympathy to remain efficient. Or build off of a project that does that for you, like Snabb Switch[0].
To partially make up for it, I put the 2x3TB of rusty metal storage we get with each of those tiny 600 boxes to good use, by moving us off S3 (which'll save another $100K/year, and a lot more as we continue to grow), so it's not been a total waste. Maybe I should look into Bitcoin mining next? :)
Not necessary if you can displace your own purchases, but my first thought was that maybe you could do something like "Amazon Glacier", offering distributed but rarely used storage for a fee. I presume that Amazon offers this because they have a similar surplus of live empty disk space.
I know you are joking about the Bitcoin mining, but my second thought was that maybe there is a parallel 'proof of storage' idea. There could be a public market for backup storage, where instead of being paid for 'proof of work' one is randomly tested for 'proof of backup'. You say you'll store something, and you are paid based on your ability to answer random 'challenge' requests in a timely manner.
And then I noticed another front page article on HN from someone writing software that could do something very similar:
http://hypered.io/blog/2014-02-17-building-reesd/
Perhaps you could be his backend.
Having dealt with applications spanning thousands of machines, I'd say there will always be value to optimizing programs on a single machine. The prerequisite to distributed computing is being able to program a single computer.
If you can save an order of magnitude in machines, you will always avoid some scaling problem. 10 machines, 100 machines, and 1000 machines all have different problems.
Not at all. I think that it's our perspective shifted, because our every day computers became so fast, with more CPU and more RAM, that the average programmer doesn't feel the pain of inefficient code anymore.
I've seen enough software that is "very fast and scalable", because it wasn't tested at really large deployments.
Of course, that's not always what your software needs or wants, but in the case of internet backend software and hardware, it's a huge difference if your servers have 1, 10 or 100 times the throughput.
Also, many developers are client developers where this is a not the problem it used to be.
In 1990, when a game didn't fit into a few kB of memory or didn't ran fast enough, the developer assembler-optimized the hell out of it, today most of this doesn't matter except for the software everyone uses but noone writes (your ISPs webmail, dropbox, twitter and whatever internet site has enough users).
As someone who learned to program on a machine with five kilobytes of RAM, and later had to put a fair bit of effort into unlearning some of the resulting habits:
Yes, that kind of efficiency (in terms of machine resources) and inefficiency (in terms of human resources) is effectively obsolete for general-purpose programming. What of that? If your heart's desire is to work in a context where machine resources are scarce, just work in an area like high-performance technical computing where that's still true. We do not, and never will, have any shortage of jobs where you always need more computing power.
There's plenty of room for both. Many useful, worthwhile applications won't ever really push a machine to its limits, rather they primarily involve bringing structure and order to mounds of complex business logic. And that's OK. Not everything needs to be optimized.
On the other side of the same token, many useful, worthwhile applications will absolutely depend on this level of optimization. There are some problems which simply can't be solved in a practical manner without it. And that's OK too.
Most of the time your time is better spent adding feature to an application then trying to tune performance in the ways he talks about on the site. Most of the time it is easier to scale across multiple machines and pay for the hardware then it is to put in the time to make a program this efficient.
But there are applications like Snort (which he seems to reference a lot) that can't be split onto multiple boxes easily, so tuning for max performance is very important and is a key feature of the application.
It seems we've already passed this problem: "We also show that with Mio, McNettle (an SDN controller written in Haskell) can scale effectively to 40+ cores, reach a throughput of over 20 million new requests per second on a single machine, and hence become the fastest of all existing SDN controllers."[1] (reddit discussion at [2])
This new IO manager was added to GHC 7.8 which is due for final release very soon (currently in RC stage). That said, I'm not sure if it can be said if all (or even most) of the criteria have been met. But hey, at least they're already doing 20M connections per second.
WhatsApp is achieving ~3M concurrent connections on a single node. [1][2]
The architecture is FreeBSD and Erlang.
It does make me wonder, and I've asked this question before [3], why can WhatsApp handle so much load per node when Twitter struggled for so many years (e.g. Fail Whale)?
The problem of 1:1 messaging is slightly different to Twitter, which is more m:n. 1:1 messaging can be handled reasonably easily with a mailbox per user, and there is no shared state. Messaging with m:n has different optimal patterns depending on the relative ratios of m and n. Twitter has many users with millions of followers; if Twitter used a 1:1 mailbox approach like a chat app, these users would be whole countries worth of load on their own.
That's not to say that Twitter's scaling issues where wholly forgivable. They weren't fatal to the service, but I don't think they were necessary with good design from the start. High popularity is a good problem to have though.
Haven't you answered your own question? FreeBSD + Erlang vs Rails. Not hating on Rails but it wasn't remotely designed for this use case, Erlang was and there's arguably nothing better in the world at it. Twitter got better after they moved to the JVM, another battle-hardened platform designed for scale.
If you are going to write a big article on a 'problem', then it would be a good idea to spend some time explaining the problem, perhaps with some scenarios (real world or otherwise) to solve. Instead, this article just leaps ahead with a blind-faith 'we must do this!' attitude.
That's great if you are just toying with this sort of thing for fun, but perhaps worthless if you are advocating a style of server design for others.
Also, the decade-ago 10k problem could draw some interesting parallels. First of all, are machines today 1000 times faster? If they are, then even if you hit the 10M magic number, you will still only be able to do the same amount of work per-connection that you could have done 10 years ago. I am guessing that many internet services are much more complicated than a decade ago...
And if you can achieve 10M connections per server, you really should be asking yourself whether you actually want to. Why not split it down to 1M each over 10 servers? No need for insane high-end machines, and the failover when a single machine dies is much less painful. You'll likely get a much improved latency per-connection as well.
It shows good practical tricks and pitfalls. It was 3 years ago so I can only assume it got better, but who knows.
Here is the thing though, do you need to solve C*M problem on a single machine? Sometimes you do but sometimes you don't. But if you don't and you distribute your system you have to fight against sequential points in your system. So you put a load balancer and spread your requests across 100 servers each 100K connections. Feels like a win, except if all those connections have to live at the same time and then access a common ACID DB back-end. So now you have to think about your storage backend, can that scale? If your existing db can't handle, now you have to think about your data model. And then if you redesign your data model, now you might have to redesign your application's behavior and so on.
If by "connection", you mean TCP, probably not. But that's not the only way to maintain connections, and it's certainly not the only reliable network protocol.
My latest project keeps every "connection" open at all times, but it's a custom UDP based reliable messaging protocol, not TCP. At Facebook's scale, we'd have the equivalent of one billion connections "open". It's easy to keep them open, despite changing IP addresses, because every packet is public-key authenticated and encrypted, so you don't have to rely on IP addresses to know who you're talking to...
It also means you only pay for a connection setup time once. For mobile devices, the improvement in latency is palpable.
Can I ask some questions about that? Extremely interested..
I've considered doing something similar for our messaging/signalling protocol (currently standard TCP sockets established to several million mobile devices.)
I had concerns about what the deliverability of UDP would be on mobile networks; many carriers are going towards NAT'ing everything, (forced) transparency proxies, etc.
Are you only using UDP from device->infrastructure? If not, do you rely on the devices providing an ip:port over a heartbeat of sorts (to keep up with IP changes?)
Do you have any issues with deliverability, in either direction? (not due to UDP's inherent properties, but because of carrier network behavior)
thanks very much for anything you're able to answer.
I'm using UDP in both directions, and I do have a heartbeat (currently set at 30 seconds, but I think we could go to 60 seconds without any problems). We do use that to keep track of IP:PORT changes, but also (mainly?) to keep the UPD hole punched, due to carrier's NAT'ing everything.
It works, and it works well. It's the same idea behind WebRTC, except instead of going peer-to-peer, you go client<->server.
All I've seen so far is the usual UDP stuff: dropped packets, re-ordered packets, and duplicate packets. Nothing out of the ordinary. Our network protocol handles those things without any difficulties.
We did it specifically because all of our clients are mobile devices, and we didn't want to have to do the lengthy TCP connection setup (or worse, SSL setup) each time the network changed—which is often.
The biggest downside of UDP at the moment is that Apple only allows TCP connections in the background. That seems like a silly decision, but whatever, it's what they've done. I may, someday, set up a bunch of TCP forwarders for iOS devices running in the background. Our messages can be decoded just fine over byte-oriented streams, so it wouldn't change much.
It's a tough call. On the one hand, our UDP clients do not need to reconnect, since the connection is always set up. So when they wake, they send a packet to the server (Hello), and the server immediately sends back any immediate updates, such as new chat messages.
Our read path is perhaps over-optimized, so it's exactly the network latency of one round trip to get the updates since you last opened the app. It takes longer to get the UI up in some cases, so that's why we haven't done the TCP background thing. For others, that might be a much more important consideration.
That paper (as well as CurveCP) was definitely a huge inspiration for what I'm doing.
A big difference is I'm not running a packet scheduler (e.g. Chicago). Frankly, our data rate on a per device basis is just minuscule, and our internal protocol has back pressure built into it anyway, so I just skipped that part. If it becomes an issue (unlikely), I'll of course actually add an explicit scheduler so our UDP traffic plays nicely with others.
I'm not doing MLT's puzzle step (although I really like the concept). At the moment, all I can do is deny connections from unknown devices if we're under attack (I can drop packets from an unknown device with a single hash lookup). We're also in the process of moving to OVH, which is able to block DDoS stuff at the network edge, should it happen.
There's more stuff I've got planned, but that's it for now.
Projects such as the Erlang VM running right on top of xen seem like promising initiatives to get the kind of performance mentioned (http://erlangonxen.org/).
I wish they open sourced it and let others look at the code and experiment with it. For a lot of developers if it isn't open = it doesn't exist. Now it is their code and they do whatever they want but that is my view of the project.
> There is no way for the primary service (such as a web server) to get priority on the system, leaving everything else (like the SSH console) as a secondary priority.
Just for the record -- the SSH console is the primary priority. If the web server always beats the SSH console and the web server is currently chewing 100% CPU due to a coding bug..
I think cheetah OS, the MIT exo kernel project proved this and halvm by Galois does pretty well for network speed that xen provides, but I forget by how much.
The netmap freebsd/linux interface is awesome! I'm looking forward to seeing more examples of its use.
i would just love to see that exo kernel from MIT in practice some day in some OS.. i think the research is from the nineties, isnt?
Also, netmap from freebds was the first thing that come to my head, as a relief from the IO bottleneck from moderns systems..
As in the original C10k, freebsd to the rescue here.. since it was the first OS with the kqueue interface.. and now is netmap.. the numbers from the speedup in the original paper are astounding
Just what are these resources that we are using more efficiently? CPU? RAM?
Are they that important? Should we not be trying to use electricity more efficiently since that is a real world consumable resource. How many connections can you handle per kilowatt hour?
Generally electricity use is roughly proportional to CPU and RAM usage, as they're powered by electricity. If you have two otherwise identical programs, one of which uses 50% of the cpu and the other of which use 10%, chances are the latter will use less electricity.
Port numbers must only be unique for ip:port pairs. A TCP connection is identified by the "quadruple" source_ip:source_port, dest_ip:dest_port. You can have as many connections as you want on the same source_ip on port 80 as long as there aren't 65,535 to the same dest_ip (ie as long as the quadruple is unique).
Cheers for the info! I guess that's where I got the wrong idea from - attempting to stress test one machine from another machine, I'd always hit that limit, but now I understand why.
What's the current state of internet switches? Back when I used to run the Napster backend, one of our biggest problems was that switches, regardless of whether or not they claimed "line-speed" networking, would blow up once you pumped too many pps at them. We went through every single piece of equipment Cisco sold (all the way to having two fully loaded 12K BFRs) and still had issues.
Mind you, this was partially because of the specifics of our system - a couple million logged in users with tens of thousands of users logging in every second pushing large file lists, a widely used chat system which meant lots of tiny packets, a very large number of searches (small packets coming in, small to large going out) and a huge number of users that were on dialup fragmenting packets to heck (tiny MTUs!).
I imagine a lot of the kind of systems you'd want 10M simultaneous connections for would hit similar situations (games and chat for instance) though I'm not sure I'd want to (I can't imagine power knocking out the machine or an upgrade and having all 10 million users auto-reconnect at once).
"There is no way for the primary service (such as a web server) to get priority on the system, leaving everything else (like the SSH console) as a secondary priority" - Can't we use the nice command (nice +n command) when these process are started to change its priority? I am sorry if it is so naive question
He probably meant that from a TCP point of view, as in there is no way to give a higher priority to incoming TCP connections going into the server than to those going into SSH, even if you could use nice to assign more cpu time to your server.
Or perhaps he meant that the infrastructure used to do multitasking still have to interrupt both his server and SSH, but then described how the kernel can be set to leave some cores free of work then to set the server to use only those and then run absolutely uninterrupted.
Not the only bizarre and confusing statement he wrote, anyways.
So, he is trying to suggest that pthread-mutex based approach won't scale (what a news!) and, consequently JVM is crap after all?)
The next step would be to admit that the very idea to "parallelize" sequential code which imperatively processes sequential data by merely wrapping it into threads is, a nonsense too?)
Where this world is heading to?
One more problem is cloud. We host on cloud. cloud service providers might be using old hardware. Newest hardware or specific OS might be winner but no options on cloud. How do you tackle that ?
What about academic operating system research that was done years ago? Exokernel, SPIN, all aim to solve the "os is the problem" issue. Why don't we see more in that direction?
The two bottom-most articles (protocol parsing and commodity x86) are seriously pure dump, but fortunately the ones about multi-core scaling are pretty damn interesting.
You can SSH into your box. You can debug with gdb. Valgrind. Everything is normal...except the performance, which is just insane.
Given how easy it is, there isn't really a good excuse anymore to not write data plane applications the "right" way, instead of jamming everything through the kernel like we've been doing. Especially with Intel's latest E5 processors, the performance is just phenomenal.
If you want a fun, accessible project to play around with these concepts, Snabb Switch[0] makes it easy to write these kinds of apps with LuaJIT, which also has a super easy way to bind to C libraries. It's fast too: 40 million packets a second using a scripting language(!).
I wrote a little bit about a recent project I completed that used these principles here: https://news.ycombinator.com/item?id=7231407
[0] https://github.com/SnabbCo/snabbswitch