You can SSH into your box. You can debug with gdb. Valgrind. Everything is normal...except the performance, which is just insane.
Given how easy it is, there isn't really a good excuse anymore to not write data plane applications the "right" way, instead of jamming everything through the kernel like we've been doing. Especially with Intel's latest E5 processors, the performance is just phenomenal.
If you want a fun, accessible project to play around with these concepts, Snabb Switch makes it easy to write these kinds of apps with LuaJIT, which also has a super easy way to bind to C libraries. It's fast too: 40 million packets a second using a scripting language(!).
I wrote a little bit about a recent project I completed that used these principles here: https://news.ycombinator.com/item?id=7231407
Is there a list of "things" (not sure of the terminology?) people have built with it?
I guess the downside is that you can't virtualize it (I realize that is kind of the point, but it does reduce the accessibility of it).
The project is new: I and other open source contributors are currently under contract to build a Network Functions Virtualization platform for Deutsche Telekom's TeraStream project  . This is called Snabb NFV  and it's going to be totally open source and integrated with OpenStack.
Currently we are doing a lot of virtualization work from the "outside" of the VM: implementing Intel VMDq hardware acceleration and providing zero-copy Virtio-net to the VMs. So the virtual machine will see normal Virtio-net but we will make that operate really fast.
Inside the VMs we can either access a hardware NIC directly (via IOMMU "PCI passthrough") or we can write a device driver for the Virtio-net device.
So, early days, first major product being built, and lots of potential both inside and outside VMs, lots of fantastic products to build with nobody yet building them :-)
Is that a good use case for Snabb Switch, or is there is an easier way to accomplish what I want?
If you can express how you want to filter with a fancy pcap-filter expression the tcpdump is the easy answer. Otherwise you might want to code it up in Lua with snabbswitch.
Here is our basic trace store/replay library today: https://github.com/SnabbCo/snabbswitch/blob/master/src/lib/p...
Thanks for the very cool project! I will have to learn more about it.
Network Functions Virtualization is the idea of replacing networking boxes (Cisco, Juniper, Ericsson, F5, ...) with virtual machines running on your own PCs. This is basically a "private cloud" but with emphasis on doing networking in a way that doesn't annoy ISPs.
Rest assured that Software Defined Networking will be redefined to mean whatever the next good technology turns out to be :-)
In my experience high-end systems choose hardware to match software, so you only need to support one suitable option. The main reason we would add support for a new NIC is if it turns out to be better in some way e.g. as Mellanox support 40G now and Intel don't.
In Snabb Switch have a lot of 10G/40G NICs online for people to play with now: https://groups.google.com/d/msg/snabb-devel/PXOsv0uLQCE/HjPj...
TFA explains how to do it, and I've done it myself. You can set a flag on the Linux kernel when it boots limiting it to the first N cores. I usually use 2. The remaining cores are completely idle—Linux will not schedule any threads on those cores.
Then you build an app that works more-or-less like Snabb Switch, which talks directly to the Ethernet adaptor, bi-passing the kernel (Ubuntu, Fedora, etc. isn't relevant in the least).
So, you launch your app as a normal userland app. For each of your app's threads, schedule them on the remaining CPU cores however you want (I schedule one thread per core). Linux will not schedule its own threads or threads from any other process on those cores, so you own them completely—it'll never context switch to another thread.
That means when you SSH in, it's running on a thread on cores 1 or 2 only. Same with every other Linux process but your own. Other than sucking up available memory bandwidth and potentially trashing your L2/L3 cache, these other processes don't impact your own app at all.
Thus, even though you're running stock Linux, and SSH and gdb works, and you've got a normal userland app, your app is the ONLY app running on the remaining cores, and you're talking directly to the hardware. It's just as fast as doing everything without a kernel, except it cost you 2 cores. IMO, it's more than worth it for the convenience.
This approach is so easy that there's really no reason not to do it. There are so many situations in the past where I wanted the performance of those single-app kernels, but it just wasn't worth the dev effort. That's no longer true.
Oh and also don't forget to set up IRQ affinity to avoid any of those cores to handle.
There is an interesting research done by Siemens, that takes this kind of isolation a step further and uses virtualization extensions to isolate resources (cores for example):
Have a look at cpuset. You can forcibly move kernel threads into a given set (even pinned threads), and then force that set onto whatever cores you want.
In the past I tried migrating kernel threads, the migration did work but system became unstable, so I gave up on that.
Perhaps this is what you meant, but this is straightforward. Simply disabling 'irqbalance' is a simple way to do this. Alternatively, you can also configure it to cooperate with 'isolcpus' by using 'FOLLOW_ISOLCPUS'.
http://code.google.com/p/irqbalance is 403 for me and I haven't been able to check.
I usually do stuff described here:
And sometimes following ends up better (same cache), sometimes isolating ends up being better.
Do you know of other resources for this approach? I have seen cpusets but not used them.
On the other hand, I worry that just means I'm old. There are a lot of perfectly competent developers out there that have very little idea about the concerns that motivate thinking like this C10M manifesto.
I sometimes wonder if my urge toward efficiency something like my grandmother's Depression-era tendency to save string? Is this kind of efficiency effectively obsolete for general-purpose programming? I hope not, but I'm definitely not confident.
But this isn't true everywhere. Imagine if Google's servers cost 100x more to run, if they didn't spend time to make their code efficient. Imagine if your video games ran at a frame rate 100x slower than they do now. Imagine if your phone sucked 100x more battery power, and lagged 100x longer after each tap.
Sounds like I'm making extreme hypothetical examples? Not really, when languages like Python and Ruby can be over 100x slower than other high level languages (e.g. Java, not to mention C/C++).
And of course, the counterpoint is usually "But most of the time is spent in the database/[some highly optimized library], not the high-level logic." To that I reply: You're absolutely right. Guess what those subsystems were written in? C/C++/Java, most likely.
If you're fortunate enough to have all your performance-critical components already highly optimized and written for you by experts at bare-to-the-metal efficiency, you probably don't have to think about this. But if you're working on a technology truly new and unique, chances are you're going to have to "get your hands dirty" at some point if you want to avoid paying 100x more than is necessary in server fees.
You said a wrong thing. Java is a far cry from a high-level language.
Scala is much higher level than Java, compiles to the same run-time, and is at rough parity. http://benchmarksgame.alioth.debian.org/u32/scala.php
Notice the "can be"? It's important.
And I don't think arguing if Java is a high level language or not is very useful in this context. Notice that C++ was also included as a "high level" language. Clearly if C++ is high level then Java is too.
Anyway, the point of that comment wasn't which languages are high level - it's that efficiency is important, because some languages have areas where performance suffers.
Clearly that is true (at least to some extent) as the use of Numpy in Python shows.
I'd argue that writing Numpy (and other high-performance libraries for use in high level languages) isn't "general purpose programming" (which is what the OP) was talking about, but that is also a different discussion.
There are far better examples. E.g. Haskell is very high-level and, as axman6 mentions elsewhere in this thread, has been shown to be able to handle 20m new connections per second.
- First generation: binary opcodes entered by hand.
- Second generation: Assembler languages.
- Third generation: Compiled languages like C++.
- Fourth generation: Interpreted high level like Python and Ruby. Also SQL.
- Fifth generation: There's no such thing. Or may be Apple's Siri is.
Similarly, if your software runs in 1% of CPU on a typical customer's machine, spending 10x the time and resources to make it run in 0.1% is not laudable. Context is everything.
No, but spending 1.2x the time and resources might be. That's not an unreasonable proposition; something written in Go, C# or Scala can run 10 to 100 times faster than the equivalent code in Ruby, but it certainly doesn't take 10 to 100 times longer to write the code.
It also conserves resources; you're reducing the amount of the world's non-renewable energy that your app consumes by up to 90%. You're also freeing up your users' machines to do more things simultaneously while running your program, potentially increasing the user's enjoyment of the system in the case of a desktop app.
Part of this applies to the craft of software too. One can be sloppy and churn out functional websites by the dozens. At the superficial level, the goal to extract the most from a 500MHX Pentium-III might seem brain dead, with little or no pay off. But it pays back by instilling an attitude of deeply learning your craft, that learning does pay back although that specific well tuned web server might not. You dont have to build all your servers that way, but if it doesnt hurt you a little when you know exactly what you could have done to extract some more juice out of it, you will not reach that level of understanding. If you are impatient or sloppy, you will build impatience into your product, and it will show.
Besides it is always easy to over value ones time, it sometimes correlates with conceit.
@sitkack what makes you think I was talking about the 50s. Going by the downvote I seem to have touched a nerve. You are right Japanese products around that time were synonymous with bad quality, not just in US. Much water has flowed under the bridge since then.
There are actually two books that were written about this very topic, one called "The decline and fall of the American Programmer", which was essentially making your argument, that American quality going down meant we would get eclipsed, and "The Rise and Resurrection of the American Programmer", written by the same author, saying "oops, I was wrong", because "good enough" is the winning strategy, not quality at all costs.
Note that at the time of those writings, the 'sloppy' guys were Microsoft, which is still way way more process than your typical RoR shop today has. In the days of shipping new software to websites three times a day, pragmatism rules all.
As the saying goes: "Premature optimization is the root of all evil."
By your measure, the Zero was a low quality plane, but served the MBAs in the Japanese military well for meeting its objectives. The AK47, one of the most successful machines in the world has extremely sloppy tolerances and it is exactly those lenient tolerances that enabled its success.
We always have to be cognizant of economics, over building, polishing or engineering a system is a waste. Calling it a craft doesn't make it any more acceptable. Proing-up and being good at what you do doesn't mean you need to put burnished wood knobs on your software.
I think, you think tolerance means fits that rattle, thats not what it means in engineering. Loose tolerance may just as likely lead to interference fits.
Forget AK-47, there are more egregious examples, consider the Blackbird, SR-71 one of the fastest military aircrafts to have graced the skies, a marvel of engineering: it had just wide gaps between the parts of its metallic skin, at the joints. Now going by your logic one might think SR71 became successful because of loose tolerance and "good enough". It was quite the opposite, the clearances were deliberate and necessary for its success. They were there to account for thermal expansion. For the AK it was necessary for rough use and low to zero maintenance.
In fact not having clearance for SR71 or the AK would have constituted intellectual sloppiness: going with cookie cutter decisions and not optimizing the product for the use case. What looks sloppy to an amateur might actually be the result of perfection by an expert.
May I recommend Zen and the Art of Motorcycle Maintenance to you.
Further, I think you are being obtuse and defensive and railing against something I have not said. I am not quite sure why. I have never claimed all products need to be finished to the point of "burnished wood knobs". My commentary was on personal growth as an engineer. If it doesn't bother you somewhere to turn in a product that you know you could improve with little effort, you are not going to be a quality engineer, and will not be able to produce a quality product when one such is desired. I am categorically not saying that each item that you deliver has to be the epitome of some arbitrary quality standard.
The cultural tradition that I was talking about was not about adding cost to the product by unnecessary finish. Often people do not finish the product even when it would not have taken much effort. This is rationalized with the logic that the finish would have little immediate value, because it is already good enough for the job, but far from "good". The other reason is sometimes the craftsman just does not have the skill and "good enough" is a good argument to take cover under.
Quality comes and goes.
All of these lazy, low tolerance sweeping generalizations need to go!
An example might be working nine hours per day rather than eight to produce something that only uses 10% of the resources (and that extra hour is unpaid). For some people, that extra hour would be a waste, but for others, it would bring them greater satisfaction knowing they'd delivered a more efficient product, even if it brought them no extra revenue.
I mean don't get me wrong, I'm working on a project right now that I'd love to do 'right'. Spend weeks doing a literature study to catch up on and really understand the current state of the art, then implement three or four different approaches to really get a feel how they work under real world conditions. After that I would try to pick the best aspects of those algorithms and try to write some really smart code that analyses the input and picks the best algorithm based on the data. Then I would tune that code until it is as fast, stable and memory efficient as possible and finally put a really slick UI on top of it. Trust me, nothing would give me greater satisfaction.
Unfortunately I got the project late last week, deadline is in less than two weeks and I have two other projects I'm working on in parallel. So I'm going to end up grabbing some off the shelf solution, tweaking it until it works well enough, wrapping it in up in some hacky shell script and make up for inefficiency by throwing more hardware at the problem. It's not optimal, but deadlines are deadlines, and I'll have a contend myself with the satisfaction of getting a job done well enough on time, rather than a job done perfectly.
Maybe we're thinking of different things here. What I had in mind was more along the lines of for instance choosing Go or Scala instead of Rails or *.js and getting a 10x speedup at the cost of simply adding a few type declarations. There are situations like yours where deadlines leave little choice, but there are also situations where a bit more leeway is available.
If you're founding a startup in an emerging consumer market, you'd be making a mistake to use something like Go or Scala rather than Rails or Node.js, because your whole success is dependent upon finding out what particular combination of design, features, and experience emotionally resonates with fickle consumers' minds. That takes a lot of trial and error; anything that slows down that experimentation process is going to cost you the market. Once you've found the market you can get VC and hire experienced technologists to curse out your technology choices and rewrite it in Go or Scala or Java or C++ or whatever.
If you're CloudFlare and billing yourself as "the web performance and security company", however, building out your architecture in a performant and reliable language makes a lot of sense. You know your value proposition: you want to use whichever technology stack lets you execute against that value proposition most effectively.
Twitter got a lot of flack for being built on a "dumb" Rails architecture, but I'm also certain they would not have succeeded had they done anything else. Remember the context of their founding: Twitter grew out of an idea lab that grew out of Odeo, and at the time of their founding they were an idea that was so marginal that nobody would've bothered with it had it taken more than a weekend or so to prototype. When it turned out that people liked it, then they could afford to hire people to rewrite the software into something scalable - but those people would not have jobs had the initial concept not been proven out first.
I work with both a RoR code base and a Scala code base. There is no difference in productivity that I can tell between the two. I think it is a out-dated assumption to believe that dynamic languages like Ruby or Python are more productive than modern statically typed languages. The only cost is one of learning; many more people know Ruby than Scala. This is an artifact of history (i.e. Ruby is 10+ years older than Scala; crappy CS educations that only teach Java).
I think it is obsolete in that area, and has been for years (decades? Unoptimized SQL is almost always slower than key-value stores). General purpose programming is really about meeting user & business requirements. Speed and ffficiency is a second level concern, which only becomes an issue when you don't have enough of it.
BUT efficiency is even more important now for systems-level program ("system" in the sense of building libraries for use by other systems, as opposed to user-facing).
A good example is SnappyLabs who developed a faster JPEG encoder and ended up being a subject of a Google/Apple bidding war.
So instead, I'm scaling out with mostly idle Intel E5-1620v2 4 core (8HT) boxes and single 1GbE connections because that's about 70% cheaper for access to the same bandwidth. In fact, bandwidth is actually the only cost we even monitor anymore, because that's what we're limited by (the machines come with far more bandwidth allocated than adding that same bandwidth to an existing machine, so the machine is essentially free).
So now we're swimming in CPUs, RAM, and SSD storage, and we're renting—no joke—the cheapest possible hardware we can at our provider (OVH). And no, we're not serving images, audio, or video—our CDN handles that (although see below for why I dropped S3). I actually do public-key authenticated encryption on every single packet and I've still got tons of CPU to spare.
Honestly, I feel a little bad about the whole thing. Why should I use 600 boxes when I really could do it in about 12? It just feels incredibly wasteful.
To partially make up for it, I put the 2x3TB of rusty metal storage we get with each of those tiny 600 boxes to good use, by moving us off S3 (which'll save another $100K/year, and a lot more as we continue to grow), so it's not been a total waste. Maybe I should look into Bitcoin mining next? :)
> But your average slow code wont perform well on 10Gb.
Yup, that really does require looking at things differently, measuring things differently. You actually need to have some kind of mechanical sympathy to remain efficient. Or build off of a project that does that for you, like Snabb Switch.
Not necessary if you can displace your own purchases, but my first thought was that maybe you could do something like "Amazon Glacier", offering distributed but rarely used storage for a fee. I presume that Amazon offers this because they have a similar surplus of live empty disk space.
I know you are joking about the Bitcoin mining, but my second thought was that maybe there is a parallel 'proof of storage' idea. There could be a public market for backup storage, where instead of being paid for 'proof of work' one is randomly tested for 'proof of backup'. You say you'll store something, and you are paid based on your ability to answer random 'challenge' requests in a timely manner.
And then I noticed another front page article on HN from someone writing software that could do something very similar:
Perhaps you could be his backend.
If you can save an order of magnitude in machines, you will always avoid some scaling problem. 10 machines, 100 machines, and 1000 machines all have different problems.
Of course, that's not always what your software needs or wants, but in the case of internet backend software and hardware, it's a huge difference if your servers have 1, 10 or 100 times the throughput.
Also, many developers are client developers where this is a not the problem it used to be.
In 1990, when a game didn't fit into a few kB of memory or didn't ran fast enough, the developer assembler-optimized the hell out of it, today most of this doesn't matter except for the software everyone uses but noone writes (your ISPs webmail, dropbox, twitter and whatever internet site has enough users).
Yes, that kind of efficiency (in terms of machine resources) and inefficiency (in terms of human resources) is effectively obsolete for general-purpose programming. What of that? If your heart's desire is to work in a context where machine resources are scarce, just work in an area like high-performance technical computing where that's still true. We do not, and never will, have any shortage of jobs where you always need more computing power.
On the other side of the same token, many useful, worthwhile applications will absolutely depend on this level of optimization. There are some problems which simply can't be solved in a practical manner without it. And that's OK too.
But there are applications like Snort (which he seems to reference a lot) that can't be split onto multiple boxes easily, so tuning for max performance is very important and is a key feature of the application.
This new IO manager was added to GHC 7.8 which is due for final release very soon (currently in RC stage). That said, I'm not sure if it can be said if all (or even most) of the criteria have been met. But hey, at least they're already doing 20M connections per second.
High Scalability Post: http://highscalability.com/blog/2013/5/13/the-secret-to-10-m...
Original Shmoocon Presentation: http://www.youtube.com/watch?v=73XNtI0w7jA
The architecture is FreeBSD and Erlang.
It does make me wonder, and I've asked this question before , why can WhatsApp handle so much load per node when Twitter struggled for so many years (e.g. Fail Whale)?
[2, slide 16] http://www.erlang-factory.com/upload/presentations/558/efsf2...
That's not to say that Twitter's scaling issues where wholly forgivable. They weren't fatal to the service, but I don't think they were necessary with good design from the start. High popularity is a good problem to have though.
That's great if you are just toying with this sort of thing for fun, but perhaps worthless if you are advocating a style of server design for others.
Also, the decade-ago 10k problem could draw some interesting parallels. First of all, are machines today 1000 times faster? If they are, then even if you hit the 10M magic number, you will still only be able to do the same amount of work per-connection that you could have done 10 years ago. I am guessing that many internet services are much more complicated than a decade ago...
And if you can achieve 10M connections per server, you really should be asking yourself whether you actually want to. Why not split it down to 1M each over 10 servers? No need for insane high-end machines, and the failover when a single machine dies is much less painful. You'll likely get a much improved latency per-connection as well.
It shows good practical tricks and pitfalls. It was 3 years ago so I can only assume it got better, but who knows.
Here is the thing though, do you need to solve C*M problem on a single machine? Sometimes you do but sometimes you don't. But if you don't and you distribute your system you have to fight against sequential points in your system. So you put a load balancer and spread your requests across 100 servers each 100K connections. Feels like a win, except if all those connections have to live at the same time and then access a common ACID DB back-end. So now you have to think about your storage backend, can that scale? If your existing db can't handle, now you have to think about your data model. And then if you redesign your data model, now you might have to redesign your application's behavior and so on.
My latest project keeps every "connection" open at all times, but it's a custom UDP based reliable messaging protocol, not TCP. At Facebook's scale, we'd have the equivalent of one billion connections "open". It's easy to keep them open, despite changing IP addresses, because every packet is public-key authenticated and encrypted, so you don't have to rely on IP addresses to know who you're talking to...
It also means you only pay for a connection setup time once. For mobile devices, the improvement in latency is palpable.
I've considered doing something similar for our messaging/signalling protocol (currently standard TCP sockets established to several million mobile devices.)
I had concerns about what the deliverability of UDP would be on mobile networks; many carriers are going towards NAT'ing everything, (forced) transparency proxies, etc.
Are you only using UDP from device->infrastructure? If not, do you rely on the devices providing an ip:port over a heartbeat of sorts (to keep up with IP changes?)
Do you have any issues with deliverability, in either direction? (not due to UDP's inherent properties, but because of carrier network behavior)
thanks very much for anything you're able to answer.
I'm using UDP in both directions, and I do have a heartbeat (currently set at 30 seconds, but I think we could go to 60 seconds without any problems). We do use that to keep track of IP:PORT changes, but also (mainly?) to keep the UPD hole punched, due to carrier's NAT'ing everything.
It works, and it works well. It's the same idea behind WebRTC, except instead of going peer-to-peer, you go client<->server.
All I've seen so far is the usual UDP stuff: dropped packets, re-ordered packets, and duplicate packets. Nothing out of the ordinary. Our network protocol handles those things without any difficulties.
We did it specifically because all of our clients are mobile devices, and we didn't want to have to do the lengthy TCP connection setup (or worse, SSL setup) each time the network changed—which is often.
The biggest downside of UDP at the moment is that Apple only allows TCP connections in the background. That seems like a silly decision, but whatever, it's what they've done. I may, someday, set up a bunch of TCP forwarders for iOS devices running in the background. Our messages can be decoded just fine over byte-oriented streams, so it wouldn't change much.
It's a tough call. On the one hand, our UDP clients do not need to reconnect, since the connection is always set up. So when they wake, they send a packet to the server (Hello), and the server immediately sends back any immediate updates, such as new chat messages.
Our read path is perhaps over-optimized, so it's exactly the network latency of one round trip to get the updates since you last opened the app. It takes longer to get the UI up in some cases, so that's why we haven't done the TCP background thing. For others, that might be a much more important consideration.
Do you have explicit DoS protections?
A big difference is I'm not running a packet scheduler (e.g. Chicago). Frankly, our data rate on a per device basis is just minuscule, and our internal protocol has back pressure built into it anyway, so I just skipped that part. If it becomes an issue (unlikely), I'll of course actually add an explicit scheduler so our UDP traffic plays nicely with others.
I'm not doing MLT's puzzle step (although I really like the concept). At the moment, all I can do is deny connections from unknown devices if we're under attack (I can drop packets from an unknown device with a single hash lookup). We're also in the process of moving to OVH, which is able to block DDoS stuff at the network edge, should it happen.
There's more stuff I've got planned, but that's it for now.
A good talk about it by one of the developers/researchers: http://vimeo.com/16189862
Just for the record -- the SSH console is the primary priority. If the web server always beats the SSH console and the web server is currently chewing 100% CPU due to a coding bug..
Having no experience with writing Java servers, I wonder if any you guys have an opinion on this.
The netmap freebsd/linux interface is awesome! I'm looking forward to seeing more examples of its use.
Also, netmap from freebds was the first thing that come to my head, as a relief from the IO bottleneck from moderns systems..
As in the original C10k, freebsd to the rescue here.. since it was the first OS with the kqueue interface.. and now is netmap.. the numbers from the speedup in the original paper are astounding
Are they that important? Should we not be trying to use electricity more efficiently since that is a real world consumable resource. How many connections can you handle per kilowatt hour?
>Content Category: "Piracy/Copyright Concerns"
I'm starting to use these blocks at my workplace as a measure of site quality (this will be a high quality article). Can someone dump the text for me?
Thanks for the link.
Mind you, this was partially because of the specifics of our system - a couple million logged in users with tens of thousands of users logging in every second pushing large file lists, a widely used chat system which meant lots of tiny packets, a very large number of searches (small packets coming in, small to large going out) and a huge number of users that were on dialup fragmenting packets to heck (tiny MTUs!).
I imagine a lot of the kind of systems you'd want 10M simultaneous connections for would hit similar situations (games and chat for instance) though I'm not sure I'd want to (I can't imagine power knocking out the machine or an upgrade and having all 10 million users auto-reconnect at once).
Or perhaps he meant that the infrastructure used to do multitasking still have to interrupt both his server and SSH, but then described how the kernel can be set to leave some cores free of work then to set the server to use only those and then run absolutely uninterrupted.
Not the only bizarre and confusing statement he wrote, anyways.
I could only hope OSv development would move faster.