About 13 years ago, I wrote my own programming language expressly for the purpose of implementing network stacks, and had a complete TCP in it; I didn't have this problem. But I do have this problem all the time when I write direct network code and forget about buffering.
Similarly: "Python is so slow that Google reset the network connection" seems a bit unlikely too. Google, and TCP in general, deals with slower, less reliable senders than you. :)
What's the time between your SYN and their SYN+ACK?
There's an iptables module called tarpit, which takes advantage of some peculiarities of the TCP protocol to essentially prevent the remote host from closing the connection, which can force (conforming?) TCP clients to take 12-24 minutes to timeout every connection. It can make portscanning unpleasantly expensive and time-consuming.
You can probably also come up with real-world applications for it (as a security tester, there are lots of applications for having full control over a TPC stack, regardless of how performant it is), but just having done it offers a huge learning return on a modest investment.
You probably don't fully grok what TCP even is until you've made congestion control work.
And sometimes you even want your stack to be slow, eg in a slow loris attack.
Being implemented around Twisted, it actually allows you to fiddle with low-level TCP stuff, while e.g. offloading the SSL to the existing stack. It saved my bacon a few times when I wanted to reproduce a complicated network breakage scenarios.
https://www.youtube.com/watch?v=BEAKtqiL0nM - Video about muXTCP from 22C3
https://github.com/enki/muXTCP - github repo with it.
Python: fast to write, slow to run.
C/C++: slow to write, fast to run.
To put it succinctly, you can write fast programs, and you can write programs fast, but you can't write fast programs fast.
Haskell, Clojure, and Ocaml do pretty well at writing fast programs fast for what I consider appealing values of fast.
I wouldn't be surprised if Google is sending the packets all at once and ignoring the ACKs altogether.
Heck, there's even a 2010 paper from Google on the subject of sending a bunch of packets at the beginning of the connection: An Argument for Increasing TCP's Initial Congestion Window
When you send ACKs, not only do you send the acknowledgement number indicating which byte you expect next, but you also send a window size indicating how many bytes you're willing to receive before the remote end has to wait for another acknowledgement. Normally you want this to be somewhat large so you don't spend lots of idle time waiting around for ACKs. (But not so large that packets get dropped). This is the key to TCP flow control, which was kinda glossed over in the blog post in the interest of keeping things simple.
But perhaps by default, you're advertising a too-large window considering the circumstances. I bet you could make this a lot more reliable just by advertising something smaller.
Good TCP implementations have overcome a lot more than some unwanted buffering.
Is your ACK sequence number the sum of all received data lengths? I think that is how that works?
So yes, if a C64 can handle it Python should have plenty of power.
Google's webservers, including the TCP stacks themselves, may be very aggressively tuned to make sure you get the response absolutely as fast as possible, at the expense of re-sending packets more quickly than specified.
Here is my two cents on the expirement:
1. You don't really have to ack every packet, you have to order them, drop duplicates and ack the last one.
2. Google ignores the TCP flow control algorithm and sends the first few hundred packets very quickly without waiting for acks. They do this to beat the latency of the flow control algorithm. That's why you end up with so many packets on your side. You could just try anything but google, and you would probably see that you have a less insane packet storm.
You'd need something substantially faster (Nyquist and such).
But if you use TCP offloading, then you can communicate using TCP as cheaply as doing any traditional serial communication. I've seen those crude TCP stack implementations and those are barely LAN capable. Using those over Internet wouldn't be a great idea. Often these TCP stacks are used just as serial - tcp converters. Using RWIN of 16 (max) bytes due UART buffering limits etc. Not sending ACK before whole 16 byte is cleared and so on.
So with really low end devices, it's better to off load TCP to custom hardware. Many embedded devices use that approach. In such cases you'll only call to IP address and port, and when it connects, you'll have your TCP/IP connection in pure serial. TCP-modems isn't any different from other good old Hayes modems. It's just dial-up over Internet and TCP.
Btw. If the system is able to run CPython, Linux, or something similar, it's already a pretty powerful system.
Sadly 10BASE-T ethernet uses manchester coding, so you need 20MHz :( you could pull it off with a slower processor if you had a fifo that can reach 20Mbaud on the serial end, but the usart in avr micros runs at a fraction of the system clock.
For a surprise, google for / research how wiznet based ethernet controllers are used on Arduino shields or just wired up by hand (I did a hand wired wiznet 5100 once upon a time...). Yes yes its possible to run it in full TCP or UDP termination mode which is the way most people use it, but it can also terminate at IP level, and with some limitations, raw ethernet packet level. If you pull that data sheet and want to actually try this, look for documentation about "MACRAW" mode. This basically turns the controller into an ethernet packet FIFO as you describe with ethernet on one side and SPI bit banging on the other side.
Disclaimer: I've never done anything with MACRAW mode on wiznet controllers other than read about it in the data sheet and wonder why I'd want to do that. If I was going to blackhat a weird portable appliance or otherwise do heavy ethernet weirdness I'd use a rasp pi or similar SBC and linux not arduino compiled C, but whatever.
The authors problems are because she is not using RAW_IP_SOCKETS.
Making TCP packets using pythons struct module is a breeze. I can post specific examples in code if anyone is interested.
Finally you can write a proper TCP stack in python, there is no reason not to. Your user-space interpreted stack won't be as fast as a compiled kernel-space one - but it won't be feature short.
PS: I guess, Google is probably sending him a SSL/TLS handshake which he isn't handling.
Edit: Corrected author's gender as mentioned by kind poster.
 And you shouldn't, see the Guidelines  under "In Comments"
If you're interested in writing a TCP/IP stack in Python I would recommend you use Python raw sockets, or possibly dnet or pcapy. The Scapy level of abstraction is too high for your needs.
I agree with other posters who mention buffering in libpcap. Read the man page for pcap_dispatch to get a better idea of how buffering works in libpcap. Also try capturing packets with tcpdump with and without the '-l' switch. You'll see a big difference if your pkts/sec is low.
Don't do arp spoofing. If you're writing a TCP/IP stack then you need to also code up iparp. If you don't want to do that, then use raw sockets and query the kernel's arp cache.
On second thought you really need to use raw sockets if you want this to work. Using anything pcap based will still leave the kernel's TCP/IP stack involved, which is not what you want.
I've not used it yet, but I've read over the documentation and am itching for an opportunity to do so.
Anyone know why the Python program is so slow? I'm looking at the code and my first guess would be this part but I can't explain why, overall, it would be so slow that a remote host would close the connection.
---- SYN ---->
<-- SYN/ACK --
---- ACK ---->
rather than having the client send two SYNs to the server?
An alternative method I've used in the past is to add an iptables rule to silently drop the incoming packets. libpcap sees the packets before they are dropped, so you'll still be able to react to them, but the kernel will ignore them (and therefore won't attempt to RST them).
In Uni we had a networking course where we got to build a network web server software stack from the bottom up, starting with Ethernet/MAC, TCP/IP, and then on the top HTTP, all being flashed onto a small network device (IIRC it was originally a CD-ROM server). It was an extremely enlightening exercise. I recommend you go deeper instead of just using a premade Python library for TCP!
Somebody's oversubscribed $3/month shared PHP hosting might not ramp up the speed as quickly.
It's really surprising to me that lots of ppl are using
scapy for things that require performance but then again
if you look at the scapy website or the docs, it's not immediately apparent that their tool is not meant for this.
Which I guess says a lot about the scapy developers rather
than the scapy users.
tl;dr Scapy is a joke, performance-wise.
In terms of using Scapy for your packet crafting, here are some guides with examples that may help you work around your issues. (Hint: use the Scapy-defined sending and receiving routines and don't implement your own, or stop using Scapy and implement your own raw packet sockets) http://securitynik.blogspot.com/2014/05/building-your-own-tc... https://github.com/yomimono/client-fuzzball/blob/master/fuzz... https://www.sans.org/reading-room/whitepapers/detection/ip-f... http://www.lopisec.com/2011/12/learning-scapy-syn-stealth-po...
As the outbound TCP SYN is manually crafted and sent over a raw socket, without any corresponding state table entry on the sender's kernel, incoming TCP responses will be rejected by the kernel with a RST.
I suggested to Julia that she manually publish an ARP entry for another IP which she could send and receive on. The kernel not having an interface with that IP assigned to it would ignore responses while also passing them to the raw socket. An alternative would be to use an iptables rule to drop incoming packets for the relevant flow - although that may be more difficult to manage depending on what you're doing.
Tangent: One of my favorite interview questions is to ask how traceroute works. The question works best when the candidate doesn't actually know. Then we can start to discuss bits and pieces of how TCP/IP works, until they can puzzle out the answer.
The bottleneck for such processes is typically network I/O and I can imagine that taking control of the network in the user space might offer some modest to significant wins. For Hadoop in particular network packets needs to traverse quite a few layers before it is accessible to the application.
Has anyone done this sort of a thing for mapreduce. Pointers to any writeup would be awesome.
In fact TCP itself might be an overkill for mapreduce. The reduce functions used are typically associative and commutative. So as long as the keys and values are contained entirely within a datapacket, proper sequencing is not even needed. Any sequence would suffice.
Here are some slides for one version: http://www.bsdcan.org/2014/schedule/attachments/260_libuinet...
Edit: with regards to motivation it's always been something along the lines of a network appliance that I've seen. The mainline linux network stack is more than capable of doing many millions of packets per second over hundreds of thousands of concurrent streams. The network stack will not be the limitation in something like batch processing.
Is it because of bad choice of internal algorithms, bad choice of internal data structures ? Bad I/O design ? Given the popularity it enjoys, and given its age, its a little frightening how much worse Hadoop is in comparison to some proprietary implementations. My hunch is that Hadoop's slowdown is in the shuffle phase, which is where faster network data transfer can help.
I like abstractions that spark exposes but it still needs a lot of engineering to catchup. I have anecdotes where Spark is slower than Hadoop by quite a bit.
All my experience is with Hadoop 1.x. Is Hadoop 2.x much better ?
I also learned a lot about networking by writing a TCP/IP stack in Common Lisp. http://lukego.livejournal.com/4993.html if you are interested.
C++, but very detailed articles.
sshuttle is a pure-python one-way TCP VPN solution that is very well behaved and quite efficient. The source is highly readable as well. +1 to everything Avery Pennarun has released (including wvdial, bup)
IPv6 has enough address space. Object storage takes care of disk access. Generally it might be way less efficent? What would the OS look like? It seems like a lot of OS services would disappear? You'd have a cloud of processes. Each process vm would be like a cell in a body. Maybe each process vm would load an auth module. Or not.
Hopefully someone will blog a response post that gets popular on HN proving just how wrong it is.
The best solution doesn't have anything to do with python, it's to implement window size (flow control) and only ack when you need to.
ps - It's been several years since I've worked on a TCP stack so please correct anything I'm remembered incorrectly
As the other post suggests, this is not a problem in all high-level languages; it's possible to write very high-performance code in Haskell or Ocaml, for example. But python's semantics are e.g. that every object has its own methods that can be overridden arbitrarily, which means that every single method call pretty much has to involve a hash table lookup. 99% of Python code never uses this extreme dynamism (usually you define methods on classes, and if you want an instance with a different implementation for a particular method then you make a subclass), but it's there, and a Python interpreter that didn't respect that would behave incorrectly.
Keep in mind that in current computer architectures memory requests are very slow. And any data structure that randomizes memory access means that you have a good chance of a cache miss and now have to hit even slower ram.
In this particular case, I don't think it's the problem, though. Just something to keep in mind.
You can do sophisticated things where you compile objects assuming they won't be overridden and then back out the compile if something touches an object (the JVM does similar things where it will compile a never-overridden method as non-virtual and then detect when the class hierarchy changes), but that requires a lot of complexity that goes against the goals of CPython.
Nowadays I mostly work in Scala, and anywhere where I would have used such a technique I find there's a way to do it "statically". So I'd be interested if there are good examples of what makes this a "powerful" technique, and to see if I can't replicate them "statically" with enough typeclasses etc.
Not impossible, as lmm said, just hard.
Also its likely more a python problem than a "high level" language problem.
It's pointless to people who understand how TCP works in depth, but the majority of programmers don't.
I for one found it interesting/fun.
Did you even read the link?