

Ask HN: EC2 network environment and UDP? (debugging a project) - swolchok

I have written a crawler for a pair of major file-sharing DHTs (Azureus/Vuze and Mojito (used by Limewire)) for a graduate-level distributed systems project, and I'm at a bit of a loss as to why it's failing only on EC2 and only for Mojito.<p>The crawling process is very similar to the  crawler described in http://www.eurecom.fr/util/publidownload.en.htm?id=2495 . Briefly, it works by breadth-first search: the crawler sends 8-16 find-node packets to a node it has not yet interrogated; the response to such a packet is a list of more nodes in the network. It adds the nodes that it has not yet seen to its list of nodes to crawl, and repeats the process. The scan is limited to sending 9000 packets per second, and terminates when there have been no packets sent for 30 seconds. netstat -su shows &#60;100 UDP receive errors on the offending EC2 box.<p>On a local box, it sees 1,163,777 nodes, and gets responses from 300,638 of them (the difference can be attributed to stale routing table entries, NATed nodes, etc.). On EC2, where I was hoping for <i>greater</i> network visibility, it sees only ~200,000 nodes, and gets responses from only ~2700. It is in fact the case that EC2 seems to have greater visibility for Vuze.<p>I am running on two high-CPU extra-large instances, one for Vuze and one for Mojito. They are both in us-east, but different availability zones. The Vuze one has been working great for hours; I am wondering if there is some kind of EC2 global bandwidth limit or DPI that could be whitelisting the Vuze traffic but not Mojito, or perhaps some other EC2 networking quirk? Any help would be appreciated, thanks!
======
eklitzke
You're not using multicast IP, are you? I've heard (a while ago) that this
doesn't work at all on EC2, e.g. their routers don't support or have limited
support for IGMP. If you had a protocol that mixed unicast and multicast then
that may explain why some nodes aren't visible.

~~~
swolchok
Nope, this has nothing to do with multicast, sadly. My current solution is to
just get off EC2 and run the crawler locally.

I have a DHT-layer ping utility, and observed that the failing node could not
ping some peers that the successful node (crawling the other network) could,
so I hypothesize that either EC2 or the file-sharing network was doing some
rate limiting. The puzzling thing is that the local scanner has been running
for hours without showing any signs of failure.

