Identifying botnets with Hadoop and Cassandra

A1kmm · on April 19, 2011

A centralised system like that, however, has a very high privacy cost - one server knows all IPs addresses that access every IP address monitored by the system.

The article isn't clear on exactly how trust is computed, but if all that is required is to detect the total number of connections between given IPs, it is possible that a peer-to-peer algorithm where only limited information is shared with neighbours about IPs that are specifically queried.

If the aim is to obtain a list of IPs that at least m peers out of n have been contacted by (only counting the last k incoming connections, per peer), without disclosing the entire list to any party, that could be done having each peer broadcast the number connections per /8 to all peers. Each peer checks the total is less than k for each peer, and also records the number from each peer for each /8. Any /8s which have seen a total of less than m connections are rejected, and the counts for each /9 in the remaining /8s are broadcast (and must add to the right number reported previously for the /8 for the peer, and be less than the number of IPs in the range). This system means that at least (m-1) peers need to collude to find out if someone has been contacted by an IP address in a range that rarely contacts people - if some mechanism stopped (m-1) of the peers from being controlled by the same person, this system could work.