Hacker News new | past | comments | ask | show | jobs | submit login

Author here. The algorithm used here is based on Google's 2007 paper "Scaling Up All Pairs Similarity Search." Since then I am sure they have started to look at billions of sets. Generally speaking exact algorithms like the one presented here max-out around 100M on not-crazy hardware, but going over a billion probably requires some approximate algorithms such as Locality Sensitive Hashing. You maybe interested in the work by Anshumali Shrivastava.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact