
Ask HN: Storing millions and billions of URLs? - gerenuk
Hello Everyone!<p>Currently, using ElasticSearch for storing the meta data and other raw data information but it is a very small scale around 500,000 domains.<p>I have been tasked to scale it to 20-40 million domains and storing their internal&#x2F;external links while building a page rank&#x2F;domain authority score for each domain which we are adding to our database.<p>What do you guys suggest&#x2F;recommend for storing this data at a very large scale as web page internal links&#x2F;external links will be stored which will lead it over 100M-1B links database?<p>Any kind of feedback&#x2F;suggestion would be appreciated.<p>Thanks.
======
nik736
I don't think that any proper database technology will have issues with that
amount of data. It all depends on how you use it.

------
sharemywin
Found this:

[https://dba.stackexchange.com/questions/38793/which-
database...](https://dba.stackexchange.com/questions/38793/which-database-
could-handle-storage-of-billions-trillions-of-records)

There's a nice little triangle diagram here:
[https://stackoverflow.com/questions/2794736/best-data-
store-...](https://stackoverflow.com/questions/2794736/best-data-store-for-
billions-of-rows)

------
girishso
I personally have used CouchDb to store tens of millions of documents. If you
can find a way get the data you want using CouchDb views, the number of
documents simply doesn’t matter with CouchDb (may be just the disc usage grows
with additional documents/views). And that too with excellent performance.

------
drizzle87
Elasticsearch should be easily able to handle your scaling needs. Why do you
think that it would not? What are your concerns?

------
jjirsa
The answer will depend primarily on how you expect to query it.

Cassandra can do many orders of magnitude more than 1B, but would limit you in
your query patterns.

------
mr__y
Have you considered sharding the data to multiple independent ES instances?
Each of them could handle amount of data that does not cause problems?

------
cimmanom
We've found Elasticsearch to be quite performant with hundreds of millions of
documents. What are your concerns with scaling it?

------
dchuk
Building an ahrefs/moz/majestic competitor?

~~~
gerenuk
BuzzSumo competitor with a different set of features.

~~~
dchuk
I'm actually interested in hearing more about this if you're willing to share
it.

