
Ask HN: Storage for Raw HTML? ElasticSearch or Something Else? - gerenuk
Hi,<p>I am working on a project where we need to store the response of a web page and have later on do the data processing such as NLP tasks, topic modeling, sentiment analysis etc.<p>The average content&#x2F;documents we are expecting is around 1-2 million per day, and soon that will be increasing to 10-20 million.<p>For this kind of data storage what do you suggest? Is ElasticSearch better for this kind of stuff or should we use HDFS&#x2F;Ceph etc. for the storage?<p>Currently, we are using mongodb for persistent storage and ElasticSearch for indexing the data and serving that to our frontend but if some better option is available we can look into that as we are restructuring our most of the architecture and data pipeline.<p>Any kind of help&#x2F;suggestion will be appreciated.<p>Thanks<p>P.S If someone can share some insights what kind of possible architecture does the brandwatch.com, mention.com or ahrefs.com got for their data pipeline would be really helpful.
======
twobyfour
For that volume, I would dump the data to disk somewhere (even if "somewhere"
is S3). Load it up to extract the actually relevant data, and index that into
whatever data store is most efficient for the types of queries you'll be
doing.

Relational is most effective for aggregates; MongoDB is good for stable
document storage; Elasticsearch is less stable but good for fast search,
especially full-text, stemming, and "fuzzy" or weighted searches. It may also
be sensible to index your data into multiple databases so you can query it
efficiently in different ways for different purposes.

