
Ask HN: Storing and processing less than 1TB of unstructured data? - johnnycarcin
Many HN folks deal with data problems, so I thought I&#x27;d ask: if you had to store and index less than 1TB of unstructured (plain text) data, what would you use?<p>I have a bunch of text files and HTML pages that I&#x27;d like to dump into something and then be able to search over it, maybe even be able to find relationships (common terms, phrases, etc) between the various docs. I&#x27;ve heard of things like hadoop, but that seems to be overkill for the amount of data I have. I&#x27;d also like to keep things as low-cost as possible as this is just for personal use. I&#x27;ve looked at a few of the cloud providers but am honestly not sure what I&#x27;m looking for, so I find myself walking away more confused than when I started.<p>This seems like an easy problem, but for whatever reason I&#x27;m getting wrapped around the axle on it.
======
dekhn
I recommend the book "Managing Gigabytes", which while dated is still
relevant. The title doesn't indicate this, but it's heavily focused on data
structures for indexing text documents.

But Elasticsearch running on a cloud VM with an attached EBS volume would be a
fast way to get work done.

------
1e10
1tb is nothing these days. If you insist on cloud the hetzner could be best
bang for buck. Otherwise a similar desktop system can be acquired for less
than 1000 usd.

I’d start with solr or elasticsearch and a simple indexing script (home rolled
python script).

Then you can use solr admin or something like Jupyter for iterative querying.

I’m not an expert on index tuning, but you might even be able to dump it all
into postgres with json types.

Best of luck!

~~~
johnnycarcin
Yeah, the amount of data is pretty small in the grand scheme of things, maybe
that is why i'm getting so hung up haha. Elasticsearch was actually the first
thing I thought of so maybe I'll just go with that and see what happens...

------
johnnycarcin
coming back, i stumbled over this while looking at options:
[https://docs.alephdata.org/](https://docs.alephdata.org/). It is a bit more
heavyweight than plain elasticsearch, but it has some nice additions that
might make it worth it depending on your situation.

