
Ask HN: How do you index large unknown files? - k4ch0w
I want to index about a terabyte of unknown data. I&#x27;d like to be able to get emails, usernames, phonenumbers, etc out of it using regexes or some other technique. Is anyone able to point me in the right direction? I was looking at using elasticsearch as it seems like the best atm.
======
rpedela
For tabular data, I recommend using Duckling [1] which is a regex-based
approach to data classification and it supports multiple languages. If you
only need English, then Stanford CoreNLP's SUTime [2] will also work. You can
extend either tool with your own regexes.

For documents, I recommend using Apache Tika which will do a decent job out of
the box to extract the raw text. The raw text can then by indexed in ES. In my
experience, it is the best tool for detecting a document's type. It is not
necessarily the best tool for extracting text depending on the document type.
For example, it is better to use a paid solution for OCR PDFs if you can
afford it. However it is often good enough and very easy to use.

If you need to extract tabular data from documents, then that is a much harder
problem and you are on the cutting-edge of NLP/ML research. One tool available
in Stanford CoreNLP that may help is SPIED [3]. If the tool isn't exactly what
you need, then I would recommend using the paper as a starting point for
literature review.

1\.
[https://github.com/facebookincubator/duckling](https://github.com/facebookincubator/duckling)

2\.
[https://nlp.stanford.edu/software/sutime.html](https://nlp.stanford.edu/software/sutime.html)

3\.
[https://nlp.stanford.edu/software/patternslearning.html](https://nlp.stanford.edu/software/patternslearning.html)

------
mindcrime
I'm doubtful that ElasticSearch, out of the box, is going to do much in the
way of getting you want you want. It's fine for building an inverted text
index and letting you query it, but that assumes you have some ability to
parse the data and submit it for indexing. If the data is truly "unknown" then
your problem is how to get it into ES in the first place. And that would be
true whether you used ES, Solr or just native Lucene.

Do you know if the data is at least in a text format? If it is, you could code
up a job to walk through the data and extract usernames, phone numbers, etc.
Something like OpenNLP would probably work well for you. If you want something
a little fancier, like NER with automatic entity linking, you could use Apache
Stanbol.

Keep in mind though, none of these tools are magic and none have human level
AGI. You probably won't get 100% perfect matching for every single entity you
care about. This is especially so using the pre-trained models. To get the
best results, you'll need to human label some training data (a subset of your
original data) and train your own model using the labeled data.

Once you're able to match all your various entities, then you might want to
stick them in ES, Solr, etc. so you can do convenient queries - depending on
what your use case is.

Note that if your data isn't in text to begin with (eg, if it's Word docs,
PDF, etc.) you might be able to use Tika to extract the content as the first
step in your processing pipeline.

