
Ask HN: Storing and Searching Volumes of Text - Xeoncross
Many companies have a firehose of text they need to store and search. Even if it&#x27;s just log files. From PostgreSQL&#x2F;MySQL + Fulltext search to distributed filesystems + multinode grep instances, to TiDB + sphinx, to cassandra + elasticsearch, and many more in-between and mixed. There are a lot of options for both storage and search of text. Yet it&#x27;s something we are all finding more need for as we store social network reactions, technical documents, forum posts, emails, company documents, error logs, and many other text-based data points.<p>Can you share your company&#x27;s database design and why this design works for you (or why it was chosen)?
======
iDemonix
This seems a bit of a 'I have an essay due' question, I don't want to break it
down in to a full walkthrough, but if it's any to you, here's roughly what we
do without revealing anything:

We have two big sources of logs, proxy logs (user browsing, tens of thousands
of users), and firewall/network logs. This is for a network with an average
transit of about 15Gbps to the internet during working hours.

Proxy logs appear at an alarming rate from a big network, every page calls 20
other URLs and so on. For this we have a piece of software (agent) on each
proxy that aggregates and forwards logs on to a set of central processing
servers. These servers do some processing, then hand them on to storage
servers (replicated). Recent data is held on disk, older data is stored in an
Azure data warehouse.

Firewall logs, you can set them to lots of levels, from only critical errors,
all the way to a mind-boggling amount per second. Depends what you want. We
send these to a clustered set of rsyslog forwarders, all logs are sent to disk
storage (with tape backups) and selected logs are sent to graylog for
analysis. You can do all sorts with graylog, from alerting when root logs in
to a Linux box or watching for hardware/memory faults, to graphically
displaying your firewall attacks using a geo-IP plugin.

~~~
Xeoncross
Thank you for sharing. If you look at my profile I think it will quell any
fear you have about this being some kind of homework-related question.

While your storage of logs looks fine, the search for graylog
([http://docs.graylog.org/en/2.2/pages/queries.html](http://docs.graylog.org/en/2.2/pages/queries.html))
seems to simply be plain string searching which seems like it would be really
slow if it wasn't for filtering by attributes before search.

~~~
iDemonix
Graylog is just an interface for ElasticSearch, and you can do all sorts with
GROK filters etc to parse incoming data.

------
stephancoral
Just use elasticsearch. It's easy to set up, fast, customizable, and has a lot
of dev / community support.

At my last company, we used kafka as our log to handle the incoming messages
and then we just forwarded everything to Elasticsearch with logstash (super
easy to create and read from new topics).

For your storage, Elasticsearch can store the source of all documents and
basically act as a persistent DB (with kafka as backup for replay). Use well-
defined mappings for the data you know is structured (logs, forum posts,
already schematized data etc) and dump the rest into their own indices and
lets ES' implicit mappings do their work. You can use dynamic templates to
drill down any specific things you want to transform, such as converting
dates, not tokenizing certain strings and so on.

I've done a lot of search and text extraction work and this is my go to for
the past three years.

