
Chat database for millions of messages per day - antcar
I need store messages for a chat system with more than 15 million of messages per day. Which database should i use? The chat system has multiroom messages and private messages. I am think in elasticsearch but i dont now if there is a better database for this feature. Thanks.
======
nivertech
Hi antcar,

You need to split your system into following:

    
    
        1. metadata database (i.e. users, groups/chatrooms, etc.)
        2. hot chat history (recent messages)
        3. cold chat history (archived messages)
        4. A messaging queue in front of  cold chat DB.
    

Fo the metadata any fast K/V store with rich data structures or basic
secondary indices will work, You can even use Redis if you use something like
RedisLabs with persistence.

For the hot DB, you can use Redis (cluster or sharded), Aerospike, ScyllaDB or
ElasticSearch.

For the cold DB, you can use Riak, Cassandra/ScyllaDB, DynamoDB, Aerospike.

You can even use the same database for both paths, but it need to be optimized
for mostly write workloads.

15M msg/day ~ 200 msg/sec - any K/V database will do.

For the messaging queue, you can use RabbitMQ or for high t-put Kafka (or any
managed cloud alternatives such as AWS Kinesis, Google Cloud Pub/Sub, Azure
Events Hub, etc.)

My email is in my profile:
[https://news.ycombinator.com/user?id=nivertech](https://news.ycombinator.com/user?id=nivertech)

Hope this helps.

~~~
iurisilvio
I'm not related to OP, but thanks for this answer! It is not easy to find a
clear answer like that for a complex problem.

The first time I had to build something to scale at day one, it was a long
research about databases, cluster/shard, architecture and everything else you
described. It worked, but it'd been easier with an answer like this one!

------
proudboffin
I guess it depends on your needs. Elasticsearch is a great option, and if
you're looking for visualizing the data then it can be used together with
Kibana to create cross-data graphs and visualizations. For the kind of data
volume you're talking of, you'll need a fully scalable stack, with queuing and
other buffers to handle traffic. You might find this piece we wrote useful:
[http://logz.io/blog/deploy-elk-production/](http://logz.io/blog/deploy-elk-
production/)

------
borplk
PostgreSQL. Avoid exotic databases they will come back to bite you. Throw some
RAM at Postgres and tweak the config values and it will fly.

------
JoachimSchipper
PostgreSQL, or just keep everything in-process on a bigger server. You're
looking at a few hundred to a thousand qps, right?

