
Ask HN: What database solution would you choose for this type of workload? - psankar
We have a telemetry gathering system. We get about 250rps to our backend systems, which dump this data on a database. We get ~25million records per day, occupying a disk space of ~12.5GB per day (inclusive of elastic index). We use elastic as the backend.<p>We chose elastic because we did not what fields we may get. There are ~40 fields (per record). Some fields may be missing in some records. Coming to the reads, the data thus saved would be queried in a timeseries way. We will access data for last 1 hour, 1 week, 1 month mostly.<p>The data will be queried with a lot of complex aggregations and joins too. For ex: What is the avg response time (which is a field) of all the requests that talked from node (another field) appServer1 to node appServer2, last month.<p>With elastic we are quite happy with the data until there are ~10 million records, beyond which response times just lag. It takes 10+ seconds for each query after 10 million records. We thought it could be due to reader starvation (because of incoming writes) and created a read-only replica and did our queries in that replica. But even there the slowness is felt.<p>In terms of memory, We have given about 1GB for esmaster, 2GB for esdata and another 1GB for esclient. All persistent disks are normal HDDs and not SSDs. We do not have much worries about diskspace and can add more diskspace (but not SSDs). But we will not be able to provision more RAM.<p>Also, we want our system to add columns&#x2F;fields without any stop-the-world migrations. This is what makes me nervous about choosing databases that want a schema ahead.<p>The writes can take (update multiple indexes on each write) but the reads have to be fast. We are also fine with choosing a hybrid approach where one system receives the writes and another system can do the analytics. The system has to work in kubernetes across clouds. So single-cloud solutions (like dynamodb) are ruledout.<p>Given the load and the above constraints, what kind of a database solution would you choose ?
======
jlgaddis
You say you can't add RAM or SSDs but that sounds like exactly what you need.
If your data doesn't fit in RAM (and, with 25MM records/day, it doesn't), then
you're gonna be hitting those slow HDDs when you run queries.

I'd add RAM first and SSDs second but you say you can't do either of those?
Well, good luck, I guess -- any "database solution" you choose is going to
perform like crap given these resource constraints.

------
speedgoose
Have you tried column based databases such as Cassandra, HPE vertica, or
Clickhouse?

Have you tried a "big data" approach like Apache Spark on parquet files?

Did you consider saving less data? Do you really need to save that much data?
Can't you sample the data and save a lot less while keeping the same
information overall ?

~~~
psankar
We are yet to try any other technology and this post is primarily for seeking
those candidates.

We could not get away with saving less data, for reasons beyond the scope of
this post (non-technical reasons).

~~~
speedgoose
I would try first with these columns based databases but I think you should
consider saving less data. Doing what you want is definitely possible but it
may be expensive, and if it's only because one non technical co-worker would
love to query so much data for his reports, perhaps you should try to show him
the cost once you tried a few technologies.

Querying less data but saving everything is also an alternative. With
Clickhouse you can specify a sampling rate for example.

~~~
psankar
This turned out to be a good advice. I am evaluating if it is possible to
somehow intelligently sample the data when it comes to ELK. Something like an
average of 1min data via a logstash filter. We could do this from our backend
programs too, but for now, I am trying to do this in ELK during the writes.

Also we do not use Clickhouse, but I will see if I can somehow do a sampling
rate in the reads, during my read queries. Thanks.

------
ianamartin
When you say you have 25 million records per day, taking 25 GB of disk space,
can you clarify what you mean by that?

Does each day of 25 million records take 25GB of disk space? Or are you saying
that it cumulatively takes 25 GB up until now?

1 million records per GB would indicate a really, really large record.

~~~
psankar
Sorry I was wrong. It takes about 12.5GB diskspace for 25million records. I
took two days disk size. Updated the post now. Thanks for spotting it. Also,
the diskspace is what is occupied by elastic and so it might contain any extra
index sizes too. Not the size of the actual data alone.

