
Storing 50M events per second in Elasticsearch - Benfromparis
https://datadome.co/store-50-million-event-per-second-in-elasticsearch/
======
jungletime
3 years ago, I made a simple calendar app in django, and I wanted to use
Elasticsearch so users can search and find an event, and to use it to populate
an upcoming events list. There's only about 10,000 events in the database.

I quickly realized what a pain it is to use Elasticsearch, for a simple app
like mine.

Pain points:

1) You have to setup and recreate part of your database in elastic search. So
you essentially end up with two databases. Which now you have to keep in sync.

2) I was getting unpredictable query results from Elasticsearch, which after a
few days, and much head scratching turned out to be that I was running out of
memory.

3) When a user added a new event, it was not being added to elastcsearch index
automatically. I could not figure out how to do this reliably. I could make it
work reliably only after a sync of the entire Elasticsearch index. But this
meant that it was next to useless, to use for the Upcoming Events List. Since
I only wanted to sync the index once a day. Confusing the users, as to why
their event was not showing up. And I gave up, and just ended up implementing
the Upcoming Events List directly from my database in python.

4) Elasticsearch came without some security settings not set by default, and
after a few months it was hacked. I had to download a new version and wasted
more time.

I still use Elasticsearch, but only for search, and not the upcoming event
list. And I don't think it was worth the complexity that it added to my
project.

~~~
steve19
I have been put off by Elasticsearch's complexity a number of times. Can I ask
why with searching a limited about of text you didn't juse use Postgres' full-
text search?

~~~
neuronexmachina
I'm not the person you're replying to, but does Postgres nowadays have a
straightforward way to do tf-idf or BM25-style information retrieval?

~~~
leetrout
Not op but not that I know of.

I commented below - I highly recommend Xapian for small projects to test the
waters. It’s the SQLite if search.

~~~
mpcjanssen
Or you could use the FTS extension of SQLite.
[https://sqlite.org/fts3.html](https://sqlite.org/fts3.html)

------
mistrial9
it appears in this document:

* DataDome is a security company, and gets web traffic in near real-time for clients; a lot of traffic in some cases with very specific numbers given, like daily peak loads.

* DataDome only retains records for 30 days, and the most attention is given to the most recent traffic, to detect attacks

* an ElasticSearch deployment records all of the traffic records downstream from Apache Flink; a new feature added to ES this year, improves the management of ES indexing, and that solved problems that DataDome was having.. things are better! write an engineering blog post !

* re-indexing is done nightly, and implemented in a cloud environment that can handle the (heavy) work to rebuild the indexing.

These numbers are impressive. Earlier criticisms of ES are being addressed,
and ES is stable and a cornerstone of the architecture. A company called
DataDome is providing real services in near real-time. Congratulations to the
team and an interesting read.

~~~
frumiousirc
I was curious about their numbers

> Storing 50 million of events per second

> A few numbers: our cluster stores [...], 15 trillion of events

> We provide up to 30 days of data retention to our customers. The first seven
> days of data were stored in the hot layer, and the rest in the warm layer.

15e12 / 50 MHz is 3.5 days.

I guess 50 MHz is the peak ingest rate.

------
FBISurveillance
I don't think writing a clickbaity title like this is fair. You just write
200k large documents per second, period. Good for you but to be blatantly
honest it's actually not a lot.

I'm not saying you shouldn't have written this post, but rather suggest you be
fair to your readers (and yourself). Otherwise you could just make up random
titles like "Writing 1 trillion log lines per second" (by storing 1,000,000
1-byte, newline-separated log lines per document).

------
outworlder
This part left me scratching my head:

> We have set “replica 0” in our indexes settings

> Now let’s assume that node 3 goes down:

> As expected, all shards from node 3 are moved to node 1 and node 2

No, as there are no shards that can be moved, as number of replicas was set to
zero and one node went down. Not sure what they are trying to explain here.

> In order to resolve this issue, we introduced a job which runs each day in
> order to update the mapping template and create the index for the day of
> tomorrow, with the right number of shards according to the number of hits
> our customer received the previous day.

This is a very common use-case(eg. logging), but it's surprising that Elastic
has nothing to automate this.

~~~
rexer
> This is a very common use-case(eg. logging), but it's surprising that
> Elastic has nothing to automate this.

You can set an index template to be used on new indices that match a pattern,
which is a very common thing to do. It sounds like what they did was modify
the template daily, which is less common IME. It's not clear why they had to
manually create the index, though. That should happen automatically.

~~~
outworlder
> You can set an index template to be used on new indices that match a
> pattern, which is a very common thing to do

It is, but how can you tell in your template you want to keep shard sizes
under 50GB? You can't.

The best thing you can do (as they did) is, based on historical data, update
the template, so that the new index will have shards that (hopefully) are
under 50GB.

------
altmind
>> Each day, during peak charge, our Elasticsearch cluster writes more than
200 000 documents per second

What is this 50M in the title?

~~~
gatorcode
They state each document has 250 events, 200,000 document/sec x 250
events/document = 50m events/sec.

~~~
kodyo
"Each document stores 250 events in a seperate field."

Curiouser and curiouser.

------
dazoot
Reading this reminds me of the pains of running an ElasticSearch cluster. We
just moved to Elassandra. No more red status.

~~~
PixyMisa
Hadn't seen Elassandra before. Thanks for the tip.

------
philip1209
I've seen elasticsearch clusters like this have consistency problems. Turns
out it's a problem in a security setting to have an off-by-one error.

