
Using Elasticsearch as the Primary Data Store in an ETL Pipeline - kn7
https://vlkan.com/blog/post/2018/11/14/elasticsearch-primary-data-store/
======
billoneil
> The reason PostgreSQL results were omitted is no matter what kind of
> optimization we throw at it, the benchmark always took more than 2 hours,
> regardless of partitioning, whereas MongoDB and Elasticsearch took a couple
> of minutes.

This seems like something was done incorrectly the comparison shouldn't be
that drastic.

> Just one PostgreSQL 9.6.10 instance (shared_buffers = 128MB)

This looks way too low. The postgresql docs say a good starting point for
shared_buffers is 25% of the server's memory. In this case that would be 32GB.

[https://www.postgresql.org/docs/9.1/runtime-config-
resource....](https://www.postgresql.org/docs/9.1/runtime-config-
resource.html)

~~~
skullborg
They set out on a 4 year journey to improve their ETL, but didn't take 1
second to change a conservative global config default. Can barely take the
rest of the article seriously after a blunder like that

~~~
kn7
See my response above, it was indeed a typo from my side. I am sorry to hear
that it spoiled the rest of the post for you.

------
JohnBooty
I'm certain I'm missing something _very_ obvious about Elasticsearch and other
NoSQL data stores.

But in my brief experience with an Elasticsearch-backed web application I
found it difficult to write integration tests for the portions of code that
dealt with ES. Thanks to ES's "eventually consistent" nature, tests would fail
intermittently because we'd be e.g. querying some data that was written to ES
but hadn't been fully persisted yet. ES gives you some "flush the data to disk
right now, please" commands but they're merely suggestions and cannot be
relied upon.

Obviously, you want to stub/mock out as many of those actual physical database
reads/writes as possible, but sometimes what you want to test _is the ES
queries themselves_ and I don't know what on earth the best practice is there.

Just to reiterate, this is an issue that any non-ACID datastore would
experience. I'm not criticizing them, I'm just sort of wondering how people
typically solve that...

~~~
jacobr1
I use `refresh=True` on my insert/update/delete, which forces the writes to
complete. Then all the reads work as you would expect.

~~~
kn7
That still doesn't guarantee that a consecutive read is gonna get the last
state.

~~~
haggy
Direct from ES documentation on the `refresh` flag: "Refresh the relevant
primary and replica shards (not the whole index) immediately after the
operation occurs, so that the updated document appears in search results
immediately."

Do you have other information about the refresh flag because their
documentation clearly states that forcing a refresh is applied to primary and
replica shards meaning that it will be available for query directly after the
call to refresh is made.

~~~
JohnBooty
My experience was that this was not reliable. It was nearly always true, but
not always, and tests would sporadically fail some small percentage of the
time.

However, this was back in ~2015 and Elasticsearch 1.3 or something like this,
which is of course a now-ancient version. Perhaps things are different now.

edit: Perhaps we were using the refresh command and not the refresh flag. It
was a few years ago and I don't have access to the code any more, and my
memory may be failing here. If the refresh flag works as advertised (enforces
an index update and guarantees a consistent view of the data for the next
query, which the command did not seem to) then that of course solves my
initial problem W.R.T. writing tests.

------
simonw
If Elasticsearch is at the end of an ETL pipeline, does that mean that if
Elasticsearch gets corrupted you can rebuild it by re-running the pipeline?

If so I wouldn't call this a "primary data store", since durability isn't
critical.

The article says:

> After drafting many blueprints, we went for a Java service backed by
> Elasticsearch as the primary storage! This idea brought shivers to even the
> most senior Elasticsearch consultants hired

I'll shiver if Elasticsearch corrupting irreversibly loses data, but if it can
be rebuilt from another source I don't see any problems with it at all.

~~~
kn7
We store the real-time content stream in a separate bulk storage unit (e.g.,
BigQuery) with a certain retention window, but the ETL'ed documents are always
on ES. Given a plain event (i.e., not ETL'ed document) is not much of a value
for search, I would not call the stream storage as the primary storage. It
just assists us to re-build the ETL state in case of an emergency.

------
debarshri
> "Enough blaming the former engineer."

I was one of them. I don't work there anymore.

I believe is it not the actual situation in bol.com. If it is, I would be
disappointed.

Last I remember, Bol.com has really good set of ops and dev tooling on hadoop,
hbase, spark, flink etc. for scheduling, running jobs etc.

I wouldn't know why they replicated data both on hbase, elastic search etc.
Having read the blog, I don't see how this fits the event sourcing pattern
that bol.com was trying to implement and also, the idea of self service BI
that they envisioned.

~~~
kn7
Hey Debarsh! First, thanks for taking time to read such a lengthy post.

If I am not mistaken the majority of the PL/SQL glue is owned by Gert, though
you might recall better. Quite some VCS history was lost while migrating from
SVN to Git. ;-)

The reason we are "replicating" the entire data is to 1) determine the
affected products and 2) re-execute the relevant configurations (facets,
synonyms, etc.) while making retroactive changes. (For instance, say someone
has changed the PL/SQL of "leeftijd" facet.) Here, the storage is required to
allow querying on every field, for (1), and on id, for (2). While id-based
bulk querying is (almost) supported by every ETL source, querying on every
field is not. Hence, we "replicate" the sources on our side to suffice these
needs. Actually, the entire point of the post was to explain this problem, but
apparently it was not clear enough.

For your remarks on event sourcing and BI, I am a little bit puzzled. I will
need some elaboration on these remarks. We do have event sourcing on our side
(that is how we can replay in case of need) and BI is not really interested in
ETL data. Maybe I misunderstood you?

I am also confused by how you relate scheduling/running PL/SQL jobs via
Hadoop, Spark, Flink, etc. Did you see the link to Redwood Explorer I shared
in the post?

~~~
barbecue_sauce
I am not Debarsh, and I am not a data engineer, but isn't the purpose of ETL
for transforming data into a more accessible/palatable form for BI?

~~~
kn7
Bol has plenty of other ETL pipelines for BI. What I meant is the data cooked
for search is not (much) of interest to BI, yet. Though we do have other means
to feed BI for search-relevant content.

~~~
debarshri
To all fairness, you are right about oracle stuff ingrained in bol.com,
however, I am not sure if I should go in detail, but the whole thing used to
be like - Maintain event states with "versions" table and then run hadoop,
spark jobs on them, and snapshot the latest computed state to oracle so that
they could run BI on it.

But I understand now what you actually mean. I wouldn't call it ETL, as ETL is
more about prepping the data for BI and not cooking data for search.

yea, I remember they used to have redwood for scheduling PL/SQL queries but I
think majority of ETL jobs for BI were in hadoop/spark/flink.

Having said all these, I think it is quite some neat and cool engineering
work, I hope you guys are successful implementing the solution.

------
buckhx
ES is currently the main data store for zagat.com which ends up being a sink
from a data pipeline and more or less being used as Key-Value store on the
query side. It has worked OK for our current use case, but definitely came
with some pain points. Primary key fetches were way too slow for what we
needed especially with some in-memory joins happening and we ended up sticking
a cache in front of ES to satisfy our performance reqs.

We had a tight deadline on implementation (3 months to extract from Google)
and chose ES in order to satisfy a kv store as well as TF-IDF corpus search.

~~~
simonw
If it's a sink from a data pipeline then presumably it's not your primary
point-of-truth data store because if Elasticsearch gets corrupted you can
rebuild it from the rest of the pipeline?

~~~
buckhx
Yep, it's a bit more symbiotic than that unfortunately, but in general most of
the data can be restored from an upstream source.

------
syastrov
Is elasticsearch really the primary data store if they can replay events to
reconstruct the data in elasticsearch?

It sounds like the primary data store would be the stream of events.

------
th0ma5
... If your primary data store should be Elasticsearch. Nice write up of
everything, though, for sure.

------
luord
Would love to do/see similar benchmarks on Postgres using more performant
setups. Like someone else said, the disparity is downright shocking.

~~~
kn7
Please see my reply to "someone else" you mentioned. If there is anything else
you think I might have mistaken to employ during benchmarks, I am all ears.

~~~
luord
Yeah, I saw your reply. To be honest, I was mostly fishing for the possibility
of finding a dataset with which I could perform the test myself.

