
AWS Elasticsearch: a fundamentally-flawed offering - bifrost
https://spun.io/2019/10/10/aws-elasticsearch-a-fundamentally-flawed-offering/
======
lflux
Heaven forbid you make a configuration change that triggers a blue-green
deployment and during the deploy one of the AZs runs out of that instance SKU.
Your deployment has halted until you can get ahold of AWS support to get them
to unstick it (this takes a couple days even with enterprise support). There's
no way to know if the AZ has capacity or not, it's something you find out in
the deploy.

The workaround AWS support proposed was to reserve 2x capacity so we wouldn't
run into this issue on subsequent deploys.

~~~
SubuSS
I am probably missing something here: If I understand what you're calling a
blue/green deploy correct, you essentially want the ability to run 2x capacity
for at least a little time (deploy time). So why wouldn't you reserve 2x?

Or switch to a AB like deploy? (deploy 5% or so, test against 5% from original
deploy and decide on future).

~~~
coder543
> you essentially want the ability to run 2x capacity for at least a little
> time

This isn't the customer's choice. The customer _does not want this._

As the article talks about, AWS Elasticsearch isn't actually elastic. On
standard Elasticsearch, you can add and remove nodes at will and it will
automatically handle rebalancing. AWS Elasticsearch can't do that. It has to
spin up a new cluster of the desired size, copy everything over, and then turn
off the old cluster. That is a form of blue/green deploy.

> So why wouldn't you reserve 2x?

Why would you _want_ to pay double all the time because AWS can't use
Elasticsearch correctly? AWS should foot the bill to ensure that everything
works properly within their broken implementation when something requires the
cluster to be duplicated and redeployed, not the customers.

> Or switch to a AB like deploy?

To reiterate: this isn't their choice. AWS forces this inefficient methodology
on users of AWS Elasticsearch, which is why the article strongly recommends
against using AWS Elasticsearch.

~~~
runamok
Right. It should be trivial when instituting a change that would trigger an
event like this to take inventory that the required instances are available.

------
reilly3000
Its worth noting that AWS's lack of security features for Elasticsearch
Service have been the root cause of some gigantic breaches:
[https://www.infosecurity-magazine.com/infosec/why-do-
elastic...](https://www.infosecurity-magazine.com/infosec/why-do-
elasticsearch-databases-1-1/)

~~~
jit_hacker
Lack of security? AWS offers very granular, per index, authorizations that are
tied into IAM in the same way you would configure S3 or DynamoDB. If user's
are failing to implement good policies, AWS is not to blame.

~~~
mixedCase
When your authorization system is a usability shitfest, you're partly
responsible for bad usage.

~~~
bifrost
Preach on!

The only way I've really been able to understand it well is by writing code
that uses boto and seeing what errors out lol.

I've found some really interesting bugs/inconsistencies too. Nothing horrible
but its def unintuitive sometimes.

~~~
ajb
That's the right way of doing it IMO. I've got a PoC script which finds the
minimum subset of permissions to allow some action:
[https://github.com/KanoComputing/aws-
tools/blob/master/bin/a...](https://github.com/KanoComputing/aws-
tools/blob/master/bin/aws-policy-minimize)

Haven't had time to productise it yet. I think doing this makes you quite a
bit safer, because it means you don't end up giving up and allowing more than
you need. However, you still need to understand which actions _shouldn 't_ be
allowed, so it's not the whole solution.

~~~
all_usernames
Netflix open sourced a similar tool that watches API calls for a Role and then
suggests minimum privilege changes to the attached policy document:
[https://github.com/Netflix/repokid](https://github.com/Netflix/repokid)

~~~
ajb
That's interesting. That can only work if there's some way of introspecting
permissions - which I didn't realise existed. Mine works by experiment. I
wonder how fine grained their way is.

------
alienreborn
Their "managed" kafka service is even bigger clusterfuck

\- Doesn't expose metrics via JMX port

\- Doesn't support version upgrade

\- Doesn't support schema versioning afaik

\- Doesn't support adding a node or two to an existing cluster

I could go on...

~~~
minitoar
Do you have experience with Confluent? How does that compare?

~~~
alienreborn
Confluent Cloud is pay by usage (per GB in, per GB out, per GB stored), so its
much pricier based on your org's usage. However, it is definitely feature
rich.

\- Doesn't expose metrics via JMX but does provide a nice tool called
confluent control center for monitoring and managing kafka cluster \- Built in
and Managed Schema Registry

but \- AFAIK they won't reveal cluster size or cluster version (except client
compability). So, they do scaling and upgrades automatically. It's a double
edged-sword but should work for lot of orgs.

\- Overall, definitely a better product than aws msk.

------
scarface74
The better question, if you need advanced ES functionality, you don't want to
manage it yourself, and AWS's offering isn't up to par, why not just use
ElasticCo's managed version that also runs on AWS?

~~~
inssein
ElasticCo's offering is no better, and has its own shortcomings. The biggest
shortcoming is its actually not multi-az. We were down 3 times in 2018, and
unfortunately had to switch to AWS ESS. Majority of their outages were also
faults of their load balancer.

We've had a few of the problems described here, but so far its better then
being down.

~~~
webo
They don't even do multi-node for node sizes smaller than 58gb RAM. Dedicated
master is not available until the cluster size at least 6 nodes of 58gb RAM.
Multi-master is not available.

I'm not sure if this is based on research or optimal configuration size, but
it seems very expensive to get a 3-node cluster going on ElasticCo.

Does anybody have experience with compose?
([https://compose.com/databases/elasticsearch](https://compose.com/databases/elasticsearch))

~~~
dangoldin
There's also [https://www.instaclustr.com](https://www.instaclustr.com)

We don't use them but had a pretty in depth conversation with them about Kafka
and they seemed sharp.

------
manigandham
Why is anyone using AWS Elasticsearch anyway? AWS is a first-mover and is
great when you need managed services not available elsewhere, but it's usually
not the best product.

Since most vendors have finally caught on to the demand for managed offerings,
I don't see much reason to go with AWS for services like Elasticsearch. The
cloud hosting plans directly from Elastic are much better. Easier to manage,
same cost, better performance, and more reliable. And support is included.

~~~
philliphaydon
People like myself don’t want to manage 50 services from 50 vendors. We want
to manage it all in 1 place.

~~~
manigandham
Is it actually 50 services? Or a few that come with a much better experience?
The trade-off seems clearly worth it when not exaggerated.

------
jalopy
This is great intel - sorry it had to come with such painful experience.

Does anyone know if Open Distro for ElasticSearch
([https://opendistro.github.io/for-
elasticsearch/](https://opendistro.github.io/for-elasticsearch/)) has these
problems? Or is it related to how AWS configures/maintains ES on their
platform?

~~~
ctvo
Open Distro is the Amazon hard fork mentioned [1]. The missing features will
most likely be a 1:1 issue.

The author does a disservice to their audience:

> As has happened before, Amazon took the open-source side of Elasticsearch,
> did a hard fork, and has been selling it as a hosted service, slowly
> implementing their own versions of features that have been available in one
> form or fashion in mainline Elasticsearch for years.

When what happened was Elasticsearch changed its licensing model after
benefiting off of the open source community for years to be more restrictive,
forcing the fork.

1 - Amazon blog post announcing fork
[https://aws.amazon.com/blogs/opensource/keeping-open-
source-...](https://aws.amazon.com/blogs/opensource/keeping-open-source-open-
open-distro-for-elasticsearch/)

~~~
phd514
There's been no license change. For years, Elastic has had a set of
commercially-licensed features on top of Apache2-licensed Elasticsearch.
Within the last year or two, they made those commercially-licensed features
source-available. Ironically enough, making the source available seems to have
prompted a bunch of claims that they changed their licensing model.

~~~
arpinum
Elastic mixed open source code with source available code, possibly as a
landmine to sue large hosting providers, like Amazon. Amazon's fork includes
removing these landmines. Elastic's hands aren't clean.

~~~
CameronNemo
I have a hard time believing that Amazon's motives are pure. Companies that
develop open source software have an incredibly hard time with profitability
-- it is no surprise that elastic wants to reduce their own workload by
maintaining a single repo for open source and source available code. The "land
mines" are cordoned off in one directory. Hard to miss that.

~~~
femto113
For me the real frustration is Elastic's close tying of client features to
server versions, making it impossible e.g. to buy the latest version of Kibana
from Elastic and run it against a server managed by Amazon.

------
bifrost
I've been following this issue for quite some time and I totally agree with
the author. I've had a lot of odd issues with AWS's productized Elasticsearch,
enough that I gave up on it entirely.

~~~
dlojudice
my experience shows that this would be even more expensive with not that good
support (very enterprise sales / customer success contact)

~~~
bifrost
I'm not sure I follow. Running Elasticsearch on your own isn't that hard but
yeah, its more than a couple of clicks. Now that you can use Elastic.co's
cloud solution its probably even easier.

~~~
outworlder
> I'm not sure I follow. Running Elasticsearch on your own isn't that hard

Hah! Please do try, at any non-trivial scale. Once you have got your battle
scars, report back.

~~~
bifrost
I had a 32 node on-prem 512 core 8TB of RAM cluster in 2014, all local SSDs.

It was fine.

------
bifrost
Looks like someone from Elasticsearch picked this up and tweeted about it too.
I suspect one way or another this could get Amazon to pay attention...

[https://twitter.com/kimchy/status/1182745675617824768](https://twitter.com/kimchy/status/1182745675617824768)

~~~
nodesocket
Completely agree, use Elasticsearch the companies offering. After being bitten
by previously no VPC support for AWS Elasticsearch (though supported now), and
blue-green deploys even with a small configuration changes (used to be
security policy updates even) that can take 12-24 hours I generally don't
recommend AWS Elasticsearch.

~~~
goldenkey
Since I worked for Amazon and saw how the sausage was made, I am reluctant to
ever use AWS again. Random grab bags of teams using all different tech with
different coding standards and methodology. Jerry-rigged micro-services turned
into APIs through total object instantiation rather than efficient mutable
updates.

------
forty
I like elasticsearch, but it's not really fun to manage. We are still running
latest 5.x because upgrading means we would lose access to our oldest
snapshots (unless we reindex them all). I would love it to be managed for me
but at the same time I don't really want to give access to my data to another
third party, so managed by AWS is the only option (our data is there anyway).
I was also under the impression that their ES offering is not great so for now
I'm still handling things myself.

------
time0ut
It is also wildly more expensive than just running your own on EC2.

~~~
bifrost
Most people using the cloud aren't really that cost concious. EC2 is about 8x
the cost of bare metal...

If you factor in operational cost AWS can be cheaper but often isn't really.

~~~
StreamBright
>> EC2 is about 8x the cost of bare metal...

Source?

~~~
bifrost
One of my old budgets, but there are a ton of blogs out there about it that
are easily found via search engine.

~~~
StreamBright
Ok so nothing concrete.

Let me share my anecdotal evidence than with numbers. I have migrated an on-
prem cluster of 150 nodes which has hadoop, elasticsearch and docker apps
running the UI. We have achieved 30% saving in the year over year budget for
the company which is ~600.000 USD. This is not about EC2 vs a Dell server for
example in a datacenter because this comparison would be an apple to oranges
one. This is the sum of all the costs on-prem vs the sum of all the costs in
cloud. When people try to compare purely EC2 to a node running on-prem, the
only thing is 100% crystal clear that they do not understand how the cost is
structured for a infrastructure. Quite often they forget that we need
networking, electricity and cooling in the datacenter. They also forget that
datacenter capacity cannot be given back when not needed (auto-scaling) and
few minor things. This results in the conclusion that the cloud is more
expensive than on-prem which in my experience of moving several fortune 500 to
the cloud is not true, quite the opposite, significant cost savings can be
achieved.

~~~
bifrost
I was talking about "in a datacenter", but yes, you have almost zero
elasticity with datacenter buildouts. You can get metered power but that can
be a mixed bag since its usually more expensive per KWH.

Most companies that "move to the cloud" also make a lot of changes to the way
they do things so they can scale/up down dynamically, thats not an
insignificant cost in development time.

If you have a static load AWS is really expensive.

Also GPU's are still insanely bad to do in AWS/Azure. Lets say you need the
equivalent of 60 x p3.16xlarge for a whole year, thats well over a million USD
a month. You're breakeven on month 3 in a datacenter even with all the
overhead. Maybe some of that is my ability to get good deals, but even if you
breakeven on month 4, thats crazytown.

~~~
StreamBright
>> If you have a static load, AWS is really expensive.

Again without details, this a meaningless claim. My own company's only
infrastructure is a website that "runs" on AWS using the free tier of
Cloudfront and a little bit of the paid tier of S3. This is a static workload.
It is really cheap. Without adding all the details on both sides and the
workload you cannot claim that AWS (or for that matter any cloud vendor) is
more expensive.

~~~
bifrost
If you can find cases where its cheap, thats great, but its not the case for a
lot of people with static compute load.

One of my example datacenters is: Private servers, access controlled, etc 12
servers (E5-2438L, 128GB of ram, 6T of RAID in each, 10gbps interconnects/etc)

It costs $1200/month to run. Just the storage cost in AWS is around
$3k/monthly. The equivalent EC2 cost is around $14k/monthly. It requires
around 1-2 hours/month of oversight and the costs are generally fixed except
for the bandwidth which is billed at a fraction of the cost of AWS pricing.

------
bratao
Shameless promo ( no affiliation, just a happy user who migrated from Elastic
Search). You should check [https://vespa.ai](https://vespa.ai) \- App
containers and first class tensor support is a blessing.

~~~
atombender
I've looked into Vespa a bit lately. It looks pretty good!

I'm a little disappointed in its data type support, though. With ES you can
throw deeply nested data structures (maps, arrays, arrays of maps, maps of
arrays, etc. ad nauseum) at it and have them be fully indexed and searchable.
But Vespa doesn't really do indexing of nested structures.

This means that if your application's schema is already dependent on such
nested data structures, you need a mapping layer that flattens your
structures. For example, if you have:

    
    
      address: {
        streetAddress: "1 Bone Way",
        city: "Boneville",
        state: "WA"
      }
    

then you have to flatten it to something like:

    
    
      {
        address_streetAddress: "1 Bone Way",
        address_city: "Boneville",
        address_state: "WA"
      }
    

And then, of course, you have to unflatten when you get the results (unless
you only use Vespa for the IDs and look up the original data in your main data
store).

Same thing with arrays. Vespa doesn't really support arrays, whereas in ES,
all attributes are technically arrays. (I.e., a "term" query/filter doesn't
distinguish between the two: {term: {foo: "bar"}} will match both documents
{foo: "bar"} and {foo: ["bar"]}.)

Another oddity is the system for updating your schema, which includes not just
data model definitions, but a whole bunch of files which you upload as a
batch. The programmatic API for updating the schema is a little impractical,
much less practical than with ES where you can just do "curl -d
@mappings.json" and you're done. Also not at all a fan of their use of XML.

Overall, Vespa feels more than a little antiquated. It's an old project, after
all. That said, I'm probably willing to deal with the warts if it's more
solid. I like that the core server is written in C++, not Java.

What has your experience been in terms of clustering? With ES you can just
boot up a bunch of nodes and, on a good day, it will self-organize into a
pretty nice and scalable setup. (On a bad day, your cluster will become "red"
for unpredictable reasons.) Is Vespa as seamless here?

~~~
mdaniel
That's my big complaint with Solr, too, which I want to like because on paper
it seems a ton more sane, but realistically the ability to throw a random JSON
document at ES -- without having to sit down and pre-define the schema -- is
invaluable.

I stopped by to check on Solr before posting this, and even their "schemaless"
document is like 8 pages long and filled with XML settings:
[https://lucene.apache.org/solr/guide/8_1/schemaless-
mode.htm...](https://lucene.apache.org/solr/guide/8_1/schemaless-mode.html)

~~~
arafalov
That's because that page explains how to turn the mode on and off, how to
fine-tune it (e.g. different date formats) and how to index formats other than
JSON. Elasticsearch does not support a good chunk of this, so no need to
document.

If you want a more streamlined version, you can check the example instead:
[https://github.com/apache/lucene-
solr/blob/master/solr/examp...](https://github.com/apache/lucene-
solr/blob/master/solr/example/films/README.txt)

And even then, it will already discuss the problem with auto-guessing the
content types, something that Elasticsearch mentions only later. Solr is just
more upfront and explicit about the issues.

Still, you do have a point, Solr documentation tries to be comprehensive
rather than ease-of-use oriented. That sometimes obscures the easy things.

------
Esthrowaway123
Used to work in AWS. ES was always a bad org with crazy attrition, none of
this surprises me.

------
hardwaresofton
Until people disentangle providing bare metal and providing the managed
services on top of the bare metal you received, it just _won 't_ matter that
AWS is worse, because customers are already in the garden, and they'll pivot
their way to being better while developers suffer in silence (but pad their
resumes).

------
jillesvangurp
A couple of misconceptions about open distro that Amazon seems to be
advertising to overstate what they are doing:

1) It's not a fork. There's no such thing as an Amazon specific fork of the
Elasticsearch git repo in the opendistro github account. There are no Amazon
specific patches to Elasticsearch.

2) Instead, Amazon redistributes a vanilla OSS build of Elastic. As is and
unmodified. Producing these builds is and always has been a feature of the
elasticsearch build scripts. With every release they produce OSS binaries and
OSS docker containers (both without the closed source plugins) in addition to
the ones that include their x-pack components. All Amazon does is take those
builds and bundle their own OSS plugins.

3) What is and is not open source is clearly documented in the Elastic
repository. There is zero ambiguity here (legal or otherwise) unlike what
Amazon implies in their marketing. If it's in the x-pack directory, it may be
closed sourced (some plugins are OSS). If that bothers you, use the before
mentioned OSS builds. Everything outside the x-pack directory is OSS. OSS here
qualifies as Apache 2.0 licensed or compatible. It's that simple.

The OSS plugins that amazon provides are of course nice if you need them. Less
nice is that they seem to be perpetually several releases behind Elastic with
both the plugins and opendistro. So if you use this, you are running with
known & fixed bugs that may or may not affect you. You could argue that Amazon
maybe does a lot of testing. If so, those tests don't appear to be part of
their OSS repos. The other explanation is of course that they only update
their cloud service a couple of times a year and simply ignore bug-fixes or
even patch releases to what they shipped, other fixes and improvements that
happen upstream, etc. Or even any documentation for that (refer to the Elastic
official release notes and documentation for what was actually fixed). If
these bugs happen to affect you, you are on your own. The release notes are
"whatever Elastic said a few months ago". Refer to the Amazon release notes
here if you don't believe me: [https://opendistro.github.io/for-elasticsearch-
docs/version-...](https://opendistro.github.io/for-elasticsearch-docs/version-
history/). The last few releases were basically "bump the version number" and
absolutely nothing else that Amazon considered worth reporting.

So, if you are comfortable running that in production use it at your own
peril. I'd argue it's probably better to take the latest Elastic oss build,
fork the amazon plugins you actually need (if any) and simply bump the version
numbers to match the current elastic version. Amazon seems to do little more
than that between releases; so you are not really missing out on any
meaningful QA, support, or other stuff Amazon implies they are doing that they
are clearly not doing.

~~~
busterarm
> 3) What is and is not open source is clearly documented in the Elastic
> repository. There is zero ambiguity here (legal or otherwise) unlike what
> Amazon implies in their marketing. If it's in the x-pack directory, it may
> be closed sourced (some plugins are OSS). If that bothers you, use the
> before mentioned OSS builds. Everything outside the x-pack directory is OSS.
> OSS here qualifies as Apache 2.0 licensed or compatible. It's that simple.

While this statement is true, it is not clearly documented in their actual
documentation! For many years the pricing model around X-Pack was incredibly
opaque and the documentation did everything possible to encourage you to use
it while keeping the warnings around licensing issues were buried deep in the
appendices.

I certainly didn't read elasticsearch's source tree when learning how to
operate it -- I started in the docs like most everyone else.

Also, your super responsible sysadmins probably aren't pulling elastic's
source code to run on your servers, they're using the distro packages (which
do mix free and non-free) which is also what the docs tell you to do.

~~~
jillesvangurp
They've long addressed all of that. There are helpful x-pack tags on the
documentation for features that aren't in the OSS release. Check here for
example on the documentation page for index life cycle management:
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-
lifecycle-management.html)

The pricing model for their platinum features is indeed opaque (as in most
small companies can't afford this, and you'd have to talk to a sales rep to
find out). Those features are also clearly marked. X-pack features are free to
use. Also, if you try to use these features without the proper license key, it
won't work for obvious reasons. There's zero risk of using this accidentally
without first agreeing to some license.

As for the repo, if you bother to open:
[https://github.com/elastic/elasticsearch/blob/master/LICENSE...](https://github.com/elastic/elasticsearch/blob/master/LICENSE.txt),
it spells it out in 10 lines of text. They've iterated on this a bit but this
was always the place where they clearly outlined what is what.

Also, each single source file includes details on how it is licensed. There's
zero chance of a developer not seeing that if they are preparing some code
patch. This is intentional; it's not optional to document stuff like this if
you are serious about enforcing your copyright; which of course they are.

Sysadmins pulling elasticsearch from a linux distro repo of course happens.
Presumably they'd be getting an OSS build and not bundle proprietary
components because they tend to care about not shipping proprietary code. If
you go to the Elastic download pages, there are convenient links to both.

------
romski
Granted the rebalancing is a real missing feature but what if you had used EBS
volumes for data storage, to grow the disk if one node approaches its limit?
I'm guessing the author is using local SSD/NVMe to get the most query
performance but that does come at the cost of flexibility.

Of course EBS is not as fast, and I'm sure others have run into availability
issues, but one has to look at their requirements; in a fast changing
environment when data needs are not certain having the ability to expand with
a push-button is powerful.

~~~
DominoTree
Our experience was that the more data you store, the more IOPS you need to
search through it at a reasonable pace, and with a large dataset things can
really start to crawl.

That said, it's worth pointing out that EBS is the default data store in AWS
Elasticsearch, and for people without a ton of data it might actually end up
working "as intended"

------
cloakandswagger
The whole shard rebalancing problem is solved if you have an acceptable
disaster recovery solution in place, which I'm assuming the author doesn't if
they're complaining about have to keep around the raw copy of their data to be
indexed.

My team has an automated workflow that runs once a week. It creates a new
cluster, re-indexes from source, starts taking customer traffic and then
deletes the old cluster. The shards stay balanced, and we can recover from a
total cluster failure within about 6 hours.

~~~
DominoTree
Can I ask how large your clusters are and how many records you’re reindexing?
I haven’t been able to scale this to a cluster taking in several TB a day

------
rawoke083600
Ja never liked the offering(ok fine not much reason given) I also just spun up
an instance and installed Solr. Lol until my new team decided we need 3 solr
in 2 docker containers (master / slave)... Lol wasted 2 weeks.

------
taf2
Yeah unless your database is not a critical component of your application,
like a experiment or doesn’t have to be available feature you should almost
always self operate. It pays to have control of your data.

------
ignoramous
Some valid points and some relevant real-world aws support nightmare
scenarios, though I think there is a chance the author might be wrong abt a
few things, or may be I misunderstood them. My 2 cents:

> _Amazon’s implementation is missing a lot of things like RBAC and auditing._

Open-distro (which AWS uses for elasticsearch deployments) supports this:
[https://opendistro.github.io/for-
elasticsearch/features/secu...](https://opendistro.github.io/for-
elasticsearch/features/security.html)

> _Shard rebalancing, a central concept to Elasticsearch working as well as it
> does, does not work on AWS’s implementation._

Not sure why the author says AWS doesn't support it, but I have seen that it
does rebalance shards just like vanilla elasticsearch would. In fact, it
wouldn't rebalance only when the shard-allocator is unable to find suitable
home for the unassigned shards (and that's vanilla behaviour, iirc):
[https://aws.amazon.com/blogs/opensource/open-distro-
elastics...](https://aws.amazon.com/blogs/opensource/open-distro-
elasticsearch-shard-allocation/)

> _...if a single node in your Elasticsearch cluster runs out of space, the
> entire cluster stops ingesting data, full stop. Amazon’s solution to this is
> to have users go through a nightmare process of periodically changing the
> shard counts in their index templates and then reindexing their existing
> data into new indices, deleting the previous indices, and then reindexing
> the data again to the previous index name if necessary._

I think the author should employ alerts for cluster-health
[https://docs.aws.amazon.com/elasticsearch-
service/latest/dev...](https://docs.aws.amazon.com/elasticsearch-
service/latest/developerguide/cloudwatch-alarms.html) or write them
[https://github.com/opendistro-for-
elasticsearch/alerting](https://github.com/opendistro-for-
elasticsearch/alerting) and def read abt best practices for offloading
petabyte-scale clusters to aws (I am sure they've read abt it already, given
they're in touch with SMEs and TAMs):
[https://aws.amazon.com/blogs/database/run-a-petabyte-
scale-c...](https://aws.amazon.com/blogs/database/run-a-petabyte-scale-
cluster-in-amazon-elasticsearch-service/)

> _Hope you had a backup of what you needed to dump._

Amazingly, AWS Elasticsearch does automated hourly backups and retains them
for 14 days, for free: [https://aws.amazon.com/about-aws/whats-
new/2019/07/amazon-el...](https://aws.amazon.com/about-aws/whats-
new/2019/07/amazon-elasticsearch-service-increases-data-protection-with-
automated-hourly-snapshots-at-no-extra-charge/)

> _The second option is to add more nodes to the cluster or resize the
> existing ones to larger instance types._

AWS Elasticsearch doesn't yet scale-out (change in instance-count) without
resorting to blue-green deployments. They should have impl that by now, like
they did for policy-updates: [https://aws.amazon.com/about-aws/whats-
new/2018/03/amazon-el...](https://aws.amazon.com/about-aws/whats-
new/2018/03/amazon-elasticsearch-service-now-supports-instant-access-policy-
updates/) I hope fixing this is in their roadmap.

\----

Also, I believe the real problem with managed elasticsearch offering is that
end-users still have to worry abt the servers as it isn't truly hands-off, in
a way that Lambda or DynamoDB are. This is complicated by the fact that
elasticsearch exposes innumerable ways to configure cluster and index setups
(read: shoot yourself in the foot).

I guess, AWS Elasticsearch needs something like an Aurora Serverless Data API
as the current offering takes away too much control away from power-users
(can't ssh into the nodes to fix anything at all and the constant reliance on
the oft-incompetent aws support to do the firefighting whilst having to
frustratingly wait on the fringes with little to no transparency is a big red-
flag):
[https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-
api.html)

With perf-analyzer and automatic-index-management included in open-distro,
they might already be half-way there:
[https://news.ycombinator.com/item?id=19361847](https://news.ycombinator.com/item?id=19361847)

~~~
DominoTree
Couple things -

It's worth pointing out that AWS Elasticsearch is not simply the open distro -
and a lot of things you see in the open distro are not currently available on
AWS Elasticsearch. Beyond that, many features are simply forcibly disabled in
Amazon's offering, just as many cluster settings and APIs are untouchable in
the AWS service (even read-only ones that would be super helpful).

I still can't touch any of the rebalancing settings on my clusters and
everything looks forcibly disabled. If rebalancing worked as expected, the
whole blue-green thing shouldn't be necessary, and over time, I wouldn't
generally end up with a single full data node while every other node in the
cluster has 300GB free. Am I missing something?

None of the CloudWatch alarms you linked to have much relevance to the issues
in the article (Other than the ClusterIndexWritesBlocked alarm which will only
start firing after everything breaks). As of the last time I looked, you
cannot monitor disk space on individual nodes in CloudWatch, only the cluster
as a whole. Alerting on a single node starting to fill up is basically the one
alert that would let me know things are about to be in a bad state.

Their service seems to work well for small implementations that use EBS-backed
storage, and I bet that's what most of their customers are using, but I'm
running 60+ node clusters and the problems only seem to be worse as capacity
goes up.

Someone in the comments here mentioned they destroy and rebuild their cluster
weekly just to keep the shards balanced. How ridiculous is it for that to be
the best option offered?

------
truth_seeker
PostgreSQL Full-text search is good enough. Also recently in PG12 they did
many perf optimizations (Table Partitioning, Indexing, Vacuum) which will help
in this regard.

[http://rachbelaid.com/postgres-full-text-search-is-good-
enou...](http://rachbelaid.com/postgres-full-text-search-is-good-enough/)

Text search functions - [https://www.postgresql.org/docs/12/functions-
textsearch.html](https://www.postgresql.org/docs/12/functions-textsearch.html)

~~~
bpicolo
If you need basic text filtering and ranking, ES is overkill. When you need
more powerful ranking, highlighting, etc, postgres isn't as sufficient

~~~
truth_seeker
Nope. PG12 is just fine for any kind of ranking and highlighting. Of course
you may have to use extensions like Citus DB to scale horizontally

------
greatjack613
TBH sounds like growing pains, I would imagine with the might of amazon
engineering behind the hard-fork it will eventually surpass the original.

Question is when??

~~~
__blockcipher__
Don’t bet on it. AWS Elasticsearch is really as bad as this article indicates.

~~~
tedivm
Yeah, we ran a minor upgrade to our cluster earlier this week and it knocked
out the entire cluster for over two hours, and we were getting AWS specific
errors regarding hostname headers that we've never gotten before. I managed to
get a developer advocate on twitter to lend us a hand, but if I had actually
waited on support things would likely still be down.

Unfortunately that seems more like a trend with AWS- there are a lot of new
services that feel like they're 80% ready for production, but which are being
sold as complete solutoins.

~~~
journalctl
Finishing things isn’t cool or sexy, don’tchaknow! Gotta get that MVP out the
door and move onto the next project. It’s fine, you can iterate on it after
re:Invent!

------
gshakir
Given the lawsuit that is going between Elastic and AWS in the background, can
someone confirm if shared rebalancing is a flaw in AWS offering?

~~~
DominoTree
It exists in mainline Elasticsearch as well as the open fork and appears to
exist in AWS's offering as well - AWS appears to have forcibly disabled it
across the board for unknown reasons. I think it's probably an issue with the
overall back-end architecture/implementation of their managed service.

------
pearjuice
Could be entirely translated at "This very generic offering doesn't suit my
very specific demand".

------
peterwwillis
Some would say Elasticsearch is fundamentally flawed.

~~~
tlynchpin
If someone would say that on HackerNews then they would expect to hear
"citation needed".

Anecdotally what I hear is a bunch of bitching and moaning about ES yet it
clearly does work and has generally all of the difficulties of any CAP
problem. This indicates to me that ES is addressing a Hard Problem and to the
extent that it is long lived and quite popular, it's likely not substantially
worse than any reasonable alternative.

Please tell us what you view as ElasticSearch fundamental flaws and give some
proposed alternatives either as revisions to ES or entire other solution
components.

~~~
jng
It is not necessary to have a valid alternative to validly declare something
as fundamentally flawed.

ElasticSearch is as brittle as you can get. If you don't dimension Java heap
sizes properly, nodes crash all the time and uncontrollable ultra-expensive
shard relocation happens. Their open source available monitoring tools have
the nice side effect of overloading the cluster and bringing it down (!). The
result of it being a whole hodgepodge of Java-based repurposed Lucene does
show in poor performance and very poor stability.

I've spent many a weekend trying to bring up a fallen ElasticSearch cluster,
in some cases brought down just from monitoring. We had a use case that wasn't
that easy, but not massive (100ks concurrent users, but not concurrent
millions), and a properly developed C++ or even Python distributed solution
would be more than able to handle it quite easily (source: ended up having to
write it myself, didn't require massive anything to handle properly).

Frankly I admire Elastic because I have no idea how you can turn such a piece
of software into ~$90MM yearly revenues, and, mainly, how you can turn that
~$90MM yearly revenue into a publicly traded company with a nearly $7bn market
cap. So much to learn from them!

~~~
tlynchpin
This is what I'm talking about wrt bitching and moaning, in summary you tried
to use ES but you didn't rtfm or didn't know about jvm tuning or didn't scale
test and found out the weekend is a bad time to come up to speed on those, you
had a bad time several times, plus you slashdotted yourself with monitoring;
then you did a custom implementation for your vertical use case which didn't
have the rtfm problem because you wrote it, but also only satisfied your case
as opposed to the wide applicability of ES. Ultimately cool story bro because
ES is freely available for anyone to use (many people do this) and modify
(some people do this too) and your alternative is unknown.

What are the fundamental flaws of ES and what alternatives avoid those flaws,
or how do you propose ES could address those flaws?

For example:

\- "Algolia is so much better because it is a managed service." (hey whatsup
ycombi)

\- "Solr is also lucene but necessarily requires significant customization to
the workload which avoids the common ES problem of it appearing to work so
well out of the box that people neglect the details until it becomes an
incident."

\- "ES fundamental flaw is that zen disco mcast nonsense, people please stop
being clever using mcast it never works in practice because igmp snoop". (hey
whatsup we out here using ES since a while now)

------
jit_hacker
This article amounts to someone whining because they didn't take necessary
action to prevent a dumpster fire. AWS's Managed Elasticsearch has tradeoffs
and you should understand them before choosing it, but AWS is not to blame if
you've under provisioned your cluster and imbalanced your shards.

------
tus88
The very, very poor tradesman blames the tool.

> However, so many fundamental features are either disabled or missing in AWS
> Elasticsearch that it’s exacerbated almost every other issue we face.

No, your _choice_ to use AWS Elastic Search is compounding all the other poor
decisions you have already made (and admitted to). This is just another one of
them. Petabyte scale ElastcSearch clusters are approaching edge case usage
scenarios, and one most people would expect an average SAAS solution may not
be optimized for.

You could have spun up - and managed - your own cluster on EC2 but instead
decided to make bad decisions probably without any research or beta testing,
got badly burned as a result, and are now trying to unload your sour grapes on
to the interwebs.

~~~
outworlder
> The very, very poor tradesman blames the tool.

But a good tradesman will not use inferior tools.

~~~
tus88
A good tradesman uses the right tool for the task at hand.

