Hacker News new | past | comments | ask | show | jobs | submit login
AWS Elasticsearch: a fundamentally-flawed offering (spun.io)
278 points by bifrost on Oct 11, 2019 | hide | past | favorite | 150 comments

Heaven forbid you make a configuration change that triggers a blue-green deployment and during the deploy one of the AZs runs out of that instance SKU. Your deployment has halted until you can get ahold of AWS support to get them to unstick it (this takes a couple days even with enterprise support). There's no way to know if the AZ has capacity or not, it's something you find out in the deploy.

The workaround AWS support proposed was to reserve 2x capacity so we wouldn't run into this issue on subsequent deploys.

> Heaven forbid you make a configuration change that triggers a blue-green deployment

The main usability problem is that they don't tell you when that will be the case. It used to be the case when scaling up instances which was completely surprising. It's not the case anymore for scale up, so it's improving.

We bitch and moan about Google labeling everything "beta"(or even "alpha"), but the beta moniker would be appropriate here.

> The workaround AWS support proposed was to reserve 2x capacity so we wouldn't run into this issue on subsequent deploys.

Thats pretty lame of them to make you pay them more money instead of functioning properly.

That's exactly what we told our TAM, SA and the AWS ES PM cc'd on that thread.

Did they ever give you an explanation as to why it happened?

1. We changed a configuration parameter, which we thought would be a no-op but caused a deploy

2. We ran a fairly large cluster with a large amount of i3.xlarge

3. One AZ in us-west-1 didn't have enough capacity of that SKU . The other AZ did, but this doesn't help if you have a multi-AZ deployment.

4. We switched to EBS-based instances after this

Reach out to your TAM again to talk about CR and RI combinations which can help to mitigate this problem.

Migrating to a different hosting platform than AWS for ES aslo mitigates this problem too, which is more likely in our case.

I am probably missing something here: If I understand what you're calling a blue/green deploy correct, you essentially want the ability to run 2x capacity for at least a little time (deploy time). So why wouldn't you reserve 2x?

Or switch to a AB like deploy? (deploy 5% or so, test against 5% from original deploy and decide on future).

> you essentially want the ability to run 2x capacity for at least a little time

This isn't the customer's choice. The customer does not want this.

As the article talks about, AWS Elasticsearch isn't actually elastic. On standard Elasticsearch, you can add and remove nodes at will and it will automatically handle rebalancing. AWS Elasticsearch can't do that. It has to spin up a new cluster of the desired size, copy everything over, and then turn off the old cluster. That is a form of blue/green deploy.

> So why wouldn't you reserve 2x?

Why would you want to pay double all the time because AWS can't use Elasticsearch correctly? AWS should foot the bill to ensure that everything works properly within their broken implementation when something requires the cluster to be duplicated and redeployed, not the customers.

> Or switch to a AB like deploy?

To reiterate: this isn't their choice. AWS forces this inefficient methodology on users of AWS Elasticsearch, which is why the article strongly recommends against using AWS Elasticsearch.

Right. It should be trivial when instituting a change that would trigger an event like this to take inventory that the required instances are available.

Reserving 2x costs 2x.

Doing a Blue/Green deployment setup costs 1.003x assuming a 5 minute switchover once a day. Being able to provision an extra server instance for a few minutes is kinda the whole point of moving to "the cloud"

Ok -I mixed up some GCP terms with AWS ones then: At least in GCP you can 'reserve' stuff that you want to use and pay for the ones you actually use. I am assuming they cap the reserve - I am yet to run into a situation where the reserved cores were unavailable.

I remember doing capacity planning like that for dynamo - but elastic search might be different.

Its worth noting that AWS's lack of security features for Elasticsearch Service have been the root cause of some gigantic breaches: https://www.infosecurity-magazine.com/infosec/why-do-elastic...

Lack of security? AWS offers very granular, per index, authorizations that are tied into IAM in the same way you would configure S3 or DynamoDB. If user's are failing to implement good policies, AWS is not to blame.

When your authorization system is a usability shitfest, you're partly responsible for bad usage.

Preach on!

The only way I've really been able to understand it well is by writing code that uses boto and seeing what errors out lol.

I've found some really interesting bugs/inconsistencies too. Nothing horrible but its def unintuitive sometimes.

That's the right way of doing it IMO. I've got a PoC script which finds the minimum subset of permissions to allow some action: https://github.com/KanoComputing/aws-tools/blob/master/bin/a...

Haven't had time to productise it yet. I think doing this makes you quite a bit safer, because it means you don't end up giving up and allowing more than you need. However, you still need to understand which actions shouldn't be allowed, so it's not the whole solution.

That's awesome!

That said, if a customer has to fuzz a platform's settings to discern their effect, the UX definitely needs work.

Netflix open sourced a similar tool that watches API calls for a Role and then suggests minimum privilege changes to the attached policy document: https://github.com/Netflix/repokid

That's interesting. That can only work if there's some way of introspecting permissions - which I didn't realise existed. Mine works by experiment. I wonder how fine grained their way is.

Ooooh I gotta check that out!

If you don't have prior experience in networking or permissions, it will take some time for you to understand the concepts to properly secure a standalone server. The same concept applies to AWS. You are paying for the hardware, not someone to hold your hand through the process.

And if you can't figure security out by yourself, pay someone to hold your hand.

>And if you can't figure security out by yourself, pay someone to hold your hand.

This. Security is as much a tradition as it is a set of technologies. Its better to learn from a master than from a costly mistake, and its better to learn how to do it rather than to pay to have it done for you.

I feel like this attitude (which is very common) holds us back from developing more reliable systems. When something fails, we don't ask what we can do to improve the system.. instead we point the blame at users. It's the easy way out.. instead of designing better systems, we just tell the user to 'do better next time'.

The more difficult a system is to use properly, the more we should demand an alternative. If your users keep making the same mistake over and over again, then at some point you have to start asking yourself what you need to improve.

On the other hand, I always go back to something my grandpa (who himself was an industrial engineer) used to say: "It is impossible to make things foolproof, because fools are always so ingenious."

Unlike every other AWS managed service I had used up to that point, when I was using Amazon ES a ~year ago there was no integration with any sort of VPC offering, and there was no clear published guide on how to establish such a connection. I ended up doing so with a hacky bastion-based architecture, but most other teams I saw using ES at the time just didn't bother.

I don't think you're wrong, its a complete offering- and yet:

- If you want to ingest data with Kinesis Firehose, you can't deploy the cluster in a VPC.

- You can enable API access to an IP whitelist, IAM role or an entire account. You can attach the policy to the resource, or to an identity, or call from an AWS service with a service-linked role. That's all good, perhaps a little complex but as you said, nothing too different than S3 or DyanmoDb, except for the addition of IP policy. Why not security groups? Is DENY worth the added complexity?

- However, you can't authenticate to Kibana with IAM as a web-based service. Recently they added support for Cognito for Kibana, otherwise one would have to setup a proxy service and whitelist Kibana to that proxy's IP, then manage implementing signed IAM requests if you want index level control. Cognito user pools can be provisioned to link to a specific role, but you can't grant multiple roles to a user pool, so you have to create a role and user pool for every permutation of index access you want to grant. You also have delegate ES cluster access to Cognito, and deploy them in the same region.

All told, even a relatively simple but proper implementation of ES+Kibana with access control to a few indexes using CloudFormation or Terraform would require at least a dozen resources, and at least a day of a competent developer's time researching, configuring, and testing the deployment. Probably more to get it right.

Ultimately there is nothing wrong with the controls AWS provides, but plenty that can go wrong with them.

For the curious:

- https://aws.amazon.com/blogs/security/how-to-control-access-...

- https://docs.aws.amazon.com/elasticsearch-service/latest/dev...

> - If you want to ingest data with Kinesis Firehose, you can't deploy the cluster in a VPC.

why not?


Check out the 2nd to last line of that post. They make the same statement in doc. Lots of services are getting VPC endpoints so traffic never has to hit the public web, but firehouse isn’t one of them(yet).

> Reports usually involve instances where individuals or organizations have actively configured their installations to allow unauthorized and authenticated users to access their data over the internet.

Why is it a problem to let authenticated users to access data through the internet?

Their "managed" kafka service is even bigger clusterfuck

- Doesn't expose metrics via JMX port

- Doesn't support version upgrade

- Doesn't support schema versioning afaik

- Doesn't support adding a node or two to an existing cluster

I could go on...

Yes it's pretty basic. Last I looked it didn't have interbroker SASL, no TLS, can't expand node storage, and the API didn't have configurable broker parameters ...?

Also Kafka and Zookeeper are just about the last things I'd want sealed away in a black box.

Do you have experience with Confluent? How does that compare?

Confluent Cloud is pay by usage (per GB in, per GB out, per GB stored), so its much pricier based on your org's usage. However, it is definitely feature rich.

- Doesn't expose metrics via JMX but does provide a nice tool called confluent control center for monitoring and managing kafka cluster - Built in and Managed Schema Registry

but - AFAIK they won't reveal cluster size or cluster version (except client compability). So, they do scaling and upgrades automatically. It's a double edged-sword but should work for lot of orgs.

- Overall, definitely a better product than aws msk.

Considering that Confluent actually develops the Open Source Kafka project, I expect these problems to be AWS only. Go for the real thing and don't give Bezos money for a crippled product.

A similar article for Kafka would be great :)

The better question, if you need advanced ES functionality, you don't want to manage it yourself, and AWS's offering isn't up to par, why not just use ElasticCo's managed version that also runs on AWS?

For us it was roughly 2x the cost for the same instance sizes. Granted you get a lot more functionality but in our case we didn't need it so it didn't make sense for us to stay on elastic.co. I've been running 15ish ES domains on AWS (multiple regions) for the past 3 years. Our API is microservices based so we split up our elasticsearch clusters so we could scale each microservice independently. Yes it's a lot to manage but blue-green's are quicker with the smaller domains and once you automate snapshots and index cleanup it's not all that bad.

That's really the crux of why the AWS offering is "meh". It just needs to be good enough because they don't need to make any extra margin on it. elastic.co has to make money in addition to channeling money Amazon's way.

ElasticCo's offering is no better, and has its own shortcomings. The biggest shortcoming is its actually not multi-az. We were down 3 times in 2018, and unfortunately had to switch to AWS ESS. Majority of their outages were also faults of their load balancer.

We've had a few of the problems described here, but so far its better then being down.

Elastic Cloud is a train wreck.

We found a bug in their Web UI that caused the RAM slider to slide but not change the value it submitted.

We tried to upgrade the RAM to the max, but the field defaults to smallest. Slider moved, but value didn’t change. Upgraded a 192GB cluster to 1GB. Shit blew up. Took three days for support to respond.

They don't even do multi-node for node sizes smaller than 58gb RAM. Dedicated master is not available until the cluster size at least 6 nodes of 58gb RAM. Multi-master is not available.

I'm not sure if this is based on research or optimal configuration size, but it seems very expensive to get a 3-node cluster going on ElasticCo.

Does anybody have experience with compose? (https://compose.com/databases/elasticsearch)

There's also https://www.instaclustr.com

We don't use them but had a pretty in depth conversation with them about Kafka and they seemed sharp.

Fully agree. Rather than digressing, change the way you operate, or adhere to standards. There is no magic box for all your asks, no cloud company does this. AWS is pretty darn good with customers IMHO. Companies have to trade off stability, elasticity, features. This is coming from someone who has managed a billion users, and has had to punt features for the other 2.

That probably will eventually be what most people do if they keep running into this.

we're happy with http://bonsai.io, which in my experience has really good support.

I don't use bonsai but had a detailed conversation with their tech a while ago, they seemed really sharp and tuned on customer success.

Not really sure how it is now though

Why is anyone using AWS Elasticsearch anyway? AWS is a first-mover and is great when you need managed services not available elsewhere, but it's usually not the best product.

Since most vendors have finally caught on to the demand for managed offerings, I don't see much reason to go with AWS for services like Elasticsearch. The cloud hosting plans directly from Elastic are much better. Easier to manage, same cost, better performance, and more reliable. And support is included.

People like myself don’t want to manage 50 services from 50 vendors. We want to manage it all in 1 place.

Is it actually 50 services? Or a few that come with a much better experience? The trade-off seems clearly worth it when not exaggerated.

AWS managed services are usually subpar. We run our own ES clusters for this very reason.

This is great intel - sorry it had to come with such painful experience.

Does anyone know if Open Distro for ElasticSearch (https://opendistro.github.io/for-elasticsearch/) has these problems? Or is it related to how AWS configures/maintains ES on their platform?

ODFE has a security plugin and supports auditing, RBAC and support for node to node encryption[1][2]. The security plugin is based on Search Guard[3].I have run Elastisearch clusters using both the Elastics Co opensource versions as well as ODFE and have personally found the security plugin of ODFE preferable to Elastic Co's X-pack. ODFE also has some additional features included like PerfAnalyzer and SQL interface. In terms of managing the Open Source Elasticsearch vs ODFE they are pretty much the same. I have not used the managed AWS Elasticsearch offering but I read the blog post and felt it had very little to do with ODFE. There is a good feature comparison matrix here:


[1] https://opendistro.github.io/for-elasticsearch-docs/docs/sec...

[2] https://opendistro.github.io/for-elasticsearch-docs/docs/sec...

[3] https://search-guard.com/

Open Distro is the Amazon hard fork mentioned [1]. The missing features will most likely be a 1:1 issue.

The author does a disservice to their audience:

> As has happened before, Amazon took the open-source side of Elasticsearch, did a hard fork, and has been selling it as a hosted service, slowly implementing their own versions of features that have been available in one form or fashion in mainline Elasticsearch for years.

When what happened was Elasticsearch changed its licensing model after benefiting off of the open source community for years to be more restrictive, forcing the fork.

1 - Amazon blog post announcing fork https://aws.amazon.com/blogs/opensource/keeping-open-source-...

There's been no license change. For years, Elastic has had a set of commercially-licensed features on top of Apache2-licensed Elasticsearch. Within the last year or two, they made those commercially-licensed features source-available. Ironically enough, making the source available seems to have prompted a bunch of claims that they changed their licensing model.

Elastic mixed open source code with source available code, possibly as a landmine to sue large hosting providers, like Amazon. Amazon's fork includes removing these landmines. Elastic's hands aren't clean.

I have a hard time believing that Amazon's motives are pure. Companies that develop open source software have an incredibly hard time with profitability -- it is no surprise that elastic wants to reduce their own workload by maintaining a single repo for open source and source available code. The "land mines" are cordoned off in one directory. Hard to miss that.

For me the real frustration is Elastic's close tying of client features to server versions, making it impossible e.g. to buy the latest version of Kibana from Elastic and run it against a server managed by Amazon.

As I argued elsewhere in this thread this is completely untrue. Amazon ships an Elastic provided OSS build of Elasticsearch unmodified and as is. There is no fork repo in the opendistro account. There is no ambiguity in the upstream repo (it's a nicely documented code base).

Amazon is shipping and forking the elastic suite of tools, you are describing Elasticsearch itself. When licensing is at a source file level that is ambiguous for sysadmins. I understand you are approaching this as a developer, where it might be very clear.

There is no Amazon fork of the Elasticsearch repository. I looked at their repositories. they are copying the oss build of elasticsearch unmodified, without patches.

As for the licensing; licensing is documented in a 10 line LICENSE.txt. Also, it would be hard to miss the license settings, when you set this up. Hard to miss unless you are a seriously dyslexic and negligent person. In which case, I'd argue the Elastic basic license is the least of your problems.

I've been following this issue for quite some time and I totally agree with the author. I've had a lot of odd issues with AWS's productized Elasticsearch, enough that I gave up on it entirely.

my experience shows that this would be even more expensive with not that good support (very enterprise sales / customer success contact)

I'm not sure I follow. Running Elasticsearch on your own isn't that hard but yeah, its more than a couple of clicks. Now that you can use Elastic.co's cloud solution its probably even easier.

> I'm not sure I follow. Running Elasticsearch on your own isn't that hard

Hah! Please do try, at any non-trivial scale. Once you have got your battle scars, report back.

I had a 32 node on-prem 512 core 8TB of RAM cluster in 2014, all local SSDs.

It was fine.

Looks like someone from Elasticsearch picked this up and tweeted about it too. I suspect one way or another this could get Amazon to pay attention...


Completely agree, use Elasticsearch the companies offering. After being bitten by previously no VPC support for AWS Elasticsearch (though supported now), and blue-green deploys even with a small configuration changes (used to be security policy updates even) that can take 12-24 hours I generally don't recommend AWS Elasticsearch.

Since I worked for Amazon and saw how the sausage was made, I am reluctant to ever use AWS again. Random grab bags of teams using all different tech with different coding standards and methodology. Jerry-rigged micro-services turned into APIs through total object instantiation rather than efficient mutable updates.

not just 'someone', the creator of Elasticsearch

Color me embarassed heh.

Pretty awesome though.

Someone should point him to https://news.ycombinator.com/item?id=21228693 too.

Their ESCloud is yet another unit. :D

Also, based on the previous in incident regarding search guard copying xpack code, I am guessing that there is maybe some intent to get AWS to just buy out ES , thus carving a wonderful exit. Just thinking!

I like elasticsearch, but it's not really fun to manage. We are still running latest 5.x because upgrading means we would lose access to our oldest snapshots (unless we reindex them all). I would love it to be managed for me but at the same time I don't really want to give access to my data to another third party, so managed by AWS is the only option (our data is there anyway). I was also under the impression that their ES offering is not great so for now I'm still handling things myself.

It is also wildly more expensive than just running your own on EC2.

You clearly have never dealt with an Elasticsearch cluster that has entered the “red” state. It’s not just about cost, it is about not having to wrangle and lasso a big cluster.

There is a certain scale where the extra cost of the AWS managed offering surpasses the cost of hiring a full time engineer to manage your own. My company is way past that point. Like most things cloud, the answer is "it depends".

Thanks for the civil response, it has been a long week.

My take on it is that for some companies using hosted Elastic (not AWS’ horror) is more probable if their business is not maintaining large clusters in house. This is along the lines of having a live-in nanny more than an occasional housekeeper. It is a luxury, and one in which having an Elastic nanny (PaaS/DBaaS) might make business sense.

You're welcome. I hear ya.

Most people using the cloud aren't really that cost concious. EC2 is about 8x the cost of bare metal...

If you factor in operational cost AWS can be cheaper but often isn't really.

At $work we run a raytracing renderfarm on AWS spot instances that are only spun up as needed. This is far cheaper than keeping the same amount of physical machines around to serve peak use. The caveat is that we could most likely do with a smaller pool of machines and queue up jobs over night yielding higher utilization.

So the price advantage only exists when the lower time-to-completion is a requirement.

Yeah, absolutely right.

However most people do not do this.

> Most people using the cloud aren't really that cost concious. EC2 is about 8x the cost of bare metal...

I've heard these claims before, but I think people are not doing the correct math. I would love to be proven wrong. It is not about computing the costs of hardware + depreciation + redudant power, adding together and calling a day.

You have to add:

– All the engineers that will be maintained your baremetal datacenter. If you are using a colo, use their costs – Downtime caused by mundane issues that you simply do not see on AWS. Yes, a single hypervisor will go bad. You fix that in _minutes_ by stopping and starting your instance. Yes, VMWare or OpenStack (with networked storage!) could do that, but know what you also won't have to do? Deal with suppliers. – The ability to quickly scale up and down according to your load. I can fire up 30 servers with a single terraform script and tear them down before your hardware supplier even returned your call. There's noone on my end racking and stacking anything. – Downtime when shit breaks. If you have 5 thousand engineers impacted because of a maintenance mishap knocked out power to a datacenter and cascaded to the backups, you'll quickly burn through your savings.

I could go on. There are scenarios where on prem makes sense, but one should not dismiss it outright without accounting for the additional risks you are incurring and quantifying those.

One thing that people often forget (and I have to keep reminding people about) is that a single Availability Zone on AWS is composed of multiple datacenters, not just one. And you have multiple AZs. Heck, if you are deliberate about it you can even recreate your entire stack in a different region entirely. It is not apples to apples.

You may not need them, but if you do, cloud providers offer capabilities you are unlikely to replicate yourself. Specially if it's not your core business.

More to the point: even if your 8x figure is correct, the AWS Elastic offering adds a premium _on top of that_ . So they have to offer better capabilities than you can build yourself to account for the premium – in my experience the extra cost is hard to justify (it can make sense, but it is not as clear as EC2)

8x is close to the top end, somewhere between 3x and 10x depending on a number of factors, mostly scale.

For the record I factor in:

  - Hardware depreciation (36mo)
  - Power
  - People 
  - DC rent + power
  - software licences / support
for on premise.

What we see now is that compute is dropping in price for on prem every year and density is improving. AMD Rome brings incredibly bang for your buck when buying at significant scale.

But it's not comparing Apples with Apples.

It's virtually impossible to accurately factor in the opportunity cost of doing all this yourself but you can potentially hire a bunch of engineers with the savings of going on prem, ymmv

You can never ever recreate the developer experience on prem regardless of your scale, on you can tell if on prem is good enough

It's difficult to put a value on being to pay as you go or suddenly be serving workloads out of a geo close to your users in Cloud where on prem there is always a lead time

Finally whilst the developer experience is better suddenly having to deal with new challenges takes a while to adjust in Cloud, outages out of your control, non predictable performance, poor support, no access to your hardware

TLDR: Cost is hard to define and isn't a zero sum game

Lots of (most?) companies will eat some cost to simplify their company/organization.


I've seen this happen a lot. Basically it continues til it starts to eat the company and they struggle to reign it in.

I did some work at $company, they were basically profitable if not for their 7 figure AWS bill. I handed them a plan that would have cut their bill in half with a one time $75k spend. They also had static load so moving to dedicated instances would've cut a huge amount off their bill as well with basically zero effort.

AWS must make a ton of money on small deploys of dead code because it would cost more to have an eng confirm that it's safe to decom.

The managed ES service makes sense for small things where the operational costs of self managed dwarf the overhead. But at scale, it just doesn't make any sense.

EC2 costs are so variable too due to reserve, spot, and elasticity (depending on workload). It can be hard to compare.

Is this figure for an infrastructure that requires failover capability (potentially remotely) and does it assume 100% utilization of the bare metal? That seems crazy high, but I'd also believe it.

No, with failover capacity it'd only be 4x and its def not assuming 100% utilization.

Sadly, its not crazy high.

>> EC2 is about 8x the cost of bare metal...


Someone who doesn't value their time.

One of my old budgets, but there are a ton of blogs out there about it that are easily found via search engine.

Ok so nothing concrete.

Let me share my anecdotal evidence than with numbers. I have migrated an on-prem cluster of 150 nodes which has hadoop, elasticsearch and docker apps running the UI. We have achieved 30% saving in the year over year budget for the company which is ~600.000 USD. This is not about EC2 vs a Dell server for example in a datacenter because this comparison would be an apple to oranges one. This is the sum of all the costs on-prem vs the sum of all the costs in cloud. When people try to compare purely EC2 to a node running on-prem, the only thing is 100% crystal clear that they do not understand how the cost is structured for a infrastructure. Quite often they forget that we need networking, electricity and cooling in the datacenter. They also forget that datacenter capacity cannot be given back when not needed (auto-scaling) and few minor things. This results in the conclusion that the cloud is more expensive than on-prem which in my experience of moving several fortune 500 to the cloud is not true, quite the opposite, significant cost savings can be achieved.

I was talking about "in a datacenter", but yes, you have almost zero elasticity with datacenter buildouts. You can get metered power but that can be a mixed bag since its usually more expensive per KWH.

Most companies that "move to the cloud" also make a lot of changes to the way they do things so they can scale/up down dynamically, thats not an insignificant cost in development time.

If you have a static load AWS is really expensive.

Also GPU's are still insanely bad to do in AWS/Azure. Lets say you need the equivalent of 60 x p3.16xlarge for a whole year, thats well over a million USD a month. You're breakeven on month 3 in a datacenter even with all the overhead. Maybe some of that is my ability to get good deals, but even if you breakeven on month 4, thats crazytown.

>> If you have a static load, AWS is really expensive.

Again without details, this a meaningless claim. My own company's only infrastructure is a website that "runs" on AWS using the free tier of Cloudfront and a little bit of the paid tier of S3. This is a static workload. It is really cheap. Without adding all the details on both sides and the workload you cannot claim that AWS (or for that matter any cloud vendor) is more expensive.

If you can find cases where its cheap, thats great, but its not the case for a lot of people with static compute load.

One of my example datacenters is: Private servers, access controlled, etc 12 servers (E5-2438L, 128GB of ram, 6T of RAID in each, 10gbps interconnects/etc)

It costs $1200/month to run. Just the storage cost in AWS is around $3k/monthly. The equivalent EC2 cost is around $14k/monthly. It requires around 1-2 hours/month of oversight and the costs are generally fixed except for the bandwidth which is billed at a fraction of the cost of AWS pricing.

Personally, I don't care all that much about the extra cost, as long as it works. But it doesn't

Shameless promo ( no affiliation, just a happy user who migrated from Elastic Search). You should check https://vespa.ai - App containers and first class tensor support is a blessing.

I've looked into Vespa a bit lately. It looks pretty good!

I'm a little disappointed in its data type support, though. With ES you can throw deeply nested data structures (maps, arrays, arrays of maps, maps of arrays, etc. ad nauseum) at it and have them be fully indexed and searchable. But Vespa doesn't really do indexing of nested structures.

This means that if your application's schema is already dependent on such nested data structures, you need a mapping layer that flattens your structures. For example, if you have:

  address: {
    streetAddress: "1 Bone Way",
    city: "Boneville",
    state: "WA"
then you have to flatten it to something like:

    address_streetAddress: "1 Bone Way",
    address_city: "Boneville",
    address_state: "WA"
And then, of course, you have to unflatten when you get the results (unless you only use Vespa for the IDs and look up the original data in your main data store).

Same thing with arrays. Vespa doesn't really support arrays, whereas in ES, all attributes are technically arrays. (I.e., a "term" query/filter doesn't distinguish between the two: {term: {foo: "bar"}} will match both documents {foo: "bar"} and {foo: ["bar"]}.)

Another oddity is the system for updating your schema, which includes not just data model definitions, but a whole bunch of files which you upload as a batch. The programmatic API for updating the schema is a little impractical, much less practical than with ES where you can just do "curl -d @mappings.json" and you're done. Also not at all a fan of their use of XML.

Overall, Vespa feels more than a little antiquated. It's an old project, after all. That said, I'm probably willing to deal with the warts if it's more solid. I like that the core server is written in C++, not Java.

What has your experience been in terms of clustering? With ES you can just boot up a bunch of nodes and, on a good day, it will self-organize into a pretty nice and scalable setup. (On a bad day, your cluster will become "red" for unpredictable reasons.) Is Vespa as seamless here?

That's my big complaint with Solr, too, which I want to like because on paper it seems a ton more sane, but realistically the ability to throw a random JSON document at ES -- without having to sit down and pre-define the schema -- is invaluable.

I stopped by to check on Solr before posting this, and even their "schemaless" document is like 8 pages long and filled with XML settings: https://lucene.apache.org/solr/guide/8_1/schemaless-mode.htm...

That's because that page explains how to turn the mode on and off, how to fine-tune it (e.g. different date formats) and how to index formats other than JSON. Elasticsearch does not support a good chunk of this, so no need to document.

If you want a more streamlined version, you can check the example instead: https://github.com/apache/lucene-solr/blob/master/solr/examp...

And even then, it will already discuss the problem with auto-guessing the content types, something that Elasticsearch mentions only later. Solr is just more upfront and explicit about the issues.

Still, you do have a point, Solr documentation tries to be comprehensive rather than ease-of-use oriented. That sometimes obscures the easy things.

Used to work in AWS. ES was always a bad org with crazy attrition, none of this surprises me.

Until people disentangle providing bare metal and providing the managed services on top of the bare metal you received, it just won't matter that AWS is worse, because customers are already in the garden, and they'll pivot their way to being better while developers suffer in silence (but pad their resumes).

A couple of misconceptions about open distro that Amazon seems to be advertising to overstate what they are doing:

1) It's not a fork. There's no such thing as an Amazon specific fork of the Elasticsearch git repo in the opendistro github account. There are no Amazon specific patches to Elasticsearch.

2) Instead, Amazon redistributes a vanilla OSS build of Elastic. As is and unmodified. Producing these builds is and always has been a feature of the elasticsearch build scripts. With every release they produce OSS binaries and OSS docker containers (both without the closed source plugins) in addition to the ones that include their x-pack components. All Amazon does is take those builds and bundle their own OSS plugins.

3) What is and is not open source is clearly documented in the Elastic repository. There is zero ambiguity here (legal or otherwise) unlike what Amazon implies in their marketing. If it's in the x-pack directory, it may be closed sourced (some plugins are OSS). If that bothers you, use the before mentioned OSS builds. Everything outside the x-pack directory is OSS. OSS here qualifies as Apache 2.0 licensed or compatible. It's that simple.

The OSS plugins that amazon provides are of course nice if you need them. Less nice is that they seem to be perpetually several releases behind Elastic with both the plugins and opendistro. So if you use this, you are running with known & fixed bugs that may or may not affect you. You could argue that Amazon maybe does a lot of testing. If so, those tests don't appear to be part of their OSS repos. The other explanation is of course that they only update their cloud service a couple of times a year and simply ignore bug-fixes or even patch releases to what they shipped, other fixes and improvements that happen upstream, etc. Or even any documentation for that (refer to the Elastic official release notes and documentation for what was actually fixed). If these bugs happen to affect you, you are on your own. The release notes are "whatever Elastic said a few months ago". Refer to the Amazon release notes here if you don't believe me: https://opendistro.github.io/for-elasticsearch-docs/version-.... The last few releases were basically "bump the version number" and absolutely nothing else that Amazon considered worth reporting.

So, if you are comfortable running that in production use it at your own peril. I'd argue it's probably better to take the latest Elastic oss build, fork the amazon plugins you actually need (if any) and simply bump the version numbers to match the current elastic version. Amazon seems to do little more than that between releases; so you are not really missing out on any meaningful QA, support, or other stuff Amazon implies they are doing that they are clearly not doing.

> 3) What is and is not open source is clearly documented in the Elastic repository. There is zero ambiguity here (legal or otherwise) unlike what Amazon implies in their marketing. If it's in the x-pack directory, it may be closed sourced (some plugins are OSS). If that bothers you, use the before mentioned OSS builds. Everything outside the x-pack directory is OSS. OSS here qualifies as Apache 2.0 licensed or compatible. It's that simple.

While this statement is true, it is not clearly documented in their actual documentation! For many years the pricing model around X-Pack was incredibly opaque and the documentation did everything possible to encourage you to use it while keeping the warnings around licensing issues were buried deep in the appendices.

I certainly didn't read elasticsearch's source tree when learning how to operate it -- I started in the docs like most everyone else.

Also, your super responsible sysadmins probably aren't pulling elastic's source code to run on your servers, they're using the distro packages (which do mix free and non-free) which is also what the docs tell you to do.

They've long addressed all of that. There are helpful x-pack tags on the documentation for features that aren't in the OSS release. Check here for example on the documentation page for index life cycle management: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

The pricing model for their platinum features is indeed opaque (as in most small companies can't afford this, and you'd have to talk to a sales rep to find out). Those features are also clearly marked. X-pack features are free to use. Also, if you try to use these features without the proper license key, it won't work for obvious reasons. There's zero risk of using this accidentally without first agreeing to some license.

As for the repo, if you bother to open: https://github.com/elastic/elasticsearch/blob/master/LICENSE..., it spells it out in 10 lines of text. They've iterated on this a bit but this was always the place where they clearly outlined what is what.

Also, each single source file includes details on how it is licensed. There's zero chance of a developer not seeing that if they are preparing some code patch. This is intentional; it's not optional to document stuff like this if you are serious about enforcing your copyright; which of course they are.

Sysadmins pulling elasticsearch from a linux distro repo of course happens. Presumably they'd be getting an OSS build and not bundle proprietary components because they tend to care about not shipping proprietary code. If you go to the Elastic download pages, there are convenient links to both.

This is the best, most informative and helpful comment I've seen. Thank you.

Granted the rebalancing is a real missing feature but what if you had used EBS volumes for data storage, to grow the disk if one node approaches its limit? I'm guessing the author is using local SSD/NVMe to get the most query performance but that does come at the cost of flexibility.

Of course EBS is not as fast, and I'm sure others have run into availability issues, but one has to look at their requirements; in a fast changing environment when data needs are not certain having the ability to expand with a push-button is powerful.

Our experience was that the more data you store, the more IOPS you need to search through it at a reasonable pace, and with a large dataset things can really start to crawl.

That said, it's worth pointing out that EBS is the default data store in AWS Elasticsearch, and for people without a ton of data it might actually end up working "as intended"

The whole shard rebalancing problem is solved if you have an acceptable disaster recovery solution in place, which I'm assuming the author doesn't if they're complaining about have to keep around the raw copy of their data to be indexed.

My team has an automated workflow that runs once a week. It creates a new cluster, re-indexes from source, starts taking customer traffic and then deletes the old cluster. The shards stay balanced, and we can recover from a total cluster failure within about 6 hours.

Yes, you should have a DR scenario.

No, you should not need to use it because an aws product is a piece of shit.

Can I ask how large your clusters are and how many records you’re reindexing? I haven’t been able to scale this to a cluster taking in several TB a day

Ja never liked the offering(ok fine not much reason given) I also just spun up an instance and installed Solr. Lol until my new team decided we need 3 solr in 2 docker containers (master / slave)... Lol wasted 2 weeks.

Yeah unless your database is not a critical component of your application, like a experiment or doesn’t have to be available feature you should almost always self operate. It pays to have control of your data.

Some valid points and some relevant real-world aws support nightmare scenarios, though I think there is a chance the author might be wrong abt a few things, or may be I misunderstood them. My 2 cents:

> Amazon’s implementation is missing a lot of things like RBAC and auditing.

Open-distro (which AWS uses for elasticsearch deployments) supports this: https://opendistro.github.io/for-elasticsearch/features/secu...

> Shard rebalancing, a central concept to Elasticsearch working as well as it does, does not work on AWS’s implementation.

Not sure why the author says AWS doesn't support it, but I have seen that it does rebalance shards just like vanilla elasticsearch would. In fact, it wouldn't rebalance only when the shard-allocator is unable to find suitable home for the unassigned shards (and that's vanilla behaviour, iirc): https://aws.amazon.com/blogs/opensource/open-distro-elastics...

> ...if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary.

I think the author should employ alerts for cluster-health https://docs.aws.amazon.com/elasticsearch-service/latest/dev... or write them https://github.com/opendistro-for-elasticsearch/alerting and def read abt best practices for offloading petabyte-scale clusters to aws (I am sure they've read abt it already, given they're in touch with SMEs and TAMs): https://aws.amazon.com/blogs/database/run-a-petabyte-scale-c...

> Hope you had a backup of what you needed to dump.

Amazingly, AWS Elasticsearch does automated hourly backups and retains them for 14 days, for free: https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-el...

> The second option is to add more nodes to the cluster or resize the existing ones to larger instance types.

AWS Elasticsearch doesn't yet scale-out (change in instance-count) without resorting to blue-green deployments. They should have impl that by now, like they did for policy-updates: https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-el... I hope fixing this is in their roadmap.


Also, I believe the real problem with managed elasticsearch offering is that end-users still have to worry abt the servers as it isn't truly hands-off, in a way that Lambda or DynamoDB are. This is complicated by the fact that elasticsearch exposes innumerable ways to configure cluster and index setups (read: shoot yourself in the foot).

I guess, AWS Elasticsearch needs something like an Aurora Serverless Data API as the current offering takes away too much control away from power-users (can't ssh into the nodes to fix anything at all and the constant reliance on the oft-incompetent aws support to do the firefighting whilst having to frustratingly wait on the fringes with little to no transparency is a big red-flag): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

With perf-analyzer and automatic-index-management included in open-distro, they might already be half-way there: https://news.ycombinator.com/item?id=19361847

Couple things -

It's worth pointing out that AWS Elasticsearch is not simply the open distro - and a lot of things you see in the open distro are not currently available on AWS Elasticsearch. Beyond that, many features are simply forcibly disabled in Amazon's offering, just as many cluster settings and APIs are untouchable in the AWS service (even read-only ones that would be super helpful).

I still can't touch any of the rebalancing settings on my clusters and everything looks forcibly disabled. If rebalancing worked as expected, the whole blue-green thing shouldn't be necessary, and over time, I wouldn't generally end up with a single full data node while every other node in the cluster has 300GB free. Am I missing something?

None of the CloudWatch alarms you linked to have much relevance to the issues in the article (Other than the ClusterIndexWritesBlocked alarm which will only start firing after everything breaks). As of the last time I looked, you cannot monitor disk space on individual nodes in CloudWatch, only the cluster as a whole. Alerting on a single node starting to fill up is basically the one alert that would let me know things are about to be in a bad state.

Their service seems to work well for small implementations that use EBS-backed storage, and I bet that's what most of their customers are using, but I'm running 60+ node clusters and the problems only seem to be worse as capacity goes up.

Someone in the comments here mentioned they destroy and rebuild their cluster weekly just to keep the shards balanced. How ridiculous is it for that to be the best option offered?

PostgreSQL Full-text search is good enough. Also recently in PG12 they did many perf optimizations (Table Partitioning, Indexing, Vacuum) which will help in this regard.


Text search functions - https://www.postgresql.org/docs/12/functions-textsearch.html

If you need basic text filtering and ranking, ES is overkill. When you need more powerful ranking, highlighting, etc, postgres isn't as sufficient

Nope. PG12 is just fine for any kind of ranking and highlighting. Of course you may have to use extensions like Citus DB to scale horizontally

TBH sounds like growing pains, I would imagine with the might of amazon engineering behind the hard-fork it will eventually surpass the original.

Question is when??

Don’t bet on it. AWS Elasticsearch is really as bad as this article indicates.

Yeah, we ran a minor upgrade to our cluster earlier this week and it knocked out the entire cluster for over two hours, and we were getting AWS specific errors regarding hostname headers that we've never gotten before. I managed to get a developer advocate on twitter to lend us a hand, but if I had actually waited on support things would likely still be down.

Unfortunately that seems more like a trend with AWS- there are a lot of new services that feel like they're 80% ready for production, but which are being sold as complete solutoins.

Finishing things isn’t cool or sexy, don’tchaknow! Gotta get that MVP out the door and move onto the next project. It’s fine, you can iterate on it after re:Invent!

Agreed, this article captures my experience accurately as well.

That's a weird assumption to make.

AWS Elasticsearch is focused on providing minimum necessary functionality to sell instances.

Opensource Elasticsearch is focused on improving Elasticsearch.

Thats a valid point.

But don't you think the more features it has the more instances of it will sell?

Generally yes, but they can wait for Elasticsearch to implement them and port them over with less time/money spent.

I don't think they have a ton of incentive to surpass Elasticsearch, but just to lag behind at some reasonable rate

Its been this bad for a while, I think its just a stalled project inside of AWS or the people who work on it think it functions as designed...

I would have believed that. I was told by someone that "Amazon has a practice of launching early [with bugs] and then iterating"

But all the things in this article were true when I tried the Amazon Elasticsearch Service in 2016, so if they haven't fixed it in 3 years, what make you think they'll fix it soon?

Is the might of Amazon engineering really behind it though, or is that might behind a bunch of different things of varying importance. How much might can Amazon really bring to bear on any one project?

ES revenue can cover a lot of devs. But rephrasing: is ES the logging and visibility solution they are betting on, or is ES an 'ok to be 2nd for folks insisting on ES' and they plan to win with other cloud-native log/doc analytics tooling?

(Their crocodile smile on ethical open source practices through better operational excellence reads even thinner when seeing stuff like this tho.)

They've definitely been positioning it as the preferred solution. Our SA steered us towards it as a general search solution over their seemingly abandoned cloud search offering. The (also abandoned) AWS landing zone has an addon to stand up an ELK stack for you and no alternative afaik.

Really not obvious to me... just as MS released internal Kusto as a product, and Google did BigQuery, I've been curious if some internal more integrated Amazon tool would surface. E.g., see the direction of CloudWatch to go from metrics queries to log queries (https://aws.amazon.com/about-aws/whats-new/2019/07/cloudwatc...) and how GuardDuty is marching towards a modern SIEM. What the cloud vendors can do vs. splunk/elk/cisco makes it feel more like a "when" not "if". Zero judgement here on technical quality, this is just the nature of big tech co's.

Do most AWS teams use ES, or something else? Is CloudWatch Insights on a scaling & parity path?

Haven't used many AWS services have you? :D hahah

Remember the EBS garbage fest a couple years ago?

Given the lawsuit that is going between Elastic and AWS in the background, can someone confirm if shared rebalancing is a flaw in AWS offering?

It exists in mainline Elasticsearch as well as the open fork and appears to exist in AWS's offering as well - AWS appears to have forcibly disabled it across the board for unknown reasons. I think it's probably an issue with the overall back-end architecture/implementation of their managed service.

Could be entirely translated at "This very generic offering doesn't suit my very specific demand".

Some would say Elasticsearch is fundamentally flawed.

If someone would say that on HackerNews then they would expect to hear "citation needed".

Anecdotally what I hear is a bunch of bitching and moaning about ES yet it clearly does work and has generally all of the difficulties of any CAP problem. This indicates to me that ES is addressing a Hard Problem and to the extent that it is long lived and quite popular, it's likely not substantially worse than any reasonable alternative.

Please tell us what you view as ElasticSearch fundamental flaws and give some proposed alternatives either as revisions to ES or entire other solution components.

It is not necessary to have a valid alternative to validly declare something as fundamentally flawed.

ElasticSearch is as brittle as you can get. If you don't dimension Java heap sizes properly, nodes crash all the time and uncontrollable ultra-expensive shard relocation happens. Their open source available monitoring tools have the nice side effect of overloading the cluster and bringing it down (!). The result of it being a whole hodgepodge of Java-based repurposed Lucene does show in poor performance and very poor stability.

I've spent many a weekend trying to bring up a fallen ElasticSearch cluster, in some cases brought down just from monitoring. We had a use case that wasn't that easy, but not massive (100ks concurrent users, but not concurrent millions), and a properly developed C++ or even Python distributed solution would be more than able to handle it quite easily (source: ended up having to write it myself, didn't require massive anything to handle properly).

Frankly I admire Elastic because I have no idea how you can turn such a piece of software into ~$90MM yearly revenues, and, mainly, how you can turn that ~$90MM yearly revenue into a publicly traded company with a nearly $7bn market cap. So much to learn from them!

This is what I'm talking about wrt bitching and moaning, in summary you tried to use ES but you didn't rtfm or didn't know about jvm tuning or didn't scale test and found out the weekend is a bad time to come up to speed on those, you had a bad time several times, plus you slashdotted yourself with monitoring; then you did a custom implementation for your vertical use case which didn't have the rtfm problem because you wrote it, but also only satisfied your case as opposed to the wide applicability of ES. Ultimately cool story bro because ES is freely available for anyone to use (many people do this) and modify (some people do this too) and your alternative is unknown.

What are the fundamental flaws of ES and what alternatives avoid those flaws, or how do you propose ES could address those flaws?

For example:

- "Algolia is so much better because it is a managed service." (hey whatsup ycombi)

- "Solr is also lucene but necessarily requires significant customization to the workload which avoids the common ES problem of it appearing to work so well out of the box that people neglect the details until it becomes an incident."

- "ES fundamental flaw is that zen disco mcast nonsense, people please stop being clever using mcast it never works in practice because igmp snoop". (hey whatsup we out here using ES since a while now)

Elasticsearch may not be fundamentally flawed, but it sure is flawed!

It's operationally unpredictable, even if you know all the corner cases (like field cache sizes) and JVM flag tuning voodoo. It's notoriously memory-hungry, and its networking is notoriously unstable. I've had issues where minor network blips throw the entire cluster into a weird quantum state where nodes are up but the cluster is down.

One particular annoyance with ES is that, once it starts having issues, it often becomes completely unresponsive, and its actual status becomes difficult to understand. You often can't access endpoints like /_cluster/health, /_cat/shards, etc. to diagnose, and meanwhile the logs are spewing inscutable Java stack traces that are of no help. There are clearly weird bottlenecks inside ES which fail in extreme circumstances.

It's gotten better. The consensus protocol was brittle and unsound for many years, and has slowly been patched for robustness, but I wouldn't say it's been fixed. ES is much more unreliable than many other clustered systems. The only one that comes to mind as being as unreliable is RabbitMQ.

Even if it is, people still use it.

Flaws don't make something unusable, you just have to know your limitations and work within them. MySQL has come a very long way for something so initially flawed.

This article amounts to someone whining because they didn't take necessary action to prevent a dumpster fire. AWS's Managed Elasticsearch has tradeoffs and you should understand them before choosing it, but AWS is not to blame if you've under provisioned your cluster and imbalanced your shards.

The very, very poor tradesman blames the tool.

> However, so many fundamental features are either disabled or missing in AWS Elasticsearch that it’s exacerbated almost every other issue we face.

No, your choice to use AWS Elastic Search is compounding all the other poor decisions you have already made (and admitted to). This is just another one of them. Petabyte scale ElastcSearch clusters are approaching edge case usage scenarios, and one most people would expect an average SAAS solution may not be optimized for.

You could have spun up - and managed - your own cluster on EC2 but instead decided to make bad decisions probably without any research or beta testing, got badly burned as a result, and are now trying to unload your sour grapes on to the interwebs.

This sort of experience report is extremely valuable for people evaluating AWS services or trying to make a case to switch to a self-managed service.

> The very, very poor tradesman blames the tool.

But a good tradesman will not use inferior tools.

A good tradesman uses the right tool for the task at hand.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact