
Hazelcast 3.8.3 - aphyr
https://jepsen.io/analyses/hazelcast-3-8-3
======
Twirrim
"In addition, the names of Hazelcast’s datatypes, and the functions provided
on those types, imply a certain fitness-for-purpose, e.g. that users can use
these types and functions in a meaningful way. What is the point of an ID
Generator which emits duplicate IDs? A lock that doesn’t lock? Who wants an
AtomicReference which is not atomic? Of what possible use is a queue which
doesn’t, well, queue?"

Wow. I wonder if the documentation and naming schemes are bad because they're
aspirational, or because the developers don't know what they're doing?

~~~
im_down_w_otp
Not with respect to Hazelcast specifically, but it's often the latter, but not
maliciously so. It's fairly difficult to implement a lot of those things
correctly, and fairly easy to convince yourself you were thorough.

~~~
noncoml
That's why we have QA, no?

A developer can only test up to the point that their code works as they
thought it should work. QA takes over from that point.

~~~
im_down_w_otp
Probably not. You as a developer could write a Jepsen harness for your system
and find the same results seen here. You could also write property-based tests
to improve your testing model and input space coverage. You could also use
mutation testing to see how good your tests are, etc. etc. etc.

At my last job we held off running a chaos test suite against our platform in
the certification environment until _after_ the QA department came back with
their report and gave the platform all green checkmarks. This was an exercise
in both validating the value of the chaos engineering/testing work as well as
partially discrediting the idea that QA is capable of exercising enough of the
system in enough ways to expose non-obvious/nuanced edge cases.

I mean I guess there's an argument to be made that QA could be engineering and
building their own rich tooling like a chaos testing framework, but that seems
extremely unlikely to happen in the current typical formulation of QA.

------
nl
Is there some award Jepsen and Aphyr can be nominated for? Kyle's work has
improved so many open source projects it needs acknowledgement.

Also, his Twitter is a great thing for confusing people who want to read about
lock schemes and relationship management etc...

~~~
emmelaich
... and bears dressed up as Wonder Woman :-)

------
hazelcast
Hazelcast has officially responded to the Jepsen Analysis here.
[https://blog.hazelcast.com/jepsen-analysis-
hazelcast-3-8-3/](https://blog.hazelcast.com/jepsen-analysis-hazelcast-3-8-3/)

~~~
fulafel
You write that Hazelcast belongs to a category of software that
characteristically chooses speed/availability over consistency in case of
network partitions[1]. A web search leads me to your your competitor who
sounds like they are tackling the issue by putting consistency first:
[https://docs.gridgain.com/v8.1/docs/network-
segmentation](https://docs.gridgain.com/v8.1/docs/network-segmentation)
[https://www.gridgain.com/products/software/enterprise-
editio...](https://www.gridgain.com/products/software/enterprise-
edition/network-segmentation-protection)

"When there is a network partition we favour availability over consistency.
Which is what the Jepsen test shows. [...] It is also the PACELC contract of
the entire category of IMDG systems. IMDGs are used for maximum speed and low
latency. They represent a set of design trade-offs to achieve this primary
purpose."

------
jodah
For anyone wondering if there is a set of Hazelcast like data structures and
primitives that actually have safe, strong consistency (based on Raft
consensus), check out Atomix:

[http://atomix.io/atomix/](http://atomix.io/atomix/)

~~~
dyu-
> that actually have safe, strong consistency

Based on what data? Even hazelcast had that same claim before they got
exposed?

I've only seen 2 systems that did rigorous in-house fault-tolerant testing
(foundationdb and cockroachdb) and later when "jepsen-verified", they actually
backed their claim or had few modifications to uphold their claim.

~~~
jodah
Atomix has a Jepsen test suite[1], which while not perfect, has helped fix the
bugs that existed in the Raft implementation.

1: [https://github.com/atomix/atomix-jepsen](https://github.com/atomix/atomix-
jepsen)

------
eternalban
Ouch. (Someone please give this guy an industry medal.)

Talk of Hazelcast reminded me of Coherence (now Oracle, before Tangosol):
[https://www.javalobby.org//java/forums/t78008.html](https://www.javalobby.org//java/forums/t78008.html)

Oracle docs sensibly call it a distributed cache:
[http://www.oracle.com/technetwork/middleware/coherence/distr...](http://www.oracle.com/technetwork/middleware/coherence/distributed-
caching-100021.html)

------
runT1ME
We use Hazelcast at Verizon Labs for a couple services, and one of the issues
we ran into is how much _work_ it takes to replicate split brain to test
various and custom merge policies.

When reading that the distributed map implements the ConcurrentHashMap
interface we were immediately skeptical, but it took more work to prove our
skepticism was founded.

Hazelcast has a _fantastic_ foundation for building clustered applications,
but some of the design choices they made both internally and API wise are not
choices we would have made.

We are moving to a custom merge policy because of some of these choices.

------
amelius
If a system is so "broken by design", shouldn't it include a flag for running
under worst-case mode, so that users can more easily detect faults in their
assumptions about the system?

------
lisa_henderson
Wait, is this Kyle Kingsbury? If so, where is his name? Who are the authors? I
notice this text doesn't mention the authors:

"We wish to thank Jordan Halterman for his discussion of Hazelcast use cases.
Luigi Dell’Aquila & Luca Garulli from OrientDB, and Denis Sukhoroslov from
BagriDB, were instrumental in understanding those systems’ use of Hazelcast.
Thanks also to Julia Evans, Sarah Huffman, Camille Fournier, Moishe Lettvin,
Tim Kordas, André Arko, Allison Kaptur, Coda Hale, and Peter Alvaro for
reading and offering comments on initial drafts. This research was performed
independently by Jepsen, without compensation, and conducted in accordance
with the Jepsen ethics policy."

I'd like to know who stands behind this research

~~~
aphyr
Hahaha I suppose that's an entirely legitimate question, isn't it? Just to
reassure you, Lisa, yes, this is me, and I'm the only person who works at
Jepsen. I'm still exploring voices for the Jepsen brand--since at some point I
might hire more researchers, more recent stuff is written with an
organizational "we". Suppose I should start adding bylines. :)

~~~
agacera
hey, Kile. Thanks for the awesome work you do.

completely off-topic: have you ever thought about giving your training
([https://jepsen.io/training](https://jepsen.io/training)) in an online way
(mooc, udemy, whatever)? I would love to learn about distributed systems from
you, but today I think it would be almost impossible since you only seems to
give in organization trainings.

~~~
aphyr
Thank you. Unfortunately, it's just not a good fit for a MOOC, in terms of
class format and cost.

~~~
eternalban
Have you explored high level consulting in inception phase of distributed
system design? (and p.s. thank you for your exceptional work and knowledge
sharing.)

~~~
aphyr
You're welcome, and yes, I do design consulting as well:
[https://jepsen.io/services](https://jepsen.io/services).

------
manigandham
Apache Ignite seems to be doing things right, they don't have jepsen results
yet though...

~~~
ajmurmann
Another solution in the same space is Apache Geode. Its commercial version has
been around since a long time and has been well proofen on very large loads in
systems that require low response times.

Disclaimer: I work on Geode.

~~~
manigandham
What is the current state of Gemfire/Geode vs GridGain/Ignite? Any big
differences?

~~~
ddorian43
Also vs snappydata which is based on gemfire?

~~~
plamb
Worked on GemFire prior to Pivotal acquisition (before Geode) and currently
work for SnappyData.

You can imagine GemFire/Gridgain as an apples-to-apples comparison. Both are
"enterprise" in-memory data grids originally intended for managing data in
low-latency OLTP applications which later added analytics/OLAP features.
Geode/Ignite are the open source options for these two IMDGs and also a good
apples-to-apples comparison. (Hazelcast also has enterprise/OSS verisons I
would compare accordingly)

I can't speak to the current comparison between these systems, but I can
compare them to SnappyData. SnappyData deeply integrates GemFire with Spark to
bring high concurrency, high availability and mutability to Spark
applications. In the world of combining Spark with a datastore over a
connector (cassandra, hive, mysql, mongo etc) to enable "database-like"
features in Spark, SnappyData has taken the next step of integration. In
Snappy, the database (GemFire) and the Spark executors share the same block
manager and VM so the systems no longer communicate over a "connector." This,
along with our database optimizations, provides the best performance for Spark
applciations in what I like to call the "Spark Database Ecosystem."

As such, comparing SnappyData to GemFire/Hazelcast/Gridgain does not make much
sense unless you are trying to use Spark in conjunction with these systems. In
that case, the main difference I would point out is that SnappyData will
necessarily perform better as any of them would need to use a connector to
interact with Spark. The better comparison would be between SnappyData and
Ignite, as Ignite contains a direct Spark abstraction called "IgniteRDD." That
said, the majority of the comparisons/benchmarks we've run have been against
MemSQL+Spark and Cassandra+Spark, so I don't have much to say about Ignite vs
SnappyData.

User manigandham mentions SnappyData's Approximate Query Processing features
(called Synopses Data Engine) which is unique within this space, but a
discussion of which would take this too far afield.

------
djhworld
It's going to take me a while to go through all of this, so I'll miss the boat
on the HN comments

However, I'd really like to credit the author on the design of the site, a
pleasure to read.

------
IgorPartola
Question: why do we still use sequential ID's for identifying objects in data
stores? There are many known problems with using numeric ID's (anyone remember
Twitter's overflow issue), so when they are sent to a remote system, they
should probably be strings anyways. Sorting or doing math on ID's is basically
useless. Memory savings are the only thing I can think of, and while yes
that's a nice feature, when a cluster of N nodes is generating ID's that are
supposed to be globally unique, aren't UUID-type things better?

~~~
traxtech
Because (B-tree) indexes on sequentiel IDs are more efficient.

------
throwaway43987
Is this use of Hazelcast in Apero CAS safe? It is built on IMap.
[https://apereo.github.io/cas/5.1.x/installation/Hazelcast-
Ti...](https://apereo.github.io/cas/5.1.x/installation/Hazelcast-Ticket-
Registry.html)

~~~
aphyr
I assume that if you lose your ticket you can always reauthenticate to get a
new one, so... that doesn't seem like it'd be too bad, right?

------
t1o5
We have been using Hazelcast for a while now. Currently evaluating K tables if
we could get the same performance. Any learnings or advices would be welcome
if anyone has done the switch.

~~~
SliderUp
What are K tables?

~~~
jermo
Presumably Kafka Streams KTables

------
grabcocque
Fairly brutal for a Jepsen report.

What fascinates me about the now-known-to-be-untrue claims in the Hazelcast
documentation is that the claims were made despite the developers having know
way of knowing if they were true because they’d not conducted these kinds of
tests themselves.

Documentation that reflects what the developers wish were true rather than
what is actually true is not a new phenomenon, but is potentially fatal for
this kind of software.

~~~
ealexhudson
For someone listed as a partner on their homepage to be actively attempting to
remove the system entirely is also pretty bad news. Presumably a number of
users of Hazelcast have been bitten by problems, and enough of them are able
to attribute them to Hazelcast...

