
Show HN: Jet – in-memory, fault-tolerant, distributed stream processing - cangencer
https://github.com/hazelcast/hazelcast-jet
======
davewritescode
Jet looks really cool, I'll but I think we'll stick with Flink for the time
being.

I say this as someone who got burned hard with weird bugs using Hazelcast 2.X
as distributed lock manager. I'll have a hard think before adopting any part
of the Hazelcast ecosystem in the future after that experience. When the
analysis of Hazelcast 3.x was posted on jepson.io
([https://jepsen.io/analyses/hazelcast-3-8-3](https://jepsen.io/analyses/hazelcast-3-8-3))
I had a good laugh because a number of issues that were exposed, we had seen
in production in older versions. Locks claimed on both sides of a cluster
partition, locks never getting released when a node crashed while running,
memory leaks, etc. In the end, we had the option of upgrading to 3.X or
dumping it entirely in favor of ZooKeeper + Curator. We chose the latter and
haven't had issues with our locking system once and nobody has gotten paged in
the middle of the night because of a ZooKeeper issue.

After that experience, I'll take every guarantee made by Hazelcast with a
giant grain of salt. I've heard good things about later versions so I'm going
to assume things have improved but I implore people to look very closely at
solutions like these and in particular, the guarantees they make before
picking any of them.

~~~
jerrinot
The truth is the original Hazelcast replication protocol was not a good fit
for some data-structures. We took the analysis seriously. I know every project
and vendor claims that. Here is what we did in recent years:

1\. Re-implemented concurrency primitives on top of Raft protocol. This
includes Distributed Locks, Semaphores, AtomicLong, etc. Raft provides
linearizability and that's what you usually want for concurrency primitives.
See our epic blog post about locking: [https://hazelcast.com/blog/long-live-
distributed-locks/](https://hazelcast.com/blog/long-live-distributed-locks/)
or our Jepsen testing story: [https://hazelcast.com/blog/testing-the-cp-
subsystem-with-jep...](https://hazelcast.com/blog/testing-the-cp-subsystem-
with-jepsen/)

2\. Added a FlakeID generator. This is on the opposite side of the consistency
spectrum - it's a k-ordered Available (wrt CAP) ID generator. It won't
generate duplicates even when there is a split-brain. See:
[https://docs.hazelcast.org/docs/4.0.2/manual/html-
single/ind...](https://docs.hazelcast.org/docs/4.0.2/manual/html-
single/index.html#flakeidgenerator)

3\. PNCounter - CRDT-based eventually consistent data structure, suitable for
.. well, counting things:) See: [https://en.wikipedia.org/wiki/Conflict-
free_replicated_data_...](https://en.wikipedia.org/wiki/Conflict-
free_replicated_data_type)

4\. Significantly extended documentation, to be more explicit about Hazecast
replication models and guarantees. The goal is clear: Avoid Surprises. See:
[https://docs.hazelcast.org/docs/4.0.2/manual/html-
single/ind...](https://docs.hazelcast.org/docs/4.0.2/manual/html-
single/index.html#consistency-and-replication-model)

Disclaimer: Obviously I am biased as I work for Hazelcast.

------
loremipsium
spark, storm, flink, beam, hazelcast... and then there are all the vendor
locked choices confluent, kinesis, azure probably has something in that space
to

The whole cloud computing space got me confused. I don't know what horse to
bet on and don't have the time to get familiar with every new framework. Is
this the new javascript world? If so I'd like to skip the next couple of years
until we found our react equivalent.

edit: Not to be read as an invitation to discuss how react is not the de-facto
standard of ui web frameworks

~~~
imglorp
Distributed Systems (the OReilly trout book) has a nice overview of the
streaming landscape (the first four you mentioned). The first several chapters
being a general tech background of stream processing: events, watermarks,
redundancy etc.

[http://streamingsystems.net/fig/10-36](http://streamingsystems.net/fig/10-36)

~~~
nwsm
I already have Designing Data-Intensive Applications (2017), do you think I
would get much more out of that book?

~~~
imglorp
Yeah I have that one too and yes it's a good general overview of the broad
space and yes there is some overlap. If you're thinking of selecting a
streaming solution in particular--and it's definitely not for everyone--
Streaming Systems is more in depth into the workings and tradeoffs and might
be helpful understanding your problem. I'd check the TOC on the link above to
decide.

------
jmnicolas
>High-throughput, large-state stream processing. For example, tracking GPS
locations of millions of users, inferring their velocity vectors.

It baffles me they're so casual about it ...

~~~
rswail
Inferring velocity vectors would be very useful for analyzing traffic flows,
impacts of lane widening/reducing, signal timing, ML for adaptive traffic
management, etc.

None of those things are nefarious and don't necessarily provide additional
knowledge, as long as care is taken to fully deanonymize and fuzz
start/stop/end locations of trips or associate trips together.

People agree to provide this information to services like Waze etc for exactly
these tasks.

~~~
kgraves
Which is still a disgusting and unethical use of technology.

~~~
t0mas88
I use Waze, I'm choosing to give them my driving data to get information on
traffic in return. That's not unethical, that's a choice I actively make.

Them selling my data to others without telling me is unethical, but that's not
the case that Jet describes.

~~~
kgraves
Would you be interested in giving them your phone number? how about your
contacts as well? maybe your own voice? apps have you've installed? and when
you're not using the app, keep them posted on your location while you're at it
too.

> Them selling my data to others without telling me is unethical.

Don't you think Google does this to you already?

~~~
greggman3
Can you link to some evidence showing where google is selling data to others?

~~~
kgraves
Ever heard of 'Real Time Bidding?', Google sells your data to advertisers to
the highest bidder [0]. Planning on becoming an activist? I've got news for
you, law enforcement also want your data and Google sells it to them upon
request at any time. [1]

[0] [https://www.eff.org/deeplinks/2020/03/google-says-it-
doesnt-...](https://www.eff.org/deeplinks/2020/03/google-says-it-doesnt-sell-
your-data-heres-how-company-shares-monetizes-and)

[1] [https://www.nytimes.com/2020/01/24/technology/google-
search-...](https://www.nytimes.com/2020/01/24/technology/google-search-
warrants-legal-fees.html)

~~~
greggman3
I don't think either of those links say what you think they say.

~~~
kgraves
Do you believe Google when they say 'We do not sell personal information to
anyone?'

------
grillorafael
This has ways to handle all the problems i currently manually implement. Any
idea of getting a python api ?

~~~
haxen
Hazelcast Jet will get an SQL API soon, and we're actively considering first-
class support from other languages as well.

------
victor106
I am new to this space. So Sorry if this is not a valid comparison. But how
does this compare to Kafka?

~~~
tyingq
It's in this "Dataflow Programming" category:
[https://en.m.wikipedia.org/wiki/Dataflow_programming](https://en.m.wikipedia.org/wiki/Dataflow_programming)

So, more comparable to Apache Beam, like a fancy ETL. Programming via pipes,
transformations, etc.

It would hook to a Kafka (or other) stream.

------
liminal
I'm a bit surprised all these systems continue to be built on the JVM. For
these sorts of tasks I'd expect something without a VM like Rust to be a
better choice

------
drej
Regarding the two licences, one for the library itself, one for the connectors
- what does it mean for users, in practice? Thanks.

~~~
cangencer
The license is meant to prevent service-wrapping by cloud providers, other
than that it doesn't have any implications for standard usage. The core
library / server is Apache 2 and the rest of the connectors are community
license. You can use and embed both the core module and the connectors for
free.

The license itself is similar to the licenses from Confluent, Elastic among
many others. You can read more about it here:
[https://hazelcast.org/blog/announcing-the-hazelcast-
communit...](https://hazelcast.org/blog/announcing-the-hazelcast-community-
license/)

------
forgotmyhnacc
How does this compare to Apache beam?

~~~
haxen
An Apache Beam Runner is already implemented in Jet:
[https://beam.apache.org/documentation/runners/jet/](https://beam.apache.org/documentation/runners/jet/)

Beam is just an API layer with different backing implementations. But you
don't typically use Beam to work with Jet, instead you use its own Pipeline
API which is mostly like Java Streams. Jet will also soon get an SQL API.

~~~
netgusto
Very cool! Is it possible to mix apis in a single project with Jet Beam
Runner? This would make it easier to port Beam projects to Jet, as the
migration could be progressive.

~~~
cangencer
Do you mean for a single job/pipeline? This wouldn't be possible at the
moment. Our current focus has shifted from Beam a little bit - as we found out
the beam threading model didn't play nicely at all with Jet's green threads
(there is no way to distinguish between blocking and non-blocking calls).

------
KptMarchewa
Why not use Apache Flink?

~~~
cangencer
While Flink is a fully-featured stream processing framework I think there's
some notable differences. Off the top of my mind:

\- Flink uses Zookeeper for metadata and coordination, Jet doesn't require any
external systems for resilience.

\- Flink uses RocksDB and HDFS for checkpointing/snapshotting, Jet stores it
in distributed, replicated in-memory store.

\- Flink allocates operators to slots, while Jet uses green
threads/cooperative multi-threading. This means you can run many concurrent
streaming jobs on the same cluster, with very low overhead.

\- Jet is basically a single, self-contained JAR. It's all you need to run a
production-grade service (+ some connectors, if you'd like)

\- Jet can scale up/down with very little friction. You start a couple of
processes and they will form a cluster automatically. Kill a couple of the
processes, and the cluster goes on.

That said, Flink have a great set of overall features, especially around
persistence and huge states. This is another area we're currently investing in
as well as SQL support.

~~~
abeppu
> Flink allocates operators to slots, while Jet uses green threads/cooperative
> multi-threading. This means you can run many concurrent streaming jobs on
> the same cluster, with very low overhead.

How does the shift to cooperative multi-threading change the way that the
cluster is used? In the "slot" approach, Alice and Bob can run concurrent jobs
with relatively little coordination needed to "share" effectively -- e.g. they
might use different branches of the same shared repo. In exchange for the
lower-overhead, does Jet's approach require that multiple use cases are more
carefully planned?

~~~
cangencer
This is indeed a question that we get asked a lot. We have so far not though
about adding more advanced scheduling capabilities for the cooperative
threads. With the slot system, if you have 48 core available in the cluster,
and running 8 jobs, each job will only use 6 cores each. With cooperative
threading, each job runs on all the 48 cores. We have tested something like
5,000 concurrent jobs on same cluster, but essentially they may be competing
for the same resources, so you'll need to do your capacity planning
accordingly. Simple way to work around that would be to create separate Jet
nodes (a Jet node is very lightweight) so you could have separate execution
pools.

