
AthenaX – Uber Engineering’s Open Source Streaming Analytics Platform - vquemener
https://eng.uber.com/athenax/
======
haohui
Hello, Haohui here. I'm one of the developer of AthenaX. I'm really glad that
the project attracts so many interests.

Before AthenaX most of our real-time analytic pipelines were on Samza, which
to to some extent can be seen as predecessor of Kafka Stream.

The migration from Samza towards Kafka Stream might seem more natural, and we
actually took a very close look on Kafka Stream and we have decided to move
towards Flink, given that:

(1) Kafka Stream lacks of important features like exactly-once delivery and
distributed consistent snapshots. They are essential to support use cases that
require high fidelity.

(2) SQL is a must-have feature in order to empower our users, many of which
are non-technical, to run large-scale streaming analytics in production. Kafka
Streams provide no support for that. Arguably the Kafka Streams provide simple
APIs -- Simple APIs themselves are insufficient to bring the analytics
applications to production. You can't ignore continuous integrations and
deployment, monitoring, etc., especially many of our users come from non-CS
backgrounds.

(3) It seems that the Apache Flink community is more open and committed
compared to the Kafka community, particularly on the SQL side. We have
collaborated with Data Artisans, Alibaba and Huawei. All parties above, as
well as us are committed and equipped with adequate resources to bring SQL to
respective customers.

~~~
ako
How do you see this compared to Prestodb, which also provides SQL for Kafka
topics? Being able to join Kafka data with other tables seems like a useful
benefit of Prestodb?

~~~
haohui
It would reduce the latency of data but we have not tried in production.

Given that Kafka neither provides secondary indexes nor organizes data in
columnar format, Presto essentially have to somewhat scan through the Kafka
topics to execute the queries, resulting a lot of disk I/O.

Our Kafka infrastructure handles more than one trillion messages per day and
guarantees second-level latency SLAs. Reading aggressively could easily
saturate all the I/O bandwidths of the nodes and leads to outages. We actually
had several incidents in the past when we did backfills. So I'm more
conservative on this.

------
nevi-me
Interesting project! I only got to know about Apache Calcite a few
months/weeks ago, but it seems like it's becoming a de facto way of supporting
SQL on analytics products. I can think of Dremio [1] as an example.

From an implementation perspective, I wonder how AthenaX will compare to the
Apache Beam SQL DSL [2] (I haven't tried it out, still learning DoFn, ParDo,
etc.). Beam is also leveraging Calcite, with the partial difference that it
runs* on Flink amongst others.

I think it's safe to say that the streaming analytics ecosystem remains
interesting and exciting! Hope I'll be able to try out AthenaX,
congratulations on the release.

[1] [https://www.dremio.com](https://www.dremio.com) [2]
[https://github.com/apache/beam/tree/DSL_SQL](https://github.com/apache/beam/tree/DSL_SQL)

~~~
haohui
I'm not fully familiar with Apache Beam SQL DSL so I can't really provide
anything insightful.

But yes streaming analytics is a very exciting area. Your feedbacks are
appreciated.

~~~
nevi-me
I think it'll only be released in the next minor version. We currently need to
build from source to try it, so I'm waiting for next version. I'm still
learning the Java API, which is a lot for a few hours here and there.

In the blog post, you talk about running AthX on YARN, so am I correct in
saying that I might not be able to run it on single-machine mode? I'd like to
try it out (SQL window functions are easier to understand than Beam Java
ones).

Maybe Gitter or a public Slack channel could help, more users will have
queries that aren't suitable for HN nor GitHub issues.

~~~
haohui
You can definitely start a single node Flink-on-YARN cluster. That's how all
the integration tests are set up.

What do you think of posting the questions on the mailing list [1]?

[1] [https://groups.google.com/forum/#!forum/athenax-
users](https://groups.google.com/forum/#!forum/athenax-users)

------
cosmie
This is super awesome!

Although it's slightly frustrating how closely named it is to the Amazon
Athena[1] service. Which _also_ is a SQL-based analytics service, but quite a
bit different than Uber's project.

[1][https://aws.amazon.com/athena/](https://aws.amazon.com/athena/)

~~~
haohui
Agree :-)

Proposals on names are always welcomed.

~~~
dsnuh
What about River? Looks like there is an R package and an abandoned Apache
project that use the name, but maybe not that confusing given context.

~~~
frankmcsherry
It's a stream processing language:

[http://hirzels.com/martin/papers/spe16-river.pdf](http://hirzels.com/martin/papers/spe16-river.pdf)

~~~
dsnuh
Okay...

Tribute Tributary Brook Flow Rapid Affluent Alluvial Aquaduct Niagara Iguaza
Aquifer Flood Flashflood Headwater Tailwater Lotic Hydro Meander Outfall
Spillway Watershed Weir

------
spilk
Putting "X" and "Athena" together just makes me think of the Athena Widgets
(Xaw) toolkit for X11.

------
DandyDev
This looks like an amazing piece of software! I have a couple of questions:

\- Does it (out of the box) support mapping JSON messages on a Kafka Topic to
a table accessible from SQL?

\- How would you support the following use case: run SQL queries on one or
more Kafka Topics and expose the results in (near) real-time through a JDBC
supporting BI tool? I assume you'd use, AthenaX to run the queries and store
the results in a standard RDBMS like PostgreSQL?

~~~
haohui
(1) Yes. AthenaX provides a thin wrapper of KafkaJsonTableSource[1] out-of-
the-box. As long as you specify the schema it should be good to go. (2) Your
guess is correct. Internally we have AthenaX streaming data to MemSQL, Pinot
or MySQL to support the use case you described. Some of them go through the
JDBCTableSink [2], the others go through a customized connectors. AthenaX is
designed to allow plugging in connectors when required.

[1] [https://github.com/apache/flink/blob/master/flink-
connectors...](https://github.com/apache/flink/blob/master/flink-
connectors/flink-connector-kafka-
base/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaJsonTableSource.java)
[2] [https://github.com/apache/flink/blob/master/flink-
connectors...](https://github.com/apache/flink/blob/master/flink-
connectors/flink-
jdbc/src/main/java/org/apache/flink/api/java/io/jdbc/JDBCAppendTableSink.java)

------
scaleout1
Interesting project although cant say I am happy to see SQL being used in
Streaming Systems like this. In my last two jobs I had to write frameworks and
tools to enable "Data Scientists" and "analysts" to write production jobs and
problem I have run into with exposing SQL to this class of user is that every
job end up being its own special snowflake with deeply nested SQL with custom
UDFs mixed in for good measure. Due to "unique" nature of each its
significantly increases the support and maintainability cost. I have to come
the conclusion that a typesafe api with map/filter/flatmap is much better API
to expose that Stringly typed SQL. I am curious to know whether Uber is
running into similar support issues?

~~~
nevi-me
My experience teaching some graduates in a BI shop. SQL is more common, and
tools that support SQL tend to be used better.

I've "taught" them how to use Spark, but being a team of varying prior
experience, the Scala API meant them learning Scala, the Python one was a bit
better, but they did much better with the SQL DSL.

Regarding your concern re maintainability: UDF's tend to be the problem, I'm
also curious to know re their support issues, and also: can anyone write their
own UDF (the code requires registering a .jar), or is there a team that helps
business users in that regard?

~~~
haohui
For UDF we support both -- users can write their own UDFs in their queries. We
provide a number of well-tested UDFs to them as well.

------
lindauer
Can anyone comment on how this compares to KSQL[1]?

[1]
[https://www.confluent.io/product/ksql/](https://www.confluent.io/product/ksql/)

~~~
haohui
It is great to see that SQL becomes one of the de facto standards of building
streaming analytics applications.

Personally I'm not a big fan of KSQL, given that:

(1) KSQL is built on top of Kafka Stream. There are use cases that we don't
think Kafka Stream is a good fit. Please see the explanations above

(2) Inventing yet another SQL dialect is a bad idea in practice. Not only it
incurs additional learning curves, but more importantly you have few hopes
winning in development velocity. Calcite has been used by Flink, Storm, Beam,
Dremio, etc. The community is simply way bigger even compared to the total
number of engineers in Confluent.

Again this is just my personal take it does not reflect the stands of Uber.

~~~
nevi-me
I share your sentiment here. Having a common base to work from also has the
benefit that end-users, the BI or Data Science practitioner (whom we empower
as data engineers) can expect a stable SQL dialect that they'll be able to use
everywhere*.

I haven't followed Spark in recent months, but what I recall was that the SQL
DSL had some caveats at first, because certain things weren't yet implemented.

My anecdote has been that projects that implement SQL after a while, typically
don't deliver the whole thing on initial release. I imagine it's often quite a
lot of work.

The more projects that use Calcite, the more upstream contributions there
would be ...

~~~
rxin
Spark actually has probably the best standard SQL support among all open
source big data frameworks. It can run all of TPC-DS queries without
modifications.

~~~
haohui
TPC-DS evaluates against batch processing and it is great to hear that Spark
supports all of them.

I think the space for streaming processing is still quickly evolving. Many
features like stream-table joins, CTEs, streaming joins with late arrival data
are unimplemented or do not even have clear semantics yet. It would be great
to see a benchmarks like TPC-DS in the domain.

------
sandGorgon
This uses Kafka as a core piece of infrastructure. I realize that Kafka
Streams was not available when AthenaX was built, but would it be a good
framework to move to ?

