
Show HN: Interactive map for architecting big data pipelines - ddrum001
http://xyz.insightdataengineering.com/blog/pipeline_map/
======
lobster_johnson
This is very useful.

I wish it had some information about supported languages. Most of the
processing systems are JVM-based and require that you write your program in a
JVM language. Some have Python support. But I have yet to encounter one that
allows you write your pipelines in Go, Rust or JavaScript, for example. One
notable exception is Storm, which supports pluggable runners, including one
that talks to an external program over standard I/O. My impression that aside
from Python, today's pipelines require a large amount of JVM buy-in, something
I'm personally not interested in.

I'd also love some kind of metric for "aliveness". For example, my impression
is that Storm was hot for about a week, and then Spark and Flink happened, and
now nobody is talking about it, and Twitter itself has apparently replaced it
with Heron.

~~~
pixelmonkey
Storm is very much alive. Many of its users are simply running it reliably in
production now. At my company, we are well past our trillionth production
tuple running through Storm.

Also note that unlike Spark, Storm is a pure open source project that does not
have a major commercial entity marketing its use cases. Hortonworks has put a
little marketing effort behind it, but otherwise, it's just a mature & active
Apache infrastructure project. Storm 2.0 is coming out soon and features a
slew of performance- and reliability-improving enhancements.

But as for marketing buzz, Google has commercial reasons for you to use Beam
and Dataflow, for example. And likewise Databricks for Spark.

It's probably a good idea to pick production large-scale data infrastructure
on a metric other than recency of marketing buzz.

-$0.02 from one of the original authors of streamparse, the Python API for Storm

~~~
lobster_johnson
Thanks, that's helpful. Is building a pipeline with Java, consisting entirely
of shell spouts, a viable option? Are there downsides to not using the Java
API?

------
dsacco
Wow, this is awesome. What a simple yet useful idea.

This format lends itself to data processing, but I think it would be really
nice to apply it a variety of workflows. For example, you could model the
software deployment process across different languages and frameworks. It
could be a good complement to StackShare.

A bit of constructive feedback: I'm not a stickler for UX or design, but maybe
spruce up the gray boxes a bit. I've never been a designer though, so take
that for what you will.

------
vosper
If you're aiming to be comprehensive, then you may want to add Onyx under
streaming processors. It's not as popular as the options you've listed though,
so I understand why it might be left off.

[http://www.onyxplatform.org](http://www.onyxplatform.org)

~~~
ddrum001
Thanks, we've included it in the Unified Batch processing because of its
ability to have large window sizes that enable batch processing.

------
lolptdr
This is awesome. Great aggregating of so many buzzwords and brand names that
I've heard over the years. Nice job!

Keep it simple and hierarchical. I suggest additional filters for each
component of the data engineering flow that can discern unique features or
commonalities.

~~~
ddrum001
Thanks, we tried to keep it simple and streamlined. If you know of any
features or common patterns that would be helpful, let us know.

------
greggyb
Interesting that Microsoft's only showing in this map is for Azure Blob
Storage.

~~~
theatraine
Batch processing: Azure Data Lake([https://azure.microsoft.com/en-
us/solutions/data-lake/](https://azure.microsoft.com/en-us/solutions/data-
lake/))

Stream processing: Azure Stream Analytics ([https://azure.microsoft.com/en-
us/services/stream-analytics/](https://azure.microsoft.com/en-
us/services/stream-analytics/))

SQL server is mentioned, but Azure Cosmos DB should also be mentioned
([https://azure.microsoft.com/en-us/services/cosmos-
db/](https://azure.microsoft.com/en-us/services/cosmos-db/))

~~~
ddrum001
Fair point - the set of technologies is based off the teams we work closest
with, which admittedly have a bias towards open source and Linux. So far, our
map is far from comprehensive, so appreciate the suggestions (exactly what
we're looking for by show HN).

To that point, just added CosmosDB, and plan to add others soon.

~~~
greggyb
Email me if you want to talk architectural patterns or Microsoft products or
both.

Details are in my profile.

------
jnatkins
StreamSets Data Collector is another useful open-source ingest tool. I'm
biased, but people seem to like it.

------
trwoway
Strange that Apache Flink and Google Dataflow don't figure in the Stream
Processing list

~~~
ddrum001
We put Dataflow and Flink in the "Unified Processing" since they can handle
batch and streaming (as opposed to tools that only handle steaming).

We might add them explicitly to Streaming as well though.

------
rahilb
Storm also has a Scala api, but is filtered when selecting Stream Processing
and Scala.

------
Aegeaner
Why is there no Flink in Streaming processing framework?

~~~
drfloob
Flink is listed under `Unified Processing` as it supports both batch and
streaming (Kappa Architecture)

------
Faaak
What do you think of Arctic for the Data point ?

------
rjbwork
Kind of cool, but only 2 entries from Azure that aren't on other places.

Kind of useless for us on Azure.

------
lima
Citus DB is missing.

