Apache Flink

lenticular · on Jan 30, 2019

Controversial opinion here, but all of these distributed streaming architectures are massively overused. They certainly have their place, but you probably don't need them. I see it all the time with ML work. You wind up using a cluster to overcome the memory inefficiency of Spark, when you could have just used a single machine. For example, I've done huge graph clustering models on a single machine just by being smart about memory consumption. It would have taken an enormous and expensive Spark cluster.

bunderbunder · on Jan 30, 2019

This has been my experience, too. I worked at just one place that had a really good handle on high-volume, high-velocity streaming data, and they didn't use Flink or Storm or Kafka or anything like that. They mostly just used the KISS principle and a protobuf-style wire format.[1]

There is definitely a point where these sorts of scale-out-centric solutions are unavoidable. Short of that point, though, they're probably best avoided.

[1]: (It's truly amazing how many CPU cycles you can reclaim just by removing branch instructions from your message deserialization code.)

mikhailfranco · on Feb 1, 2019

See McSherry et al, "Scalability! But at what COST?"

https://www.usenix.org/system/files/conference/hotos15/hotos...

lenticular · on Jan 30, 2019

That's a smart way of doing it. In much ML work, I've found that the hash trick can be used to train many more things in a constant memory, out-of-core fashion than most people think.

manigandham · on Jan 30, 2019

We have also seen this. The huge memory capacities now offered in the cloud makes single node processing very capable with entire TB datasets fitting into RAM, and running quickly enough to offset the hourly cost. Spark clusters are more efficient though for longer-running or continuous background processes.

riku_iki · on Jan 30, 2019

And spark also has some other goodies: integration with storages, monitoring, fault tolerance, common algorithms implemented.

deepsun · on Jan 30, 2019

I've seen the opposite in our company. We did some data science, on a data that wasn't that big, but big enough to not fit into one machine. Everyone was reluctant to move to Spark, so data-scientists computed their models over subset of data: one day instead of one week of data. And after the project finished, and passed to client, they realized that subset was not representative enough to meet the criteria.

Of course, it's kinda their fault to assume that subset is representative, but if you start checking every assumption you make in data science, then you won't get far.

Had they used Spark from the beginning, they would not have that last-moment surprise, because they would work on proper sized dataset from the beginning.

bunderbunder · on Jan 30, 2019

Well, two things there -

First, the case you describe is one that's covered by using Spark for batch processing. Parent was criticizing using it and other tools for stream processing. The two are very different use cases.

Second, just gotta call out that 2nd paragraph. A qualified data scientist should have a solid training in statistics. And someone who has a solid training in statistics should rarely if ever make the assumption that one day is representative of the whole week in the first place, regardless of whether they subsequently check that assumption. They would probably start with the presumption that one day is not representative of the whole week, because that is self-evidently going to be near-universally the case for any data that measures something about human behavior.

deepsun · on Jan 30, 2019

Right, for batch models and training evaluation.

Re. assumptions -- it all depends on the data. If your data, say, collection of bird songs on Galapagos, then assuming there's correlation between day-of-week is counter productive. If a data scientist says me that they need to spend a day or two checking that assumption is correct, then I would look for a more productive scientist. Time to market is critical.

And there are many of such assumptions in every project, just try to do something, anything with data science checking every assumption you make. Basically as I said above, you won't get far.

By the way, majority of statistics theorems start with "take N independently distributed variables". In real world nothing is independent, but we still make useful predictions.

bunderbunder · on Jan 30, 2019

Sorry, wasn't clear enough on the "human behavior" bit. Where data scientists are dealing with business problems, you can pretty much always assume that the data describes human behavior at least in part. Doubly so if you're talking about data that's big enough to not be processable on a single computer, and also needs to be handled in short order, since that implies we're in Big Data (emphasis on the capital B and D) territory.

The counter-hypothetical you're giving just isn't plausible for the situation at hand. Analyzing bird songs on the Galapagos is scientist work, not data scientist work. Analyzing bird songs in an urban area might more plausibly be data scientist work, in that it gets you back toward business applications, but also gets us back into a spot where there's no way you could just assume there's no day-of-week effect.

I'm gonna stand firm on that one. That sort of thing is worrisome - I don't care if it's someone who holds a PhD or someone who just took a MOOC or two, data scientists have a professional responsibility to be way less sloppy than that.

lenticular · on Jan 30, 2019

Birdsong actually isn't a good example, because birds sing less the more noise there is. If there is more activity during certain days of the week, it will be biased. You really can't predict these kind of correlations ahead of time, which is why it is always critical that your data is not collected in a biased way. The data scientists should have randomly subsampled from each day of the week. This is extremely common practice, and there's nothing wrong with it if the subsampled data gives acceptably credible results.

I've never heard of people only using certain days of the week. As a practicing and productive data scientist, that sounds completely insane to me. I would never hire a data scientist who does not understand why such practices are bad.

Data scientist is just a buzzy name for statistician. Data scientists who do not understand basic good statistical techniques are not competent.

meatyapp · on Jan 30, 2019

What's the easiest way to get started with trying Apache Kafka/Spark/Flink on the cloud? If I want to try out Redis there's RedisLabs, CloudAMQP for RabbitMQ, Compose for Postgres/Redis/RabbitMQ, offerings like Google Cloud SQL/MemoryStore and AWS RDS/ElastiCache, etc. Where do I go for some easy Apache deployments?

deepsun · on Jan 30, 2019

Google cloud has "one-click" installation of integrated 3rd-party solutions. Never tried that though.

CSDude · on Jan 30, 2019

AWS lets you uplod Flink programs to process to Kinesis streams, and Google Cloud also has support for Apache Beam

rehevkor5 · on Jan 30, 2019

Yeah, or any other input. I don't think it's tied explicitly to Kinesis. This is definitely easier than other ways to deploy! EMR also has Flink as an option.

_ZeD_ · on Jan 30, 2019

"Prerequisites for building Flink: Unix-like environment (we use Linux, Mac OS X, Cygwin) Java 8 (Java 9 and 10 are not yet supported)"

Sigh is too much to ask for proper crossplatform support? And when the hell they will add support for recent versions of Java????

rsynnott · on Jan 30, 2019

Pretty much all of these Hadoop-adjacent things are more or less Linux only, and certainly unix-y-thing-only. What other platform do you want to run it on?

_ZeD_ · on Feb 2, 2019

Uhh, I don't know... maybe windows?

heck, hadoop itself runs fine on windows 10 [http://hadoop.apache.org/docs/current/hadoop-project-dist/ha...]

wpaladin · on Jan 30, 2019

https://jira.apache.org/jira/browse/FLINK-8033

skrebbel · on Jan 30, 2019

Make a PR

IshKebab · on Jan 30, 2019

I wish people would stop making this predictable and tedious comment that adds nothing to the conversation.

11sdmb8m · on Jan 30, 2019

"Make a PR" actually means "This seems not to be relevant enough for the current maintainers to do it, and if it feels important to you, you should write it yourself and post a patch", which is a good answer.

vageli · on Jan 30, 2019

> "Make a PR" actually means "This seems not to be relevant enough for the current maintainers to do it, and if it feels important to you, you should write it yourself and post a patch", which is a good answer.

So we just assume that a desire to do something translates to an _ability_ to do it?

lmm · on Jan 30, 2019

If you want it done and don't have the skills you can hire someone who does.

humanbeinc · on Jan 30, 2019

No, it's not and you know it

ckdarby · on Jan 30, 2019

Does anyone have experience running in production?

If you have an article about using it or slides please send links.

wenc · on Jan 30, 2019

Not sure why the Flink github site got linked. Here's a list of companies using it in production, with links to details.

https://flink.apache.org/poweredby.html

zbentley · on Jan 30, 2019

> Does anyone have experience running in production?

We do:

https://klaviyo.tech/scaling-klaviyos-real-time-analytics-sy...

rehevkor5 · on Jan 30, 2019

Yes. No public material though, sorry.

meritt · on Jan 30, 2019

Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing. I feel like this is a bit overboard. And this is before we talk about the non-Apache stream-processing frameworks out there.

* Apache Flink is an open source stream processing framework

* Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.

* Apache Storm is a distributed stream processing computation framework

* Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing

* Apache Spark is an open-source distributed general-purpose cluster-computing framework.

* Apache Apex is a YARN-native platform that unifies stream and batch processing.

* Apache Kafka is an open-source stream-processing software platform

virgilp · on Jan 30, 2019

> Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing.

Well, no, you went too far.

Kafka is primarily used for communication & data transport, by most people (can be used in other ways, and it has the KafkaStreams library that enables you do to some computation on said data - but it is, primarily, a transport & communication mechanism; also maybe storage, if you squint right)

Spark and Flink might be similar on first sight, but if you look a bit closer you realize Spark is primarily geared towards batch workloads, and Flink towards realtime. Sure, you can do micro-batch in Spark and pretend that's realtime stream processing, but the focus of it is fairly clear - as is the focus of Flink. So both have legitimate rights to exist.

I'm not sure about the others, haven't used them. There may be indeed consistent overlap - but I'm sure they are different approaches. What's wrong with that?

meritt · on Jan 30, 2019

On the contrary, I didn't go far enough. I didn't talk about Apache Gearpump, NiFi, Beam, Ignite, or Trident.

I know there are subtle differences in each specific technology, and that's probably the same justifications used to support the thesis when building yet another very similar framework.

> What's wrong with that?

I believe it drastically reduces adoption of these tools because many of us avoid what appears to be bandwagon technologies as we don't want to consciously add layers of future technical debt when a majority of these projects will be abandoned.

StevePerkins · on Jan 30, 2019

In the case of Spark and Flink, I wouldn't say that batch processing versus realtime stream processing are "subtle differences". That's akin to arguing that relational databases vs. document stores vs. timeseries databases just "muddy the waters".

Hacker News and Reddit have a lot of interesting discussion. But the audience skews toward client-side webdev, and students or younger developers. An audience accustomed to libraries and frameworks that you can reason about with fairly low learning curve, and spin up in a Codepen to see visually right away.

Heavy-lifting server side tools, especially those who only earn their keep at scale, are a different beast. And that's OKAY. Quite frankly, if you're "not sure" whether you need a stream processing platform in your architecture, then YOU DON'T. Aside from some consultants and salespeople, no one's really going to push you toward adoption of this stuff.

In the overwhelming majority of use cases, what you need is a tiny microservice (in your language of choice). Which reads from a Kafka or Rabbit topic, and stores state in your cache system of choice. By the time you reach the scale where that's not suitable, your organization probably won't need a web forum thread to educate you on what the vendor landscape looks like.

marcyb5st · on Jan 30, 2019

To be a bit more future-proof you should give Apache Beam a try. The same code should (theoretically) work with any of the supported runners[0] and so you could deploy it on top of the most suitable framework/technology for your specific workload. Moreover, at this point, the community has several examples of how to add an additional runner.

Edit: I didn't see you mentioned Beam in your second salvo :)

[0]: https://beam.apache.org/documentation/runners/capability-mat...

marcinzm · on Jan 30, 2019

>I believe it drastically reduces adoption of these tools because many of us avoid what appears to be bandwagon technologies as we don't want to consciously add layers of future technical debt when a majority of these projects will be abandoned.

Doesn't seem to be impacting adoption of the major ones from what I can see (Spark, Kafka, Flink). Assuming you're at a scale where you need these technologies. Which most companies are not so it's a good thing they don't implement them just for kicks or as "future proofing." If you have a business problem that the main frameworks can't solve then trying out some of the other ones may be worth the cost. There's also ongoing support for older frameworks (Storm) so it's not like your code becomes useless.

The approach open source is taking to resolve this issue is to embrace the diversity but create unified APIs on top of it. Apache Beam and Arrow for defining data workflows and data exchange formats respectively. There's also always SQL which works with a lot of the more data warehouse solutions out there (with some tweaks per solution unfortunately).

geezerjay · on Jan 30, 2019

> I believe it drastically reduces adoption of these tools

I really don't see why. From afar they might appear to perform the same task but once you take a closer look you quickly realize they follow fundamentally different architectures and have significantly deployment and performance characteristics.

Just because there are all kinds of hammers, but there are also plenty of different uses, and even in the subset involving driving nails there are significant requirements.

virgilp · on Jan 30, 2019

I said "you went to far" to claim Kafka, Spark and Flink do basically the same thing. It's reasonable to use all 3 of them in the same team - so they clearly don't do the same thing.

> I believe it drastically reduces adoption of these tools

As does any competition. Tons of smartphone makers = less adoption for any one of them - and many will close down. Still not a bad thing.

rakoo · on Jan 30, 2019

> As does any competition.

Those are open-source frameworks, which is antithetical to competition. The whole point of open-source is that the contributions of one entity can benefit everyone, while in this situation, having so many frameworks "dilutes" the efforts because the same problems need to be solved in each framework. Smartphone makers aren't here to share their technical advances with everyone, they're here to make money.

I'm not saying we should purposefully agree on killing all but one framework, but the ideal situation is if one or two come out at the top as being just the best in their category so that everyone just use them, and all efforts converge towards those.

virgilp · on Jan 30, 2019

I don't think that's how it works; "best" is sooo hard to define. Humans are very diverse, and have different needs/preferences. Also what is best today may not be best tomorrow (not because of the system itself, but because of changes in the environment), so it pays off to have "sub-par" systems evolve in parallel with different slightly different goals or implementations.

I don't see "open-source" as being antithetical to competition, at all. In fact, e.g. Databricks only exists because of Spark. There is competition in the open-source world too, and IMO that's a good thing.

PurpleRamen · on Jan 30, 2019

Since when is there no competition and diversity in open source? Isn't too much choice usually a major complain with linux, programming languages, frameworks, etc?

manigandham · on Jan 30, 2019

Nobody said that. But having too many competing overlapping products does mean more fractured effort spread amongst all of them.

marcinzm · on Jan 30, 2019

>I'm not saying we should purposefully agree on killing all but one framework, but the ideal situation is if one or two come out at the top as being just the best in their category so that everyone just use them, and all efforts converge towards those.

That assumes you can have a single framework with all the features people want without causing issues (conflicting basic designs, high cost to maintain, deployment costs, configuration costs, etc.).

mxmxm · on Jan 30, 2019

Not quite. All of these are open-source projects but they are very different in many aspects:

* Apache Flink

Sophisticated stream processing framework with focus on robustness (managed memory) and correctness (exactly-once semantics)

* Apache Flume

Tailored towards log data.

* Apache Storm

First stream processing framework. Legacy.

* Apache Samza

Only used at LinkedIn. Tight to Hadoop's YARN.

* Apache Spark

Only great in batch processing.

* Apache Apex

Dead project. Tight to Hadoop's YARN.

* Apache Kafka

A distributed message queue with simple stream processing built on top via the Confluent Platform.

dankohn1 · on Jan 30, 2019

Nice list. There are also non-Apache projects like NATS, hosted by CNCF.

Here are the 19 streaming and messaging projects and products that CNCF is tracking: https://landscape.cncf.io/category=streaming-messaging&forma...

michaelmior · on Jan 30, 2019

> Only used at LinkedIn.

And Intuit, Uber, Netflix, VMWare, ....

https://cwiki.apache.org/confluence/display/SAMZA/Powered+By

lazzlazzlazz · on Jan 30, 2019

As a frequent user of almost half the programs you've listed... I couldn't disagree more. They're completely different.

I'm surprised to see this level of misunderstanding posted with such confidence.

kevinconaway · on Jan 30, 2019

It makes you wonder about some of the other insightful sounding top comments on topics that you aren't as well versed in

rsynnott · on Jan 30, 2019

In the beginning, there was Hadoop, which was just a MapReduce clone, more or less.

And then there was Hadoop 2, which was kind of meant to do everything.

And that didn't really work out, so there was specialisation.

I don't really see the problem. Many of these do different things, others do similar things different ways. Some are essentially dead ends (I doubt there are many new users of Storm, say). This all seems pretty normal.

star-trek-fleet · on Jan 30, 2019

No Hadoop does more than MapReduce, much more. Primarily: MapReduce does not have a distributed file-like block storage.

bunderbunder · on Jan 30, 2019

OK, so, to be more precise:

Hadoop MapReduce does what Google MapReduce does, and Hadoop Distributed File System does what the Google Filesystem does.

And Apache Spark does the same thing as that other Google product whose name I forget.

And Drill vs Dremel, etc etc etc.

I think that the OP is plenty right for the purposes of the point they were making. And I think (from a non-Googler's perspective) that it's a worthwhile point, since the FOSS ecosystem has since evolved along similar lines to Google's internal tools.

erik_seaberg · on Jan 30, 2019

Apache Spark does what Google FlumeJava does, express a chain of map/reduce steps in a DSL (and inline away some intermediate results).

Apache Flume is unrelated except for the confusing name.

powerapple · on Jan 30, 2019

... I am using Apache Pulsar to replace current Kafka setup * Apache Pulsar is an open-source distributed pub-sub messaging system

continuations · on Jan 30, 2019

What compelled you to replace Kafka with Pulsar?

dvlsg · on Jan 30, 2019

Not OP, but I'm keeping a close eye on pulsar.

Im not sure I'd want to replace an existing / working Kafka setup with it, but if pulsar catches on, I would very strongly consider using it for newer projects.

I haven't used pulsar extensively in production for anything, but all the setup and testing I've done on my own was a pleasure. Hoping pulsar continues to grow.

asavinov · on Jan 30, 2019

> Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing.

Yes, conceptually they are very similar. If youu want something radically new then check out Bistro Streams: https://github.com/asavinov/bistro

buboard · on Jan 30, 2019

Solving problems of the internet behemoths -> creating opportunity to be acquired.

diehunde · on Jan 30, 2019

> Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing.

That's like saying C++, Java, C#, Scala, Python, Clojure... all do basically the same thing

m_mueller · on Jan 30, 2019

isn't this just a sign of a still young field (data science)? maybe we'll soon start to enter a consolidation phase.