Controversial opinion here, but all of these distributed streaming architectures are massively overused. They certainly have their place, but you probably don't need them. I see it all the time with ML work. You wind up using a cluster to overcome the memory inefficiency of Spark, when you could have just used a single machine. For example, I've done huge graph clustering models on a single machine just by being smart about memory consumption. It would have taken an enormous and expensive Spark cluster.
This has been my experience, too. I worked at just one place that had a really good handle on high-volume, high-velocity streaming data, and they didn't use Flink or Storm or Kafka or anything like that. They mostly just used the KISS principle and a protobuf-style wire format.[1]
There is definitely a point where these sorts of scale-out-centric solutions are unavoidable. Short of that point, though, they're probably best avoided.
[1]: (It's truly amazing how many CPU cycles you can reclaim just by removing branch instructions from your message deserialization code.)
That's a smart way of doing it. In much ML work, I've found that the hash trick can be used to train many more things in a constant memory, out-of-core fashion than most people think.
We have also seen this. The huge memory capacities now offered in the cloud makes single node processing very capable with entire TB datasets fitting into RAM, and running quickly enough to offset the hourly cost. Spark clusters are more efficient though for longer-running or continuous background processes.
I've seen the opposite in our company. We did some data science, on a data that wasn't that big, but big enough to not fit into one machine. Everyone was reluctant to move to Spark, so data-scientists computed their models over subset of data: one day instead of one week of data. And after the project finished, and passed to client, they realized that subset was not representative enough to meet the criteria.
Of course, it's kinda their fault to assume that subset is representative, but if you start checking every assumption you make in data science, then you won't get far.
Had they used Spark from the beginning, they would not have that last-moment surprise, because they would work on proper sized dataset from the beginning.
First, the case you describe is one that's covered by using Spark for batch processing. Parent was criticizing using it and other tools for stream processing. The two are very different use cases.
Second, just gotta call out that 2nd paragraph. A qualified data scientist should have a solid training in statistics. And someone who has a solid training in statistics should rarely if ever make the assumption that one day is representative of the whole week in the first place, regardless of whether they subsequently check that assumption. They would probably start with the presumption that one day is not representative of the whole week, because that is self-evidently going to be near-universally the case for any data that measures something about human behavior.
Re. assumptions -- it all depends on the data. If your data, say, collection of bird songs on Galapagos, then assuming there's correlation between day-of-week is counter productive. If a data scientist says me that they need to spend a day or two checking that assumption is correct, then I would look for a more productive scientist. Time to market is critical.
And there are many of such assumptions in every project, just try to do something, anything with data science checking every assumption you make. Basically as I said above, you won't get far.
By the way, majority of statistics theorems start with "take N independently distributed variables". In real world nothing is independent, but we still make useful predictions.
Sorry, wasn't clear enough on the "human behavior" bit. Where data scientists are dealing with business problems, you can pretty much always assume that the data describes human behavior at least in part. Doubly so if you're talking about data that's big enough to not be processable on a single computer, and also needs to be handled in short order, since that implies we're in Big Data (emphasis on the capital B and D) territory.
The counter-hypothetical you're giving just isn't plausible for the situation at hand. Analyzing bird songs on the Galapagos is scientist work, not data scientist work. Analyzing bird songs in an urban area might more plausibly be data scientist work, in that it gets you back toward business applications, but also gets us back into a spot where there's no way you could just assume there's no day-of-week effect.
I'm gonna stand firm on that one. That sort of thing is worrisome - I don't care if it's someone who holds a PhD or someone who just took a MOOC or two, data scientists have a professional responsibility to be way less sloppy than that.
Birdsong actually isn't a good example, because birds sing less the more noise there is. If there is more activity during certain days of the week, it will be biased. You really can't predict these kind of correlations ahead of time, which is why it is always critical that your data is not collected in a biased way. The data scientists should have randomly subsampled from each day of the week. This is extremely common practice, and there's nothing wrong with it if the subsampled data gives acceptably credible results.
I've never heard of people only using certain days of the week. As a practicing and productive data scientist, that sounds completely insane to me. I would never hire a data scientist who does not understand why such practices are bad.
Data scientist is just a buzzy name for statistician. Data scientists who do not understand basic good statistical techniques are not competent.
What's the easiest way to get started with trying Apache Kafka/Spark/Flink on the cloud? If I want to try out Redis there's RedisLabs, CloudAMQP for RabbitMQ, Compose for Postgres/Redis/RabbitMQ, offerings like Google Cloud SQL/MemoryStore and AWS RDS/ElastiCache, etc. Where do I go for some easy Apache deployments?
Yeah, or any other input. I don't think it's tied explicitly to Kinesis. This is definitely easier than other ways to deploy! EMR also has Flink as an option.
Pretty much all of these Hadoop-adjacent things are more or less Linux only, and certainly unix-y-thing-only. What other platform do you want to run it on?
"Make a PR" actually means "This seems not to be relevant enough for the current maintainers to do it, and if it feels important to you, you should write it yourself and post a patch", which is a good answer.
> "Make a PR" actually means "This seems not to be relevant enough for the current maintainers to do it, and if it feels important to you, you should write it yourself and post a patch", which is a good answer.
So we just assume that a desire to do something translates to an _ability_ to do it?
Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing. I feel like this is a bit overboard. And this is before we talk about the non-Apache stream-processing frameworks out there.
* Apache Flink is an open source stream processing framework
* Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.
* Apache Storm is a distributed stream processing computation framework
* Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing
* Apache Spark is an open-source distributed general-purpose cluster-computing framework.
* Apache Apex is a YARN-native platform that unifies stream and batch processing.
* Apache Kafka is an open-source stream-processing software platform
> Apache Flink, Flume, Storm, Samza, Spark, Apex, and Kafka all do basically the same thing.
Well, no, you went too far.
Kafka is primarily used for communication & data transport, by most people (can be used in other ways, and it has the KafkaStreams library that enables you do to some computation on said data - but it is, primarily, a transport & communication mechanism; also maybe storage, if you squint right)
Spark and Flink might be similar on first sight, but if you look a bit closer you realize Spark is primarily geared towards batch workloads, and Flink towards realtime. Sure, you can do micro-batch in Spark and pretend that's realtime stream processing, but the focus of it is fairly clear - as is the focus of Flink. So both have legitimate rights to exist.
I'm not sure about the others, haven't used them. There may be indeed consistent overlap - but I'm sure they are different approaches. What's wrong with that?
On the contrary, I didn't go far enough. I didn't talk about Apache Gearpump, NiFi, Beam, Ignite, or Trident.
I know there are subtle differences in each specific technology, and that's probably the same justifications used to support the thesis when building yet another very similar framework.
> What's wrong with that?
I believe it drastically reduces adoption of these tools because many of us avoid what appears to be bandwagon technologies as we don't want to consciously add layers of future technical debt when a majority of these projects will be abandoned.
In the case of Spark and Flink, I wouldn't say that batch processing versus realtime stream processing are "subtle differences". That's akin to arguing that relational databases vs. document stores vs. timeseries databases just "muddy the waters".
Hacker News and Reddit have a lot of interesting discussion. But the audience skews toward client-side webdev, and students or younger developers. An audience accustomed to libraries and frameworks that you can reason about with fairly low learning curve, and spin up in a Codepen to see visually right away.
Heavy-lifting server side tools, especially those who only earn their keep at scale, are a different beast. And that's OKAY. Quite frankly, if you're "not sure" whether you need a stream processing platform in your architecture, then YOU DON'T. Aside from some consultants and salespeople, no one's really going to push you toward adoption of this stuff.
In the overwhelming majority of use cases, what you need is a tiny microservice (in your language of choice). Which reads from a Kafka or Rabbit topic, and stores state in your cache system of choice. By the time you reach the scale where that's not suitable, your organization probably won't need a web forum thread to educate you on what the vendor landscape looks like.
To be a bit more future-proof you should give Apache Beam a try. The same code should (theoretically) work with any of the supported runners[0] and so you could deploy it on top of the most suitable framework/technology for your specific workload. Moreover, at this point, the community has several examples of how to add an additional runner.
Edit: I didn't see you mentioned Beam in your second salvo :)
>I believe it drastically reduces adoption of these tools because many of us avoid what appears to be bandwagon technologies as we don't want to consciously add layers of future technical debt when a majority of these projects will be abandoned.
Doesn't seem to be impacting adoption of the major ones from what I can see (Spark, Kafka, Flink). Assuming you're at a scale where you need these technologies. Which most companies are not so it's a good thing they don't implement them just for kicks or as "future proofing." If you have a business problem that the main frameworks can't solve then trying out some of the other ones may be worth the cost. There's also ongoing support for older frameworks (Storm) so it's not like your code becomes useless.
The approach open source is taking to resolve this issue is to embrace the diversity but create unified APIs on top of it. Apache Beam and Arrow for defining data workflows and data exchange formats respectively. There's also always SQL which works with a lot of the more data warehouse solutions out there (with some tweaks per solution unfortunately).
> I believe it drastically reduces adoption of these tools
I really don't see why. From afar they might appear to perform the same task but once you take a closer look you quickly realize they follow fundamentally different architectures and have significantly deployment and performance characteristics.
Just because there are all kinds of hammers, but there are also plenty of different uses, and even in the subset involving driving nails there are significant requirements.
I said "you went to far" to claim Kafka, Spark and Flink do basically the same thing. It's reasonable to use all 3 of them in the same team - so they clearly don't do the same thing.
> I believe it drastically reduces adoption of these tools
As does any competition. Tons of smartphone makers = less adoption for any one of them - and many will close down. Still not a bad thing.
Those are open-source frameworks, which is antithetical to competition. The whole point of open-source is that the contributions of one entity can benefit everyone, while in this situation, having so many frameworks "dilutes" the efforts because the same problems need to be solved in each framework. Smartphone makers aren't here to share their technical advances with everyone, they're here to make money.
I'm not saying we should purposefully agree on killing all but one framework, but the ideal situation is if one or two come out at the top as being just the best in their category so that everyone just use them, and all efforts converge towards those.
I don't think that's how it works; "best" is sooo hard to define. Humans are very diverse, and have different needs/preferences. Also what is best today may not be best tomorrow (not because of the system itself, but because of changes in the environment), so it pays off to have "sub-par" systems evolve in parallel with different slightly different goals or implementations.
I don't see "open-source" as being antithetical to competition, at all. In fact, e.g. Databricks only exists because of Spark. There is competition in the open-source world too, and IMO that's a good thing.
Since when is there no competition and diversity in open source? Isn't too much choice usually a major complain with linux, programming languages, frameworks, etc?
>I'm not saying we should purposefully agree on killing all but one framework, but the ideal situation is if one or two come out at the top as being just the best in their category so that everyone just use them, and all efforts converge towards those.
That assumes you can have a single framework with all the features people want without causing issues (conflicting basic designs, high cost to maintain, deployment costs, configuration costs, etc.).
In the beginning, there was Hadoop, which was just a MapReduce clone, more or less.
And then there was Hadoop 2, which was kind of meant to do everything.
And that didn't really work out, so there was specialisation.
I don't really see the problem. Many of these do different things, others do similar things different ways. Some are essentially dead ends (I doubt there are many new users of Storm, say). This all seems pretty normal.
Hadoop MapReduce does what Google MapReduce does, and Hadoop Distributed File System does what the Google Filesystem does.
And Apache Spark does the same thing as that other Google product whose name I forget.
And Drill vs Dremel, etc etc etc.
I think that the OP is plenty right for the purposes of the point they were making. And I think (from a non-Googler's perspective) that it's a worthwhile point, since the FOSS ecosystem has since evolved along similar lines to Google's internal tools.
Im not sure I'd want to replace an existing / working Kafka setup with it, but if pulsar catches on, I would very strongly consider using it for newer projects.
I haven't used pulsar extensively in production for anything, but all the setup and testing I've done on my own was a pleasure. Hoping pulsar continues to grow.