Hacker News new | comments | ask | show | jobs | submit login
Twitter open-sources a high-performance replicated log service (github.com)
296 points by spullara on May 10, 2016 | hide | past | web | favorite | 116 comments

Several people asked how this compares to Kafka (I'm one of the people who created Kafka at LinkedIn). Here's my take:

I think the motivations they list are: 1. Different I/O model 2. Was started before Kafka had replication (the first release of Kafka with replication was in late 2013 I think)

The I/O model I'm less sure about, we looked at similar things for Kafka and they didn't seem worth it (basically you're doing a ton of stuff at the app level that the OS does pretty well--namely caching and buffering linear I/O), we'd have to look at actual benchmarks to know.

Here is my take on the pros and cons of the core tech.

Pros: - Seems to have better built in support for fencing/idempotence - Better geo placement?

Cons: - Lots more moving pieces. Already people are irritated that there are both Kafka nodes and ZK to set up. This system seems to split this over separate physical tiers for serving, core, storage, and zookeeper. My experience has been lot's of tiers is generally a big headache.

Neutral: - There seems to be a built in achival to HDFS. I think if the consumer is fast and efficient then you don't need to reach around your consumer api which will be high latency (since you have to wait for files to be closed out).

There is also a bunch of stuff Kafka does that I'm just not sure about how complete it is in DistributedLog: - Clients in a bunch of languages - Integration with all the major stream processing frameworks - Log compaction http://kafka.apache.org/documentation.html#compaction - Connector management http://www.confluent.io/blog/announcing-kafka-connect-buildi... - Quotas/throttling - Security/ACLs

People talk about Zookeeper as a negative, but in a quorum it's been one of the most stable and reliable pieces of software I've deployed, despite being a bit frustrating to set up / configure. Netflix's Exhibitor[1] is an indispensable addition to it.

Also, once you're on the Big Data™ train, a lot of things like to plug into Zookeeper, so it becomes more of a convenience.

Kafka, and presumably DL, are at their most useful when you're pushing the limits of NIC and/or HDD performance for throughput. Zookeeper's configuration is a footnote in the complexity of managing one of these systems, and lets them avoid implementing their own byzantine coordination system. Also, folks seem to appreciate Aphyr's opinion, and he states it pretty plainly: Use Zookeeper. It’s mature, well-designed, and battle-tested. [2]

[1] https://github.com/Netflix/exhibitor

[2] https://aphyr.com/posts/291-jepsen-zookeeper

I found Zookeeper to be pretty easy to learn about and deploy. I also found it got a semi-bad reputation with others because they did not want to learn anything. Two major examples being:

* attempting thousands (tens of thousands?) of simultaneous writes across data centres/continents at the same instant and then saying it was slow. Subscribing all clients to updates on all parts of the tree. The architecture had been grown by people who didn't understand the guarantees and constraints.

* calling it unreliable after arbitrarily moving nodes around without changing connect strings and generally mis-configuring it. Essentially, pointing clients at machines that no longer contained nodes and blaming ZK for this not working.

It could be easier to setup. People often don't want to think about such things at all. But it's also not the hardest, and I found it to be very resilient to node and network failures when deployed correctly.

Agreed, some of the problems I encountered were because I did not carefully read the documentation. Some are because a lot of tooling has become easier to deploy in the years since Zookeeper was released, so the bar has been raised.

Which is just to say there's room for improvement. A dependency on Zookeeper is fine if you've already got a configured cluster, and a cognitive speed bump if not.

This could be an interesting competitor to Apache Kafka, which is singularly unique in this space as far as I'm aware.

On another note, I find it somewhat funny that these are called "log" services, logging is probably the least interesting use case for these things I can think of. A better description in my mind would be as a distributed event processing framework, since what they are really doing is distributing discrete events in a reliable manner.

I'm pretty sure that this post inspired DL. It was written by Jay, one of the 3 founders of kafka (Jay, Jun, and Neha) and should be recommended reading for every software engineer if you've not read it:


An ordered append only datastructure is rightfully called a log. The fact that text based files are called logs is just an annoying feature in common english usage.

IIRC, we started working on Distributed Log at least a year before that post.

The Twitter blog post in other comments says that "At design time we had concerns about Kafka’s I/O model...", which means that at the least Kafka was extensively studied during the design phase of DistributedLog, which may have started earlier than Kafka's release of course.

If you read the CS papers surrounding distributed systems, you will often see the notion of a 'journal' or a 'log', meaning an append-only structure, which typically contains numerous agreed-upon facts.

I agree that term describes the foundational data structure these services provide, but in common parlance that is not really what it means at all and it is confounded by a common(and boring) use case being to synchronize processing of actual, text-based, logs.

Jay Kreps, architect of Kafka, calls it a log.


What does "boring" have to do with any use-case?

aka atomic broadcast

I don't think "distributed event processing framework" describes this system at all. It's exactly what it says on the tin: a replicated log (i.e. event log) storage system that one can build other systems on top of. It's not processing events as far as I can see?

You put events in one end, and it distributes them, reliably and consistently, to distributed end-points that can consume them. That is far more powerful than simply being used to log events for later retrieval, though it can be used for that. I have seen systems using Kafka that process tens of thousands of time-sensitive events a second and distribute them all over the world for independent processing as part of a larger system.

"real-time event journaling and dissemination system"?

Yes I like that, much more accurate descriptor than "log", though it is more verbose. In the end it is really just nitpicking I guess, but the discoverability of calling these systems "log services" is very low compared to what they are really capable of.

Somehow I associate "journal" with temporary/transitional data, but that's probably just me thinking of journaling file-systems.

You might prefer 'ledger' to emphasize the immutability of the journal.

When I think journal, I think a permanent record of rdbms transactions or perhaps even a blockchain. I agree with you though that the persistence of data needs to be defined when discussing a journal of any sorts.

"Log-structured file system" or "journaling file system" is often how such things are described. The notion being that the primary operation of interest is appending new data records, rather than modifying existing data.


The use of the term "log" in some contexts implies concepts like lossy, unimportant, non-durable, etc. However, in the context of log-structured file systems or journaling file systems, the sense that's intended is append-only as a primitive, atomic, consistent, durable (replicated) operation.

Kafka isn't that unique. There are several hosted options as well as commercial software that can do what Kafka does.

Kafka has the open-source/java community and ecosystem that fits in well with the rest of the current big data processing stuff though.

This looks potentially fantastic. If I could beg one wish from the developers of this (and almost every other project anywhere near this space), though, it would be one tiny piece of documentation:

What's your unique ID scheme?

Let's say I'm willing to believe[1] that you've got Durable and Consistent down, once messages make it committed in to the system. What's the story for messages on their way in? My application logs are buffered to the local disk, now I'm streaming them into central storage, and halfway through a TCP connection that's shuffled 2mb of thousands of messages into storage, the connection terminates -- unexpectedly, midmessage. Could the service have committed more messages than it acknowledged? Or many less than I've sent? Both could be true from the network standpoint.[2]

So, what I need to know, and what should be very easy to answer, front-and-center in your docs, pretty please:

1) Where should my log uploader resume?

2) Is there any danger of repeatedly entering some lines?

3) If I have log lines that are legitimately duplicates, will they be stored at the correct count?

These are questions that may have a different answer than the durability after data makes it fully into the system. It also may provide useful information about how complexity the code in a submitting client is, because good answers tend to require some kind of ID sequence being assigned on submitting clients, afaict. And it's really just plain critical to sanity.


[1] well, no, I'm not, "trust by verify" in all things etc etc; but let's suppose that's more believable and something I have to mechanically verify anyway, and doesn't have an obviously observable boolean at the protocol level as to whether it's going to work well or not, and system internals simply don't have such a sordid history of being over-simplified until they're broken like client interfaces so often are, so...! We'll handwave that to a later and more involved step of quality investigation.

[2] https://en.wikipedia.org/wiki/Two_Generals'_Problem

I stumbled upon this page while reading the docs: https://twitter.github.io/distributedlog/html/design/main.ht...

It looks like they use fencing and a two-phase commit to prevent duplicate writes. Whether that covers all failure scenarios I'm not sure.

> Could the service have committed more messages than it acknowledged?

Yes of course. It may have committed records but not had time to to ACK success.

> Or many less than I've sent?

Yes again. However it will NEVER ACK success for writes which it hasn't committed.

So the persistence story is pretty straightforward (I'll see if we can update docs if this is missing).

If you want to avoid duplication on the write side, you would have to retrieve the last log record-- with id x-- and start adding records again from id x+1.

Fencing + locking combine to provide efficient exclusive access and the fat client (for now) guarantees write ordering.

I was interested until I saw the Zookeeper dependency.

I have had too many deployment nightmares with Zookeeper. I would prefer to avoid it as much as possible, plus systems software in Java, sigh.

For better or for worse, Java runs a big chunk of this world. Not running a Java application because "ewww Java is gross" will prevent you from enjoying some of the most exciting database technologies out there.

Would also be interested to hear more about the "deployment nightmares with Zookeeper". For us, it has to be one of the most stable pieces of 3rd party server software we run.

General things: when your session fails to migrate, when log compaction can't keep up with writes, when logs fill up your disk (ZK used to require external cron jobs to prune logs), when watcher notifications arrive late, when the leader's GC exceeds its heartbeat timeout, and so on.

Lots of things can and do go wrong with Zookeeper. I suspect it depends on the use case, but building a Zookeeper dependency into any system is potentially asking a lot of users/operators.

Well, like many things, if you deploy it on sensible hardware and put some thought into how its operated, zookeeper isn't particularly problematic. It's certainly not a 'nightmare'.

As an operator I do feel that pain but the alternative of every piece of software that needs distributed coordination reimplementing it, likely poorly, it isn't great either.

There are (IMHO, superior) alternatives like Etcd and Consul that could be used.

Ideally, these kinds systems should be built such that the distributed coordination piece is pluggable and the implementation can be chosen based on deployment concerns.

I have similar worries with kafka, that’s why I’ve been working on and off in my own log based message queue for while now [0]. So far is nothing but a humble beginning to something targeting very simple use cases, but AFAIK there aren't any lightweight solutions in this space.

[0] https://github.com/ninibe/netlog

Exactly this.

When I saw the header I thought to myself, "Yay! Now I can run something like Kafka, without depending on Zookeeper or having to install JVM!! Give me stable drivers for Python and Go and I am sold!"

If I have to install Zookeeper and JVM, why not use Kafka?

'plus systems software in Java', mind sharing an explanation?

Java at first is great. Once you get in to large and complex loads, the GC tuning becomes load specific. While tuning GC can be fun (for the first dozen times), the bigger problem happens when your load shifts and your GC tuning becomes subtly wrong.

Also even IF you are tuned properly, there is a world of 99-percentile you'll never get to.

Additionally, the difference between openJDK and Oracle JDK becomes something admins learn to hate you for. Complex shell scripts to invoke java, and lots of standard unix tools just don't work super well. eg: pgrep and pkill. you can tweak it, but it takes a little while to learn the many tips and tricks.

It's not all bad, most Java programs are deployed in a static-all-batteries-included fashion, so you rarely worry about system-installed library versions. So that makes deployment a little less hassle. You never have to recompile for cross platform. The profiling tooling and other stuff is pretty good, and the more you're willing to pay the better the tools get.

You are wrong about never getting to the 99th. I did not work on this service at Twitter, but many services at Twitter are built on the JVM, and if this service is like most there, then it probably has SLA targets that are sub-millisecond at the 99th and sub 10ms at four 9's. The only caveat to this is that it has to write to disk, which may keep it from achieving truly high performance, but this is not the fault of the JVM. I worked on one service at Twitter that had to use microseconds on its dashboards just to see any variation at all throughout the day.

I find that 'jps' works as a good replacement for pgrep.

sure, but jps isnt pkill.

The point is you cant use binary name to make your way around anymore. The binary name is 'java'. Standard unix tools just don't work. Yes there are work arounds, but over time things end up being just a little more complex than they should be.

Which means if someone is comparing zookeeper and etcd, well etcd wins major points for being unixy and easy to deploy. Copy 1 binary, done. ZK loses major points here. Gotta make sure the JVM is installed, but do you need the OpenJDK or the Oracle one? if the latter, well apt-get and yum are less helpful.

It's all just little globs of annoying details that add up to be a small pain. Nothing horrible, but if you could make a choice to avoid that, why not?

Basically I guess what I'm saying is there is probably a market for replacing all the Javay distsys stuff with Rust/Go versions. I mean look at etcd!

> Basically I guess what I'm saying is there is probably a market for replacing all the Javay distsys stuff with Rust/Go versions. I mean look at etcd!

Or in the other direction. If your main system runs on the JVM and your sysadmins are used to the JVM tools then having a piece of infrastructure that's just another .jar is wonderful, and C/Rust/Go/Ruby/etc. infrastructure elicits groans. Mixing platforms will always be harder than a common platform. So the infrastructure market depends on where you think the future of applications is.

Java has garbage collection and is about two times slower than C

For good C, which is hard to write - for me and other mortal humans.

On the other hand, as a mortal, I can write a-grade-above-code-that-an-idiot-would-write-just code in Java at about 10 times the speed I can write dire-useless-risible C code.

The comparison is pointless though, good modern languages like Rust and Julia are developing and LLVM is enabling further development.

You can call the comparison pointless, but it is the answer to the question of why Java is not a systems language

C dont have garbage collection. In some cases Java can outperform C.

Yeah, that's what I said. C doesn't have gc so it is a systems language. Java has gc so it is not a systems language

Agree re: Zookeeper. It looks like DistributedLog uses BookKeeper, which I'm guessing is where the Zookeeper usage comes from.

In terms of simplifying the deployment, they could have went embedded with Atomix[1] instead. Perhaps next time.

[1]: http://atomix.io/

Mind sharing a bit more?

The classic problem of ZooKeeper is the client and the server upon initial startup will DNS resolve and permanently cache the IP addresses of the ZK cluster. Thus any cluster migrations require you to restart _EVERYTHING_.

So running zookeeper in a dynamic environment becomes very risky. Aka AWS. Perhaps on GCE it's less dangerous, because the host migration is pretty good.

There are other operational issues people have noted above as well.

Then configure it using IP addresses. I've been using Zookeeper like this for 4 years in public EC2, swapping machines in and out and it always worked. What's the problem.

How is this an improvement over using DNS? When you swap in a new EC2 instance it has different IP addresses from the old instance. So now not only do you have to restart your other Zookeeper servers in the cluster, you have to update their configuration then restart them.

He's probably using ENIs to decouple the IP address from the instance.

i think you do know the problem, which is not everyone has that set up. Yes, if you are cloud, and can roll with static IPs then great for you. Some people still have to be on-prem and reusing IPs isn't always an option.

We run ZooKeeper (and Kafka) within Docker containers and were able to solve this problem with a patch to the client: https://issues.apache.org/jira/browse/ZOOKEEPER-2184

You also need to disable the JVM's insane DNS-caching behavior (config change only).

You're right in that Zookeeper can be a hassle in a dynamic environment, but in my experience Exhibitor largely removes that concern.

This behavior should be fixed in Zk 3.4.7+ and 3.5.0+

Anyone know how this similar and different from Kafka? Or why Twittered decide to build their own instead of using and contributing to Kafka?

Which says " At design time we had concerns about Kafka’s I/O model and its lack of strong durability guarantees‐a non-starter for an application like a distributed transaction log[3]"

Seems reasonable, right?

Except "[3] Kafka addressed these durability concerns in version 0.8"

So they built a whole thing, because they didn't bother to ask or say "hey, if we help fix the durability, would that we welcome?" or even "do you guys have a plan and timeline to fix it?"

That's .... not great.

Now, maybe there are other reasons, but they aren't elucidated in this blog post :P.

Even then, my general view would be "did you approach the community and discuss your concerns or just dismiss them out of hand as infeasible", mainly because my experience is that if you do this, you often find they have exactly the same set of concerns/goals, and just need more resources to make it happen.

The desire to build shiny objects is very large. Outside of the paper plans of engineering teams, these things rarely end up up more shiny than what already exists or will be built by the time you are done.

Arguably there are benefits to developing in-house expertise, and no better way to develop expertise than to architect and build a solution end-to-end. Twitter now has several domain experts on staff who can continue maintaining DistributedLog and/or weigh its benefits against Kafka's and make more informed decisions going forward.

Not saying that they couldn't have worked more closely with Kafka's team in the first place, but, hey, now we have two Kafkaesque log services instead of just one. Seems like a win to me.

> Arguably there are benefits to developing in-house expertise, and no better way to develop expertise than to architect and build a solution end-to-end

There are also drawbacks to consider: Those experts might just decide to leave and then you have an in-house solution that you have basically no chance to find experts on. At least with an open source solution you may have an easier time.

> hey, now we have two Kafkaesque log services instead of just one. Seems like a win to me.


Or even "we are ready to pay 3 people for a year to build that, if we don't and give you the money instead what can you do ?".

The software industry has a real problem contributing to open source, the stuff, you know, allowing them to make money in the first place.

They did open source their solution. It sounds like you are arguing against parallel development. That happens all the time in open source.

Contributing to other open source projects is very different than to open source your project. NIH is one of the worst problem of open source : companies are spending a lot of money to dev new software instead of improving existing one.

Seriously this seems like a common pattern in open source: [big company] could just improve [X] but instead builds something from the ground up.

Sometimes you need to let talented engineers build things from the ground up, because it's good for them, makes them happy, and stops them from going to work somewhere else. Keeping people with the skills to solve these types of problems around and happy is also great for recruiting and for helping your less capable engineers learn and grow.

Sure, but duplicating Kafka? How many man-months did they put into building this, proving it out, and dealing with fallout from any bugs or production issues?

What's the point of retaining an engineer who's doing nothing for the business but re-inventing existing successful software?

"Sometimes you need to let talented engineers build things from the ground up, because it's good for them, makes them happy, and stops them from going to work somewhere else. "

actually, trying to let people do work you don't need done, in order to keep them happy, is a pretty rookie manager mistake.

If they aren't passionate about it, and you can't persuade them to do the things you need doing, they aren't the right person for the job.

That is always true, even if they were the right person in the past.

Your goal in that case should be to try find stuff the company needs done that they want to do, and push them to work on that. But if you find nothing, ...

Sometimes building something from the ground up allows you to make fundamental design decisions not available in the alternatives.

Or forks it and maintains an incompatible, non-contributable internal version based on a release from five years ago.

Speaking of logs, I want to put some logging in place for my web server. I log every single request with extensive details, so I can debug things later if needed. It's several gigabytes per day now, so I can no longer just dump it on disk as I did for the last couple of years.

Since I'm on AWS EC2, I want to try this:

  - Write the logs to local SSD, asynchronously 
    so as not hold back the http request.
  - Have a separate cron job that loops through 
    the log directory and scoops up all the files.
  - The job will then stuff those files into a Kinesis Firehose.
    AFAIK, Kinesis Firehose does not require any capacity provisioning, 
    unlike the Kinesis Streams, so I'm set "for life" (up to 5MB/second)
  - The firehose will accumulate the logs and put them into S3. 
    Hurray unlimited storage!
  - S3 will trigger a Lambda.
  - Lambda will parse through the log from S3, pull out 
    interesting properties (IP address, user id, session id, etc) and 
    stuff them into a DynamoDb table.
  - If I need to see data from one user/ip/session I will use DynamoDb 
    to find the right S3 blobs.
  - If I need to reprocess the logs to extract a new piece 
    of data that I did not foresee earlier, I can run a 
    map-reduce task
Except the last piece, this looks like something I can half-ass in a couple of days and forget about it for another couple of years.

Any opinions? I don't really want to use a SaaS log service because gigabytes per day.

I've heard some bad things about Kinesis in general. Why not have your cron job just put the logs onto S3 directly?

I've used Lambda a bit. The debugging process can be a pain, since you're forced the upload a ZIP file, and if your code times out Lambda doesn't give you any traceback to indicate what happened. There's also a maximum run time of each Lambda invocation, which I believe is 5 minutes. Is there a chance your parsing may run longer than that? Also, what will you do if you upload some bad code and the parsing fails? Will it be the end of the world if you lose data while you fix the parsing?

Oh, I see you plan on doing map-reduce to re-parse the logs, so maybe that part isn't as big a deal.

You could also consider doing something like rsyslog -> db-of-choice while also rotating the files off to S3 for long-term map-reduces. This is all to ignore the obvious ELK cluster solution, which will give you good data visualization and investigation options, but may be more of a headache to set up and maintain than you are looking for.

Anyway, those are my thoughts. Hope they help.

Thanks for sharing your thoughts, esp on Lambda.

For Kinesis, I planned to use Firehose, not Streams (the latter have to be provisioned, which I was hoping to avoid). The firehose could put data into S3 for me, and S3 would trigger lambda. However I just realized that S3 will only make 3 attempts to invoke the Lambda, so that pretty much rules out this part of my design - the data will not get lost, but it will not get indexed either. I may run map/reduce later, but I don't want to be dependent on doing that to pick the loose ends.

These are just server logs, they don't affect business continuity. Still I wouldn't want to be sitting there and wondering "is this user having connectivity problems, or did I just lose a pile of logs?".

I could probably put the files directly into S3, got carried away stacking my AWS features together. :) I'd need to be more careful with batching, so as not to create batches too small or too large. Perfectly doable, though Kinesis Firehose already does that for me. Plus in case of the EC2 instance death I will lose the current batch with the hand-rolled solution, but not with Firehose.

So I guess I should just put a batch of data directly to S3 and send an SQS message to make sure that it's indexed properly, then delete the local files.

I really don't want to pick up another thing that I have to understand and manage. Like ELK. Someone who knows ELK will probably have no problem managing it, but my head is full with business domain problems.

We just went through an extensive ecosystem survey and reached an identical design, which we're currently implementing. So far, so good! AWS has aws-kinesis-agent, which you can deploy for the log aggregation bit, which is very easy to use.

I just found out that an s3-triggered lambda is only retried 3 times and then it gives up.

How does this compare to Kafka?

Here is a pdf going briefly into kafka comparison and design motivation http://goo.gl/J9XdsG . During my time at twitter I remember when Twitter switch from Kafka to distributedlog. They have an internal layer that adapts Kafka API to distributedlog, I am not sure if they have open sourced that

Very nice, now that we have some good competition in this space - kafka, ddl and wormhole. If only Fb can open src wormhole, there will be some good related systems software to choose from and contribute to.

By checking out the code quickly it looks like it's built on top of Apache Bookkeeper http://bookkeeper.apache.org/ ?

This blogpost https://blog.twitter.com/2015/building-distributedlog-twitte... mentioned somewhere else in the discussion confirms it's built on bookkeeper.

Nice docs, thanks!

For those who like to watch talks, I found one on youtube: https://www.youtube.com/watch?v=QW1OEQxcjZc

I am not that into this topic, but could you compare DL to Soundclouds Roshi[0], and if not, whats different? Thanks! [0] https://github.com/soundcloud/roshi

I wish they had kept some of the commit history - would be tough to parse the project without it.

Does DL has a limit on the number of partitions like Kafka?

A DL stream is a single partition if you are thinking in terms of kafka. We stitch them together as a partitioned stream a la Kafka using a different system. So, to answer your question, no, there is no limit to the number of DL streams you can put together into a bigger logical stream, we have some very large ones.

Ah! That actually explains the strange syntax I was seeing in the tutorials:

  // Create Stream `basic-stream-{3-7}`
  // dlog tool create -u ${distributedlog-uri} -r ${stream-prefix} -e ${stream-regex}
  ./distributedlog-core/bin/dlog tool create -u distributedlog:// -r basic-stream- -e 3-7
You use a regex to progammatically create N streams, which would be N shards in Kinesis or N partitions in Kafka.

100% Java. I'm wondering. Why doesn't this has any Scala?

Rumors of Scala's dominance at Twitter are slightly exaggerated. While it's true that the non-revenue-related backend is almost all Scala, on the ads eng side, it's almost 100% Java. Which language a new project is written in has more to do with who's writing it than anything else.

Actually I just wondered cause of Finangle. I thought that their projects uses the RPC service heavily.

Ah, Finagle has a Java API too.

I think over the years they decided to use Scala. I'm not even sure if most of their codebase is still in Scala today.

Because Scala adds overhead.

Where? In operational cost? In development time? In performance?


Depending on use-case, of course, there's lot of places where scala is preferable to Java. A byte-shuttling service where performance is important, mutability reigns and expressiveness is irrelevant.. not one of those places.


Joke aside, at least in hiring people it should.

What? No bird related name for this project?!

Bird projects got outlawed during the interregnum.

Wonder how it compares to Facebook Scribe?

> This is an archived project and is no longer supported or updated by Facebook

Yes but you can still use it, or fork it, it's actually a really cool federated logging system.

I hear, from people that would know, that it has a lot of bugs that were fixed internally but never made it into an open-source release.

Scribe is a log aggregation system, it doesn't really do topics/streams or have a good durability/streaming performance story. Its mostly used for aggregating data into a log storage system like DL, or Kafka, or into HDFS.

No, actually Java is a bane to the database world.

Cassandra doesn't work, and Hadoop is a complete waste of hosts for most companies (hence the move to Spark.)

Please keep programming language (and other technology) flamewars off HN. You're welcome to make a substantive critique.

We detached this subthread from https://news.ycombinator.com/item?id=11669189 and marked it off-topic.

Blanket statements like "Cassandra doesn't work" and "Hadoop is a complete waste of hosts for most companies" are unproductive and contribute nothing, unless you can back them with data and real world examples.

So, what data do you base these assertions on ? Also, not to burst your bubble but a lot of businesses (if not the majority) run Spark on YARN. And Spark is built on the JVM.

If they had data and examples they would almost certainly have enough experience not to say things like "______ doesn't work" and "______ is a complete waste of hosts."

Disagree on two fronts : - By hadoop I assume you mean Map Reduce ? There are other Engines like SAMSA and FLINK and Kafka makes a great event store. Anyway, MR is super for massive throughput batch jobs, for example huge HIVE queries or Pig jobs and if you are reading, breaking the heap size, doing one thing and then writing there is no bonus from doing it in SPARK. - SPARK is written in Scala, which runs on the JVM. And it has a nice Java API as well !

I'd be interested to know how you can assert that Hadoop is a 'complete waste of hosts for most companies'. Also, don't underestimate the many, many people successfully running Spark on YARN at scale. Hadoop is actually quite helpful to some workloads.

Most companies simply don't have the data volume to make Hadoop worthwhile. You can process tens of TB in an RDBMS on a beefy machine cheaper than a Hadoop cluster.

Hadoop is slow, but on huge data volume the overheads are dwarfed by the parallelism gained. Most companies don't have huge volume though.

For example recently I saw someone propose using Hadoop for a sub-TB dataset...

> hence the move to Spark

...which also runs on the Java Virtual Machine and is subject to the same pros and cons.

heh - you beat me too it!

Spark, Kafka, Flink, Storm, YARN, Samza, etc... Good luck staying out of the JVM. "The database world" is a bane to the big data processing world.

... logstash, kibana, elasticsearch, lucene, solr ... yeah pretty hard to not run java if you're doing distributed, scalable systems.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact