Hacker News new | past | comments | ask | show | jobs | submit login

My understanding is that Google has recently deviated from this strategy. The result of the strategy you mention is that the industry standardized on other companies' implementation of ideas that came from Google: Hadoop (MapReduce), HDFS (GFS), ZooKeeper (Chubby, and more. For examples of newer open source projects that see more active maintenance from Google, see Kubernetes and TensorFlow.

But K8s is not a development of Google software. It is developed specifically for the public, it throws out all of the interesting parts of Borg, and Google themselves don't use it, or barely do. As for that other stuff it seems to have worked out fine for Google: they describe obsolete technologies and the outside world develops hideous analogs of those and uses them for decades. Hadoop for example is just an unbelievably bad implementation of map reduce as it was ten years ago and is laughable compared to what Google replaced it with. HDFS is a joke of GFS which Google turned off eight years back. It's really remarkable the way the industry is essentially self-disabling in this regard. Meanwhile Google does not burden itself with trying to adopt every idea they read in a paper, and maintain a significant cost and efficiency advantage by doing so.

> it throws out all of the interesting parts of Borg

This is not true. It throws out the Google-specific parts of Borg (like integration with Google's service discovery, load balancing, and monitoring systems) and improves a number of things compared to Borg. For a good reference on the evolution of Borg into Kubernetes, I recommend the recent Kubernetes Podcast interview with Brian Grant: https://kubernetespodcast.com/episode/043-borg-omega-kuberne...

> Google themselves don't use it

This is not true, and the reasons why it hasn't replaced Borg are related to the integrations I mentioned above (which will take time to integrate or replace) and the zillions of lines of borg config that have built up over the years, rather than concerns that people outside of Google would have (production-worthiness, reliability, etc.)

(Disclaimer: I worked on Borg at Google, and now work on Kubernetes at Google.)

Unfortunately we can't discuss the parts of Google's platform that aren't in Kubernetes on this forum. If we could, I think I could defend my statement reasonably well. But perhaps you just don't think that the pieces I would mention qualify as interesting.

go/-link or it didn't happen.

well my secret document says you're wrong and i'm right.


Partial information is better than none.

aphorism rejoinder:

disinformation is worse than no information.

Implementation matters to google, more than to say the average company that uses Hadoop. At "Google-scale" small imperfections become huge imperfections.

What's good for the bottom 90% of tech companies probably isn't for the top 10%.

I am a consultant working on Hadoop installations across the globe. As an average I usually able to save 70% disk usage and 30% overall cost by changing the defaults to a reasonable value as well as migrating companies out of HDFS to something like S3. I have spent majority of my career (10 years) working on Hadoop and I can tell you that it is a terrible piece of software with insane ineffciency all over the place. If you would switch over all the Hadoop installations on Earth at once to something more reasonable it would be visible on the global CO2 production chart quite a bit. What is good for the bottom 90% of companies is a financial question not a technological question. It is a bit unfurtunate that Hadoop is so popular and nobody cares about efficiency, not even the Hadoop vendors (maybe with the exception of MapR, which is not opensource).

> This find | xargs mawk | mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.


Well this is great until you need more nodes. :) I am talking about the same scalability while maintaining a much lower ecological and financial footprint.

Using hadoop/spark for <2gb of data seems like a terrible idea.

When all you have is a hammer everything starts to look like a nail.

Is there a good open source alternative that meets the HDFS use-case (i.e. file or blob storage, rather than a KV store designed for point lookups)? Or is tuning the HDFS defaults the best you can do without migrating onto someone's cloud platform?

I'd argue Kubernetes isn't the best choice for the bottom 90%. There's a lot of companies you could describe as tech and many are doing just fine in the old world of manual application provisioning.

I don't agree at all. While the large majority does not need the scalability it offers, it can benefit from all the other stuff applying the 'best practices' offers. The problem is - that many people do not stick to the best practices and do not know how to build containerised applications.

Implementing the whole "DevOps" idea just becomes a whole lot easier when developers don't even have the concept of their snowflake server anymore. And yes - k8s has a ton of overhead and is pretty complex to get into at first, but it all makes sense. There are many points that could be criticised about it, it's far from perfect, but having a standardised way of deploying whatever application has been a massive game changer in the development environments I've been thrown in.

Source/disclaimer: I'm a consultant that has seen quite a few k8s/openshift fuckups and success stories, both on large and small scale.

>I'd argue Kubernetes isn't the best choice for the bottom 90%.

Exactly. Instead of having something so simple that scales for 90% of everyone's needs. We have solution that Most enterprise wants and filter down from top to bottom. And it is true in almost all Tech things related.

Google offers their most recent infrastructure as a service in GCP, e.g. Cloud Dataflow, and it hasn't exactly taken the world by storm. Industry standards matter, even if they are inferior implementations; the differential is just not that big.

Dataflow is based on oss Apache airflow. I don’t know how well it’s doing in the wild but every IT admin I’ve worked with are super excited to use it.

Airflow and Dataflow are not related.

Google open sourced dataflow as Beam - https://beam.apache.org

Dataflow it's self isn't open source. Beam is not open source Dataflow, however you can use the Beam SDK with Dataflow as a runner.

Ah, apologies, my bad.

Can you mention some of the interesting parts of Borg that are missing in k8s?

among other things, I miss Autopilot and generally the extensive machinery to help with massive capacity planning.

Think of Autopilot as an automation that tweaks a pod's request/limits according to what it actually needs in order to reduce waste and thus improve cluster utilization.

(I _think_ this no longer qualifies as secret after https://github.com/kubernetes/kubernetes/issues/44095)

That said, k8s is quite extensible and it would definitely be possible to add such a component as a controller.

Well, nobody should be using MapReduce now, and HDFS now is a lot better than 8 years ago.

Without Google using, validating and releasing those design, we might be stuck with MPI and NFS for a lot longer.

How many GMM can MapReduce do? What about lattice quantum chronodynamics performance? MPI and Lustre exist for a reason: map reduce isn't great for all problems.

MR never claimed to be great for all problems. It main selling points was big-data and easy-of-use.

Sure, MPI might blow MR out of the water in term of number-crunching, but it is also way harder to use.

I know some people who develop on top of Tensorflow; from my conversations with them, Tensorflow's moat is Google making a lot of breaking changes by incorporating a lot of new functionality. I've also heard complaints that the online documentation isn't terribly great for triaging not-happy paths, to the point where you kind of have to just dig through the source code to figure out what's going on. Also, if you want to poach the maintainers, you would somehow have to poach them away from Google (which isn't happening, since ML at scale is something Google does best, and is an ongoing field of research). You can't become as good at using Tensorflow as Google is by simply forking the project.

Does that deviation lineup with Google's entry into offering cloud computing?

Pretty much. That's when it became clear that Google would have to support whatever is popular outside Google.

Gonna just chime in and mention Yahoo for Hadoop (yes I know Big Table was Google), and Zookeeper. Great tech started at Yahoo, but they didn’t (unlike today) make too much of a fuss/self-pat-on-the-back about it

Those projects were initiated by Yahoo, but the designs are taken directly from Google papers. Hadoop (HDFS and Map/Reduce) was based on the GFS and Map/Reduce papers. ZooKeeper was based on the Chubby paper.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact