Hacker News new | past | comments | ask | show | jobs | submit | mrry's comments login

"Project Nessie: Transactional Catalog for Data Lakes with Git-like Semantics"

https://projectnessie.org/

This supports lightweight branches, and transactional commits and merges. I haven't used it—and it seems cool—but it also seems a little heavyweight to get cross-table transactions on top of these table formats (which would be my primary use case).


Project Nessie also powers Dremio's Arctic service, so you can get all the branching and benefits of Nessie with an intuitive UI to browse branches, create branches and merge branches. Also, it is a cloud managed service with a free tier.



[Disclosure: I designed and wrote most of the docs for TensorFlow’s Dataset API.]

I’m sorry to hear that you’ve not had a pleasant experience with tf.data. One of the doc-related criticisms we’ve heard is that they aim for broad coverage, rather than being examples you can drop in to your project and run straight away. We’re trying to address that with more tutorials and blog posts, and it’d be great if you started a blog on that topic to help out the community!

If there are other areas where we could improve, I’d be delighted to hear suggestions (and accept PRs).


No offense but TensorFlow's Dataset API documentation also sucks. This combined with bad API design (that actually can be used as case study for bad design in classrooms) is disaster in making. For example, shuffle() takes a mysterious argument. Why? It's not to be found in docs except that it should be more than items in dataset. Why can't shuffle() just be shuffle() and why do I now always have to remember passing correct parameter for rest of my life? Whatever. I still don't get what exactly repeat() does. Does it rewinds back to start when you reach past end? Why you need it? Why not just stick to epochs? Why make things complicated with steps vs epochs anyway? Docs gives zero clue. Then there are whole bunch of mysteriously named unexplained methods like make_one_shot_iterator() or from_tensor_slices(). Why is make_one_shot_iterator() not just iterator()? Why do I have to rebuild dataset using from_tensor_slices()? The docs are designed with a point of view "take all these code calling mysteriously designed APIs, copy-paste and don't bother too much about understanding what those APIs really do". It really sucks.


IMO, shuffle is something they did really fine. Unlike PyTorch datasets, TF allows streaming unbounded data. For something like this work with shuffle, it must cache some data before passing it down the pipeline. You specify how much in the argument.

This may not seem useful this conventional training, where you usually work with a fixed amount of samples you know beforehand. But there may be cases where this is not true (for instance, in some special cases of augmentation) - the streaming part is useful but then you must use this caching trick.

But I agree API naming is not stellar, or at least should come with better documentation.


I think tf.data is amazing, its far far better than the previous queue and string_input_producer style approach.

More than documentation, I would argue that TF especially tf.data lacks a tracing tool that would let a user quickly debug how data is being transformed and if there are any obvious ways to speed up. E.g. image_load -> cast -> resize vs image_load -> resize -> cast had different behavior and lead to hard to identify bugs. For tf.data prefetch which ends up being key to improving speed yet its is not documented, the only way I actually found out about it was by reading your TF.Data presentation.


I guess you misread me. The Dataset API is somewhat fine, much better than queues for instance. However not clear from documentation how to do more complex stuff, or how to integrate it with the rest of TF stack, specially the new Estimator API.


Yes. The prebuilt binaries can be used with GPUs that support CUDA 8.0 and have compute capability 3.5 or 5.2. The package is now in PyPI, so you can install the latest version with `pip install tensorflow-gpu`.


We've made good progress on single-machine performance and memory usage in the latest release, especially for convolutional models (such as adding support for NCHW data layouts), and we'll continue to aim for best-in-class performance in that setting.

The cool thing about distributed TensorFlow is that it supports efficient synchronous optimizers, so you can scale up the effective batch size by using multiple GPUs, to get increased throughput without losing accuracy.


Thanks for the shout out!


Derek is also extremely available to his colleagues at Google. He's always friendly when I ask questions, and very thoughtful. I feel lucky to work with him, however distantly! :)


A lot of the differences between the systems arise from the implementation choice of how to do aggregation in Hadoop 2.4.0 and Spark 1.3. There's nothing inherent in the RDD model, for example, that says the aggregation has to be done eagerly at the mapper; nor in the MapReduce model that says it has to be done at the reducer. Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Some former colleagues wrote a nice paper about the performance trade-offs for different styles of distributed aggregation in DryadLINQ (a MapReduce-style system), and evaluated it at scale:

http://sigops.org/sosp/sosp09/papers/yu-sosp09.pdf


> Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Hive implements something similar to the paper mentioned. Partial aggregation on mappers & the reducer does a sorted final aggregation.

You'll find Hive beating MapReduce[1], even though it is implemented using MR.

[1] - https://www.cl.cam.ac.uk/research/srg/netos/musketeer/eurosy...


Malte Schwarzkopf is the first author of the Omega paper:

http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/S...

...so not entirely blind.


One of the reasons Omega failed was because its authors went up in an ivory tower and ignored everything about what was actually happening in Google datacenters for years. That a diagram this full of WTF could be produced by one of the Omega authors does not surprise me.


It's interesting to compare this with the Facebook stack, drawn by the same author:

http://malteschwarzkopf.de/research/assets/facebook-stack.pd...

…and it would be doubly interesting to see the "related work" arrows between the different companies' infrastructure platforms. We'd need a 3D visualization for that though.


I'd take this chart with a grain of salt, given the FB paper. Peregrine is no longer in service, IIRC. It was a friendly rival to Scuba, and was eventually replaced by Presto, which is not even mentioned. Also not mentioned are several important things like the pub/sub ecosystem built on Scribe. Haystack is long dead, except for maybe some archived stuff. Lastly, PHP-to-C++ has not been a thing since early 2012.


Original author here.

This is interesting -- I tried to do my best to find out what's still in use and what is deprecated based on public information; happy to amend if anything is incorrect. (If you have publicly accessible written sources saying so, that'd be ideal!)

Note that owing to its origin (as part of my PhD thesis), this chart only mentions systems about which scientific, peer-reviewed papers have been published. That's why Scribe and Presto are missing; I couldn't find any papers about them. For Scribe, the Github repo I found explicitly says that it is no longer being maintained, although maybe it's still used internally.

Re Haystack: I'm surprised it's deprecated -- the f4 paper (2014) heavily implies that it is still used for "hot" blobs.

Re HipHop: ok, I wasn't sure if it was still current, since I had heard somewhere that it's been superseded. Couldn't find anything definite saying so, though. If you have a pointer, that'd be great.

BTW, one reason I posted these on Twitter was the hope to get exactly this kind of fact-checking going, so I'm pleased that feedback is coming :-)


HipHop's replacement was pretty widely reported: http://www.wired.com/2013/06/facebook-hhvm-saga/

This link has papers on pub/sub, HHVM, and so on: https://research.facebook.com/publications/

Re Haystack: possible I am misremembering, or the project to completely replace Haystack stalled since I left.

If you want to gather a more complete picture of infrastructure at these companies I suggest, well, not imposing the strange limitation of only reading peer-reviewed papers. Almost none of the stuff I worked on ended up in conference proceedings.


Thanks, I've added HHVM and marked HipHop as superseded.

I also added Wormhole, which I think is the pub/sub system you're referring to (published in NSDI 2015: https://www.usenix.org/system/files/conference/nsdi15/nsdi15...).

Updated version at original URL: http://malteschwarzkopf.de/research/assets/facebook-stack.pd...

Regarding the focus on academic papers: I agree that this does not reflect the entirety of what goes on inside companies (or indeed how open they are; FB and Google also release lots of open-source software). Certainly, only reading the papers would be a bad idea. However, a peer-reviewed paper (unlike, say, a blog post) is a permanent, citable reference and is part of the scientific literature. This sets a quality bar (enforced through peer review, which deemed the description to be plausible and accurate), and allows the amount of information to remain manageable. The number of other sources of information makes them impractical to write up concisely, and it is hard to say what ought to be included and what should not when going beyond published papers.

I don't think anyone should base their perception of how Google's or Facebook's stack works on these charts and bibliographies -- not least because they will quickly be out of date. However, I personally find them useful as a quick reference to comprehensive, high-quality descriptions of systems that are regularly mentioned in discussions :-)


Other ways to get information:

1. Ask the developers per email.

2. Fly out to SF, visit the campus, have lunch.

Will work for FB (I do it all the time), Google people won't tell you anything.


Re Hiphop I think HHVM + Hack (Facebook's internal "improved PHP") has superseded it but while HHVM is open sourced Hack isn't public.


Hack is just a piece of HHVM, it's open: http://hacklang.org/


Ah thanks, I didn't realize it was out. I read a separate article that said it hadn't been released yet -- it was probably outdated.


Does Spanner talk to Bigtable? From reading the paper, I thought it was built directly on Colossus.


Ah, you're right!

I misread "This section [...] illustrate[s] how replication and distributed transactions have been layered onto our Bigtable-based implementation." in the paper as meaning that Spanner is partly layered upon BigTable, but what it really means is that the implementation is based upon (as in, inspired by) BigTable.

Spanner actually has its own tablet implementation as part of the spanserver (stored in Colossus) and does not use BigTable. I've amended the diagram to reflect this.


Carlos, you are mistaken re: Haystack. I don't think it was ever planned to be replaced, it's always been in use as a hot blob storage engine since it was launched. There are three storage layers for blobs at Facebook: hot (Haystack @ 3.6x replication), warm (F4 @ 2.1x), cold (Cold Storage @ 1.4x). We have published papers describing each layer.

Haystack still handles the initial writes and the burst of heavy read traffic when an object is fresh. It has a higher space overhead of 3.6x because it's optimized for throughput and latency versus cost savings. The Haystack tier's size is now dwarfed by the size of the F4 and Cold Storage tiers, but it still handles almost all of the reads for user blobs on Facebook due to the age/demand exponential curve.

After a haystack volume is locked and has cooled down enough, it moves to F4 to bump its replication factor way down for cheaper, longer term online storage. And then Cold Storage is used for the older stuff that gets barely any reads but that you still want to store online for perpetuity.

That is why the team that works on media storage is called Everstore; they take storing people's photos and videos seriously and view it as a promise to keep your memories available forever. It feels really good to see photos from 5+ years ago and have them still work, and someday there will be 30 year old photos and videos on Everstore as well.

Source: I built Haystack with a couple other people and founded the Everstore team. :-)


Presto is a copy of dremel. Putting it all together:

gfs -> hdfs

bigtable -> hbase

google mapreduce -> hadoop

dremel -> presto

protocol buffers -> swift


protocol buffers -> thrift



An interesting comparison point: a single core on a late-2014 MacBook Pro can achieve runtimes for the same graph that are within a factor of 4 for WCC (461 seconds for FlashGraph versus 1700 seconds for the laptop).

http://www.frankmcsherry.org/graph/scalability/cost/2015/02/... (previously on HN: https://news.ycombinator.com/item?id=9001618)

There are also results for PageRank on that graph, which make the difference more pronounced. FlashGraph runs PageRank in 2041 seconds (I'm assuming for 30 iterations, per Section 4 of the paper), whereas the laptop takes 46000 seconds for 20 iterations.


Absolutely spot on - between FlashGraph and Frank McSherry's COST work, the two have really pushed the envelope on efficient large scale graph analysis.

Frank McSherry wrote a "call to arms" for the broader graph community at [1]. The main point of interest is that academia generally compared their work with existing distributed graph processing systems, celebrating when any achievements were made, yet not aware of the significant overheads brought on by the distributed approach. Both Frank's work (run on a single laptop) and FlashGraph (run on a single powerful machine) run far faster than the distributed approach and have very few disadvantages.

Note: I'm a data scientist at Common Crawl and Frank's graph computation discussion article was a guest post at our blog.

[1]: http://blog.commoncrawl.org/2015/04/evaluating-graph-computa...


This is absolutely true! However, I'm not aware of anybody doing this for the graph processing systems being discussed in the post. Nor am I sure what that would mean. Admittedly if something takes 100 times longer to run on multiple machines than it does on a single laptop (as in the previous COST post [1]) then you might care about fault tolerance, but this is a contrived problem if you could achieve the same results faster on yor laptop.

[1] http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...


I was answering parent questions. There are usually good reasons for clustering and big iron.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: