Hacker News new | comments | show | ask | jobs | submit login
Big Data: principles and best practices (new book) (manning.com)
132 points by nathanmarz on Jan 9, 2012 | hide | past | web | favorite | 65 comments

Tangent: has someone done the startup to do technical books this way (serialized subscriptions to books in progress, on-demand print-and-ship at completion for the small subset of customers that want that)?

Now that giant book stores are on their way out, it seems like we should be ready to end the pretense of retail channel relationships and marketing as being worth virtually all the money in the tech book production value chain.

Also, do any companies exist that are trying to make the whole lifecycle of ebook self-publishing dead simple for non-techies? It seems that it's "not that easy"(1) even for techies and that non-techie authors can be overwhelmed with the amount of work they have to do that isn't writing a book(2). The only relevant company I know of is http://leanpub.com, but they state in their manifesto that they are only interested in the very specific and (I'm guessing here) relatively uncommon style of incremental self-publishing.

(1) http://www.whattofix.com/blog/archives/2012/01/e-book-publis...

(2) http://www.npr.org/2012/01/08/144804084/a-self-published-aut... around 5:30 in the audio

davidw's got a little startup to help people make Kindle books.

We've added a share/preview feature at eBookBurn.com (https://ebookburn.com/help.html#share) which lets authors do this with early drafts.

It's all digital, though, with no print-on-demand option.

What's the split between author/"publisher"? I know Safari does more than ORA books now, but don't you need a relationship with a real publisher to post books there? I'm talking about replacing the publisher altogether.

I'm one of the authors of the book. If you have any questions about the book, I'm happy to answer them here.

Nathan, I looked for the preorder button as soon as I saw your name on the landing page.

For anyone who's not familiar with his work, look these over:




I just ordered a copy of the EAP. Looks like it's just PDF now. Do you know if they plan on offering kindle/epub versions later in the MEAP or on release? Some manning books seem to have them and some don't. Device specific formats are often much easier to read than PDFs are.

I asked Manning about this for another EAP book, and this was their response:

At present all of our books are released as pdfs. Once the meaps are published they will be converted to mobile format epub and mobi. We understand the desire for mobile formats and we are looking to in the future, hopefully near future, to have all books available in mobile formats, meaps included. You can find all titles we have available in mobile format here: http://www.manning.com/catalog/mobile

Each book is converted manually to ensure that everything transfers to the new format as the Author intended it to appear. This is a painstaking process and does take time. Since each book is different in number of pages and images we do not have a set time frame for when each book will be available but know that as soon as the final ebook is complete it is sent to be converted.

I confirmed with Manning that this will be available. From Manning:

"When the book is finished and the ebook is created, a kindle epub version will be created at that time as well."

I don't think anyone has mentioned the discount code yet; it might still be active (bd50 for 50% off; it worked for me about 6 hours ago): https://twitter.com/#!/nathanmarz/status/156459481864220672

Just worked for me (at 2:43AM PST). Thanks for pointing that out!

As much as I'm a huge fan of Nathan Marz and his work with Clojure and Cascalog... I am hoping this book is about how to make Big Data accessible to programmers across multiple languages.

Nathan, do you think you'll be including pseudo-code, or will one need to be a clojure programmer to best leverage your book?

There's no Clojure in the book (we don't think Clojure should be a prerequisite to learning this important subject). Most of the examples will be in Java.

There is a big emphasis in the book on using multiple languages together. This is reflective of how I myself have architected systems, with our team using Ruby, Python, Clojure, and Java for the same product. Chapter 2 is about creating a schema for your data using Thrift, for example.

Fantastic. I am very much looking forward to the book!

Being someone who have never got into this field and stats and only tackled with RDBMS, would it be useful for beginners?

The material in the book is most useful when you're working at very large scale (where the RDBMS breaks down). You won't necessarily need the techniques if you're working at smaller scale, but the material will certainly expand your mind on ways to manage and work with data.

How well does a system like this work for bootstrappers on a tight budget? It seems like by nature of the system design, you're going to need quite a few more servers than a simple LAMP-like setup. Between Hadoop, Cassandra, Storm, Web Servers and the like, you're looking at ~10'ish server instances right out of the gate.

I ask because I'm intrigued by this kind of design, but not the server cost that seems to be associated with it for a newly launched (and potentially unproven) product.

If you have big data, you're going to need lots of servers anyway, and I think there's no better way to manage that data than with the techniques I talk about in the book.

While I think these techniques can scale down, the current crop of Big Data technologies (esp. Hadoop) don't scale down very well. That is, they have a lot of overhead for small amounts of data. So while these techniques can work for "small data", it's going to be relatively more costly. For big data, the overhead is amortized. In the future, I do see scaling down as an important evolution for these technologies.

Can you recommend some tools for someone starting down this path? I'm comfortable with apt-get and mildly capable with the AWS console, but I'm a bit daunted by the idea of attempting to automatically spin up 2-3 servers, have them configure themselves, and then have them form up a little Hadoop cluster. The "set up your own single-node Hadoop cluster on Ubuntu" guides I've skimmed have a sizeable amount of configuration details that are completely opaque to an outsider.

Not being huge into Java isn't helping either. Would I be better served by biting the bullet and doing things in Java initially or can I skip right to jython or jruby or clojure or something?

I'm a big fan of Pallet for infrastructure management ( https://github.com/pallet/pallet ). That's what we used for all our infrastructure on AWS at BackType, and my team has continued to use it to manage our machines within the Twitter datacenter. Pallet has a high learning curve, but it's worth it.

Sam wrote the pallet-hadoop tool which can spin up Hadoop clusters at the click of a button ( https://github.com/pallet/pallet-hadoop ). Although if you're on AWS you're better off just using EMR.

You don't need to use Java. I do everything in Clojure (using Cascalog and Storm's Clojure DSL).

The one thing that makes me mildly uncomfortable about pallet is that, in the end, it's just another "run these shell scripts to set up your server" system. I find I prefer tools like puppet or chef and then extending them to deal with AWS (cluster-chef, for example).

There's nothing preventing you from installing all these components on a single box, and using cluster sizes of 1.

Of course, this won't net you any benefit (in fact, performance will be slightly worse), except that it will be relatively easy to scale out and add servers later on.

From the first paragraph:

"In the past decade the amount of data being created has skyrocketed. More than 30000 gigabytes of data are generated every second, and the rate of data creation is only accelerating"

How could you even hope to put a number on the rate at which data is being generated? What does it even mean to generate data?

Would make for a fun (& meaningless) interview question!

I agree. Where I work, we run weather and climate models, and These models generate 100's of gigabytes of output in a short amount of time. Regarding "data being generated", these models output 3D data sets of weather/atmospheric related variables (4D if you include time).

Perhaps it is an estimate on the volume of data that is archived, hence is suitable for future analysis.

If you're in San Francisco this Thursday, come check out Nathan give a talk on Storm & realtime processing at Airbnb HQ.

Free food and drinks.

Signup here: http://www.airbnb.com/meetups/zjw9ks5q9-nathan-marz-of-twitt...

Looks like an interesting talk. I'll definitely try to be there. Thanks!

thanks but there are too many meetups this Thursday in the bay area, curious why don't airbnb use meetup.com?

Thanks for the comment Suyash.

We needed a meetup tool for all of the awesome Community Meetups that we throw around the world, so two of our engineers (Raph: https://github.com/Raphomet & Horace: https://github.com/warpdude) built an Airbnb Meetup tool. We like to dogfood, so we use it for our nerd meetups as well.

As for the meetup being on a busy Thursday, it just happened to be a time that worked for everybody.

If you're interested in this subject and are looking for training in Cascalog, one of the authors (Sam Ritchie) is teaching a 3 day class prior to the Clojure/West conference - more info here: http://clojurewest.org/training-cascalog

I am looking forward to this book.

One of the things I would like is recommended naming conventions for the various objects in STORM. For example, what's the best way to name a StreamID? Should it include information about the spout/bolt it originates from and the bolt it is going to? I spend a lot of time fretting these names and I still feel like I'm not getting it right.

I make my stream ids descriptive enough so that I understand what it is in the context of that spout or bolt. For wrapper bolts (like CoordinatedBolt) that add streams in addition to the wrapped bolt, I'll prepend the stream id with the name of the wrapper class to avoid naming conflicts (essentially namespacing the stream).

In general I think of streams as not going to a particular bolt, but something that is provided that anyone can subscribe to. So in the WordCountTopology, the stream of words isn't "intended" for the word counting bolt, it's just data that can be used by anyone else in the topology. This is a consumer-focused way of looking at it (consumers know their inputs) rather than producer-focused (producers know their outputs).

Beware if you plan to buy electronic books from Manning. Unlike O'Reilly and Pragmatic one may only download purchased electronic books from Manning for a short period of time.


That's not true (at least not any more). You now get an account on Manning the first time you buy a book and you can download the electronic versions of those books any time from your account.

Screenshot of my account page - http://imgur.com/Hu7qT

The URL is - http://beta.manning.com

To clarify, the time limit applies only to the links that are sent via email to you. The latest version of the MEAP and the published versions are always available for download.

Try: http://beta.manning.com/

Lists all of your ebooks with quick download links and last updated times for meap books. Been using it for a while now with no issues.

That made me extremely hesitant to order from them, but it turns out that I can download all my Manning books without time restrictions. (Clearly inconsistent with the policy you point out; dunno what it all means.)

They might have changed this recently. Several months ago I wanted to download a copy of a book that I had purchased in the past and discovered this policy. I was not able to download it again.

Are you using the original links that were emailed to you? I no longer had the download message and I see no way to login on their site.

That is a definitely usability flaw IMO. I forget the exact dance I did, but I think in the end it was simply to create an account, and they associate all books you've previously ordered with that email address with your new account.

The emailed links expire, but the accounts page is permanent as far as I know.

I keep hoping that the next Big Data book I see will serve as a relatively gentle introduction for us non-DB types. I work in incident response and SIEM, and log analysis (for things other than web analytics) seems like a natural fit for this approach.

This past week I'd been looking all over Amazon/Apress/PragProg for a good data-practices book and couldn't find one. I skimmed over the summary page for Big Data and decided that this was what I was looking for. Bye bye $40.

Will there be a chapter about how your preferred architecture compares to in-memory databases technologies? SAP has been creating a lot of marketing hype on this in the last years. Indeed I think a lot of business problems can be solved by cramming more RAM into centralized servers. Yesterday's big-data problems are now routine if you have a dozens-of gigabytes of RAM machine available.

Or is this question not applicable at all (because the architecture makes no assumptions on the type of data storage); the requirements and usage scenarios are completely different?

The architecture described in the book is fully distributed and horizontally scalable, and I won't be looking at scale-up techniques. The chapters on Storm and distributed RPC does have an emphasis on using lots of RAM for certain tasks though by partitioning data appropriately across the nodes.

In chapter 2, you describe writing the master dataset to a distributed filesystem (presumably HDFS). If you create a new file with a new record for each update to the dataset, wouldn't this result in lots of small files on the filesystem, resulting of fragmentation. HDFS seems well suited for large files, not small (http://hadoop.apache.org/common/docs/r1.0.0/hdfs_design.html...).

Radical. Just ordered the MEAP of this. Looks like great stuff, and I - for one - can't wait for the full book.

MEAP, btw, is a great program... for any of you guys who haven't ever bought a book this way, it's pretty cool. Getting updates as they're delivered and being able to provide feedback as the book is being developed, is pretty gnarly.

What do you think about "grid computing" concepts? Are they too academic? outdated? more/less general? Is your architecture a different approach or just a variation on the theme, a special case?

Technically, the architecture promoted in the book is "grid computing" -- that is, a fully distributed set of resources that work together to accomplish a common task.

Many commercial grid computing products try to be all in one -- that is, handle storage and computation. They don't apply to every problem because they only have one kind of storage meant for certain kinds of tasks.

The architecture in Big Data is a general-purpose way to compute arbitrary functions on arbitrary data, at scale and in realtime. Every data problem you'd ever want to do can be described as a function on data, which is why this architecture is so general-purpose. I recommend reading Chapter 1 in the book (which is free to download from the webpage for the book) where we explain these ideas much further.

Is there a way to get a notification when the full book is released?

You certainly will if you buy the MEAP. I'm getting a clarification on this from Manning to see if there's a way to get a notification without buying the book.

I'll also be making announcements about the book on Twitter, of course.

I got clarification from Manning. Currently there's no way to get notification without buying the MEAP. However, they think this is an interesting idea and there might be a way to do this in the future.

Thanks for checking on that. Bookmarked the book instead :)

How has working at Twitter changed your perspective on big data? Has it been more challenging working within an existing framework or starting from scratch (at BackType)?

Actually it's reinforced my beliefs of the proper way to build these systems. We've actually continued using all the tools and technologies we were using at BackType (Cascalog, ElephantDB, Storm, our schema, etc.)

Really enjoyed the first chapter. Quick question: how do you deal with filtering out duplicate records (e.g. blog posts/comments) when saving to the batch layer ?

Chapter 2 talks about forming a data model for the master dataset. The core idea is that each record should be a "fact" that stands on its own as something true at a moment in time. When you write your batch computations, you should make them work on any set of valid facts. There's nothing wrong with saying the same record twice, as logically "A and A" is the same as "A". So by formulating batch computations to work on any valid set of facts, it doesn't matter if facts are duplicated.

How was the work split up between you and the co-author?

I brought Sam on as a co-author at the 1/3 mark of the book (when Manning does the first round of reviews). So far Sam has helped with the revisions necessary to go from review -> MEAP, and we'll be splitting up the future chapters.

Will this be available as a well formatted pdf?

Yes. Check out the first chapter to see what it looks like: http://manning.com/marz/BD_meap_ch01.pdf

When do you expect to complete the book?

End of this year. We'll be releasing chapters to the MEAP as we go.

From the site estimate: Summer 2012

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact