

Big Data: principles and best practices (new book) - nathanmarz
http://manning.com/marz/

======
tptacek
Tangent: has someone done the startup to do technical books this way
(serialized subscriptions to books in progress, on-demand print-and-ship at
completion for the small subset of customers that want that)?

Now that giant book stores are on their way out, it seems like we should be
ready to end the pretense of retail channel relationships and marketing as
being worth virtually all the money in the tech book production value chain.

~~~
prostoalex
<http://safaribooksonline.com/> ?

~~~
tptacek
What's the split between author/"publisher"? I know Safari does more than ORA
books now, but don't you need a relationship with a real publisher to post
books there? I'm talking about replacing the publisher altogether.

------
nathanmarz
I'm one of the authors of the book. If you have any questions about the book,
I'm happy to answer them here.

~~~
tednaleid
I just ordered a copy of the EAP. Looks like it's just PDF now. Do you know if
they plan on offering kindle/epub versions later in the MEAP or on release?
Some manning books seem to have them and some don't. Device specific formats
are often much easier to read than PDFs are.

~~~
samstokes
I asked Manning about this for another EAP book, and this was their response:

 _At present all of our books are released as pdfs. Once the meaps are
published they will be converted to mobile format epub and mobi. We understand
the desire for mobile formats and we are looking to in the future, hopefully
near future, to have all books available in mobile formats, meaps included.
You can find all titles we have available in mobile format
here:<http://www.manning.com/catalog/mobile>

Each book is converted manually to ensure that everything transfers to the new
format as the Author intended it to appear. This is a painstaking process and
does take time. Since each book is different in number of pages and images we
do not have a set time frame for when each book will be available but know
that as soon as the final ebook is complete it is sent to be converted._

------
mark_h
I don't think anyone has mentioned the discount code yet; it might still be
active (bd50 for 50% off; it worked for me about 6 hours ago):
<https://twitter.com/#!/nathanmarz/status/156459481864220672>

~~~
mattyb
Just worked for me (at 2:43AM PST). Thanks for pointing that out!

------
mwexler
As much as I'm a huge fan of Nathan Marz and his work with Clojure and
Cascalog... I am hoping this book is about how to make Big Data accessible to
programmers across multiple languages.

Nathan, do you think you'll be including pseudo-code, or will one need to be a
clojure programmer to best leverage your book?

~~~
nathanmarz
There's no Clojure in the book (we don't think Clojure should be a
prerequisite to learning this important subject). Most of the examples will be
in Java.

There is a big emphasis in the book on using multiple languages together. This
is reflective of how I myself have architected systems, with our team using
Ruby, Python, Clojure, and Java for the same product. Chapter 2 is about
creating a schema for your data using Thrift, for example.

~~~
pknerd
Being someone who have never got into this field and stats and only tackled
with RDBMS, would it be useful for beginners?

~~~
nathanmarz
The material in the book is most useful when you're working at very large
scale (where the RDBMS breaks down). You won't necessarily need the techniques
if you're working at smaller scale, but the material will certainly expand
your mind on ways to manage and work with data.

------
blakesmith
How well does a system like this work for bootstrappers on a tight budget? It
seems like by nature of the system design, you're going to need quite a few
more servers than a simple LAMP-like setup. Between Hadoop, Cassandra, Storm,
Web Servers and the like, you're looking at ~10'ish server instances right out
of the gate.

I ask because I'm intrigued by this kind of design, but not the server cost
that seems to be associated with it for a newly launched (and potentially
unproven) product.

~~~
nathanmarz
If you have big data, you're going to need lots of servers anyway, and I think
there's no better way to manage that data than with the techniques I talk
about in the book.

While I think these techniques can scale down, the current crop of Big Data
technologies (esp. Hadoop) don't scale down very well. That is, they have a
lot of overhead for small amounts of data. So while these techniques can work
for "small data", it's going to be relatively more costly. For big data, the
overhead is amortized. In the future, I do see scaling down as an important
evolution for these technologies.

~~~
dpritchett
Can you recommend some tools for someone starting down this path? I'm
comfortable with apt-get and mildly capable with the AWS console, but I'm a
bit daunted by the idea of attempting to automatically spin up 2-3 servers,
have them configure themselves, and then have them form up a little Hadoop
cluster. The "set up your own single-node Hadoop cluster on Ubuntu" guides
I've skimmed have a sizeable amount of configuration details that are
completely opaque to an outsider.

Not being huge into Java isn't helping either. Would I be better served by
biting the bullet and doing things in Java initially or can I skip right to
jython or jruby or clojure or something?

~~~
nathanmarz
I'm a big fan of Pallet for infrastructure management (
<https://github.com/pallet/pallet> ). That's what we used for all our
infrastructure on AWS at BackType, and my team has continued to use it to
manage our machines within the Twitter datacenter. Pallet has a high learning
curve, but it's worth it.

Sam wrote the pallet-hadoop tool which can spin up Hadoop clusters at the
click of a button ( <https://github.com/pallet/pallet-hadoop> ). Although if
you're on AWS you're better off just using EMR.

You don't need to use Java. I do everything in Clojure (using Cascalog and
Storm's Clojure DSL).

~~~
gchpaco
The one thing that makes me mildly uncomfortable about pallet is that, in the
end, it's just another "run these shell scripts to set up your server" system.
I find I prefer tools like puppet or chef and then extending them to deal with
AWS (cluster-chef, for example).

------
aquark
From the first paragraph:

"In the past decade the amount of data being created has skyrocketed. More
than 30000 gigabytes of data are generated every second, and the rate of data
creation is only accelerating"

How could you even hope to put a number on the rate at which data is being
generated? What does it even mean to generate data?

Would make for a fun (& meaningless) interview question!

~~~
metaobject
I agree. Where I work, we run weather and climate models, and These models
generate 100's of gigabytes of output in a short amount of time. Regarding
"data being generated", these models output 3D data sets of
weather/atmospheric related variables (4D if you include time).

------
Airbnb_Nerds
If you're in San Francisco this Thursday, come check out Nathan give a talk on
Storm & realtime processing at Airbnb HQ.

Free food and drinks.

Signup here: [http://www.airbnb.com/meetups/zjw9ks5q9-nathan-marz-of-
twitt...](http://www.airbnb.com/meetups/zjw9ks5q9-nathan-marz-of-twitter-on-
storm-and-realtime-processing)

~~~
suyash
thanks but there are too many meetups this Thursday in the bay area, curious
why don't airbnb use meetup.com?

~~~
_harry
Thanks for the comment Suyash.

We needed a meetup tool for all of the awesome Community Meetups that we throw
around the world, so two of our engineers (Raph: <https://github.com/Raphomet>
& Horace: <https://github.com/warpdude>) built an Airbnb Meetup tool. We like
to dogfood, so we use it for our nerd meetups as well.

As for the meetup being on a busy Thursday, it just happened to be a time that
worked for everybody.

------
puredanger
If you're interested in this subject and are looking for training in Cascalog,
one of the authors (Sam Ritchie) is teaching a 3 day class prior to the
Clojure/West conference - more info here: <http://clojurewest.org/training-
cascalog>

------
ThePig
I am looking forward to this book.

One of the things I would like is recommended naming conventions for the
various objects in STORM. For example, what's the best way to name a StreamID?
Should it include information about the spout/bolt it originates from and the
bolt it is going to? I spend a lot of time fretting these names and I still
feel like I'm not getting it right.

~~~
nathanmarz
I make my stream ids descriptive enough so that I understand what it is in the
context of that spout or bolt. For wrapper bolts (like CoordinatedBolt) that
add streams in addition to the wrapped bolt, I'll prepend the stream id with
the name of the wrapper class to avoid naming conflicts (essentially
namespacing the stream).

In general I think of streams as not going to a particular bolt, but something
that is provided that anyone can subscribe to. So in the WordCountTopology,
the stream of words isn't "intended" for the word counting bolt, it's just
data that can be used by anyone else in the topology. This is a consumer-
focused way of looking at it (consumers know their inputs) rather than
producer-focused (producers know their outputs).

------
absconditus
Beware if you plan to buy electronic books from Manning. Unlike O'Reilly and
Pragmatic one may only download purchased electronic books from Manning for a
short period of time.

<http://manning.com/about/ebook_support.html#downloadtime>

~~~
calibraxis
That made me extremely hesitant to order from them, but it turns out that I
can download all my Manning books without time restrictions. (Clearly
inconsistent with the policy you point out; dunno what it all means.)

~~~
absconditus
Are you using the original links that were emailed to you? I no longer had the
download message and I see no way to login on their site.

~~~
mark_h
That is a definitely usability flaw IMO. I forget the exact dance I did, but I
think in the end it was simply to create an account, and they associate all
books you've previously ordered with that email address with your new account.

The emailed links expire, but the accounts page is permanent as far as I know.

------
kylemaxwell
I keep hoping that the next Big Data book I see will serve as a relatively
gentle introduction for us non-DB types. I work in incident response and SIEM,
and log analysis (for things other than web analytics) seems like a natural
fit for this approach.

------
knb
Will there be a chapter about how your preferred architecture compares to in-
memory databases technologies? SAP has been creating a lot of marketing hype
on this in the last years. Indeed I think a lot of business problems can be
solved by cramming more RAM into centralized servers. Yesterday's big-data
problems are now routine if you have a dozens-of gigabytes of RAM machine
available.

Or is this question not applicable at all (because the architecture makes no
assumptions on the type of data storage); the requirements and usage scenarios
are completely different?

~~~
nathanmarz
The architecture described in the book is fully distributed and horizontally
scalable, and I won't be looking at scale-up techniques. The chapters on Storm
and distributed RPC does have an emphasis on using lots of RAM for certain
tasks though by partitioning data appropriately across the nodes.

------
bm3780
In chapter 2, you describe writing the master dataset to a distributed
filesystem (presumably HDFS). If you create a new file with a new record for
each update to the dataset, wouldn't this result in lots of small files on the
filesystem, resulting of fragmentation. HDFS seems well suited for large
files, not small
([http://hadoop.apache.org/common/docs/r1.0.0/hdfs_design.html...](http://hadoop.apache.org/common/docs/r1.0.0/hdfs_design.html#Data+Organization)).

------
mindcrime
Radical. Just ordered the MEAP of this. Looks like great stuff, and I - for
one - can't wait for the full book.

MEAP, btw, is a great program... for any of you guys who haven't ever bought a
book this way, it's pretty cool. Getting updates as they're delivered and
being able to provide feedback as the book is being developed, is pretty
gnarly.

------
knb
What do you think about "grid computing" concepts? Are they too academic?
outdated? more/less general? Is your architecture a different approach or just
a variation on the theme, a special case?

~~~
nathanmarz
Technically, the architecture promoted in the book is "grid computing" -- that
is, a fully distributed set of resources that work together to accomplish a
common task.

Many commercial grid computing products try to be all in one -- that is,
handle storage and computation. They don't apply to every problem because they
only have one kind of storage meant for certain kinds of tasks.

The architecture in Big Data is a general-purpose way to compute arbitrary
functions on arbitrary data, at scale and in realtime. Every data problem
you'd ever want to do can be described as a function on data, which is why
this architecture is so general-purpose. I recommend reading Chapter 1 in the
book (which is free to download from the webpage for the book) where we
explain these ideas much further.

------
whichdan
Is there a way to get a notification when the full book is released?

~~~
nathanmarz
You certainly will if you buy the MEAP. I'm getting a clarification on this
from Manning to see if there's a way to get a notification without buying the
book.

I'll also be making announcements about the book on Twitter, of course.

~~~
nathanmarz
I got clarification from Manning. Currently there's no way to get notification
without buying the MEAP. However, they think this is an interesting idea and
there might be a way to do this in the future.

~~~
whichdan
Thanks for checking on that. Bookmarked the book instead :)

------
jblomo
How has working at Twitter changed your perspective on big data? Has it been
more challenging working within an existing framework or starting from scratch
(at BackType)?

~~~
nathanmarz
Actually it's reinforced my beliefs of the proper way to build these systems.
We've actually continued using all the tools and technologies we were using at
BackType (Cascalog, ElephantDB, Storm, our schema, etc.)

------
bmfg
Really enjoyed the first chapter. Quick question: how do you deal with
filtering out duplicate records (e.g. blog posts/comments) when saving to the
batch layer ?

~~~
nathanmarz
Chapter 2 talks about forming a data model for the master dataset. The core
idea is that each record should be a "fact" that stands on its own as
something true at a moment in time. When you write your batch computations,
you should make them work on any set of valid facts. There's nothing wrong
with saying the same record twice, as logically "A and A" is the same as "A".
So by formulating batch computations to work on any valid set of facts, it
doesn't matter if facts are duplicated.

------
knb
How was the work split up between you and the co-author?

~~~
nathanmarz
I brought Sam on as a co-author at the 1/3 mark of the book (when Manning does
the first round of reviews). So far Sam has helped with the revisions
necessary to go from review -> MEAP, and we'll be splitting up the future
chapters.

------
lowglow
Will this be available as a well formatted pdf?

~~~
nathanmarz
Yes. Check out the first chapter to see what it looks like:
<http://manning.com/marz/BD_meap_ch01.pdf>

------
infynyxx2
When do you expect to complete the book?

~~~
nathanmarz
End of this year. We'll be releasing chapters to the MEAP as we go.

------
danso
This past week I'd been looking all over Amazon/Apress/PragProg for a good
data-practices book and couldn't find one. I skimmed over the summary page for
Big Data and decided that this was what I was looking for. Bye bye $40.

