

Why Node.js streams are awesome - dmmalam
http://blog.dump.ly/post/19819897856/why-node-js-streams-are-awesome

======
masklinn
> The only downside is that it’s conceptually more complicated, and requires
> some understanding of underlying components (zip files, http responses,
> streams).

There's at least one more downside: the user loses all indication of progress
as the Content-Length is unknown when the headers are sent

~~~
dmmalam
Nice catch, We are working on it!

dumply knows the exact size of each image as it is saved in the DB on upload,
and all the zip byte headers are fixed, so the zip file size should be
deterministic and calculable even before the first byte is sent. Remember we
don't compress the already compressed images.

If you didn't know the file sizes, for example you had raw unknown input
streams, or had compressible data, you can still guesstimate the content-
length so the user got some progress bar, even if it wasn't 100% accurate.

~~~
tiles
The IEInternals blog covers each browser's behavior for underrunning the
purported Content-Length of a download:
[http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/brows...](http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/browsers-
accommodate-incorrect-http-content-length-and-sites-depressingly-depend-on-
it.aspx)

Looks like using browser sniffing, you can deliver an exaggerated Content-
Length to everyone but Opera and browsers will deal with it gracefully. Pretty
neat. (Obviously not desirable for violating the HTTP spec, but the UX gains
might be worth it.)

~~~
pornel
You're not always dealing with the client that is specified in the UA string.
Clients can use proxies, including _transparent_ proxies.

For example major mobile operators pipe HTTP connections through a proxy that
recompresses images. In that case you see e.g. Safari's or Opera's UA string,
but you're actually dealing with proxy's HTTP behavior.

~~~
tiles
In dump.ly's use case, they would probably want client-side detection in
JavaScript, not the UA string. You definitely have to be conservative
implementing such an unexpected feature.

An older article (2008) that also talks about misreporting content-length for
fun and profit: [http://tech.hickorywind.org/articles/2008/05/23/content-
leng...](http://tech.hickorywind.org/articles/2008/05/23/content-length-
mostly-does-not-matter-the-reverse-bob-barker-rule)

~~~
gilini
Client-side detection in JavaScript or UA string aren't mutually exclusive,
since the UA string is exposed on the DOM.

Are suggesting they would rather guess the browser by inference through
presence of DOM properties and methods?

In the end it narrows down to: would you rather a) break the functionality for
some users in exchange to give the best possible solution to others, or b)
give an OK experience to everyone

I usually go with "b". I think that frustration is much more powerful than
awe.

------
colinmarc
I was playing around and did something similar with video encoding. The server
code starts a running ffmpeg process, and then the handler code just looks
like this:

    
    
        server = http.createServer(function(request, response) {
            request.pipe(ffmpeg.process.stdin);
            ffmpeg.process.stdout.pipe(response);
        });
    

What a nice interface! The end result is that you can do weird stuff like:

    
    
        $ curl -T my_video.mp4 http://localhost:9599 | mplayer

~~~
timc3
Think you will find VLC, PS3/Xbox streaming servers and a whole load of others
do the same thing

------
EvanMiller
Not to poop in your cocoa puffs, but I wrote an Nginx module to do the same
thing in 2007.

<https://github.com/evanmiller/mod_zip>

The module is quite mature at this point, and is used in production on many
websites (including Box.net, which commissioned the initial work). The module
supports the Content-Length header, Range and If-Range requests, ZIP-64 for
large archives, and filename transcoding with iconv. Being written in C, it
will probably use much less RAM than an equivalent Node.js module.

I have found that the hardest part of generating ZIP files on the fly has
nothing to do with network programming; it's producing files that open
correctly on all platforms, including Mac OS X's busted-ass
BOMArchiveHelper.app.

~~~
dmmalam
The point wasn't that creating on the fly zips is new, it was that using
pipeable steam abstractions is a composable way to build network servers, and
nodejs is just something we found this easiest to express with.

Having a large number of stream primitives means you can easily wire up
endpoints, for example say you wanted to output a large db query as xml, or
consume and editing gigabytes of json, or consume, transcode and output a
video.

You can by all means write a nginx module in C for each usecase and this is
probably the right solution for very HEAVY specific loads.

But writing a C module is probably a barrier too high for many, whereas
implementing a nodejs stream isn't. Respond to a few events, emit a few events
and you have a module that can work with the hundreds of other stream
abstractions available. (npm search stream)

You still need the specific domain knowledge (eg how zip headers work) and
this is usually the complicated bit. mod_zip looks excellent, and I wonder if
some of the domain knowledge of handling zips can be resused in zipstream.

------
chrisacky
Nice approach.

This is how we handle it currently.

> User adds images to a virtual lightbox.

> User decides that he wants to download all the images in this lightbox, so
> presses "Download Folder". The user is then presented with a list of
> possible dimensions that they can request.

> The user selects "Large" and "Small" and hits "Download"

> This request gets added to our Gearman job queue.

> The job gets handled and all the files are downloaded from Amazon S3 to a
> temporary location on the locale file server.

> A Zip object is then created and each file is added to the Zip file.

> Once complete, the file is then uploaded back to Amazon S3 in a custom
> "archives" bucket.

> Before this batch job finishes, I fire off a message to Socket.io / Pusher
> which sends the URL back to the client who has been waiting patiently for X
> minutes while his job has been processing.

This works okay for us because when users create "Archives" of their
ligtboxes, generally they do this because they want to share the files with
other people. This means that they attach the URL to emails to provide to
other people.

So for us, it's actually neccessary to save the file back to S3... however,
I'm sure that not everyone needs to share the file... it would definitely be
worth investigating if the user plans to return back to the archive, in which
case implementing streams could potentially save us on storage and complexity.

~~~
dmmalam
I think you have pretty much described our original ('ghetto') solution with
caching ('lipstick').

With streams, there is no need to cache, as recreating the download is dirt
cheap. Essentially just a few extra header bytes to pad the zip container,
ontop of the image content bytes that you will have to always send.

The use case you mentioned, of sharing the download link, works exactly the
same. You send the link, and the what ever user clicks on the links gets an
instant download.

True you are bufferring data through your app, instead of letting S3 take care
of it. But if your on AWS, S3 to EC2 is free and fast (200mb/s+), and then
bandwidth out of EC2 costs the same as S3. If it goes over an elastic IP, then
a cent more per GB. You app servers also handle some load, but nodejs (or any
other evented framework) live to multiplex IO, with only a few objects worth
of overhead per connection.

In return, you can delete a whole load of cache and job control code. Less
code to write, test and maintain.

~~~
masklinn
> With streams, there is no need to cache, as recreating the download is dirt
> cheap.

The cost when streaming and not streaming should be pretty much the same,
unless your non-streaming case is working on-disk (in which case you're
comparing extremely different things and the comparison is anything but fair)

------
timc3
Alternately you could hand it off to the web server which is probably a better
more elegant solution.

<http://wiki.nginx.org/X-accel> and mod_zip for instance.

Why do people keep reinventing the wheel, thinking node is the be all and end
all when this is nothing new at all?

------
nevinera
I'm pretty sure any evented framework in any language can do the same thing.

~~~
masklinn
Even non-evented ones, the interesting part really is not node itself (despite
what the blog says) but the ability to pipeline streams without having to
touch every byte yourself.

It should be possible to do something similar using e.g. generators (in
Python) or lazy enumerators (in Ruby)

In fact, in Python's WSGI handlers return an arbitrary iterable which will be
consumed, so that pattern is natively supported (string iterators and
generators together, then return the complete pipe which will perform the
actual processing as WSGI serializes and sends the response). Ruby would
require an adapter to a Rack response of some sort as I don't think you can
reply an enumerable OOTB.

~~~
timc3
It's possible in Python generators but in my tests the performance sucks
particularly if the client can't receive fast enough.

Using eventlet or gevent was much kinder to the system.

------
latchkey
I used to have this same exact issue while working for a large porn company.
We needed to make zips of hundreds of megs of images. We were creating them on
the fly to start with, which sucked for all the same reasons mentioned in the
blog post. After doing a ton of analysis and not finding a good streaming
library that didn't require either C or Java (this is long before Node came
along), we realized that as part of the publishing process, we could just
create the zip and upload it to the CDN. Problem solved with the minimal
amount of complexity.

------
chubot
This is really cool. How are errors handled though? What if you have a
transient error to 1 of 50 images -- does that bork the whole download? The
user could get a corrupted file.

~~~
georgefox
I'm curious about this as well. While it's all very neat and improves the user
experience when everything is working, what happens if things break? If you
can't connect to S3 or something, but you've already sent HTTP headers for the
ZIP download, what do you do? Throw an error message in a text file the ZIP?
Send the user an empty ZIP? A corrupted ZIP, as chubot mentions, seems like it
would be the worst-case scenario in terms of UX.

~~~
sirclueless
Abruptly ending with a RST packet causes a failed download in every browser
except Chrome:
[http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/brows...](http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/browsers-
accommodate-incorrect-http-content-length-and-sites-depressingly-depend-on-
it.aspx)

~~~
georgefox
Good to know, thanks!

I feel like from a UX perspective, it'd be ideal to be able to give some
friendly error message to at least acknowledge that the failure is on the
server end. A page that says, "Sorry, we're having trouble accessing your
files right now. Please try again in a minute.," seems more user-friendly to
me than a download that suddenly fails with no explanation. Nevertheless, this
is very cool.

------
Benvie
The important bit is that this is THE core abstraction used in node and the
node community. If for no other reason, you should do it (if you're using
node) because it's how you hook into the existing libraries.

The main benefit here isn't that it's possible to do this thing, as many
people pointed out the myriad ways this is accomplished elsewhere. The key
point is that everything that manipulates data, node core as well as the
userland libraries, implement the same interface.

~~~
sirclueless
That's not really true. Most libraries expose a callback mechanism, where the
result of some IO is passed as a Javascript primitive to a callback function
that you provide. The Dumply guys used to use an API like that.

The notion of piping the output from some I/O (say, a request to S3) into the
input of some other I/O (say, a currently writing HTML response) without ever
referencing it is blessed by node, which has a stream type as part of its
standard library. But it's far from the most common abstraction of
asynchronous work.

------
robfig
Using the Play! Framework (Java):

    
    
      public static void myEndpoint() {
        HttpResponse resp = WS.url("url-of-file").get();
        InputStream is = resp.getStream();
        renderBinary(is);
      }
    

Or am I missing something?

(EDIT: This doesn't do the zipping or multiple files -- I guess I need a
ZipOutputStream to take it the rest of the way)

~~~
simonw
Will that definitely stream from one to the other without buffering the full
file in memory? That's the main benefit of the Node.js streaming approach - it
doesn't need to hold the whole thing in memory at any time, it just has to use
a few KBs of RAM as a buffer.

~~~
robfig
Play! uses Apache HttpClient under the covers. I haven't tried it
experimentally, but the documentation explicitly says that it's streaming.

------
WiseWeasel
I like how this is done, but I do see one problem with this approach for users
connected through certain wireless ISPs, such as Verizon, who have all their
http image requests automatically degraded to a lower bitrate to save
bandwidth. They might think they're getting a usable local copy of their
project, when they've actually got ugly, butchered versions of all the assets.
That would not have been an issue with the server-side implementation.

~~~
icebraining
This is still server-side; it just streams instead of downloading and then
pushing.

~~~
WiseWeasel
The image requests are client side, no? The client requests the images, then
streams the responses to a zip file. Wouldn't Verizon Wireless' network
management software replace the images requested with low quality versions in
this case? If it does, then it may be advisable to keep the old method around
as an option when this method is impractical for whatever reason. Maybe there
could be client-side code to test whether images are being re-encoded by the
ISP (calculate checksum for a known image), and request a zip via the old
method if they are.

~~~
simonw
"The image requests are client side, no? The client requests the images, then
streams the responses to a zip file."

No - the image requests happen on the server, which then concatenates them
together in to a zip file which is served to the client. The client never sees
the actual image files, just the resulting zip file.

I'd imagine (well, hope anyway) that the ISP proxies that downscale image
files do so based on the HTTP Content-Type header - since the images contained
within a zip file would be part of a file with a different Content-Type they
should be left alone.

~~~
WiseWeasel
OK, if the server is the one creating the zip file and sending it to the
client, then it probably avoids the issue. I was under the impression that the
point of all this was to have the client create the zip, not the server.

------
aioprisan
that's all dandy until you run our of RAM, as everything is done in RAM and
nothing to disk. you honestly don't see a scalability issue here? it may be ok
for a few thousand concurrent downloads but anything above that will kill it.
heck, you might not even get to 1k concurrents, depending on the file size..

~~~
atesti
He's streaming: Node will only buffer a few kb per connection and push it
right out to the downloader. There is absolutely no need to download complete
files. That's the beauty of streams and pies!

------
moonboots
Have you considered jszip and/or webworkers to perform the zipping on the
client?

~~~
dmmalam
The full size origs are only stored on the server, the client just uses
thumbnails so not much point in zipping on the client

~~~
moonboots
Thanks, I thought the client may have already downloaded some of the full size
photos, so zipping on the client would reduce download sizes. In your
uservoice feedback, there are a few votes for full size zooming, so this
client zipping may be useful if you implement zooming. This method also
simplifies server architecture as the frontend server now only needs to
reverse proxy image request to s3.

------
lucaspiller
...or you could use Erlang. :)

~~~
marcocampana
Sure you could use Erlang for that, but what Dharmesh is saying is that
building this kind of solution in node.js would definitely be easier and
possibly faster to code that writing it with other languages/frameworks.

~~~
masklinn
I'm not sure about that though: the gain here mostly seems to be the "stream"
standard abstraction, and it being implemented (via adapters if needed) by
many data-processing utilities leading to high pipeability (letting the
developer define the chain, and the runtime handle all the data flow within
that chain).

Many other languages have similar abstractions — python and generators for
instance <http://www.dabeaz.com/generators/> — although their usage would
likely require more work as they probably are not as standardized as far as
usage goes.

I mean in this case it's "faster" to write it because somebody else had
already gone through the motion of creating a zipping stream (which they still
needed to fork), it's not like node magically did it.

tldr: the node community is re-discovering dataflow, and a few are trying to
pass it as some sort of magical property of node.

~~~
cpr
Or, to rephrase your point, the node community has built some nice pipeable
abstractions in a way that's easier to use than Python (e.g.), and people are
making good use of those abstractions. ;-)

------
bluespice
Stop it already.

<http://teddziuba.com/2011/10/node-js-is-cancer.html>

~~~
tomgruner
I have to be honest, the writing quality of that article is so low and
aggressive that I could not even finish it.

