

Parsing file uploads at 500 mb/s with node.js - felixge
http://debuggable.com/posts/parsing-file-uploads-at-500-mb-s-with-node-js:4c03862e-351c-4faa-bb67-4365cbdd56cb

======
pjscott
This is nitpicking, but it always bothers me: "mb/s" is an abbreviation for
"millibits-per-second". You meant "MB/s", although this wasn't obvious; if you
hadn't mentioned maxing out a GigE connection, it could have been interpreted
as "Mb/s", which is short for "megabits-per-second".

</pedantry>

------
jules
Writing everything in callback style is not nice. Why aren't they using a
language with coroutines?

~~~
felixge2
This presentation has a tiny bit on coroutines:

<http://nodejs.org/jsconf2010.pdf>

> Coroutines complicate the mental model while adding only cheap syntactic
> pleasures.

~~~
makmanalp
Actually, that's not a very good argument. The answer to that is "to you,
maybe". The better argument is this one: "Must worry about I/O occurring in
all function calls. (They might call wait().) The user needs to make their
functions coroutine safe!" I think this is the reason why coroutines are more
popular in functional programming languages where side effects are limited by
style or by enforcement of the language itself.

~~~
jules
That is not a good argument either. If you are going to write your entire IO
library in asynchronous style like node.js then you could easily make all IO
routines coroutine safe. In fact you have exactly the same problem whether you
use coroutines or not. If you have a nice asynchronous program and I call wait
in the middle that's going to hurt you in the same way.

~~~
felixge2
You're correct, it would not be impossible to write coroutine safe code in
node.js.

The problem is just that coroutines are not natural in JS, and people will
shoot themselves in the foot all the time.

Some "features" are better left out to allow a bigger audience to write
reliable Software. Your milage may vary.

~~~
jules
Perhaps, but in my experience using coroutines is easier than using callbacks
because coroutines make code look the same as if it was synchronous.

    
    
        readAsync(function(x) {
          readAsync(function(y) {
            write(x+y)
          })
        })
    

vs

    
    
        x = read()
        y = read()
        write(x+y)
    

You don't really need to know how to use the full power of coroutines if you
just want asynchronous operations. You just need to know that read() may block
the current coroutine.

~~~
pmjordan
You could even be more explicit about the whole thing, use _futures_ and only
allow them to block. The above example would go something like this if the
reads can't be executed concurrently:

    
    
      future_x = read();
      x = future_x.wait();
      y = read().wait(); // shortcut
      write(x + y);
    

Or something like this if the reads are independent:

    
    
      future_x = read();
      future_y = read();
      // one of a handful of functions that can "block", all operating on futures:
      waitForAll(future_x, future_y);
      write(future_x.get() + future_y.get());
    

Using futures rather than implicit suspension has the added advantage of being
able to pipeline independent reads just as you can with callback-style
asynchronous I/O.

You can already implement[1] an approximation of this in terms of callbacks,
but it doesn't look quite as nice, e.g.:

    
    
      var handler = new AsyncHandler();
      // independent, pipelined reads
      readAsync(handler.cb());
      readAsync(handler.cb());
      
      handler.whenDone(function(x, y) {
        write(x+y);
      });
    

It gets substantially uglier than that if the dependencies aren't so
straightforward, e.g. A, B & C are independent, D depends on A & B having
completed and the last part of the code requires the results from C & D.
Futures do much better in that sort of situation.

[1] I've built a basic but usable helper for this purpose:
<http://github.com/pmj/MultiAsync-js/tree/master/src/> I hear the Dojo toolkit
contains something similar.

~~~
jules
Excellent :)

Why do you have a separate wait & waitForAll? Couldn't get() wait
automatically?

~~~
pmjordan
I was just throwing ideas out there, not really thinking it through. :)
Although you're absolutely right about waitForAll() in that example, there is
a point to that sort of function, say if you wanted to add a timeout. Or you
could have a waitForAny() function - useful if you only need one of the
futures to finish before proceeding.

get() vs. wait() is admittedly a question of preference. Personally, I'd keep
them separate and even go so far as to have it warn you if you called get()
without a wait() or a successful isReady() or so on that future. It keeps the
suspensions explicit and the intentions clear. It's a bit like explicit vs
implicit transactional systems.

~~~
bruceboughton
Out of interest, is this any different to WaitHandles + async delegates in
.NET? Doesn't look like it to me but I could be missing something:

    
    
        Func<string> reader = read;
        IAsyncResult future_x = reader.BeginInvoke(null, null)
        IAsyncResult future_y = reader.BeginInvoke(null, null)
    
        WaitHandle.WaitAll(new[] { future_x.AsyncWaitHandle, future_y.AsyncWaitHandle }); // takes optional timeout
    
        string x = reader.EndInvoke(future_x);
        string y = reader.EndInvoke(future_y);
    
        write(x + y);

~~~
pmjordan
Looks like it's essentially the same type of programming model, although I
have a feeling WaitHandle.WaitAll() just blocks the current thread. In the
ideal case, it would internally call the event loop coroutine and process
other events instead without sending the thread to sleep. Thread scheduling
involves system (kernel) calls, coroutines are purely userspace, just like
node.js's async callback mechanism.

------
rythie
It's a pity that browsers couldn't just tell you the length of the part in the
header of each part.

~~~
felixge2
I think the reason multipart works that way, is so you can stream data that
you don't know the full length of beforehand.

But afaik, that's pretty much never the case with file uploads, unless you are
uploading a file that is still growing in size - so yeah, it's annoying : ).

('felixge2' because my other account is in noprocrast mode: )

~~~
stingraycharles
Think of it the other way too: it allows the HTTP server to start writing the
file to disk without having to completely load the file into memory.

~~~
felixge2
You could still do that if the length was pre-announced, or am I missing
something?

~~~
irrelative
Yeah, you could -- I wonder what would happen if the client gets it wrong or
is deliberately dishonest? Not trusting the client is a big part of writing an
open server and this seems like you would have to trust the client in a big
way.

~~~
felixge2
Well, it's not a big problem - you should have a timeout on incoming
connections, and node is pretty well-suited for having lots of "hanging"
connections (I ran some test with 56k active connections).

If a connection is closed, either by a timeout, or EOF, you simply check if
the promised content length matches the count of received bytes - if not you
should probably discard the whole thing (unless you're specifically supporting
broken clients).

------
marketer
When parsing boundaries you know the first character is going to be a
hyphen(-), and the last character is going to be a newline. Wouldn't it be
easier to search for hyphens, and then read until you see a newline, and then
compare to the boundary? Typically boundary characters are random printable
characters, so you might be doing more work than you need.

~~~
inimino
The rub is "search for hyphens": you have to look at every character until you
find a hyphen; the point of Boyer-Moore is that you don't have to even look at
most of the characters at all.

~~~
marketer
Ah, that makes sense, thanks.

------
GrandMasterBirt
If I can get a revenue stream of approximately $1000 a month my entire website
will be literally a call to a few different web services and some fancy-
shmancy CSS files. God I love hacker news! Ok fine there will be some of my
code in there, hopefully not for long because soon there will be a tool to
solve any other problem as well :P As long as these don't get too pricey.

