Hacker News new | comments | ask | show | jobs | submit login
Four MLs (and a Python) (thebreakfastpost.com)
176 points by atombender on Apr 30, 2015 | hide | past | web | favorite | 72 comments



    $ time wc big-data.csv                                                                     
     1252045  1252045 17805309 big-data.csv
    
    real    0m0.417s
    user    0m0.411s
    sys     0m0.004s

    $ time ./sum-ocaml big-data.csv                                                            
    6143290.,13087196.,14473656.,50128757.,3765822.,1290420.
    
    real    0m3.680s
    user    0m3.666s
    sys     0m0.008s


    $ time python3 sum.py big-data.csv          
    6143290.0,13087196.0,14473656.0,50128757.0,3765822.0,1290420.0
    
    real    0m7.186s
    user    0m7.131s
    sys     0m0.036s

    $ time python2 sum.py big-data.csv         
    6143290.0,13087196.0,14473656.0,50128757.0,3765822.0,1290420.0
    
    real    0m5.146s
    user    0m5.111s
    sys     0m0.012s

    $ time pypy2 sum.py big-data.csv            
    6143290.0,13087196.0,14473656.0,50128757.0,3765822.0,1290420.0
    
    real    0m2.754s
    user    0m2.697s
    sys     0m0.052s
I really like ocaml, but have to admit, pypy is pretty amazing.


It completely depends on the application. I have yet to see a use case that benefits from PyPy. Here's an example I tried with CPython2, 3 and PyPy yesterday: It loads the most probable translations for all unigrams from a Moses phrase translation table (11GB) into a dict, and using that dict translates 73 MB of text word by word. Most of the time is spent filling the dict/reading the large file.

  $ time python literal_translation.py phrase-table en test4
  translation table contains 76885 elements
  ...................................................................... 070
  .......................................................
  125 files translated
  
  real	1m14.345s
  user	1m9.836s
  sys	0m4.157s
  
  
  $ time python3 literal_translation.py phrase-table en test5
  translation table contains 76885 elements
  ...................................................................... 070
  .......................................................
  125 files translated
  
  real	2m17.680s
  user	2m2.605s
  sys	0m4.987s
  
  
  $ time pypy literal_translation.py phrase-table en test6
  translation table contains 76885 elements
  ...................................................................... 070
  .......................................................
  125 files translated
  
  real	1m13.347s
  user	1m0.033s
  sys	0m4.734s

https://gist.github.com/andreasf/af6bdc00cf5a928712b5


If your problem is memory-bound then no compiler will help you.

Memory-mapping a serialisation of the dict would help, but I'm not sure how you would do that outside of an extension lib.


1m for 11GB sounds IO-bound, but yeah that's still not a use-case where switching to pypy will make a difference.

But did you try using it as a service (ie. avoid the startup time) and timing the actual translation?

At work, we switched to pypy for the speed. For a spell-checker I made in python, pypy made the dictionary compiling 3x faster and the spelling 2x faster, so it can really matter. On the other hand, pypy's memory usage is often a bit worse, and startup time can also be a bit slower, at least if your program has an otherwise "zero" startup time.


No, I didn't try to optimize this any further. I just checked whether any of the new interpreters is really faster. It's a simple one-off experiment, and compared to real statistical machine translation with Moses on the same data the translation is pretty much instant. Moses needs something on the order of 1600 CPU-hours (50 nodes * 8 threads * 4 hours).


NLP stuff tends to be more data-bound, which PyPy can't really help with.

If it's worth the effort, use Cython, design your data structures carefully, and you'll get C level performance.


Hey, thanks for spaCy! It's a great library.


This isn't an isolated test, and tells us almost nothing. Because you've clumped together IO as well as the bulk of the actual processing work.

Rather, when you do the test, only test the "mapping" part, and only test the "IO" part. That way you can clearly see the difference between the two.


I notice python3 is twice as slow.


To whoever downvoted this -- how about posting a more comprehensive set of benchmarks instead? My single benchmark is not in any way less meaningful than the parents. But as I said, I often try PyPy on longer-running Python code, and so far saw no significant gains.


Your benchmark is terrible - it's impossible to run it because you don't provide a dataset that you used (or I'm missing some description). It also stresses one aspect of the language (processing strings, in a very specific way) and yet you claim this is representative.

However, we're more than happy to consider it a pypy bug if we're not faster than CPython. Please provide clearer instructions how to reproduce it and ideally also other benchmarks that did not work for you.


Great, next time I'll hit your bug tracker directly ;-)

The script and data are available here: http://zentrale1.com/~an/literal-translation.tar.xz

The texts were scraped from the European Parliament web site. The phrase table was built with Moses from the Europarl parallel corpus [0] minus the texts.

The script itself was supposed the be baseline input for another experiment, basically it's the worst way of translating something i could think of.

[0] http://statmt.org/europarl/


can you please provide me the data in somehow reproducible form (I have no clue how to use all the tools you mentioned for example), ideally as a .tar.bz2 file somewhere I can download and run without any prior knowledge. I'm happy to improve pypy for that particular use case.


"Terrible" is a little harsh. I found the benchmark to be interesting because it's similar to many NLP-style analysis tasks I have myself. Of course reproducible ones would be great too.


And ZipPy -- a Python implementation for the JVM based on Truffle/Graal -- supposedly performs even better than pypy.


[citation needed]. In the minute or so of idle browsing I did not find that claim on their website nor a claim how much python they actually implement.


https://docs.google.com/spreadsheets/d/1fFMWcRIuPKt7wSAM5Ox9...

It seems like it's often much faster than PyPy (up to 4x) and almost never slower.


I feel like if a program runs for less than 10 seconds, I don't much care if it's 2 seconds or 7 seconds. And a benchmark for a program like that tells me very little about how the various runtimes affect larger programs.


Hmm, the OCaml could be a lot shorter:

    let rec fold_channel f acc chan =
      match input_line chan with
      | line -> fold_channel f (f line acc) chan
      | exception End_of_file -> acc

    let comma = Str.regexp ","

    let values_of_line line =
      List.map float_of_string (Str.split comma line)

    let sum_channel chan =
      let folder line tot = List.map2 (+.) tot (values_of_line line) in
      fold_channel folder (values_of_line (input_line chan)) chan

    let sum_and_print_file filename =
      let chan = open_in filename in
      sum_channel chan |> List.map string_of_float |> String.concat "," |> print_string;
      close_in chan

    let () =
      match Sys.argv with
      | [|_; filename|] -> sum_and_print_file filename
      | _ -> failwith "Exactly 1 filename must be given"
This is also a bit faster, and almost all of the difference is because I hoisted the Str.regexp ",". I suspect using a split-on-character operation would make a bit more difference there, but of course the spartan OCaml stdlib lacks such a function.


Oops, breaks on empty files. I should have matched on the input_line chan in sum_channel to catch End_of_file - damn exceptions.

print_string should also be print_endline to match the original.


Also, is doing flow control using exceptions an acceptable pattern in the OCaml world?

Like in Java:

  try {
     print(awesomeObj.getSomeState());
  } catch(NullPointerException npe) {
     awesomeObj = AwesomeObj.getInstance();
  }

would be considered very bad. You would instead be expected to do:

   if (awesomeObj == null) awesomeObj = AwesomeObj.getInstance();
   print(awesomeObj.getSomeState());

(how do you <code-wrap> stuff on here?)


Yes. Various parts of the stdlib are intended to be used by catching Not_found or End_of_file or whichever exception happens to be relevant.

It's not what I would have done.

(Mark text as code by putting it on a fresh line with two or more spaces in front. I don't think there is a way to do this inline.)


It's always weird to me when people talk about using exceptions for control flow.

Exceptions are, pretty much explicitly, a control flow mechanism. That's what they do, they transfer control from one place to another. It's like people are getting hung up on the name.


Exceptions, in Java land at least, are very costly (think interrupts and context switching). To do flow control using them is shooting yourself in the foot.


While being done in the standard library, several big libraries alternative (Core, Batteries...) prefer wrapping everything in an `option` type instead of raising an exception. Wrapping in an option avoids forgetting about catching the exception.


Your OCaml code is not doing the same thing as the Python one: you are parsing a regexp in a loop, and then splitting by the regexp - not by a delimiter.

I moved the regexp instantiation outside the loop, see the results below ("orig" is your code, "proper" has the regexp parsed only once):

  > bash -c 'time ./orig bigdata.csv'
  15502592020.,15502065537.6,15519223046.6,15498884970.,15502078298.,15519530367.3,15510803256.2,15519590717.2,15511590976.

  real    0m3.110s
  user    0m3.107s
  sys     0m0.003s

  > bash -c 'time ./proper bigdata.csv'
  15502592020.,15502065537.6,15519223046.6,15498884970.,15502078298.,15519530367.3,15510803256.2,15519590717.2,15511590976.

  real    0m2.596s
  user    0m2.590s
  sys     0m0.003s
I'd say the rest of the difference is probably due to the standard library performance itself. BTW, compilation time for ocamlopt is 70ms.


Any idea why the compiler doesn't hoist the constant single-character regexp automatically?


1) It's not a constant expression, it's a function call.

2) The compiler does not know if there are collateral effects or not.

3) Ocaml is strict eval - no memoisation

Therefore this would be an unsafe optimisation if the compiler tried to perform it, which could lead to unintended behavioural changes.

The only way to fix this properly would be to add an effects system in some sense, at least a way to mark pure/impure functions.


I wanted to have a look at how much PyPy might speed up the task so I generated a one million line x 20, 287M, big-data.csv of my own with the following.

    import random
    with open("big-data.csv","w") as f:
        for b in range(1000000):
            f.write(','.join([str(random.random()) for x in range(20)])+"\n")  
Each line looks like: "0.47509825737,0.525866136528,0.167956183215...0.888687040645".

The script from the blog takes 7.3 seconds to chew through it on my machine with Python 2.7.8, and 4.9-5.5 seconds with PyPy, a reasonable improvement. Out of interest, taking all the boilerplate and checking off, the following is 7 lines versus the 24 lines or so in the blog and runs in a similar 4.9-5.4 second range on PyPy (although performance blows out under Python 2.7.8 to 13 seconds or ~6 seconds slower than the blog). Somewhat more Pythonic, and more likely do be done live in the REPL without even bothering to write a script, to my mind.

    from collections import Counter
    c = Counter()
    lines = (rawline.strip().split(",") for rawline in open("big-data.csv"))
    for line in lines:
        for colname, colval in enumerate(line):
            c[colname]+=float(colval)
    print c  
Above on PyPy.

    time pypy sum5.py ./big-data.csv

    real	0m4.972s
    user	0m4.946s
    sys	        0m0.024s  

wc on the same machine.

    time wc big-data.csv  

    real	0m3.701s
    user	0m3.672s
    sys	        0m0.028s
Bonus one-liner (6.6 seconds under PyPy).

    print map(sum,zip(*([float(x) for x in rawline.strip().split(",")] for rawline in open("big-data.csv"))))


Bonus bonus easy to read one-liner (using numpy):

   numpy.loadtxt("big-data.csv",delimiter=',').sum(axis=0)


This is a nice data-processing trick, but obviously runs very little Python code.


Well over 99% of the run time is spent in loadtxt which is pure python. In fact this code is slower than the original code due to the overhead of the checks that loadtxt does.


No wonder OCaml compiled to native is faster than other MLs and Python. I rem stumbling on benchmarks elsewhere, where compiled OCaml rivaled CPP in terms of performance on certain tests.

In my experience with OCaml, I have found the syntax a bit too unfavourable like the author did, but the resulting code was very fast. With concurrency sorted out, I wonder if it could take on Rust, Nim, and Go as far as systems-programming is concerned (esp since Go has the weight of Google behind it).

It also helps that the language designers have written a large tutorial on systems programming: https://ocaml.github.io/ocamlunix/

Then there's this epic Mirage OS built with Cloud Computing in mind by Anil and his team of OCaml zealots: http://openmirage.org/


I don't think anyone on the MirageOS team is a language zealot. Almost by definition, they'll be polyglots so understand the trade-offs you make when using different languages.

It turns out that OCaml is pretty good for systems programming, hence projects like MirageOS can survive and grow.


Agree. BTW, I meant "zealot" as a compliment.


Fun fact: the Rust compiler was originally written in OCaml.


It seems that many original compilers are written in OCaml.


What is it written in now? (Rust?)


Yup!


> With concurrency sorted out

The last time I looked at this (~ 2 years ago), there were a few efforts to solve this, but they all had lots of caveats and none were generally accepted as the solution. Has there been any progress since?

I tried to solve a performance issue with OCaml, and the lack of concurrency and librairies were too limiting.


There is a multicore runtime in progress, with an ETA of "when it's done".


Multi-core addresses parallelism, concurrency is a whole different can of worms.


Fair point. For concurrency, you have lwt and Core.Async. Lwt is fairly nice, haven't tried Core.Async. I'll point out that "there is no parallelism" is one of the traditional complaints about OCaml, not "there is no solution for concurrency".


It's worth mentioning that OCaml has a good, multi-threaded, implementation on the JVM, too: http://www.ocamljava.org/


I absolutely agree with the author, in that I find SML the most pleasant to read (and write).

And I've said this before: If only SML actually had proper unicode support, I might use it for something real.


This is the start of a solution in Haskell:

    import System.Environment
    import qualified Data.Vector.Unboxed as V
    import Data.Vector.Unboxed (Vector)

    display :: Vector Double -> String
    display ds | V.null ds = ""
            | otherwise = tail . concatMap format . V.toList $ ds
                    where format d = ',' : show d

    csv_sums :: String -> Vector Double
    csv_sums input = go (V.length $ h) 1 h t where
        h : t = map (V.fromList . read . bracket) $ lines input
        bracket s = '[' : s ++ "]"
        go s n summary [] = summary
        go s n summary (x:xs)
            | V.length x == s  = go s (n + 1) (V.zipWith (+) summary x) xs
            | otherwise        = if blankLastLine then summary
                                                else error "Inconsistent lengths."
                where blankLastLine = case xs of [] -> V.length x == 0
                                                _  -> False

    main = do
        paths <- getArgs
        if length paths /= 1
            then error "Exactly one filename must be given."
            else readFile (head paths) >>= putStrLn . display . csv_sums
It takes about 22x longer than Python on my platform though, which I'm not sure how to solve. (The vector keeps it nice and un-thunked so it doesn't seem like it's laziness that's killing it, maybe it's the use of strings over Data.ByteString.Lazy or so?)


The main problem is probably creating and traversing the strings, although I have not benchmarked it. This version https://gist.github.com/Codas/894694eea247aaacf35f runs about 4 times faster than the python version on my machine.

It does use some libraries that are not in the haskell platform though.


Very cool!


Can someone explain to me, why would a programmer want to develop a web application using an ML-derivative (e.g. F#) instead of Ruby/JavaScript/PHP or Python?

Except from a totally different programming paradigm what do ML-derivateves have to offer?

Also the author states that ML is "statically typed", "type inference"... Why does a statically typed langauge need "type inference"? As I understand this - I'm probably wrong but - dynamically type == type inference.

If anyone takes the time to answer... Well thanks a priori! :^)


Static typing helps catch bugs and code refactoring. For example, if you add a parameter to a function in ML the compiler will point out for you all the call sites that need to be updated to the new API. The compiler will also be smart enough to notice the times when your function is called via aliases, which is something you can't do via textual find-and-replace.

The need for type inference is because in languages with more advanced type systems like ML it would be very annoying if you had to write down the types for everything by hand with no assistance from the compiler.

> As I understand this - I'm probably wrong but - dynamically type == type inference.

No, this is only a common misconception. The better analogy is "dynamically typed" == "everything belongs to a single type".

In Python every value has a type of "object" and can be passed to anywhere you want. The interpreter adds tags to the memory representation of the objects so it knows what to do with them and it raises a runtime error if you try to use a value in the wrong place (for example, when you try to call a method on None).

On the other hand, in a language with a sound type system you have a guarantee that there will never be type errors at runtime ("well typed programs do not go wrong"). The type checker enforces that different types never get mixed together, which also removes the need for runtime tags in the memory representation.


Thanks for the comprehensive answer!


> Why does a statically typed langauge need "type inference"

   let suffix str = str ^ "mysuffix";;
   val suffix : bytes -> bytes = <fun>
Type inference in action: the OCaml toplevel displays the type of the 'suffix' function as something that takes a value of type bytes and returns a value of type bytes, without any type declaration (though obviously, the type inference system works in much more complex cases as well). You can write entire programs without a single type declaration.

Dynamic languages do not do type inference. They do duck-typing. The big difference is that OCaml will not let you write a program like:

    let suffixed_five = suffix 5
You will get:

    Error: This expression has type int but an expression was expected of type                                                           bytes
Python will not let you concatenate integers and strings, either, but will blow up at runtime.


Some dynamic languages, notably Ruby, use duck typing. Many tag all values with types, Lisps are an example. Tagged types allows a dynamic language to be strongly typed [dynamic/static is orthogonal to strong/weak typing, C being an example of weak static typing].

Type inference is syntactic sugar in most languages that allow it, and explicit type declaration is [almost?] invariably permitted. Of course, the price of omitting type declarations in code is the potential loss of clarity that comes anytime implicit communication is used in lieu of explicit statements of intent.


A nit: Dynamically typed languages do not all use duck-typing. Duck-typing is primarily concerned with the question: Does this object respond to this message/method or have this field? If the answer is yes, then we'll act as though it's whatever type we needed it to be. So duck has a quack() method and decoy_duck has a quack() method. decoy_duck is not, in fact, a duck, but it has enough of the duck behavior for what we need.


There are of course the ordinary people do what they do reasons: familiarity with the language, consistency within a project, opinion and deep insight into the problem and dumb mistakes.

Static typing offers some advantages in regard to tooling. Entire classes of errors can be caught by the editor/IDE as the programmer types code without waiting for compile time...or in the case of dynamic languages perhaps runtime.

ML languages are particularly suited for the classes of problems for which they were developed, e.g. fault tolerant high reliability systems, because they can efficiently express state machines via pattern matching semantics and the SOME|NONE construct.

So a person might choose to write a web application in F# because they are thinking about the problem in terms of a state machine. With code generation, a stack similar to Microsoft's can provide a lot of the Presentation and Storage layers in Javascript and SQL respectively.


> ML languages are particularly suited for the classes of problems for which they were developed, e.g. fault tolerant high reliability systems

Are you thinking of Erlang? Erlang is definitely not a ML, although they share some similarities (immutability, pattern matching). Not sure what ML originally was designed for, other than maybe building compilers.

> So a person might choose to write a web application in F# because they are thinking about the problem in terms of a state machine

Algebraic data types in ML allow you to encode program state within the type system, so the compiler can help ensure the program is in a consistent state at any point during run-time. That is the real benefit to having a strongly static type system in the vein of ML.

Regarding the original question: the difference between static and dynamic types is, IMO, one of checking at compile-time vs run-time. (But the static/dynamic definition is fuzzy enough that there's often leeway for interpretation here.) Type inference is icing on the cake -- type annotations could be explicit (C, Java) or implicit (ML), but a language is not statically typed unless the types are known prior to execution.


> Are you thinking of Erlang? Erlang is definitely not a ML, although they share some similarities (immutability, pattern matching). Not sure what ML originally was designed for, other than maybe building compilers.

ML stands for metalanguage. It was originally developed for a theorem prover. I think the expressibility of the ML family of languages turned out to lend themselves to many categories of programming problems like compilers. Especially since the code could so closely match the structure of the problem itself (via recursion, pattern matching and other features).

EDIT: Commenting on my point about code matching the problem structure. This is why I'm, in general, a fan of most functional and declarative programming languages I've tried. Prolog, Common Lisp, Scheme, SML, Ocaml, Haskell, Erlang, etc. They may not be ideal for every problem (though the malleability of Scheme and CL make them pretty damn close, IMHO), but there are a number of domains where they truly excel compared to the bog-standard enterprise languages. The only problem I have not solved is convincing bosses to accept a polyglot environment. They seem happy with Python and C++ and C# and Java, but throw in even F# and they reject it.


Erlang came from Ericsson, SML from Bell Labs. Erlang's contract based programming idiom is another, and perhaps better, approach to high reliability fault tolerant system implementation, but though I was thinking about it by virtue of thinking about software for high reliability fault tolerant systems developed by telecoms, rest assured I was not writing about it. The reason I wasn't writing about it is that it would not have answered my comment's parent.


Sure, the only reason for mentioning Erlang is because "high reliability and fault tolerant systems" are its raison d'etre, but AFAIK that's not an expressly stated design goal for any of the ML languages. Any such benefits come as a side effect of ML's design. However, a lot of ideas from ML are being incorporated into (non-ML) languages that do have reliability and safety as a design goal, such as Rust.


With :pre and :post, Erlang's design by contract has made it's way into Clojure...and in the form of a full featured contract system, into Racket as well.


Nice, thanks! :-)


> Except from a totally different programming paradigm what do ML-derivateves have to offer?

The programming paradigm is itself a reason.

> Also the author states that ML is "statically typed", "type inference"... Why does a statically typed langauge need "type inference"?

Static typing provides aspects that are statically known to be correct without testing, type inference allows the benefits of static typing without the type-related boilerplate/overhead of, e.g., Java.

> As I understand this - I'm probably wrong but - dynamically type == type inference.

Dynamic typing is not the same thing as type inference. Dynamic typing provides the low boilerplate that static typing with type inference provides but without the safety that static typing (with or without inference) provides.


Haskell might also be an interesting choice in this case? Worth looking into!


Wouldn't the OCaml example be much faster if it used arrays instead of lists? Doesn't seem like a fair comparison.


I enjoyed reading this and I'm not bothered about tweaking the solutions into oblivion to eke out performance gains. This is a nice, simple comparison, with performance being a simple demonstration of code written by one author.

The only thing that seems to be an obvious omission is the omission of Haskell.


I noticed that the author is comparing the size of the executables. Dunno about all the other MLs but in Ocaml the executables can blow up in size if you statically link them instead of let them be dynamically linked as is the default. The presence of debugging symbols also matters a lot.


Is a 300MB CSV really considered "big data" now? I routinely generate larger CSVs using a $10 sensor (rtl-sdr with rtl_power) and process them on a $70 ARM device (Odroid U3).


The author addressed that in a comment over there:

The choice of filename was tongue-in-cheek. (The main thing is that it’s big enough to blow the stack with non-tail-recursive code, and to take a non-trivial amount of time to run.)


For the Python test, why not use csv module instead?


Or panda or numpy... I think they wanted to compare raw languages, in terms of syntax, size of the code, ease or writing it for a given problem, as a matter of fact look at their disclaimer"

> Do it with R instead .../... Or at least use an existing CSV-mangling library.

Having said that, I think your comment is to the point, one of the reason I use python a lot is because of its libraries. I once decided to use OCaml to solve a performance issue, and was surprise by the lack of librairies for what seemed common in more modern languages (even with opam).


similarly for the other languages (and OP mentions that)


You forgot the most important ML derivative: Haskell. I'd like to see how it measures up.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: