Hacker News new | comments | show | ask | jobs | submit login

OP here.

Speed is the main motivation, but total time is TimeToWriteCode + TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCode. C++ has lowest TimeToRunCode, but high TimeTowWriteCode. Haskell is often a good compromise for me.

Also, with Haskell, it can be very easy to take advantage of 20 CPU cores, while I don't have as much familiarity with high-level C++ threading libraries.

@ the OP - not to sound hostile, but you write code (like in the example here [1]) that is bound to be slow, just from a glance at it. vstacking, munging with pandas indices (and pandas in general), etc; in order for it to be fast, you want pure numpy, with as little allocations happening as possible. I help my coworkers “make things faster” with snippets like this all the time.

If you provide me with a self-contained code example (with data required to run it) that is “too slow”, I’d be willing to try and optimise it to support my point above.

Also, have you tried Numba? It maybe a matter of just applying a “@jit” decorator and restructuring your code a bit in which case it may get magically boosted a few hundred times in speed.

[1] https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post....

That is the _FAST_ version of the code (people keep saying "of course, it's slow", when it's the fast version).

Here is an earlier version (intermediate speed): https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...

It's not so easy to post the data to reproduce a real use-case as it's a few Terabytes :)


Here's a simple easy code that is incredibly slow in Python:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total = 0
    for line in open('data.txt'):
        id,val = line.split('\t')
        if id in interesting:
           total += int(val)
This is not unlike a lot of code I write, actually.

I've also found that loops with dictionary (or set) lookups are a pain point in python performance. However, this example strikes me as a pretty-obvious pandas use-case:

    interesting = set(line.strip() for line in open('interesting.txt'))
    for c in chunks: # im lazy to actually write it
        df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val'])
        total += df['val'][df['id'].isin(interesting)].sum()
I'm not exactly sure, but pretty sure that isin() doesn't use python set lookups, but some kind of internal implementation, and is thus really fast. I'd be quite surprised if disk IO wasn't the bottleneck in the above example.

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.

Ok, I said I wasn't sure about the implementation, so I looked it up. In fact `isin` uses either hash tables or np.in1d (for larger sets, since according to pandas authors it is faster after a certain threshold). See https://github.com/pandas-dev/pandas/blob/master/pandas/core...

Could you give a hint of how the data ("sample1", "sample2") looks like, or how to randomly generate it in order to benchmark it sensibly? I guess these are similarly-indexed float64 series where the index may contain duplicates? Maybe you could share a chunk of data (as input to genetic_distance() function) as an example if it's not too proprietary and if it's sufficient to run a micro benchmark.

There's also code in genetic_distance() function that IIUC is meant to handle the case when sample1 and sample2 are not similarly-indexed, however (a) you essentially never use it, since you only pass sample1 and sample2 that are columns of the same dataframe (what's the point then?), and (b) your code would actually throw an exception if you tried doing that.

P.S. I like the part where you've removed the comment "note that this is a slow computation" :)

Have you checked out scikit-allel? It is fairly comprehensive in terms of calculating basic population stats, and the developer is highly active.

scikit-allel: http://scikit-allel.readthedocs.io/en/latest/index.html

scikit-allel example: http://alimanfoo.github.io/2015/09/21/estimating-fst.html

zarr: https://github.com/zarr-developers/zarr

The speed could possibly be improved by using map. Also, not related to speed if this is all of the code, but might affect it in a larger programs: you should make sure your file pointers are closed. Something like:

    with open('interesting.txt') as interesting_file:
        interesting = {line.strip() for line in interesting_file}
    with open('data.txt') in data_file:
        total = sum(int(val) for id, val in map(lambda line: line.split('\t'), data_file) if id in interesting)

`map` is not going to make it faster. `map` is a loop. Only vectorized code is faster.

Have you tried using Cython to compile code like the above? Python's sets / maps / reading data etc should be fairly optimised, so Cython might let you bypass boxing counter variables instead using native C ints or whatever.

Also, if the data you're reading is numeric only - or at least non-unicode / character data - you might be able to get a speed boost reading the data as binary not as python text strings.

> you write code (like in the example here [1]) that is bound to be slow, just from a glance at it > [1] https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post.....

Given his code you referenced, could you elaborate on what makes it look slow at a glance, and how you might speed it up? :)

ln 221:

    if snp_taxID not in samples_of_interest.keys():#Check if Genome is of interest
Tracing through, looks like samples_of_interest is a dict. `snp_taxID not in samples_of_interest` would make membership check constant time.

> Also, have you tried Numba?

Numba does not support dictionaries and has limited support for pandas dataframes (only underlying arrays, when convertible to NumPy buffers, if I understand correctly). This limits usefulness for many non-array situations, as well as some existing code-bases (the dictionary is fundamental in Python and typically used everywhere -- often for performance).

Brian Moore's quip [0] about mod_rewrite comes to mind every time I use Numba:

"Despite the examples and docs, Numba is voodoo. Damned cool voodoo, but still voodoo"

0. https://httpd.apache.org/docs/2.0/rewrite/

Interesting assertion re: TimeToWriteCode, but I think there's TimeToWriteCode vs. TimeToWriteGoodCode.

I'm working on my first serious Python project right now, and I find it's super easy to throw together some code that more or less works; but for solid, readable, documented, properly unit-tested code I hope is production-ready, it's not any faster than Perl or Golang.

(Sure, if you're a Python expert it's faster for you than for me, but if it's about TimeForExpertsToWriteGoodCode I'm not any more convinced.)

Production-ready is so complex, it's hard to make any comparison. E.g. for a library, writing good documentation (with diagrams and decent technical writing) takes me way longer coding anyway - probably by an order of magnitude.

Proper unit-testing is also going to take roughly the same time in any language, just because you have to think hard about sensible tests (although I still love mocking/patching in Python, so I'd give it an edge, plus pdb/ipdb for debugging tests is cool). Production-ready also includes deployment, which for anything non-trivial I'd say Golang > Python > Perl.

Finally, if we're talking "serious project", IMO tooling and how that tooling integrates into a CI pipeline are more important than development speed, because as a team or project goes, terrible CI will slow developers more than any language. Although again here I think Python does quite well with decent linting, unit test frameworks, and code coverage options, Golang's opinionated tools are simpler in this respect.

(I enjoyed C# for similar reasons, although I don't think it's kept up w.r.t. tooling - been ages since I used it though.)

Good points. So far I find I really like Python's mocking, "with self.some_useful_patch()" is really nice, and I like the idea of side effects especially with boto. Of course in some cases it's really difficult, but every language has its tricky unit-testing problems.

One big point I would give to Golang, about which lots of people disagree with me, is the "opinionatedness" of it. It seems to me that Python, like Perl, has a "There's More Than One Way To Do It" mentality, and after many years of that I really appreciated Golang's emphasis on the "idiomatic." That goes for the tooling too.

I have also noticed that the Python ecosystem doesn't have a strong documentation culture, which I find annoying as a relative newbie. But that presumably matters less over time, and it seems to be part of the Python Way to use libraries that "just work" and not worry about the details.

>Interesting assertion re: TimeToWriteCode, but I think there's TimeToWriteCode vs. TimeToWriteGoodCode.

In lots of areas, "good code" doesn't matter much, if at all.

Scientific computing is full of those cases -- you write code to run a few times, and don't care for maintaining it and running it ever again (as long as the results are correct).

I often wonder about that, especially having written lots and lots of lousy, unmaintainable code in my own life.

It usually starts with "oh it's just a one-off thing" and then it turns out to be useful and the rest is messy history.

But sure, within that genre I could see Python being a faster language to write in than many others.

Sometimes even for a one-shot job you dive down and write passable code then as you start to tackle the complexities of the problem at hand you realise that the amount of ropy code has just tied your hands and now it gets increasingly harder to wrap your head around your implementation and finally complete the one-shot job.

> In lots of areas, "good code" doesn't matter much, if at all.

This is the received wisdom in biological science but I’m convinced that it’s trivially wrong. I’ve seen a lot of research code, most of it bad. I have no idea how many bugs are in this code, and I know for a fact that the original authors also don’t know. And it would be truly exceptional if these pieces of code were bug-free (in fact, there’s enough software engineering know-how to categorically conclude that a very high percentage of such code has bugs). How many of these bugs affect the correctness of the results?

… since the code quality is so bad, this is impossible to quantify. So, yes, code quality does matter in science, since it affects the probability of publishing wrong results.

Incidentally, there are cases of retractions of high-impact papers due to errors in code. Of course this will also happen with better code quality; but if conventional software engineering wisdom is right then it will happen substantially less.

That’s easy with python, too, in a lot of number crunching cases. Numpy with MKL will use all your cores, as will e.g dask and other libraries built on numpy. Farming out embarassingly parallel work to threads or processes is also easy.

If I can fit the code into numpy-like structure, then Python is typically fine.

The issue is when I cannot.

Then move the function to a pyx file and build it with Cython. Problem solved.

Also look into numba as a jit decorator for python functions.

Have you given dask a try? It gives you out-of-core arrays with numpy semantics and distributed computing.

Dask doesn't solve that problem since it's a wrapper around pandas functions.

If you can't make the core pandas code decently fast, dask won't save you.

dask.dataframe might not help but dask.distributed could in that case.

I've had success using it on non vanilla stuff (i.e. code that could not get converted to play natively with numpy/pandas structures)

As a bonus, the nice profiling tools (built within dask) have also helped me improve the performance of the code.

See https://distributed.readthedocs.io/en/latest/

There’s dask.array which works on numpy arrays instead of dataframes. Otherwise, your argument holds.

Or use languages where you don't have to these extreme workarounds for what should happen by default.

If you write more C++ than python, it will have a lower TimeToWriteCode. Despite having spent years writing python I don't find it any more productive than C++.

C++11 has all the nice features you might expect from python with the only drawback being the lack of a REPL.

The lack of REPL compounds with long compilation times, which is practically a feature of C++ and not going to go away anytime soon. The effect is that, when you explore a new API or need to tune parameters to some function call deep in the call stack, you're an order of magnitude slower than with Python (or Lisp, Scala, F#, Haskell, or even Nim or plain C (b/c compilation times)).

If you know exactly what you need to write, you're just as quick in C++ as in Python, that's true. Programming is mostly about learning what to write, though, and here C++ loses.

If you are developing your code as a small tiny functions getting stitched later. Then writing unit test cases will solve this problem too.

No, it will help with lack of REPL but not with long compilation times. Long compilation times are bad across the board. Go advertises "fast compilation" as one of its key features for a reason.

EDIT: Not to mention, if you write your code as a lot of tiny functions you could just as well write it in C. Once you go for classes and templates, that's where C++ power is visible, but that's also where its compile times suck.

Writing unit tests is nowhere near a replacement for a proper REPL.

+1 people shouldn't overlook things that are bundled in the C++ stdlib now (chrono, random, thread, algorithm, mutex, containers, etc)

They're great and incredibly useful. And one should not forget that you can easily use them in a Python extension written in C++14 and exported using Cython or SWIG.

There is a C++ REPL: https://root.cern.ch/cling

https://repl.it/ also provides C++ (and many more) support in an online version.

My time to debug code is usually smaller for C++, although I'm not familiar with the python tooling as much.

Is total time really that interesting as a metric? Factor in cost, both in terms of, say, what the employer pays you and what they pay for CPU time, sprinkle it with costs in terms of externalities (e.g. the cost of millions of clients executing poorly performing code vs the cost of millions of clients paying for the additional development overhead of well performing code) and the equation is a lot more complex and application-dependent.

Then weigh in the hard realities of some engineering problems. It won't matter that it takes 1% of the time to implement a video decoder in python if it can't deliver decoded frames in a timely manner. It won't matter that the C solution will run 1000x faster if you need a month to develop what should be delivered on Friday.

I'm sorry if this is already covered in the article. I had a brief look before but it won't currently load.

As for high level C++ threading you have OMP. It's incredibly easy to use. In the simplest case you just use a preprocessor directive before a loop to say it should run in parallel. It's probably not as nice as what you get in Haskell because it needs to be done explicitly but it is really easy to use.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact