Hacker News new | past | comments | ask | show | jobs | submit login
Why I Use Nim instead of Python for Data Processing (benjamindlee.com)
257 points by benjamin-lee 22 days ago | hide | past | favorite | 179 comments



It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem. Which also why I don't use it that much, because while numerical analysis is a big part of what I do, so is what I would call "symbolic manipulation" and unless you go to quite some effort to transform every problem into a numerical one, Python is just awful at that.

But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.

Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".


> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.

I wrote a lot of scripts that polled large amounts of network devices for information and then did something with it (typically upsert the data into a database, either via direct SQL or a REST API to whatever service owns the database). All these tasks were heavily network-bound. The amount of time the CPU was doing any work was minuscule compared to the amount of time it was waiting to get data back from the network. I doubt Nim or any other language would have been a significant performance improvement in this case.

For what it's worth, that made these scripts excellent candidates for multithreading. I'd run them with 20+ threads, and it was glorious. At first I did multiprocessing, because of all the GIL horror stories, but multiprocessing made it very difficult to cache data, so eventually I said "well, all this is network-bound so the GIL doesn't even apply" and switched over to multiprocessing.dummy (which implements pools using the same API as multiprocessing but with threads instead of processes), and I never looked back.

Edit: For what it's worth, Nim sounds like a really cool language, and it's right up my alley in several ways, I just don't think Python is particularly slow at network-bound tasks that use very little CPU.


I agree with your entire post but, and I‘m saying this as a fulltime python dev, there‘s often a point where it starts being bothersome, and that usually comes only later in the lifecycle of an application after it had some organic growth. Some day e.g. a sales manager comes down to your lair and asks you if you couldn‘t just also parse this little 200MB Excel spreadsheet after it came over the network such that your ETL process could save it into a new table. And boom you‘re into CPU bound lands now. Often it‘s fine, you can wait those 1-2min for a daily occuring process. But what if for example you put this whole component behind a REST API that is behind a load balancer that is set with a certain timeout? There are even strict upper limits if for example you chose AWS Lambda for your stuff.

And suddenly you need to introduce quite a bit more technical complexity into this story that‘s gonna be hard to explain to management - all they see is that you now can insert a couple of millions of DB rows and their Big Data consultants[TM] told them that this is nowadays not even worth thinking about.

Point being: If your performance ceiling is low, you‘re gonna hit it sooner.


> there‘s often a point where it starts being bothersome

My team and I have been using Python for web, scripting, and ETL development since 2007. I don't recall the last time Python wasn't "fast enough" for anything I needed to do. I'm sure it's legitimately too slow for plenty of use cases and classes of programming domains. But for a general purpose language that makes our developers incredibly productive (which is what we optimize for organizationally), I've not "often" found the point where it becomes bothersome. On the contrary, I'd say I've rarely found it. And in those cases, the workaround to make it fast enough is there.

That's not to dispute your experience. I just want to provide a counter example.


Does Python give you an actual productivity edge? I do not see how.

Python developers are easier to hire (certainly than Nim), have a monolanguage experience (meaning less onboarding), but are not magically more productive. Developer availability is still a great reason to choose Python.


YMMV. My background is in financial applications and there it’s a quite common problem popping up, especially (but not exclusively) in analytics.


Is there a problem with binding c or c++ to your python project in these situations?


The whole build gets quite complex. Numba is interesting but also still a bit limited in what it supports. There are no silver bullets and sometimes you can find a good fit, but IMO the JVM or C#/.net overall have a better story when it comes to write performance critical applications (but don’t need to get the last 20% such as HFT or HPC sims), as you can also find people on the market who understand how to do and maintain that more easily. There are tons of python devs but in my experience at least on European market their understanding of underlying systems tends to be limited.


>Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

And that isn't the only culprit. Large code bases will be hard to structure, maintain and organize. And you will spend more time writing tests than writing productive code. Because you don't have many defenses against errors and because once a bug will hit production it will be very difficult to debug.


I somewhat disagree there. IMO you can structure a large application just as well in python as in Java, but it requires more architecture guidance as your degree of freedom is higher. Regarding testing, mypy and later python versions have been a big help there, I don't think it's much of a difference anymore and I'd take it over Node.js for that reason because the core language IMO is more consistent and strict and requires way way less outside dependencies. In fact the standard lib for me is the main reason to choose python today.

So for me it's really performance being the main thing to think about with python. If one anticipates tighter latency or throughput requirements I'd go for something else.


> "I'll also add that Python is decent at tasks that aren't CPU-bound"

IO-bound tasks are almost by definition outside of your Python application's control. You yield control to the system to execute the actual task, and from that point on - you're no longer in control of how long the task will take to complete.

In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.


The main point (I think) was that python is a viable language for many use cases that are not processing intensive, while also being very easy and quick to write which is often the most important thing.


In my experience, Python is so slow that it will make CPU bound tasks that have no business being CPU bound.


You have to be reaaaaaally slow to be beaten by a network. Also, touched by the OP, if it takes 3 hours to write some code that in Python takes only 1, or if the compile times are huge, Python can beat other languages in speed (that edge fades when the same program is used over and over again).

But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.


> Just it's huge ecosystem.

self-contradiction at its best. kindly be reminded the same advantage was the only thing that kept Java alive for so long until it finally started to enter the XXI century a couple years ago.

you just can't discount an ecosystem, especially if its huge.


Exactly. Also I’m not impressed with a lot of the ecosystem. After getting more acquainted with many very important libraries I can see the poor quality and it’s scary. Grateful for the free tools but IMO there other Languages like go where more essential libraries and tooling are written by better programmers. Like actual professionals instead of hobby tier


As other commenters point out, how can a language be fast in a way besides something CPU bound? You are saying it is fast when it's not doing anything. Not sure I understand.


it's a shorthand for 'fast enough'. is it slow if nobody would notice if it was 10x faster?


You might want to look into async (asyncio or anyio) instead of or in addition to threads for network-heavy code. Async coroutines I find can be much easier to debug and develop than OS-threaded code.


The thing with Python is it's usually pretty easy to optimise quite impressively.

E.g. random example:

Sprinkle some cdef's in your python and suddenly you're faster than c++

https://github.com/luizsol/PrimesResult

https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race...

25.8 seconds down to 1.5


I've tried cython and it isn't competitive with C in real world cases. Cherry picking a case where it is competitive doesn't help your case. Its performance tend to be around the level of Java, another language people say is competitive or faster than C++ but in practice C++ is still twice faster than Java in most cases.

Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.


Ahh my point was more that you can score a 17x performance boost with minimal effort rather than any absolute performance advantage over c++.

For a c++ comparison python would be much better pointing out the productivity advantage it has over the notoriously low productivity of c++ development - rather than competing on execution performance.


There is also numba, which is very impressive in its own right, and also pypy, which supports features up to Python 3.7.

Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].

XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.

[1] https://github.com/scikit-hep/iminuit/blob/develop/doc/tutor...


Apparently mypyc does the same if you have types in your code, though I've never used it.


> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

CPython's slowness doesn't boggle my mind at all. It's a bytecode interpreter for an incredibly dynamic language that states simplicity of implementation as a goal. I would say performance is actually pretty impressive considering all that. What _does_ boggle my mind is the performance of cutting-edge optimizing compilers like LLVM and V8!

At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.


Nim isn’t unique in its performance, but it “feels” a lot like Python making it a nice gateway drug for Python users to start wasting less electricity.


IME Python is pretty great as cross-platform scripting-, automation- and glue-language up to a few thousand lines of code. Essentially replacing bash scripts and Windows batch files or for simple command line tools that don't need the performance. It should be just one language in one's language toolbox though.


> It's primarily a testament to how simply mind bogglingly slow Python is

No, Nim is truly among the top fastest languages when writing idiomatic code as shown in many benchmarks.

> But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python

...while also being very friendly to Python programmers, intuitive and expressive. Unlike many other languages.


> mind bogglingly slow Python is outside of its optimised numerical science ecosystem

Granted, but inside its optimised numerical science ecosystem, Python is, in fact, fast enough. If most of your program is calls into numpy, Python will get you where you need to go. In my experience, one scalar Python math operation takes about the same amount of time as the equivalent numpy operation on a million-element array. Linked against a recent libblas, numpy will even distribute work across multiple cores. So much for the GIL.


I often recommend PyPy for "non-numerical" data processing when performance matters.

Also, "awful" is too harsh. Probably 90% of Python code just doesn't need to be faster than it is.


I'd guess closer to 99%...


Yeah should really be "Why I don't use Python for Data Processing". I would consider Typescript as an alternative too which I'm sure would get a similar speedup.

Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:

> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.

If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!


I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.

(That philosophy extends far beyond naming conventions.)

What this avoids is being stuck with antiquated standard libraries that continue to do things contrary to the language's standards for the sake of backward compatibility (arg Python!) and 3rd party libraries where someone chose a different standard because that's their preference (arg Python! JavaScript! Literally every language!). Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.

Could Nim have forced everyone to snake_case naming structures for everything from the start? Well, sure, but then the people that have never actually written code in Nim would be whining about that convention instead and we'd be in the same place. After having actually used Nim, my opinion, and I would venture to say the opinion of most, is that its identity rules were a good decision for the developers who actually write Nim code.


> I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

This is a serious design flaw. It absolutely should be front and center when Nim is discussed.

> Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.

I want to write code where "myFoo != my_foo". Evidently, nim doesn't allow that, so this argument seems pretty hollow.

> Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

This is because we have crappy linters. If a linter can't tell that an identifier is camel case because it is third party, it's a bad linter.

> Could Nim have forced everyone to snake_case naming structures for everything from the start?

This would have been preferable to this madness.


> then the people that have never actually written code in Nim would be whining about that convention

Spot on. I wrote plenty of Nim and the style-insensitivity is a feature and not a bug.

Not allowing the use of 3 different variables named userName, user_name and username only encourages readable and robust code.


> 3rd party libraries where someone chose a different standard because that's their preference

C & C++ don't have a standard style but Python and JavaScript definitely do. And Rust and Go and I would guess most projects. Find me a popular library in one of those languages that doesn't use the standard style.


> Yeah should really be "Why I don't use Python for Data Processing".

Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.

> Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this: > If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

It may seem like a design mistake at first glance but it’s surprisingly useful. It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles. Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles. Though mostly I forgot it’s there as most idiomatic Nim code sticks with camel case. I’d say not to knock it until you’ve tried it.

The rest of Nim’s design avoids many issues I consider actual blunders in a modern language such as Python’s treatment of if/else as statements rather than as expressions, and then adding things like the walrus operator etc to compensate.


> It may seem like a design mistake at first glance but it’s surprisingly useful. > It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles.

That doesn't sound right at all. It sounds like a design choice aimed at achieving the exact opposite: inconsistency without any positive tradeoff in return.

> Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles.

That does not sound right at all. At most, it sounds like the compiler does not throw errors when stumbling on what would otherwise be syntax errors, but you still have all the mismatches in internal styles and linters complaining about code and teams wasting time with internal piss matches, and more importantly a way to foster completely futile nitpicking discussions within a language community.


> It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles.

This doesn't make sense. For an entirely new language you can just have the entire ecosystem use the same style, e.g. like Rust does. Or even Python!


I'd agree, if you want a consistent style, ship a formatter with your package manager, like Rust does with `cargo fmt`. If it's there by default, it'll quickly be a standard.


there are many packages and maintainers of such who just dont give a damn about the formatter or the style, some people just *want what they want*


Sure, but there's a standard formatter, they become a small minority.


and with style insensitivity they become a non-issue?


The compiler might not be sensitive to style, but *I* am.


I guess the best thing would be having a formatter that can switch the style for you locally. I don’t think Nim has this currently, would be neat if they did.


I don't think most people care about a specific style, as long as it's consistent. That's why I love Black, we all disagree with it equally, but each for different reasons.


It actually makes a lot of sense. I personally don't want to touch .Net langs because of their PascalCase and camelCase coding style which I find very unfortunate arbitrary decision. And well, didn't really want to get into Nim because of camelCase but since I've learned about this feature I'm reconsidering again.


Do you really dislike camelCase that much to avoid an _entire_ ecosystem because of it?

I prefer snake_case too, but avoiding a language because of a minor convention like that seems weird.


Only language where I can stand it is Haskell. camelCase is relatively ok, but PascalCase for non-types (i.e. function names) is a big no-no :)


When there are so many excellent languages to choose among, a trivial distinction can almost be a relief.


I actually use TypeScript/JavaScript a lot for this reason, especially for biological algorithms that I want to run in the browser. The developer tooling is also as good as you can hope for, especially when using VS Code. I actually wrote a circular RNA sequence deduplication algorithm in it just recently [1].

With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...


That's great. However the method that you use to find the canonical representative [1] is quadratic (when the string has length N, there are N rotations and for each rotation you need to check N characters to determine whether this is earlier than the best on that you have found so far). For large strings you would probably want to switch to one of the linear minimal string rotation algorithms [2], for example Booth's Algorithm.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

[2] https://en.wikipedia.org/wiki/Lexicographically_minimal_stri...


You hit the nail on the head. This is just the lexicographically minimal string rotation with a canonicalization step. I have actually had this on my to-do list for a while. The truth is that the data set I'm applying this to is small (though I'm doing my best to change that by discovering new viroid-like agents) so the optimization has yet to become pressing. At some point, I'd like to spin this script out into a standalone tool but I'm not sure the demand is there yet.


The main beef I have with the javascript ecosystem for data analysis is lack of multicore. Yes lots of things can be solved by converting multi-core to multi-process or using other workarounds but there are a whole class of problems where shared memory access makes a huge amount of sense.


How is that a basic mistake? I think having two distinct identifiers that differ only by case sounds like a mistake!


What I found useful is that even if some author uses snake in his library, it doesn't spread to your code, so you can still adhere to the standard.


Various programming languages (and file systems!) have had some form of case insensitivity and it always turns out to be an absolutely terrible idea:

* It makes searching for identifiers harder. For Nim you can't even use case insensitive search because of the underscore thing! Better practice your regexes.

* The case insensitivity rules are usually super complicated and don't apply to everything, so now it's an extra thing you have to mentally compute when coding. This is probably the biggest problem and I'm sure it has led to bugs, e.g. in SQL.

* Do you enjoy the tabs vs spaces debate? How about single quote Vs double quotes? Ugly inconsistently styled code? Well you'll love this!

* Unicode case insensitivity is actually really really complicated (this mostly applies to filesystems).

You're basically opening yourself up to an array of annoyances and gotchas for essentially no benefits.

I've literally never seen anyone use two identifiers that only differ by case, but if that were actually a big problem it could be solved just by making that illegal. You don't have to resort to the insanity of case insensitivity.


> Do you enjoy the tabs vs spaces debate? How about single quote Vs double quotes? Ugly inconsistently styled code?

Nim solved tabs/spaces debate by allowing only spaces. Single/double quote are also completely separate, so there is no inconsistency as well. About "inconsistently styled code" - due to style insensitivity, effects are not viral. If you depend on a library that uses `get_name` but your project adopted `getName()` your don't need to suffer.

> Unicode case insensitivity is actually really really complicated

Which is why nim only handles case insensitivity for ASCII identifiers and not Unicode. Which makes sense because 99.9% code is written in ASCII.

> The case insensitivity rules are usually super complicated and don't apply to everything

Quoting from the manual - "only the first letters are compared in a case-sensitive manner. Other letters are compared case-insensitively within the ASCII range, and underscores are ignored." Unicode is not handled in style-insensetive manner, so the rule is pretty simple.

> It makes searching for identifiers harder. For Nim you can't even use case insensitive search because of the underscore thing!

Technically true, but in reality this comes up so rarely, I didn't have any issues with this ever. Nim projects usually adopt camel/pascal case.

For some reason, people often assume that allowing style insensitivity instantly throws the whole language ecosystem in complete disarray and everyone starts writing code mixing every possible style of writing at once, swapping styles on every second identifier just for their amusement.


> so the rule is pretty simple

So it's only identifiers, what about keywords, compiler directives? And only ASCII characters... And it doesn't affect the first character randomly. Sure very simple.

> If you depend on a library that uses `get_name` but your project adopted `getName()` your don't need to suffer.

Yeah because nobody ever imports code from other projects, copy/pastes from StackOverflow etc. /s

> Nim projects usually adopt camel/pascal case.

So why do you need case insensitivity??!

Trust me this is a decision they will regret.


It does affect identifiers, built-in keywords and type names in the same manner, though it is completely beyond me why anyone would write 'tY_PE Position = tUPLE[x, y: iNt]'.

If we are going to dismiss any nontrivial behavior as "sure very simple" then sure, having two distinct incompatible ways to write get_name (or getName) looks better.

It does not "randomly" not affect first character. Most style guides for camel/pascal case treat first character differently, with types being capitalized and regular variables starting in lowercase. Making whole identifier style-insensetive would significantly reduce number of possible names. So this rule makes 'building' and 'Building' different, but at the same time does not differentiate between 'get_name' and 'getName'.

> Yeah because nobody ever imports code from other projects, copy/pastes from StackOverflow etc. /s

I don't understand your point (even considering /s) - if I do import code from my other projects or copy-paste it from somewhere it simply doesn't matter to me what style they used - I'm not required to fix copied code, or adapt to the style my dependencies use.

> So why do you need case insensitivity??!

If their code gets into larger ecosystem it will not affect anything else, or at least it's effects would be minimized. It is a win-win solution - library author uses their preferred style, I (and everyone else in the ecosystem) stick to common convention, and everyone bis happy.

> Trust me this is a decision they will regret.

I've been using nim for multiple years now, and I have never seen anyone actually regretting this. Of course, when new people come to the language they sometimes are really surprised by this, but rarely ever complain about this in the long run.

EDIT: forgot to mention that nim is more than a decade old language, and "decision they will regret" should have happened years ago, yet it does not seem to be the case.


> For Nim you can't even use case insensitive search because of the underscore thing! Better practice your regexes.

Not at all, unless you decide to mix different styles in the same codebase.


While I trust the author on this, I don’t think DNA datasets and string analysis was a great example.

One of the big, big things for improving performance on DNA analysis of ANY kind is converting these large text files into binary (4 letters easily converts to 2 bit encoding) and massively improves basically any analysis you’re trying to do.

Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.

There’s no real point in showing off that a compiled language is faster at doing something the slow way…


You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...


Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.


This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.


What do you mean that each base can be included or excluded? Isn't only one extra value needed? Sort of like nil?


Because there are notations for any combination of bases. There's a way to indicate "C or G", "A or T", "C, G, or A", etc.


Oh. So what's the grand total of all possible permutations of single and multiple (connected with an "or") values?

I'll also read through your links, thanks for posting them.


Aren't the reads emitting a set of size greater than 4 bases per position, with a wildcard or "?" perhaps one option?

(As in GATTACA might be read as is, but might be read as GAT?ACA.)

Still that's a minimal of 3 bits versus much longer.

[Edit : i see another commenter with the same observation, more thoroughly explained! ]


Why do we use Python for data processing?

Because we use it as a nice syntactic frontend to numpy, a large and highly optimized library written in C++ and Fortran (sic). That is, we actually don't use "Python-native" code much, and numpy is essentially APL-like array-oriented thing where e.g. you don't normally need loops.

For native-language data processing, Python is slow; Nim or Julia would easily outperform it, while being comparably ergonomic.


Apparently there's also a data processing library for Nim called Arraymancer[0] that's inspired by Numpy and PyTorch. It claims to be faster than both.

[0] https://mratsim.github.io/Arraymancer/


Please add D language to the mix as well. Interestingly, you can simply replace Nim with D in the blog article and most of the contents will still make sense!

The funny thing is that Nim and Julia libraries are still wrapping Fortran numerical library while D has beaten the old and trusted Fortran library in its home turf five years back:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...


>Julia libraries are still wrapping Fortran numerical library

You say that, but Julia is rapidly acquiring native numerical libraries that outperform OpenBLAS:

https://discourse.julialang.org/t/realistically-how-close-is...


There’s been a tremendous amount of work optimizing blas _and_ ensuring it’s numerically stable. Julia made a good choice to use blas first. Though it’s good to see new native implementations.

For Nim, there’s also NimTorch which is interesting in that it builds on Nim’s C++ target to generate native PyTorch code. Even Python is technically a second class citizen for the C++ code. Most ML libraries are C++ all the way down.

https://github.com/sinkingsugar/nimtorch


Not necessarily true with Julia. Many libraries like DifferentialEquations.jl are Julia all of the way down because the pure Julia BLAS tools outperform OpenBLAS and MKL in certain areas. For example see:

https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...

So a stiff ODE solve is pure Julia, LU-factorizations and all. This is what allows it to outperform the common C and Fortran libraries very consistently. See https://benchmarks.sciml.ai/html/MultiLanguage/wrapper_packa... and https://benchmarks.sciml.ai/html/Bio/BCR.html


Julia is in the process of replacing many C and Fortran numerical libraries with pure Julia implementations, because they have similar performance.


Another nim & python thread that has not been mentioned yet here

https://news.ycombinator.com/item?id=28506531 - project allows creating pythonic bindings for your nim libraries pretty easily, which can be useful if you still want to write most of your toplevel code in python, but leverage nim's speed when it matters.

If you want to make your nim code even more "pythonic" there is a https://github.com/Yardanico/nimpylib, and for calling some python code from nim there is a https://github.com/yglukhov/nimpy


The author makes a fair point, however, that is a rather non optimal implementation in Python. You likely could use chunked Pandas to speed up or the code, or at least replace some of the for loops with a list comprehension syntax.

However, in any case I would never replace Python with Nim as it is too niche of a language and you would struggle with recruiting. I could consider Julia if it's popularity keeps growing.

That is the ultimate challenge of a language. It either needs a large backer (Go and Google) or be so good, it gets a natural market adaptation(Julia). As a manager I am reluctant to adapt yet another language unless there is a healthy job market for it.


There are objectively better niche solutions for niche problems out there. But we pickup things that can be applied to solve a number of different problem, that are more versatile and has a community behind them.


There's a class of technologies falling under what I'd call "most of your engineers would pick that in a weekend".

Not all technologies require the full cycle and the normal risk management.


Agreed. I am certainly inclined to believe that Nim is a better language than Python, but it's not so much better thlo justify moving off of the ecosystem.


This is reasonably idiomatic Python and 10x faster than the implementation in the original post:

  with open("orthocoronavirinae.fasta") as f:
      text = ''.join((line.rstrip() for line in f.readlines() if not line.startswith('>')))
      gc = text.count('G') + text.count('C')
      total = len(text)


Or if you want to be explicit, this is just as fast (and might scale better for particularly long genomes):

  gc = 0
  total = 0
  
  with open("orthocoronavirinae.fasta") as f:
      for line in f.readlines():
          if not line.startswith('>'):
              line = line.rstrip()
              gc += line.count('C') + line.count('G')
              total += len(line)

I didn't test Nim but the author reports Nim is 30x faster than his Python implementation, so mine would be about 3x slower than his Nim.


I think this is missing the point of the article.

Yes, you can implement a faster Python version, but notice also:

* This faster version is reading all the file into memory (except comment lines). The article mentions the data being 150MB, which should fit in memory, but for larger datasets, this approach would be unfeasible

* The faster version is actually delegating a lot of work to Python's C internals by using text.count('G'). All the internal looping and comparisons is done in C, while on the original version, goes through Python

So yes, you can definitely write faster Python by delegating most of the work to C.

The point of the article is not about how to optimize Python, but about how given almost identical implementations in Python and Nim, Nim can outperform Python by 1 or 2 orders of magnitude without resorting to use C internals for basic things like looping or comparing characters.


I didn't try to write optimized code, but idiomatic Python. Which also happens to be 10x faster.

To make it streaming, take the second version and remove the readlines (directly iterate over f).

Delegating work to Python's C internals is fine IMO because "batteries included" is a key feature of Python. "Nim outperforms unidiomatic Python that deliberately ignores key language features" is perhaps true, but less flashy of a headline.

And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.


One liner to count gc, without buffering.

    import io
    f = io.StringIO(
    """
    AB
    CD
    EF
    GH
    """
    )

    total = sum(map(lambda s: 0 if s[0]==">" else s.count('G') + s.count('C'), f.readlines()))

    print(total)


And reading the file as binary. There's a lesson about the overhead of unicode strings here ;)

Your first example takes 3.1 seconds, my previous comment takes 2.3 seconds, this one takes 1.4 seconds.

    start = time.perf_counter()

    with open("orthocoronavirinae.fasta", "rb") as f:
    total = sum(map(lambda s: 0 if s[0]==65 else s.count(b"G") + s.count(b"C"), f.readlines()))
    
    end = time.perf_counter()
    
    print(total, " total")
    print(end-start, " seconds")


This is good too. I would use a generator expression instead of the map though probably.


map is a generator


As a Data Engineer, I mainly use Python in conjunction with PySpark. So essentially I just use Python as an easy-to-read wrapper around Spark, and it works great when working together with Data Scientists, who are mostly used to Pandas, Keras, Tensorflow, etc.

In my use case, I don't really see how Nim would make my life easier right now.


One pitfall that I ran into when trying out Nim for scientific computing is that Nim follows more a computer science convention than mathematics convention for exponentiation and negation operators. That is in Nim -2^3=(-2)^3 unlike more scientific computing oriented languages where -2^3=-(2^3). To someone like myself who mainly does scientific work this was quite unexpected and it causes a surprising amount of mental overhead to avoid mistakes. I did like Nim quite a bit otherwise, but essentially found that I was missing some important numerical libraries so did not continue using it.


To be honest, regardless of the behavior of the language, I would be wrapping things in parentheses to make it explicit anyway.


Although Nim using weird order of operations is unfortunate, your example is not well chosen. Replace the exponent 3 with an even integer and your point will be clear.


Maybe not that weird, although not the most common syntax. In Nim 0 - 2^2 evaluates to -4, while -2^2 evaluates to +4. So in -2, the - is treated as a unary minus and binds tightly to the 2. It would be bad if Nim were doing + or - before ^, or before *. I have a feeling some other languages treat the unary minus the same way, but don’t know of any examples off hand.


Yes when I asked a question about this on their gitter channel, people brought up several languages I think bc is one of them for example. I understand the reasoning I still find it extremely unintuitive personally.


I don't think that is a mathematics convention vs computer science convention thing. Most languages, whether aimed at mathematics or computer science, have exponentiation at higher precedence than unary minus. (JavaScript does not, but it also does not allow unary minus directly in front of the base of an exponentiation so you have to add parenthesis no matter which interpretation you want).

The main places you find it the other way are spreadsheets and shells.

Is there an explanation from the Nim authors as to why they made such an odd choice?


IIRC the argument is that unitary negation is the operator with the highest presedence and one should be consistent across logical and mathematical operations. I think this is a not a completely unreasonable stance to take (and several other languages take the same), but it is unintuitive for someone who does significant scientific computing.


While Nim is for certain interesting and even pleasant to write code in, its small user base and environment discourage people to use it.

I don't write code only for myself.

How would I convince my employer to let me use Nim instead of a better known language?

And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

And even id we can find those people, it would mean we would have to write many things ourselves, which in other languages we can take for granted as they have libraries for almost anything.

So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

Go, Rust, Kotlin, Swift and even Julia have the luck of having some industry heavyweights behind them, pushing the ecosystem and contributing with money and developers. Nim has only a bunch of passionate people behind it.


>if we want to start a new project how could we find programmers well-versed in Nim?

If a programmer can't pick up a language like Nim in a few weekends (from what I gather, it's similar to Python and not much different from most common languages, i.e. not something relatively exotic like Haskell) then I don't know. Our mainly PHP shop transitioned to Go quite effortlessly. Today we hire PHP juniors without any Go experience (easier to find), we teach them, and then they work on Go codebases already after a month of internship. So lack of "professional Nim programmers" doesn't look like a problem to me.

Lack of libraries is a good point but from what I read, Nim compiles to C, so I understand they can have access to tens (hundreds?) of thousands C libraries without writing everything from scratch.

However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.


I think it is true for people who have computer science/software engineering or any programming background, but for my statistician/data scientist friends - they skip a bit when they think they might need to learn a different language.


> However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.

Same here, curious to know what HN crowd recommends between Nim vs Go for new projects.


Nim has hygienic AST macros, so it has really good metaprogramming capabilities. An example of the sort of thing you can do with it: use pattern matching to implement a functional programming DSL:

https://nim-lang.org/blog/2021/03/10/fusion-and-pattern-matc...

This makes it really easy to reduce boilerplate and create low- or 0-cost abstractions for your problem domain. Example of this done in a microcontroller project here:

https://www.youtube.com/watch?v=j0fUqdYC71k

Async/await is also implemented by metaprogramming, rather than as a "core" part of the language:

https://www.youtube.com/watch?v=i0RB7UqxERE

Unrelated to the above, Nim also compiles to Javascript. So you can use the same language for both the backend and frontend.


>use pattern matching to implement a functional programming DSL

DSLs have some use cases but generally I'd avoid being too clever and creating basically entire sublanguages for the sake of it - unless I really need to. Most projects don't need it, it's harder to pick up for novices (especially if there's a zoo of DSLs, different between projects). In Go the last resort is usually code generation, there are built-in tools for it.

>Async/await is also implemented by metaprogramming, rather than as a "core" part of the language

I guess most programmers don't really care if a feature is implemented in the language or in the library, provided it's easy to use. In Go, it's not extendable however; but so far, I've never had a need to extend the default goroutine scheduler.

>Unrelated to the above, Nim also compiles to Javascript. So you can use the same language for both the backend and frontend.

Go can be compiled to WASM, but practically I've never seen code that could be reused between backend/frontend because they solve different problems using different principles, so the idea of using one language for frontend/backend was never compelling to me. Maybe it's just me.


Nim is also faster than Go if we look at most benchmarks. Nim's memory management can be tuned quite a bit. Garbage collector can be even turned off. You can better tune to the underlying hardware if you need to.

https://nim-lang.org/docs/gc.html


> How would I convince my employer Turntables: if you were an employer, what would convince you to use Nim?

Hiring for Nim skills can be a signal that a company has people who learn languages beyond the run-of-the-mill ones. A bunch of passionate people you might say. That would make the company promising to work for.


Interestingly this was exactly the place Python was in back in the day: niche enough that anyone knowing it must've learned it because they thought it was cool rather than because it would get them a job. Those days are long gone of course.


In the case of Python, the "industry heavyweights" do very little and have a negative influence now.

Why the phrase "only a bunch of passionate people"? This is how software gets written, parasitical corporations and their unproductive developers who are installed in existing OSS projects come later and mainly associate themselves with the result (speaking of Python again).


> How would I convince my employer to let me use Nim instead of a better known language?

This is a rephrasing of "nobody ever got fired for buying IBM".

Some organization prioritize innovation and technical acumen.

> So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

Many applications don't need a large ecosystem. People can learn.

> Go, Rust, Kotlin, Swift and even Julia have the luck of having some industry heavyweights behind them

Python was never corporate-driven, thankfully, and it is successful.


> And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

Nim's easy to learn if you have any experience with any compiled language and can understand anything along the line of C#, Kotlin or Python syntax. Also because it compiles to C and JS it makes it easy to add it to a project incrementally in many cases.


You could make the same argument about rust right now., but rust is further along the 'programming language life path'.


Not really, Rust is way more established than Nim. Lots of people use it, lots of companies use it, the community is very large. Maybe 5 years ago, Rust was at the same point as Nim, but even then I doubt it. Nim seems to be one of those languages that will always remain low-profile, because it's an incremental improvement and isn't pushed by a platform.


So many python speed apologists. Yes, you can pour over your python code and given enough time and effort you can eek out another 25% improvement, but it's still much slower than the alternatives.


Counter argument, how many times does one need to do data processing and what's the most expensive process in the equation?

The answer for the latter is programmer time, and some things can be scaled easily using `joblib`, or `dask`. Now, it isn't as trivial as importing parallel iterators with rust and changing `.into_iter` to `.into_par_iter`, but still needs less time, and once it is done, I don't need to think about it again.


Python gets WAY too much work done in WAY too many fields to just handwave away "waaah, it's faster to use blah-lang"


> Nim compilation process took an additional 702 ms

That's horrifyingly slow for a compiler. The author mentioned "modern languages look like Python but run as fast as C", which is a common promise those languages make that never really materialize except for a few very happy path cases they heavily optmised the language for. Julia, for example, makes this promise too, but compiles even slower than that and takes ridiculous amounts of RAM even for hello world.

Did the author post the data set they used for the examples? Would be nice to try it out on a few languages to see how fast that can compile and run on a mature language like Common Lisp (which is just as easy to write) or even node.js.


Nim is actually one of the fastest to compile out of the compiled languages out there, on par with Go. Although this is a bit subjective, I think a second of compilation is good enough for light scripting tasks. (And being a statically-typed languages it catches a good chunk of errors before compilation is finished.)

Nim's advantage is that it uses a good old C compiler for the backend (which has been hyperoptimized for decades), but the frontend (transpiler) is also pretty fast. Nim's compilation speed should improve a bit when incremental compilation support is added (which would probably solve a lot of other current issues for Nim, for example better IDE tooling)


I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...


I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.

Here's a comparison with Common Lisp:

~/fasta-dna $ time python3 run.py

0.3797277865097147

21.828 secs

~/fasta-dna $ time sbcl --script run.lisp

0.37972778

2.415 secs

~/fasta-dna $ ls -al nc_045512.2.fasta

-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta

So, almost as fast as Nim (the time includes compilation time)?

Here's the Common Lisp code:

    (with-open-file (in "nc_045512.2.fasta")
      (loop for line = (read-line in nil)
            while line
            with gc = 0 with total = 0 do
              (unless (eql (aref line 0) #\>)
                (loop for i from 0 below (length line)
                      for ch = (char line i) do
                        (setf total (1+ total))
                        (when (or (eql ch #\C) (eql ch #\G))
                          (setf gc (1+ gc)))))
            finally (format t "~f~%" (/ gc total))))
With a top-level function and some type declarations it could run even faster, I think.

EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.

The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).


@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.

Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:

    import memfiles as mf
    var gc = 0
    var total = 0

    var f = mf.open("orthocoronavirinae.fasta")
    for line in memSlices(f):
        let n = line.size
        let cs = cast[cstring](line.data)
        if n > 0 and cs[0] == '>': # ignore comment lines
            continue
        for i in 0 ..< n:
            let letter = cs[i]
            if letter == 'C' or letter == 'G':
                gc += 1
            total += 1

    echo(gc.float / total.float)
    mf.close(f) # not really needed; process about to end
Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }


I clicked on the big Download button and selected "all records", it downloaded over 3.5GB before I gave up... which file exactly should I use??


I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].

Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

[2] https://file.io/nUNc7cG5i8gj


The file at [2] is already gone :(


can you upload somewhere your 150M file. If i follow the link in your comment there are bunch of small files, did you concatenate them?


NIM also has a very good JavaScript backend. You can generate both C and JavaScript code from a nim program.

Last time I used it, I liked it but didn't use it long enough to have a strong opinion.


Python sometimes runs slowly, because it's not designed to run fast. It's designed to be readable and easy to write, which in turn makes developing python faster.

It's a compromise, but I always prioritise _my_ time over my computers time, so if I can write something quickly and just go and get a coffee while it runs - I will do that. I won't spend twice as long writing a single-run script just because it'll finish before the kettle has boiled.


That’s where Nim can shine. For simple scripts both Python and Nim are about as easy to write. But the Nim version usually runs a lot faster.

Static types help for basic data munging when you haven’t used a script for months to get up to speed and make tweaks.


Sadly, I think you’re spot-on about Nim’s future as the realization of the alternative timeline where Python didn’t make several stupid design choices (e.g. the GIL, Python 3).

It’s a shame because I think Nim has some neat features that allow it to present as a serious competitor to Rust but it will ultimately have to compete against Python instead to secure its niche.


Oh yes, Nim definitely feels like an alternate reality where Python 2 became static and dumped some poor design choices.

Well I believe there's room between Rust and Python where Nim can grow. It made the TIOBE top 50 lately even. Likely it can eat enough market share from the edges of both Rust & Python to become more well known (more libs, tools, etc).

Rust is fantastic but tedious to program (to me at least) and it's community focuses on more formal type traits, etc making "scripting" trickier. Python is great for a mix of quick scripts, web dev, and data science but it's slow enough (and getting complex enough!) for many to want something faster and more stable yet still easy to write. Nim lives in between them and is more enjoyable to write than either for many. Also, Nim _could_ add Rust as a backend target and be relevant even if Rust displaces C/C++. ;)

Nim is also great for embedded systems too! I've been using it a fair bit and it's really nice [1]. There's a lot of room to grow in that field.

1: https://github.com/elcritch/nesper


There are plenty of compiled languages that are also designed to be readable and easy to write.


You can have Python syntax without Python slow.


And functional and static with F#.


You can also drop in PyPy and get a significant speedup on "loopy" string processing tasks with no changes in your code.


>It's designed to be readable and easy to write

So is Golang.


Go is precisely what I switched to for my test/automation efforts.


For me the corollary of this post should be, try PyPy. You may get a 10x speedup for free.


For me the corollary of this comment should be, try reading the post. You may learn that the author did get a 10x speedup for free, but it was still 3.3x slower than Nim.


I did, I don't think the snarkiness was necessary.

My point, which apparently it wasn't evident enough, is that you can get the most of the benefits by doing nothing, just trying a different Python implementation, without the hassle of learning a niche language, as easy as it might be.

BTW, if you take into account compilation times the difference is even meager, and in all fairness the PyPy warmup period should have had to be discounted.


Do you feel better about yourself by being snarky to people on the internet? How does it work? I'm curious


Missing : on the first line of code. When speed mattes, people use libraries that are considerably faster than plain python. It’s these libraries that turned python so popular in data science. Giving them up maybe makes sense, but that mans a whole lot of learning and development to replace already pretty well established tools.


> When speed mattes, people use libraries that are considerably faster than plain python.

This.

The general guideline has always been that Python is ideal for glue code and non-performance-critical code, and when performance became an issue then Python code would simply be used as glue code to invoke specialized libraries. Perhaps the most popular example of this approach is bumpy, which uses BLAS and LAPACK internally to handle linear algebra stuff.

This Nim advertisement sounds awfully desperate with the way it resorts to what feels like a poorly assembled strawman, while giving absolutely nothing in return.


I notice that python has `rstrip` while `nim` doesn't. Python's `rstrip()` is allocating a whole new line there. Does the nim iterator skip over whitespace or something? Do those bits of code output the same thing? The presence or absence of the whitespace will affect the `total` count.


This is explained a bit further into the article:

> A nice feature of the lines function is that it automatically strips newline characters such as LF and CRLF so we no longer need to doline.rstrip().


Oh, nice! Missed that. Thanks.


Nice writeup, glad the author has a language and environment that they like.

Python has never been one of my favorite languages, but easy support in Google Colab, AWS SageMaker, etc. as well as most of my professional deep learning work using TensorFLow + Keras, it makes Python a go-to language for me. If you want a Lisp syntax on top of Python, you can try Hy (and get a free copy of my Hy book at https://leanpub.com/hy-lisp-python by setting the price to $0.00).

That said, for unpaid experiments I like Julia + Flux, which also solves the author's preference to avoid slow programming languages. Julia is really a nice language but no one has ever paid me to use it.


I'm an experienced Python developer who also dabbles with some applications written in C++. In my experience Python is much faster when it comes to development speed but it's way more demanding when it comes to optimization.

When you write C++, you kind of cheat because even code with high computational complexity is pretty fast. Whereas the equivalent code in Python will be awfully slow.

So, while it's true that Python requires less development time, this statement can't be used generally. I have spent hours optimizing Python code when in C++ I would have just moved on to my next task.


How does it compare to Julia? Anyone with experience in both Nim and Julia?


I have tinkered with both. Basically if you need to write a FORTRAN application but don't want to use FORTRAN, Julia is the correct choice. If ease of use is more important than having a diverse collection of pre-existing numerical libraries to play with, I'd go with Nim.

If Nim had cloud SDKs I would use it as my default language for pretty much everything.


I would have thought Cython would be the closest analogue for comparison.


If anyone is interested to see how Nim fares against some other programming languages, here are some benchmarks: https://github.com/kostya/benchmarks


I'll probably always dislike Python. I do like Nim, but for strictly data processing I'd take R over either in a heartbeat. R has too many libraries, its semantics are perfect for data processing, RStudio is too nice and while pure R is slow as shit, in practice it's fast because it's basically a scripting language for a bunch of C and Fortran bits that are doing the real work.

If I were writing something from scratch that dealt with data, I would probably use Nim though. It's super easy to write something fast in and is more pleasant than pretty much any other compiled language.


related: Biofast benchmark ( Nim, Julia, Go, Pypy, C, Crystal, .. )

"Benchmarking programming languages/implementations for common tasks in Bioinformatics"

https://github.com/lh3/biofast#fqcnt

https://lh3.github.io/2020/05/17/fast-high-level-programming...

HN: https://news.ycombinator.com/item?id=23229657


i have seldom seen data engineers write raw python loops the way you did with your examples. they usually use numpy, scipy, etc.


that can be a good approach where the computation maps cleanly onto some C or fortran code prepared earlier and wrapped as a numpy or scipy primitive. but sometimes expressing what you want to do as a bunch of numpy operations becomes a lot harder to read and perhaps also slower than if you just directly wrote the raw loops over some arrays in C.

in cases where what you want to do doesn't exactly fit standard operations, cython can be pretty nice. e.g. 200x -- 1000x speedups for translating C-oriented number crunching code from python to cython. but if you do want performance, you have to think about it while writing the code (avoid needlessly allocating memory in tight loops, data-oriented programming with simple arrays, statically type all of your variables, ...).


I don't doubt Nim, looks like a great language. But that is just an awful Python implementation. I'd do it in this way:

    lines = (line for line in lines("orthocoronavirinae.fasta") if not line.startswith(">"))
    gc_lines = (1 if ('G' in line or 'C' in line) else 0 for line in lines)
    gc = sum(gc_lines)

    total = len(list(gc_lines))

    # Alternatively, a more "memory efficient" total would be:
    total = sum(1 for _ in lines)
Edit: my code is not perfect (I’m typing from my phone, I’m surprised I could even match parentheses).

My point is: this is a highly I/O bound program. The implementation matters. With the correct implementation there shouldn’t be much difference between the languages.


>total = len(list(gc_lines))

That won't work properly; you've already exhausted the gc_lines generator in the previous line.


True, you'd need to re create the generator expression. Still, the other implementation seemed too naive.


There's probably some itertools trick to do it with only one iteration.

EDIT: you can do it with functools.reduce and a generator of tuples:

    from functools import reduce

    with open("orthocoronavirinae.fasta") as f:
        lines = (line.strip() for line in f if not line.startswith(">"))
        sums = ((len(line), sum(1 for ch in line if ch in "CG")) for line in lines)
        total, gc = reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), sums)
Besides, I really don't think that any of our solutions will be that much faster than the one in the OP. All of them are using lazy iteration on the file object, just written differently. The differences amount to micro-optimizations. The real way to speed it up would be to use something like pandas, where the loading and summing calls into fast C implementations.


> There's probably some itertools trick to do it with only one iteration.

Or an arithmetic trick. Since there's the `1 if ('G' in line or 'C' in line) else 0 for line in lines` construct in the code, if you simply replace 1 with (len(line)<<K)+1 and 0 with len(line)<<K (for a suitable value of K), you can extract the two results using div/mod.


That's not good python, it's unreadable and less optimized than the original code. Your "1 if ... else 0" condition serves no purpose and you iterate twice over the same loop for no reason. Use normal loops and save everyone who comes after a lot of reading time and bugs.


Given that there is a fixed set of Alphabets, I would rather use Counters. It avoids for loops (the biggest source of non-efficiency in python) and is imho is much more idiomatic.


Isn't that going to be even slower, since a 156mb file would probably make gc_lines a very big tuple before summing it?


You could do gc=sum(1 for _ on lines if ...) or use Counter


gc_lines isn't a tuple here, it's a generator expression. It will be lazily evaluated.


Isn’t this counting the lines with any G/C in them vs the total number of G/C literals?


The (…) return an iterator, right?


A context might be useful.

From what I gather, the author is a researcher in bioinformatics related field. This may indicate that they tend to work either alone or in a relatively small group. The domain is small scope data processing/manipulation, research/exploratory code, ,likely short-lived or even one-off.

The progress in this context will possibly be governed by sheer processing speed (e.g. it’s unlikely anyone will delve deep into the code, a lot of iterations to ‘just get it done’ instead of testing etc.).

If this is more or less correct, the point that Nim might be more useful than Python for the author sounds very sensible to me. It’s a nice spot between command line tools and more functionality-loaded languages.


Author here. This is spot on. The majority of the code I write is either piping data around to existing tools using shell scripting and Snakemake or writing the data processing code myself when there isn't a tool that does what I need. Usually, I'm working alone or with a few other computational biologists. Many of my scripts are one-off but they have the distinct tendency of growing in complexity and scope if they are useful. That's one of the big advantages with Nim in my mind: you can write a quick and dirty script and have it be pretty fast and then go back later and optimize it to a few percent of C without having to rewrite your code in another language. In this sense, it's quite like Julia (another really good language).


what's up we the all the shouting in the code ?


It's because the `font-feature-settings` of the main font leak to the code font. The feature 'case' turns the code all uppercase.


Ha, I thought the author was intentionally writing the python in all-caps to highlight the similarity in syntax to the alternative language (not being familiar with Nim, I figured it must be conventional there). I was going to post a comment expressing my surprise that was even legal.


Why not do both?

  cat test.py | py2many --nim=1 -
  http://dpaste.com//5ALVT7MK4


does nim have anything like scikit yet?


Closest I can find is https://nimble.directory/pkg/science


TLDR: Because Python is slow

Yes, that is the achilles heel of Python.

I am always torn between Python and PHP for new projects because of this.

The Python Syntax plus its import system are huge advantages over PHP. On the other hand, you suffer a 6x slowdown if you go with Python. Decisions decisions. I so dearly wish I could have the good parts of both worlds.


Python is indeed slow, but why would anyone ever use php as a comparison point?

And for data processing of all things ...

Php is also very slow, on top of being many other kinds of unpleasant and broken.


I had a nightmare where I was writing an operating system in PHP, which ran in hypervisor written in PHP, which ran in a virtual machine written in PHP, which in a browser written in PHP, which ran in an operating system written in PHP...


PHP has a proper JIT compiler on its reference implementation.


> PHP has a proper JIT compiler on its reference implementation.

That doesn't make it fast at all. Just faster than if it wasn't jitted.

And that certainly does not eradicate the vast ocean of other problems php has, the first of which being that is was never, ever "though out" and instead grew like a cancerous mushroom.


> that is the achilles heel of Python

I think that is pip.


Yes, Python is painfully slow, but it shouldn't matter. If you are using Python where performance, speed, correctness, testability, maintainability matters, you are doing it wrong.

Python is good where speed of development matters, where you write throw-away code testing some ideas and you want to do it fast, where you write glue code, for prototypes, for small code bases.

Once you are getting outside of that area, you better should use a language more suited for the task.

As for myself, even if I can use Python in some cases, I can churn C# code almost as fast so I prefer doing it that way in case I want to grow the code later or use it somewhere else. Being lazy, I dislike rewriting code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: