Hacker Newsnew | past | comments | ask | show | jobs | submit | rhlap's commentslogin

Are people actually doing science with Python or are they talking about doing science?

There's so much buggy low quality stuff in that space that I'd write a serious application in C or C++ from scratch.

It would be a custom application, sure, but not everything needs to be general.

Also, I find Lisp much more natural for mathematical reasoning.


Yeah, they're very much doing it.

Pandas is huge, libraries like Spacy, NetworkX, etc exist. It's a massive and good ecosystem. Python is the goto for scientific computing in most of the sciences for newer students I'd hazard a guess over the older R and Julia.

This will be blindingly obvious if you work in that area. Yes, you can do it in another language, but you're missing out on a lot of stuff that is just done and is state of the art and is fast because the speedy parts aren't in Python. The complaints about parens for lisp are superficial, but it's my experience the same same goes for whitespace in Python. They just don't matter.


> and is fast because the speedy parts aren't in Python.

Having worked months with a slew of senior data scientists, this was a bit painful. Python is so slow and those data scientists were very good at coming up with solutions for the issues of the company, but the implementations (using Spacy, Pandas and other libs) had enough Python in them to make them not practical for the company use case. Nice prototypes which I then had to fix them or even rewrite to C/C++(we worked Rust as well to try it out) to make them usable in the company data pipeline.

I think companies are burning millions (billions in total?) on depressingly slow solutions in this space by throwing massive power at it all to make them complete their computations before the sun dies out.

Example: we needed a specific keyword extraction algorithm for multiple languages; my colleague used Spacy and Python to create it. It took a couple of seconds per page of text; we needed max a few ms on modern hardware. He spent quite a lot of time rewriting and changing it, but never got it under 1s per page on xlarge aws instances. My version takes a few ms on average executing the same algorithm but in optimised c/c++.

Sure we could've spun up a lot more instances, but my rewrite was far cheaper than that, even in the first month.


(I'm the creator of spaCy)

If you want to email me at matt@explosion.ai , I'd be interested in the specifics of the algorithm and why the implementation was slow.

The idea for something like that keyword extraction algorithm would be that if the Python API is slow, you should just use Cython. The Cython API of spaCy is really fast because the `Doc` is just a `TokenC*`, and the tokens just hold a pointer to their lexeme struct, which has the various attributes encoded as integers.

I've never really done a good job of teaching people to use the Cython API though. I completely agree that it's not productive to have slow solutions, and using too many libraries can be a problem. The issue is that Python loops are just too slow, you need to be able to write a loop in C/Cython/etc. Thinking through data structures is also very important.

I get very frustrated that there's this emphasis on parallelism to "solve" speed for Python. Very often the inputs and outputs of the function calls are large enough that you cannot possibly outrace the transfer, pickle and function call overheads, so the more workers you add, the slower it is. Meanwhile if you just write it properly it's 200x faster to start with, and there's no problem.


Sorry if you think I blamed spaCy for anything; it was not intended; I know it was due to the way Python was used which I tried to convey. Your product is excellent and yes, I probably should've reached out more anyway; I just know how to solve things my way and did not wanted to waste more time (there was an investor deadline).

Cython part sounds good; I will try it out and email you if I get totally stuck, thanks!


Oh, I didn't take it as a pointed criticism or anything. It just seemed like it would be an instructive example.

The underlying point I often make to people is that Python's slowness introduces a lot of incidental complexity, and you find yourself fiddling with numpy or something instead of just writing normal code and expecting it to perform normally.


But that's fine, no? I mean, it's a pretty common workflow where the people close to the science part of something write a prototype in their language/ecosystem of choice, and then the engineering side is in charge of taking the prototype implementation and making it performant enough for production use. Finding people who know both, data science, and low level programming languages well enough to be able to implement data science applications directly for production is pretty hard, I'm sure.

In either case, I much prefer prototypes in Python than, say, Matlab. To speed things up I once rewrote an internal Scipy function to a version that allowed me to use it in vectorized code on my end. If the prototype is in Matlab, the optimization and integration possibilities are much more limited due to licensing, toolboxes, and the closed ecosystem in general.


Also I think it is good to be able to use the python code as testcases/validation on smaller datasets for the C code.


Yes, I guess it is fine if that is the flow. I just didn't expect it upfront (my bad).


Yeah, if it's actually OK or not depends a lot on the particulars. Like, if it's not actually your job, and the data people were supposed to produce production ready stuff themselves, and then you have to go out of your way to actually make it work, then it's not OK. But that's more to do with how organizations function, not technical merits of the involved programming languages.


Python performance is something that I see about 15-20x as much in discussions about python than I do wrestling with real life problems.


Agreed. The only time in practice (working on datasets consisting of millions of rows) that Python has been too slow was when I was taking courses in college and their online code thing timed out on some specific graph problems. I rewrote the algo in C# without any fuss


All analysis we run is over 100s of millions of objects (objects can be a few megabytes large) every time it runs. The slightest increase or decrease for 1 object obviously makes a huge difference overall; either in cost, time or both.


Absolutely, but I don't feel the vast majority of corporations are doing that type of computation. I feel that even with hundreds of millions of rows Python can be a great solution (I have done multiple projects generating fairly complex projections from a few hundred million rows) for most projects.


Nice prototypes which I then had to fix them or even rewrite to C/C++

Even as someone who 'knows' C and C++ I still find it faster and easier overall to do the exploratory and 'science' part in Python, making sure it works and gives me the answers I need in the format I want etc. only to then rewrite and optimize the slow parts in C or C++ if necessary.


You could probably use something like Chez Scheme for that, especially the new Racket fork of it with unboxed flonums and flonum vectors, except with significantly decreased need to rewrite stuff in C for performance. Also with proper threads to boot.


You're missing the point. What makes python 'fast' is that it comes with 'out of the box' support for everything I might need. Reading and writing obscure file formats. Every sparse matrix, image processing and graph algorithm I might need has already been implemented. Do I need to all of sudden solve an optimization problem or a differential equation? Already nicely integrated with the library using. If I need some obscure domain specific algorithm, there is almost certainly a library for that already. I don't have to worry about any of that stuff and can focus on solving my problem.


There's quite a few C/C++ libraries for those things, right? Integrating them with proper FFI (of the kind that Chez has, for example - or LuaJIT, for that matter) seems hardly difficult...although differential equations or optimization problems may be better served without any language interfaces whatsoever since they involve functional parameterization. That generally sucks with mixed-language solutions, to the extent that GSLShell, the LuaJIT interface to GSL, completely skipped the DE solvers in GSL and reimplemented them in Lua for higher performance. I imagine that 'fast' in case of Python really needs to be quoted the way you did.


"seems hardly difficult" and "could probably use" is seldom faster than "out of the box" and "already nicely integrated"


Maybe, but the latter definitely isn't the case with Python, especially in case of PyPy (or at least it wasn't the case the last time I tried that), so there's that.


Have you considered how much time the "senior data scientists" saved, and how much better algorithmic solution they were able to develop by being able to iteratively explore the problem space and refine the algorithm in a convenient-to-them environment?

This development process allowed creation of a good solution that you were then able to quickly port to a performant production platform.


I've worked in this industry for a bit and had the opposite experience. Most of the slow solutions have been in other languages due to poor design and algo choice. There have been several projects I've been able to rewrite from R to Python where the runtime on a workstation went from days with R to seconds minutes with Python in a tiny tiny VM (like $10 DO box).

Sure maybe a 3 minute task in python that reconciles a few million transactions and builds some very useful projections is too slow for some pipelines, but it worked for my clients.


So the R code was pretty bad and your solution was more optimal - that's the only bit of information in here. The point is that rewriting something means you have extra domain knowledge, bottleneck knowledge, etc that you didn't have during the initial write. Tt might have gone the other way too, initial write in Python is too slow, rewrite in R is faster.


Oh yeah, it definitely was - my point wasn't that R is slower than Python. My point is the only time I've encountered "slow apps" in practice was when they were written in a much faster language, and rewriting them in a slower language resulted in really good performance.


The vast majority of the time that I've seen slow Python code, it's because people are not leveraging the libraries in the right way, e.g. by using groupby-apply in Pandas or not vectorising while using NumPy.

I can't speak as to the specific use cases that you've encountered, but performance wise, I have found Python to be a fine choice for several ML services.


> Are people actually doing science with Python or are they talking about doing science?

People are actually doing it. And a lot of it too. Both in terms of data science (as a broad term that can mean a bunch of different things) and in terms of computation for specific scientific fields like physics or biology.


Very much actively doing so on my end where nearly all work in the industry is in Python, with some Matlab, C, C++, and Julia sprinkled in.

Python is a great high level language for basically everything, but hardcore low latency apps. I can parse text, connect to databases, do sparse matrix computations on massive matrices, calculate network flows, generate large node-graph diagrams, use a Python based API to connect to any vendor software I've seen, do any kind of statistical analysis thing I need with pandas, amazing and free IDE allows me to use a REPL, code editor, and data structure viewer with ease, Python notebooks for education...etc etc. I've frequently found that I can rewrite a vendor's 10k line C++ program in a few pages of Python as the built-in Python data structures make text parsing extremely flexible and simple.


> do sparse matrix computations on massive matrices

This is completely impossible to do in the Python language, unless you resort to external tooling written in C or Fortran. Sure, you can call these codes from Python, as you can call them from any other language.


Nobody cares about what you're saying though (with respect) in this area. It's all about the ecosystem or Anaconda distribution itself rather than just the core language. I agree that what you're saying is accurate, but it also happens to be irrelevant in this particular case.

Numerical methods and data science are mostly done by engineers, mathematicians, and other random stem folks. I've yet to meet someone who is even cognizant that Numpy is really calling out to some low level C, C++, or Fortran library. They just know that you call a library like any other and the code works.

If you're trying to say that any language with FFI capabilities can do that, you'd be right, but it also doesn't matter much. Python has somehow found a sweet spot where it's easy to learn and onboard people and there is support for a lot of stuff with relatively low hassle. It certainly isn't lisp, but somehow seems to be orders of magnitude more successful.

I've been searching for a tool/language/ecosystem to replace Python for ages, but nothing ends up becoming close. I spent a significant amount of time learning lisp, but a lot of what I saw (besides the power of macros and restarting) was just a less intuitive way of doing things I could easily do in Python, Ruby, or Perl. Lisp is secret alien technology if you're coming from C or C++, but coming from Python it seems closer to a wash.


> I've yet to meet someone who is even cognizant that Numpy is really calling out to some low level C

then you've never met anybody who builds the tools that you use. Which is alright. But if you disparage their point of view then you sound a bit funny.


In this particular niche', yes. But again, that is almost entirely irrelevant to the vast majority of the millions of scientific Python users.

I'd love to write my own solution in assembly or C where I give birth to every function, but nobody has time for that level of monumental effort. Low level matrix libraries have a lot of inertia for a reason.

I'm not disparaging anybody's point of view. Yours is certainly valid for a small group of elite users. I'm just trying to point out that it is only a valid point for a very small group. Most simply view these things from the perspective of the entire ecosystem. Even scientists well aware of the C internals will not always use that knowledge.


The ecosystem matters. I'm a developer and not a scientist, but having everything inside an environment that's at least workable is a huge boon.

Of course you could call the same functions from the ffi of any other language, but nobody does that for the same reason that nobody writes web applications in C.

I hate python, as far as I'm concerned it's a nightmare hell of a language that does everything wrong, and yet it's probably the language I use the most due to its sheer convenience and massive ecosystem.


This is really splitting hairs isn't it? Plenty of languages are not bootstrapped. Isn't that essentially the same thing?


No, nothing to do with bootstraping, this is completely different. My point is that you cannot develop the very algorithms that you are using. Numerical math is not only about using ready-made algorithms, it is mostly about implementing new algorithms. For example, if you invent a new matrix factorization algorithm, it is very likely that you cannot implement it in Python (or if you can, it will be either very slow or very cumbersome). Python+numpy is not a natural way to write many numerical algorithms, based on explicit loops and new conditions inside them. Whereas in Fortran or in C, the implementation is likely to be much simpler, natural and fast.


Nobody is arguing that, but they're saying it doesn't matter to the majority of scientists who just want to invert a matrix for some study and don't need to implement a new matrix inversion algorithm. I would use C or C++ for that most likely. That is a valid use case for some scientists, but I would expect it to be a very small number compared to those that just need to use the existing tools in the ecosystem.

I think we may be speaking past each other a bit.


Every single bit of Python relies on code written in C. What is it that you are trying to refute here?


> Every single bit of Python relies on code written in C.

There's pypy, a jit python interpreter writen entirely in python, and it does not depend in C. It is also much faster than the common interpreter, cpython. Unfortunately it is still not appropriate for numerical computation, as the language itself makes working directly with numbers very cumbersome (and this was the point I wanted to make).


Yes. Python is pretty much the main tool in biology, for example. C or C++ would be abysmal for similar exploratory scientific takes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: