There is a ton of conflation in this post, yesterdays post, and the comments.
* Comparing parallel implementations (Julia) to single thread (Python csv module). This is still quite relevant because in places like Python, it is impossible to parallelize the csv module without essentially updating its native implementation. That's not true in other places, where some simple wrapper code may be all required to get that "10x faster" number.
* Comparing concrete object decoding (Python csv module) to array decoding (PyArrow). Producing concrete objects representing every row (e.g. tuples of strings) will always be slower because of the pressure it puts on the heap and memory. Storing 10 million rows with 10 cols of 8-byte floats in an array might only generate 800 MB of memory bandwidth, the equivalent concrete list-of-floats PyObjects would come out at 6160 MB bandwidth and a ton of CPU burned in the allocator. There are use cases where either representation is preferred, and using e.g. PyArrow's parser then simply iterating the result as concrete objects, worst case the result will be slower than directly decoding to PyObject to begin with.
* Lumping floating point decoding (a problem with large performance-correctness tradeoffs) in with CSV parsing. It's hard to decode floats both quickly and precisely, it's also impractical to describe in the context of a comparison of CSV parsers which language/implementation might decode floats better and why that is better.
You may be missing the point of the comparison, which is this: to compare the performance of the de facto standard CSV parsers of each system when used the way that a typical user would use them, i.e. to parse a CSV file into a data frame.
Regarding your specific points:
1. Yes, it's possible to rig up some way of parsing in parallel in Python, like splitting the CSV file into multiple files and then parsing them in separate processes and then joining that. But no matter how you do it, you have to jump through extra hoops to do it. And that is not something most users do.
2. I don't entirely follow this point. Perhaps using PyArrow's parser would be faster than what is timed here, but is that what the typical Python data science user would do? Most Python users use Pandas and the CSV parsing that it provides. If that changes, then it would be good to do a new benchmark.
3. Yes, floating-point decoding is hard, but Julia's is fully accurate, so if it's also faster than Python's how is it an unfair comparison? If Julia was cheating by doing imprecise float parsing, sure, that would be a valid complaint, but it's not.
These "conflations" sound suspiciously like excuses. If float parsing is slow, make it faster. If concrete object decoding is slow, stop doing that. If threading would be faster, bite the bullet and use threading. If the language doesn't provide good support for these things, then maybe the language is at fault.
First import took >10s just to precompile and then CSV.File took 5.21s to read the file. A fresh pandas install takes ~10s to import and then reads this file in 81.2 ms ± 1.25 ms. CSV.jl takes 38 ms on a second run so it is twice as fast, however especially to someone that just starts using Julia, these >100x slowdowns constantly happen. Especially when using libraries that heavily rely on multiple dispatch specialisation for performance.
I tried out the dataset you mention. I use the latest Julia 1.5.2 and CSV.jl 0.7.7 and load the CSV package in 1s, read the file first time in 5s and then in 0.03s the second time. The precompile time you mention is only for the very first time you download a new version of a package, which is then cached for subsequent sessions. Your generalization from a one time precompilation of the CSV.jl package to >100x slowdowns happening constantly is simply not true.
julia> @time using CSV
1.161851 seconds (2.05 M allocations: 131.757 MiB, 0.46% gc time)
So I want to plot the data I just loaded. Since I wrote the comment I had closed the Julia window (not to make a point, btw.)
julia> import CSV (takes ~1-2 seconds)
julia> data = CSV.File("/User/orbifold/Downloads/FL_insurance_sample.csv) (takes seconds again...)
I want to explore the data. I vaguely remember there is Gadfly to do that. For some reason it depends on FFTW and a whole bunch of other packages, but the dependency story in Python is equally insane, so whatever. As I'm typing this, I'm waiting for Gadfly to precompile.
julia> @time import Gadfly (104.460307 seconds)
python> import matplotlib.pyplot as plt (~10s if you are unlucky and it has to generate the font cache)
Maybe it's a good idea to do a scatter plot of two of the fields? Let's find out
but that number is not accurate, in fact it took so long for the browser to open the result, that I had time to check the documentation on backends and see whether I'm missing something.
python> %time plt.scatter(data['eq_site_limit'], data['hu_site_limit']) 41.9 ms, a bit more for the window to open
Turns out that is not a useful plot... maybe a histogram?
I've had similar problems and so have many of my colleagues in silicon valley. No matter how you spin it and whatever benchmarks you show - Julia is a very slow fast language. :)
The UX of Julia needs major work - everything is just slow. If you've used Python for 10 years and it is like running through molasses. Benchmarks are meaningless for the most part. This is what bothers me about Julia's marketing - it claims to be fast, but in reality, precompilation time alone is a no go unless you're doing high-compute large batch processing. Julia cannot be come a general purpose language unless you fix these issues.
> These "conflations" sound suspiciously like excuses.
I think you're the one conflating total end-to-end time spent between given a task and going home early to see my kids if I used Python.
Nobody is spinning anything here. Julia's compilation time is a known quantity. See my comment in https://news.ycombinator.com/item?id=24750559 about the different phases of Julia's compilation and execution (from a user's perspective).
Obviously, if you are not working on a problem that doesn't need Julia's speed (and makes it worth paying the compilation cost), and you are more comfortable with a different tool, you should use that. You may in fact be better off using Python, or shell scripting or even Excel.
Precompilation time in Julia is akin to `make` in a project with C code. You only compile the library once and use it repeatedly, until you update it. Is it a dealbreaker? Perhaps it is for you (assuming you are using precompilation in the right context). It is not for many who work with Julia day in and day out, and it is something that we continue to improve.
Thanks for the insight. I think still for most engineering development - its mostly the one-off launch time that predominates. This is exactly why I don't use C++ for general purpose use.
There is something to be said about how easy Python makes development look - despite of the package management warts.
Btw, is there a reason why packages can't be precompiled and even distributed? I am sure you guys have thought about that.
IIUC, In principle it should be possible to do much better if you knew for sure at compile time which methods you would need to dispatch to for the data you ultimately want to run on.
Yes, you can use PackageCompiler for your collection of packages. We are also exploring doing this in the background when you add more packages.
The reason why we can't distribute precompiled .ji files at a package level is because of the way the package resolver works. It depends on the exact versions of dependencies of every package - and even for the same version of an end-user package, there can be slightly different versions of the dependencies installed depending on constraints imposed by other packages.
One major improvement coming in 1.6 is multi-threaded precompilation, and it will leverage all the cores you have.
While I hope they'll keep the pace of improving the JIT lag (perhaps even adding a fast optional interpreter or AoT compilation of parts of code), the benchmarks are fine considering compilation is a one time paid ticket, while runtime is paid by usage. It's not just high compute, if I set up a website in Julia on a docker after the "cold start" (which can be handled before using packagecompiler if I don't want any client to feel the slow JIT compilation, as I obviously should) then it should run at the benchmarked runtime speed even if each request is very short (as the compilation price was paid already). That applies to most services in general, and even during development Julia programmers eventually learn not to cold start all the times (keeping the session open until you stop programming, so you again only pay the cost once).
Of course, any short task that must cold start like a bash script isn't convenient at all (and packagecompiler will generate a fat binary that runs fast, but it's still inconveniently fat if it's for something simple). And Julia's first impression is really important, which the "time to first plot" seems to affect the most (as you need to go deeper on the language to adapt to it's unusual workflow), so I agree that it's a UX priority.
1. Precompilation time: Only once when you install or upgrade a package.
2. First running time (that includes compilation time): When you call a function like `plot` the first time in a session
3. Second and subsequent running times: The compiled code is now in the cache, and what you see is the actual running time.
Thus what you are reporting here as precompilation (1) is not something that a user will see every time. What they will see is (2) for the first time in a session and (3) on subsequent plots. Could you confirm this is what you see - to ensure that those following along understand the terminology?
Here are the usage times I see when I use Plots on a daily basis in a Julia session:
julia> @time using Plots
11.341632 seconds (20.21 M allocations: 1.108 GiB, 2.87% gc time)
julia> @time plot(1:10,1:10)
2.291147 seconds (3.24 M allocations: 165.569 MiB, 7.80% gc time)
(Another 2 seconds for the window to be drawn the first time on mac)
julia> @time plot(1:10,1:10)
0.000763 seconds (2.80 k allocations: 166.023 KiB)
> 2. I don't entirely follow this point. Perhaps using PyArrow's parser would be faster than what is timed here, but is that what the typical Python data science user would do?
I am a Python data science user. If data gets big enough such that loading time is a bottleneck, I use parquet files instead of CSV, and PyArrow to load them into pandas. It’s a one line change. The creator of Pandas is now leading the Arrow project. It’s very seamless. Don’t know if I’m typical but that’s me.
Thanks Viral. To be clear, I’m a python user who’s cheering for Julia, because I live the problems of python and do see the potential of Julia as a better path. But unfortunately I’m not prepared to be the early adopter (at least in my day job), and will wait until other, braver users have sanded off the rough edges. God speed and good luck.
That's a completely reasonable viewpoint. Many users of Julia and contributors start out experimenting with it and then end up bringing it into their work when they feel comfortable with it. I hope you will have the same experience one day.
Regarding 2), code like "for row in pandas.DataFrame(...).iterrows():" where the DataFrame was populated by some array-based reader is creating an indirection. Most good pandas code of course doesn't look like this, but it's regularly unavoidable, hence "There are use cases where either representation is preferred". Without specifying our use case, it is meaningless to talk about "faster" or "slower".
For 3) there inherently is no optimal solution to "a problem with large performance-correctness tradeoffs", additionally users do not care only about speed, for a huge number of applications, absolute reproduction of precision may be the difference between a good simulation and data science quackery. We saw elsewhere in the thread that Julia's float parsing code at least presently handles certain syntaxes incorrectly. Ignoring performance, this is criteria for deciding whether a solution is fit for purpose at all.
The overall point otherwise makes sense: most DS folk don't and maybe shouldn't care about this stuff, however this is HN, and elsewhere in the comments there are plenty of examples of people who do.
> there inherently is no optimal solution to "a problem with large performance-correctness tradeoffs"
You seem to be under the misapprehension that mainstream CSV libraries parse float values with reduced precision in order to go faster. They do not. A decimal floating-point string represents a precise mathematical value and the only acceptable value for a CSV parser to produce is that value correctly rounded to a Float64 (or some other type if requested). Any CSV parser that does anything else is simply incorrect.
> We saw elsewhere in the thread that Julia's float parsing code at least presently handles certain syntaxes incorrectly.
You may be referring to a bug mentioned elsewhere that fields like `-123.` with a leading minus sign and a trailing decimal point are misinterpreted as strings. That is a bug, plain and simple. It has nothing to do with float parsing speed/accuracy tradeoffs.
A typical user will rarely ever use the built in CSV library in Python, and would choose among the many third party options for much greater performance, ease of use, parity with rectangular data structure tools like pandas, and so forth.
Comparing anything against the built-in CSV reader in Python is pretty much an exercise in nothing.
In 20 years of programming Python professionally, I have used the built in CSV parser about 5 total times, and all were use cases where 10-20x improvement of CSV parser performance did not matter whatsoever to the application.
I disagree completely. It’s perfectly good enough for non-specialized use cases, which is exactly right for a stdlib offering. For anything else, factor it out as separate third party options (even if maintained by PSF itself) so users only install what they need for their special application.
One person may install something that’s blazingly fast. Someone else may install something because it can convert the data to an in memory parquet format, someone else may use something that can virtualize SQL queries over top of disk-backed csvs.
They should all be different for the nuance of each application.
Interestingly, CSV.jl can do all of those. This seems like a classic case of lack of composability in Python: you need a different CSV parser for each and every application even though the core logic of parsing the CSV format is the same. This is because anything fast has to be implemented in very concrete, specialized C code. In Julia, you can write something generic and let the compiler specialize it for you, so you get many different ways of applying the same thing for free.
Lack of composability? What? Of course you can use one parser for everything, what’s stopping you from doing that. Separating IO from business logic and composing the two is a pretty standard practice. Not sure I follow the argument here.
The argument is simple: Python is by design made so that if you want things to go fast then you need to implement it in C. This creates a constant need to use specialized libraries (such as pandas), where in other languages they can simply do a better job in optimizing the standard library.
This is untrue, you can compartmentalize your C implementations using tools like Cython.
The fact that you write these as isolated special implementations is a good thing and not a deficiency of generics.
This way it can be gradual. You only target certain functions or modules for optimized C implementation, you don’t waste static typing overhead or compliance to a generic interface on the 99% of the code that will have no material gain from any optimization.
It’s similar with gradual typing in Python as well. Annotate what you need, omit annotations for what you don’t. Implement an extension module or multiple dispatch for what you need, use plain Python for everything else.
> “ In Julia, you can write something generic and let the compiler specialize it for you, so you get many different ways of applying the same thing for free.”
Python also lets you easily achieve this in many ways - notably fused typing in Cython.
Just adding types doesn't give you that much extra performance in cython (2-3x at most in most cases in my experience) if you want 10x+ speed up you generally have to rewrite your code in a more 'cythonic' way.
I’ve not had the same experience. Simply adding types and omitting CPython types (which results in simpler “basic C” loops and so on) gets 100x speed ups in most situations, and Cython has an annotation mode that highlights CPython bottlenecks to make this very easy and interactive to spot.
With numba it can be even more dramatic. Zero code changes, solely type annotation in the decorator, and you can often get 1000x speedup over pure Python, because numba can optimize away many of the extra CPython data model features that tight loop code may occasionally not need to use.
Do you have any code you can share showing this speed up? Because I have written a lot of Cython over past 5 years, and unless my python code is monumentally stupid (and in that case I can get an order of magnitude or two speedup by just writing sane python), I have never seen anything close to 100x speedup in real world code. Even getting a 5x speed up over a baseline python/numpy implementation requires a fair amount of refactoring to make the code more cython friendly.
Same with Numba, despite its promises, I've never seen anything close to 100x speedups in the real world, and I've even seen a fair few cases where numba resulted in slower code compared to 'pure' python/numpy (although admittedly that was 2-3 years ago)
Sure I could probably design some pathological code where you might see 100x speed up by just adding type annotations, but that would probably require starting with some really inefficient python code.
In this example they give a very clear, normal example (not pathological, not concocted) and show you get between a 100x to 200x speedup with extremely simple Cython and also the same with Numba.
That sounds like a poorly designed library to me. Instead of doing one thing well, it forces users to consume something meant as a jack of all trades, which may not meet their use case.
I wish such a thing wasn’t forced in a standard package - rather split into different ones. Then I can easily compose them according to my needs, instead of having the stdlib force feed me some set of tradeoffs I might not want.
I’m sure these extra third party libraries exist or will exist in Julia just as they do in Python and users will have lots of good options - but what a shame for people to take a needlessly myopic view that core libraries need to be obsessed either with performance or with breadth of use case coverage coupled into one library, which seems to purely be for macho bragging rights or something?
That you believe this violates the principle of "do one thing and do it well" indicates a certain "abstraction blindness". Do you really believe that "parse CSV to a data frame" and "parse CSV to a parquet file" and "parse CSV and apply a query to it in-place" are completely separate, unrelated things? It seems clear to me in any case that with the right abstraction, those are the same thing composed with different choices of what to do with the parsed data.
At a higher level, this seems like a classic Julia versus Python dynamic. In the Python ecosystem there are a dozen different versions of each thing that make different trade-offs and specialize in different ways. When you're picking one, you have to spend a week evaluating all the possible CSV parsers (for example).
In the Julia ecosystem, the ideal it to have one really high-quality, fast, flexible implementation that everyone collaborates on to make it great for all kinds of use cases. And the language design and the compiler make this possible, since you can write pretty abstract code and compose it with various types and get a really fast implementation. The annoying and tricky logic of parsing the CSV format only needs to be done once, and reused for all different situations.
This isn't done for "macho bragging rights" but because it is a waste of time and effort to implement a dozen half-baked versions of the same thing. When there are a dozen separate implementations of something like this, they also end up disagreeing on their behavior, which causes no end of headaches.
> In the Python ecosystem there are a dozen different versions of each thing that make different trade-offs and specialize in different ways. When you're picking one, you have to spend a week evaluating all the possible CSV parsers (for example).
It saddens me to read this weird statement. It saddens me as I was a rather early adopter in Julia (v0.3.2, if I remember correctly) and really believed its mission. I don't quite understand why various ecosystems cannot peacefully coexist without these untrue and completely avoidable statements.
> “ Do you really believe that "parse CSV to a data frame" and "parse CSV to a parquet file" and "parse CSV and apply a query to it in-place" are completely separate, unrelated things?“
Yes, absolutely. Very high level design goals are completely different between those use cases and you would be forced to make mutually exclusive trade offs to solve any one of those problems - which will directly lead to concessions in terms of the other use cases. No one generic csv parser can be optimal to all three.
CSV.jl seems to be proof that you're wrong about that. More concretely, the only difference between parsing to a data frame, parsing to a parquet file, and executing a query over data is a) where you put the data in memory and b) optionally calling some code to operate on the data as you go. It's hard to see how those applications require radically different designs.
This is not accurate. CSV.jl optimizes for speed but does not optimize for data transfer to other systems like Spark (which, by necessity, causes worse performance in order to add features around type marshaling).
Design for general case speed is directly at odds with designing either for executing SQL over top of the file (in which case the trade offs are about indexing the data for record-wise or column-wise operations) or type marshaling for an extremely narrow use case like reading into Spark.
For example, parsing for speed vs parsing for Spark interop are very likely to handle datetime parsing or null parsing significantly differently.
Well, the promise of overhead-free multiple dispatch is that you don't have to choose between a general solution with poor performance and a specific solution with good performance. You can have all the specific solutions you need in one place and automatically dispatch to the right one without any extra effort.
I know it sounds too good to be true, but my limited experience so far is that it really is true, with surprisingly few pitfalls (mostly, you have to think about type instability, which will be new to most, but is really not that hard to avoid).
Multiple dispatch based on static typing only solves a very narrow set of optimization problems - most complex trade offs that would lead to the need for eg multiple different third party csv readers are not addressable at all by multiple dispatch.
I’d also point out that with fused typing in Cython, it’s also trivially easy to get overhead-free multiple dispatch in Python too, and unlike Julia, this had the benefit that it can be a gradual trade off, where you only bother writing the Cython module with autogenerated multiple dispatch implementations based on different types for the ~1% of use cases where it has any impact on the application, and you get the benefit of super easy to develop, flexible Python for everything else, with none of the code liability that comes from describing multiple dispatch generics in the 99% of the code that does not benefit from it.
I think you are misinformed. Runtime dispatch on multiple argument types is still using static typing under the hood (Julia is using this, just as Cython is).
The fact that input types are dynamic and resolved at runtime (which works identically in both Julia and Python using a Cython extension module) does not mean the multiple dispatch “is dynamic” (it’s still based on a registry of types that determine which overloaded implementation to select).
The only trade-off is whether you want to be able to extend this registry of static types mapping to implementations on the fly (similar to type classes in Haskell) which Julia supports natively and Python supports via tools like numba, or you need to ahead-of-time compile it (Cython).
This is a trade-off though, between AOT resolver speed vs JIT flexibility. It’s not definitely better one way or the other, and Cython gives you a level of control over explicit language features to enable or disable (eg Exception disabling) that is much better for some use cases.
No. You're confusing static dispatch with static typing.
Julia can do both static and dynamic dispatch. The latter works with function barriers works to maintain codebases that are impossible to achieve with just one or the other.
There's no resolving trade-off because when Julia knows types and call time (or a union of them ) it can inline function calls.
Anyway, neither numba nor cython has the combin of features that allows for one to define a fast custom abstract array and have another package seamlessly define a subtype that works fast and with other data structures, overloading only the necessary functions. Then a third package can come and build on top using traits from the first and some weird solver.
No, you are still misunderstanding static type dispatch within a dynamic language. The registry of overloaded specializations works based on static type information. That is not a requirement to have statically typed values (neither in Python nor Julia) but it is still statically typed.
> Multiple dispatch based on static typing only solves a very narrow set of optimization problems - most complex trade offs that would lead to the need for eg multiple different third party csv readers are not addressable at all by multiple dispatch.
How about multiple dispatch with dynamic (runtime) typing?
I think parents' point is that a fast CSV parser can be written once and then used to implement many different features. Meanwhile in python every library has to reimplement its own CSV parser because switching from C to python and then back to C would ruin performance. To avoid this context switch the csv parser has to be written again and again.
This is not true at all. Most libraries do not reimplement csv parsing - I’m saying only a select few libraries do this because they have very different design goals. For example optimizing for maximum rows per second parsed is a totally different design goal from optimizing to read into Spark DataFrame types to connect to Spark. Prioritizing one of these might come at the cost of the other, which is totally fine and having more than one library to choose from is a good thing.
> “ switching from C to python and then back to C would ruin performance.”
This is explicitly wrong in regards to Python. Optimizing Python often involves writing an implementation directly in C (or better, Cython) and autogenerating bindings that marshal between CPython native C types and the types of your extension module. This is extremely fast - as fast as calling native C code, there is no performance penalty for this.
Doesn't that validate the applicability of the benchmark then? GP said:
> A typical user will rarely ever use the built in CSV library in Python
as a criticism of the benchmark, but if a typical user does use stdlib’s csv after all, then it seems like your disagreement is perhaps with GP and not with Stefan.
It does validate the benchmark but that doesn't make the benchmark particularly useful. In both Python and Julia if loading your data from CSV is a bottleneck you'll do something else. If it's not a bottleneck it doesn't matter.
If you've got a Python code base it would cost far more in terms of effort and dollars to convert it all to Julia than to just find a better way to load data. If you've got a green field project and CSV reading is a big portion of the task load then it makes sense to pick Julia I guess.
1. Python’s std csv parser is very fast & versatile, perfectly good for a range of use cases that don’t have unusual needs.
2. For any production use cases whose needs aren’t met by the std csv parser, nobody would use it and nobody has to - plenty of other options exist (rightfully) as separate packages that allow the user to choose a different point of trade-off properties if they don’t like the std module’s defaults.
Nothing says the std isn’t good enough for production - it is. Just most people will use something application-specific and they have tons of options.
Use a binary data format or hexfloats instead. Decimal floats are a pain and especially pointless if they are being generated from and converted back to IEEE 754.
>You may be missing the point of the comparison, which is this: to compare the performance of the de facto standard CSV parsers of each system when used the way that a typical user would use them, i.e. to parse a CSV file into a data frame.
Well, in Python typically one would use pandas "to parse a CSV file into a data frame", not the standard csv parser (which doesn't deal with data frames anyway), no?
I would also say, for those not familiar with Julia, that the main advantage of Julia vs Python or R is that you can write performant code in Julia that will be fast enough for most scenarios.
For example, there are efforts to write a pure BLAS in Julia that is still performant [1]. If you are into numerical computing, you will quickly understand this is crazy cool.
A consequence of that is composability. Most libraries are pure Julia, small, and play nicely with each other.
This is also very easy to achieve in Python by using Cython to selectively optimize code. That way for the 99% of code where such optimizations are meaningless and their requirement for static typing is a liability, you can ignore them, and only focus on the 1% of use cases where it matters.
... but Cython is not Python, is it? The argument that "Python can do it, you just have to use something other than Python) is very weird to me.
I began looking into Julia because I was forced into Cython during a project and I loathed having to add a separate, statically compiled section of my code that did not integrate well with the rest of Python (e.g. no stack traces through Cython), and kept breaking (because of some linker issues in Conda)
Cython is a superset of Python. If it’s valid Python, it’s valid Cython. Cython supports the full range of Exceptions and Python stack traces, and you can annotate on a per-function basis whether you want this in Cython.
Managing Cython implementations as separate extension modules is a feature, not a bug - it separates concerns and allows only spending any effort on type system constructs in the small amount of cases where those optimizations will help you.
Cython is extremely easy to compile and standard Python packaging tools work easily with it out of the box, conda included.
Binary will always beat the pants off human-readable formats.
I doubt Excel can export to Flatbuffer, Protobuf, MessagePack, Parquet, etc. So CSV is pretty much here to stay, even though correctly parsing it is more difficult that it seems.
To gain even from parallelism, make all your rows a fixed size. Arbitrary length fields aren't your friend if you want max speed.
I'm surprised to have not seen a mention of the python library Dask in this thread. Whilst I agree that reading CSVs in Python could be a lot quicker, most people do tend to use third party libraries to do so. Dask is able to multi-core multi-thread almost all operations, including reading in a CSV. You can go from a Dask data frame to a pandas dataframe.
While Julia has become much more stable than it used to be, it is still in many places an evolving system and you (in my limited experience) should be prepared to hit issues.
As the primary author of CSV.jl, I can help clarify on the posted issues:
* #720: ended up not being an issue at all, but a misconfigured environment
* #714: there was indeed a corner case when automatically detecting float values where the float started with '-' sign and only had a trailing decimal (e.g. '-123.'). Not super common, but indeed a bug
* #749, #734 are related to a new "beta" feature (CSV.Chunks) which allows iterating over large CSV files in batches. I've been trying to track down the issues 2 people have reported, but haven't been able to reproduce on the same large file. Once we iron out some of those issues, we'll mark the feature as "out of beta".
I agree that Julia packages in general are still evolving and you might run into issues, but at the same time, I strongly believe it's reached a similar maturity in most ways with other language package ecosystems. For example, I use a lot of Javascript/Python frameworks/packages and at least for me, I tend to run into corner case bugs/issues as often as I do for the most common/mature Julia packages.
Compared with, say, data.table in R, or pandas in python, one of the things I enjoy most about Julia packages is that they're almost exclusively written in pure Julia. Having had to dive into data.table/pandas source C/C++ + language binding glue code is a huge pain when trying to track down bugs, so I feel like my knowledge of Julia "goes further" in that if I run into a package bug, it's relatively much easier to track down what's going on and even submit a pull request to fix!
I should have been a bit careful with this message, and mentioning issues that might be in beta. I don't want to specifically pick on csv.jl, but my experience of Julia has been every time I've worked on a significant program, we've hit an issue, either in core or a common library, certainly compared to Python (I avoid Javascript)
I agree that Julia is progressing quickly, but I think a lot of people (certainly myself) got burnt back in the earlier days of Julia when things were much less stable and changing rapidly and the language was (in many blogs) sold as ready for use.
No worries. The CSV.jl package has just been around for 5 years now, has some 700 issues opened/resolved, has built up a pretty extensive test suite, and is used in production by a number of companies, so I just wanted to hopefully clarify its level of maturity :D
I can definitely understand getting burnt in Julia early days (I was around back then as well), but since 1.0, the package ecosystem has matured quite a bit IMO; most of the really popular packages are very stable and work similarly to popular packages in other languages.
Same here. Burned quite a few times trying out Julia stuff.
Last time it was because cold start times were unbearable (compared to Python/R/Stata/Matlab). My goal was to compare how regressions in a few software packages behaved with difficult datasets, so I had to open each of those, run a snippet of code, and log the output. Here, Julia's cold start (and importing the CSV, GLM, etc libraries) took way longer than all of the other tools together.
The biggest problem with Julia is its marketing[1]:
- Julia lures newcomers by the way of attractive marketing (At one point in my career, I was such a huge fan of Julia that I dismissed many people and their concerns. It was like I was following a cult). This is pre v0.4 days.
- Too much focus on optimization. As a seasoned engineer, speed has been less and less of a concern where primary packages that need to be fast in Python are already quite fast. Majority of the tasks most engineers (not academicians) do is plumbing and connecting various pieces together. Julia just isn't good at this and perhaps it will get better.
- Julia markets itself boldly as a general purpose language but doesn't have a good systems library, no real competitive webframework and many things that are necessary for a true general purpose language - for e.g. GUI. At best, it is a scientific computing language such as MATLAB but free (that is a huge thing for Academicians but not for engineers working in Top 500 companies). At worse, it is lacking features and documentation that MATLAB has.
- Julia isn't humble in its approach towards marketing. It over-extends and over-reaches. This leads to people like myself lured into it and then completely burned by the experience. For e.g., it flaunts about being able to see the inline assembly instructions - in my entire 15 year career as a software engineer, I've never had the need to do this. It is almost a meaningless feature for vast majority of the users, but marketing wise it is pretty attractive.
- Julia recruits inexperienced software engineers and non-software "script junkies". It has lost its vision and original premise since 0.4v I feel like. Syntax has gotten worse. Bad decisions have been made. I think the founders of Julia are very very smart folks, it's just the community that is full of people who haven't built anything remotely large and of medium complexity. Are there exceptions? Yes, Julia has been adopted in a few large scale projects but I dispute whether that was a good decision.
Note: My experience is based on engineering mindset, not academics and research. Julia would probably be pretty good on a professor's macintosh and can gain wide adoption in those areas.
Wouldn't this statement be true for any new technology? I don't doubt what you say about being burnt when adopting in the early days. However, it is simultaneously true that people have successfully used Julia in several commercial applications and significant research codebases for many years now.
We also request people to please file an issue against Julia or a package if it doesn't do what you expect it to, or post on discourse. That way, even though you may go ahead with a different tool, someone is likely to fix it by the time you come back the next time. :-)
I've generally found the Julia community less welcoming than some (Rust, Python), although much better than the "classics" (C,C++,Java and friends).
For example, picking the first question I can see on discourse (which isn't a clear technical question, it was number 3) is "What is the status of debugger?". The first answer is "I've used Julia for 3 years and I don't need a debugger" whereas searching for similar questions on the Rust discourse tends to find answers.
This isn't a complete investigation, it might just be me generalising from my small number of interactions with the Julia and Rust communities.
The second answer in that thread, however, are links directly docs, repository and a blogpost describing the debugger. I would say the question was very well answered.
Your post gives the impression that the question was dismissed, maybe even giving the impression that there is no work on a debugger, which is quite misleading. There are several debuggers, and lots of work.
There is some controversy in the community about the _need_ for a debugger, with people with strong opinions on both sides.
As for the general tone, I don't know what to say, except that I find it quite helpful, with lots of people putting in a lot of effort to help whomsoever comes along with questions.
> Compared with, say, data.table in R, or pandas in python, one of the things I enjoy most about Julia packages is that they're almost exclusively written in pure Julia.
R is going to eventually use Julia in their package. It's one of the gluest language I've ever learned so far.
It's actually R's strength. It doesn't have to compete when it can assimilate stuff in the backend while leveraging their existing user. Just look at STAN.
With all due respect to all the people working on Julia, I see a stark difference in rigor between say Python's PEP / RFCs and how Julia's core developers operate. The former is a strict formal process with a lot of thought given to a particular PEP proposal, whereas the latter seems like an informal chaotic hashing/bikeshedding between a few key people in Julia community. I've contributed to Julia and people are really nice but IMO it is missing structure, discipline, hardline regimented approach of confronting difficult problems, etc. One feels like hearing a supreme court case and the other a street brawl to decide upon important issues.
One advantage I've noticed at least (having been a contributor to the Julia language itself), is that it at least makes contributing to the language very approachable. Granted you see lower quality proposals from time to time, but in general, I even question whether I myself would have gotten involved or been brave enough to propose language features if I would have had to do such a rigorous, formal PEP writeup. In some respects, you're restricting the pool of people who actually contribute ideas to the language.
I also have appreciated the "meritocratic" approach of Julia language features; if someone can show that an idea/approach/algorithm is fundamentally faster, more flexible, etc. it's generally been accepted, regardless if the proposer is a first-time contributor or not. That also makes contributing feel approachable and very fair, as opposed to just trying to please a single core developer or playing some politics game.
> if someone can show that an idea/approach/algorithm is fundamentally faster, more flexible, etc. it's generally been accepted, regardless if the proposer is a first-time contributor or not.
If you can submit a measurably faster csv implementation to cpython (without breaking existing code of course) I can guarantee it will be welcome whether you’re a first time contributor or not. Not sure about your definition of “more flexible”, if that entails breaking existing code then it may rightfully meet resistance.
Not sure why you’re implying other languages’ proposal processes are about pleasing a single core developer or playing some politics game. Sounds like either you have a lot more experience than I do and hence have been exposed to politics games or anything else non-meritocratic (that I haven’t sensed in CPython development, at least), or you just don’t have as much experience contributing to other languages.
(I'm a bit cranky today. Sorry if I sound aggressive.)
No worries, it doesn't come off that aggressive ;)
Yes, I've definitely been involved in other projects where a single core developer can do a little too much "imposing" of their will. i.e. if they happen to have a personal preference one way or another on things, they not accept or even entertain certain changes.
I wouldn't expect most large, mature projects to be this way (otherwise they probably wouldn't have grown as large or attracted enough of a community!), but it's certainly a problem with some projects.
Projects, of course. Ages ago I quit a notable project partly because of maintainer team politics, so I'm no stranger. But I thought we were specifically talking about language improvement processes of non-niche programming languages (a specific one, even).
> If you can submit a measurably faster csv implementation to cpython (without breaking existing code of course) I can guarantee it will be welcome whether you’re a first time contributor or not.
I wouldn't be so sure. Search for `site:bugs.python.org "maintenance overhead"` on your favourite search engine.
8 results total, with exactly one entry that’s remotely relevant (i.e. performance related, patch provided) — bpo-21955. You could have just linked that. And this happens to be a patch from a core dev. And it’s not comparable to submitting a performant implementation of a PSL module like csv. You have chosen a pretty convincing example, I don’t know what I’m supposed to be convinced of though.
On the other hand, many important things move at a grinding pace in Python. Packaging and performance are two major issues for just about everyone I’ve spoken with who uses Python in a serious capacity, and these have been notorious problems for decades with no end in sight. Most of this seems to boil down to an unwillingness to encourage the ecosystem towards a narrower packaging format (“sorry, downloading a package and running setup.py just to determine the dependency set is insane and we’re going to deprecate that going forward”) or a narrower C extension interface that would be compatible with more aggressive optimizations. There’s value in stability, but if the governance model can’t pass these kinds of slow, steady changes for paramount issues, then perhaps new languages shouldn’t look to Python as a model for a well-governed language.
I would suggest Rust—I’m not a fanboy by any means, but Rust is a large, rapidly growing project that is unparalleled with respect to its rate of improvement as far as I’m concerned. I know recommending Rust for things is a cliche (esp “rewrite it in Rust”), but I would defy anyone to find a better governed language.
I find Rust really difficult to get into :( Python, you're right has a bunch of problems with datetime and packages, etc. but when it comes to productivity, it is insane how amazing Python is. Also being able to hack internals of python library and "batteries included" philosophy is why it is the most popular language in the world now overtaking Java.
Rust does have a steep learning curve and Python is a very productive language... until you are the one responsible for builds and deploys, or performance, or concurrency, or quality, or etc. I think you’ll find that Go hits a nice sweet spit between all of those competing concerns even if it isn’t the best at any of them (in general; however, probably not for data science in particular).
Is this process actually fundamentally better? There are so many PEPs that are up to the reader to interpret. There are so many PEPs that contradict each other (just read some of the early PEPs). The whole thing feels basically meaningless, and toothless at this point.
I think Go's proposal process nails balancing formalities and discussion/bikeshedding. They seem to be very productive at making the language better, thinking critically and avoiding regrets, rather than pandering to the very vocal and often misaligned minority of the community.
I can’t even tell if this is a bad faith attack or genuine cluelessness.
No, language features, standard library modules, infrastructure implementations, standard APIs, development and community processes, etc. etc. aren’t “up to the reader to interpret”, aren’t “basically meaningless”, aren’t “toothless”, especially not “at this point”. Informational PEPs are informational.
python-committers also aren’t a misaligned minority of the community. If you feel committers don’t serve your interest, step up and start contributing.
Can you provide some meaningless and toothless examples? Which PEPs disagree with which ones?
I think PEPs, RFCs, and formal processes are designed to keep structure and discipline. It may sound like unnecessary beauracracy but you see this in PostgreSQL development process as well. Vast number of critical pieces of software is developed by this process - from Linux to proprietary software in large companies such as Apple, Morgan Stanley and many many more.
Maybe I'm a cranky old man, but I think there's immense value in the fact that Python's (scientific and non-scientific) ecosystem has matured and has become standard over the last 1-2 decades.
Starting again from the ground up in a shiny new language (even if the language is perhaps slightly better) is a waste of so much effort. Transitioning from Python 2 to 3 was already a nightmare.
I know there's some interoperability between Julia and Python, but a lot of stuff is not interoperable, and there are annoying differences in array indexing etc.
Why can't people just settle on some tech, let it mature, and enjoy reaching higher and higher in capability? Why do we have to always tear things down and start from scratch?
This causes a lot of artificial bit rot, there's a never ending treadmill of just porting well-developed concepts and mature code to slightly changed representations. It's the busywork of tech.
Does the scientific community also want to follow web dev in their insane churn of frameworks and libraries where last week's best practice is laughable ancient stone-age primitiveness today?
I love python and work mostly in it, and I have some issues with the current state of Julia, but I think you are underestimating how much the python science stack has changed recently in a very unstable manner. There is so much cutting edge science that simply can not be done with numpy/scipy. A very big part of this has been the acceptance of reverse design and automatic differentiation in many engineering fields, for which you simply have to use theano/jax/tensorflow/pytorch (I am not talking about machine learning here). Of these, I had to switch my simulations away from Theano when it was deprecated, I could not use pytorch because it does not support complex numbers, and I had to switch from Tensorflow 1 to Tensorflow 2, and I had to write by myself ridiculously basic things like ODE solvers.
The existence of ad-hoc solutions like numba or cython also do not speak well of the limitations of the language. Most specialized science packages (qutip for instance, for quantum optics), reimplement tons of solver/integration/optimization logic and require plenty of cython (or C) code. Scipy does not play nice with most of the automatic differentiation libraries.
In that context (a pretty universal context), Julia is looking pretty stable (relatively) and future proof in comparison. E.g. you can not really interoperate with scipy/tensorflow in a meaningful high-performance way. The value proposition of Julia is that this will be possible, today, and hopefully when future new paradigms appear.
Also, on a tangent, Julia already has ODE solvers and Probabilistic Programming packages lightlyears ahead of what Python has.
We need more decades-stable stuff similar to LAPACK and BLAS, but for one step higher-level things. Just imagine if people had to rewrite all that numerical computation for every new language.
Leaving some performance critical stuff in C is okay. We can expose them to a variety of languages then.
> away from Theano [...] I had to switch from Tensorflow 1 to Tensorflow 2
I feel you. But I think this has more to do with Google's deprecation culture. TF1 was nice for its time, but I will never adopt TF2, as they will drop it any time as well. I have higher hopes for PyTorch though.
One of the neighboring comments mentioned something about changing assumptions. This is what limits the creation of decades-stable stuff. If you leave the performance crucial stuff in C, that stuff will not operate with the novel code you wrote (autodiff, probabilistic programming, GPU/TPU hardware, etc).
Julia is now experimenting with pure-Julia code that is faster than many LAPACK/BLAS implementations. But I actually care little about the fact that they are a bit faster. Same with the new ODE package in Julia, which is unsurpassed in terms of functionality. The actual reason these pure-Julia packages excite me is that they are generic enough to run relatively easily on different hardware (GPUs for instance), they are on top of modern compilers (LLVM), and most importantly, they are written in a language that permits much higher level of interoperability.
To rephrase it much more simply: The fact that you can not use scipy special functions efficiently inside of any of the autodiff frameworks is exemplary of how incredibly limiting python is, when it comes to developing new numerical techniques. Julia does not have that problem as far as we know.
>The actual reason these pure-Julia packages excite me is that they are generic enough to run relatively easily on different hardware (GPUs for instance)
Only if they were specifically written for GPUs. GPUs are incredibly limited. Expecting to run arbitrary code on them is only going to disappoint you. If you could just run regular code on GPUs then everyone would have done so. Programming language support is not a barrier. The GPU itself is. In fact, you should consider yourself lucky if you have problems that don't need GPUs.
IMO this is worth very little; as long as there is a heavily optimized version written in C/assembly (ie MKL), the fact that Julia might be faster than standard-but-slower versions like OpenBLAS doesn't actually gain me anything as a user.
Mostly the bits about how you have to workaround the Julia GC.
So it doesn't seem like a great option for libraries that might have lots of different bindings, where C/C++/Fortran (and maybe Rust?) seem to be the only reasonable choices.
> Leaving some performance critical stuff in C is okay.
I disagree on account of the glovebox problem.
If you can't touch your data without losing perf, your API becomes awkward like a glovebox. All meaningful low-level operations must be represented in your high-level API which bloats the heck out of it. To add insult to injury, it still usually doesn't get you 100% functionality coverage, so projects that veer even slightly off the beaten path are all but guaranteed to involve C and bridge crossings and explode in complexity. So the end result is unnecessarily bloated libraries that unnecessarily constrain your creativity.
Yes, I know about numba -- I've used it heavily and even contributed a few bugfixes -- but it's often not even a workable solution, let alone a good solution, let alone a good solution that's fully integrated into the ecosystem (i.e. debug and in other people's code).
Most of the code I write is still python, but I need perf, so I suffer at the python/C bridge regularly. Like any other programming language bridge, it has trolls under it, so if you only need to cross once or twice you're probably fine but if you frequent the bridge they're going to periodically pop up and beat the shit out of you. This won't affect people following tensorflow tutorials but it greatly affects people in the deep end of the ecosystem and it really sucks.
Next time the trolls come out, I'm going to try to make the case for Julia again. Hopefully one of these times we make the right choice, but in the meantime it would be nice if the people using my libraries understood that I was taking troll-beatings on their behalf and listened to my requests that maybe we should do something about it one of these days.
By that logic people should have kept using Fortran instead of Python (which was the new guy a couple decades ago). But Fortran will never be Python, and Python will never be Fortran, no matter how much maturity/improvements. Because maturity means getting close to it's potential (local maxima), without the ability to drastically move to a hypothetical global maxima (without breaking all it's legacy, like a more drastic Python 2 -> 3, or adding things indefinitely until you get effectively many languages/dialects in one like C++).
New languages on the other side don't have as much baggage (and can start with increased knowledge from previous attempts) so they are able to freely potentially target better local maxima that old languages cannot anymore. Plus in this particular case, Python was not really designed to target the scientific domain (nor ML which did not even exist in the way we use nowadays), it was repeatedly retrofitted with possibly a lot more effort than would be necessary to create something better from scratch.
> By that logic people should have kept using Fortran instead of Python
There were very real pain points with Fortran, C++, MATLAB etc. compared to Python. Now this is hard for me to argue in a few sentences as it's a distillation of having worked with many languages, but I feel that Python strikes a great balance for productivity. It feels like pouring your thoughts directly onto the screen, with no need to fight the language. The syntax is very clear, concise, "Pythonic" is a word of praise. As I said, this kind of debate is rarely productive and devolves to religious arguments, but that's my stance.
The only issue with Python is speed, but if you use Numpy properly, even that isn't a big issue.
> and can start with increased knowledge from previous attempts
They often don't make use of that out of a sense of pride and not invented here.
I feel like you considering Python as having few real pain points in data science as either lack of knowledge of other languages or imagination/ambition. I do work on Python for data science/engineering in a production environment and I do find many of those all the time. Python does not feel like pouring my thoughts because I cannot write Python directly without taking hours to handle a few million points of data, so I have to juggle with multiple dialects (pandas, numpy, pytorch, tensorflow), and Python ends up being one of the languages that I need some documentation at all times. And then that documentation is also not obvious what's the input, do I need to give a tuple, or perhaps a dict or even a list because typing is very recent so the ecosystem isn't up to date and sometimes I can't even type my code properly because I legitimately don't know what a library function returns. And even with mypy I keep finding errors that I can only find after deploying (I don't blame Python on this one, and Julia isn't really better, but I can still dream of something that is dynamic when I want but still safe). Also Python for a dynamic language has pretty mediocre interactive story (using ptpython since the default repl is unusable), even simple things like copy pasting to a REPL can end up being a pain because of indenting, and the repl is far away from Lisp, Clojure and even Julia and Elixir. And Elixir also makes me especially disappointed with Python's multithreading story (in Elixir it's so natural that it really is the "pouring your throughts on the screen" for distributed).
It ended up being a rant, but I have many pain points with Julia as well, but it's still a new language that has more space to evolve and find ways to solve them, and if the solution is yet another language that solves all of them, I'll quickly jump. I spend a lot of time programming, so any significant improvements on the usage of my time is worth the effort in learning.
> so I have to juggle with multiple dialects (pandas, numpy, pytorch, tensorflow)
NumPy should be enough for general computation. If you need autodiff or GPU then add in PyTorch. Pandas is more about various metadata than the actual numerical computing. If you want those types of features, the complexity doesn't disappear if you go to a different language.
There's an effect where a new generation of developers see complexity built by the earlier generation, say it's too complicated and mess up, we don't need all that, so start over clean and it all looks so easy. But it's deceptive, because it will get complicated again once you put in all the features but it will look familiar now to this generation of developers as they grow side-by-side with the new language/framework. After a few years the cycle repeats and a new generation says "what's all this mess, why do I need to juggle all this, I just need XY."
> Also Python for a dynamic language has pretty mediocre interactive story (using ptpython since the default repl is unusable), even simple things like copy pasting to a REPL can end up being a pain because of indenting, and the repl is far away from Lisp, Clojure and even Julia and Elixir.
Use Jupyter Notebooks (or IPython if you don't want to leave the shell).
> disappointed with Python's multithreading story
Thread pools (executors, futures etc.) and process pools (multiprocessing module) work quite nicely.
I'll have to disagree with your defeatist generalization here: creating something with better knowledge of the problem and a lot of hindsight will not necessarily get as complicated for the same feature set. In fact the resets are probably necessary so you can get to even higher feature sets before getting so obtuse that it's a pain to move another step.
Julia doesn't even need full parity with numpy because you can trivially write your needs in straightforward Julia (in fact Julia does not have numpy, only Julia arrays). And Pytorch equivalent? Also uses Julia arrays and straightforward methods (as you can just differentiate Julia code directly). Pandas equivalent? It's a wrapper over Julia arrays. Need named tensors? Just pick another wrapper over Julia arrays and it will work on Julia's Pytorch equivalent without changing anything.
I do have to deal with Jupyter Notebooks, but they do have a lot of "real pain points" as well (it's always a pain when a data scientist gives one to add for deployment, half the time it does not work because it does not keep track of the whole execution history and need a major rewrite), but the main issue is the disconnect between the exploratory code and the production code. Julia's workflow (Revise + REPL) means I still write the structured code as usual, but every change in the code reflects in the REPL, like I'm interacting with the production code directly (and the Julia language supports a lot of commands to interface with the compiler, from finding source code to dumping the full structure of any object and even all types inferred). And of course, Julia also has Jupyter in the first place as it's name suggests, and a very cool alternative for some workflows like Pluto.jl that improves on reproducibility).
You might also want to try Erlang/Elixir, it's really amazing how you can trivially spawn hundreds of thousands of tasks that are incredibly robust even though it's a dynamic language (since you add not only tasks to do something but also to monitor and handle errors). And you can even connect and hot swap parts of the application live (as an example of superior interactivity).
I'm not saying that Python is not good, it wouldn't get where it is if it weren't. I'm saying that things can be way better, and we already have examples of it in most particular areas, but none that covers all of Python's strengths, which is why it will still be king for the foreseeable future. But I can only hope that we will get better and better tools to tackle more and more complex problems in the most simple and optimal way that we can given all that we learned as a community.
> The only issue with Python is speed, but if you use Numpy properly, even that isn't a big issue.
Numpy leaves a couple of orders of magnitude of performance on the table, especially if you use small arrays and a lot of intermediary computation. It is terrible in terms of memory allocation overhead. Cython or @tensorflow.function fix much of that, but then you are not really using python anymore.
> They often don't make use of that out of a sense of pride and not invented here.
Agree, but to add, there are many transcendental functions (e.g. I needed to use the Mittag-Leffler function many times in my work) which are near-impossible to implement in Numpy in a way that gives anywhere near usable performance. It's not a common enough function to be pre-implemented in Scipy special functions.
In my case, I wrote the algorithm in Numpy/python myself which was almost unusably slow. I then outsourced it to a precompiled fortran program. Finally I just switched to Julia. There was already a library, MittagLeffler.jl which did everything I needed and was written in pure Julia. In the end everything was much faster than the Python/Fortran frankenstein code I had.
Edit: For more info on Julia and difficult (computationally) mathematical functions see:
And, unlike in many other languages, that MittagLeffler julia function can probably already work efficiently on special arrays (distributed arrays, GPU arrays, etc), is fussable into inner loop kernels, and supports automatic differentiation, without extra work from the author of the package.
There are very real pain points in Python, too. Developers tend to internalize these pain points to the extent that they’re completely blind to them, and that limits innovation. Not all interesting and fast algorithms can be vectorized. The idea behind Julia is that you can write code that should be fast, like loops, and it will be. You don’t have to bend over backwards to jam your algorithm into a C wrapper. It’s fair to question whether the trades with things like JIT compilation were worth it in Julia, but it should also be understood that NumPy is insufficient for supporting all of the interesting things you can do with arrays of numbers on a computer.
>They often don't make use of that out of a sense of pride and not invented here.
You are handwaving away a big reason why there is research into non-Python solutions. Unless you work at FAANG, or some other institution where developer time is much, much greater than compute time; simply using a more performant solution can save either in days of waiting or thousands of dollars a month in compute.
>The only issue with Python is speed, but if you use Numpy properly, even that isn't a big issue.
This is literally the pain point that Julia set out to solve. Nobody uses Python for real programs, they use an ad-hoc kludge of Python and C/C++. This is basically fine if you're doing bread-and-butter things that are well covered by existing libraries, but it's a serious drag if you're trying to do anything really interesting.
Python still suits a lot of people and that's fine, but "mature" is very much a double-edged sword. We're in the middle of a scientific computing renaissance right now and Python really isn't keeping up; it was designed by and for hackers, but Julia is a multi-disciplinary effort.
> Python really isn't keeping up; it was designed by and for hackers
What does "hackers" here even mean? Is it just a throw-away disparaging term?
Hacker has at least three distinct meanings: deeply knowledgeable software developers, shallowly knowledgeable programmers, and people who break into computer security systems.
I assume you are not using the first of these, else you would likely say the Julia developers are hackers too. And you are definitely not using the third.
van Rossum's training, for example, came through the language design and implementation process for ABC.
So already there it was a multi-disciplinary effort.
What don't you say Python was designed for an era of single-core computing and for tasks where raw performance took second place to usability instead of going hand-in-hand? Seems rather more correct and less needlessly antagonistic.
But you seem rather fond of needless antagonisms, as people certainly do write "real" programs all in Python.
>What does "hackers" here even mean? Is it just a throw-away disparaging term?
No, my point is that the Julia team includes a lot of people who aren't primarily software developers. Many of the top committers to Julia are working scientists who have first-hand experience of running HPC jobs with Julia; their input is a crucial factor in the success of Julia as a numerical computing language.
Python is a very good scripting language, but it's not built for numerical computing and it's not built by people who understand the needs of scientific users. It's not a bad tool, it's just the wrong tool for the particular jobs that Julia was built for.
> Python is used or proposed as an application development language and as an extension language for non-expert programmers by several commercial software vendors. It has also been used successfully for several large non-commercial software projects (one a continent away from the author’s institute). It is also in use to teach programming concepts to computer science students
"non-expert programmers" and "computer science students" are not hackers by your definition, resulting in an internal inconsistency in your claim.
Python indeed wasn't built for doing numerical computing in Python. But it was definitely influenced by the needs of people who 'have first-hand experience of running HPC jobs' - under the "steering" model.
> We have described current approaches and future plans for steering C++ application, running Python on parallel platforms, and combination of Tk interface and Python interpreter in steering computations. In addition, there has been significant enhancement in the Gist module. Tk mega widgets has been implemented for a few physics applications. We have also written Python interface to SIJLO, a data storage package used as an interface to a visualization system named MeshTv. Python is being used to control large-scale simulations (molecular dynamics in particular) running on the CM-5 and T3D at LANL as well. A few other code development projects at LLNL are either using or considering Python as their steering shells. In summary, the merits of Python have been appreciated by more and more people in the scientific computation community.
One of the authors of that presentation, Paul Dubois, was lead developer for Numerical Python, which evolved into NumPy.
The needs of numeric computing, under the steering model, contributed to Python-the-language. Most recently with "@", but also the multiple slice notation a[1:3, 5:11:2] and the change to remove the three-state compare cmp(). Plus under-the-covers support for a wide range of HPC architectures.
In turn, Python's support for steering numeric computing helped it make in-roads in HPC, where it remains to this day.
You may call the steering model 'an ad-hoc kludge of Python and C/C++', but as I've pointed out, you have a tendency to use needless antagonisms, and that's another one of them.
You're the one who claims I don't do "real" programs, only "bread-and-butter things" simply because I don't do bare-metal numeric computing software. Just how much of a multi-disciplinary effort is Julia if bare-metal numeric computing software is the only discipline it focuses on?
That's a big one. Memory usage and typing are other issues. That it doesn't really run in the browser is another. Or that it has more issues than C++ when trying to write a portable GUI.
I see what you mean, but from my perspective what this argument is missing is the context that a large and important subset of the scientific community has never moved to Python to begin with.
In other words, there is an even crankier and older community for which Julia may be the first actual change in a very long time.
This is particularly the case in HPC, where everyone still uses Fortran, C, or C++, and has never moved nor ever will move to Python because Python is fundamentally unsuited to these workloads. But in some cases, Julia is [1].
The best differential equation solvers (e.g., SUNDIALS [2] for a modern example) have been written in FORTRAN for the last 50 years. If Julia can challenge Fortran as the go-to language for this type of work (e.g. DifferentialEquations.jl [3] and SciML), that would hardly count as excessive churn.
Python still has an enormous amount of baggage from its single-threaded and fully-interpreted past (and present, really -- but I'd expect fans to argue). Should we live with the kludgey, limited workarounds? Or switch to an ecosystem built on a more modern foundation? Only time will tell -- but I won't shed any tears for python if we decide on the second option.
There is an immense value in most of the ecosystem converging on Python over the last decade and especially with leaving MATLAB behind in the dust. I don't think decade long transitions from old tech to new tech is a bad time scale or even close to that of the churn of web technologies. Julia offers enough of an improvement over Python to warrant a switch over the next 5-10 years and leave Python behind in the same way. I believe Julia offers the same level of improvement over Python as Python does over MATLAB.
I also believe that it has drawn too much from MATLAB only for the sake of being familiar to that group and not the computer science and software engineering communities though such as the one based indexing you mentioned. Unfortunately that ship has already sailed and we will likely have to wait another 10 years for the next one to hopefully fix those problems.
Back 5.5 years ago, I used to complain about the 1 based indexing and the column-major structure of matrices in Julia (both like Fortran), however, those issues have been solved by OffsetArrays and PermutedDimsArrays, giving far more flexibility that is possible in most other languages.
It's silly to keep bringing up the issue of one based indexing, when you can use any integer as the base, just like Fortran 90 (so you can index by -5..5, for example).
For some things, 0 based does make things easier, sure, but you can do that easily in Julia (and more!)
Yes it is possible to make those changes and even if it wasn't it does not detract from the impressive and useful innovations Julia has made. That is why I usually refer to them as cosmetic but that is possibly not the correct word either because it does not convey the implications of having that default beyond simply being unattractive. Cultural may be the correct term.
Julia claims that it wants to solve the two language problem. To do that it must win both the scientific compute, HPC, data science, machine learning, etc. group and the CS, software engineering, dev ops group. I've shown Julia to quite a few members of the latter and they always love the idea and technical capabilities of Julia but visibly cringe when they find out it uses one based indexing. They also almost universally find the syntax ugly.
Unfortunately the available packages to change that also does not help because most people will never change the defaults and IMHO it would be bad to do so. When writing Python you follow the PEP-8 style guide, it is just best practice to follow the languages conventions so everyone is on the same page. This is the same problem with C++ trying to add all these new features and a solid package manager like Rust has, it will never fly, people just won't use it everywhere, it isn't going to be in the standard library, It isn't the default and it isn't part of the culture.
> Starting again from the ground up in a shiny new language (even if the language is perhaps slightly better) is a waste of so much effort. Transitioning from Python 2 to 3 was already a nightmare.
This is a sunk cost argument. Python’s stack got us a long way, but it still falls short in a lot of ways. First of all, it is eager so you can’t build up an entire query, have the system optimize it, and then execute it for greater performance. Secondly, you can’t easily parallelize operations on the same dataframe. Thirdly, it’s poorly suited to running in a web process. Fourthly, it’s all written in C/C++, so debugging, building/packaging/installation (esp on niche systems), etc is a pain and it prevents many of the optimizations Julia and others can do natively, etc. Fifthly, the APIs were horribly designed (especially matplotlib, good grief). Etc.
I don’t know about Julia in general, but in most domains there’s a middle ground where people can have nice, thoughtfully designed tools (albeit with minor issues), and endless churn. It certainly feels like the scientific toolchain is one of those areas—fix a few of these enormous, glaring problems and then enjoy some stability.
> Fifthly, the APIs were horribly designed (especially matplotlib, good grief)
I've seen comments like this fairly often, but am not really sure what about matplotlib's API makes it so bad. Hoping to learn why.
The only equal-basis competitor to matplotlib that I'm familiar with is Matlab, which is definitely worse. Declarative plotting libraries like ggplot2 have a nicer API, but the grammar of graphics approach is an entirely different paradigm, which gets as thorny as anything else if you have to do something unexpected. For the kind of medium-level plotting where you're given prepared plot elements, but have to explicitly specify where to draw them in data or figure coords, what would be an example of a "good" API (better than matplotlib)?
The API is state full and the args it accepts are arbitrary. It just tries to guess what you want it to do based on the types of the args you pass and their order. I’m not sure which plotting API is better, but there’s nothing unique about plotting in all of computing that requires an insane API.
Disruption is the price of progress. Like cars with electric motors vs combustion engines, sometimes you have to start from scratch. You can't just keep improving combustion engines forever, you eventually reach a technological limit.
To your point though, I do see a lot of unnecessary disruption particularly in the web dev world. I think people like working on new stuff. It's exciting to take the first steps, see order of magnitude improvements every commit.
As an aside, I think that's a poor analogy. Electric cars have been around since the late 1800s, only a decade or two after the first internal combustion engine, and for a while set land speed records. The last 30 years of electric cars did not start from scratch.
Electric motors reached a technological plateau. As https://en.wikipedia.org/wiki/Electric_car points out, it required MOSFET power converters and lithium-ion batteries. Before then, it could appear that it reached a technological limit.
> Why can't people just settle on some tech, let it mature, and enjoy reaching higher and higher in capability? Why do we have to always tear things down and start from scratch?
At what level of language improvement does it warrant rebuilding things? Eg: What if there has actually been a change as revolutionary as multiple dispatch based on a good type system, which cannot be backported to Python, and provides huge advantages for a diverse set of applications? For many users, it would be crazy not to switch... especially given that it’s actually much easier to write performance and flexible libraries in the Julia ecosystem (writing Julia code) than in the Python ecosystem (writing C/C++/Cython code).
You might be succumbing to a bit of the blub paradox here... Python might be good enough for all your needs, but people are justifiably enamored with Julia because it brings some really exciting capabilities to the table which allows them to solve serious impediments they experience, not just because it is a shiny new thing.
Rhetorical question: Now that we have all this infrastructure for gasoline based cars, is that reason enough to hold back the “churn” to electric cars?
I cannot resonate more with this argument. Everytime we had some issue with performance with Python, instead of rewriting from scratch, I wrote it in C and created Python Wrappers for it. Its much better to extend the existing ecosystem, than rewriting things from scratch over and over again.
Yes, there is a very long tail of libraries scattered across GitHub and the web in general. Snippets of obscure algorithms, APIs for various things, like ML datasets from 5 years ago etc.
I heard there's some interop between Python and Julia, but I guess it comes with plenty of caveats, like it works for most cases, but will give you enough headache like "oh, of course this use case is not supported like that" etc.
Once you're in Python you can do anything. Spin up a web server, render something in 3D, read obscure data formats...
When I see academic code from 10 years ago written in Matlab I cringe because I know it will be a pain to work with that code. There was a big reason people moved away from Matlab as it cannot support large applications. In Python you can build huge code bases with modular structure, sane classes etc. as opposed to Matlab. And it's free and open source unlike Matlab. So the switch there was worth it. But I don't want all of today's great Python long tail ecosystem to be lost to obsolescence in 5 or 10 years. It's just sad to see so much effort go down the drain.
Because people want to? I think a big part of it is that people like trying new things and trying to differentiate in some way. In something as competitive as web dev, it makes sense that the churn in frameworks is so high, because everyone is looking for just that little edge or unique special sauce that makes development a bit easier or makes the competitor a bit more obsolete, or attracts talent thats just a little bit better. For someone out there, a little edge in CSV parsing is going to make their lives just a bit easier and help them do just a bit more than the competing lab, and grad students/labor is cheap, so why not have a grad student play around with Julia and see if you can get anything out of it?
From my perspective, the fact that so much of computing has consolidated around a language that cannot support even the most rudimentary code completion or static analysis tools is deeply unfortunate. As a result of Python's mass adoption, uncountable thousands of hours of developer time have been wasted hunting down bugs that would have been caught by tooling in any statically-typed language.
Unless Python can fix this, by adopting static typing, it's not entirely fit for purpose in the ecosystem niche it has come to dominate. Replacing Python with a better language would be a net productivity boon for developers and scientists almost everywhere.
I think you're hitting on the main pain point of Julia. My bet is that we'll see strong typing dominate in the coming decade, with assistance from linters/static analysis becoming default.
Unfortunately, this is not too easy to create using Julia. Much easier than Python, sure, but not too easy. Tellingly, the Julia community apparently does not care for static analysis and have not begun to develop the capacity for it. As a consequence, much Julia code is written in a way that is un-analyzable, which digs it deeper into weak typing.
If you want to read csvs fast in Python, you should consider using the Apache Arrow[0] csv reader. Depending on your number of CPU cores it can be 10x-20x as fast as the native pandas reader. [1]
More broadly, because Arrow is cross platform it can give you similar performance in many languages. And once the dataframe is in memory, you can share it between languages with no need for serialisation and deserialisation.
I agree; if I needed to parse CSVs in python and could utilize the arrow format, I would definitely use pyarrow.
I actually recently finished support for reading/writing the arrow format in Julia (https://github.com/JuliaData/Arrow.jl), and it's automatically integrated with the CSV.jl package; so you can do `Arrow.write("data.arrow", CSV.File("data.csv"))` and convert a csv file to arrow format directly. I'm very bullish on arrow as a standard binary data format for the future.
Pandas is pretty slow and since it loads into memory it can be totally infeasible for even relatively small data sets. The csv module it what one should compare it to imo.
Apache Arrow reads csvs into memory in Arrow format, not pandas format. They are independent libraries that do different things. In Arrow, it is possible to read a csv in batches, obviating memory problems. See 'incremental reading' - http://arrow.apache.org/docs/python/csv.html
And fast! I got easily 3 orders of magnitude speedup. I used to load CSVs with pandas and do typical filtering operations... oh boy, some of the stuff would take a hole morning, not to mention the memory usage. I do all that in minutes now.
The vroom R package is likely the fastest R package in read speed, not the ones used in this article. vroom was 13.1x faster than R's fread/data.table based on the read performance benchmark in https://cran.r-project.org/web/packages/vroom/vignettes/benc... so may be similar or faster in read speed vs julia.
Vroom is lazy though: it doesn’t parse until you access the data. That’s a very different approach, which is why it’s no direct comparison. You’ll note that data.table, which is compared in the article, is faster than vroom’s default configuration when it comes to whole table scans, which would be the appropriate comparison. If the data is mostly numeric, then data.table is faster for all configurations. That said, if you don’t plan to access most of the data in a data set, then vrooms lazy approach can be as big win.
> The tools used for benchmarking were BenchmarkTools.jl for Julia, microbenchmark for R, and timeit for Python.
Is this a meaningful comparison? These benchmarks repeatedly call the function (with same input file) many times and then take the average of the time spent. However, in real world, we only read a specific file ONCE -- if we call the csv reader several times, it is almost always for different files.
There are two potential issues that this might disregard:
1. cold file cache
2. JIT compile time
The cold/hot file cache issue affects all languages equally, so it doesn’t invalidate the comparison. It is possible that CSV file is not in cache and a system's disk I/O is slow enough that it becomes a bottleneck, which would make all parsers equally slow. However, this is not very realistic because CSV files—especially large enough ones where you care about speed—are almost always compressed—So you're not going to be bottlenecked on disk read. Assuming that compressed disk read + decompression is fast enough to keep the data flow high, you're back to CSV parsing being the bottleneck.
The JIT compile cache only affects Julia. The reason is it not included in the benchmark results is because it is a small, fixed overhead: it is only paid on the first CSV file read (no matter the size), and if you read a larger data set, the compile time does not increase. Since the point of benchmarks is typically to project from a smaller case how long it would take to do even larger tasks, you don't want to include small, fixed overheads.
For some concrete numbers, I just timed reading a tiny CSV file on my MacBook Pro 2018 and the first read, including compile time took 4 seconds. The second read took 0.000347 seconds. So that's the fixed overhead we're talking about here for reading a CSV file: about 4 seconds on the very first CSV file you read. People can be the judge of whether that's a showstopper for them or not.
4 seconds if one doesn't use PackageCompiler, but I'd argue using PackageCompiler for something like this (and Plots) is fairly standard now. I probably update my basic package compiles once every month (which is often because I work on a lot of packages of course), and many of the SciML users report updating PackageCompiler sysimages every few months. With that, the basic compile times are gone. Given that as a "new standard", we might as well include some times with compile times in a user-improved environment.
I think this is an important point, if you look at the warmup run on [1] the Julia CSV readers are slower than many of the R and python packages . It makes the claims of 10-20x faster quite disengenuous in my opinion. A lot of data science work will only read the file on once per session and what these benchmarks actually suggest for that sort of work is not reflected in the title.
Frankly, all these implementations are disappointingly slow. Daniel Lemire and I wrote simdjson to use SIMD to read JSON quickly - CSV is strictly easier than that. I made a start on simdcsv but got bored, but the same principles would apply. IMO this task should be doable at 1-2GB/s on a single core; it's not something that should really need multiple cores.
Agreed. But doing SIMD optimizations could quickly become 2-3 weeks project for me just for one platform. Adding NEON would be another 1 or 2 weeks. Need much more dedication I guess.
Your work on simdjson is awesome! Thanks for saving the world so much unnecessary power consumptions :)
Thanks for the kind words. I suppose this confirms my general suspicion that principles of simdjson need to be lifted out into a more general tool that allows people that don't live and breathe SIMD (strange that such people exist :-) ) to still get the benefit of these techniques for formats that aren't JSON.
I built a pagerank implementation in Julia many years ago. It was operating on an adjacency list with half a billion edges, and I had to resort to writing a custom reader to get reasonable read speeds over the csv. In case anyone is interested: https://github.com/purzelrakete/Pagerank.jl/blob/070a193826f...
Java made things fast at the macro level by making threading possible. Remember trying to do any kind of portable multi core in C/C++ in 2000? It was no surprise JVM languages took over and became what we’ve built Big Data on (Hadoop, Hive, spark, etc)
But they leave so much on the table for micro performance! Take a CSV or JSON parser for example - likely spends all its time extracting string objects!
Now the Big Data community mostly doesn’t notice; for example, the Parquet code is about 5x slower than the Orc code, apparently, but we all adopt Parquet anyhow.
I’ve been going through my java codebases doing the basic stuff like “let’s write a simple fast json parser that doesn’t make a lot of garbage” and you can easily half the number of boxes you need etc. sometimes order of magnitude improvements by writing java code that is more like C.
But there’s this level you cannot reach. You look longingly at the raw mechanical sympathy C enabled.
Perhaps Rust is the future? I expect we’ll just keep using more cores than necessary though.
Take a deep look at Julia! Adding multithreaded parsing capabilities to CSV.jl was really a joy; basically just chunking up the file and spawning threaded tasks to process each chunk with the existing parsing code.
My favorite favorite thing about developing in Julia is the ability to write "high-level" type code and usually get decent performance, BUT THEN have the ability to fine tune for that extra boost of performance. The code inspection tools (see lowered, typed, LLVM, or native IR code levels) and super nice Task-based multithreading really make performance tuning a delight.
> Julia will be the killer lang for building web apps
That would be fun, but Julia's community aren't web devs. Julia spawned around very specific needs of scientific computing, which is characterised by a short-running daemon or a script-type interpreter. A web server is a long-running process. Not knowledgeable enough about Julia to tell how it lends itself to server uses, but heard hearsay that it's problematic.
It is quite common for scientists and engineers to want to present the findings of their work or share them with others who are non-coders. Thus you get tools like Jupyter, Shiny, Pluto, etc.
There are several long running Julia processes in real applications. While there's no doubt that there is work to be done, the set of these applications is not the null set either.
I don't think you can define scientific computing in a way that excludes quantum mechanics, molecular dynamics, fluid dynamics, Monte Carlo simulations... There are lots of long-running applications. We don't want them to be long-running, but it's not unusual for a run to take weeks and not unheard-of for it to take months.
Seconded this. If you looked at the Julia website (https://julialang.org), in the Ecosystem section, they listed these things:
"Visualization, General Purpose, Data Science, Machine Learning, Scientific Domains, Parallel Computing."
They already positioned themselves in scientific computing battleground. Web development (or desktop programming) is not their focus (for better or worse).
> Once ecosystem for web dev matures Julia will be the killer...
Looking back at history, javascript was put together in haste, almost like an afterthought. Even the name was a stupid marketing tack-on.
Yet it stuck and kept going even when people became sick of it. It has reinvented itself several times over, and in spite of the emergence of multiple superior technologies that were specifically targeted at web-development, it still stands at the top of hill.
I guess it would be properly ironic that something designed for mathematics would take down javascript for web-dev, but I doubt it. It would be amusing though.
Julia is a good bit easier to pick up than Rust. I like Rust btw. but say at my job chances of us adopting Rust would be close to 0. Julia is really easy to pick up so would be much easier sell once ecosystem is mature.
I’d really think all these languages would have the bottleneck of disk read speed. So how can Julia be that much faster? Are the others really that inefficient?
The biggest difference in these benchmarks comes down to how multiple threads are leveraged; (disclaimer: primary CSV.jl author here).
In CSV.jl, it was relatively straightforward to add multithreaded parsing support; we chunk up the file, find row starts, then spawn a threaded task to process each chunk. In data.table (fread), they have a constraint of R having a global String intern store; so it fundamentally restricts the multithreading capabilities by requiring strings to be interned sequentially. In pandas, there are similar nontrivialities jumping between the C++ source and python object world that adds a lot of complexity (and probably explains why no one has made the effort to do multithreaded parsing). It's interesting to note, however, that the apache arrow project (pyarrow package in Python), has a multithreaded csv parser integrated with the project. It provides similar performance to CSV.jl because it was built from the ground up to process chunks on multiple threads into "arrow tables". Unfortunately, it's tied very specifically to the arrow project, so pandas doesn't benefit from its work! It's one of my favorite features about Julia that CSV.jl automatically integrates with other "data table formats"; i.e. ODBC.jl, SQLite.jl, MySQL.jl, Arrow.jl, DataFrames.jl, etc. They can all leverage CSV.jl as "their csv parser" because it's seamlessly integrated via well-established "table" apis.
Because it is not a bottleneck. Modern SSDs can read data at more than 500 megabytes per second, which, on a single thread gives you just about 6-8 clock cycles to process a single byte. That’s not a lot of CPU instructions so it will take quite a number of threads to handle this bandwidth.
The SSD in my current workstation tops out at 5GB/s sequential reads and 680kIOPS. That's a modestly high-end consumer M.2 drive, not enterprise exotica. SSDs are really, really fast.
Without having looked at it, it might also be a case that the other libraries spend more time handling more obscure corner cases and doing more robust error handling.
I think there's a strong chance that Swift or Rust take a lot of Python's data science cake. Both of them are extremely fast, concrete, and have lots of investment being poured into numerics and fitting into the Python/ML ecosystem.
I don't think Julia or R are going to steal this away. In fact, I think Julia is a major turn off to engineers with some of the bizarre choices they made (eg. 1-based indexing to appease math-centric backgrounds). If you don't excite engineers, you're not going to get as many library and optimization contributions.
Julia feels like it came too late in the game and the newer entrants are making quick strides in the mathematics space.
I could be totally wrong. This is just my take, and I have no stake in this fight other than I want a good, safe, highly-compatible, and fast ecosystem to work with.
Not sure what you mean about being a turn-off to engineers. I'm an electronic engineer currently working on production data engineering pipelines (which has no relation to the engineering in my degree) and I quite like the language. 1-based index is such a small "problem" compared to actual problems you encounter when developing software that is not even worth commenting in this context.
I work on a polyglot company, and parts of the pipeline have different demands. For low latency and fast development speed we use Elixir, for high throughput we use Scala, for data analytics in batch jobs we use Python. As I see, Julia is a language that is more concise and faster than Python for the number crunching I do, and has the potential to match Scala's throughput all while having multithreading/distributed that could rival the erlang VM (not in latency, but in speed and easy to use) and Scala's Akka eventually.
So as an engineer I'm excited about the potential (a single language that you can develop as fast as python that could compete with some of the top tiers in different areas of data engineering), even if it's not yet. It needs to complete the multithreading (structured parallelism so you can safely monitor and restart threads to help with reliability and with no risk of leaking, and have some library that works at the level of abstraction as the OTP/Akka), it needs native support for the infra-structure (Kafka, Prometheus) and a solid support for the web (and both of those last two can be done right now with Julia 1.5). For now it's an amazing language for exploratory data analysis and research, but I can't wait for the moment that I can safely recommend as the backend of my company's data infra-structure.
I guess we're talking about a different kind of engineer, but I work mostly with mechanical/aerospace engineers and they're more comfortable with 1-based indexing like in Matlab.
Complaining about 1-based indexing in Julia is so... 4 years old! Just use OffsetArrays, and you can use 0 based, or whatever base floats your boat (start underwater with -5, for example!)
It’s doing parallel parsing, but I’m pretty sure their technique won’t work for all inputs. Namely, they try to hop around the CSV data and chunk it up, and then parse each chunk in a separate thread AIUI. But you can’t do this in general because of quoting. If you read the code around where I linked, you can see they try to be a bit speculative and avoid common failures (“now we read the next 5 rows and see if we get the right # of columns”), but that isn’t going to be universally correct.
It might be a fair trade off to make, since CSV data that fails there is probably quite rare. But either I’m misunderstanding their optimization or they aren’t being transparent about it. I don’t see this downside anywhere in the README or the benchmark article.
I wouldn't call that cheating, how else would you multithread this? You basically have to chunk it up and try to find a boundary and from what I can tell, they try to find out if they're in a quoted block. If they are, they find the end of the quoted block (or EOF, which shouldn't happen => broken file) or ask the user to report the bug. This is more like failsaving, not cheating.
Multi-threading is a means, not an end. It's cheating if the optimization sacrifices correctness, especially in a meaningful way that is not documented clearly.
You cannot parallelize csv parsing by chunking like this. The format prohibits it, full stop. This is why xsv comes with an `xsv index` command for example.
Like I said, either I'm misunderstanding the optimization they are doing or they aren't being transparent. From a glance at the code, it looks like the latter to me.
(As others have mentioned, you can parallelize csv parsing, but it requires multiple passes. Which is basically what `xsv index` is. It lets you choose when to do the first pass, and then benefit in performance in all subsequent runs.)
Use a fast "frontier" thread to determine quotes and split works at the safe boundary.
I'm personally okay with the speculative execution as long as its result is correct and I wouldn't consider it cheating. But it is concerning that the benchmark doesn't list all the caveats (as all proper benchmarks should do, like [1], which is incidentally written by who you are replying to). I can't even determine whether the machine is hitting I/O bound or not from the OP!
You can write slow programs in all languages. There are many cases where Julia's language features enable higher performance than in comparable C libraries.
Pythons CSV reader is actually decently fast. It is just that speed wasnt a priority.
I dislike the whole "faster than C" comparisons. Almost nothing is. If you want speed you chose C. If you want something that is plenty fast with a much better speed-to-effort ratio, Julia is a strong contender.
Exactly. Nothing is faster than C that is also high-level. If it is, then the C version is not equivalent to the code with what you are comparing. Or are there any examples where this is not the case? If you want performance in your language, you typically write those parts in C, and if it is in C and still not fast enough, you typically go for inline assembly.
Would you sum it up for me and give me examples? Which programming language is generally faster than C that is also high-level? Why do you think people go for C if they want performance? In practice it seems like it is the fastest, popular, stable high-level language out there that has been around for decades.
Maybe Forth whose compiler is in assembly, but due to its type system or lack thereof, doubt the same optimizations can be performed.
The overhead in all Python JSON parsers isn't parsing the JSON, it's building the Python objects that represent each element. If Julia has lower overhead objects, or JITs with basic optimizations, it would be trivial to be faster than pysimdjson.
Julia has multi-stage compilation: first it's lowered to the AST (all macros are resolved), then it is lowered to an IR, then types are solved, then it's lowered to an SSA form IR, then to LLVM IR and finally machine code [1] (and it's JIT will try to pre-compile as much code as it can, in some situations it might even compile the entire program in one go, which is the cause of the delay when starting the program). Everything that runs is always machine code as there is no interpreter or VM (though you might say that Julia's bytecode is the LLVM IR).
Python is "compiled" to python bytecode, which is interpreted by the python runtime (as in, the python runtime looks up what a function should do, does that thing, then checks the next piece of code).
Julia is compiled to first julia-IR, then LLVM IR and finally raw machine code (assembly, x86, those things) which is just run like any other compiled language. It's not interpreted.
That's not true, julia is compiled down to assembly like any regular C program would. It's just going through LLVM and most code is compiled interactively when you use the REPL, which is why you might think it's interpreted.
The problem with CSV files are the endless edge cases.
What I'll typically do for larger ones is that I'll write a program that does the reading and the rudimentary processing in C, and then saves the data as a binary .npy file in /dev/shm (or ImDisk if I'm on windows) and load that up in Python. Massage the data more if necessary and finally do the plotting. I found that this helps, hope it helps you too
This is great. I'm definitely going to keep Julia in mind the next time my languages self-review comes up, in about 2030. Could even happen as early as 2029.
If only learn Julia in 2029 you will miss out on top notch productivity for another 9 years! Please reconsider for Ur own sake. If not that then for your whisky.
this is neat. that said, to me, this is like comparing car startup times. even if a tesla starts faster than a toyota, they're both fast enough to not notice 99.999% of the time. This isn't a factor in a purchasing decision -- i've never even thought of that before contriving this metaphor
I'm glad Julia does exist. It's great to have a free and open-source programming language focused on scientific computing.
What I dislike from the Julia community is the unhealthy fixation they have with Python and R. Virtually, no comment or blog-post is written without mentioning how slow, inefficient, inappropriate, inelegant is Python or R. Somehow, the Julia community convinced itself that best way to attract more users is to constantly put down Python and R. A sad state of affairs definetely.
Completely agree. I was turned off by two things when Julia started getting hyped. The first was the benchmarks they reported, which were, shall I say, carefully selected. This is a perfect example. As someone else pointed out, R has much faster ways to read csv files than what they used. The thing is, most of the time it doesn't make any difference at all. The other was the nonstop attacks on other languages, many of which represented the ignorance of the critic (not unrelated to the first point) rather than deep insight.
For the historical record, when we picked those benchmarks [1] we were terrible at them. Like 10,000x slower than C. We picked them because they were representative of the kinds of things that we felt a language should be good at: iteration, recursion, shuffling data in arrays, floating-point operations, complex number arithmetic, string operations, small matrix operations and large matrix operations. Then we worked on the language until it was no worse than 2x slower than C. So claiming that we cherry-picked those benchmarks because they made Julia look good simply isn't historically accurate: we picked them because they made Julia look bad, and then we improved Julia until it didn't look bad anymore.
The only faster way to read CSV files in R that I'm aware of is vroom, which is faster because it's lazy: it doesn't parse until you actually access the data. According to vroom's own benchmarks, if you access all the data, it is slower than data.table, which is compared here. So yes, if you want to load a CSV file and not access most of the data, then vroom is faster; if you want to access all of the data (which seems pretty typical), it is not faster.
Regarding "nonstop attacks on other languages", honestly, we don't care and would be happy to stop talking about other languages. However, there is a constant barrage of people making comparisons to other languages and demanding that we justify Julia's existence. Like it is literally offensive to people that Julia exists — "why didn't you just use Python/R/Matlab?!??!" I cannot express how continual this is. So many people have asked to see these exact benchmarks comparing CSV.jl to read.table and Pandas' csv reader. So here are the benchmarks. If you see a mildly worded benchmark comparison showing that one CSV library is faster than some others as "an attack", then you are too good for the internet.
Unfortunately, you are right - but mostly when discussing programming languages on the Internet. In forums like this, we (Julians) often want to make a case for Julia. Since most soon-to-be Julians come from Python, R or Matlab, it's natural to argue for Julia by arguing why Python and R is deficient.
It's similar to how it's hard to argue for Rust without mentioning that C is unsafe. If C users kept insisting that the unsafe behaviour of C was not a real issue, Rustaceans would seem a lot more negative towards C.
It can make huge amounts of difference in a production system; where I work, we process terabytes of csv data every day; saving minutes per file can add up to enormous differences in CPU cost/time for a production system running 24/7.
I agree that for a data scientist doing exploratory analysis locally on their computer, it doesn't make nearly as much a difference (also because they're usually not working on crazy large files).
The performance work in the CSV.jl package (that the article is about) was very much geared towards these kinds of production scenarios.
Right - sounds like you have more of a production support role vs. a data analysis workflow kind of task. Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.
>Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.
Why should the exploratory and production teams be using completely different tools? That seems like it would cause frictions in productivity and make there be gaps that introduce translation errors. I would venture to say that just having the exploratory and production teams working using the same code base is a very strong productivity gain, and we've seen this is true in many companies.
One is R&D, and one is operations. Obviously there are situations where it makes sense to combine them, but most successful companies with any real technology to sell will not treat R&D this way.
Unfortunately almost exclusively via http rest apis. It's not great, but it's the lowest common denominator between the vast "ingestion" service we've built (connectors to web apis, local application for local file upload, raw api endpoints, etc.).
We've started exploring the apache arrow format as a compressible binary format with a dedicated wire format just to cut down on parsing processing costs.
Then it goes both ways right? I think we need to also say "For most people Python is fine, if you need that extra speed, try Julia" instead of blanket praises of Julia as true competitor to Python (It doesn't even come close IMO).
* Comparing parallel implementations (Julia) to single thread (Python csv module). This is still quite relevant because in places like Python, it is impossible to parallelize the csv module without essentially updating its native implementation. That's not true in other places, where some simple wrapper code may be all required to get that "10x faster" number.
* Comparing concrete object decoding (Python csv module) to array decoding (PyArrow). Producing concrete objects representing every row (e.g. tuples of strings) will always be slower because of the pressure it puts on the heap and memory. Storing 10 million rows with 10 cols of 8-byte floats in an array might only generate 800 MB of memory bandwidth, the equivalent concrete list-of-floats PyObjects would come out at 6160 MB bandwidth and a ton of CPU burned in the allocator. There are use cases where either representation is preferred, and using e.g. PyArrow's parser then simply iterating the result as concrete objects, worst case the result will be slower than directly decoding to PyObject to begin with.
* Lumping floating point decoding (a problem with large performance-correctness tradeoffs) in with CSV parsing. It's hard to decode floats both quickly and precisely, it's also impractical to describe in the context of a comparison of CSV parsers which language/implementation might decode floats better and why that is better.