Hacker News new | past | comments | ask | show | jobs | submit login

I agree with your conclusion but want to add that switching from Julia may not make sense either.

According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.

For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.




Indeed DataFrames.jl isn't and won't be the fastest way to do many things. It makes a lot of trade offs in performance for flexibility. The columns of the dataframe can be any indexable array, so while most examples use 64-bit floating point numbers, strings, and categorical arrays, the nice thing about DataFrames.jl is that using arbitrary precision floats, pointers to binaries, etc. are all fine inside of a DataFrame without any modification. This is compared to things like the Pandas allowed datatypes (https://pbpython.com/pandas_dtypes.html). I'm quite impressed by the DataFrames.jl developers given how they've kept it dynamic yet seem to have achieved pretty good performance. Most of it is smart use of function barriers to avoid the dynamism in the core algorithms. But from that knowledge it's very clear that systems should be able to exist that outperform it even with the same algorithms, in some cases just by tens of nanoseconds but in theory that bump is always there.

In the Julia world the one which optimizes to be fully non-dynamic is TypedTables (https://github.com/JuliaData/TypedTables.jl) where all column types are known at compile time, removing the dynamic dispatch overhead. But in Julia the minor performance gain of using TypedTables vs the major flexibility loss is the reason why you pretty much never hear about it. Probably not even worth mentioning but it's a fun tidbit.

> For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.

I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs. Generally when I hear about R people talk about tidyverse. Tidier (https://github.com/TidierOrg/Tidier.jl) is making some big strides in bringing a tidy syntax to Julia and I hear that it has had some rapid adoption and happy users, so there are some ongoing efforts to use the learnings of R API's but I'm not sure if someone is looking directly at the data.table parts.


> Indeed DataFrames.jl isn't and won't be the fastest way to do many things

Agreed, and the DF.jl developers are aware and very open about this fact - the core design trades off flexibility and user friendliness over speed (while of course trying to be as performant as possible within those constraints).

One thing that hasn't been mentioned so far is InMemoryDatasets.jl, which as far as I know is the closest to polars in Julia-land in that it chooses a different point on the flexibility-performance curve more towards the performance end. It's not very widely used as far as I can tell but could be interesting for users who need more performance than DF.jl can deliver - some benchmarks from early versions suggested performance is on par with polars: https://discourse.julialang.org/t/ann-a-new-lightning-fast-p...


> Tidier

I have not tried it. I like that the project makes broadcasting invisible, I dislike that it tries to completely replicate R's semantics and Tidyverse's syntax. Two examples: firstly, the tuples vs scalars thing doesn't seem very Julia to me. Secondly, I love that DF.jl has :column_name and variable_name as separate syntax. Tidier.jl drops this convention (from what I see in the readme).

> I'm not sure if someone is looking directly at the data.table parts

I believe there was some effort to make an i-j-by syntax in Julia but it fell through or stopped getting worked on. By this syntax I mean something like:

  # An example of using i, j, and by
  @dt flights [
    carrier == "AA",
    (mean(:arr_delay), mean(:dep_delay)),
    by = (:origin, :dest, :month)]

  # An example of expressions in by
  @dt flights [_, nrows, by = (:dep_delay > 0, :arr_delay > 0)]
The idea of ijby (as I understand it) is that it has a consistent structure: row selection/filtering comes before column selection/filtering, and is optionally followed by "by" and then other keyword arguments which augment the data that the core "ij" operations act upon.

data.table also has some nifty syntax like

  data[, x := x + 1] # update in place
  data[, x := x/nrows(.SD), by = y] # .SD =  references data subset currently being worked on
which make it more concise than dplyr.

The conciseness and structure that comes from data.table and its tendency to be much less code than comparable tidyverse transformations through some well-informed choices and reservations of syntax make it nicer for me to use.


> I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs.

Personally, my main usability gripe is that it's difficult to do row-wise transformations that try to combine multiple columns by name. I know one can do ``` transform(df, AsTable() => foo ∘ Tables.NamedTupleIterator) ```

But this is 1) kind of wordy and 2) can come with enormous compile times (making it unusable) for wide tables


I really hope people don't come from R to Julia. People who use R are not good programmers, and will degrade the core of the language and it's principles. It would be a shame to see the equivalent of tacking on 6 different object oriented systems to a base language and fragmenting the community completely.


I'm not sure I'd have the same take. Yes, R as a language is kind of wonky and people who use R tend to not be good programmers. However, the APIs of some packages are designed well enough that even with all of those barriers it can still be easy to use for many scientists. I wouldn't copy the language, 6 different object systems and non-standard evaluation is weird. But there is a lot to learn from the APIs of the tidyverse and how it has somehow been able to cover for all of those shortcomings. It would be great to see those aspects with the data science libraries of the Julia language.


It might surprise you to learn that Julia is actively relying on code written in/for R to perform computations. You might be surprised to find out that people who can write R can also write C++ C and other languages of their choosing. You also might be surprised to learn that some of the most vetted statistical code exists in the R ecosystem. If I were someone recruiting for a niche language that had a weak ecosystem, personally I'd take all the help I could get. Can learn Julia with a background in any other programming language in a few weeks... The same can't be said about martingales... But you get to choose your strategy here...


And thus we who transitioned to Julia from R and know a bit about martingales and less about programming have long been trying to degrade the core of the language and its principles by making `mean` a Base function.


R users in the form of statisticians should definitely come around to Julia. More high quality packages never hurt. But I agree with fragmentation and 'object systems', yet I don't think this is a huge danger for Julia.


duckdb's fork, updated 2023.04 (h2oai is 2021.06): https://duckdblabs.github.io/db-benchmark/

repo: https://github.com/duckdblabs/db-benchmark




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: