Hacker News new | past | comments | ask | show | jobs | submit login

I think the answer to your question is that, until relatively recently it was not possible to run arbitrary SQL against data frames in either language. sqldf and duckdf pass data off to another program (albeit in-process), run the query there, and pull the result set back to R, paying data movement and transportation costs along the way. Tidyquery is neat and avoids that issue, but is also only 18 months old or so



I agree that Tidyquery and duckdf are quite new, but sqldf has been around for a long time. The first package upload to CRAN was 2007.

Also, duckdf uses duckdb[1] to register a virtual in-memory table, so there's very little overhead for data transport.

[1] https://duckdb.org/docs/api/r


While sqldf has been around for a while, it's not been a performant choice, since it requires transforming and moving data into a database, albeit one that's in process. So with something like

    df <- data.frame(a=rnorm(5e8), b=runif(5e8))
    sqldf::sqldf("select count(*) from df where a > 3")
works, but takes 200 seconds and doubles the memory footprint of the process whereas

    nrow(df[df$a > 3, ])
    sum(df$a > 3)
take ~1.5 seconds and ~1s respectively on my machine.

I appear to have been too pessimistic about duckdf/duckdb though. It's docs[1] claim that registering an R data.frame as a table

> does not actually transfer data into DuckDB yet.

Which implied to me that it just deferred transfer until the user runs a query. However, in a basic test just now:

    con = dbConnect(duckdb::duckdb(), ":memory:")
    duckdb::duckdb_register(con, "iris_view", iris)
    duckdb::duckdb_register(con, "df_view", df)
    dbGetQuery(con, "select count(*) from df_view where a > 0")

it appears to execute the query directly against the data frame. At least, it runs in ~2.2s, and doesn't allocate more memory. Cool! As you've noted though, it's very new - looks like they released 0.1.0 last May?

I think the point stands: until very recently, SQL-on-dataframes was not a viable choice for anyone working at even moderate scales in R or Python, so preference has been for APIs exposed by libraries (pandas, data.table, dplyr, ...) that offered users reasonable performance, even if SQL would have been be a more ergonomic interface.

[1] https://duckdb.org/docs/api/r


Yeah, I agree sqldf is quite slow. Fair point.

As you've seen, duckdb registers an "R data frame as a virtual table." I'm not sure what they mean by "yet" either.

Of course it is possible to write an R dataframe to an on-disk duckdb table, if that's what you want to do.

There are some simple benchmarks on the bottom of the duckdf README[1]. Essentially I found for basic SQL SELECT queries, dplyr is quicker, but for much more complex queries, the duckdf/duckdb combination performs better.

If you really want speed of course, just use data.table.

[1] https://github.com/phillc73/duckdf


Didn't realize duckdf was your package. Cool! Definitely something I'll consider for future work, though I spend more time on the Spark-and-Python side of the fence these days.

If you wanted to add corresponding memory benchmarks the value-prop of duckdf might be clearer to those of us that have been scarred by sqldf :).


Sounds like it could be an interesting comparison. I'll look into it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: