Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Work with CSV files using SQL. For data scientists and engineers (superintendent.app)
99 points by tanin on July 18, 2021 | hide | past | favorite | 61 comments

The way I love doing these kinds of things is by literally using three lines of R code:


tab1 = read.csv("file1.csv")

sqldf("select * from tab1")

sqldf [1] is a library with which you can access R dataframes just like tables in SQL (it is actually using SQLite in the background). I do not have much experience with programming in R itself, but you barely need it. This approach has the advantage that it is very flexible, and you get the power of the (IMHO) best plotting library ggplot with it. Of course, you can also do more complicated stuff like joining multiple dataframes and basically everything else you can do with SQLite, and then store results back in R dataframes. This workflow works if you use an IDE which lets you execute single lines or blocks of codes like R-Studio. Then you also get a nice GUI with it, but there are also plugins for VI/Emacs that work very well.

[1] https://cran.r-project.org/web/packages/sqldf/index.html

EDIT: code formatting

Checkout the R bindings for DuckDB[0]. You should find that it does the same thing (i.e. run SQL against a dataframe/file on disk) much faster for many SQL operations.

[0] https://duckdb.org/docs/api/r

Came here just to recommend DuckDB! :-) Huge fan. It's unreasonably fast for how easy it is to use.

If you’re going to use R you might as well learn a bit of dplyr or data.table.

You get that autocomplete goodness and the vast majority of sql can be reduced down to single lines.

Great to see SQL being used more for such use cases! (I really like SQL)

This looks nice, especially with the charting!

It's good to know though that there are a few similar open source tools already:

- q[0] - SQL on CSV and TSV

- TextQL[1] - SQL on CSV and TSV

- Datasette[2] - Running SQL on various datasets in various formats, by (if I understand it correctly) importing it to sqlite and running the query there. This one's the closest one to the posted project I think, with interactive data exploration and publishing built in.

- OctoSQL[3] - SQL on files like CSV, JSON as well as non-file databases, endless streams with temporal SQL extensions and joining everything together in a single query.

The performance numbers of this tool however are impressive if they will check out with real data. 1GB in 10 seconds is fast. However, looking at the documentation it looks like most of the heavy lifting is done by SQLite under the hood. From the docs:

> Under the hood, Superintendent loads CSVs into a SQLite database.

Disclaimer: I'm the author of OctoSQL. Still actively developing a rewrite on the redesign branch.

[0]: https://github.com/harelba/q

[1]: https://github.com/dinedal/textql

[2]: https://datasette.io

[3]: https://github.com/cube2222/octosql

I've been looking at calling ".import" in SQLite directly as an optimization for my sqlite-utils tool - I'm seeing a big speed up in comparison to parsing the CSV in Python: https://github.com/simonw/sqlite-utils/issues/297

Hi! I'm the OP here.

The performance gain is mainly from reading/writing files in C and/or by Sqlite (which is also C).

Any other tool that use dynamic languages like python will immediately takes minutes to handle any GB-sized file (initializing string is already slow).

The main difference is essentially GUI vs command-line. Datasette is the closest but it seems to require command line (based on the usage page). I haven't tried it out yet, so I may be wrong here.

OctoSQL looks very interesting.

Initially, I was trying to use DuckDB but couldn't make it work on Windows.

Sqlite's SQL dialect is somewhat lacking (hello WITH RECURSIVE). It is one of the gaps I would like to solve.

Odd that DuckDB didn't work for you on Windows! I only use it on Windows and love it!

My use case is a bit more complex. I have to compile it with an electron app and etc.

I filed a few GitHub issues with them as well.

This has great potential. The dialect is much more comprehensive than the Sqlite one.

Impressive work and clear readme with good usage examples. Kudos

Seems useful but at the same time it's comparison with loading CSV into SQL seems kind of disingenuous since if you're familiar with SQL you're probably familiar with command line enough to load CSV into SQLite (which is free in both meanings)

I probably model it off myself too much.

I know Postgresql very well... But I rarely use psql. I still use pgadmin.

That said, I never load CSVs into Postgresql. I still tried very hard to load CSVs into Excel.

My personal opinion is that people have their preferences. If they prefer command line tool, Superintendent is probably not for them.

So its a GUI wrapper around sqlite. I guess I expect data scientists and engineers to be comfortable with command line and scripting, and there are GUIs that will work with sqlite already, so I guess I'm having hard time figuring out the target audience for this.

It's the inbetweeners that's the ideal target audience. Analysts (including me a few years back) that knows SQL (through Tableau/BigQuery), but don't know how to get comfortable with CLI to set up their own DB. Already sent this link to someone I know that can use it.

Ah, makes sense. I guess I'm surprised that people would know SQL and not be comfortable with CLI. But then I haven't worked with the software you describe (Tableau/BigQuery) etc.

Yeah, I don't know whether this is new or not, but there's a lot of positions that are supported by their data teams to be able to do SQL queries for their business needs but not necessarily anything technical. Web BigQuery, PowerBI, Tableau, Metabase, etc.

In some cases, they still need to do stuff outside of the prepared tables/platforms (further analysis/other data sources) but have no terminal/CLI knowledge, let alone importing csvs to sqlite and querying from them. This app is great to be able to extend their SQL skills to csv files.

This open source tool on Github allows you to quickly import a CSV and write SQL queries against it. The CSV is imported into a SQLite database. It also does a lot more, highly recommended.


Thanks for this. I've been doing it manually all along.

Perl DBD::CSV module is a classic in this genre (first version is from 1998) - https://metacpan.org/pod/DBD::CSV

> A python script loading a 1GB file is likely to take more than 1 minutes on any machine.

Since when? I just tested it, and I can process a 1gb CSV file on my machine using the csv module in Python in ~10 seconds. My machine isn't a world record setting machine, either.

This is OP, but that is amazing.

I tried reading 1GB CSV and add all the rows into Sqlite, and it definitely takes minutes. This happens to both JS and python.

Constructing a giant INSERT should already exceed 1 minutes (or transforming any 1gb-sized string).

Would you mind sharing your Python snippet?

Not sure how you'd do this from Python, but you can fire up sqlite3 from the command line and run

    .mode csv
    .import filename.csv table_name
On my 2-year old Windows laptop (i7, lenovo x1) importing a 2M rows 1.4GB file takes about ~70 seconds.

So not 10 seconds but no need to create an insert statement (I'm sure you can access this functionality from python if you need to)

I use MacBook Pro 2020, so it is probably faster. Also 1.4 GB is quite difference from 1GB. These factors can multiply the run time.

Superintendent uses the same approach (i.e. use the CSV extension).

But I don't think we should call that python ... since it essentially just calls C code in Sqlite.

It seems interesting, but what is the advantage over just \copy in Postgres from a CSV file - or better yet, import into SQLite without even providing the table definition?

The advantage would be that you do not have your data in two places and can avoid all the problems and confusion that can result.

How does this avoid having data in two places? You still have to have a CSV file to import the data...

You're saying that viewing your csv data via a SQL query is having it in two places. I don't think that's the normal concept of having multiple copies of data.

Not me. Are you responding to parent poster?

The difference is between GUI and command-line.

I work with a lot of CSV files, and for some reason I rarely load them into Postgres or use Python script. I still try very hard to use Excel.

There is some degree of convenience that GUI offers over command-line, though some people still prefer command line.

I use the command line all day, it definitely has its place. But trying to manipulate large CSV files has never felt like a good fit for the CLI.

I'm already using and enjoying this tool alot . I use to do this in DBeaver ( also a great tool ) , but I have to manually specify the types in a JDBC string to get all the goodies of SQL this so far has been much nicer.

I have already made this a daily tool thanks!

Any feedback or suggestion is very welcome! Please don't hesitate to email me at tanin@superintendent.app

Which is likely how this works under the hood.

OP here.

Yes! we use Sqlite under the hood with modified C extensions, so it can be as fast as it can be. (Just try not to touch JS code for this)

However, I have recently learned that SIMD is even faster than C, but I haven't explored it yet.

The datasette author offers this tool for conversion: https://github.com/simonw/csvs-to-sqlite

Also https://sqlite-utils.datasette.io/en/stable/cli.html#inserti... which comes with a lot of other useful tools for manipulating the scheme, adding indexes and so on after you've imported the data.

The tools in these replies also offer similar functionality (in terms of running sql queries against csv data) https://news.ycombinator.com/item?id=27763129

Cool tool!

Panda SQL is great for doing SQL on pandas dataframes. https://pypi.org/project/pandasql/

DuckDB is a newer but much faster alternative that lets you run SQL on csv files (and also directories of parquet files etc.)

It's so fast/good that I've actually been using it for a lot of data cleaning and transformation that previously I'd have done in Pandas.


Really cool and smooth! One killer feature here will be auto translating to pandas for people to copy over snippets to their Jupyter notebooks.

You don't want to know how many times I've been asked to do exactly that.

Other than that one can do that (somewhat clumsily) with postgres using file_fdw like:

CREATE FOREIGN TABLE mytable ( datetime timestamp(3) with time zone, name text, score integer, ) SERVER myserver OPTIONS ( filename 'data/file1.csv', format 'csv' );

and then query as usual. More docs here: https://www.postgresql.org/docs/13/file-fdw.html

You can query CSV directly without a DBMS backend using clickhouse-local. [1] It's often used to clean data exactly as this article describes. You can also load CSV very easily into tables.

Superintendent.app seems to bring automatic schema definition to the table. We've discussed adding this to ClickHouse. Would be a great PR if anyone is interested.

[1] https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...

(This note brought to you by the ClickHouse evangelization task force.)

Interested, but writing and compiling a large new C++ codebase when I haven't written C++ in years isn't something I quite have time for at the moment :)

For anyone else with more time to roll up their sleeves, https://github.com/ClickHouse/ClickHouse/blob/b0ddc4fb30f1a0... would be the place to split the header row, and perhaps use some heuristics on the first data row to identify datatypes!

OP here. It is not that automatic. We simply import every column as TEXT, which seems to work well enough :)

This area interests me but being completely honest there are a few off-putting grammatical/spelling errors. Might be worth getting them cleaned up..

Thank you for your feedback.

I'm a non native English speaker, so I would like to ask if you can point out a few places you feel they are off putting.

I would really really love to fix those.

Very neat! I'm also working on an (open-source/core) app for running SQL against any data (copy-pasted stuff, data files, logs, HTTP responses, other SQL requests) and being able to script and graph as well.

A difference might be that the goal I have for DataStation is to aid more in data exploration (but not being limited to smaller datasets like this site points out Google Sheets and Excel are). Some optimizations like streaming all data (and streaming through scripts) are currently out of scope and that will definitely limit the scale DataStation can handle.

It's primarily to help out the case where you have a bunch of different data and you want to join/filter/group and graph the result for some business/product/engineering executive team.


If your work is more OLAP focused, DuckDb has good tools for loading csv files. https://duckdb.org/docs/data/csv

DuckDB is like a columnar SQLite. It also has great support for Pandas DataFrames.

Loading and running SQL on structured text files (.CSV) is definitely an underserved data tools niche, but this statement threw me: "No more custom and slow Python script. No more vlookup, no more pivot tables." Wow, I'll keep my slow Python/Pandas script thank you.

No license keys required with H2 http://www.h2database.com/html/tutorial.html#csv

Looks really nice. In what language is the app itself written? If it's web app, I'm curious to know what table component you are using.

Also, for anyone looking to edit simple csv files in an excel-like matter, please check https://editcsvonline.com, a free tool that I made, powered by DataGridXL (which I also made).

You can do this with bit.io, although it’s not local; import your csv (drag and drop or browse to upload) and get a Postgres database in the cloud. Full disclosure I’m a founder at bit.io.

I love to see anything that makes it easier to use data!

I do this already with sqlite3

If I open a CSV with pandas in debug mode I’m also pretty much free to play with whatever transformations I want, without being restricted by SQL syntax. Saved as a parquet, reloads are also pretty much instantaneous.

Not everyone is sold on and proficient in those tools. Honestly, whenever I think I need pandas I try get into it and spend an hour in the atrocious docs before giving up and realizing I could have done that "simple" grouping or filtering or graphing by using other tools and have been done already by now even though it would have "taken longer".

Same goes for all these NoSql DBs. E.g. I realize afterwards I spent 2 hours researching how to do an effing join between two different documents. Time I could have spent deciding on a relational entity model and putting down the ORM boilerplate for it. /minirant

Is this similar to https://github.com/harelba/q ?

This reminds me of https://lnav.org

I like that this lets you join CSVs, which a lot of tools in this space don’t.

To do this in Easy Data Transform:

-drag 2 CSV files on the Center pane

-click 'join'

-select the column in each CSV to join in the right pane


Depending on the size of the CSV files you can do this in a few seconds (plus ~1 minute to install).

Why isn't this opensourced?

I'm old enough to know I should give a pass to anything "free while in beta"...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact