tab1 = read.csv("file1.csv")
sqldf("select * from tab1")
sqldf  is a library with which you can access R dataframes just like tables in SQL (it is actually using SQLite in the background). I do not have much experience with programming in R itself, but you barely need it. This approach has the advantage that it is very flexible, and you get the power of the (IMHO) best plotting library ggplot with it. Of course, you can also do more complicated stuff like joining multiple dataframes and basically everything else you can do with SQLite, and then store results back in R dataframes. This workflow works if you use an IDE which lets you execute single lines or blocks of codes like R-Studio. Then you also get a nice GUI with it, but there are also plugins for VI/Emacs that work very well.
EDIT: code formatting
You get that autocomplete goodness and the vast majority of sql can be reduced down to single lines.
This looks nice, especially with the charting!
It's good to know though that there are a few similar open source tools already:
- q - SQL on CSV and TSV
- TextQL - SQL on CSV and TSV
- Datasette - Running SQL on various datasets in various formats, by (if I understand it correctly) importing it to sqlite and running the query there. This one's the closest one to the posted project I think, with interactive data exploration and publishing built in.
- OctoSQL - SQL on files like CSV, JSON as well as non-file databases, endless streams with temporal SQL extensions and joining everything together in a single query.
The performance numbers of this tool however are impressive if they will check out with real data. 1GB in 10 seconds is fast. However, looking at the documentation it looks like most of the heavy lifting is done by SQLite under the hood. From the docs:
> Under the hood, Superintendent loads CSVs into a SQLite database.
Disclaimer: I'm the author of OctoSQL. Still actively developing a rewrite on the redesign branch.
The performance gain is mainly from reading/writing files in C and/or by Sqlite (which is also C).
Any other tool that use dynamic languages like python will immediately takes minutes to handle any GB-sized file (initializing string is already slow).
The main difference is essentially GUI vs command-line. Datasette is the closest but it seems to require command line (based on the usage page). I haven't tried it out yet, so I may be wrong here.
OctoSQL looks very interesting.
Initially, I was trying to use DuckDB but couldn't make it work on Windows.
Sqlite's SQL dialect is somewhat lacking (hello WITH RECURSIVE). It is one of the gaps I would like to solve.
I filed a few GitHub issues with them as well.
This has great potential. The dialect is much more comprehensive than the Sqlite one.
I know Postgresql very well... But I rarely use psql. I still use pgadmin.
That said, I never load CSVs into Postgresql. I still tried very hard to load CSVs into Excel.
My personal opinion is that people have their preferences. If they prefer command line tool, Superintendent is probably not for them.
In some cases, they still need to do stuff outside of the prepared tables/platforms (further analysis/other data sources) but have no terminal/CLI knowledge, let alone importing csvs to sqlite and querying from them. This app is great to be able to extend their SQL skills to csv files.
Since when? I just tested it, and I can process a 1gb CSV file on my machine using the csv module in Python in ~10 seconds. My machine isn't a world record setting machine, either.
I tried reading 1GB CSV and add all the rows into Sqlite, and it definitely takes minutes. This happens to both JS and python.
Constructing a giant INSERT should already exceed 1 minutes (or transforming any 1gb-sized string).
Would you mind sharing your Python snippet?
.import filename.csv table_name
So not 10 seconds but no need to create an insert statement (I'm sure you can access this functionality from python if you need to)
Superintendent uses the same approach (i.e. use the CSV extension).
But I don't think we should call that python ... since it essentially just calls C code in Sqlite.
I work with a lot of CSV files, and for some reason I rarely load them into Postgres or use Python script. I still try very hard to use Excel.
There is some degree of convenience that GUI offers over command-line, though some people still prefer command line.
I'm already using and enjoying this tool alot . I use to do this in DBeaver ( also a great tool ) , but I have to manually specify the types in a JDBC string to get all the goodies of SQL this so far has been much nicer.
I have already made this a daily tool thanks!
Yes! we use Sqlite under the hood with modified C extensions, so it can be as fast as it can be. (Just try not to touch JS code for this)
However, I have recently learned that SIMD is even faster than C, but I haven't explored it yet.
Panda SQL is great for doing SQL on pandas dataframes.
It's so fast/good that I've actually been using it for a lot of data cleaning and transformation that previously I'd have done in Pandas.
You don't want to know how many times I've been asked to do exactly that.
Other than that one can do that (somewhat clumsily) with postgres using file_fdw like:
CREATE FOREIGN TABLE mytable (
datetime timestamp(3) with time zone,
) SERVER myserver
OPTIONS ( filename 'data/file1.csv', format 'csv' );
and then query as usual. More docs here: https://www.postgresql.org/docs/13/file-fdw.html
Superintendent.app seems to bring automatic schema definition to the table. We've discussed adding this to ClickHouse. Would be a great PR if anyone is interested.
(This note brought to you by the ClickHouse evangelization task force.)
For anyone else with more time to roll up their sleeves, https://github.com/ClickHouse/ClickHouse/blob/b0ddc4fb30f1a0... would be the place to split the header row, and perhaps use some heuristics on the first data row to identify datatypes!
I'm a non native English speaker, so I would like to ask if you can point out a few places you feel they are off putting.
I would really really love to fix those.
A difference might be that the goal I have for DataStation is to aid more in data exploration (but not being limited to smaller datasets like this site points out Google Sheets and Excel are). Some optimizations like streaming all data (and streaming through scripts) are currently out of scope and that will definitely limit the scale DataStation can handle.
It's primarily to help out the case where you have a bunch of different data and you want to join/filter/group and graph the result for some business/product/engineering executive team.
DuckDB is like a columnar SQLite. It also has great support for Pandas DataFrames.
Also, for anyone looking to edit simple csv files in an excel-like matter, please check https://editcsvonline.com, a free tool that I made, powered by DataGridXL (which I also made).
I love to see anything that makes it easier to use data!
Same goes for all these NoSql DBs. E.g. I realize afterwards I spent 2 hours researching how to do an effing join between two different documents. Time I could have spent deciding on a relational entity model and putting down the ORM boilerplate for it. /minirant
-drag 2 CSV files on the Center pane
-select the column in each CSV to join in the right pane
Depending on the size of the CSV files you can do this in a few seconds (plus ~1 minute to install).