Hacker News new | past | comments | ask | show | jobs | submit login
One-liner for running queries against CSV files with SQLite (simonwillison.net)
747 points by jdblair on June 21, 2022 | hide | past | favorite | 127 comments



SQLite's virtual table API (https://www.sqlite.org/vtab.html) makes it possible to access other data structures through the query engine. You don't need to know much if anything about how the database engine executes queries, you only need to implement the callbacks it needs to do its job. A few years ago I wrote an extension to let me search through serialized Protobufs which were stored as blobs in a regular database.

https://github.com/rgov/sqlite_protobuf


I like the virtual table API a lot but it has some serious drawbacks. You don't need to know much and indeed, you can't know much about the execution engine, even if that knowledge would help you. Many parts of the query are not pushed down into the virtual table.

For instance, if the user query is:

SELECT COUNT(*) FROM my_vtab;

... the query your virtual table will effectively see is:

SELECT * FROM my_vtab;

SQLite does the counting. That's great, unless you already know the count and could have reported it directly rather than actually returning every row in the table. You're forced to retrieve and return every row because you have no idea that it was actually just a count.

As another example, if the user query includes a join, you won't see the join. Instead, you will receive a series of N queries for individual IDs, even if you could have more efficiently retrieved them in a batch.

The join one is particularly nasty. If you're writing a virtual table that accesses a remote resource with some latency, any join will absolutely ruin your performance as you pay a full network roundtrip for each of those N queries.

I wrote a module that exposes remote SQL Server/PostgreSQL/MySQL servers as SQLite virtual tables, and joins basically don't work at all if your server is not on your local network. There's nothing I can do about it (other than heuristically guessing what IDs might be coming and request them ahead of time) because SQLite doesn't provide enough information to the virtual table layer. It's my understanding that PostgreSQL's foreign data wrappers (a similar feature to SQLite's virtual tables) push much more information about the query down to the wrapper layer, but I haven't used it myself.


I wrote a module that exposes remote SQL Server/PostgreSQL/MySQL servers as SQLite virtual tables, and joins basically don't work at all if your server is not on your local network. There's nothing I can do about it (other than heuristically guessing what IDs might be coming and request them ahead of time) because SQLite doesn't provide enough information to the virtual table layer.

Is it possible to wait until all of the queries are in and then do some query planning at your end to resolve this? Or is it entirely synchronous with no way to do that?


Unfortunately it's synchronous. The interface requires you to return the rows for the first key before it gives you the second one, and so on.


You also can’t add new columns using alter statements. I really like virtual tables in SQLite but It would be nice if documentation included some limitations and reasons not to use.


And in fact there is a CSV virtual table available from SQLite but it's not built in the normal client: https://www.sqlite.org/csv.html

It really should be, as the code is tiny and this functionality is not overly exotic.


In my experience [1], the CSV virtual table was really slow. Doing some analysis on a 1,291 MB file, a query took 24.7 seconds using the virtual table vs 3.4 seconds if you imported the file first.

The CSV virtual table source code is a good pedagogical tool for teaching how to build a virtual table, though.

[1]: https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html


https://github.com/liquidaty/zsv/blob/main/app/external/sqli... modifies the sqlite3 virtual table engine to use the faster zsv parser. have not quantified the difference, but in all tests I have run, `zsv sql` runs faster (sometimes much faster) than other sqlite3-on-CSV solutions mentioned in this entire discussion (unless you include those that cache their indexes and then measure against a post-cached query). Disclaimer: I'm the main zsv author


What are the differences between the zsv and csv parsers?

Is csvw with linked data URIs also doable?


Not sure what you mean by csvw. But, zsvlib is a CSV parser, and zsv is a CLI that uses zsvlib. zsv also uses the sqlite3 vtable based on the example in the original sqlite3 code, but it modifies it to use the zsvlib parser instead of the original CSV parser. The zsv parser is different from most CSV parsers in how it uses SIMD operations and minimizes memory copying. zsvlib parses CSV based on the same spec that Excel implements, so having data URIs in the CSV is fine, but if you wanted a compound value (e.g. text + link), you would need to overlay your own structure into the CSV text data (for example, embed JSON inside a column of CSV data). Not sure that answers your question but if not, feel free to add further detail and I'll try again...


zsvlib does SIMD CSV like simdjson? Sweet.

Zsvlib on cloudfuzz would be good if that's not already

Yeah linked data schema support is distinct from the parser primitives and xsd data type uris, for example.


Dang, that's a bummer -- thanks for the intel, as well as the article / Parquet code.


## /? sqlite arrow

- "Comparing SQLite, DuckDB and Arrow with UN trade data" (2021) https://news.ycombinator.com/item?id=29010103 ; partial benchmarks of query time and RAM requirements [relative to data size] would be

- "Introducing Apache Arrow Flight SQL: Accelerating Database Access" (2022) https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f... :

> Motivation: While standards like JDBC and ODBC have served users well for decades, they fall short for databases and clients which wish to use Apache Arrow or columnar data in general. Row-based APIs like JDBC or PEP 249 require transposing data in this case, and for a database which is itself columnar, this means that data has to be transposed twice—once to present it in rows for the API, and once to get it back into columns for the consumer. Meanwhile, while APIs like ODBC do provide bulk access to result buffers, this data must still be copied into Arrow arrays for use with the broader Arrow ecosystem, as implemented by projects like Turbodbc. Flight SQL aims to get rid of these intermediate steps.

## "The Virtual Table Mechanism Of SQLite" https://sqlite.org/vtab.html :

> - One cannot create a trigger on a virtual table.

Just posted about eBPF a few days ago; opcodes have costs that are or are not costed: https://news.ycombinator.com/item?id=31688180

> - One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table implementation. Indices cannot be added separately using CREATE INDEX statements.)

It looks like e.g. sqlite-parquet-vtable implements shadow tables to memoize row group filters. How does JOIN performance vary amongst sqlite virtual table implementations?

> - One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.

Are there URIs in the schema? Mustn't there thus be a meta-schema that does e.g. nested structs with portable types [with URIs], (and jsonschema, [and W3C SHACL])? #nbmeta #linkedresearch

## /? sqlite arrow virtual table

- sqlite-parquet-vtable reads parquet with arrow for SQLite virtual tables https://github.com/cldellow/sqlite-parquet-vtable :

  $ sqlite/sqlite3
  sqlite> .eqp on
  sqlite> .load build/linux/libparquet
  sqlite> CREATE VIRTUAL TABLE demo USING parquet('parquet-generator/99-rows-1.parquet');
  sqlite> SELECT * FROM demo;
  //
  sqlite> SELECT * FROM demo WHERE foo = 123;
  sqlite> SELECT * FROM demo WHERE foo = '123'; // incurs a severe query plan performance regression without immediate feedback
## Sqlite query optimization

`EXPLAIN QUERY PLAN` https://www.sqlite.org/eqp.html :

> The EXPLAIN QUERY PLAN SQL command is used to obtain a high-level description of the strategy or plan that SQLite uses to implement a specific SQL query. Most significantly, EXPLAIN QUERY PLAN reports on the way in which the query uses database indices. This document is a guide to understanding and interpreting the EXPLAIN QUERY PLAN output. [...] Table and Index Scans [...] Temporary Sorting B-Trees (when there's not an `INDEX` for those columns) ... `.eqp on`

The SQLite "Query Planner" docs https://www.sqlite.org/queryplanner.html list Big-O computational complexity bound estimates for queries with and without prexisting indices.

## database / csv benchmarks

- https://h2oai.github.io/db-benchmark/


  sqlite3 :memory: -cmd '.mode csv' ...
It should be a war crime for programs in 2022 to use non-UNIX/non-GNU style command line options. Add it to the Rome Statute's Article 7 list of crimes against humanity. Full blown tribunal at The Hague presided over by the international criminal court. Punishable by having to use Visual Basic 3.0 for all programming for the rest of their life.


The initial release of sqlite was in 2000. Yes, well after GNU-style command line options existed but not by much.


22 years is a long time to deprecate bad command line arguments.


Sqlite team idolizes backwards compatibility. Which is admirable but at the same time it means we are not getting strong static typing of tables before 2035.

Shame. Even bigger shame is that nobody else is taking the torch.


They added strong typing of tables a few months ago as an opt-in feature, hence keeping things backwards compatible: https://www.sqlite.org/stricttables.html


Yep, I know. Wish they went even further but it seems like a good start.


The documentation should focus on the 'best' way to accomplish the goal, reducing the number of people who encounter and repeat the old way. Doing anything else is a self-fulfilling prophecy. Yeah of course people are still using the old way, if some of them didn't start using it until after you added the new way.


> well after ... but not by much.

This seems a contradiction to me. Was it "well after", or after "but not by much"?


Those -cmd are actually unnecessary, you can do

    sqlite3 :memory: '.mode csv' '.import taxi.csv taxi' \
    'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count'


Let me introduce you to bazel ...

bazel build //:--foobar --//::\\


I'm not sure what that does, but it reminds me of GNU Parallel's interface, which although smart, is not very intuitive and memorable in my experience.


Go has been doing same thing for a decade now:

https://godocs.io/flag

Personally I'm fine with it. The whole, "let's combine 5 letter options into one string", always smacked of excess code golf to me.


My VMS users say the same thing. Wow, do I not want that to happen. I am not a fan of DCL.


Would :memory: even parse in some shells?


Why wouldn't it? I'm not aware of colons being a special character in any shell I can think of


Colons are a little bit special in Bournish shells, since they're a delimiter in variable assignments after which tilde expansions, and some shells extend this behavior to command line arguments.

Frankly, I don't find it outside the realm of possibility that there's some combination of options that will make :memory: misparse on a popular shell, I just don't know of any...


Using ClickHouse you can also process local files in one line using clickhouse-local command tool. And it will look a lot easier:

clickhouse local -q "SELECT passenger_count, COUNT(*), AVG(total_amount) FROM file(taxi.csv, 'CSVWithNames') GROUP BY passenger_count"

And ClickHouse supports a lot of different file formats both for import and export (you can see all of them here https://clickhouse.com/docs/en/interfaces/formats/).

There is an example of using clickhouse-local with taxi dataset mentioned in the post: https://colab.research.google.com/drive/1tiOUCjTnwUIFRxovpRX...


Clickhouse-local is incredible. It does the best of any similar tool I've benchmarked. But the reason I took it out of the linked benchmarks in OP's post is because it's 2+GB. That's a massive binary. It's the whole server. I'm not sure you want to be distributing this all over the place in general. It's just not in the same category IMO. Disclaimer: I build another tool that does similar things.


It is shipped with debug info and symbol tables just in case.

Without debug info, it will be 350 MB, and compressed can fit in 50 MB: https://github.com/ClickHouse/ClickHouse/issues/29378

It is definitely a worth improvement.


Aha, thanks for the clarification and link. I'll be following that issue.


It is that big only with debug symbols. If you take packaged version of ClickHouse it will be much smaller. Or just strip large binary manually if you already have it.


Lately I've been using Visidata for any text file that looks like a table or other squarish data source, including JSON.

https://www.visidata.org/


Visidata is wonderful.

Also for querying large CSV and Parquet files, I use DuckDB. It has a vectorized engine and is super fast. It can also query SQLite files directly. The SQL support is outstanding.

https://duckdb.org/

Just have start the DuckDB REPL and start querying e.g.

    Select * from ‘bob.CSV’ a
    Join ‘Mary.parquet’ b
    On a.Id = b.Id
Zips through multi GB files in a few seconds.


Is the capitalized keywords a requirement? ew (IMO)


No, that’s an artifact of me typing on my phone.

Also if you know SQL you’ll know that SQL keywords are case insensitive in most DBMSes.

Don’t be too quick to ew.


I was ew'ing at capitalized keywords, whether or not they're required. I'll be as quick as I like, thanks.


Yes and they’re not a requirement. Expressing premature disgust without investigation leads to low quality comments. You can do it but it goes against an ethos of curiosity and unfairly diminishes the object of discussion.


Beware that visidata phones home by default:

https://github.com/saulpw/visidata/discussions/940


This looks like my new best friend


Just wait until you discover the file browser ))


One of my all-time favorite (and somehow still-obscure / relatively unknown) tools is called `lnav` ^1. It's a mini-ETL powertool with embedded SQLite, perfect for wrangling log files or other semi-structured data (a few millions of rows are no problem), it's intuitive and flexible...

1. https://lnav.org


lnav is super cool, but as its name says: log navigator, it's more of a less/tail/etc supercharged with sqlite under the hood.

of course because it has a flexible format definition it can deal with csv files as well, but it's true power is getting sql queries out of nginx log files and the like without the intermediate step of exporting them to csv.


I am a data scientists. I have used a lot of tools/libraries to interact with data. SQLite is my favorite. It is hard to beat the syntax/grammar.

Also, when I use SQLite I do not output using column mode. I pipe to `tv` (tidy-viewer) to get a pretty output.

https://github.com/alexhallam/tv

transparency: I am the dev of this utility


Here is an example of how I would pipe with headers to `tv`.

sqlite3 :memory: -csv -header -cmd '.import taxi.csv taxi' 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count' | tv


What a nice tool. I love how Rust has reinvigorated command line utilities.


piping to jq (using json mode of course) also works well for this


Or `jless`, which shows you a tree and doesn't flood your terminal.


Nice, that's definitely better than `jq | less`


[flagged]



Just want to add that snowflake (imo) is better. You don’t have to suffer SQLite’s lack of data types and honestly snowflake is the best tool to work with messy data.

Just fyi you can set up a snowflake account with a minimum monthly fee of 25 bucks. It’ll be very hard to actually use 25 bucks if your data isn’t in 100s of GBs and you literally use as little compute as is needed so it’s perfect.


This is in no way a relevant good-faith reply. It is spam. A web-based cloud-based data analytics platform isn't in the same category as piping command-line programs together.


We can’t pay for anything that doesn’t have Jira integration, sorry!


Since many people are sharing one-liners with various tools...

OctoSQL[0]:

  octosql 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY passenger_count'
It also infers everything automatically and typechecks your query for errors. You can use it with csv, json, parquet but also Postgres, MySQL, etc. All in a single query!

[0]:https://github.com/cube2222/octosql

Disclaimer: author of OctoSQL


Hi cube2222!

Just today I tried octosql for the first time when I wanted to correlate a handful of CSV files based on identifiers present in roughly equal form. Great idea but I immediately ran into many rough edges in what I think was a simple use case. Here are my random observations.

Missing FULL JOIN (this was a dealbreaker for me). LEFT/RIGHT join gave me "panic: implement me".

It took me a while to figure out how to quote CSV column names with non-ASCII characters and spaces. It's not documented as far as I've seen (please document quoting rules). This worked:

  octosql 'SELECT `tablename.Mötley Crüe` FROM tablename.csv'
replace() is documented [1] as replace(old, new, text) but actually is replace(text, old, new) just like in postgres and mysql.

index() is documented [1] as index(substring, text)

  (postgresql equivalent: position ( substring text IN string text ) → integer)
  octosql "SELECT index('y', 'Mötley Crüe')"
  Error: couldn't parse query: invalid argument syntax error at position 13 near 'index'
  octosql "SELECT index('Mötley Crüe', 'y')"
  Error: couldn't parse query: invalid argument syntax error at position 13 near 'index'
Hope this helps and I wish you all the best.

[1] https://github.com/cube2222/octosql/wiki/Function-Documentat...


Hey!

Thanks a lot for this writeup!

> but I immediately ran into many rough edges

OctoSQL is definitely not in a stable state yet, so depending on the use case, there definitely are rough edges and occasional regressions.

> Missing FULL JOIN (this was a dealbreaker for me). LEFT/RIGHT join gave me "panic: implement me".

Indeed, right now only inner join is implemented. The others should be available soon.

> replace() is documented [1] as replace(old, new, text) but actually is replace(text, old, new) just like in postgres and mysql. index() is documented [1] as index(substring, text)

I've removed the offending docs pages, they were documenting a very old version of OctoSQL. The way to browse the available functions right now is built-in to OctoSQL:

  octosql "SELECT * FROM docs.functions"
> invalid argument syntax error at position 13 near 'index'

Looks like a parser issue which I can indeed replicate, will look into it.

Thanks again, cheers!


Just to update this response, I've just released a new version that contains a new `position` function, as well as the capability to execute SELECT statements without a FROM part. So i.e. the above

  octosql "SELECT position('hello', 'ello')"


To add another update here: left, right and outer joins have been added.


Heads up: cube2222 is the original author of this benchmark. :) I copied it and Simon copied my copy of it.


Thanks for the acknowledgement!

Though btw., I think I personally prefer the SPyQL benchmarks[0], as they test a bigger variety of scenarios. This benchmark is mostly testing CSV decoding speed - because the group by is very simple, with just a few keys in the grouping.

[0]:https://colab.research.google.com/github/dcmoura/spyql/blob/...


Actually I have a big issue with that benchmark in that it doesn't ORDER BY. I don't believe all those tools will produce the same result and it's not required by SQL for them to do so.

That doesn't change the poor performance of dsq but it does change the relative and absolute scores in that benchmark.


I love the simplicity. Is there support for joins / use of multiple tables?


Yes, certainly! And those multiple tables can come from different data sources, so files or databases.


Btw, am I alone in thinking that DataFrame abstractions in OOP languages (like Pandas in Python) are oftentimes simply inferior to relational algebra? I'm not sure that many Data Scientists are aware of the expressive power of SQL.


There are loads of things that are not possible or are very cumbersome to write in SQL, but that pandas and many other dataframe systems allow. Examples are dropping null values based on some threshold, one-hot encoding, covariance, and certain data cleaning operations. These are possible in SQL but very cumbersome to write. There are also things that are outright impossible in a relational database related to metadata manipulation.

SQL is super expressive, but I think pandas gets a bad rap. At it's core the data model and language can be more expressive than relational databases (see [1]).

I co-authored a paper that explained these differences with a theoretical foundation[1].

[1] https://arxiv.org/abs/2001.00888


Thanks for sharing this. I believe we essentially agree: chaining method calls is inexpressive compared to composing expressions in an algebraic language.


I'm not defending Pandas but just want to point out that the inability to conveniently compose expressions is one of the biggest problems with SQL, since it was designed to be written as a sort of pseudo-English natural language, in an era when people imagined that it would be used by non-programmers. To be clear, that's a problem with SQL, not with the idea of a language based on relational algebra. There are various attempts to create SQL-alternatives which behave like real programming languages in terms of e.g. composability. This blog post makes the point better than I can:

https://opensource.googleblog.com/2021/04/logica-organizing-...


I absolutely agree - one of the biggest shortcomings of SQL is that its primary programming interface is based on text and intended for human, instead of being based on data structures and intended for programs.


SQL does not exactly implement relational algebra in its pure form.

SQL implements a kind of set theory with relational elements and a bunch of practical features like pivots, window functions etc.

Pandas does the same. Most data frame libraries like dplyr etc. implement a common set of useful constructs. There’s not much difference in expressiveness. LINQ Is another language around manipulating sets that was designed with the help of category theory, and it arrives at the same constructs.

However SQL is declarative, which provides a path for query optimizers to parse and create optimized plans. Whereas with chained methods, unless one implements lazy evaluation one misses out on look aheads and opportunities to do rewrites.


> There’s not much difference in expressiveness

> However SQL is declarative

Pick one :) the way I see it, if declarativeness is not a factor in assessing expressiveness, then expressiveness reduces to the uninteresting notion of Turing-equivalence.


Expressiveness and declarativeness are different things, no?

Are you talking about aesthetics? I’ve used SQL for 20 years and it’s elegant in parts but it also has warts. I talk about this elsewhere but SQL gets repetitive and requires multi layer CTEs to express certain simple aggregations.


Agree. I've completed data pipelines for several projects and have found that the cleanest, and often fastest solution is to use SQL to structure the data as needed. This is anecdotal and I'm not an expert with SQL, but I haven't come across a situation where R or Pandas dataframes worked better than a well written query for data manipulation. This has the benefit of simplifying collaboration across teams because within my company not everyone uses the same toolset for analysis, but we all have access to the same database. Other tools are better suited to analysis or expansion of the data with input from other sources, but within our own data SQL wins.


Often -- yes. Always -- no.

For example let's try changing/fixing sampling rate of a dataset (.resample() in Pandas).

Or something like .cumsum() -- easy with SQL windowing functions, but man they are cumbersome.

Or quickly store the result in .parquet.

But all the above doesn't matter, because I feel like 99% of Pandas work involves quickly drawing charts on the data look at it or show to teammates.


Squawk: An Awk-like program that uses SQL. Can parse, format, filter, and combine data from multiple files. Powered by SQLite.

https://wiki.tcl-lang.org/page/Sqawk


In the past I've used https://github.com/BurntSushi/xsv to query some large CSVs


Over all these CLIs, I prefer zsv ( https://github.com/liquidaty/zsv )-- then again, I wrote it so my preference should come as no surprise. On my Mac was 340% faster than the OP command, and does a lot more than just SQL/sqlite3-related


I like using q for querying CSVs on the command line: https://github.com/harelba/q


Same. And I believe q uses sqlite under hood, so you can use the same SQL syntax as the one supported by sqlite. Joining multiple csv files is also possible without too much setup. http://harelba.github.io/q/


It's nice that q has caching.... then again, it kind of needs it to solve its performance inefficiency.

Running "select *" on a 1mm-row worldcitiespop_mil file, q takes 27 seconds compared to `zsv sql` which takes 1.7 seconds ( https://github.com/liquidaty/zsv ) and also supports multiple file joins. I'm sure q is faster once cached, but taking a 16x performance hit up-front is not for me


How smart is SQLite at detecting column types from Csv data?

I once wrote a Python script to load csv files into SQLite. It had a whole hierarchy of rules to determine the data type of each column.


It doesn't detect column types automatically-- they are imported as text. You can, however, use math functions on them and sqlite3 will dynamically convert where possible (e.g. "select number_1_stored_as_text + 1 from mytable" will output 2)


Using DuckDB [1]:

  duckdb -c "SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY ALL"
DuckDB will automatically infer you are reading a CSV file from the extension, then automatically infer column names from the header, together with various CSV properties (data types, delimiter, quote type, etc). You don't even need to quote the table name as long as the file is in your current directory and the file name contains no special characters.

DuckDB uses the SQLite shell, so all of the commands that are mentioned in the article with SQLite will also work for DuckDB.

[1] https://github.com/duckdb/duckdb

Disclaimer: Developer of DuckDB


A bit clunky, but works.

   ps | awk '$1=$1' OFS=, | duckdb :memory: "select PID,TTY,TIME from read_csv_auto('/dev/stdin')"


can't figure out how to make this work in bash; it just prints out "select PID,TTY,TIME from read_csv_auto('/dev/stdin')" but split into columns... using cli v0.4.0 da9ee490d which seems like the latest


How does the column data type inference work? I've run into that challenge myself in the past.


The CSV auto-detector was implemented by Till Döhmen, who did his master thesis on the subject [1] and has actually written a paper about it [2].

Essentially we have a list of candidate types for each column (starting with all types). We then sample a number of tuples from various parts of the file, and progressively reduce the number of candidate types as we detect conflicts. We then take the most restrictive type from the remaining set of types, with `STRING` as a last resort in case we cannot convert to any other type. After we have figured out the types, we start the actual parsing.

Note that it is possible we can end up with incorrect types in certain edge cases, e.g. if you have a column that has only numbers besides one row that is a string. If that row is not present in the sampling an error will be thrown and the user will need to override the type inference manually. This is generally rather rare, however.

You could also use DuckDB to do your type-inference for you!

  duckdb -c "DESCRIBE SELECT * FROM taxi.csv"
And if you want to change the sample size:

  duckdb -c "DESCRIBE SELECT * FROM read_csv_auto('taxi.csv', sample_size=9999999999999)"
[1] https://homepages.cwi.nl/~boncz/msc/2016-Doehmen.pdf

[2] https://ir.cwi.nl/pub/26416/SSDM1111.pdf


This is fantastic, thanks.

My solution is a lot less smart - I loop through every record and keep track of which potential types I've seen for each column: https://sqlite-utils.datasette.io/en/latest/python-api.html#...

Implementation here: https://github.com/simonw/sqlite-utils/blob/3fbe8a784cc2f3fa...


That works and is similar to what DuckDB does for small CSV files. We have the sampling step primarily for larger CSV files, when you might not want to do two passes over the file. This way we can keep the benefits of streamed file reading while offering type inference that "just works" most of the time without a major performance penalty.


I had to do something very similar for analysing CVE information recently, but I don't remember having to use the :memory: option. I suspect it defaults to that if no .db file is specified.

Slightly tangentially, when doing aggregated queries, SQLite has a very useful group_concat(..., ',') function that will concatenate the expression in the first arg for each row in the group, separated by the separator in the 2nd arg.

In many situations SQLite is a suitable alternative to jq for simple tabular JSON.


I just tried it without :memory: and it dropped me into the SQLite shell without executing the query:

    % sqlite3 -cmd '.mode csv' -cmd '.import taxi.csv taxi' \
      'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count'
    SQLite version 3.36.0 2021-06-18 18:58:49
    Enter ".help" for usage hints.
    sqlite>


Ok, I just went back to see what I did. In fact I echoed a script into sqlite via stdin, not by using `-cmd ''`. In that case, I didn't need to use the :memory: table, and sqlite exited normally when done.


Another great tool written in Go is CSVQ [0][1] that can be used as a command line or a library.

  csvq 'select id, name from `user.csv`'

[0] https://github.com/mithrandie/csvq

[1] https://mithrandie.github.io/csvq/


I've become a fan of using SQLite-utils to work with CSV or JSON files.

It's a two step process though. One to create and insert into a DB and a second to select from and return.

https://sqlite-utils.datasette.io/en/stable/index.html


I added a feature last year that lets you do this as a one-step process - "sqlite-utils memory": https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/


Those steps still exist though surely, just managed by the one command. If you're querying via SQLite you have to ingest first.


Yup - 'sqlite-utils memory' works by creating an in-memory database, importing various different file formats into it, running the query and then throwing everything away again. It's generally not a good fit for anything over a few dozen MBs of data.


Very cool!


This comes at the most opportune time when I am cranking through selectstarsql,sqlbolt,schemaverse,pgexcercise to bone up on my SQL skills for my upcoming data engineering interviews. SQL IS the secret super power that devs don't know they possess.


   MODE is one of:
     ascii     Columns/rows delimited by 0x1F and 0x1E
Yes!


If you need to query against multiple CSVs, e.g. using joins, you could use QHS: https://github.com/itchyny/qhs


csvsql from csvkit[0] can do that too.

[0] https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html


i used to daydream about adding an ALLOW DOING IT LIVE option to cassandra's csql client. in the event that your where clause was incompatible with your table's key, it would just wholesale dump the table in question into a sqlite while indexing the appropriate columns, run the query in question, actually return the result and then properly clean up.


Miller is also a great tool to do computation, transformations on CSV/TSV/JSON/SQLite files...

Here is the way to pretty-print the same result with mlr:

mlr --icsv --opprint --barred stats1 -a count,mean -f total_amount -g passenger_count then sort -f passenger_count taxi.csv


I'm genuinely curious about the use case of this. whenever i have CSVs I either import then into R/python of they're small or some DBMS first if they're large. can somebody tell me what niche this is filling?


There is also our somewhat older sqlet.py at http://www.sqlet.com with multiple input files, column index shortcuts.. (and probably overdue for an update).


the .import command used for actually loading the CSV is kinda picky about your CSVs being well-formatted. I don't think it supports embedded newlines at all.


I just tested it against a CSV file with newlines that were wrapped in double quotes and it worked correctly. I used this CSV file: https://til.simonwillison.net/tils/til.csv?_stream=on&_size=...

And this query:

    sqlite3 :memory: -cmd '.mode csv' -cmd '.import til.csv til' \
      -cmd '.mode json' 'select * from til limit 1' | jq


Neat! I guess it got fixed in the, uh, decade or so since I last tried


You can also prefix a SQLite import command with |, which hopefully produces text. At the system level, some import scripts can be entirely in one sql file.


This is far more useful for SQL users than chaining several sed/awk/sort commands on pipe (although a bit against nix principles).


For a *nixy approach, try lnav (https://lnav.org)


I’ve been doing this. But I hate it. CSVs need to die. They’re terrible data formats. But here we are. And SQLlite makes things amazing.


I'm willing to bet CSV will still be around in 200 years. It's ugly, but exceedingly effective.


I wouldn't even assume the concept of text files as we know them today will still exist in any meaningful way in 100 years. It's just as likely all digital data will be stored in something like neutral networks with no obvious textual representation. But yes, CSV has had remarkable persistence (the most recent major feature addition I made to our product had to read from a supposedly modern web API response that was in CSV format, despite all the other endpoints returning JSON).


CSV sucks, yes, but for moving/exchanging/aggregating data between various independent sources, it's the least terrible option everyone can process easily.


Can somebody post the equivalent in Windows cmd?


Just replace the ' with " and lose the \ . For some SQL queries you may have to use more exotic quoting (generally, in cmd you can always quote the next character with ^ ).


> sqlite3 :memory: -cmd '.mode csv' ...

csv mode can also be set with a switch:

    sqlite3 :memory: -csv


I'm looking through this guys website for 'today I learned' and at first I'm impressed by how many of them there are. But then I start thinking: when you're trying to solve a problem you search for a lot of data. None of his posts are attributed. He's getting all his information from somewhere and then he goes and posts these articles just ripping off other sources.

I can understand when its based on your original work but this website reads more like basic questions posted on Stackoverflow. E.g. 'how to connect to a website with IPv6." Tell me he didn't just Google that and post the result. 0/10


You should look harder! I attribute in plenty of these pieces, where appropriate.

Here's a query showing the 23 posts that link to StackOverflow, for example: https://til.simonwillison.net/tils?sql=select+*+from+til+whe...

And 41 where I credit someone on Twitter: https://til.simonwillison.net/tils?sql=select+*+from+til+whe...

More commonly I'll include a link from the TIL back to a GitHub Issue thread where I figured something out - those issue threads often link back to other sources.

For that IPv6 one: https://til.simonwillison.net/networking/http-ipv6

I had tried and failed to figure this out using Google searches in the past. I wrote that up after someone told me the answer in a private Slack conversation - saying who told me didn't feel appropriate there.

My goal with that page was to ensure that future people (including myself) who tried to find this with Google would get a better result!

(I'm a bit upset about this comment to be honest, because attributing people is something of a core value for me - the bookmarks on my blog have a "via" mechanism for exactly that reason: https://simonwillison.net/search/?type=blogmark )


I tried to reproduce the OP's complaint. Of the 5 most recent TILs, only 1 did not reference some other source as inspiration. One literally gave "thanks" to tips obtained elsewhere.

I'm offended on your behalf! :)


I've been following you for some months now, and you always put links of your sources.

So please, feel 100% free to ignore that person


Hm, I guess most half-assed techy blogs are 0/10 then.

Honestly I don't see a problem with the few posts I looked at. It's like recipes. You can't copyright recipes. At least it's not AI-generated blogspam, but a modicum of at least curating went in here.


Remember what this is.

This is the author's notebook, which happens to be public. It's okay.


Thats actually allowed, if you run your own personal website.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: