
TextQL: Execute SQL Against CSV or TSV - TAForObvReasons
https://github.com/dinedal/textql
======
kibwen
If you work with CSV a lot, then also check out xsv, "a command line program
for indexing, slicing, analyzing, splitting and joining CSV files", from
Andrew Gallant (burntsushi of Go and Rust fame):
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

~~~
jhoechtl
Another great tool to wrangle csv file is miller

[http://johnkerl.org/miller/doc/](http://johnkerl.org/miller/doc/)

~~~
walshemj
You can also use log parser lizard to parse CSV files.

------
ravinizme
Seem very similar to q

q is out there for years, with a very large community.

q is a command line tool that allows direct execution of SQL-like queries on
CSVs/TSVs (and any other tabular text files).

[http://harelba.github.io/q/](http://harelba.github.io/q/)

~~~
ams6110
Yes, I use q all the time for slicing and dicing delimited files. The only
problem I have with it is that the name can make it a littler harder to find
if you don't remember the repo.

Since q will read stdin and write CVS to stdout you can chain several queries
on the command line, or use it in series with other with other commands such
as cat, grep, sed, etc.

Highly recommended if you like SQL and deal with delimited files.

------
ConceitedCode
This looks very simple and easy to use.

Are there any differences with using this instead of CSVKit?

[https://csvkit.readthedocs.io/en/1.0.3/](https://csvkit.readthedocs.io/en/1.0.3/)

It includes a tool called csvsql.

Example usage -

    
    
      csvsql --query "select name from data where age > 30" data.csv > new.csv

~~~
jillesvangurp
Was wondering the same. I've been using csvkit a lot lately. This looks like a
subset of functionality using more or less the same approach. Csvsql uses
sqlite under neath and you can do some nice things with this like joins, using
sql functions, etc. There is also stuff like amazon athena that allows you to
do similar things in s3 at scale.

Csvkit is great with pipes and you can also easily convert between csv tsv and
even stuff like json.

------
hk__2
I’ve worked on a similar tool in the past called Charlatan [1] (no I didn’t
choose that name), but it implemented its own SQL-like language and was
extensible to support any file type.

Its key difference with TextQL and similar alternatives is it works on streams
instead of importing everything in a SQLite then querying it. It seems strange
to read a whole file in memory then perform a `SELECT` query on it when you
could just run that query _while reading the file_. That means a much lower
memory footprint and faster execution, but on the other hand you can only use
the subset of SQL that’s implemented.

[1]:
[https://github.com/BatchLabs/charlatan#charlatan](https://github.com/BatchLabs/charlatan#charlatan)

------
lowleveldesign
If you are on Windows you may consider using Log Parser Studio [1]. It is a
GUI over logparser.exe which understands a lot of various file formats and
uses SQL to query them [2].

[1] [https://gallery.technet.microsoft.com/office/Log-Parser-
Stud...](https://gallery.technet.microsoft.com/office/Log-Parser-Studio-
cd458765)

[2]
[https://en.m.wikipedia.org/wiki/Logparser](https://en.m.wikipedia.org/wiki/Logparser)

~~~
noveltyaccount
Came here to say this. Log parser is amazing. Check out Log Parser Lizard for
a good IDE.

------
zmmmmm
This is one of those problems that is just hard enough that everyone feels it
is useful to solve and just easy enough that everyone can solve it. Hence the
dozens or even hundreds of independent solutions. For some reason I never end
up using any of them!

------
cplat
Apache Drill ([https://drill.apache.org/](https://drill.apache.org/)) is also
a great alternative. Although advertised as a query engine for Hadoop, it
works perfectly fine on local CSV and JSON files.

------
berta
How is this different from
[https://github.com/harelba/q](https://github.com/harelba/q)

------
striking
For anyone looking to do this with plain SQLite, one can import a CSV by
running (in the REPL)

    
    
        .mode csv
        .headers on
        .import my.csv tablename
    

This does look extremely convenient, though; being able to use UNIX pipes will
be a huge improvement to my workflow.

~~~
mingodad
To have an idea of what can be done with sqlite look at this solution to the
Hashcode 2018
[https://github.com/mingodad/sqlite3-hashcode-2018](https://github.com/mingodad/sqlite3-hashcode-2018)
using only sqlite3 command line.

------
__david__
I usually appreciate the animated gif screenshot (especially for curses based
programs), but this particular animated gif would have been better off as just
plain text—there's too much interesting data that can't be copied, it's real
hard to go back to look at something that you missed and it's frustrating to
have to wait while fake typing happens. I really wish there was just a
transcript of the gif that I could read at my own pace.

------
bobivl
There is also BigBash [0] that converts an Sql statement to a bash one-liner
(using sed, grep, awk,...) which can then run to execute the query. The
advantage is that you can let it run on very large file(s) because of the
streaming nature of linus bintools.

[0] [http://bigbash.it](http://bigbash.it) or the corresponding Github repo.

------
blacksqr
Sqawk [1] is an Awk-like program that uses SQL and can combine data from
multiple files. It is powered by SQLite.

[1] [https://github.com/dbohdan/sqawk](https://github.com/dbohdan/sqawk)

------
otoburb
I like VisiData[1][2] for CSV/TSV data exploration. textql is a good command-
line follow-up to run quick direct queries against the data without needing to
see visual/spatial structure.

[1] [https://jsvine.github.io/intro-to-
visidata/](https://jsvine.github.io/intro-to-visidata/)

[2] Previous HN VisiData thread
[https://news.ycombinator.com/item?id=16515299](https://news.ycombinator.com/item?id=16515299)

------
daotoad
Yawn. We've had this in Perl for _decades_ using the DBI Shell (dibsh). DBI is
a common database interface. To connect to a data source, one uses an
appropriate driver (by convention, in the DBD name space, such as DBD::Pg or
DBD::MySQL) to get a db handle that can run SQL commands. There are a lot of
drivers for things like CSV files, in memory database, etc.

Plus you can access all this programmatically and not just through the dbish
command. Since this is a central part of Perl infrastructure, there are a ton
of tools that extend the DBI, from a variety of ORMs like DBIx::Class, to data
export and even DDL delta management.

Generalized Database Interface:
[https://metacpan.org/pod/DBI](https://metacpan.org/pod/DBI) DBI Shell
program:
[https://metacpan.org/pod/DBI::Shell](https://metacpan.org/pod/DBI::Shell) CSV
driver: [https://metacpan.org/pod/DBD::CSV](https://metacpan.org/pod/DBD::CSV)

------
kamac
Ha. Shameless plug, but just about a week ago I started to work on a somewhat
analogous thing. Executing SQL on XML [1]. I came up with this idea after
trying to use stackoverflow data dump and finding out it was stored as XML. I
wanted to run queries with LIKE operator on it.

[1]: [https://github.com/kamac/AskXML](https://github.com/kamac/AskXML)

~~~
_mhr_
This is a great idea. There are so many APIs and text dumps from websites and
databases that are XML, and it would be nice to be able to query them with
regular relational operations. I'll be following your project! I will say that
I believe a streaming backend would be more suitable, since most of the XML
dumps I'm thinking of are humongous and won't always fit in memory.

~~~
kamac
Thanks! About fitting into memory; the way the library currently works should
technically work even on very big files, as the XML is iteratively converted
into a SQL database. You can choose whether you want that database to be in a
file or in memory. The only thing that could break right now is synchronizing
back to XML after having made changes with UPDATE or INSERT statements,
because I'm not paging results.

------
Nitrado
ClickHouse ships with a command line tool which does this (without the actual
database server):

    
    
        ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | \
            clickhouse-local -S "user String, mem Float64" \
                -q "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
    
        
        ┏━━━━━━━━━━┳━━━━━━━━━━┓
        ┃ user     ┃ memTotal ┃
        ┡━━━━━━━━━━╇━━━━━━━━━━┩
        │ clickho+ │      0.7 │
        ├──────────┼──────────┤
        │ root     │      0.2 │
        ├──────────┼──────────┤
        │ netdata  │      0.1 │
        ├──────────┼──────────┤
        │ ntp      │        0 │
        ├──────────┼──────────┤
        │ dbus     │        0 │
        ├──────────┼──────────┤
        │ nginx    │        0 │
        ├──────────┼──────────┤
        │ polkitd  │        0 │
        ├──────────┼──────────┤
        │ nscd     │        0 │
        ├──────────┼──────────┤
        │ postfix  │        0 │
        └──────────┴──────────┘
    

Has the advantage of being really fast.

~~~
spullara
the additional requirement to set up the schema is kind of onerous.

~~~
networked
Shameless plug: Sqawk can do nearly the same without you defining a schema.

    
    
      $ ps aux | sqawk -output table \
                       'select user, round(sum("%mem"), 2) as memtotal
                        from a
                        group by user
                        order by memtotal desc' \
                       header=1
      ┌────────┬────┐
      │dbohdan │67.1│
      ├────────┼────┤
      │  root  │3.5 │
      ├────────┼────┤
      │ avahi  │0.0 │
      ├────────┼────┤
      │ daemon │0.0 │
      ├────────┼────┤
      │message+│0.0 │
      ├────────┼────┤
      │ nobody │0.0 │
      ├────────┼────┤
      │  ntp   │0.0 │
      ├────────┼────┤
      │ rtkit  │0.0 │
      ├────────┼────┤
      │ syslog │0.0 │
      ├────────┼────┤
      │ uuidd  │0.0 │
      ├────────┼────┤
      │whoopsie│0.0 │
      └────────┴────┘
    

Link: [https://github.com/dbohdan/sqawk](https://github.com/dbohdan/sqawk)

------
emmelaich
Looks pretty good!

Just for historical interest, similar things have been done before.

e.g.[http://quisp.sourceforge.net/shsqlhome.html](http://quisp.sourceforge.net/shsqlhome.html)

See also Facebook's osquery of course. Turns even complex structures into sql
tables -- helped by Augeas.net.

------
chrisweekly
Does this have any benefits over lnav[1]?

[1] [https://www.lnav.org](https://www.lnav.org) \-- a feature-rich, powerful,
flexible "mini-ETL" with an embedded sqlite engine -- is one of my all-time
favorite CLI tools that somehow remains under the radar

~~~
vdm
[http://innolitics.com/10x/graduate-from-sed-and-sort-to-
lnav...](http://innolitics.com/10x/graduate-from-sed-and-sort-to-lnav-the-
logfile-navigator/)

------
daamien
PostgreSQL does this out of the box with :

    
    
      CREATE EXTENSION file_fdw;
      CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
      CREATE FOREIGN TABLE foo (
        col1 text,
        col2 text,
        ...
      ) SERVER import 
      OPTIONS ( filename '/path/to/foo.csv', format 'csv' );
      SELECT col1 FROM foo WHERE col2='x';

~~~
banku_brougham
Having to define the schema is the major hangup for me in my workflow. I would
prefer a wrapper that could run within unix pipelines around arbitrary text
data files, but they all use SQLite.

I personally have been using harelba’s q (the un-googleable utility), which is
just fine.

It would be great not to shift gears into SQLite syntax and date formatting
all the time. Does anyone know of a similar tool that runs over postgres?

~~~
chrisweekly
> "prefer a wrapper that could run within unix pipelines around arbitrary text
> data files"

That sounds like [lnav]([https://www.lnav.org](https://www.lnav.org)).

For queries, it does use SQLite under the hood, but before querying your
"arbitrary text data" files, you can use regex to define a custom format (eg
with named groups and back references), providing structure against which
standard SQL is an ideal tool to query.

It's simpler than I'm probably making it sound. Highly recommended.

------
owlninja
These responses are so HackerNews. Everyone asking 'why not use x project?',
and each x is different...

------
kbenson
In the past, I've used fsql[1] when I needed this. This looks interesting
though.

1: [https://metacpan.org/pod/distribution/App-
fsql/bin/fsql](https://metacpan.org/pod/distribution/App-fsql/bin/fsql)

------
srhngpr
I'm a big fan of SQL Notebook [1]. It is open source and has a great UI to
quickly work with a number of importable formats (CSV, Excel, JSON) as well as
server-based DBs.

[1] [https://sqlnotebook.com/](https://sqlnotebook.com/)

------
Corrado
One thing that I would like to add to my projects is the animated GIF shell.
I've looked at several ways to create them but they all seem really clunky and
force me to type correctly the first time. What do people use to create these
helpful animations?

------
ilitirit
I'm pretty sure many people have written or attempted to write similar tools
at least once before. It's as simple as programmatically importing to an in-
memory sqlite instance and then executing SQL queries.

------
rhacker
Apache Spark can do this too, similar to the Apache Drill comment. This thread
is starting to feel like Stack Overflow a bit.

~~~
mekazu
If only when googling “sql on csv” this comments page would show up first.

------
aembleton
This is cool. How does it work? I'd guess cleaning the data, and copying to
SQL Lite?

~~~
hk__2
It imports everything in an in-memory SQLite database.

------
th0ma5
Apache MetaModel is great for this in Java.

------
pX0r
While dealing with CSV files using SQL statements is cool and all, SQL queries
are still SQL queries.

~~~
IMTDb
[https://en.wikipedia.org/wiki/Truism](https://en.wikipedia.org/wiki/Truism)

------
partycoder
I use R for this.

~~~
Myrmornis
Can you show an example of how to use R to read tabular data on stdin (e.g. in
a UNIX pipe line) and perform SQL queries?

~~~
goatlover
Why would you be running SQL queries in R, a language with built-in support
for handling tabular data?

~~~
Myrmornis
I was asking because the topic of this discussion is specifically running SQL
queries (that's what TextQL does).

But, there certainly are good reasons for using SQL in R. Packages like sqldf
exist, so their authors would probably be able to give the best answer to your
question.

Some reasons:

\- SQL is a very widely-known DSL for working with tabular data, so it may
make sense to make that interface available within R.

\- For certain operations (e.g. specific types of joins) it may be more
natural / easier to express the operation in SQL.

\- For large data sets, some operations in R have large memory footprints, and
doing the operations in a database may have lower memory requirements.

------
gaius
In SQLite you can just do

    
    
        sqlite> .import myfile.csv mytable
        sqlite> select ...
    

What does this tool give you that SQLite doesn't do out of the box?

~~~
jssmith
As a command-line tool it could serve as a more powerful replacement for unix
utilities that I often use for simple analysis jobs: sort, uniq, wc, awk,
sometimes cut and sed. I'm going to give it a try.

It looks like textql loads everything into an in-memory SQLite instance,
whereas I'd really like to see an approach that uses the SQLite's virtual
table mechanism
([https://sqlite.org/vtab.html](https://sqlite.org/vtab.html)), which would
avoid the loading step and perhaps make streaming processing possible.

~~~
MR4D
What you describe would be a super nice utility. Great for prototyping and
development, as the code could be copy & pasted from such a tool into
application code.

The streaming would make it memory efficient, and possibly able to handle some
big data - maybe not true "Big Data", but certainly 10s of gigabytes.

Anyone want to take this idea into a GoFundMe site?

~~~
ams6110
SQL is set-oriented. How would that work on a potentially indefinite stream,
other than as a simple filter which you could just do with a tool such as awk.

~~~
barrkel
Many relational operations don't require a whole stream to compute, and many
of those that do don't need it all at once.

Projection (mapping), a join against a fully loaded other side as well as
filtering work.

Aggregation can consume an indefinite stream with limited working set if the
cardinality of the grouping key isn't large.

And of course you can combine these in nested and unioned operations,
computing across multiple indefinite streams concurrently and with limited
working set.

It would be tricky to make work effecively without hinting for things like
joins, for sure; join order is one of the hardest bits a query engine
optimizes.

------
clatan
we should be aiming to kill SQL as a language

~~~
walshemj
Why? Not surprised you got down voted but I am curious?

~~~
orionblastar
It is a negative comment. We should make alternatives not kill off
technologies. Even the Commodore 64 has use with the right modern add ons, now
that is hacking.

