
TSV Utilities: Command line tools for large, tabular data files - bryanrasmussen
https://github.com/eBay/tsv-utils
======
JimmyRuska
I've been noticing more swiss army-knife-like cli tools in the last few years.
It would be cool if there were some that could support avro/parquet/orc
formats. This one is notable because it's written in D lang by a mega corp.

Some useful cli data wrangling tools --

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

[https://github.com/dinedal/textql](https://github.com/dinedal/textql)

[https://github.com/n3mo/data-science](https://github.com/n3mo/data-science)

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

[https://gitlab.redox-os.org/redox-os/parallel](https://gitlab.redox-
os.org/redox-os/parallel)

[https://github.com/willghatch/racket-
rash](https://github.com/willghatch/racket-rash)

Would you have any others you recommend?

~~~
sstephenson
jwalk parses JSON into a stream of TSV records:
[https://github.com/shellbound/jwalk/](https://github.com/shellbound/jwalk/)

~~~
DonHopkins
>parses large documents slowly, but steadily, in memory space proportional to
the key depth of the document

If parsing JSON with shell scripts and awk is your idea of the most ideal way
to "slowly, but steadily" get the job done.

[https://github.com/shellbound/jwalk/blob/master/lib/jwalk/co...](https://github.com/shellbound/jwalk/blob/master/lib/jwalk/commands/parse.awk#L137)

I know that everything looks like a nail if your only tool is a hammer, and
it's fun to nail together square wheels out of plywood, but there are actually
other tools out there with built-in, compiled, optimized, documented, tested,
well maintained, fully compliant JSON parsers.

------
gibba999
Seems like a waste to do CSV->TSV without going all the way:

[http://www.tsvx.org/](http://www.tsvx.org/)

The problem with TSVs and CSVs is that you might get an odd datatype 1TB into
a file. For example, what you expect to be an integer value is somehow a
string.

TSVx extends TSV to add standard formats for things like headers which allows
for strict typing. You can do things like export a database table and import
losslessly. You can even export from MySQL and import into PostgreSQL most of
the times without pain.

The strict typing also avoids a lot of potential security issues. And in an
environment where you control both ends (so you don't need to worry about
security of where the file came from), it leads to much nicer APIs: you can
refer to things by names rather than column numbers. It's more readable, and
if the order or number of columns changes, nothing breaks.

~~~
enriquto
It wouldn't seem that tsvx is "going all the way", it's a completely different
thing that dirties your numbers with silly metadata.

~~~
woofie11
TSVs aren't just numbers. A TSV might be numbers for the first terabyte, and
then include a string, just for the heck of it.

I'd call that "dirtying numbers," rather than an removable header which allows
you to programmatically validate that your numbers have not, in fact, been
dirtied.

------
jondegenhardt
Hi all, Primary tsv-utils author here. Thanks for checking out the tools! I
hope some of you find them beneficial.

~~~
Tharkun
Thanks for these. I've rolled my own implementation of many of these over the
last 20 odd year, most of them living in ugly shell scripts embedded as
aliases in my env. This will make my life easier!

------
cbsmith
It makes me cry how much time has been invested in formats like CSV's, TSV's,
etc. ASCII (and UTF-8) has characters reserved for column, row, and even group
separation. Just use them and save a lot of pain.

~~~
dev_dull
We’ve discussed this at length and I’m squarely in the DO NOT camp. The
drawbacks of using non-readable meta characters exceeds the benefits:

1\. TSV can be read and imported by almost everything.

2\. People can add and adjust TSV files from any editor.

3\. What’s the way to insert meta characters again In VIM? And now nano? Argh
I’ll just try and copy and paste it. Ugh that doesn’t work.

Just use CSV/TSV folks. Anything more complicated and reach for a better
serialization format (json, yaml) and _not_ a better delimiter.

~~~
Annatar
Both JSON and YAML are very difficult to construct parsers for.

To make matters worse, both formats are full of pitfalls:

[http://seriot.ch/parsing_json.php](http://seriot.ch/parsing_json.php)
[https://arp242.net/yaml-config.html](https://arp242.net/yaml-config.html)

using either of these, in my opinion, is extremely misguided.

~~~
dev_dull
The flip side to that is that json and yaml parsers exist in every language,
and would be more than capable of replacing any logic you’d find in a CSV.

~~~
Annatar
CSV's are a data structure. They do not contain any logic.

------
dima55
There're lots of tools in this space that are similar in a very general way,
but have widely different design choices. My tookit is

[https://github.com/dkogan/vnlog](https://github.com/dkogan/vnlog)

It's tsv-utils-like, but is strictly a wrapper around existing tools. So
filtering and transformations are interpreted literally as awk (or perl)
expressions. And the various cmdline options match the standard tool options
because they ARE the standard tools. So you get a very friendly learning
curve, but something like tsv-utils is probably faster and probably more
powerful. And it looks like tsv-utils references fields by number instead of
by name. Many of the others (mine included) use the field names, which makes a
MAJOR usability improvement.

Other tools in no particular order:

[https://csvkit.readthedocs.io/](https://csvkit.readthedocs.io/)

[https://github.com/johnkerl/miller](https://github.com/johnkerl/miller)

[https://github.com/eBay/tsv-utils-dlang](https://github.com/eBay/tsv-utils-
dlang)

[http://harelba.github.io/q/](http://harelba.github.io/q/)

[https://github.com/BatchLabs/charlatan](https://github.com/BatchLabs/charlatan)

[https://github.com/dinedal/textql](https://github.com/dinedal/textql)

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

[https://github.com/dbohdan/sqawk](https://github.com/dbohdan/sqawk)

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

[https://github.com/benbernard/RecordStream](https://github.com/benbernard/RecordStream)

~~~
cube2222
Adding another one which I'm one of the authors of:
[https://github.com/cube2222/octosql/](https://github.com/cube2222/octosql/)

It tries to provide ergonomic data munching of various formats, but using a
sql interface, which most will probably feel immediately at home with.

~~~
bradknowles
With respect, who actually feels at home with SQL?

It's about the most alien and obtuse language I've ever had the misfortune of
encountering, and in that category I rate it worse than COBOL, FORTRAN,
Assembly, C, Prolog, Sendmail re-write rules, BASH, and every other language
I've ever encountered and had to use, but which I cannot recall at the moment.

~~~
he_the_great
SQL gets rather mistifying when you move towards some complex joins and data
mapping. PSQL can also feel Sadly out of place.

But their is real nice clarit in what you want at its foundation.

------
yfiapo
Cool that there are some specialty tools in this space and will use if I hit
that performance challenge.

Minor nit, I found the performance table ([https://github.com/eBay/tsv-
utils/blob/master/docs/Performan...](https://github.com/eBay/tsv-
utils/blob/master/docs/Performance.md)) confusing at first and second glance.
Alternating colors indicate.. different OSes? That doesn't seem to be the
important message to convey as you are trying to show the speed of your tool
and not the OS. Recommend to use the coloration to provide differentiation
between tools instead of OS.

~~~
jondegenhardt
Thanks for the feedback. I agree, the performance table would benefit from
better formatting.

------
zX41ZdbW
Worth to note clickhouse-local: full featured ClickHouse SQL engine for files
in CSV/TSV/JSONLines, whatever...

[https://www.altinity.com/blog/2019/6/11/clickhouse-local-
the...](https://www.altinity.com/blog/2019/6/11/clickhouse-local-the-power-of-
clickhouse-sql-in-a-single-command)

~~~
paulryanrogers
Nice! Or even just Sqlite

------
Annatar
With AWKA, in my tests I was able to get an instant 100% speed boost without
modifying any of my AWK code. So using AWKA would likely beat the tools
presented here, without needing to invest the time to learn a new tool, which
for me at least is of paramount importance.

[https://github.com/noyesno/awka](https://github.com/noyesno/awka)

------
oehtXRwMkIs
I remember there was a TUI app that could handle CSV, HDF, etc. I can't find
if for the life of me though. I remember the author even made their own TUI
library for it. I've been looking for it for a while now, if anyone knows
please let me know.

~~~
flas9sd
must be visidata

~~~
oehtXRwMkIs
I think that's the one, thanks so much.

------
visarga
I've had something like this for 10 years, self made. I use cut, grep, sort,
uniq from the system, but added a tool to compute counts per field, do value
processing like awk (using a one line Perl script because I didn't want to
learn awk), randomise order, split files by percentage, join multiple files by
column and pretty format the data.

Sometimes I make a chain of them 3-4 lines long, at which point I switch to a
script.

------
zmmmmm
I always get excited by these types of tools and then end up going back to
awk. How do you beat something that's already pre-installed on every Unix
system for the last 40 years?

Having said that ... these do look like they have some useful extras, in
particular, around the annoying part of retaining / manipulating header rows
...

~~~
thewhitetulip
I totally agree. awk, sed, grep are amazing and it astonishes me how many
people aren't fluently able to write scripts in it. I used to be in that camp
few months earlier and I used to write everything in Python, but one day, I
determined to learn shell scripting and I loved it.

awk::Python what Java::Python. What we can do in 10 lines of Python can be
done in one line of awk and it doesn't have to be unreadable!

Main pain in the neck is finding a good tutorial. As usual, I started
documenting what I learnt in an end to end guide here,
[https://github.com/thewhitetulip/awk-anti-
textbook](https://github.com/thewhitetulip/awk-anti-textbook)

~~~
cf
The original awk book from the creators is in my opinion really good.
[https://archive.org/download/pdfy-
MgN0H1joIoDVoIC7/The_AWK_P...](https://archive.org/download/pdfy-
MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf)

~~~
thewhitetulip
I found it to be a good introduction, but the issue is, it does not go in dept
and teach by example.

------
Mauricio_
Dear author: thank you for making multiple small programs instead of a single
command with 400 flags.

~~~
JoeAltmaier
Agreed!

My pet peeve: open source packages that have no separation of the actual IP
from the embodiment in an app or environment. Is it so hard to create a
library? And put the real feature in there, instead of wrapped up in some run-
on main module?

Kudos to this writer for doing it well.

~~~
jondegenhardt
Thank you both! I too like the Unix philosophy of small tools that do a
specific job. In a package like this, people may find they only care about one
or two tools. They can ignore the rest and mix and match with other tools as
they see fit.

------
londons_explore
I feel like this might explain atrocious unicode handling and special
character escaping in most eBay properties... Looking at you gumtree!

------
maxekman
Check out Nushell if you missed it a few days ago on HN:
[https://github.com/nushell/nushell](https://github.com/nushell/nushell)

Related HN thread:
[https://news.ycombinator.com/item?id=20783006](https://news.ycombinator.com/item?id=20783006)

------
jmsmistral
Looks cool! The presentation of the benchmark data could be improved by
charting the values, however.

------
j88439h84
Mario is a tool for manipulating data in your shell with Python functions. It
supports csv, json, toml, yaml, xml.

[https://github.com/python-mario/mario](https://github.com/python-mario/mario)

------
mlueft
[http://www.lueftenegger.at/produkte/csvsplitter-gratis-
progr...](http://www.lueftenegger.at/produkte/csvsplitter-gratis-programm-zum-
splitten-von-csv-dateien/)

------
cben
`keep-header` is a superb gem of do-one-thing composability.

------
QuadrupleA
Wouldn't a relational database be better and faster-querying and more space
efficient in most cases, e.g. sqlite? Wonder what use cases would favor large
collections of text / TSV files?

~~~
pletnes
If you’re getting TSV files from an external source and are interested in just
some aggregate results, you’ll get better performance by computing the smaller
results and storing these than if you first import, then process, then delete
the data in a SQL db. Also you don’t have to store the data twice.

------
ragerino
You might want to check out Apache Drill. It allows you to access structured
files as database tables through JDBC/ODBC.

~~~
riboflavin
Dremio replaces Drill for a lot of that too, some of the original team has
worked on it. Here's a tutorial from some folks who have used it:
[http://www.helicalinsight.com/technical-guide/connecting-
csv...](http://www.helicalinsight.com/technical-guide/connecting-csvexcel-via-
dremio/)

------
pfarnsworth
eBay has an api project that appears to have been half-completed and then
abandoned. Trying to understand exactly how to use their APIs is extremely
confusing because they don't have anything definitive and basically a
frankenstein-like API so I've given up. I frankly wouldn't trust anything
coming from eBay at this point, they appear to have extremely poor developer
support and no investment in making their APIs better to use.

