Finding data items in one field that contradict data items in another field

codeulike · on Oct 3, 2018

Ok so the title of the blog is 'Data ops in the Linux command line'. Sounds fun, but in the same way that 'paintball with blindfolds' would be fun.

e.g. this is one of the examples - a kindof sanity check on two fields in a tab separated file to see if one is less than the other. The fields are identified by their position (31, 33 etc)

    awk -F"\t" 'NR>1 && $31!="" && $33!="" && $33>$31' fish | wc -l

Surely much better to just import it into a database and do the analysis in SQL. The SQL equivalent of the above would be something like:

    SELECT *
    FROM FishSpecimenData
    WHERE MinDepth > MaxDepth
      AND MinDepth is not null AND MaxDepth is not null

If you're worried about type conversions while importing into SQL, just import everything as a varchar. You've still got a fairly easy job to compare the numbers:

    SELECT *
    FROM FishSpecimenData
    WHERE Cast(MinDepth as int) > Cast(MaxDepth as int)
      AND MinDepth is not null AND MaxDepth is not null
      AND IsNumeric(MinDepth) = 1 and IsNumeric(MaxDepth) = 1

edit: To be fair, on this page https://www.polydesmida.info/cookbook/index.html the author explains the rationale for using command line tools:

I'm a retired scientist and I've been mucking around with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite). In 2012 I discovered the AWK language and realised that every processing job I'd ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with GNU/Linux command-line tools, especially AWK

So I guess the point of the blog is to promote that approach. Fair enough.

gjulianm · on Oct 3, 2018

I don't think it's much better to use SQL, it's literally killing flies with nuclear bombs. Awk does the job perfectly well and you don't have to spin up a SQL database, import the data, deal with any possible problem when importing the data. I don't know why a lot of people have this aversion to the console when it can do the job fairly well, specially in these cases where you just want a quick check.

mercer · on Oct 3, 2018

How would SQLite fare in regards to your concerns? You won't have to 'spin up' a database and importing from various sources, AFAIK, is pretty easy these days (and OP even goes out of their way to deal with VARCHAR fields).

It strikes me as much easier than having to figure out Awk, but that's assuming most people are more likely to know SQL and/or use it in the future.

I mean, I'd be happy to try and solve the problem with Awk because it's been high on my list of things to learn, but SQL does seem like an easier solution in most cases.

gjulianm · on Oct 3, 2018

You still have to create the SQLite database and have the client handy. Awk is there by default in all *nix systems, and for these basic tasks it's incredibly easy (see http://www.hcs.harvard.edu/~dholland/computers/awk.html for a quick intro). Once you know it you can use it to process a lot of text files in the console without any kind of worry. Maybe it's a preprocess step before importing data to a database, maybe you're just exploring a dataset, maybe you want to modify some command's output in the middle of a pipe.

I learned awk a few years ago and it does not have equivalent when the only thing you want to do is to consume column-based text and do some simple processing: it is simple and incredibly fast.

mercer · on Oct 4, 2018

Thanks for the link and the motivation to keep reading the AWK book I started a while ago!

kbenson · on Oct 3, 2018

> Awk does the job perfectly well

I've always been of the opinion that you're better served by learning Perl and using it. The syntax for quick things is very similar to Awk, but you also get all of CPAN to back you up in case you need to do something more complex.

For example, real CSV parsing, HTML parsing with queryselector access, complex record destructuring and/or inner-record decompression, and sprintf (among many others) for for output.

I still do a lot of cat / grep / cut / sort / uniq type commands, but whenever I have need of a little complexity, Perl is definitely my go to tool. I think it's already installed in most common installation choices for most distros.

gjulianm · on Oct 3, 2018

I don't think I've used awk scripts more than in one or two occasions, I use it mostly for one liners and there I like awk's simplicity. When I need something more complex I resort to Python which is slower but usually manageable, you're right that if I knew Perl I'd have both tools in one language.

onychomys · on Oct 3, 2018

Identifying the problem is only half the battle, though. The end goal would be to go through and fix each record. Surely that's easier done in a database tool than on the command line.

Additionally, these data presumably live in a database at the museums they're from. They put them on the web as CSV (or similar) so that the public can see them, but the museum IT guy would end up making the eventual changes in their tracking system anyway.

passwordreset · on Oct 3, 2018

Why not combine the approach and run SELECT statements from the command line?

https://github.com/harelba/q

I'm sure there are issues with this, but it seems pretty sane. Your sample query would read _something_ like:

q 'SELECT * FROM FishSpecimenData.csv WHERE c33 > c31 AND c31 is not null AND c33 is not null'

lolc · on Oct 3, 2018

Oh I thought the article would be about scientific fields. Not data fields. I became increasingly irritated about the pedantry before I realized that this was the topic.

Once the confusion lifted I could enjoy the read.

stephengillie · on Oct 3, 2018

Last week I signed a lease that stated "Mold have existed since the beginning of time." While scientifically inaccurate, I believe legally it is close enough.

NullPrefix · on Oct 3, 2018

Is that a waiver so you could agree to live in a moldy house? Do you have an option of not living there?

Fixing your health issues could cost more than the savings you get from living in that shack.

stephengillie · on Oct 3, 2018

Mold can become a bad issue if your bathroom isn't ventilated after showers. I've rented for decades in WA State and every lease I've signed has had a mold clause. The building was built 4 years ago; I watched them build it.