Ok so the title of the blog is 'Data ops in the Linux command line'. Sounds fun, but in the same way that 'paintball with blindfolds' would be fun.
e.g. this is one of the examples - a kindof sanity check on two fields in a tab separated file to see if one is less than the other. The fields are identified by their position (31, 33 etc)
Surely much better to just import it into a database and do the analysis in SQL. The SQL equivalent of the above would be something like:
SELECT *
FROM FishSpecimenData
WHERE MinDepth > MaxDepth
AND MinDepth is not null AND MaxDepth is not null
If you're worried about type conversions while importing into SQL, just import everything as a varchar. You've still got a fairly easy job to compare the numbers:
SELECT *
FROM FishSpecimenData
WHERE Cast(MinDepth as int) > Cast(MaxDepth as int)
AND MinDepth is not null AND MaxDepth is not null
AND IsNumeric(MinDepth) = 1 and IsNumeric(MaxDepth) = 1
I'm a retired scientist and I've been mucking around with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite). In 2012 I discovered the AWK language and realised that every processing job I'd ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with GNU/Linux command-line tools, especially AWK
So I guess the point of the blog is to promote that approach. Fair enough.
I don't think it's much better to use SQL, it's literally killing flies with nuclear bombs. Awk does the job perfectly well and you don't have to spin up a SQL database, import the data, deal with any possible problem when importing the data. I don't know why a lot of people have this aversion to the console when it can do the job fairly well, specially in these cases where you just want a quick check.
How would SQLite fare in regards to your concerns? You won't have to 'spin up' a database and importing from various sources, AFAIK, is pretty easy these days (and OP even goes out of their way to deal with VARCHAR fields).
It strikes me as much easier than having to figure out Awk, but that's assuming most people are more likely to know SQL and/or use it in the future.
I mean, I'd be happy to try and solve the problem with Awk because it's been high on my list of things to learn, but SQL does seem like an easier solution in most cases.
You still have to create the SQLite database and have the client handy. Awk is there by default in all *nix systems, and for these basic tasks it's incredibly easy (see http://www.hcs.harvard.edu/~dholland/computers/awk.html for a quick intro). Once you know it you can use it to process a lot of text files in the console without any kind of worry. Maybe it's a preprocess step before importing data to a database, maybe you're just exploring a dataset, maybe you want to modify some command's output in the middle of a pipe.
I learned awk a few years ago and it does not have equivalent when the only thing you want to do is to consume column-based text and do some simple processing: it is simple and incredibly fast.
I've always been of the opinion that you're better served by learning Perl and using it. The syntax for quick things is very similar to Awk, but you also get all of CPAN to back you up in case you need to do something more complex.
For example, real CSV parsing, HTML parsing with queryselector access, complex record destructuring and/or inner-record decompression, and sprintf (among many others) for for output.
I still do a lot of cat / grep / cut / sort / uniq type commands, but whenever I have need of a little complexity, Perl is definitely my go to tool. I think it's already installed in most common installation choices for most distros.
I don't think I've used awk scripts more than in one or two occasions, I use it mostly for one liners and there I like awk's simplicity. When I need something more complex I resort to Python which is slower but usually manageable, you're right that if I knew Perl I'd have both tools in one language.
Identifying the problem is only half the battle, though. The end goal would be to go through and fix each record. Surely that's easier done in a database tool than on the command line.
Additionally, these data presumably live in a database at the museums they're from. They put them on the web as CSV (or similar) so that the public can see them, but the museum IT guy would end up making the eventual changes in their tracking system anyway.
Oh I thought the article would be about scientific fields. Not data fields. I became increasingly irritated about the pedantry before I realized that this was the topic.
Last week I signed a lease that stated "Mold have existed since the beginning of time." While scientifically inaccurate, I believe legally it is close enough.
Mold can become a bad issue if your bathroom isn't ventilated after showers. I've rented for decades in WA State and every lease I've signed has had a mold clause. The building was built 4 years ago; I watched them build it.
e.g. this is one of the examples - a kindof sanity check on two fields in a tab separated file to see if one is less than the other. The fields are identified by their position (31, 33 etc)
Surely much better to just import it into a database and do the analysis in SQL. The SQL equivalent of the above would be something like: If you're worried about type conversions while importing into SQL, just import everything as a varchar. You've still got a fairly easy job to compare the numbers: edit: To be fair, on this page https://www.polydesmida.info/cookbook/index.html the author explains the rationale for using command line tools:I'm a retired scientist and I've been mucking around with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite). In 2012 I discovered the AWK language and realised that every processing job I'd ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with GNU/Linux command-line tools, especially AWK
So I guess the point of the blog is to promote that approach. Fair enough.