In an attempt to fix the bug, I opened mirexpress's code. And all my confidence in my programming ability vanished when I saw its innards. I understand that the code may have been written by scientists who had no experience in programming, but I have never been so utterly _disoriented_ by bad code. Anyways, after hacking away at the mess for about 3-4 hours, I realized that this was a fool's errand and thought I'll just phone it in the next day saying I couldn't do it. I went to sleep thinking that it was already late and I'd get late for work the next day.
- 5 minutes later -
I woke up with a start, recalling this nifty tool called awk. I had last used it maybe 3 years ago, and before that only in college. But I could see how awk could do some of the things which mirexpress was claiming to do. So I fire up my computer, write an awk script - 2 lines only! TWO FUCKING LINES! And it runs like a charm - eats away at megabytes of sample data and gives me results I can show. So then like any rational person, I spent the remaining hours re-discovering awk and forgot to sleep. Pissed away the whole next day (and some part of the day after that too!) :-D
It's really fascinating that this nifty little tools invented DECADES ago are still going strong, and there's been no _evolutionary_ leap in areas where tools like awk/grep/sed excel at.
Don't write one-liners if they're not that simple. Write ten-liners! Add comments! Commit them to version control! Awk can be a nice, readable little language if you're not trying to win at code golf all the time.
Awk is a pretty terse language, and that gives you plenty of room to put in whitespace and comments and still have something short and sweet. I think that the idea of writing code to be nicely readable by your collaborators, or your future self, might not have been around in the early days of UNIX.
("But it's just a one-off thing I'm never going to do again, why should I save it to a file?", you may ask. Is what you do important? Then you'll probably have to do something like it again.)
It’s an invaluable asset and saves me tons of time on a regular basis. I’d rather search my OWN hard drive for an example of my OWN code for how to - for instance - use a CTE to recursively populate a date dimension table, than to search Google and see someone else’s code, to refresh my memory.
It’s why my Programmer’s Compendium  may only ever be useful to me. If you try looking at some of the more complete pages there, you might think they’re useless, but they’re, in fact, all that I need to write down.
However, mine is public, and there’s not much harm in these things being public, because perhaps it can be useful to someone else one day. I would encourage people to share knowledge by default.
So while I also think making it public is a good idea, I would also say never be under pressure to make it accessible to others!
What’s good for learning is almost never good for reference.
Maybe. But stuff like The Perl Cookbook helped me a lot in the past.
 The chapter on iterators and generators is excellent, and full of useful ideas and code snippets you can reuse and build upon. I think a large part of that chapter was written by Raymond Hettinger, who also has designed and implemented much of those features in Python, IIRC. This book is written by many contributors, for the different sections and recipes.
 Interestingly, the Cookbook for Python 3 takes a somewhat different approach. It still has recipes, of course, but they are presented without too much discussion. The authors expect you to (and say so explicitly up front) use more of your time and thinking to figure out how they work (and to read the relevant Python and external library docs for the background information needed), rather than giving detailed explanations (not that the explanations in the previous edition are very detailed, but they are there). David Beazley and Brian Jones are the main authors of this one, IIRC.
Both models have their merits, IMO. In fact, overall, I think the approach of the Python 3 Cookbook may be better, at least for experienced programmers, because it makes you think and do more on your own (using the book as a base), from which you grow more as a programmer.
I use it to save all sorts of clever snippets on the command line. Helps me revise some nifty commands periodically and thereby grow expertise in them with time.
Furthermore, a lot of it contains stuff that I'm doing for work at any given moment. Some queries are designed to run against databases at work. I don't know what kind of hot water I could get in for letting that code loose on the internet, but I'd rather not find out.
I may consider making some sort of "best of" project and putting some stuff out there, though. It would be fun pick 5-7 concepts that I've struggled with, and spend some time writing about them and making them presentable for the public.
I also recall a more recent interview where when asked about his language preferences, he admitted he does very little programming anymore. When the interviewer pressed for what language he would use now (sigh), he said he would just use Python.
By designing a terse language, they made people want to do things with it that are short and sweet, preventing it from feature-creeping into a full-featured language where you'd write thousands of lines of spaghetti. Is that the idea?
Either you won’t understand it later, or you’re just wasting effort.
It's not worth it. As soon as you go over about 3 lines, you're better off switching to Ruby, Python or even Perl.
I wonder if the problem is twofold: 1) a lack of education compounded by 2) the rapid evolution of computer systems.
Unix is a rare beast in that not just its philosophy but even its components survive to this day and remain relevant. People in unrelated fields rebuild tools that could just as well be assembled using Unix's basic components, but they're just not aware of them. And why would they look for these antiquated tools? They've been trained to reasonably expect old tools to have been replaced by newer, better, more featureful ones.
 AWK was created before I was born, and I'm among the more senior engineers in my 20+ team.
From what I've seen, amid all the research, failed experiments, constant fight for funding (in some cases) and "life" in general, people have much less time to learn these critical things like programming / tools so useful to them. They will do it at some point, but probably not till their life depended on it. On the other hand, if life depended on me recognizing those amino acids, we'd all be :-)
Compounded by the fact that in science once you have the result the code is probably just an artifact so there's no real reason to refactor it
Could you explain what you mean by this?
It's easier to break the mould if you're not bound by other people's software, but starts to look awfully sciency if you explore too far : D So it's useful to keep wiping the slate to not be bogged down by previous experiments
pd, processing, chuck, cm/clm, or just old-school mod-tracking are some good ways to get going
It's a shame to see so much bloated, over abstracted development these days with more emphasis on delivery than doing one thing well.
After seeing what the demo scene is able to pull off in very little code makes me hope that the art of fundamentals & single responsibility principals like in Unix/Linux resonates with younger developers the ability of truly understanding a problem scope before implementing a solution. Above all else have fun with it, be curious on even a low level how your program functions at an os & hardware levels. Things like what is the von Neumann bottleneck? Or even those building boolean logic gates in minecraft is inspiring to want to know "how does it work? And why?"
But that article was archived from the IBM dW site some time ago, after being there for some years. However, I wrote to IBM and got the PDF of the article, and put it on my Bitbucket account, with the C code for the utility.
This post on my blog describes the article:
And here is the Bitbucket project for selpg, the utility used as a case study, with the C code and the article text (as PDF):
The article and all the source files are here:
It may be of use to people who want to progress beyond using Unix / Linux command-line utilities (in C, but the principles and techniques can be adapted to other languages like Python, Ruby, etc.), to writing such utilities themselves, along with integrating them into shell scripts and pipelines.
One of these days, when the mood strikes, I'll upload the matlab script I inherited from my Ph.D supervisor.
Scientific programming is, by and large, an abysmal state of affairs.
I will never forget my gut wrenching feeling when I opened up one of the main provisioning scripts (in Python) to read the first two lines:
True = 0
False = -1
Could you please explain what is so egregious about those two lines for those people that may not see it? Asking for a friend.
This script is changing it to be backwards, and python is flexible enough to allow that. But I would expect this would break most python libraries, because it's such a core language feature that's being completely restructured.
(Python has a search path: it looks first on the current module, then on the built-in __builtins__ module, which is where the built-in constants and functions reside.)
Is there a valid use case for changing the numeric values of True and False?
Python 2 has _many_ quirks like this one. Python 3 was not just "it's unicode".
No! Actually I take it back this _is_ definitely still a terrible idea and will lead to certain madness for all involved. :)
However maybe the professor needed 'true/false' to represent something to be used in his formulas so this made sense to him. But who knows.
>>> True = 0
>>> False = -1
>>> if False: print('oh no')
For the curious, the workflow builds ML models to predict binding to unwanted proteins for new drug compounds, based on drug molecule-protein target interaction data, and is implemented with out Go-based SciPipe workflow library .
The rawdata is a 18GB tsv file (ExcapeDB ), and AWK really really shines for this.
I sometimes wish we had an SQL-like language for expressing all these computations, because we ended up doing some pretty complex join-like filtering and stuff. But from my tests with SQLite, I would not get a query answer until after like ~5 minutes with SQLite, so I gave up. With AWK, I can do a "head -n 10" on the result of the AWK operation (or on the input data file), and immediately verify that it works as expected. This seems to be one of the strongest points of the unix tool philosophy: Ability to check partial results in no time, and so iterate quicker on developing long "pipelines" of chained commands.
It also fits perfectly as a way to build scipipe workflow components, since having the component code in a separate language (AWK), rather than as inline Go, which is also possible in SciPipe, it is super-easy to send this component of to the HPC resource manager (given that we have a shared parallel filesystem): Just prepend the SLURM  salloc command with its parameter, before the command.
you might be interested in datajoint:
Learning the basics of awk, sed, etc is almost a requirement.
Most of my time is spent on the terminal and without these I'd shoot myself.
Haha! I've been there!
Awk is amazing and when paired with sed it exponentially increases what one can achieve on the command line.
I tend to one line most of my things in bash and tend to write a script if I have a multiple use case.
Such is the folly in bioinformatics. Most of the stuff that we use tend to be one off. :/