Hacker News new | comments | show | ask | jobs | submit login

Recently, I was helping out a friend in analyzing some RNA samples for her work. These samples are huge - like nearly a gigabyte of data. There was this tool which was recommended for the job - mirexpress. It was a small job, perhaps 10 minutes worth of effort. To make my work easier, I provisioned a beefy (and costly) machine on Azure to do the job, took a quick look at the clock (it said 11 PM), ran the tool, and relaxed. The tool crashed while reading the file.

In an attempt to fix the bug, I opened mirexpress's code. And all my confidence in my programming ability vanished when I saw its innards. I understand that the code may have been written by scientists who had no experience in programming, but I have never been so utterly _disoriented_ by bad code. Anyways, after hacking away at the mess for about 3-4 hours, I realized that this was a fool's errand and thought I'll just phone it in the next day saying I couldn't do it. I went to sleep thinking that it was already late and I'd get late for work the next day.

- 5 minutes later -

I woke up with a start, recalling this nifty tool called awk. I had last used it maybe 3 years ago, and before that only in college. But I could see how awk could do some of the things which mirexpress was claiming to do. So I fire up my computer, write an awk script - 2 lines only! TWO FUCKING LINES! And it runs like a charm - eats away at megabytes of sample data and gives me results I can show. So then like any rational person, I spent the remaining hours re-discovering awk and forgot to sleep. Pissed away the whole next day (and some part of the day after that too!) :-D

It's really fascinating that this nifty little tools invented DECADES ago are still going strong, and there's been no _evolutionary_ leap in areas where tools like awk/grep/sed excel at.




The one thing I find unfortunate about how people learn awk is the emphasis on making it terse and unreadable, especially trying to turn everything into a shell one-liner.

Don't write one-liners if they're not that simple. Write ten-liners! Add comments! Commit them to version control! Awk can be a nice, readable little language if you're not trying to win at code golf all the time.


bwk addressed this in a video at U Nottingham a few years ago. In a nutshell, he's basically arguing that they designed it to be terse to avoid giving people an excuse to write 1000's of lines for no reason. Less code = better (at least to his mind).


I can't imagine what one would do with thousands of lines of awk, so that seems a little extreme.

Awk is a pretty terse language, and that gives you plenty of room to put in whitespace and comments and still have something short and sweet. I think that the idea of writing code to be nicely readable by your collaborators, or your future self, might not have been around in the early days of UNIX.

("But it's just a one-off thing I'm never going to do again, why should I save it to a file?", you may ask. Is what you do important? Then you'll probably have to do something like it again.)


I have a little repo called “cleversql”. It’s full of little scripts that represent hard problems that I’ve had to solve, clearly-named and well-documented. I fall back on it often, when I remember that I’ve solved a certain kind of problem before, but I cannot recall exactly how.

It’s an invaluable asset and saves me tons of time on a regular basis. I’d rather search my OWN hard drive for an example of my OWN code for how to - for instance - use a CTE to recursively populate a date dimension table, than to search Google and see someone else’s code, to refresh my memory.


Have you considered making it (or part of it) publicly available?


I suspect a huge amount of the value of it is because it is personal, I.e. based heavily on his past experience. It means he is always just refreshing his memory rather than looking at something new.

It’s why my Programmer’s Compendium [0] may only ever be useful to me. If you try looking at some of the more complete pages there, you might think they’re useless, but they’re, in fact, all that I need to write down.

However, mine is public, and there’s not much harm in these things being public, because perhaps it can be useful to someone else one day. I would encourage people to share knowledge by default.

So while I also think making it public is a good idea, I would also say never be under pressure to make it accessible to others!

What’s good for learning is almost never good for reference.

[0] https://qasimk.gitbooks.io/programmers-compendium/content/


What’s good for learning is almost never good for reference

Maybe. But stuff like The Perl Cookbook helped me a lot in the past.


The Perl Cookbook is just great. IIRC I have it, though not looked at it lately. (Had read most of it earlier, though.) I also have the Python Cookbook for Python 2 [1] and for Python 3 [2]. Both those are great too. All three are by O'Reilly Media. It's not only that you get ready-made recipes that you can use; you get to learn a lot about language and library features and software design too, from such books.

[1] The chapter on iterators and generators is excellent, and full of useful ideas and code snippets you can reuse and build upon. I think a large part of that chapter was written by Raymond Hettinger, who also has designed and implemented much of those features in Python, IIRC. This book is written by many contributors, for the different sections and recipes.

[2] Interestingly, the Cookbook for Python 3 takes a somewhat different approach. It still has recipes, of course, but they are presented without too much discussion. The authors expect you to (and say so explicitly up front) use more of your time and thinking to figure out how they work (and to read the relevant Python and external library docs for the background information needed), rather than giving detailed explanations (not that the explanations in the previous edition are very detailed, but they are there). David Beazley and Brian Jones are the main authors of this one, IIRC.

Both models have their merits, IMO. In fact, overall, I think the approach of the Python 3 Cookbook may be better, at least for experienced programmers, because it makes you think and do more on your own (using the book as a base), from which you grow more as a programmer.


Also recommend https://github.com/knqyf263/pet

I use it to save all sorts of clever snippets on the command line. Helps me revise some nifty commands periodically and thereby grow expertise in them with time.


Not really. I don't think it would be terribly useful to anyone else - it's a reflection of how I understand and process certain techniques... and I really don't want the mental overhead of knowing that there's the chance that strangers on the internet could be judging my work.

Furthermore, a lot of it contains stuff that I'm doing for work at any given moment. Some queries are designed to run against databases at work. I don't know what kind of hot water I could get in for letting that code loose on the internet, but I'd rather not find out.

I may consider making some sort of "best of" project and putting some stuff out there, though. It would be fun pick 5-7 concepts that I've struggled with, and spend some time writing about them and making them presentable for the public.



"aaa" is a fitting name because that's what I said out loud when I followed the link.


I believe there is also an earlier reference where he said something similar. If I recall correctly, he said he was surprised when he became aware people were writing large programs in awk.

I also recall a more recent interview where when asked about his language preferences, he admitted he does very little programming anymore. When the interviewer pressed for what language he would use now (sigh), he said he would just use Python.


Oh, I think I understand the point here now. I said in my other comment that I can't imagine what one would do with thousands of lines of awk -- but maybe this means their design succeeded.

By designing a terse language, they made people want to do things with it that are short and sweet, preventing it from feature-creeping into a full-featured language where you'd write thousands of lines of spaghetti. Is that the idea?


Pretty much. Another bwk-ism I recall hearing was “don’t be too clever or too dumb” about whatever you write.

Either you won’t understand it later, or you’re just wasting effort.


I've written hundred-line awk scripts.

It's not worth it. As soon as you go over about 3 lines, you're better off switching to Ruby, Python or even Perl.



The "strongly-hyped" and "lookalike" languages points are good, at around 8 to 9 mins in :)


Build it bit by bit in a wiki. When it gets too big to type in reliably put it in version control. Add a help function and more flags for all the corner cases your coworkers discover.


Nice story! I have a friend who worked in bioinformatics, but he was good at programming and went to better-paying pastures.

I wonder if the problem is twofold: 1) a lack of education compounded by 2) the rapid evolution of computer systems.

Unix is a rare beast in that not just its philosophy but even its components survive to this day and remain relevant[1]. People in unrelated fields rebuild tools that could just as well be assembled using Unix's basic components, but they're just not aware of them. And why would they look for these antiquated tools? They've been trained to reasonably expect old tools to have been replaced by newer, better, more featureful ones.

[1] AWK was created before I was born, and I'm among the more senior engineers in my 20+ team.


It's actually pretty common in all kinds of fields to see others reproduce things that they have known for decades. A fun (and infamous) example is this paper (http://care.diabetesjournals.org/content/17/2/152) which rediscovers the trapezoidal rule to calculate area under the curve and gets published.


I think it's just that they are good at biology, not computers. I mean if they entrusted ME to tell two amino acids apart by looking at their behaviour, I'd take a month!

From what I've seen, amid all the research, failed experiments, constant fight for funding (in some cases) and "life" in general, people have much less time to learn these critical things like programming / tools so useful to them. They will do it at some point, but probably not till their life depended on it. On the other hand, if life depended on me recognizing those amino acids, we'd all be :-)


I think part of it too is that it's exploratory. Sometimes I use javascript to write music and as creative twists and turns happen the code folds in on itself in ugly ways that don't happen when solving a clearer problem

Compounded by the fact that in science once you have the result the code is probably just an artifact so there's no real reason to refactor it


> Sometimes I use javascript to write music

Could you explain what you mean by this?


Just messing around and nothing worth showing, but the web audio API gives all the low level stuff, and then hot module reload from webpack means you can change a function while the music is playing and it updates in real time. Not being bound by a GUI makes it trivial to do things like arpeggiate or automate filters, and then orchestrate that at higher levels

It's easier to break the mould if you're not bound by other people's software, but starts to look awfully sciency if you explore too far : D So it's useful to keep wiping the slate to not be bogged down by previous experiments


So are you programming on the sequence level (notes, parameter automation) instead of on the synthesis level? From a previous life as a music software geek I remember an abundance of programming environments for synthesis, and a shocking lack of options for programming one level higher.


I have never done music programming. Where can I read more about this?


not 100% focused on the programming/sound design side of things, but https://linuxaudio.org/resources.html has a lot of good pointers.

pd, processing, chuck, cm/clm, or just old-school mod-tracking are some good ways to get going


I am always impressed by the UNIX tools because of their simplicity and yet how easily you can build powerful systems due to their composability. But what I find even more amazing is that they "got it right" so many years ago. We have progressed so much in our understanding of computing and human-computer-interaction, but not many tools that have come out recently can claim similar success.


Although standard awk has the advantage of being installed on all POSIX systems, for bioinformatics there's a nifty bioawk (https://github.com/lh3/bioawk) that extends awk so that records aren't necessarily lines but fasta/fastq records as well. Pretty neat for oneliners.


When you start looking at the code of the tools, and how older systems were design, it's really is a testament to good engineering practices because it all scales so well. Small, simple, "one task well" utilities that process data in streams means the megabyte files of the 80's and 90's are today's terabytes. Sure they require some work to get things scripted correctly but it really is pretty amazing what coreutils and a shell script can do.


There Unix philosophy. https://en.wikipedia.org/wiki/Unix_philosophy

It's a shame to see so much bloated, over abstracted development these days with more emphasis on delivery than doing one thing well.

https://en.wikipedia.org/wiki/Demoscene

After seeing what the demo scene is able to pull off in very little code makes me hope that the art of fundamentals & single responsibility principals like in Unix/Linux resonates with younger developers the ability of truly understanding a problem scope before implementing a solution. Above all else have fun with it, be curious on even a low level how your program functions at an os & hardware levels. Things like what is the von Neumann bottleneck? Or even those building boolean logic gates in minecraft is inspiring to want to know "how does it work? And why?"


Also a nice read: The art of writing Linux utilities (Developing small, useful command-line tools) [0]

[0] http://people.fas.harvard.edu/~lib113/reference/unix/writing...


A tutorial I had written for IBM developerWorks, titled "Developing a Linux command-line utility" (in C), is mentioned in the Resources section of that page ([0]) above.

But that article was archived from the IBM dW site some time ago, after being there for some years. However, I wrote to IBM and got the PDF of the article, and put it on my Bitbucket account, with the C code for the utility.

This post on my blog describes the article:

https://jugad2.blogspot.com/2014/09/my-ibm-developerworks-ar...

And here is the Bitbucket project for selpg, the utility used as a case study, with the C code and the article text (as PDF):

https://bitbucket.org/vasudevram/selpg

The article and all the source files are here:

https://bitbucket.org/vasudevram/selpg/src

It may be of use to people who want to progress beyond using Unix / Linux command-line utilities (in C, but the principles and techniques can be adapted to other languages like Python, Ruby, etc.), to writing such utilities themselves, along with integrating them into shell scripts and pipelines.


You see this in malware development a lot as well, the search for smaller and more efficient executables that use a minimum of code to function.


The difference is that most of the things back in 60s, 70s and 80s were built by people who were actually passionate about computers and programming. Then the career types took over — the ones who spend more time planning sprints and following SCRUM practices and figuring out ways to exaggerate what they do than they spend writing actual code. Of course, some of the blame can be laid on the feet of the people who fancy themselves engineering manager because they happen to have an MBA — despite the fact that they can’t Engineer their way out of a paper bag.


You might appreciate the original implementation of awk. Good engineering everywhere.

https://github.com/RetroBSD/retrobsd/tree/master/src/cmd/awk


Updated, but still the original:

https://github.com/onetrueawk/awk


>but I have never been so utterly _disoriented_ by bad code

One of these days, when the mood strikes, I'll upload the matlab script I inherited from my Ph.D supervisor.

Scientific programming is, by and large, an abysmal state of affairs.


At one job I had we got "help" from a local college professor on how to design our database. A lot of what he said made sense. And we were super green and databases were not our expertise or focus. Short on time, high on cash, we asked if he'd implement it for us.

I will never forget my gut wrenching feeling when I opened up one of the main provisioning scripts (in Python) to read the first two lines:

  True = 0
  False = -1
The rest of it was an unmaintainable disaster that may have worked. I learned so much that summer, having been forced to learn it all myself.



>>I will never forget my gut wrenching feeling when I opened up one of the main provisioning scripts (in Python) to read the first two lines:

  True = 0
  False = -1
<<

Could you please explain what is so egregious about those two lines for those people that may not see it? Asking for a friend.


In most programming languages, including python, true is represented by 1 and false is represented by 0.

This script is changing it to be backwards, and python is flexible enough to allow that. But I would expect this would break most python libraries, because it's such a core language feature that's being completely restructured.

https://en.wikipedia.org/wiki/Boolean_data_type


Actually Python 2 allows that, in Python 3 True, False, and None are reserved words [1].

[1] https://docs.python.org/3/whatsnew/3.0.html#changed-syntax


Luckily, assigning to True and False affects only the current module. If he had changed __builtins__.True and __builtins__.False, on the other hand...

(Python has a search path: it looks first on the current module, then on the built-in __builtins__ module, which is where the built-in constants and functions reside.)


When doing comparison checking, in most programming languages false is 0 and true is any other value.


python is flexible enough to allow that

Is there a valid use case for changing the numeric values of True and False?


No, it's legacy behavior and the community took the opportunity to fix it in Python 3, where it raises a SyntaxError.

Python 2 has _many_ quirks like this one. Python 3 was not just "it's unicode".


It’s, IMHO, a terrible idea but the one use I could imagine is if you are doing a lot of stuff with Unix exit codes, where exit code 0 is success. That would cover setting True = 0 ...maybe but not failure which is usually a positive exit code.

No! Actually I take it back this _is_ definitely still a terrible idea and will lead to certain madness for all involved. :)


True != Success


True = (True != Success)


No.


True is normally 1, while False is 0. This seems like specific variable declarations as well, which you generally don't need as they are built into all languages e.g. "variable = false;"

However maybe the professor needed 'true/false' to represent something to be used in his formulas so this made sense to him. But who knows.


    >>> True = 0
    >>> False = -1
    >>> if False: print('oh no')
False is now truthy and True is falsey. Note this only works in python2.


I literally almost choked on my drink reading this.


That's a nice story, but True and False are keywords in python. Such assignment attempts are a syntax error.


This was not true in Python 2, and redefining True to False used to be a jokey way to break the interpreter in a bunch of fun ways.


In fact, there's still lots of code in the wild that assigns to True/False, presumably for compatibility with very old Python versions:

https://codesearch.debian.net/search?q=%28%3Fm%29%5E%5Cs%2AT...


I counted to at least 10 awk components in a recent bioinformatics/cheminformatics workflow we wrote (Search for 'awk' in these lines: https://github.com/pharmbio/ptp-project/blob/aa91f1/exp/2018... ).

For the curious, the workflow builds ML models to predict binding to unwanted proteins for new drug compounds, based on drug molecule-protein target interaction data, and is implemented with out Go-based SciPipe workflow library [0].

The rawdata is a 18GB tsv file (ExcapeDB [1]), and AWK really really shines for this.

I sometimes wish we had an SQL-like language for expressing all these computations, because we ended up doing some pretty complex join-like filtering and stuff. But from my tests with SQLite, I would not get a query answer until after like ~5 minutes with SQLite, so I gave up. With AWK, I can do a "head -n 10" on the result of the AWK operation (or on the input data file), and immediately verify that it works as expected. This seems to be one of the strongest points of the unix tool philosophy: Ability to check partial results in no time, and so iterate quicker on developing long "pipelines" of chained commands.

It also fits perfectly as a way to build scipipe workflow components, since having the component code in a separate language (AWK), rather than as inline Go, which is also possible in SciPipe, it is super-easy to send this component of to the HPC resource manager (given that we have a shared parallel filesystem): Just prepend the SLURM [2] salloc command with its parameter, before the command.

[0] http://scipipe.org

[1] https://zenodo.org/record/173258

[2] https://slurm.schedmd.com/


Sounds like xsv might be a nice complement to your toolkit https://github.com/BurntSushi/xsv


Looks very interesting, thanks!


> an SQL-like language for expressing all these computations

you might be interested in datajoint:

https://github.com/datajoint/

https://tutorials.datajoint.io/


Standard unix tools are so incredibly useful in bioinformatics. Almost all of our data formats are, or can be canonically transformed into, plain-text.

Learning the basics of awk, sed, etc is almost a requirement.


Sort, cut and uniq are also amazingly useful.


Don't forget paste. You need that to make your table from your column oriented data store.


Don't forget join, tr and column. Makes formatting a breeze.

Most of my time is spent on the terminal and without these I'd shoot myself.



Wow, thanks!


> So then like any rational person, I spent the remaining hours re-discovering awk and forgot to sleep. Pissed away the whole next day (and some part of the day after that too!) :-D

Haha! I've been there!

Awk is amazing and when paired with sed it exponentially increases what one can achieve on the command line.


I fucking love awk and sed so much that to my chagrin I haven't learnt to program properly.

I tend to one line most of my things in bash and tend to write a script if I have a multiple use case.

Such is the folly in bioinformatics. Most of the stuff that we use tend to be one off. :/


I used mawk for many years and wrote numerous >1000 line programs for manufacturing automation. It is just about the fastest interpreter I have ever encountered, processing multi-megabyte files in just seconds.


ack-grep has been kinda revolutionary for some of my uses.


ag was a development over grep.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: