I learned this from my doctor (who is a diabetic specialist), some years ago we were talking about how to form good habits for kids and my wife and I were avoiding caffeine beverages for the kids, but letting them have apple juice etc. His opinion based on the evidence is that he would much rather have our kids drink caffeine every day but stay away from the sugar. His specific point was that people on the internet point to unknown risks with aspartame or caffeine, but ignore the mountain of known risks associated with sugar.
It was an eye opening discussion to say the least.
For me the take away is that non-trivial independent replication of results still stands as the gold standard for experimentation. P-hacking shows up clearly when the experiment is replicated. P-values in my experience are only a starting point for further experiments.
Take the cold-fusion debacle. When independent groups attempted to replicate the results they were unable to do so.
Of course the other big take away is that people really don't understand statistics at a basic level, even the "scientific reporters".
> For me the take away is that non-trivial independent replication of results still stands as the gold standard for experimentation.
Agreed. Too bad funding agencies rarely if ever give you the money to do it :(.
The takeaway here is that statistics is arguably one of the most nuanced quantitative fields out there. It's really easy to shoot yourself in the foot, particularly with p-values.
I think every statistical test has its place, but my personal favorite lambasting of p-value testing is Steiger and Fouladi's 1997 paper on non-centrality interval estimation .
As an aside, Steiger was my graduate statistics professor several years ago, and probably the primary reason I know this paper even exists. If you enjoy the harsh treatment of significance testing in the paper, just imagine hearing it straight from the horse's mouth during lecture :).
This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.
Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.
The vocabulary of the grandparent comment implies they are using hadoop's streaming mode, and thus one can use a map-reduce streaming abstraction such as MRJob or just plain stdin/stdout; both will work locally and in cluster mode.
Or, if static typing is more agreeable to your development process, running hadoop in "single machine cluster" mode is relatively painless. The same goes for other distributed processing frameworks like Spark.
Yeah that xkcd came to my mind as well. I'm amazed at the timeless truths he captures in those comics. Creating another version which may be better isn't the hard part. Gaining consensus and getting people to give up the other ones is the hard part.
wc -l is faster however it only works on an uncompressed file. The pipeline in the article was doing much more complicated work on the stream of output. I will often separate my pipeline logic from my "read this data and feed it into the pipeline" logic, since the logic around what to read in and where to write it can often change even when using the same logical pipeline.
As a simple example:
cat somefiles.txt | pipeline_script > stored_output.txt
Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.
I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?
Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?
It's like the old saying...there is more than one way to skin a "cat". :)
Interesting concept, but the flaw at the heart is the presupposition that lack of progress right now means we are at the ultimate limit of what can be accomplished. Imagine if cavemen learning to paint on walls said, well we haven't improved in a few millennia, so this is probably the most complex thing that can be represented by drawings. Or what about math stopping with Euclid? It was thousands of years later that progress happened.
Technology comes in fits and starts. A lot of new things happened in the 50s and 60s, we are still trying to figure out ways to use and apply them. Just because someone thought about and prototyped something then doesn't mean it's not new when that feature goes mainstream (e.g. garbage collection in Java, channels for concurrency in Go, etc).
For a long time we couldn't break the sound barrier, which is a limit, but not the speed of light limit. Because progress is stalled now doesn't mean there will never be progress in the future.
The author addresses this in the post. First by comparing major technological advances to earthquakes: "they occur at unpredictable intervals, with little prior warning before their emergence." At the end of the post he says that he doesn't think we've reached an ultimate limit and that the language design space needs to be explored more fully.
I think it's more of a thought experiment; just an interesting idea to entertain even if we haven't reached the limit. The more immediately applicable point is that of a cognitive limit moreso than the suggestion of a technological limit.
I think the author did address this but you raise an interesting analogy. I remember Carl Sagan made a point in Cosmos that the natural sciences lost millenia of progress thanks to the dominance of Platonic thought and mysticism over empiricle research and observation. I do not think it will take millenia to see it, but perhaps in a few decades we will see the present circumstances of program language application in a similar light.
>" Imagine if cavemen learning to paint on walls said, well we haven't improved in a few millennia, so this is probably the most complex thing that can be represented by drawings."
The analogy between painting and programming languages has some precedent around the HN community. But the idea that contemporary painted images are more advanced than those on the walls of Lascaux is suspect because it is premised upon our acceptance of a belief that the ancient paintings were not well suited to their purpose.
It is more likely that the opposite is the case. Today, most paintings carry little significant meaning to their author's larger community. The odds that Salvador Dali's butcher was impacted by the armlessness of Venus de Milo are pretty low - never mind the $86 clown on steel collage Mission Thrift Store.
Art and programming languages have evolved. But that evolution is Darwinian not teleological. High-level languages are better adapted to the humans who write them but not the machines which run them.