Sin #1 - Unnecessary distributed parallelism: It does not always make sense to parallelize a computation. Doing so almost inevitably comes with an additional overhead, both in runtime performance and in engineering time. When designing a parallel implementation, its performance should always be compared to an optimized serial implementation in order to understand the overheads involved. If we satisfy ourselves that parallel processing is necessary, it is also worth considering whether distribution over multiple machines is required. The rapid increase in RAM and CPU cores can make local parallelism economical and worthwhile. Establish the need for distributed parallelism on a case by case basis.
Sin #2 - Assuming performance homogeneity: Virtualized cloud environments exhibit highly variable and sometimes unpredictable performance. Base reported results on multiple benchmark runs, and report the variance.
Sin #3 - Picking the low-hanging fruit: It is unsurprising that it is easy to beat a general system by specializing it. Few researchers comment on the composability of their solutions.
Sin #4 - Forcing the abstraction: In some application domains, it is unclear if a MapReduce-like approach can offer any benefits, and indeed, some have argued that it is fruitless as a research direction. Ideally, future research should build on [iterative processing, stream processing and graph processing], rather than on the MapReduce paradigm.
Sin #5 - Unrepresentative workloads: The common assumption in academic research systems is that the cluster workload is relatively homogenous. Most research evaluations measure performance by running a single job on an otherwise idle cluster.
Sin #6 - Assuming perfect elasticity: The cloud paradigm [...] its promise of an unlimited supply of computation. This is, of course, a fallacy. Workloads do not exhibit infinite parallel speedup. The scalability and supply of compute resources are not infinite. There are limits to the scalability of data center communication infrastructure. Another reason for diminishing returns from parallelization is the increasing likelihood of failures and vulnerability to "straggler" tasks.
Sin #7 - Ignoring fault tolerance: Many recent systems neglect to account for the performance implications of fault tolerance, or indeed of faults occuring. For each system, we should ask whether fault tolerance is relevant or required. If it is, it makes sense to check precisely what level is required, and what faults are to be protected against; consider and ideally quantify the cost.
Note that this is more targeted at academic research. Not to undermine its importance of course.
Its good to see something like this out there. There are literally thousands of papers published which try to demonstrate "speedups" but which are usually not very reproducible or useful. The "publish or perish" mentality is responsible for a lot of this. The only way to be upto date with cutting-edge meaningful research seems to be to attend conferences and talk to other researches in the field, and try to publish in the better journals.
This assumes most academics are more interested in good research than publishing papers or looking objectively at software they spent years writing without any fundamentally new ideas.
Systems is much more practical as an academic field. For example, Googlers are heavily involved in high-end systems research, and they really are just interested in good research vs. increasing their paper counts (which they doesn't help them much in their career in industry). The tier one systems conferences, OSDI and SOSP, also only accept much fewer papers a year than most fields (~30 vs. 60+) with about the same number of active researchers. The culture of your field really helps in this regard.
Even if each individual researcher mostly cares about getting publications, the way to get published is to do work that other researchers think is interesting and important. Perhaps the mechanism by which a paper like this can change a field is not by causing a particular person to reevaluate their own research agenda, but by changing the opinion of reviewers (who were not necessarily invested in that agenda to begin with).
I wonder if 'dang' followed the same process I did: click the submitted link; read the interesting abstract; scroll to the bottom only to be scared by the embedded video; recover for a second from the panic of being without plain text; scroll back up scrutinizing the small print for a link to actual paper; then click with an awkward urgency, feeling relief at having found my way back to safe ground.
Then I asked myself, "Why did the submitter link to the container page rather than to the real content? Maybe 'dang' should change the link?" But then I wondered, "What if there are people who actually prefer the video? It's hard for me to picture, but presumably such people exist. How will they be able to watch the video if the link goes straight to the PDF?"
So as a survey for my sanity, I have to ask: How many people watched the video without looking at the paper first? How many looked at the paper without giving any consideration to the video? Second, does this correlate with the first 'sin' in the paper? Do those who prefer linear text also frequently wonder if unnecessary distributed parallelism tends to hamper performance?
Nate, we don't really have a convention or guideline for this situation.
The guidelines only tell us to add [pdf], [video], and similar to the
title if it's a link to a media file. Though it's not exactly stated, it
means a direct link to the media file itself, rather than a link to the
page where the media files can be downloaded. For example, we (almost?)
never see "[video]" tags in the titles of stories linking to youtube or
vimeo.
On previous USENIX submissions I've put some variation of
"[pdf/slides/video]" in the title to indicate multiple things are
available, but this just created unnecessary work for poor `dang` who
had to edit the title to remove it. I really do try to avoid causing
work for the moderators, so this time, I did not put the extra info in
the title. I was proud of myself for learning from my mistake... until I
saw `dang` adding a comment with a direct link to the paper pdf. (sigh,
I'll probably never learn ;)
If there is a "right" answer, then I still haven't figured it out yet.
As for my access pattern, I actually found the USENIX page from another
site listing various papers, so that's what I was looking for when I
loaded the page. I made sure the USENIX actually did provide a download
link to the full paper PDF (sometimes they don't/can't), and then
started downloading the video with wget in the background while I read
the paper... I still haven't watched the video yet, but I will.
I appreciate the insight. I wasn't trying to criticize your choice of link, just wondering if others shared my personal reaction. Similar to linking to the abstract page for arXiv papers, I think you chose the "right" option.
I think the FAQ is out of date with regard to adding an explicit [PDF], as the system now adds this automatically for links where the type is apparent. But I'm always impressed at how efficient the new 'dang' API is for adding publication dates to submitted papers!
While my reaction to content posted as a video (whether it be a tutorial, paper, how-to, etc.) is more what I'd call annoyance than panic, I rarely watch them. I find I can skim written content and within about 30 seconds decide if it's worth spending more time on. Not so easy with a video.
The paper is just excellent and at just 5 pages, including about one
page of references, it's well worth the time to read. If you do any
amount of reading on parallel and distributed computing, then you've
probably read something where at least of the mentioned "sins" came
to mind. I see #1, #2, and #5 so often that I nearly expect them.
Though it's not really "cloudy" per se, I think there's an eighth deadly
sin, "Ignoring Architecture and Infrastructure." If you want to get
repeatable and comparable results, the system and network architecture,
as well as the supporting infrastructure, can make a huge difference.
So I have a question about papers. It seems the standard in comp-sci and much of science to do two column layouts even for pages that are entirely text. Why is this?
> For similar reasons, long lines (wide paragraphs) are slower and harder to read than narrower ones. Lines of around 100 characters present neat bite-size chunks of text that can easily be decoded, and also make it really easy to scan round to the start of the next line. That’s why newspapers and magazines use several columns on a page, and why books use the same common format.
> This study examined the effects of line length on reading performance. Reading rates were found to be fastest at 95 cpl. Readers reported either liking or disliking the extreme line lengths (35 cpl, 95 cpl). Those that liked the 35 cpl indicated that the short line length facilitated "faster" reading and was easier because it required less eye movement. Those that liked the 95 cpl stated that they liked having more information on a page at one time. Although some participants reported that they felt like they were reading faster at 35 cpl, this condition actually resulted in the slowest reading speed.
I'm not sure, however, if you would consider 95 cpl to be long. 10 pt. font across a letter-sized page with reasonable margins...that seems quite painful and one reason I dislike reading dissertations. However, if you are reading straight and not skimming, it might be better as flow is broken less frequently (I rarely read papers straight, they have to be very good for that).
But wait, there's more! Narrow multi-column formatting only works well with decent typesetting systems...aka not Microsoft Word! Fortunately, a MSR Turing award winner using the work of another non-MSR Turing Award winner have solved that problem for us (i.e. TeX and LaTeX). There are many fields where MS Word is the standard, and therefore...it would be nuts to try anything complicated.
Sin #1 - Unnecessary distributed parallelism: It does not always make sense to parallelize a computation. Doing so almost inevitably comes with an additional overhead, both in runtime performance and in engineering time. When designing a parallel implementation, its performance should always be compared to an optimized serial implementation in order to understand the overheads involved. If we satisfy ourselves that parallel processing is necessary, it is also worth considering whether distribution over multiple machines is required. The rapid increase in RAM and CPU cores can make local parallelism economical and worthwhile. Establish the need for distributed parallelism on a case by case basis.
Sin #2 - Assuming performance homogeneity: Virtualized cloud environments exhibit highly variable and sometimes unpredictable performance. Base reported results on multiple benchmark runs, and report the variance.
Sin #3 - Picking the low-hanging fruit: It is unsurprising that it is easy to beat a general system by specializing it. Few researchers comment on the composability of their solutions.
Sin #4 - Forcing the abstraction: In some application domains, it is unclear if a MapReduce-like approach can offer any benefits, and indeed, some have argued that it is fruitless as a research direction. Ideally, future research should build on [iterative processing, stream processing and graph processing], rather than on the MapReduce paradigm.
Sin #5 - Unrepresentative workloads: The common assumption in academic research systems is that the cluster workload is relatively homogenous. Most research evaluations measure performance by running a single job on an otherwise idle cluster.
Sin #6 - Assuming perfect elasticity: The cloud paradigm [...] its promise of an unlimited supply of computation. This is, of course, a fallacy. Workloads do not exhibit infinite parallel speedup. The scalability and supply of compute resources are not infinite. There are limits to the scalability of data center communication infrastructure. Another reason for diminishing returns from parallelization is the increasing likelihood of failures and vulnerability to "straggler" tasks.
Sin #7 - Ignoring fault tolerance: Many recent systems neglect to account for the performance implications of fault tolerance, or indeed of faults occuring. For each system, we should ask whether fault tolerance is relevant or required. If it is, it makes sense to check precisely what level is required, and what faults are to be protected against; consider and ideally quantify the cost.