The issue here is that those 160k insertions and 130k deletions don't really mean he wrote 160k lines of text.
The way VCS works, any edit would register as an insertion and deletion.
The way writing a book works, is you write some kind of draft and then mercilessly edit by rearranging things and changing words. But you don't change every single word 10 times.
Moving stuff around creates a lot of edits very quickly -- moving a paragraph creates many lines of insertions and deletions but I would never count that as rewriting all those lines.
The same goes for editing words -- changing a word will change entire line (insertion/deletion). If this also rearranges paragraph (due to line breaking) it may cause a lot of edits.
So I don't really buy into the whole premise of this article.
Author here. I very much do understand that one line of "insertion" or "deletion" doesn't necessarily mean the entire line was rewritten. But as I wrote in the blog post, there are also many types of changes missing from the data:
1. I don't do a commit for every single line that I change. In fact, I may change a line 10 times, and commit only once.
2. This is actually even more pronounced for code. While doing a code-test cycle, I may change a few lines of code 50 times over, but only do one commit.
3. For my books, a lot of edit rounds and writing happened outside of Git (e.g., O'Reilly does copyediting in a PDF).
My guess is that these two factors roughly cancel out. It won't be exact, of course, and the actual ratio may be 8:1 or 12:1, but the order of magnitude is probably correct.
You have used some numbers to substantiate an idea. Then made a fleeting mention the numbers don't mean anything. Then decided that you also do other stuff and "you guess" it cancels each other.
Anecdote from Alan Rinzler about editing novelist Tom Robbins:
"On sticking with it until it’s right
“Challenge every single sentence for lucidity, accuracy, originality, and cadence. If it doesn’t meet the challenge, work on it until it does.”
While I was editing Jitterbug Perfume, Tom would read me a passage aloud to see how it sounded. Sometimes I’d comment, sometimes I wouldn’t. But each time I heard it again, it had changed. I saw how many times he would rewrite a passage and how much he relished doing it.
“Sometimes 40 times,” he told me.
He took the process of conception, research, trial and error very seriously, moving things around, changing voices and pitch. He wrote slowly and carefully, revised constantly, refining and evolving the novel over the course of about two years."
I faintly recall an anecdote about a writer who, when asked what writing they did yesterday, said "I put in a comma" -- "and what did you do today?" -- "I took it out again." Maybe some HNer will recognize the story and remind me who it was about :-)
My experience with writing is that edits feel much larger while you're making them. But when you take the final version and actually diff it to the original, and take into account e.g. moving rather than deleting and rewriting text, then the differences are actually quite small.
The main exception is when you literally throw everything away and start from scratch. The odds of writing the exact same words twice are basically nil. But I'm not sure this should count as "editing" in the same sense as when I'm doing word-and-line copy editing.
I haven't done a rigorous comparison with code, but it would be interesting to take a tool like Moss and see how similar it thinks the code is after <N> revisions of the source.
Well, I guess it depends on how you work and the type of work you do.
When I worked as a translator (I had an episode translating books from English to Polish) I would more or less write the final version of the text and there would be very few edits. The plan for the text was set by the original so my job was basically to figure out the way to express it in Polish.
On the other hand, when I started writing my own texts I would spend much more time editing. I would write passages that I wanted to include, but then I would move them around a lot so that they make more sense for the reader, for example.
The same with programming. When I know very well what I am building there would be very few edits. I would start bottom up writing modules and tools and abstractions that I know I will need later and build on it. A website would be a good example. It is not a rocket science. Once you design your UI, processes, it is fairly straightforward to translate it to code.
Again, if I work on something very tricky, there is A LOT of edits. My other project is an embedded controller for an espresso machine. I am building this controller from scratch, meaning I design the board, then I receive PCB from manufacturer, then place and solder all components. The code is for ARM Cortex-M4 microcontroller which I am getting to know. I would say most of the code was edited many, many times over because of many iterations I need to get stuff working and then to refactor it to integrate with the structure of the application.
Not to detract from the good overall point, but isn't the math here a bit wonky? In the first example, he shows 160k insertions and 130k deletions and claims that therefore he wrote 160k + 130k = 290k lines of prose. That's... wrong? He wrote 160k lines of prose. If you want to express it as a sum, it's 160k (total inserts) = 30k (current size) + 130k (total deletions.)
Compound this and the likelihood that many line counts are counting method artefacts, and the end ratio is probably more around 3:1.
For anyone writing non-fiction, I would highly recommend the book "on writing well" by William Zinsser, who also stresses the importance of editing relentlessly your draft.
Agreed. I think for anyone considering writing you have to know there are two skills:
- Building the writing habit of words on a page and how you make that happen.
- Being comfortable removing all of the words that are unnecessary through the editing process.
I knew I crossed the bridge when I found myself editing my own work and saying: "this is an incredible sentence, but it's not the right one here, axe it."
Deleting prose and lines of code is still work. You have to read the text, think about it, figure out what would happen if you remove it, update all other places accordingly, and so on. The same goes for deleting code. To ignore that 160k lines were deleted as part of the writing process would be to ignore a massive amount of work.
Here's a super super hacky script that I wrote to calculate the ratio described in the article. I'll make it a lot better tonight and clean it up. Here's the bash script:
Right! And if each line of code is (re)written 16 times, imagine how many times it's read. (several times each rewrite, and PR review)
In other words, your work isn't done when the program works and makes the customer happy. It's done when other developers can read it. Otherwise you can't continue making the program work and make the customer happy.
It's not always 10:1, but it seems to be a good average. This happened to me numerous times — I sometimes start with a quick and dirty solution for my problem, then refine and refine until I'm happy with the result, the performance, the code looks beautiful and then I'm like “heck, it's only 100 lines and it took me a week”. In more pathologic cases, I've spent a week just thinking about it, and then I started writing code.
I'm programming for 20 years and I still can't provide good ETA-s for non-trivial problems.
Exactly; you can only provide a more accurate estimation if you know all the details and the exact implementation and how fast you can type, and even there you have to take at least a 30% margin either way.
To go even further on that point - the chart itself seems overly optimistic. +-40% of estimate after product design? +-10% of estimate after design spec?! Countless number of times I have seen detailed design specs completely thrown out of whack by new engineering, process or business revelations/insights. Human thought is iterative - most people can imagine at most 80% of the future/product/design possibilities.
Yeah the ratio 10:1 seems reasonable from my experience, and the order-of-magnitude difference is consistent with the 80:20 rule of software engineering that states that the last 20% of work on a project take as much time as the first 80% of the work.
There is a really cool tool called `gource` http://gource.io/ that allows you to visualize the changes in version control systems like git and hg. I highly recommend you try it on some of your projects---it can help you see who did what. I bet you'll find some surprising things when you watch the history of the project... which parts were done first, which things are new, etc. Is there a genius colleague on your team that wrote the whole thing? Who are your colleagues with good work ethic that consistently push solid bugfix commits and refactoring all over the code base?
There's also a similar rule of thumb in journalism: "A good journalist makes use of 10 percent of his source material; a bad journalist makes use of 110 percent."
Yeah, someone told me the secret to good photography is just taking a lot of pictures. Stop wasting time trying to manufacture perfect, just be ready when it happens.
And I think it comes back to the principles of Slack. By shooting more than you need, you have room to cut. Hopefully, you'll have to wind up cutting good stuff because you have too much good stuff.
> Yeah, someone told me the secret to good photography is just taking a lot of pictures.
I've gone down this road and I don't think that it's good advice any more. The secret to good photography is having a good editor, someone who can tell you which photos you have are good and bad, and why. Someone you can have a discussion with over a dozen 11x14" prints. Continuous improvement. Develop your eye and shoot neither too many nor too few photos. Every time you go through the review process you get a better sense of what you should be looking for in your photos.
It's so easy to fill up an SD card with nothing but garbage. I've done that. I've gone through bulk rolls (100' of 35mm film) and filled them with garbage I just don't want to see any more. By shooting more than you need, you can easily fall into the trap that because the photo will probably get cut during editing, there's no point to putting any special amount of effort into it. So there's no point in shooting more pictures if you're not putting enough effort into each one that it could be good. When I was at college I remember having discussions where we thought that there was not really any point to going through more than a single bulk roll per academic quarter (18 rolls in 10 weeks, or ~65 exposures per week). People who shot less than that much weren't showing as much improvement, and people who shot more than that were just ending up going through the process a bit too mechanically. That was just a rule of thumb for the classes at that particular place and time.
There are good, great, and amazing photographers who shoot such a small number of photos it would probably shock you, and others who shoot so much it makes you wonder how quickly they must go through equipment.
Personally, I've found that taking a lot of pictures worked for me, in that comparison of the multiple shots I took gradually taught me what made for better photos.
The downside is that I'm usually lazy/reluctant to delete the extras, especially when there's a tossup as to which is the "best", so I end up with tons of photos I don't want in my library.
As a side gig, I take photos of domestic animals (primarily dogs). Some of the animals can be posed like little dolls. Others can't sit still to save their lives. I also take action photos of them. I have to spend time manufacturing perfect, because I often can't make very many on-the-fly changes. But to capture that perfect, I have to take a lot of photos.
As a hobby, I practice general photography, and my usual method is to lock some variables, and take several photographs while floating other variables.
Also as a hobby I practice astrophotography, and here you absolutely have to manufacture perfect, because you only take a lot of pictures in order to stack them. A single photo can take hours to produce, so you do what you can to get it right in the first place.
When I'm taking photos of food and cocktails for my side gig, I generally have a shot setup in my head prior to food arriving at the table. I'll choose places with natural light, near large windows to diffuse, or some interesting backgrounds and pretty lights. But once the food arrives, it's very much shoot as much as possible from various angles. Even a single angle, I shoot at high fps. That guarantees you at least once of the shots is the sharpest, and even slight variations of positions will work better than the other. Trying to be perfect just leads to missing shots. Some dishes deteriorate quickly as they cool off. Shoot lots, and choose what works best with your style. Often times, you'll get something unexpected that you didn't see with your eyes initially.
Taking a lot helps, but you also want to be directed. Directed practice has been discussed on HN numerous times at this point. For photography I would go to a location and take a lot of pics of the same thing using auto, aperture priority, shutter priority, and then full manual. Most of these would get thrown away, but I can see how each came out and use that to further improve.
This method of practicing and improving used to be expensive with film, but with digital the process is now free.
The other thing you learn is your camera shortcuts so that when you do run into a photographic moment you can quickly take a bunch of shots with different settings.
Also with writing music. For many years I've repeatedly said, word for word, "you have to write about 10 songs for each really good one you end up with."
I'm really surprised that this isn't builtin to more NLEs. I've tried this in the past, but using SVN (long long ago). At CG/post houses with custom pipeline tools, they have an easy way to version up builtin. With NLEs, it's all up to the editor duplicating timelines. However, with literal EDLs, it's hard with modern editing as so much has to be simplified to satisfy the EDL. I think AE has a version up tool (memory is fuzzy). Even tried using Time Machine backups to roll back to prior versions. All sorts of hacks that I really feel should be built in to software that is known for multiple revisions.
It looks like older projects have higher lines-written to lines-in-production ratios:
terraform-aws-couchbase (2018) - 5:1
Terratest (2016) - 8:1
Terraform (2014) - 9:1
Express.js (2010) - 14:1
jQuery (2006) - 15:1
MySQL (1995) - 16:1
This is a small sample size but it seems easy enough to run in some popular open source perfect and see if there's a statically significant trend. It would also be cool to see the lowest and highest ratios on popular projects.
This is very, very interesting but I think it would be even better if it was somehow normalized; in the first N years (1, 2, 5, 10), what is the lines-written to lines-in-production ratio? Older, still active projects will almost for sure have a higher ratio just as a matter of fixing bugs.
So if it was normalized, we could see whether newer projects are rushed to production faster or it's just a matter of time passing.
+1 for computing some sort of normalization factors... here are some ideas:
N = number of API endpoints, then N/cloc would be something like efficiency (get more stuff done with less code)
M = number of lines of code for all code paths accesses during average day on PROD, then (cloc-M)/cloc would represent the "dead code" ratio---how much of your code base is not used
X(c1,c2) = an arbitrary function computed on the diff between commits c1 and c2
And all of the above can be run using some sort of rolling window from git init to today.
I'm curious about the LOCs. I believe (but haven't bothered to verify) that code swells and then contracts in cycles. Expansion is due to the drunken sailors walk thru the solution space, as everything is tried. Contraction is due to identifying best fit (good enough) paths, code deduplication, dead code (and feature) removal, generalizations hard earned thru experience.
I think 10:1 is a good rule of thumb, but it does feel like time is a function that needs to be accounted for, too.
It would also be interesting to think through how many changes there are _between releases_ of a project. MySQL is 23 years old [1], but what's the effort/change between major releases at this point? That's where the rule sort of falls down for me: a book has a few editions (if it's lucky); software, on the other hand, has lots of releases if it's successful.
It strikes me this must be wrong. The 90% churn rate is way too high, surely?
Looking at own git logs (for code, not prose), I notice git is very happy to add and delete lines even if just some characters on those lines have changed. Is it not the case that the author of the article has simply edited 90% of the lines in his book, rather than rewritten them completely? That seems much more likely to me.
Yeah your assertion is correct, even when it's just a single character changed, it's marked as a whole line has changed.
Git doesn't really have good statistics (I think) for e.g. token changes (it would need a tokenizer for that). There is however --word-diff for showing a diff in words instead of whole lines; it's probably possible to count those as well, the output syntax looks parseable enough.
I know git has a way to show diffs based on words instead of lines. Is there a way to use words instead of lines for these stats? When I'm writing markdown I usually turn on word wrap in my editor, which means there are entire paragraphs on one line.
On the other hand, editing takes a lot of time per word changed, because you have to consider the whole sentence or even paragraph to make sure the grammar is still right, you're not repeating things, and you haven't removed any necessary context. So maybe a bit of overestimation is good.
The one caveat is that git has a somewhat more expansive view of what counts as a rewrite of a line of code than what would be counted as a rewrite when writing a book.
Unindent 1000 lines of code after removing a if() statement. You just changed 1000 lines of code in 2 seconds. Reformat entire file automatically with editor, you just changed 4000 lines of code.
Stayed at that job for 4 months before realizing the other engineer was just going to keep writing the same crap. No tests, no comments.
The real kicker was that the “database” was several different instances of mongo. So doing joins was a pain, and having transactions was almost impossible.
I put some exaggerated examples in there to show situations where changed line counts are absurd (when whitespace is counted).
Such numbers of whitespace changes do not happen often, but have a big impact on LOC when measuring a long period of time. I realized that when counting my changed LOC for a past internship.
Yeah I wouldn't mind a new system that can analyze git history and accurately count changes - not counting package.lock changes, indentation / reformatting, snapshots, etc.
It's an interesting analysis, but I strongly suspect that a lot of changes aren't really rewrites of new content. Instead, they're moves of content withing a file or between files that get counted as a deletion+insertion.
Just read Brikman's Terraform book and it was great! He's a clear writer and great explainer and I'd recommend it to anyone looking for a good resource to get started with infrastructure-as-code.
I'm somewhat skeptical that we can draw any conclusions from the total vs. finished based on gross output. Writers are often described as either those who produce prodigiously and then cut mercilessly or those who refine every line before committing and do very little editing. I think programmers are similar. Metrics on the entire set are going to be muddled at best.
I would however agree that:
* those who produce & then cull generate a lot more than we see in the end product (due to those who do little editing, probably more than 10:1)
* writers/programmers get better at their craft, reducing the chaff to wheat ratio as they practice (like the original author saw in his work)
* well understood domains or themes are also more efficient (Michael Crichton didn't throw away 90% of his output; your average CRUD app doesn't either)
Finally, the written word tends to remain unchanged once committed (revisions tend to be minor in the overall scope), while code that is used changes dramatically. Long-running projects that are successful will see their application increase and evolve accordingly, so it's likely less valuable to track change over time if you're looking for total vs. finished effort metrics.
It’s hard to consider the changes in files alone because files are so different.
For instance, I may have to edit a project or UI component and generate “churn” just because it decides to reformat some XML for a minor edit. Or, in code, some styles are clearly more “vertical” than others, comment paragraphs may or may not have changed, etc.
Then there are tiny edits that have a huge amount of “churn” in the actual project, such as “#if 0”.
While lines of code is not an entirely useless measurement, it would definitely be good to have at least a couple other things measured. For instance: compiled binary sizes before/after edits, number of bugs logged per month, or something. Even then of course, these additional measures can obviously be affected by external factors (how good is your compiler, what kinds of bugs actually appear in reports, etc.).
My takeaway is that measuring productivity accurately is hard, and if you want good accuracy then you have to put a lot of effort into the measurement process. There are no easy 10:1 rules for these things.
From the article, about the shelf-life of UI code:
"Does the amount of churn depend on the type of software? For example, Bill Scott found that at Netflix, only about 10% of the UI code lasted more than a year, and the other 90% had to be thrown away. What are the rates of churn in backend code, databases, CLI tools, and so on?"
I used to do a lot UI. This jives with my experience.
At the time, I resolved to divine concise ways to prototype and implement UIs.
20 years ago. I didn't get very far. But I am now making another run up that hill, so we'll see.
This would be really interesting to do with some sort of move-aware diff algorithm. Does anyone know if such a thing exists?
Instead of getting lines removed + lines added, there could be some sort of content aware diff algorithm that works with 'data chunks' at another granularity (lines = good for code, words ala diff --color-words = good for text)
where `moved` represents # of `-` lines in diff that have an identical corresponding `+` line,
and `changed` represents lines where 60%+ of - line matches the text in some + line.
Using such a fancy_diff, moving a 200-line chunk of code from one source file to another source file won't show up as -200s and +200s, but as {'added':0,'removed':0,'moved':200}
Looking at such "fancy diff" numbers will show more what's going on...
Pretty sure git does this already by calculating a “similarity index”. So if you mv a file even without an explicit `git mv`, git will recognize it at commit time as a file in a different location. Perhaps this is presentational only, though; after all, a mv is fundamentally a delete and create.
Not 100% sure on that though, nor am I sure if the same applies at the line level.
Yeah, I think the git "chunks" are quite large though. A big enough move (e.g. an entire file) will be recognized as a rename and reuse a pointer to the underlying blob, but anything paragraph-sized will be recorded as remove and an add.
To be honest, I'm not sure if computing a fancy diff of the form {'added':<int>, 'removed':<int>, 'changed':<int>, 'moved':<int>} is a well defined problem because there are multiple ways one could interpret a given set of changes, e.g., is it a move with a small edit vs. and add and remove. We might need to define some "economy of diff" objective to optimize... something like Kolmogorov complexity for diffs?
Code churn is an interesting measurement. I have come to realize I spend 90% of my time on 10% of the problem and I don't always know what the 10% is when I start. This is not just true for writing software, but for installing a hardwood floor -- the center is easy and is where I naively estimate the time -- the time was really spent on the nooks and crannies.
Really interesting. Thank you for sharing. I imagine that on average most things creative or scientific need 10 versions or so to get to the final stage. On a separate note, wonder whether you would classify programming as a science or creative work, a mix of both or something else...
> So, I added 163,756 lines and deleted 131,425 lines, for a total of 295,181 lines of code churn. That is, I wrote and deleted 295,181 lines of to produce a final output of 26,571 lines. That’s a ratio of more than 10:1! For every 1 line that got published, I actually had to write 10!
I'd believe that if the guy was talking about the book text only. Let's say you use bootstrap and then decide to use xxxStrap. You just added and deleted 100k lines of code. Congrats, you have a "churn" and stuff like that. But the reality is that you neither "wrote" nor "rewrote" these lines of code.
I might be wrong but I'd like arguments to the against. This happens a lot in projects I work within.
> You just added and deleted 100k lines of code. Congrats, you have a "churn" and stuff like that. But the reality is that you neither "wrote" nor "rewrote" these lines of code.
I've seen that in some projects too, usually where developers don't trust their package managers and/or just want to inflate the size of the project to manipulate the client.
Rule of thumb: don't commit external dependencies (like node_nodules) when you can lock dependencies in place. If you really have to do it then have a downstream repo that will do that automagically.
This makes your strap situation far less problematic. You'll still touch everything that uses said strap but then it's actual programmer effort and not manual package management.
Getting a bit topic here, but just to add to that (because checking in node_modules drives me nuts): The output of, eg npm install, in node_modules can be different across machines in the case that a module has a build step that creates a binary dependent on the architecture of the host machine.
Checking node_modules into VCS is a code smell and footgun. Don’t do it. Check the lock file into source control and make sure you have a reliable cache in between build and registry (eg the yarn public registry, which caches everything, or a private registry and/or proxy).
I understand the author's overall point, but does a variable declaration really need to be rewritten 10 times to do its job? This feels like an apples to oranges situation. Maybe the 10:1 rule could apply to a component or a function or maybe the context of the line of code should be taken into account. Lots of lines of code are really the equivalent to punctuation in prose - important but ultimately an adornment.
Interesting. A few months ago, I wrote a blog post about 4 being the factor you should use to multiply rough estimates with [1]. Seeing 10 here as the input to output ratio makes me wonder how those two factors relate to each other ;-)
Off-topic: Atlas looks cool and I've been looking for something like this to create multiple output formats from one source document for ebooks and other longer documents, but I'd like something that's open-source. I see "2014" multiple places on the Atlas site, which makes me think it's been abandoned.
I wrote my master thesis in markdown - tracked in git - and generated documents in PDF, HTML and docx with pandoc. Didn't try epub, though that's also possible.
There were some issues, but overall it was an awesome experience. Unlike my friends, I actually enjoyed writing my thesis.
As a developer I don't estimate work based on number of lines I expect to write, so while this metric is interesting I do not think it explains why we miss deadlines. It does indeed show that things usually are more complicated than we think. But with that knowledge, I'm pretty sure I'll keep missing deadlines.
In some old book I have read that good programmer writes one line of code per day. That seems about right.
Also my friend told me how to estimate the timeline of a project - double (yours or supplier) estimate and increase the unit o measurement (so 1 week becomes 2 months, 2 months -> 4 quartes etc.). Works every time...
I've been using LeanPub for my book work. It hooks into GitHub and gives me a DevOps-kinda work environment.
Git's cool and all, but I wasn't a huge fan at first. One version control system was like another. But the impact it's had? Now that version control is free and ubiquitous, using git for stuff is everywhere. That's turned out to be extremely cool.
MySQL v1 or v2 were not unusable (or unreliable) products, despite the fact, that later requirements changed and code changed with them. This could significantly lower the ratio discussed for software products (not sure if it applies to books, too).
I don’t know if I agree with the author’s methodology (insertions + deletions seems like a weird choice for the numerator), but moreover, a big “work vs. final output” ratio doesn’t explain why estimation is so notoriously difficult.
Perhaps my explanation isn't perfect, but... Maybe in part because it is unthinkable for any "grown-up", "responsible", "engaged" estimator in charge of the project to put in 10x churn into the estimation of resources they request from their management? If you claim you would do that, imagine yourself justifying it to your manager, and then to senior management to which your manager drags you afterwards.
I think it’s worse than 10:1, because I often throw away written code before I commit. Also, even slightly complex lines may go through 2-3 revisions before even being written to the filesystem and compiled for the first time.
The important take away for me, is that it is a reminder that rewriting/refactoring is a big part of quality, and should be in the back of a creators mind when considering the ends.
This is interesting, but I'm not sure it's universal enough to be a "rule". Maybe it is in your realm, but I can't believe that this is the way everyone operates based off a single example.
The way VCS works, any edit would register as an insertion and deletion.
The way writing a book works, is you write some kind of draft and then mercilessly edit by rearranging things and changing words. But you don't change every single word 10 times.
Moving stuff around creates a lot of edits very quickly -- moving a paragraph creates many lines of insertions and deletions but I would never count that as rewriting all those lines.
The same goes for editing words -- changing a word will change entire line (insertion/deletion). If this also rearranges paragraph (due to line breaking) it may cause a lot of edits.
So I don't really buy into the whole premise of this article.