Hacker News new | past | comments | ask | show | jobs | submit login

I’ve usually heard this phenomenon called “incidental duplication,” and it’s something I find myself teaching junior engineers about quite often.

There are a lot of situations where 3-5 lines of many methods follow basically the same pattern, and it can be aggravating to look at. “Don’t repeat yourself!” Right?

So you try to extract that boilerplate into a method, and it’s fine until the very next change. Then you need to start passing options and configuration into your helper method... and before long your helper method is extremely difficult to reason about, because it’s actually handling a dozen cases that are superficially similar but full of important differences in the details.

I encourage my devs to follow a rule of thumb: don’t extract repetitive code right away, try and build the feature you’re working on with the duplication in place first. Let the code go through a few evolutions and waves of change. Then one of two things are likely to happen:

(1) you find that the code doesn’t look so repetitive anymore,

or, (2) you hit a bug where you needed to make the same change to the boilerplate in six places and you missed one.

In scenario 1, you can sigh and say “yeah it turned out to be incidental duplication, it’s not bothering me anymore.” In scenario 2, it’s probably time for a careful refactoring to pull out the bits that have proven to be identical (and, importantly, must be identical across all of the instances of the code).

That can be boiled down to the “Rule of 3”.

My CTO often asks me to implement a feature to do X and make it “generic enough to handle future use cases”. My answer is always the same - either give me at least three use cases now or I am going to make it work with this one use case. If we have another client that needs the feature in the future then we will revisit it.

Of course, there are some features that we know in advance based on the industry how we can genericize it.

From (I think) an old Joshua Bloch talk on API design, paraphrased:

* If you generalise based on one example, you will get a flexible API that can handle only that example. * If you generalise based on two examples, you will get a flexible API that can switch between those two examples. * If you generalise based on three examples, you have a chance of abstracting over the common essence.

Hmm. I wonder if he got that from Simon while he (JB) was at CMU. He (HS) once said to me, jokingly, I think, “One makes an observation, two makes a generalization, three makes a proof”.

Sounds good but is wrong. Sometimes you need more than 3. Maybe 3 is for those who need to move fast and break things though.

Yeah. I think he (HS) was joking. He had a Nobel, and (co)invented AI. I’m pretty sure he knew what a proof is.

Yes, he mentions it in Effective Java. When designing APIs, collaborate with 3 clients if possible.

Until an example that “doesn’t fit” ?

And then if the example “doesn’t fit”. It’s by definition a different thing than you originally modeled

The Rule of 3 is a great rule, except when it isn't.

I had a colleague some time ago who wrote a couple of data importers for FAA airspace boundaries. There were two data feeds we cared about, "class airspace" and "special use airspace". These airspace feeds have nearly identical formats, with altitudes, detailed boundary definitions, and such. There are a few minor differences between the two, for example different instructions for when a special use airspace may be active. But they are about 95% the same.

The developer wrote completely separate data definitions and code for the two. The data definitions mostly looked the same and used the same names for corresponding fields. And the code was also nearly the same between the two (in fact exactly the same for most of it).

It was clear that one importer was written first, and then the code and data structures were copied and pasted and updated in minor ways to create the second.

Because the data structures were unique for each (even if they looked very similar in the source code), this impacted all the downstream code that used this data. If you saw a field called FAA_AIRSPACE_MIN_ALTITUDE, you had be sure to not confuse the class airspace vs. special use airspace, because each of these had a field of the same name, the compiler wouldn't catch you if you used the wrong one, and you may have the wrong offset into the data structure.

I asked the developer and they told me, "I have this philosophy that says when you have only two of something, it's better to just copy and paste the code, but of course when you get to the third one you want to start to think about refactoring and combining them."

Yep, the Rule of 3.

In this case there were only ever going to be two. And the two were nearly the same with only minor differences.

But because of blind obedience to the Rule of 3, there were many thousands of lines of code duplicated, both in the importer and in its downstream clients.

I still like the Rule of 3 as a general principle, and I have espoused it myself. But I think it is best applied in cases where the similarities are less clear, where it seems that there may be something that could be refactored and combined, but it's not yet clear what the similarities are.

I think it is a very bad rule in a case like this, where it should be obvious from the beginning that there are many more similarities than differences.

Every single rule or advice in programming is good until it isn't. OOP is good until it isn't, function programming is good until it isn't, premature optimization is the root of all evil until it is the root of all good.

For some reasons humans have this deep need to try and boil things down to bulleted lists which in the domain of programming are just incredibly not useful.

This is because programming is an art. It has fundamental components (objects, functions, templates, iterators, etc) like in visual design (point, line, shape, value, etc).

I think engineers should read the old and the new C++ books and master that language to know the evolution of all these paradigms and how to use them. There’s so much wisdom in the “Effective C++” series and Gang of Four and “C++ Templates: The complete guide“ to name a few.

Problem is in this “start up culture” to bang things out and get them working the art is left behind. Just like many other arts.

I was part of an uncredited team interviewed by Meyers for "Effective Modern C++". Always thought ir was somewhat ironic. At the firm (declined acknowledgments because they shun media attention) we werent even using a C++11 compiler on either Linux or Windows at the time. Yet, at least two of the patterns, I recognize as being at least a partial contributor to.

Myself, I lost track of C++ standards changes with C++17, and Ive not been using C++ for the last several years.

I still love the power and speed, but right now I'm dping more ETL work, and Python is a better and more productive language for that.

I think that you have a point but I often find myself citing guidelines or rules when I am evaluating code decisions or questioning code design. Maybe it depends on your interpretation of the phrases, some sayings should be followed religiously but others applied with discretion.

well said. for a while I started treating everything as a soft rule, more like guideline. it gets easier then :)

Consider them to be heuristics, not rules.

The rule of 3 usually is in reference to small scoped abstractions, not whole modules or subsystems. We're talking about extracting a short function, not significant and potentially thorny chunks of code.

But I guess no one explicitly spells this out, so I could see where someone could become confused.

Premature abstraction. Rule of 3 helps. But I found better principle for it:

Any abstraction MUST be designed to be as close as possible to be language-primitive-like. Language primitives are reliable, predictable, and non-breaking. If they do, they don't affect business logic written on top of it. If parameters are added, defaults are provided. They don't just abstract, they enable developers to express business logic more elegantly.

The challenge is to pick the part of the code abstractable to be primitive-like first and make it a top priority.

This is why language features like Rust async, Go channel-based comm, ES6, C++ smart pointer was such hype in their time and is used up until now. It also applies to enabling tools such as React, tokio, wasm-bindgen, express, TypeScript, jquery (even this which is not a thing anymore).

I find abstractable parts in code by thinking about potential names for it. If there is no good name for a potential common function, it's probably not a good idea to extract the section into a function. Maybe it can be extracted into two separate functions with good names?

This is a proxy for having a well-defined purpose for a function.

Yes, absolutely, but it's one that's easier to teach juniors.


Yes, exactly, and it is such a good rule to go by so much of the time. But like many rules, you need judgment to know when to apply it.

Rules are a poor substitute for actual thought.

> Rules are a poor substitute for actual thought.

This should be the guiding principle of life!

The guiding principle... a rule, so to speak!

Everything in moderation, even moderation itself.

> Everything in moderation, even moderation itself.

This was my guiding principle in life, and I thought I was pretty clever for coming up with it, but later found out that someone far more clever got a lock on the phrase in history: https://www.goodreads.com/quotes/22688-everything-in-moderat...

No, no, no; it's "Everything in moderation, especially moderation.". :)

Really? I'm sure you know that all generalities are false.

Only a Sith speaks in absolutes

You nailed it. I will print this and hang it on the wall in our office.

I thought rules were a way of organizing thought, like a language which is governed by rules.

The rule of 3 usually is in reference to small scoped abstractions, not whole modules or subsystems. We're talking about extracting a short function, not significant and potentially thorny chunks of code.

I would say it’s for entire features sometimes. For instance, we are a B2B company. We have features on the roadmap or a feature might be suggested by a client. Either way, you hit the jackpot if you can get one client to pay for a feature that doesn’t exist in your product that you can then sell to other clients.

The problem is that you don’t know whether the feature is generally useful to the market. But you think it might be, in that case, you build the feature in a way that is useful to the paying client and try not to do obvious things that are client specific. But until you have other clients you don’t know.

That's been my experience doing embedded stuff. Customers will tell you they want a feature. You'll implement it then find out they don't need it enough to devote resources to utilize it on their end. So then it just lingers in the code base. And never gets real world testing. Lately I've been pulling unused features and consigning the code to the land of misfit toys. Just so I don't have to maintain them.

If that’s the case and they were willing to fully pay for the feature, at professional services + markup hours, it wasn’t a loss. In our case, we still got unofficially professional services and recurring subscription revenue and now we have a “referenceable client”.

If it’s behind a per client feature flag, it doesn’t cause any harm.

Any of these rules comes with "use your best judgement" and not drive off a cliff blindly following it.

> I asked the developer and they told me, "I have this philosophy that says when you have only two of something, it's better to just copy and paste the code, but of course when you get to the third one you want to start to think about refactoring and combining them."

I would argue this is a good rule of thumb, but nothing should ever be a hard unbreakable rule. The usual refactoring rules should apply too: if 95% of the code is the same, you should probably go back and refactor even without a third use case because it sounds like the two use cases are really just one use case with slight variation.

If the compiler didn’t catch it, doesn’t that say it was modeled incorrectly?

Why not have an IAirSpace interface or an abstract AirSpace class with two specializations? If there were processes that could handle either it should take an AirSpace class, one that could only handle one or the other took the specialization.

If the steps were the same for handling both, have step1...step(n) defined in the concrete class and have a coordinating “service” that just calls the steps and takes in an IAirSpace.

To be honest I don't get it. What language was this in, that you couldn't duck type the data or make the data structures inherit from some shared interface that encapsulated the commonalities?

That is a reasonable question. You may have noted that I left some details vague, so all I ask is that you trust me that the overall story is true.

Of course if you don't trust me, that's cool too! ;-)

Hah! Totally do, I'm just nerd sniped by the details I guess :-)

Rules of thumb are just that. Developers still have to understand the underlying motivation (incidental vs inherent duplication).

> The Rule of 3 is a great rule, except when it isn't.

"Rules are for the guidance of wise men and the obedience of fools."

It's unfortunate that you had to deal with a fool, but that's not a indictment of the particular rule that they picked to follow off a proverbial cliff.

Edit: fixed ambiguous quote formatting.

Thanks for your reply. I just want to note that what appears to be a quote in your comment ("Rules are for the guidance of wise men and the obedience of fools") is not something I wrote at all. If you still have time to edit the comment, I might suggest changing that so it doesn't look like you quoted me.

> But because of blind obedience to the Rule of 3, there were many thousands of lines of code duplicated, both in the importer and in its downstream clients.

Well sure, blind obedience to anything at all can cause problems.

There’s a safety issue here. I can conceive of these code blocks diverging more with time. Plus is there really a cost of duplication.

but it was used in many places, so the rule of three applies.

Personally, I would have kept the code that he wrote and made something that handled the special cases you are talking about.

Everything should be replaceable, not extendable. Now if the special cases change, my code can be thrown away without changing the data feed code.

Are you sure merging code for different datafeeds would be better though? In such cases, what is identical and what is not, should be references to eachother in comments. But you don't know beforehand which approach would be better, unless you know the datafeeds will stay the same as now.

The sad story here is that if you know the datafeeds will stay pretty static, there's little to gain making an advanced abstraction over them. Which is why you often find duplicated code that haven't been touched for years.. The original target was met with a naive approach, and no new changes lead to stale codebases.

If you have a 95% match on something nontrivial (and it likely won't diverge significantly), I'd go for merging even with 2 cases. At least merge most of the common parts.

Reading a couple of ifs, and some not-quite duplicate procedures seems much better than having a complete 2-set in cross-refenenced files.

Why are you reading a couple of ifs instead of having the two similar things represented by separate classes with shared functionality in a common class? Or even if you prefer composition to inheritance you could still make it work cleaner without a bunch of if statements.

With a 95% match, you really only have one use case, with some minor variation.

No, you also need to ask not just whether they're similar, but whether they are the way they are for the same reason, and are thus likely to change together. It doesn't matter if there are 10, if they all might change independently in arbitrary ways.

"DRY" as coined in The Pragmatic Programmer spoke in terms of pieces of knowledge.

If you are combining code just because it looks similar, you're "Huffman coding."

Every time I have seen a feature that was written general to handle possible future uses, after a year of sitting unused there will certainly be modifications or surrounding code that doesn't support the planned extensibility. So it can never be used without a lot of work changing things anyway.

Future coding has lead to some of the most overcomplicated systems I've worked with. It's one of the reasons (among many) I quit my last job. I was constantly told code that had no use cases was "important to have" because "we needed it".

So does that same idea apply to all of the many abstractions thst geeks do just to stay vendor or cloud agnostic just in case one day AWS/Azure go out of business?

On the other hand, I worked on a multi-million line codebase that was deeply joined to oracle’s db, with a team who all really wanted to move away from it but couldn’t because in the beginning (a decade earlier) the choice had been made to not put in “unnecessary” abstractions.

It’s not about the abstractions. In the case of Oracle or any other database, if you’re only using standard SQL and not taking advantage of any Oracle specific features, why are you spending six figures a year using it?

The same can be said about your cloud provider. If you’re just using it for a bunch of VMs and not taking advantage of any of the “proprietary features” what’s the purpose? You’re spending more money than just using a colo on resources and you’re not saving any money on reducing staff or moving faster.

You’re always locked into your infrastructure decisions once you are at any scale. In the case of AWS for instance (only because that’s what I’m familiar with), even if you just used it to host VMs, you still have your network infrastructure (subnets, security groups, nails), user permissions, your hybrid network setup (site to site, client to site VPNs) your data etc.

In either case, it’s going to be a months long project triggering project management, migrations, regression tests, and still you have risks of regressions.

All of the abstractions and “repository patterns” are not going to make your transition effort seamless. Not to mention your company has spent over a decade building competencies in the peculiarities of Oracle that would be different than MySql.

After a decade, no one used a single stored procedure or trigger that would be Oracle specific? Dependencies on your infrastructure always creep in.

Yes. There's no point abstracting over a vendor API if you're not actually using an alternative implementation (even for testing). Otherwise, keep your code simple, and pull out an abstraction if and when you actually have a use case for doing so.

We had platform agnostic discussions with no concrete plan or action by anyone to actually escape our platform.

Vendor agnostic code doesn't anticipate AWS going out of business, just them raising prices significantly. It can be smart to be able to switch in a reasonable amount of time so that you can evaluate the market every few years. This way spending extra time to be vendor agnostic can also pay off. But there's no technical reason for that, it's a cost issue.

It’s often noted that in almost 15 years of existence, AWS has never increased prices on any service.

What are the chances that AWS will increase prices enough to make all of the cost in developer time and complexity in “abstracting your code” and the cost in project management, development, regression tests, risks, etc make it worthwhile to migrate?

The cost of one fully allocated developer+ qa+ project manager + the time taken by your network team + your auditors, etc and you’re already at $1 million.

Do you also make sure that you can migrate from all of the other dozen or so dependencies that any large company has - O365? Exchange? Your HR/Payroll/Time tracking system (Workday)? Windows? Sql Server? SalesForce? Your enterprise project management system? Your travel reimbursement system (Concur), your messaging system? Your IDP (Active Directory/Okta)?

Old boss of mine had a great line: "two points always make a line, but three usually makes a triangle"

Your boss made a good point. My boss also made the exact same good point. But between those two same good points, there isn't a great line. ;)

We had introspective bosses?

I think the catch all term for that is YAGNI.

YAGNI is about not adding functionality until it's needed. DRYing code isn't adding functionality.

I think GP is implying the thing you're not going to need is a generic solution to the repetition. i.e: DRY is in fact a feature.

DRYing existing code to handle exactly what is needed to support the current uses with no additional axes of variation isn't adding functionality (well, if it handles more than the existing values in the existing axes of variation, it's adding some but arguably de minimis functionality.)

Building code with support for currently-unused variation to support expected future similar-but-not-identical needs, that is, pre-emptively DRYing future code, is adding functionality.

WET - Write Everything Twice

What is YAGNI?

You Ain't Gonna Need It

Write once, copy twice, refactor after 3.

On one project we ended up with a series of feature that fell into groups of threes and we kept trying to make the 2nd feature in the series generic, and time after time #3 was a rewrite and significant rework of #2. So any extra time spent on #2 was time wasted.

I think that what Burlesona is suggesting is more nuanced (and effective) than the rule of three. We can easily imagine situations where code used twice warrants a refactor, and situations where code used three times does not.

The rule of three suffers the same problem as the pre-emptive refactor - it is totally context insensitive. The spirit of the rule is good, but the arbitrary threshold is not.

Similarly, 99% of your comment is bang-on! My only gripe is the numeral 3. But pithy rules tend to become dogma - particularly with junior engineers - so it's best to explain your philosophy of knowing when to abstract in a more in-depth way.

> My only gripe is the numeral 3. But pithy rules tend to become dogma

Agree, 3 is a pretty arbitrary number. If you have a function that needs to work on an array that you know will certainly be of size 2, it takes minimal effort and will probably be worthwhile to make sure it works on lengths greater than 2.

But the bigger point is valid: you need examples to know what to expect, and if you make an abstraction early without valid examples, you'll almost certainly fail to consider something, and also likely consider a number of things that will never occur.

> make it “generic enough to handle future use cases”.

The answer to this is usually YAGNI.

That is, don’t plan for a future you might never have. Code in a way that won’t back you into a corner, but you don’t know what the future’s cases might be (or if there even will be any) so you can’t possibly design in a generic way to handle them. Often you just end up with over-engineered generec-ness that doesn’t actually handle the future cases when they crop up. Better to wait until they do come up to refactor or redesign.

Some people argue to design in a way that lets you rewrite and replace parts of the system easily instead.

The repetition is what is YAGNI!

Repeating code 7 times in preparation for separate evolution of those 7 cases is YAGNI, unless the requirements are on the table now.

Merging repeated code into one is something that is demonstrably needed now, not later.

It's not so much that you're preparing for 7 separate cases, as a thing that works in one place and almost, but not quite, fits in 6 others. You rarely exactly duplicate code, but often duplicate and modify.

By the time you hit 7, you do clearly Need It. But now you've got 7 cases to work from in writing the generalization. When the number is 2, it's often reasonable to say, "I don't know how these will evolve and I'll probably guess wrong".

Yes, I agree. That’s not what I was replying to, though. I noted in another comment that I consider merging worthwhile even in the absence of three use cases, certainly if what you have now is very similar.

If all seven are the same except for one or two cases, it doesn’t mean that you have to have a bunch of if statements, you either use inheritance or composition to create special cases and judiciously apply “pull members up” and “push members down” refactoring, interfaces, abstract classes, virtual methods, etc. These are all solved problems.

Yes I know about the whole “a square is not a rectangle problem”.

"Premature optimization something something..." ;)

That's actually "premature generalization".

Isn't premature optimization a generalization of premature generalization? ;-)

Oliver Steele describes "Instance First Development", which the language he designed, OpenLaszlo, supported through the "Instance Substitution Principle". I've written about it here before, and here are some links and excerpts.


In the right context, prototypes can enable Instance-First Development, which is a very powerful technique that allows you to quickly and iteratively develop working code, while delaying and avoiding abstraction until it's actually needed, when the abstraction requirements are better understood and informed from experience with working code.

That approach results in fewer unnecessary and more useful abstractions, because they follow the contours and requirements of the actual working code, instead of trying to predict and dictate and over-engineer it before it even works.

Instance-First Development works well for user interface programming, because so many buttons and widgets and control panels are one-off specialized objects, each with their own small snippets of special purpose code, methods, constraints, bindings and event handlers, so it's not necessary to make separate (and myriad) trivial classes for each one.

Oliver Steele describes Instance-First Development as supported by OpenLaszlo here:

Instance-First Development




[...] The mantle of constraint based programming (but not Instance First Development) has been recently taken up by "Reactive Programming" craze (which is great, but would be better with a more homoiconic language that supported Instance First Development and the Instance Substitution Principle, which are different but complementary features with a lot of synergy). The term "Reactive Programming" describes a popular old idea: what spreadsheets had been doing for decades. [...]


Oliver Steele (one of the architects of OpenLaszlo, and a great Lisp programmer) describes how OpenLaszlo supports "instance first development" and "rethinking MVC":



[...] I've used OpenLaszlo a lot, and I will testify that the "instance first" technique that Oliver describes is great fun, works very well, and it's perfect for the kind of exploratory / productizing programming I like to do. (Like tacking against the wind, first exploring by creating instances, then refactoring into reusable building block classes, then exploring further with those...)

OpenLaszlo's declarative syntax, prototype based object system, xml data binding and constraints support that directly and make it easy.

OpenLaszlo's declarative syntax and compiler directly support instance first development (with a prototype based object system) and constraints (built on top of events and delegates -- the compiler parses the constraint expressions and automatically wires up dependences), in a way that is hard to express elegantly in less dynamic, reflective languages. (Of course it was straightforward for Garnet to do with Common Lisp macros!)


Instance-First Development:


>The equivalence between the two programs above supports a development strategy I call instance-first development. In instance-first development, one implements functionality for a single instance, and then refactors the instance into a class that supports multiple instances.

>[...] In defining the semantics of LZX class definitions, I found the following principle useful:

>Instance substitution principal: An instance of a class can be replaced by the definition of the instance, without changing the program semantics.

In OpenLaszlo, you can create trees of nested instances with XML tags, and when you define a class, its name becomes an XML tag you can use to create instances of that class.

That lets you create your own domain specific declarative XML languages for creating and configuring objects (using constraint expressions and XML data binding, which makes it very powerful).

The syntax for creating a bunch of objects is parallel to the syntax of declaring a class that creates the same objects.

So you can start by just creating a bunch of stuff in "instance space", then later on as you see the need, easily and incrementally convert only the parts of it you want to reuse and abstract into classes.

What is OpenLaszlo, and what's it good for?


Constraints and Prototypes in Garnet and Laszlo:


In OpenLaszlo, you can create trees of nested instances with XML tags, and when you define a class, its name becomes an XML tag you can use to create instances of that class. That lets you create your own domain specific declarative XML languages for creating and configuring objects (using constraint expressions and XML data binding, which makes it very powerful).

This gives me nightmares of over engineered xml programming that is infamous I the Java community. You lose all of the benefits of static type checking.

Generic enough code that can be adapted to future cases is code with clean architecture that follows standard OO principles.

On the other hand, trying to handle all the hypothetical cases because "that makes the code generic and future-proof" is usually a complete waste of time.

My view is to develop the simplest, well architected OO code that can handle the use cases at hand.

After witnessing the collateral damage of many software reuse projects (anyone remember "components"?), I came up with a different ruleset, useful for "compromising" with "software architects":

First, translate the data.

Second, divine a common format and share the data.

Third, create the libraries for this common format, to be reused amongst projects.

I have never reached #3 in my professional career. Sure, we wrote the libraries. But other teams, projects have never adopted before whole effort became moot.

So I kept my projects in tact and moving forward, while letting mgmt think they're doing something useful.

Sandi Metz wrote a blog post about this: https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstracti...

> The moral of this story? Don't get trapped by the sunk cost fallacy. If you find yourself passing parameters and adding conditional paths through shared code, the abstraction is incorrect. It may have been right to begin with, but that day has passed. Once an abstraction is proved wrong the best strategy is to re-introduce duplication and let it show you what's right. Although it occasionally makes sense to accumulate a few conditionals to gain insight into what's going on, you'll suffer less pain if you abandon the wrong abstraction sooner rather than later.

Sandi Metz's blog (and book) are an absolute gold mine. I'm a junior developer (<5 years), and the material gave me an entirely new outlook on design (object-oriented and otherwise)... and also showed me how un-maintainably most development is done nowadays :(

If there was a required reading list for professional developers, I would put her work on with zero hesitation, I feel it to be that important.

Which book are you referring to?

Not your parent, but https://www.poodr.com/

You call devs Junior until they have over 5 years experience? Wow, that’s harsh

> You call devs Junior until they have over 5 years experience? Wow, that’s harsh

Less than 3-4 years is absolutely junior dev. Up until around 7 or 8 years is mid-level. Passing beyond that would be senior dev and the other titles then follow.

With the obvious note that it's not strictly time bound - someone could easily get stuck at mid level for much longer if they aren't progressing. It's pretty hard to still be only junior after 7+ years

I agree.

I still very much consider myself junior but someone I know with less experience than me just recently got a job with "senior" in the title. I have seen senior job postings that say something like "3+ years of experience."

I absolutely do not consider myself senior and I don't think I will for at least seven more years.

One of the big reasons I realized I still _was_ a junior developer was because of Sandi Metz's content showing me how long of a journey I still have to go :-P

I don't personally think it's strictly about the time one's been programming, although in my experience that can be a good benchmark for e.g. how good one's abstraction, API design, etc. skills are.

> don’t extract repetitive code right away, try and build the feature you’re working on with the duplication in place first. Let the code go through a few evolutions and waves of change.

This ^^^.

I'm in my mid 50's now and worked as a software dev since my teens. I've learned over time that certain lumps of code need to be left alone for a while before jumping in and aggressively refactoring the perceived "duplication". I've been guilty of this kinda thing before, spot a wodge of code that looks like a duplication, only to find out later you hit some "special cases" as the project progresses and suddenly, as the article points out, your "helper" method balloons into a ball of spaghetti.

As an apropos to the article, and touched upon therein, checking in a fairly major change to de-duplicate some code without consulting the original author/team is a wee bit rude. Ask your colleagues first why such code still exists before barging in and making these changes, they may already have some concerns as to why refactoring to a helper method or some abstraction isn't in their game plan yet. It's a common courtesy.

This isn't about rudeness or courtesy unless they are responsible for that code, in which case they should be involved or leading the change.

As you mention it's about technical knowledge: They know exactly why the code is the way it is so consulting them is "due diligence".

> As an apropos to the article, and touched upon therein, checking in a fairly major change to de-duplicate some code without consulting the original author/team is a wee bit rude.

This sounds like the outcome of bad culture. Ownership of the code should be shared to the point where it should never be considered rude to improve the code. Any part of the code.

> Ask your colleagues first why such code still exists before barging in and making these changes,

If I had to synchronise with others all of my improvements to existing code (which was frequently written in a hurry to meet a deadline, so with shortcuts taken intentionally, or with incomplete knowledge of future use cases) I would get at most half as much done.

> they may already have some concerns as to why refactoring to a helper method or some abstraction isn't in their game plan yet.

If there are alluring "improvements" that don't work for such subtle reasons, this should be documented in the code. If it's not, one has only oneself to blame when someone goes in and changes it.

Edit: I realise now that I'm talking about teams where everyone is reasonably senior. It could be different with junior members on the team, to which many changes might look like improvements when a senior engineer would at most see the change as equivalent. In that case I think you're right, but for a different reason: junior engineers should always check in with senior engineers about things in order to learn more of the craft!

> This sounds like the outcome of bad culture.

For 90% of shops outside the cornucopia of SV with unlimited budgets and internal customers only, the client drives decision mercilessly and ruthlessly.

And for a good reason. The product doesn't exist because they're fun to develop. They exist because they solve a problem customers have. So ultimately, decisions should always be based on what customers (long-term) benefit most from.

One of the areas where I really like "incidental duplication" is in tests.

Tests can sometimes be very repetitive and identical, and it's tempting to want to refactor it in some clever way. That's almost never good.

On top of the reasons laid out in parent comment, tests also function as unofficial documentation. I like having everything explicit in there, it makes them easier to read and understand.

If you try to abstract away tests, you often just end up re-implementing the same abstractions used in the actual code, and you can end up not catching unfounded assumptions that your abstraction is making in both the tests and the code. There is a scope for having test helpers / utils to make tests easier to write, but you should be minimalist with these.

I agree, but some test helpers are reasonable. It's about balance and readability of the tests in question. Otherwise integration tests end up being a hundred lines of code.

The problem with abstractions in testing is - you should test it :)

Tests should also be treated as a form of documentation. A test should reflect the way you'd use the code in real life as closely as feasible.

I'm not sure I agree. I like to move all setup code to helper methods so that my tests are just a few lines

  // Given <setup>
  // When <thing happens>

  // Expect <thing> to be <such>
This allows the reader to easily see which workflows are actually tested. If the reader is interested in implementations the utility code is one click away and usually only needs to be looked at once to be completely understood. The test bodies themselves however have many flavors for many work-flows so getting rid of the repetition is critical to highlight the specific nature of individual tests.

I do think a lot of test frameworks would do well to support data-driven testing a la Spock: http://spockframework.org/spock/docs/1.0/data_driven_testing...

Terrible acronym for a good idea: DAMP


Tests rarely have bugs, I find, so generally dry isn’t critical. Also, dry is for security (see below)

Tests frequently have bugs, especially bugs that result in the test passing when it should fail.

That’s why I make the test fail before writing the code. If the code is already written, then I break it in the minimal way to test the test, and then fix it.

IME things never fail in the way you expect them to... You can build a fortification where you think you are weak only to find the enemy is already in the castle.

A green test does not equal bug free code. There may be a misimplementation / misunderstanding of the spec or your code passes beautifully under right conditions, with just this data setup.

That's a test for your test, so why only run it once transiently instead of running every time? "Mutant" testing helps with this. It's basically fuzzing your test code to make sure that every line is meaningful.

I can see how that would be useful, but I also think it's a matter of priorities.

I'm basically saying I rarely have bugs in my tests because I verify them first. In fact I can't think of a single bug in my tests over the last 4 years (or even 10 years), but I can think of dozens of bugs in my code.

For example here are some pretty exhaustive tests I've written for shell, which have exposed dozens of bugs in bash and other shells (and my own shell Oil):


I would rather spend time using the tests to improve the code than improving the tests themselves. But I don't doubt that technique could be useful for some projects (likely very old and mature ones)

ITYM "I rarely have bugs in my tests that I'm aware of". The amount of tests I've seen that look like they are working properly, have been "verified" and are buggy is huge. Usually we find they were buggy because someone moves some other tests around or changes functionality that should have broken the tests, but didn't.

Please, don't ever assume that your tests are beyond reproach just because you verified them. Tests are software as well and are as prone to bugs as anything.

And how do you do this in practice? I am struggling to think of a good way to keep the production code that fails the test and the production code that doesn't fail the test together. I might have my test check out an old version of the production code, compile it and test against that. But that is hard to get right.

Tests always have bugs, it's just that you don't know about the edge cases yet.

As a preventative measure, I write some tests for my tests. Also in TDD style of course. And on a very rare occasion, I have to write a test for those tests as well.

It’s time for TTDD. Start by writing tests for your tests :)

I do actually do that. I'll write some buggy code in order to learn how to test for it.

TDD for me is primarily a way to guide myself toward accomplishing a goal. So I sometimes write way more tests for myself than the business needs. I will then delete the scaffolding tests before I tag my PR for review.

Tests have fewer bugs if you write them before the system under test, and if they don't have mocks, and if you have enough of them that anything you get wrong the first time will get noticed by the results of the many similar tests.

> Tests have fewer bugs ... if they don't have mocks

100x this. I've repeatedly fail to convince my team members that mocks are unnecessary in most cases. I've reviewed code with mocks for classes like BigDecimals and built-in arrays. This is especially prevalent in Java teams/codebases.

Drives me insane.

> Tests rarely have bugs, I find, so generally dry isn’t critical

DRY is important for tests, but the areas where it is have probably (if you aren't writing a testing framework or using a language that doesn't have one that covers your use case) largely covered by your testing framework already.

Tests are there to catch issues. In a way, almost every bug that makes it to production is a test bug.

Not to mention the inconsistent tests that fail so randomly from timing issues that the team ignores them

I have 2 rules I use when determining whether to duplicate code or to refactor:

1. How many duplications are there? If the code is duplicated once, that's fine. If it's duplicated twice (so 3 instances of it), then it's time to consider refactoring, subject to the next rule.

2. Why is the code duplicated? If it's "incidental duplication", i.e. code that happens to look the same, don't refactor. Only refactor if there's actually a reason why the code is the same. Which is to say, attempt to predict the future: if a change is made to one instance of the code, do I expect it to be replicated to the other instances too?

It's also useful to look at not just the duplication but the code itself. In this case, it was code for geometry which is not like to change all too much.

Often the difference between harmless but ugly looking duplication and duplication that is actually harmful relies on the semantics of the code and not just its appearance.

It wasn't code for geometry, it was code for manipulating geometric shapes via UI. the article specifically mentions that custom behavior was eventually needed:

> we later needed many special cases and behaviors for different handles on different shapes. My abstraction would have to become several times more convoluted to afford that, whereas with the original “messy” version such changes stayed easy as cake.

> attempt to predict the future

Sounds very error prone…

It is, which is why you want to be conservative. If the duplicated code is obviously supposed to be identical, then predicting the future should be trivial. If it's not obvious, then it's a question of "can I conceive of a reason why I'd want to update one and not the others?". And if the answer is unclear, wait a while and see if anything comes up.

This. I wait until the feature is mature, and has been used by actual people. There's no point trying to clean up code that hasn't finished evolving, or that I don't understand fully.

It also means I have a tidy stack of refactoring to do when I'm bored or need a quick motivational win :)

And in the other direction, when adding flags etc to a method that was created to reduce duplication, consider if splitting it into two "duplicate" copies or copy it to the call-site requiring the flag isn't the better alternative.

Yes. If extracting and deduplicating is in your toolbox, so should be inlining and duplicating.

Or, if possible, split up the helper into smaller units of functionality that can be combined as appropriate for different requirements. That could possibly work for the original article's problem, too.

Refactoring duplicated code is a trade off: on the one hand you create abstraction and centralizing code logic at the cost of, first, the overhead of learning that abstraction, and second, by increasing coupling between functions. I've personally found that the coupling is by far the most important factor to consider. If A depends on B, and C is found in both A and B, then you should factor out C. If A and B share C because they are adjacent, without being fundamentally intertwined, then duplicate C, but consider creating a library or module that makes it easy to talk about things that are similar to that duplicated code, C. (I tend to think about modules/libraries as little, independent DSLs)

Another thing you can do when a function becomes overencumbered is to split the remaining similar logic into smaller functions which are composed into specialized functions.

This has benefit in that when analyzing modules, you can spot differences in procedure at a glance instead of needing to dig through 100 lines of somewhat similar imperative code.

Right. There are two kinds of such refactoring / DRY-ing up the code:

1) Specialized helper sub-functions/classes to make the codeblocks DRY.

2) Functions/classes to make separate features DRY.

Problem arises when the design or understanding of the code doesn't reflect the realities of changes over time, forcing you to restart/revert, or making spaghetti code with optionals and whatnot to accomodate the rising complexity.

The agile approach would be to make the code that is easiest to change either way, and prevent being locked in to only one approach.

Go error handling is a good example of this. So far all the attempts to reduce the repetitive `if err != nil { ... }` through some abstraction failed. Look at https://github.com/golang/go/issues/32825

In the language maybe but both Rust and Zig show that it's possible to have much less 'bloat' for error handling even without using exceptions.

I'd say that go designers have still work to do: Zig especially show that you can be a 'simple' language and yet have both sane error handling and generics.

If someone else is looking for examples too, I found those:


I thought all of this until I got used to Go's error handling.

There's a couple aspects to this:

1. After a while, the "if err != nil {" becomes a single statement in your mind, and you only notice it if it's different (like trapping things that should error with "if err == nil {"). In other words, it only feels verbose if you're not used to it. After a while, the regular rhythm of "statement, error check, statement, error check" becomes the routine pattern of your code and it looks weird if you don't check for errors (which is as it should be).

2. The point of Go's error handling is that it isn't magic. There's nothing special about error values, and they are handled exactly the same way as every other variable in the system. The only thing the language defines about errors is that they have a method called Error that returns a string. That's it. This means that you can create complex error handlers if you need it, entirely within the standard language. This is extremely powerful.

The Go team's examination of the language error handling is interesting because it seems there's a conflict between newer Gophers who don't like the verbosity of it (but don't realise the power it brings) and the older Gophers who are used to the verbosity and appreciate the power. Almost exactly like TFA. The repitition looks ugly if you don't appreciate the reasons for it.

"It isn't magic!" mantra is often heard in Go apologetics, but every time I see it, it occurs to me that Go's definition of "magic" is somewhat akin to a 15th century peasant seeing a lightbulb. Stuff like exceptions or error types isn't magic - they have been around for a long time, they're well understood, and they have significant advantages.

I kinda prefer this to the Rust apologetics, where every post about Go is replied to with three posts about how Rust does it so much better ;)

I've used lots of other languages, as have a lot of other Gophers. I'm not saying "Go's approach is good" out of some strange tribalism or a need to assert my preference. I'm saying this because, having spent over 35 years programming, I really appreciate the simplicity of Go and the lack of magic. I'm not apologising for Go's simplicity, I'm trying to explain why I like it.

A garbage collector of obviously lack of magic.

Exceptions are pretty much the perfect example of (bad) magic actually; unannotated, dynamically dispatched nonlocal control flow that can execute code (ie, destructors) in completely unrelated contexts on the way past. At least 0x5F3759DF can be boxed into a function and commented in one place.

> Go's definition of "magic" is somewhat akin to a 15th century peasant seeing a lightbulb.

This is very true and well put though.

Except exceptions are rarely understood and used correctly by most programmers. They can simplify program structure, but at the expense of proper errorhandling and error mitigation strategies.

Golang is still in the sort of niche that builds databases, queues, container-orchestration, etc., but can be built for other things given enough care for spending the extra effort simplifying the solutions.

> Except exceptions are rarely understood and used correctly by most programmers

That's pretty condescending.

The mechanism for exceptions has been around for more than 20 years, it is well understood by most programmers.

The problem is that error handling is hard.

Exceptions are an adequately sophisticated solution to that hard problem. Go's approach only encourages ignoring errors (since the compiler never enforces that you handle them) and boiler plate.

> That's pretty condescending.

Nope! Exceptions are, in fact, rarely understood and used correctly by most programmers. Including loopz. And me. And the authors of approximately every nontrivially-exception-using piece of code I've had to work with. And presumably also of the code loopz has had to work with.

The difference is that some of us have the good sense to rarely use exceptions at all.

Is it really: Are programmers omniscient then that they can trap all kinds of exceptions correctly from external code? It's a sophisticated method that dumps the problem on the user instead.

Golang also output stack traces and even supports panic() if one wants to have something similar to handling exceptions. The difference is that this is used for classes of errors that ideally are programmer error, and not for all kinds of business logic states. I'm not saying the Go Way is perfect either, but it's at least a small step acknowledging the difference, rather than defaulting to dumping random programmer errors on unsuspecting users.

Errorhandling is easier when improving design. The problem is this takes time, thinking and effort.

Go didn't discover anything. Java's runtime and checked exceptions are already the direct consequence of errors being of two kinds: recoverable and non recoverable.

Go's approach is inferior to Java's in all ways.

Exceptions are basically error types with dynamic typing in an otherwise statically typed language. So if you can deal with dynamic typing in Python, you can handle exceptions in Java.

The main issue with any discussions on exception is the elephant in the room, Java. Java has a worst model of exception mixing weird typechecking rules + error handling not forcing to recover the exception.

I really like the exception model of Erlang, recovery is only possible from another routine. It's is in my opinion the best exception model.

Go code is nice because everything is fully explicit but it's hard to read, trying to follow the control flow when half of the statements are error recovery code is just very unpleasant. And i still don't know how to express that an error is the result of another error without forgetting the context.

This is kinda my point: it's hard to read until you get used to it. I can totally see why someone used to Python has a hard time with Go's error handling, because they're not used to the rhythm of it (logical statement, error check, logical statement, error check, logical statement, error check); they're more used to (exception handling setup, logical statement, logical statement, logical statement, exception handling completion). It's different, and therefore strange and weird. But after a while you get used to it, and expect to see an error check after each logical statement, and it sort of merges into one structure in your mind.

There are packages for wrapping errors, and I believe some form of error wrapping (using the fmt package for some reason) is being adopted. More than that is up to the coder to implement.

> Java has a worst model of exception mixing weird typechecking rules + error handling not forcing to recover the exception.

Unless you have a specific complaint about Java's model (which I'd love to read), I strongly suspect that your beef is with a few standard Java library functions misusing checked exceptions than a statement against exceptions in general.

The combination of runtime and checked exceptions offers the most complete solution to the difficult problem of handling errors, with the compiler guaranteeing that error paths are always handled.

The big problem with Java's model is that exceptions aren't part of the type system: they're that whole separate thing that is applied to methods, but it's not a part of that method's type in any meaningful sense. This means that it's impossible to write a higher-order function along the lines of "map" or "filter" that would properly propagate exceptions, because there's no way to capture the throws-clause of the predicate and surface it on the HOF itself.

`> The big problem with Java's model is that exceptions aren't part of the type system

They are.

To the extent that the checked exceptions that a method can throw are part of the method signature (as all error cases should).

I have no idea what you mean by "exceptions aren't part of the type system".

Sounds nice, and of course it is possible to build solutions with exceptions that do recover all errors elegantly and cleanly. However, the correct judge on this would be your own users. Given enough care, the discussion becomes rather philosophical.

Though, having to confront errors through the callstack makes one review where handling would be most prudent, in real-life the time-pressures are just too strong making such efforts largely unrewarded.

>I really like the exception model of Erlang,

Or the exception model of Common Lisp, which is designed for always being able to recover.

>Except exceptions are rarely understood and used correctly by most programmers.

That's just your opinion.

When applications and services still defaults to dumping full stacktrace and reporting programmer's errors, it's from longer-term experience also.

What are they supposed to do? And how is that different from panics?

> The point of Go's error handling is that it isn't magic. There's nothing special about error values, and they are handled exactly the same way as every other variable in the system

The error maybe, but not the result of the call. The multiple-value return x, err is not a first-class value. It cannot be handled like any other variable.

This was demonstrated very clearly with proposal for try. try would have automatically returned with err when err != nil. But what if you wanted to change the error, say create an error message? Then try was completely useless. In Rust, where the result actually is just a regular value, you can transform the error however you like just like any other value and try is just as useful as before.

Sorry, I don't understand what you're saying. Are you saying that because there's a proposal in v2 for error values to not be 1st class values, therefore they're not in v1?

I think it’s that in Go multiple return values aren’t a first class value. It’s just two separate values. Whereas in Rust or Haskell they’d be a single, first-class Result<a> (or whatever) value.

> 1. After a while, the "if err != nil {" becomes a single statement in your mind, and you only notice it if it's different (like trapping things that should error with "if err == nil {"). In other words, it only feels verbose if you're not used to it.

The whole point of programming is to abstract away repetitive work. Yes, a human will spot that the pattern is the same, but this is both fallible and a waste of human effort. And even if you can see the difference, those extra characters are still filling up your lines and making it hard to keep functions on a single screen where you can comprehend them more easily.

> 2. The point of Go's error handling is that it isn't magic. There's nothing special about error values, and they are handled exactly the same way as every other variable in the system. The only thing the language defines about errors is that they have a method called Error that returns a string. That's it. This means that you can create complex error handlers if you need it, entirely within the standard language. This is extremely powerful.

There's nothing magic about something like https://fsharpforfunandprofit.com/rop/ either. Just plain old functions and values written in the language (that talk literally gives the definitions of all the types and functions it uses, written in plain old F# code). You need functions as values, but everyone agrees that's a good idea anyway, and you need proper sum types, but you need those anyway if you're ever going to be able to model a domain properly.

> The point of Go's error handling is that it isn't magic.

The problem is first, that sum types are also not magic. There is nothing special about the error type or value in a `Either<T, Err>`. Go's type system is just too crappy to make such things, or make good use of them even after you tried to shove them into an interface{}.

The second problem is that Go's error values, like every error handling system that pretends it doesn't need sum types, have picked up more magic (%w) or impacted the usability of other interfaces (context.Err, separate error channels) bit by bit.

The downside I see with go's error handling is that you can forget to check. With rust, if the function being called returns Result, you have to deal with the error (even if dealing with it just means propagating it out). Missing error handling is such a common source of bugs that go really turns me off here.

> you can forget to check

Linters can help with this.

Specifically for this issue, linters also have many false positives. Some Go libraries trying to encourage a fluent style will accept an error for some logic also return it, so you can `return x.HandleError(err)` - but if you don't want to return it, you obviously don't care it returns what you just passed it. (I personally consider fluent methods a bad idiom in Go, but I also don't get to write all the Go code in the world or even in my project.)

There are also a lot of functions that return errors because Go's type system demands that if the interface returns two values `T, error`, every implementation must also - it won't auto-create a nil value for the second result. That's reasonable if you are committed to errors just being normal values. But such a restriction would not be necessary if the interface could be declared with a sum type - promotion of a `T` to a `Either<T, ...>` or `Just T` or so on would be fine for all types, not just error handling . Lots of infallible Writer implementations like bytes.Buffer and hash.Hasher suffer from this, and linters can't be aware of all such cases.

Sure, but given a choice, I'd rather work with primitives that are correct-by-construction, not correct-if-I-use-an-extra-tool-and-actually-act-on-its-advice.

If you do use a linter and have it set up so linter issues are fatal to the build, then you run into the issue that if the linter throws false positives, you have to add exclusion rules (if the linter even supports that) or downgrade linter issues back to warnings, and lose the benefit entirely.

One aspect of go error handling that still really bothers me is how easy it is to accidentally drop an error on the floor.

If you strap all the linters you can find onto your build system, you can catch the most obvious cases. But I still frequently find mishandled errors that sneak past the linters in ways that can't be solved without restricting the way devs use the language.

By making error handling something you have to deal with every time you call a function, you massively increase the number of opportunities you have for screwing it up.

I would love something like rusts ? Operator for go. You could choose to not use it when you need special handling. But it would be rare and exciting and developers would use it with care.

On the error branch, how do you do code coverage?

I don't really think Go error handling is a good example. A nil test paired with a return cannot be extracted to a function. A macro could perhaps centralize the logic, but otherwise it's no more amenable to deduplication than plain addition. Moreover, it's completely trivial, if verbose in the Go language.

The case where it becomes interesting is when there are four or five statements that are repeated. If it's two statements, especially if the statements are both control flow, one of which is tied to the frame of the executing function, and the other is a trivial nil test, that falls firmly on the "not refactorable" side of the line.

> Then you need to start passing options and configuration into your helper method... and before long your helper method is extremely difficult to reason about

In which case, you should split the helper function ( extract sub-part common to all cases, and report the differences where the helper function is called).

I think I would most of the time go with de-duplicating as early as possible, as long as the helper function only has few parameters and can be described in plain english rather easily.

To me the cost of having to refactor the helper function later in the process is less than dealing with duplicated code.

Duplicated code causes many issues. You mentioned introducing bugs, but it also makes the code harder to read. Every person who reads the code has to make the de-duplication effort mentally (check that the duplicated parts are indeed duplicated, figure out what they do, and on what parameters they differ...).

Premature optimization. Duplicated code is only evil when there's a bug you only fix in one place and forget about the duplicates; in almost every other case, it's easier to reason about and is more resilient in the face of local changes.

Abstraction is often like compression, and compressed data is easier to corrupt. Change the implementation of an abstraction, and you put all consumers of it at risk. It's not an absolute good.

> it's easier to reason about

Consider you have 4 times a block of 10 lines of code, they are identical except for a couple of parameters. The person who reads the code has to 1. figure out what the code does 2. see if the duplicated parts differ in some subtle way.

The alternative is to replace the duplicated parts with a function that has a meaningful name. This makes the code easier to read. It's not a premature optimization. It would make sense to keep it duplicated if you're in an explorative phase and not sure yet what the final design will be, but I wouldn't submit a patch where parts are obviously similar. I'm not sure it would pass review.

It really depends. For test setup, I'm inclined to leave alone; duplication is often easier to reason about when a test breaks. For code with control flow changes in the middle, abstraction is honestly dubious. For mere value differences, maybe, but if the values are complex and not merely scalar, maybe not. More duplication is needed to justify when the abstraction needs more edge cases to cover them all, especially control flow more than different values.

I generally find it pretty easy to reason about code structured like:



(bunch of code)


(bunch of code)


(bunch of code)


Even if the function is long it's pretty easy to skip over the irrelevant parts.

When you get in trouble is when you discover a bug (or have changed requirements) in something that gets duplicated several times and have to remember to hit all of them. The last one especially, it's the one that seems to be missed the most often.

Overall the tradeoff is generally worth it though, because you only need to care about one case at a time.

> Even if the function is long it's pretty easy to skip over the irrelevant parts.

> Overall the tradeoff is generally worth it though, because you only need to care about one case at a time.

Which part is irrelevant? As a programmer, I don't generally know which value `object` has, so if I need to understand the whole statement, I need to look at every case, so I often need to check whether they are identical or slightly different.

Duplicate code like this is a well-known source of bugs, one of the cases most often highlighted by static analysis tools.

It kind of depends what you're doing, but a lot of the time you are in a situation like "there's a bug when you do X", and when you find code like this you scroll through the options until you get to the one labeled as function X.

Inside of there you may run into stuff that was set up earlier in the function, but it's pretty easy to backtrack to the top of the switch statement to see what was done before then.

There's nothing quite like the joy of being asked to fix something and discovering that it's a big old top down project without excessive abstraction/compartmentalization. Just skim through the code in order till you get to the part that's misbehaving and start investigating. No need to jump around a dozen factory template engines or custom generic template wrappers to find the 3 lines of business logic that have an off-by-one error or something.

I am such a huge fan of Big 'Ol Switch Statements over using polymorphism/types/generics/whatever. So easy to understand, and it's all there in a huge scrolling list of cases. If something needs to change, it's easy to change it, and you know what else is affected.

That works until you have 20 different Big Ol' Switch Statements, each switching on (mostly) the same cases, essentially implementing a set of related behaviors in 20 different places instead of grouping the 20 behaviors under the same umbrella.

Overall, I think there is an equilibrium between the number of cases in the switch and the number of different switches with the mostly the same cases. The fewer cases ypu have and the more times you handle the same cases, the more it will help to group these different behaviors in separate places: each case would correspond to a different class, each switch statement to a different method in each class. The fewer classes and the larger they are, the more I think it helps to apply this transformation.

By the way, this has nothing to do with the discussion above. The alternative to the switch that GP commenter presented isn't polymorphism, it is simply extracting the common lines into a separate function.

you're entirely right, of course. I just wanted to express my enthusiasm for the approach :)

> Duplicated code causes many issues. You mentioned introducing bugs, but it also makes the code harder to read

That is I believe contestable. Yes, it can end up easier to read but there is a big tradeoff - when you remove code from its context it is much harder for a reader to reason about it. Once something is extracted into a function you can't see, you have to mentally ask the questions like "could this return null / None?", "how will it behave if the input is negative?", "does it throw exceptions", "is it thread safe?" etc etc. All this is directly observable in context when the code is inline, not so when it is removed to a central place.

Ideally, these questions should be answered by the function name, type, and its documentation. And if not, one can always jump to the function definition and the above elements (missing if the piece of code is inlined) will give additional hints to what the code does.

Upvoted. We'd probably come up with different implementations, I strongly prefer composition and "interpreter" style code, but whenever I've seen casual mutations in different dataflows, it was because of doing one off changes while ignoring the whole.

I think about it in terms of bugs. If your abstraction causes a bug, I have to go in and work out wtf your wonky abstraction is doing and also risk breaking other cases. If there are 6 duplications and there is a bug because one of them is missing a change applied to all of them, that takes 5 mins to fix and risks breaking nothing.

When you make an abstraction think not only "Will this create bugs?" But also "If this abstraction does create bugs will they be easy to identify and fix?".

Extracting common function was right move in your example.

The bad move was to add options to common functio instead of changing one caller to call new different method.

Even worst move was add more and more options to originally simple method. Just because something was rightfully extracted to common place 10 months ago dies not mean it have to remain in common place. And the need for split does not mean original extraction was wrong.

I’m just wondering where you draw the line.

As with everything, it often becomes a big grey area on what is acceptable and what is not.

Example: my (fictional) company sells a B2B platform which provides companies with an online marketplace or some other type of online application. Each installation requires different, though often similar, integrations with the customer back-ends - think Postgres vs MySQL, but some customers may also use TotallyEffingObscureDB, which doesn’t even speak SQL. Those backends are usually the source of truth for the entire stack, storing user data, etc, the local data is just a mirror.

So, given this scenario, how should we approach user registration processes? Is that a product component or custom (not DRY) code? What about all the other (similar, common, and basic) features of our platform that every customer wants but invariably needs to be implemented slightly differently?

I’m only posing these questions because I worked in a situation where something similar happened, and it wasn’t dealt with well, and I’m wondering what other HN coders have to say...

Ironically, we're handling this with the concept called "clean code". We do have a core, which does implement the base logic. Everything in there is domain driven, but only using DTOs and providing interfaces for input and output, using repositories and presenters. When the data source changes, we only need to add the new repositories and set those within the context. If we have to implement some specialized logic just for one client, we add this as a plug-in on top, so it is handled outside of the core logic. Of course all of this is very abstract and since I've worked mostly with simple MVC concepts until now, I'm still struggling to get my head around this approach, but so far it's looking very well. It may look overcomplicated at first, but there is a very strict while at the same time very flexible logic behind it, which can handle all the "creative" ideas our clients have so far, while still keep being maintainable. It is a long process to get to this method of programming, but my initial scepticism has completely changed to "why haven't I worked like this before?".

That’s pretty much the approach I would suggest and think works best. Thanks for your reply.

Looking back, I think the place I worked had a lot of issues with how code was actually stored and maintained - different repositories for each client and each feature, for example, meant a lot of common code was simply copy/pasted when reused and that obviously made sharing bug fixes much more difficult or even impossible. Real dirty stuff. The solution isn’t all that complicated but when you’re on a services team with several hundred clients and hundreds of issues in your backlog, management focuses less on paving the way for the future and more on getting things done immediately. A few of us did implement a single code-bass / configurable framework for one big feature, but it was hard to get buy-in and convince people to use it - even if it reduced the workload from days or weeks to hours. The concepts from that were eventually re-packaged by management and sold by a more creative manager as an “SDK”, but I didn’t have the privilege of working on that team.

Interesting. I'm curious how long did your core system with basic-logic take to reach that maturity, and how many people involved? What kind of development model did you use?

Also you state that you "still struggling to get my head around this approach," does it mean that the system somehow violates the principle of least astonishment? (https://en.wikipedia.org/wiki/Principle_of_least_astonishmen... )

My first job at a bigco had a very heavy read-the-code culture, and being able to read code and write readable code is one of the most valuable skills I learned there. The lack of this skill is one of the things I've found most frustrating about working with less talented or, now that I'm a bit later in my career, more junior engineers. There's a tendency to glom onto black-and-white, superficial rules without understanding them, instead of appreciating the principles underlying them and having a sense of how to balance trade-offs. This creates an unfortunate cycle: everyone writes unreadable code, so nobody practices reading code, so nobody internalizes what readable code looks like and continue to write bad code.

I tend to have a strong reaction to duplicated code, but DRY is risky if whatever you're pulling out isn't logically concise and coherent[1]. Some of the helper functions I've seen in code reviews (as well as the one in the OP) strike me as written by unsophisticated language AIs, catching textual similarities that aren't semantically linked and pulling them out.

The engineers I've mentored over the years, including ones starting with no eng experience, go on to write fantastic code (and no, this isn't just my opinion). But it's a very labor-intensive, hands-on process of thorough reviews and feedback until they internalize the underlying principles, and it can be tough to swing in very short-term-oriented company environments. Now that I'm running larger teams, I've been noodling over how to encapsulate this deeper understanding of what makes good code in a way that scales better. But the fact that Google et al haven't already done this makes me think it's not something you can teach with eg a bullet point list.

> I encourage my devs to follow a rule of thumb: don’t extract repetitive code right away, try and build the feature you’re working on with the duplication in place first. Let the code go through a few evolutions and waves of change.

(Note: this isn't a case of what I describe in the earlier part of my comment, as I don't think this is a superficial, black-and-white rule)

I disagree pretty strongly here, especially since you're #2 involves waiting til you hit a bug. IME, there are many cases in which a solid understanding of the code allows pulling repeated code out in a way that's principled enough that it's more likely to adapt well to future changes. Indeed, this is true of most changelists to some degree, or you'd end up with no functions or classes other than main().

[1] A good rule of thumb is whether the free function has a reasonably readable name that captures its logic, without having to abuse handleFoo() or processBar().

I'm about 70% down the page, and this is the best comment I've read so far. I think you should write a book, and/or a blog. I'd read it, anyway. :)

Not only DRY but YAGNI.

Unless you are sure you will need to separately evolve the initially duplicated cases, assume YAGNI: You're Not Gonna Need It (the repetitions).

Avoid repetition like the plague; when the new requirements arise that conflict with the reduced repetition, confront it at that time.

It's possible they they are bad requirements, in which case the clean, DRY code can support arguments against the requirements.

Treat the computer as more of an Analytical Engine, and less of a Jacquard Loom.

The special case behaviors for various shapes hinted at in the article:

> For example, we later needed many special cases and behaviors for different handles on different shapes.

This sounds like it might be a bad experience for the end user who has to learn annoying shape-specific handle quirks.

From my experience, I would start with a more compact implementation first in hopes that it would save me time. This usually ends with me having to do more complicated operations in my head to plan the code. It's tempting. It's like trying to do math in your head to save time - often leads to a mistake. Now I believe there's no shame in trying out a solution first and rewriting it later when I get a feel what the problem is really about.

John Carmack once said something like "If you're not sure which version is better, code it both ways and compare".

The only downside of later optimization is that in commercial environment they often don't let you clean up.

> The only downside of later optimization is that in commercial environment they often don't let you clean up.

Exactly (speaking for custom industrial automation projects). The follow up next requirements for a project might be in 5 or 7 years or never. You would be cleaning up a 80% dead project. In these environments, it's not uncommon to no longer have a dev environment because the system is a singular testbench that costs upwards of 2 million and only exists in 2 factories, so all bugfixes must be extremely conservative and any refactoring is stricly forbidden because there's no way to locally test for regressions.

It's important for developers to understand that programming is at least as much about expressing yourself clearly with language, as it is about maths and compsci views of functions. Language that's more verbose, but also adds clarity, is a good thing.

If I had an instruction book on building a cabinet, it wouldn't help to re-list every screw, tool, and their sizes on every single step if I could put a parts list at the front. But it also wouldn't help to collapse every matching group of steps with one or two different parameters together.

Reusing a function like this is not clean code. It violates several principals.

1. Functions should have as few parameters as possible and almost never have flag parameters. This is a basic thing and costs very little to follow. As soon as you want to add a flag to a function you need to make a new function.

2. Minimize coupling.

3. Single responsibility principal. A unit of code should have one reason to change.

Of course in order to follow principles 2 and 3 here you may well need to consider the business logic.

Good generics can be really helpful here.

Can't be specific without knowing the exact helper function but I find sitting down with a cuppa (away from the computer) and planning an interface (If it's C++) that accepts (say) a policy class with a default and a callable object ("Functor") very helpful.

Sometimes it can't be done but it's better to have a bugfree building block (Macros don't count!) that can accept buggy user defined tasks than lots of repitition.

> Macros don't count!

Well, C++ macros don't count.

One big way I prevent this from happening is to treat classes as interfaces to data structures and keep everything that isn't about accessing the data elsewhere. Conversions to other data types go somewhere else. In fact I don't want my data types depending on any other data types at all.

When doing this any of this repetition or evolution can stay out of the data structures themselves so that they can be reused without irrelevant baggage.

Have you not simply abandoned OOP at that point? A core point of OOP is that objects manage their own state, and provide an interface for accessing/mutating it.

If classes are only used as data structures, and everything is done through (presumably pure) utility methods, it sounds like you're writing procedural code in an OOP language.

That's not inherently a bad thing, but OOP provides benefits and you may be making a trade-off without thinking about it.

I try to worry much more about what works rather than labels.

If you stuff all your functionality and data transformations into class definitions, your class definitions will be full of dependencies. Now all the fundamental elements of your program depend on each other and can't be separated.

It's like trying to pile up concrete until it forms a cave instead of laying it down to make the floor and building on top of it.

What am I missing?

The best way I’ve had this explained is that refactoring != DRY. Some methods have a similar structure, but if the problem they’re solving isn’t fundamentally similar then it’s not truly refactoring and you’re setting yourself up for more work in the future.

> don’t extract repetitive code right away, try and build the feature you’re working on with the duplication in place first.

I have a similar rule of thumb: when you do a thing the first time, just get it working. When you do a similar thing a second time, just get it working. When you do a thing a third time, pull it together as a cohesive abstraction.

This is a silly generalization, but the point is that you generally need to see a handful of examples of a thing before you can make any useful abstraction. What's more common is people creating abstractions during the first iteration, leading to features that are never used, and others that were never considered.

And the abstractions are not abstracting over the correct things so they are actually a net negative. One has to reason about the abstraction AND the problem domain.

I find it helpful to work with ADTs of values, lists and maps. No OO, just functions for selection and projection. The majority of programming is figuring out the nuances of the domain and getting something working. Code is actually an impediment to that.

I think that people often make these choices based on latent social and work-related contextual factors as much as there is any misunderstanding or blind adherence to the wrong set of guidelines. There are human factors as well that go into people defensively citing guidelines for their decisions after a criticism comes up in code review. It's somewhat likely that the choice of a premature deduplication wasn't deeply considered. The programmer just did it because they felt like it made sense, and when questioned about it, they claim whatever "rule" exists to justify it.

There's a strong pressure in work environments to get shit done, get it through review with as little fuss as possible, and move on to the next thing. That's work after all. Choosing to leave something in a state that's guaranteed to need attention later can be viewed as lazy. If you at least make an attempt to do something in a clean way, it will be perceived as "work" even if it turns out to be the non-optimal choice.

I find for myself, at least, that I'm far more willing to let things evolve in my personal projects and to discover the right abstractions over time than I am in my work projects. I don't know if that's true for everyone though. My personal projects are always about the learning process. Maybe for a lot of people that's not the case? The focus is more on the product than the process?

I write the first version of every side project in a language I don't know or don't know well and rewrite the actual version I'm going to use in a language I do know well. So I'm really happy to let things evolve, and the more I know the domain in one of my main languages, the more time I spend in the new language because I want to distance myself from what I think I know and examine it fresh when I come back to it and see if what I think I know is actually still (or was ever) valid.

Different people have different teams and motivations both at work and at home, so while I agree with you that taking a wait-and-see approach is often a really good idea, there are often human factors involved in these decisions where the same person will make very different choices depending on their environments and motivations.

I am working on a large project and I also realized this. I worked really hard to not repeat myself. I tried to reuse as much as possible as not to create extra code. Then, I quickly realized the things are NOT EXACTLY the same, and the methods to handle things were similar they were not the same. So, I had to refactor a lot of other helper methods.

I agree with the approach you are describing. It would be easier to refactor later, as one would probably have a better understanding of the application in the wild.

Loved the "incidental duplication" term. I will def start using it from now on.

But to add to this, I want to say that the moment you find your self writing a "helper" function, that's the moment when you must realize that you didn't understand the problem quite good.

As a rule of thumb and what I am always trying to pass to my peers at work etc, is to always try and think in terms of your domain and the models in it.

- What models does this helper function is acting upon or on behalf?

- Does it make more sense to create a model and put that function in it?

- Should that function be in a specific model or more that one?

> I encourage my devs to follow a rule of thumb: don’t extract repetitive code right away, try and build the feature you’re working on with the duplication in place first. Let the code go through a few evolutions and waves of change.

... alright buddy, I’m going to need you to remove your spyware on my computer!

I just did this last week with a CBOR message interpreter in C. I did it the long messy way, saw a bunch of repetition, “cleaned” it, changed the way the parser is called, changed some of the validator that runs after it... then did that entire process two more times as requirements changed.

I’m near certain you wrote that ABOUT me! ;)

I ask myself a couple of questions to decide when to abstract code to de duplicate it.

1. Is this domain logic or plumbing code?(for instance would this logic be described in a spec for the application)

2. How many times will these be duplicated?

3. How abstractable is it?

3.a. to abstract it will I have to do anything tricky like pass functions around use reflection?

3.b If I abstract this logic into a function is there a good concise name for it?

Agreed. And more importantly, (2) is waaaaaay easier to fix than the problem you outlined in your preamble.

you can also try to reason about the situation to figure out if the duplication is incidental or inherent.

I am coming to the conclusion that code reuse is over pushed as an ideal in universities. Often the desire to create reusable code means that we create overly complicated code with less duplication.

I wonder if we could have IDE features that will make #2 less likely. Eg marking some lines of code that they're similar to lines of code elsewhere. And if you then change it in one place the IDE will remind you about the others.

JetBrains has something like this in their tools. It pointed me in the right direction a couple of times when touching code that is similar to something elsewhere.


Misunderstood the question (I just realized now ...). However. The tool in JetBrains is good at finding duplicates ...

Relying on your IDE to maintain program semantics is terrible. IDE should help you write good code that you commit, not be a crutch for writing bad code.

People in the thread are arguing that it's not necessarily bad code though. Also, the point of the IDE would be to help you just not forget to change it.

Another approach with statically compiled language and good IDE might be to reduce duplication whenever you see it and inline back when you feel that this abstraction is no longer useful.

Code duplication is a suggestion of meaning identity. But can be coincidential.

Consider LZ78 (represents duplication with backreferences). Not very human readable.

I call these "spaghetti abstractions", and they're the worst kind of spaghetti.

I really like this way of looking at it.

the rule is "do the simplest thing that could possibly work"

inversion of control

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact