Hacker News new | past | comments | ask | show | jobs | submit login
The Wrong Abstraction (2016) (sandimetz.com)
717 points by LopRabbit 8 months ago | hide | past | web | favorite | 207 comments

I found the following comment very insightful in a past discussion:


I reproduce the relevant part:

Dependencies (coupling) is an important concern to address, but it's only 1 of 4 criteria that I consider and it's not the most important one. I try to optimize my code around reducing state, coupling, complexity and code, in that order. I'm willing to add increased coupling if it makes my code more stateless. I'm willing to make it more complex if it reduces coupling. And I'm willing to duplicate code if it makes the code less complex. Only if it doesn't increase state, coupling or complexity do I dedup code.

State > Coupling > Complexity > Duplication. I find that to be a very sensible ordering of concerns to keep in mind when addressing any of those.

Interesting indeed but the part that stuck with me the most is:

>> Existing code exerts a powerful influence. Its very presence argues that it is both correct and necessary.

I read the article in 2016 and that phrase stuck with me ever since, I had never thought about it but it's such a simple and self evident fact but so easy to miss. It's a powerful concept to know. Both when writing and when refactoring code.

If you're working to get a certain task done at your job, yes I can see wanting to minimally touch the code.

If something gets bad enough, I will refactor the whole damn thing, but only at jobs where there are unit tests. If there are no unit tests, this truly becomes an impossible task and it's best not to touch anything you don't need to.

You have to have good tests if you ever want to tackle technical debt.

I've heard this called "bunny code." It doesn't matter if it's good or bad, it'll reproduce.

This too is one of the few pieces of programming wisdom that I find unforgettable.

Deleted not to please downvoters.

The point is not whether it is correct and necessary, but that its existence makes the argument that it is indeed so. What you mention is actually the insight that "it may not be". But you reach that idea _against_ the argument its presence is making.

That is, you find some code. Its presence says "I'm here for a reason", your answer of "maybe it's just because..." comes as, precisely, an _answer_ to that argument.

Neither the argument itself nor your answer are necessarily and always correct. This doesn't argue that point, just that the presence of a piece of code makes such an statement.

The article explains what is meant.

> We know that code represents effort expended, and we are very motivated to preserve the value of this effort. And, unfortunately, the sad truth is that the more complicated and incomprehensible the code, i.e. the deeper the investment in creating it, the more we feel pressure to retain it (the "sunk cost fallacy"). It's as if our unconscious tell us "Goodness, that's so confusing, it must have taken ages to get right. Surely it's really, really important. It would be a sin to let all that effort go to waste."

I think it is a mistake to call this a sunk cost fallacy. the other stuff is true, but it's not the same thing as the sunk cost fallacy

I really enjoyed you insighful comment. I can't understand all those furious downvoters. Perhaps they are a bad abstraction.

I really agree with both of you.

I should guess hamandchess and trufa are very young people, that are beginning in paid software. This explain they overvalue experienced programmers blogs.

The only thing experienced programmers truely value is profiling- write it, run it, measure it.

Advice is nice and all, but at the end of the day, even that new method to write code, gets to perform on the profiling table of your manager. OO ? Use it, run it against procedural or functional approach, measure it, decide.

Everything else is politics, religion and that one irresponsible guy who gets high on new things and away with it somehow, while touring companys.

There is a quote by Linus Torvalds that is relevant here:

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they’ll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

Yep. And this is why our industry's decision to focus on essentially procedural abstractions for our interfaces is...problematic.

Let me offer a different interpretation: That's why it doesn't matter so much if you're doing code in a procedural or functional way. If your data structures are wrong, the code will be bad, period.


To clarify: when I wrote "procedural abstractions", that included functional.

Ok, I guess I misunderstood you then. Maybe your comment was more in the spirit of "APIs should be less secretive about the shape of data they are maintaining internally?"

Oh, it goes further than that. :-)

The assumption that data is something to be maintained internally, at best hidden behind an interface (a procedural one) and at worst "exposed" is so ingrained that it's hard to think of it any other way.

However, why can't we have "data" as the interface? The WWW and REST show that we can, and that it works quite swimmingly. With a REST interface you expose a "data" interface (resources that you can GET, PUT, DELETE, POST), but these resources may be implemented procedurally internally, the client has no way of knowing.

The interface is the data. Have a look at Data Distribution Service or Eve programming (you may have seen them before). They go further than rest in that you can react to changes in the data model (rest is only half a protocol)

Yeah, I am aware of Eve.

I've also done my bit, with In-Process REST[1] and Polymorphic Identifiers[2] for the "REST" half, and Constraint Connectors[3] for the "reacting" half.

[1] https://link.springer.com/chapter/10.1007/978-1-4614-9299-3_...

[2] https://www.hpi.uni-potsdam.de/hirschfeld/publications/media...

[3] https://www.hpi.uni-potsdam.de/hirschfeld/publications/media...

Except that functional programming completely eliminates (yet still allows) concern no. 1 in the mentioned order -- state > coupling > complexity > code.

Not to mention the better expressive power for describing data structures with algebraic data types (just + and * for types really).

That's just not true. Functional programming does not eliminate state. You can't do computation without state. What fp does differently is it pushes state around like a hot potato. In my eyes that is about as problematic as OO (where you cut state in a thousand pieces and cram it in dark corners and hope nobody will see it).

If you make global arrays instead you will always have a wonderful idea of what your program's state is, and you can easily use and transform it with simple table iteration.

> That's just not true. Functional programming does not eliminate state.

And yet it says so in the first sentence in the Wikipedia page for functional programming https://en.wikipedia.org/wiki/Functional_programming

>a style of building the structure and elements of computer programs—that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data.

But I'll take it that you don't have much functional programming experience.

Of course one can still go with a big global array and keep updating it in-place. A good programmer can write Fortran (or C in that case) in any language.

At some point, you need to modify some state, otherwise your program/language is useless. And that's not me saying that, I am just quoting, or at least paraphrasing, Simon Peyton Jones:


And of course a lot of so-called "functional" programs just outsource their state management to some sort of relational database. And the people talking about their creation will praise the state-less-ness of their creation. Unironically.

What can you do? ¯\_(ツ)_/¯

Anyway, more practically, the vast majority of workloads do not have computation as their primary function. They store, load and move data around. Computers, generally, don't compute. Much. For those workloads, a paradigm that tries to eliminate the very thing that the problem domain is about, and can only get it back by jumping through hoops, is arguably not ideal.


> avoids changing-state and mutable data.

This doesn't mean that functional programming eliminates state. Avoiding changing-state and mutable data is different and the Wikipedia article is referring to how functional programming doesn't mutate existing data, so you avoid the stale reference problems that can occur in OO languages.

Instead, the state is the current execution state of the program. Function calls are side affect free (except when interacting with the external world, which is a special case I'm not covering here). Because of this, the only way data can be communicated from one function to another is by passing it in when calling the function, or by returning it. This means the state is just the data local to the currently executing function, and any calling functions (though the data in that part of the state isn't available to the current function it's still in memory until the calling function returns).

Contrast this with procedural programming languages, where state can also be maintained in global variables, or object oriented languages, where objects maintain their own state with the system state being spread around the whole system.

Again, you can't do computation without state. The only question is how honest you want to be about it. And whether you put things in global tables or not is completely orthogonal to whether you mutate data or make new data.

And please, no beaten up buzz words and selling pitches needed.

I looked up the full text of the book, but couldn't figure out what tables mean in this context.

The quote comes usually with "data" instead of "tables" and "algorithms" instead of "flowcharts".

Row/column data such as you would find on mainframe accounting software

I would assume database tables

I'm guessing s/tables/data/g

Yes, and sometimes I feel this also holds in a way for user manuals.

"Algorithms + Data Structures = Programs" is a classic book by Niklaus Wirth.

It was one of the first to emphasis data structures in addition to code.

There is a free pdf online[0] discussed on HN in 2010.[1]

(from wikipedia) "The Turbo Pascal compiler written by Anders Hejlsberg was largely inspired by the "Tiny Pascal" compiler in Niklaus Wirth's book."[2]

[0] http://www.ethoberon.ethz.ch/books.html

[1] https://news.ycombinator.com/item?id=1921125

[2] https://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures...

To add a little of poetry, another quote: In fact, I'll take real toilet paper over standards any day, because at least that way I won't have splinters and ink up my arse.

From linus rants: https://www.reddit.com/r/linusrants/comments/8ou2ah/in_fact_...

I think you're just misusing that quote... At the end of the day, for any given datastructure setup, an algorithm still needs to be implemented. There's many ways to slice that pie, and depending on how you do it, that can cause a lot of saved time or misery for the engineers you work with.

Data structures and their relationship cannot express everything. SQL is not Turing complete if you don’t use CTE to introduce recursion. And I think that we all agree that SQL is perfect to work on data structures and their relationship.

The original comment is on a completely different level of analysis. I think that if you know about Linus Torvalds, that you will agree that he knows what SQL is and how it differs from a turing-complete language. The point being made is much deeper and philosophical, and makes a lot of sense in complex systems.

does he know how SQL works?

I “know” who is Linus Torvalds since when I installed my first Linux distribution on my 386sx 25mhz when I was 14 or so in 1995/6. I think he is very smart but sometimes I disagree with him and with his harsh way of relating with the rest of the world. Now, after a useless appeal on authority, would you mind to explain what is wrong with my opinion about data structures and relationships? I don’t really think that you can do everything just with data structures and relationships. If you think the opposite than please explain how you can do everything with something not Turing complete.

> I don’t really think that you can do everything just with data structures and relationships.

No, you can't do everything with just data structures. Everyone knows this. 1st-year junior programmer knows this. It's obvious. The original question did not talk about this, you misunderstood the level of analysis it was aiming at.

The fact that SQL is not turing complete is a meaningless truism here, because Linus obviously did not mean that we should all start using SQL instead of C. The point he is making is that data structures are of much bigger importance to get right in order for the program to be good. Not just fast or just maintainable, or just easy to understand. But all of those things and many others.

"SQL is perfect to work on data structures" if and only if relational tables are the only data structure that you know.

Try to look at it that way: what isn't a relational table? Any data structure you can make is essentially a tuple of primitive elements. It may point to further data items, but still. Now, put equally shaped tuples in common tables, and you have a database.

Trees, graphs?

Of course one can force anything into a relational database. The data analog of "Turing tarpit".

Ironically graph databases are way better for describing relations than relational databases.

> Trees, graphs?

Easily represented as a vertices-array and an edges-array. It's conventional to index the (directed) edges to optimize iteration over all edges from a given vertice. If you're being "sloppy", you can also represent edges as vector<vector<int>> (one array of destinations per source). This is more convenient but comes with the slight disadvantage that these arrays are physically separated (for example, you'll need two nested loops instead of only one to iterate over all edges).

At the same way that you can force everything in a deterministic or non deterministic Turing machine, depending from the problem. But something that is just looking at data and relationships, akin to a relational database, while extremely powerful, can’t solve every problem in the world. There are much better tools for that. And they have something more than just data and relationships.

Of course you can force anything in a graph database. But then you have to make special collection objects to iterate over all Foo items in your process. I guess you'll also need some kind of garbage collector.

> Ironically graph databases are way better for describing relations than relational databases.

How so?

You can force pretty much every data structure that I know in a table. That doesn’t mean that you can solve everything with a non Turing complete language. So, unless I’m badly mistaken, you’ll need something more than data and relationships to solve everything.

The Linux kernel, Linus's lifetime project, is full of data structures and contains no SQL. Because it's completely inappropriate in that context.

I would say they are all related concerns but my order would probably be Complexity > Duplication > State > Coupling. In nearly all the refactorings I've done, reducing complexity and duplication will tend to automatically take care of the other concerns.

100% with State. It is just my personal experience, to keep track of all states, it takes O(n^2) of my brain power, and the same order of magnitude tests to cover it up.

Excellent quote! Saw this comment a couple years ago, agreed strongly, and been looking for it since — thanks for resurfacing.

When referring to "state" are we talking about just mutable state? Or are we talking about both mutable and immutable state as being equally undesirable? Because in my experience immutable state is fine, and often desirable, whereas mutable state is almost always toxic. I could be convinced otherwise, for sure, but it might be worthwhile to make the distinction between the two.

All state is a problem; it's something you need to keep in mind when analyzing code, because it may be used in a given computation. You need to keep it all in your head in order to comprehend what is happening.

Local state here is better than global state, especially if you consider the advice to write shorter functions - if your functions are small, so are the scopes in them, and the local state is easy to trace and memorize. Global state is not bounded, there could be hundreds of constants, enums and global objects to keep in mind.

Immutable data structures are easier to comprehend than mutable ones, because there's less points where state is modified. If you take Redux as an example, you still need to know what is in the "immutable store" at any point in order to understand how the code uses it; Redux tries to minimize the pain by limiting changes to the store in actions/reducers and by giving you access only to part of the store(local state vs global). However, you still need to understand what changes a sequence of actions perform on the store, so that's still state you need to be concerned with.

Not the OP, but as someone who has also found it to ring true, ideally the less state the better, immutable or mutable. But, if I had to choose which kinda of state I’m managing, I’ll take a bunch of immutable structures over a few mutable ones any day.

We all know the issues that arise out of mutable state, values getting changed for seemingly no reason, race conditions literally sapping every little bit of will to live you had, mutable state doesn’t scale well (at least from a complexity standpoint). Now you’ve gotta worry about locks, and all that fun stuff if you try to do any kind of non trivial concurrent programming.

Now, I’m not saying immutable data structures are literally the silver bullet, but they do almost completely solve all the above mentioned issues with mutable state. But, they too, have their own issues. Working with immutable structures can be significantly slower, especially as the amount of state grows, any modifications mean you have to create a new structure and copy data, and now you’re also going to be having a lot more memory being used, and that’s not mention the conceptual differences you have to adjust to if you’ve never worked with strictly immutable structures before (“what do you mean I can’t just update this array index with arr[i] = 2?”). But, in my experience, debugging can be orders of magnitude easier, concurrency is something that is actually enjoyable to work with, and not a chore of mutex checking, and hoping some random thread isn’t going to fuck up all the data, and given the power of modern computers, the memory bloat that comes along with immutability isn’t really an issue anymore.

But, I’m also one of those people that thinks functional programming is the one true path, so I may be a bit biased/misinformed on some crazy mutability benefits that make the bullshit worth it.

I heard that since the future is more and more cores, instead of faster ones, functional programming is going to become increasingly necessary.

Oh definitely, while I don’t think it will ever evolve into everyone using Haskell, we’re definitely going to keep seeing more and more functional concepts creep into all programming languages. Hell, even king OOP, Java, is even breaking down and adding some functional things last I heard, finally getting lambdas (I think), right? And I imagine that was much to the outcry of thousands of “enterprise” developers, what’s going to happen when big bad functional programming comes to down and shuts down all their abstract factories?! Think of their child classes!

But, I digress, and honestly think a nice balance between concepts, and using what works best for the task at hand. However, I’m super excited for the future of functional languages. I love Elixir/Erlang, once you get the hang of OTP/actor model/functional programming, it’s absolutely mind blowing how easy it is to write highly concurrent programs that are super reliable, easy to restart from failures (just let it die is what they say right?). Nothing like the headache’s I experienced when I first learned concurrency with pthreads in c++. And what’s exciting is the Erlang VM is legendary for being able to build up highly reliable and super concurrent, complex software, however one of it’s biggest dings was it’s far slower than something like C when it comes to raw computations. This is largely because of it’s functional nature, since it’s data structures are immutable, any modifications will result in having to make new copies of data, while C could just do it right in place. However, now that raw computing power is becoming cheaper and faster, they is becoming much less of an issue. And the Erlang VM can handle things like clustering servers to run processes across several computers built right in. I don’t want to imagine what it’d be like to have to set that up with our old friend C out of the box (but C also doesn’t have the overhead of things like a VM or a garbage collector, so it’s not like it doesn’t have a ton of advantages over Erlang I just wouldn’t want to use it to build concurrent software across multiple servers).

Java had lambdas since 8 (we're on 11 now?). They are a great addition to the language and streams (essentially FP) are amazing.

Also, my FactoryFactoryProviders are alive and well :)

Jose Valim goes into this when explaining why he created Elixir: https://news.ycombinator.com/item?id=17513812

Immutability is property of data structure. It can help prevent some unexpected errors, mainly accidental side effects, but make no mistake, you can still use them to create bad and stateful code.

Think of a function, where everything is immutable, but instead if full of if/switch statements and complicated branching behavior. Even if it is deterministic, it will become intractable for reasoning once it reaches certain scale.

> When referring to "state" are we talking about just mutable state?

I don't think you should restrict yourself to thinking only about mutability and immutability in your program, but also that of the entire system. If your program is completely self-contained, that's good, but often they need to integrate with outside services and communicate over the network and write data to disk etc. Those dependencies also result in state that might affect the behaviour of your software and you need to consider it when designing and writing code.

I disagree, simpler code can be better if the library is well known. Otherwise we would never be using utility libraries. Though yes, coupling indiscriminately is problematic

I don't fully understand the state part. Could someone provide an example of what the OP is talking about? thanks!

Say you write some software that manages a shopping cart.

a) You can "store" (even if it's in-memory) just the products and their quantities. Then each time you need to display the cart you go and compute the corresponding prices, taxes, discounts, whatever.

b) You can store each cart line, whether it has discount(s) or not, as well as its taxes and the cart's global taxes and discounts and whatever else you can imagine.

Option "b)" is probably more efficient (you are not constantly recomputing stuff) but you will be better off in the long term by going with option "a)":

- Your cart management and your discount/tax computation are less coupled now (the cart doesn't really need to know anything about them)

- You have less opportunities for miscalculation because everything is in one "logical flow" (computeDiscounts()/ computeTaxes()) instead of being scattered (some stuff is computed when you add an item or when you remove it, or when you change the quantity, or when you specify your location, etc..). The code will most probably just be simpler with option "b)".

The article argues that you should sacrifice the performance in cases like this. I agree.

Hah I get where you’re going with this example, but shopping carts in particular do want to keep the line in the cart as “local state” because the desired behaviour is that once a customer has added something to his cart, within a reasonable time limit is what he pays for, even if there are some sort of price flux. So probably not the best of examples.

Anyway I myself so wholeheartedly agree with the minimizing state idea.

yes it is annoying when prices change in your shopping cart at time of checkout. that has happened to me more than once after keeping it there past a store's midnight.

Well, more state in code usually makes it more difficult to do things like run the code concurrently. You have to worry about managing data races when there is a lot of shared state, whereas in stateless code no complex managing is needed

Although this is true of stateful code, I think an even more fundamental, but related, reason to reduce state is this: code that is stateless always behaves the same way so it can be characterized and reused more easily than code that changes behavior depending on the state. This is the reason it is good for concurrent programming, but it also means it has a more concrete/consistent nature.


On the one hand, often there can be shared lines of code without a shared idea, this shouldn't be a candidate for being factored out into some new abstraction.

On the other hand, you may want to introduce an abstraction and factor out code into a common library / dependency / framework when there's a shared well-defined concern/responsibility.

That said, on the gripping hand, I say may because even if there's the opportunity to introduce a clean future-proof abstraction, introducing it may be at the cost of coupling two things that were not previously coupled. If you've got very good automated test coverage and CI for the new common dependency and its impact upon the places it is consumed, then perhaps this is fine. If the new common dependency is consumed by different projects with different pressures for change / different rates of change / different levels of software quality then factoring out a common idea that then introduces coupling may cause more harm than good.

I am continually fighting the DRY cult at work. I explain that we need to focus on whether the code shares requirements or implementation. With one, the duplicated code would evolve hand-in-hand while the other can have the code evolve separately. The problem is it requires a tipping point before you split the code up again. So instead you extend the existing code to handle disparate requirements, turning it into a god function / object.

To help weed out requirements I tell people that on their third copy / paste, they might begin to consider reducing the duplication. At that point, they both have had time to think about the code and had gained experience with it to discover what the requirements are.

Another problem with bad code reuse is code locality. Like with Instruction and memory locality helping to improve runtime performance via caching, code locality helps improve mental processing. The further you separate related pieces of code, the more you need to have a good abstraction for it so you can correctly reason about the code. Without a good abstraction, you are having to jump between far areas of code to figure out what your function does.

Beautifully put.

Programmers have had DRY drummed into them so hard that it is almost heretical to even consider the tradeoff of increased coupling that arises from it. Coupling is good if things should change together because they are linked in some fundamental way such that it would be wrong for them to be different. Coupling is bad when things should evolve independently and the shared link is incidental.

The problem is that it is surprisingly hard to tell the difference up front. In the moment of writing the code, the evidence for the shared abstraction seems overwhelming, but the evidence of the cost of coupling is completely absent. It exists only in a hypothetical future. Unless there is strong evidence for a shared underlying conceptual link, I often consider only the 3rd repetition of a shared piece of code evidence for the existence of a true abstraction. Two times could just be chance, three is unlikely to be so.

> In the moment of writing the code, the evidence for the shared abstraction seems overwhelming, but the evidence of the cost of coupling is completely absent

This is actually representative of a problem in the industry as a whole I think. A lot of things have short term benefits but long term drawbacks. Because of the drastic, recent growth, orgs are bottom heavy (very few people have experienced long term drawbacks of X compared to how many people who just learnt X). Additionally, because of the extremely quick turnover of people, it's even rare that people who implement X are there when X blows up in people's face. They went on to implement Y...and will be gone before Y blows up.

So most tools, libraries, frameworks and abstractions are HEAVILY optimized for the short term. Optimized for getting a project set up quick. Optimized for the initial "Hello world". Optimized to get an API and a form in seconds. Very few tools/patterns are optimized for ease of long term (hell...these days long term means a year or two) maintenance. The ones that are generally get a bad rep.

And building stuff that's both good short AND long term is very, very hard.

Having worked with some Go programs, where one of the sayings is:

  A little copying is better than a little dependency.
This is even in the case with the two copies share the same correct abstraction. It's different than in the npm world. I've taken this and applied it in microservices written with other languages/frameworks and have no regrets. Sometimes some versions are a bit less complete or featureful, but each works fine. If a bug is discovered, it's fairly easy to and patch them all.

It's usually the intermediate developers who like having rules to follow to know that they're doing well that tend to over-apply DRY and other principles. Only experience (aka pain over time) seems to show when to break (or just not apply) the rules.

Perhaps it's just the way things are taught/learned. Instead of just showing what's good and have them interpreted as rules, each should be shown as a rule of thumb with a concrete example of when it should not be applied. Even if they don't clearly understand the difference at the time, they'll always recall that there are exceptions and not feel so motivated to apply it in every instance.

I think it just takes a few times having to back out of an abstraction because of changes to requirements to become wary of premature abstraction.

It is not just that 3 indicates the existence of an abstraction, but seeing 3 examples improves your odds of identifying the correct abstraction to use.

This comment is more insightful than the original post.

The real problem is when engineers abhor duplication, and in order to reuse existing code, they simply call the same code from multiple places without thinking through the basis for this reuse.

Mindless deduplication is not an act of abstraction at all! This is a very important point, because a "wrong" abstraction that is conceptually sound is not that hard to evolve, and if the code is called from N places then you get to look at those places to understand how to evolve the abstraction. Improvements to one part of the code benefit N parts, and you save work.

The only other factor to keep in mind is the dependency graph and coupling, as my parent mentions.

Mindless deduplication is more common than you'd think, especially with bits of code like utility functions and React components. For example, you end up with a component called Button that has no defined scope of responsibility; it's just used in one way or another in 17 different places that render something that looks or acts sort of like a button. This is not the "wrong abstraction," it is code reuse without abstraction.

I know what you mean, but you need to find a different or more nuanced term. Deduplication is abstraction, it just isn't an abstraction mapped to the domain problem. Even a compression algorithm abstracts:

An abstraction can be seen as a compression process, mapping multiple different pieces of constituent data to a single piece of abstract data; [1]

There are wrong abstractions.

[1] https://en.wikipedia.org/wiki/Abstraction#Compression

Conceptual or semantic compression, yes, as the rest of that section makes clear. The very problem with deduplication without abstraction is not thinking at the conceptual level, only at the literal code level. There are lots of ways to compress code, e.g. minifying it :)

Quoting the start of the article: Abstraction in its main sense is a conceptual process where general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or "concrete") signifiers, first principles, or other methods.

For the Button case, you'd have to come up with some concept of what a Button is and does, beyond what code lives in its file (e.g. an onClick handler that calls handleClick, etc.) in order to have an abstraction.

There are "wrong" abstractions (in the sense of abstractions that turn out to need to be changed later, like any code), but if you lump all deduplication into abstraction then you will have a skewed sense of the cost of changing an abstraction.

The cost of changing an abstraction also depends on your programming language; if you spend a lot of time in a dynamically-typed language, you may internalize that refactoring is tedious and error-prone and often intractable.

I got some flack from some other students learning when I had some duplicated code in a few places. I tried to explain that while they shared some common code the way some of the code that used it worked really didn't have the same goal and/or might change as opposed to other code. So while I was sharing some places, I chose not to share in others to allow each area some level of independence if / when we change that. They were all zombie "Barrrrrrgh look at me reusing code all efficient like!"

Granted we were all n00bs and nobody will see that code again so it wasn't a big deal... but the intent, direction, and possible future of the code seems like something that should be considered once you start sharing.

> but the intent, direction, and possible future of the code seems like something that should be considered once you start sharing.

Yes. Dare I say, intent is one of the most important things here. Two new pieces of code may be structurally identical, and yet represent completely different concepts. Such code should not be DRYed. Not just because there's a high chance you'll have to split the abstraction out later on - but also because any time two operations use a common piece of code, there's an implicit meaning conveyed here, that the common piece of code represents a concept that's shared by the two operations.

People often forget "copy-on-write". Coupling doesn't have to be permanent. If refactor to create a sahred component, and then you want to modify a shared component to help one client, you can fork it -- it's not worse than simply not having created the shared component in the first place.

In my experience people will most likely just hack the shared component by adding awful arbitrary if-statements or other such hacks, rather than fork the shared component. This is the path of least resistance. Once this happens a few times that shared component begins to be seen as a central component and is quite a complicated mess.

Well, after enough settings are added, take a look at your components, and define a clearer 2.0 version of them.

When systems need to use newer functionality, port them to the new components.

I've had a mixed experience with this, but at some point you get the API right, and then it works.

People tend not to do this when the original already handles many cases. If half the copied code is dead on arrival, it tends not to be copied.

But often the fork happens too late, after the first few differences have been creatively shoehorned into the shared code. The resulting mess then tends to live on twice after the fork.

In the end, almost every conceptual way to slice up software can be viable if you are good at whatever you do, and terrible of not.

My jam is to wait for a few use cases before creating a new abstraction or process. I want to see how they are similar, and how they are different too in order to form a generalized solution that will serve the use cases at once. Dealing more and more with tooling for other developers, this applies especially to the tooling APIs.

I apply this to DRY, coupling, encapsulation, APIs, etc. Also, I prefer to focus on consistency and readability over most other concerns. I mentally, or physically!, note areas of code I want to improve but don't feel like the time is right now, right now. During future work that touches that code I will refactor it if a solution has presented itself.

These days I prefer languages with bomber language services and tooling to make refactor in the future as painless as possible(types). I prefer explicit over implicit(sorry Ruby and Chef), and configuration over convention(looking at you Gradle).

> when there's a shared well-defined concern/responsibility

I think a good test for this is if you can write a reasonable unit test for the code in question. If your unit tests essentially become two separate sets of tests, testing the different branches of code, it's probably the wrong abstraction. If your tests work and you've built a reasonable standalone library (even if it's not useful to anything but your exact product), that's at least a signal the abstraction is sustainable.

Now we simply need a formal objective definition of "reasonable code" and the industry should never have this problem again.

Even if some shared code is currently branch-free though, it may be unlikely to remain that way if the abstraction is fragile.

A red flag is vague function names like "processThing" or "afterThingHappens". If a function can't be summarized concisely, it's probably doing too many things, and the abstraction is likely to break down later when the callers' needs diverge.

I'm going to copy and save this.

As a senior engineer who recently became an engineering manager I always caution my devs about abstracting too liberally. Junior engineers are particularly bad about this. They see a handful of functions that are duplicated across a few (unrelated) projects and they want to create a new repo/library and share it. Then I direct them to the Slack channel for our platform services, which has a sizable shared library across dozens of services. That shared library is a frequent source of problems.

It takes a while, but I usually beat that primal impulse out of them.

> They see a handful of functions that are duplicated across a few (unrelated) projects and they want to create a new repo/library and share it

He is what I think you do when you do that.

You just created yet another internal API. Designing, creating and _documenting_ good API's is _hard_. The most likely result is an undocumented dodgy half finished API that doesn't fully encapsulate the thing it's supposed to deal with. So you end up with code that both uses and bypasses the API you just wrote.

If you do that and later decide that you want to move some functionality from one side of the API to the other you've just set yourself up for a hella lot of work.

The other thing is, you want to make changes to duplicated code. You can limit the risk to code base you're actually working on and not a bunch of unrelated programs.

> If the new common dependency is consumed by different projects with different pressures for change / different rates of change / different levels of software quality then factoring out a common idea that then introduces coupling may cause more harm than good.

This reminds me of the recent post on HN about a company migrating from microservices back to a monolith, for this exact reason.

The biggest antipattern I encounter is factoring out an abstraction when the idea hasn't even been shared yet.

But being able to see the future, and reducing technical debt around it, is a required skill of an experienced developer.

Just because something hasn’t been shared yet isn’t good justification in my opinion, if you know it will be, especially if it’s a library/API.

I guess what I have run into is a lot of code that is agressively, needlessly abstracted for a future that will likely never come. Abstractions that perhaps would be worth it to save hundreds of copies or permutations, while I'm looking at one.

I'm all for not repeating myself, but there is a different between "usually avoid" and "never"

Copying and pasting in many situations would seem a breeze compared to the nest of abstractions required to avoid it.

Seeing the future? Picking the right macro strategy, programming language, database, etc ahead of time, sure, an experienced developer can usually do that. But correctly predicting the boundaries of systems, libraries, APIs—and guessing who will maintain which system and figuring out dependencies—that never goes according to plan. So in my experience YAGNI and reducing coupling are more important principles than sharing code.

I suppose it depends on your problem space.

If I'm writing an API to move a robot, my problem space is fairly bounded, and I know that someday I will want force control at some end effector. I know that there's a 6 axis robot I've been eyeing, etc.

Maybe I'm being downvoted by web devs?

The Go community has always embraced this. On the Go Proverbs page[1], it's expressed as “A little copying is better than a little dependency” — true, it's not precisely the same idea, but close.

[1] https://go-proverbs.github.io/

Hmmm...I feel like these ideas can be refactored into one idea. BRB

I experienced a crystal clear example of this about 15 years ago. I developed a prototype of perl script that read data from one folder, filtered and enhanced it, and wrote it out to another folder, where it was picked up by another process. The script was configured by another perl file that contained the paths to read and write from. The initial (pre-production) deployment had the two folders adjacent, so the config script looked something like

    $input_folder = "/some/annoyingly/long/path/my-cool-project/input/"
    $output_folder = "/some/annoyingly/long/path/my-cool-project/output/"
At some point the script was handed off to some other dev who looked at those paths and apparently thought "that's not DRY!", and changed the code so that the config file just had

    $project_folder = "/some/annoyingly/long/path/my-cool-project/"
and actually append the "input" and "output" in code when needed (fairly elegantly leveraging some existing keys that already defined those two strings).

The problem was that when I developed the script the actual consumer hadn't been finalized, so that output folder path was just a placeholder. When it comes time to deploy we get the actual path which is now some NFS thing like

At this point I naively go to the config to update my $output_folder variable and discover the code changes made by the other dev, which have made it impossible to separate the two folders, and because of the "elegant reuse of existing keys" made it a huge pain to even change the code back, since the assumption that the last segment of the path had to match the intended use was deeply baked in. At this point I think I just started swearing for a week.

Too much abstraction?... Or not enough abstraction?

  $project_folder = "/whatever";
  $input_folder = $project_folder . "/input";
  $output_folder = $project_folder . "/output";

This does not solve the problem. The input and output folders had different roots in the production app.

However, that abstraction is very localized and thus easy to remove (once the new understanding has been gained), so I'd say it is better.

Now you've just obfuscated that string for no apparent reason. Nobody can grep for that literal string anymore and you gained an extra line of code.

Those paths are not the same data repeated twice just because they share common substrings. They are two paths that serve distinct purposes. The developer likely chose that syntax because it looks like a setting that can be changed. It could just as well have been read from a settings file.

In the spirit of the original article I think point is not whether it was too much or not enough abstraction--it was the wrong abstraction. His abstraction eases one potential change (changing the project folder, which would imply changing two path strings in the original code but one in his), but at the same time makes a whole bunch of other potential changes much harder (or even impossible) to handle. We all would have been better off if he'd just left the duplication in place.

Yeah that looks nice. An overabstracted example would be

  $project_folder = "/whatever";
  $in = "in";
  $out = "out";
  $suffix = "put";
  $directory_separator = "/";
  $input_folder = $project_folder . $directory_separator . $in . $suffix;
  $output_folder = $project_folder . $directory_separator . $out . $suffix;

    $in = "in";
    $out = "out";
Those are not really abstractions, just extracted variables. An abstraction would change the concept language.

    $suffix = "put";
This is also not an abstraction, as the language of the term "suffix" comes from its role in string concatenation, which is the same as the role of the original string literal. It doesn't change the "level of abstraction".

This isn't an over-abstraction, it's an over-extraction. Each abstraction should be non-trivial.

    $project_folder = "/whatever";
is a good abstraction. Looking at the string literal "/whatever", I cannot determine it's "role", but $project_folder is a good name and changes the concept language from being about string concatenation or arbitrary names into a concept language about projects and folders.

That's why you make it as parameter

One might contemplate the question of why people use config files in some cases and command-line parameters in others. Having a section for constants at the beginning of a script resembles a config file to some extent.

Implicit assumption that other people's code is off-limits?

Most software on my machine has probably arrived from apt-get at some point or another. Even if it's all Perl scripts, I'm not going to overwrite them directly, only to have my changes either removed on update, or blowing up the package manager's consistency checksums, or [insert random reason I can't conceive, because I'm not familiar with internals of apt]. So it's either config files, command line, environmental variables, or I'm going to build a wrapper that bends installed software to my will.

Dijkstra also beautifully sums up this concept in his 1972 ACM Turing Lecture [1]:

> The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise

[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340...

Just copy-paste code and add a comment pointing to the source (starting with TODO helps you find those later).

It will become obvious which duplicated code to abstract when you find yourself changing all/many at the same time or fearing you’ll forget/break something if you don’t change all instances. Writing tests is also a good motivator as it means even more code per duplication (and reduces the fear of breaking something).

It really takes a lot of duplication for this to get out of hand. Wait til it happens, you’re a software engineer you’ll figure out how to get rid of duplicated code just fine. Coming up with a great abstraction is extremely difficult before seeing at least a few examples.

I recall when ESR tried to school Linus Torvalds about the "curse of the gifted student" and that Linux would collapse under the complexity of device driver code duplication. Linus didn't care because he was more concerned about introducing code dependencies that would block developers and make them less productive than some duplicated code in drivers maintained by different people.

  Date:	Tue, 22 Aug 2000 16:00:52 -0400
  From:	"Eric S. Raymond" <esr@thyrsus.com>
  To:	Linus Torvalds <torvalds@transmeta.com>

  Linus Torvalds <torvalds@transmeta.com>:
  > On Tue, 22 Aug 2000, Eric S. Raymond wrote:
  > >
  > > Linus Torvalds <torvalds@transmeta.com>:
  > > > But the "common code helps" thing is WRONG. Face it. It can hurt. A lot.
  > > > And people shouldn't think it is the God of CS.
  > > 
  > > I think you're mistaken about this. 
  > I'll give you a rule of thumb, and I can back it up with historical fact.
  > You can back up yours with nada.

  Yes, if twenty-seven years of engineering experience with complex
  software in over fourteen languages and across a dozen operating
  systems at every level from kernel out to applications is nada :-).
  Now you listen to grandpa for a few minutes.  He may be an old fart,
  but he was programming when you were in diapers and he's learned a few
More here:


https://news.ycombinator.com/item?id=11077799 (2016 discussion)

I still regularly fall for this trap. My instinct to "DRY" is so ingrained into me that I always find myself deduping similar looking blocks of code. I think this culture is a knee jerk reaction to dealing with code bases on the opposite end of the spectrum where entire classes are "copy and pasted" with only a single change. I've had the misfortune of dealing with these kinds of projects.

I now try to find the middle ground by remembering to "do the simple thing" even if it appears less elegant. This makes it easier to refactor in the future (if required) at which point more information will be available to design a more appropriate abstraction than would have been possible before.

However, let's not go too far: I think DRY is a good default for new programmers. They can learn to break that rule as they gain experience.

One of my coworkers won’t be happy until we are all living in Death Valley. Almost nobody can follow his code, and given that we aren’t trying to save the world, pretty much nobody tries anymore.

So now he’s indispensable, a situation I assiduously avoid (you can’t work on the cool stuff when you have to be there to maintain the old thing).

Removing duplication != abstraction.

Many copy & paste scenarios can be avoided without creating any meaningful abstraction. Generally this is best done with a stateless (pure) function that has few if any conditionals and does not involve design patterns such as inheritance, overriding, or even creating new compositions. It should feel easy and boring when you are doing this.

Abstraction is a new representation that calls for deep thought and I agree is easy to mess up.

The need for DRY also depends on type-safety. Type-safe boilerplate generally adds verbosity, but few bugs. This article is coming from the Ruby world where a bug lurks behind every untested line of code: this can create a lot of pressure to make boilerplate free code which can end up turning into abstractions. But every code change (without code coverage) in dynamic languages is also an enormous risk. In type-safe languages the compiler can help ensure that the process of removing duplication is correct.

This is why the article says "wrong" abstraction. Replacing duplication with abstraction often leads to the wrong abstraction. That's the point of the article.

That being said, good thoughts on solving duplication correctly.

I agree. I suspect the author is coming from a purely-oop background, where making a subclass/trait is a very costly and clumsy way to reuse logic.

In a functional paradigm, even the most trivial of abstraction pays off handsomely. This is exactly what "map" is for example.

Trivial doesn't mean it wasn't well thought out. On the contrary the more trivial the abstraction the more likely its more general, since it has less constraints.

Conversationally I would say that calling a sequence of functions was composing them.

I’m pretty sure you’re talking about composition in terms of heavy weight enterprisey terms but I think that is a pretty fine line.

> When dealing with the wrong abstraction, the fastest way forward is back. ... Re-introduce duplication by inlining the abstracted code back into every caller.

This can be tricky if the person who wrote the abstraction takes it the wrong way. At my previous company, I've been yelled at for doing this. Some developers get emotional about their code, in which case undoing their abstraction causes offense. How do you get around this?

I find that the best approach is to talk to them first, explain your use case, ask them how to solve the problem.

Most of the time you will either be welcomed for your proper obsequence, or find out that there's a jealous guardian of that code and no matter what you do it won't alleviate that tension anyway.

Yep, always talk to the person who made a WTF-inducing change. It's even possible they've thought of something you didn't.

You don't need to undo the abstraction, just opt out: Fork the shared artifact and modify or rewrite the copy (prune useless parts) as you need.

Opportunity for leadership. Explain to them that you are not your code. That code is never done, and like writing, it can improve with each draft. Also, what specific behavior of theirs makes you feel like they are yelling (raised voice, standing close, interrupting). People aren't perfect, but we can help each other get closer.

What I go to war against lately is abstractions that are only there to save people some typing. I strongly believe abstractions are there to abstract -complexity-, not to save you some typing (if the bottleneck doing your job is your typing speed, ask your boss to give you harder problems, hehe).

Being able to evolve 2 pieces of code separately is powerful, and realistically is a more common case than wanting a change to pop up in 2 different places.

There's another angle to this which I think is quite important. 4 years back, I joined a team where the tech lead would always warn developers about the abstraction problem described here.

Very regularly I'd hear "Code duplication is fine, do not use an abstraction here". What he meant was "In this case the abstraction might be incorrect. Use abstractions only when they actually relate to business logic, not just because two pieces of code happen to look the same". Unfortunately, while that was very obvious to him, to new developers to the team (like me) it sounded like "Do NOT use abstractions, they are evil". Over the years I developed a habit of never thinking about abstractions because they are evil. I duplicated code that should have been abstracted and today we pay the maintenance cost for that.

tl;dr: Experienced folks, be careful when you caution your peers against abstractions. Be very explicit and assertive that they _CAN_ be used correctly and one shouldn't avoid them.

This is good advice. I've come to the conclusion that, when suggesting almost anything to do with Development, I really need to be at pains to prevent it being received that the answer is always either Black or White at all times.

It's hard to get across that the answer is usually one of the Greys and, even then, the shade will probably vary a little from time to time.

Rules are the children of Principles, they're important handrails as you're learning but to progress from there you have to understand the Principles behind them and how they confirm or contradict eachother.

Agreed. To me, the real point is not to be afraid to undo an abstraction that has proven to be more trouble than it's worth. DRY is still a good default mode of thinking.

When in doubt, I use this rule of thumb:

- It is ok to have the code duplicate twice. Add a comment to track the conscious decision.

- But when I find myself to do it the 3rd time, it is time to think if I can factor it out.

Three use cases tell me more than two about how things can be abstracted and if it makes sense.

I think the wording of the second part is crucial. Thinking if you can factor it out is a way different beast than "you must refactor to obey DRY". Implied in your statement is whether or not the abstraction can be made simply, efficiently, cleanly, etc. which all serve to make it a good abstraction.

If your only priority is removing duplication at all costs, you'll end up with worse code than if you just let code be duplicated.

But the whole point of this discussion is that you should probably duplicate as many times as needed if it helps to avoid wrong abstraction.

No. The point of the article is that it's ok to undo an abstraction that has done more harm than good. Abstraction and DRY is still a good default mode of thinking.

The question is whether updates can be made against both copies of the duplicate, requiring synchronization. It's the difference between a replicated slave copy of a database and a real distributed one. The first is easy. The second is very hard.

I'd argue if you find yourself needing to do this, it actually might be a hint that abstracting is appropriate, as it's proof there is more than superficial commonality. Like so many things, I don't think there's a way to make any hard and fast rules, and figuring this out is more art/experience/taste than science.

Not including cross-cutting concerns that modify all usages (eg, changing your logging or dependency injection library).

The "wrong" (less than optimal) abstraction might be better as long as it's rigorously documented, which is rare, of course. In the right hands, it can be an important stepping stone to a better abstraction. I'm dreaming already.

Duplication has it's own set of dangers leading down the road to a verbose mess of convoluted crap code in most cases.

Most established code bases make me want to puke before long.

The older I get the more I appreciate XP. The Rule of Three in particular becomes a bigger presence in my life as time goes on.

With two copies of the code you can’t be sure if the similarities are factual or coincidental. At three the situation begins to crystallize quite rapidly.

After all the hemming and hawing about this, I'd really like to see a real case study on it. It's never been an issue for me. I've had functions where I added parameters later on for new use cases to fit the function to them, yes, but I never felt like I was suffering for it. Maybe DRY isn't so bad after all.

I agree with this so much. The way I've put it before is, the difference between under-engineering and over-engineering is that you can fix under-engineering.

Personally I don't think that's entirely true. Both under- and over-engineering can be hard to fix.

What definitely does make things easier is simply having less code to fix. Although measuring the 'amount' of code is at least somewhat subjective.

Yes! I think what I've come to is: Spend more time thinking about how to do something more concisely, rather than spending time thinking of how to abstract the code. Sometimes the former leads to the latter. But less code--or rather more concise code (not to be misrepresented as "fewer characters" or "everything in one line"!)--is almost always better for maintaining the code into the future.

Under-engineering makes it hard to fix common issues because there is no common code to fix. You can find yourself scouring over thousands of lines of code never sure that you found all the places you need to find. This is why DRY is a solid principle.

I think there is always the problem that programmers want to apply all these principles, including DRY, without actually thinking about it. Once it's applied illogically you are in strange and awful territory.

You can fix under engineering only when it is under control. Too much of under-engineered code may lead to bloated codebase which quickly becomes unmanageable.

I prefer disposable code over reusable code [0], i.e. code that's easy to delete instead of easy to extend [1].

The nature of software development is change. Extensive use of abstractions can make your code base rigid and averse to change.

[0] https://bjoernkw.com/2016/11/06/writing-disposable-code-not-...

[1] https://programmingisterrible.com/post/139222674273/write-co...

I love that post from @tef_ebooks. Actually I come to HN/Reddit every day to find another posts like that one, but only happens once or twice a year..

Eric Evans maybe said it the best that code should be rewritten again and again as new insights come from the domain to get closer and closer to the real thing. Every time we write the code we gain new insight so the original should be essentially written closer to the real things with these new insights. I've been taking this approach coupled with ddd where the aggregate root's pure domain logic is relatively easy to actually discard and recreate.

I can't find the snippet but it's somewhere in the "DDD is not for perfectionists" vein

While unexpected features can definitely complicate something that was merged into an abstraction, bug “fixes” also matter and they can be worse:

- Programmers may look at a bug in a simple shared function and conclude that it “obviously” should be fixed, and do so quickly without really understanding what else could go wrong. (As a completely contrived example: You “fix” something that previously couldn’t return a negative value, and move on; turns out this “fix” allows a bug somewhere else to crop up, a catastrophic improper cast from signed to unsigned, blowing up your -1 into an iteration over billions of expensive operations.)

- Bug priority levels vary between features, even if code is shared. Your abstraction may make it effectively impossible to fix just one high-priority feature, if your deployment is (say) set up to run hours or days of regression tests on all affected parts. Generally, the more segregated things actually are, the easier it is to set priorities well.

Just because something is duplicated doesn’t mean that it’s that way forever, either. At a good branch point, such as a new project, you can aggressively prune out things that won’t apply to that branch even if they helped keep things stable on the previous project.

My own read in this situation is that this problem is almost always less of an issue in code that is very explicit about what it’s doing. A function that uses nouns and verbs with a very precise meanings survives these changes better than wishywashy code.

“Generic” functions make it difficult to find all uses or even understand what scenarios they belong to. With bland say-nothing nouns and meaningless verbs like execute() or process() that appear everywhere in the code, you’re just crossing your fingers and hoping for the best.

I've done exactly this where I'm just extending my code in an ugly way. Two alternative patterns to extending a function with a parameter switch are to pass in a function to the function or split the function into multiple functions around the bit thats different.

Definitely a lot harder to fix when you've done this though.

I think "the wrong abstraction" is too lofty a title this is just about oversize functions.

The "wrong" abstraction is still a nexus of control and understanding - a point you know the code will return to under certain circumstances

this is not to say that if you rollback the code and then commit tonfinding and implementing new abstractions you won't win - but the second part is necessary or you are just digging a deeper hole.

think of it as a wrapped transduction - younhave to do the second part as well.


I think this sometimes gets used as an excuse for people who don't want to deal with another layer of indirection added by an abstraction. The gist of this touchstone blog post is that we need to be willing to abandon our abstractions as soon as we start special casing them. It's not wrong to abstract, it's wrong to cling to an abstraction that is broken.

I can't agree with this beyond the high-level sentiment. Perhaps the example is just not good. In the listed example the error clearly lies with programmer B who decided to extend an abstraction with a branch. That's not extending an abstraction - that's squashing 2 different abstractions into one function!

Which is the point of the article!

but the article argues against creating the abstraction when the functionality was the same in both places, which I disagree with. To be honest, the whole topic disregards whether the two pieces of functionality are actually doing the same thing in a domain-logic sense or simply a code sense, which is the heart of the issue. If the similarity is coincidental (e.g. validation on two model types temporarily happens to be the same) then the abstraction should not occur. If it is not (e.g. two distinct operations on a model type happen to have an identical subroutine) then the abstraction should occur.

It is easier to introduce the right abstraction later. The article argues that if there will be no specific abstraction, programmer B would create the right abstraction without hacking on top of existing. This becomes more oblivious when the application is in maintenance and everyone is risk averse.

I really like Sandi Metz as a speaker, and am pretty sure she has thought about these problems much deeper than I have. However, I often find myself wondering if most of the problems she is trying to solve would go away if programmers spent more time thinking about their code.

The case filled code that she is describing seems to be a result of the programmers not fully grasping the purpose of the code, and being unable to tell if the current abstraction is fitting. I understand that deadlines and the sunk cost fallacy play a factor, but, at least for me, finding the right abstractions / architecture is most satisfying part of coding! Shouldn’t that be what these programmers are focusing on in the first place?

I think people tend to focus on specific symptoms of bad code (duplication, in this case) without thinking about what makes good code.

Ideally, we'd like our code to be:

- Mutable (i.e. easy to modify

- Understandable

- Good at doing what it's supposed to do.

- Other stuff that I'm forgetting.

The general recommendation against duplicate code is intended to promote mutability (by avoiding multiple implementations that need to be changed). If you apply it blindly without keeping mutability in mind, you can get situations like the one the authors describes.

I see some of the same myopia when people talk about testing. Testing is there to ensure that your code is correct, and that it's easy to make changes without affecting correctness. As soon as you find yourself writing tests that aren't for those two reasons, consider whether it's worth the effort.

What if the programming language could choose the right level of abstraction for you automatically? For instance a language like Rust forces you to structure your code so as to avoid race conditions etc., it's extra effort but once you've done it it should work. What if a language forced you to make the right abstractions or else it won't compile? We have self-balancing trees, couldn't we have a self-balancing programming language? I keep thinking about some way of programming which is more visual, perhaps involving graphs, where the problems of over/under abstraction would be more obvious and the graph could somehow balance/normalize itself somehow.

Just off to prototype this now, should have it done by the end of the day... :)

You would need a working AGI (Artificial General Intelligence) for that to work.

And at that point, you don't really need to worry about compilers, you just have AGI looking at the code.

What you are proposing requires a technological miracle to implement. That's why it doesn't make sense. When we can do miracles, we will obviously use them in the mentioned and in many other areas. The problem is to create AGI.

I don't get this post and I don't get all the comments in support of copy pasting code, and against DRY. I am going to need a real life example of when copy pasting is a good idea, because I've never seen it. Giving some 'shared code' a name really doesn't seem like it's a dangerous path.

Programmer B feels honor-bound to retain the existing abstraction, but since isn't exactly the same for every case, they alter the code to take a parameter, and then add logic to conditionally do the right thing based on the value of that parameter.

Programmer B's poor decision doesn't mean you should reach for ctrl-v, in my humble opinion. But I'm willing to change my mind, if there's a compelling case.

I had this in mind recently when doing the Cryptopals challenge, which requires you to implement the Counter mode of block encryption[1]. In that case, encryption is the same mathematical operation as decryption. I still figured I should have two different functions for encryption vs decryption to make it more obvious which variables are intended to hold a plaintext vs ciphertext.

[1] https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation...

Off topic, but I'd highly recommend that Cryptopals set of challeges for anyone that likes a coding challenge and wants to dip their hairy toes in encryption. Eudyptula one also cool; I think last time it came up there was a list of some similar ones.

I find that a particularly good way of learning (at least over plain reading/coding).

Depends on language, but if the type system supports encrypted and plaintext types/traits, this is the way to go.

Good point, I didn't think of that! That would be a better way to do it! (At least in some respects.)

Good idea in general. Not sure I agree with the idea of inlining a bunch of code in order to refactor an abstraction.

I've always found the easiest way to refactor is to get really good code coverage on the outermost layer of code that uses the abstraction, remove/ignore unit tests if there are any on the abstraction itself, then refactor the abstraction with vigor until it feels appropriate and hopefully elegant or is removed if necessary. As long as all your tests still pass, you should be good to go.

I usually end up with quite a bit of duplication before I even think about creating any type of abstractions.

There's no point trying to think about abstractions before you know what the problem is.

There’s also a solid YAGNI argument against abstraction, especially when circumstances or requirements can change.

You don’t yet know what abstraction you need or what extensibility or generalizability you need, and prematurely extending in these directions can either paint you into a corner where you have to do terrible things to avoid throwing things away and rework, or else you have to bite the bullet and do a bunch of rework.

There can be a lot of benefits to just duplicating things like config, occasional pieces of function bodies, classes with modified functionality, or even using whole projects as templates, like quickly getting a collection of inter-related web servics going with copy/paste code and factoring out common code later.

Watched the talk a few years ago. Several times. It really hit home. Before, I would try to introduce abstractions as soon as I had one instance of duplication. After, my code has more duplications, less nesting, less abstractions. It's easier for a newbie to understand (or myself down the line), it's easier to delete and modify. I've shared the mantra "duplication is better than the wrong abstraction" with colleagues on many occasions.

Imagine in 6 months you have survived a traumatic brain injury and you're required to maintain the awful crap that younger/smarter you thought, at the time, was clever.

Abstraction is bad and is the price you pay for being able to move lots of things around at once. WET (Write Everything Twice) > DRY (Don't Repeat Yourself) because you might be able to grok what the heck you meant at the time. It kills me that my colleague wants to be clever. NO! Clever is bad!

His just circles the problem i think. It is a symptom of programmers not taking responsibillity for their own code. To cover your ass in the best possible way you have to write as little code as possible and that is what happens when a person sees a function that sor of doeas what they want. For some reason we think this is deduplicating code and it is good. But at least I have never been taught this in scool or by anyone in my work. We should treat it for what it is: an antipattern used by programmers to avoid taking responisbility. Another side to this coin is some times the «enablers» who create unneccesary abstractions. These are people who play at being livrary designers when they sre supposed to be developing applications as fast and safely as possible. The people who creates complex tools and unneccessary integrations inside the application code they write. Now, we are all risk averse and enablers to different degrees and at different times. To counteract these antipatterns i think it is important that we teach a few things: that functions/method should have a single purpose, if ou have a boolean argument and a if you should refactor. Rabbit theory of code. And at a higher level: always keep the focus on the business requirement and do codt brnefit analysis (light wieght in our head). And take time to learn the domain and tools you are working with. This also requires experience. But if we focus more on teaching people to learn sbout these things i think it may improve things

In a similar vein, see "Premature Flexibilization is the Root of Whatever Evil is Left":


Shameless plug: Building White-Box Abstractions by Program Refinement (https://mehrdad.afshari.me/publications/building-white-box-a...)

I'm currently working with a driver developer. He doesn't have access to simple connections. (Everything is a linked list.)

Every single lookup was a copy & paste while loop with business logic inside the loop, and then a break statement.

This is a textbook example of when not to copy and paste.

I use the "dueling sins" model of "copying versus coupling." Early lifecycle code benefits from the flexibility of copying. Mature code benefits from the coupling forces of abstraction. Both have the capacity to do harm.

Do you follow Jim Highsmith at all? His philosophy is that there are no answers. That we are minmaxing a bunch of competing criteria and trying to do the best we can.

In his words: we are a solving problems, we are resolving paradoxes.

Too late to fix typo: we are not solving problems, we are resolving paradoxes

Personally, I'd still err on the side of creating or thinking about an abstraction rather than duplicating code. While creating the wrong abstraction is a problem, it should be a deliberate choice, and duplication should be picked only in the rarest of cases, or where the logic is trivial. If a future requirement renders the current abstraction wrong, it should be refactored to fix the abstraction. If folks are adding edge cases to the current abstraction instead, that is a culture issue that must be fixed. And you should always have good test coverage anyway. This article makes an assumption that future developers are lazy or incompetent and will not fix the abstraction, and I think that we should strive for a culture where such laziness is not tolerated, instead of living with duplicate logic everywhere.

This is analogous to confirmation bias. Existing code is the current narrative. New requirements are like new evidence. With each piece of new evidence, you must reconsider the narrative to fit all the evidence.

When you find yourself at his step 8 you don’t necessarily have to go back and fix all the sins of the past. There’s a cost to this which may or may not make sense to pay. You could simply not use the bad abstraction.

It should also be pointed out that when you find yourself at steps 6 and 7 you don't have to sin, and when you find yourself at step 1, you can obey more complicated heuristics than DRY, like "I will apply the rule of three if the duplicated code is pretty short and not inherently self-contained" or "if this big method looks like it will change, instead of fully abstracting it, I'll just break out the bits that look like neat little functions".

You only find yourself at step 8 after a suite of bad decisions, and possibly even bad decisions that you signed off on during code review.

Isn't that very close to the suggested solution - which is inlining the correct "duplicate" code that's necessary to solve the problem?


Premature abstraction. Especially endemic to large enterprise projects.

One of the easiest ways to fix this is with lambda functions or callbacks. Each caller passes a lambda function or call back that is the specific case that's unique to the caller.

In the case they're describing this should almost surely not be a lambda function. Those should be reserved for when an arbitrary function makes sense, but it's unlikely that someone would have abstracted a function that could have done anything.

It's much more likely that the code in question is responsible for one particular thing, and switches between several different ways to achieve some sub-goal. Those parts should be lifted into some kind of interface, where the different variations are lifted into different implementations of that interface. A lambda function is the most general interface possible, so it's probably not the best choice, you'd eventually end up with callbacks calling each other without it being entirely clear which callback does what.

> A lambda function is the most general interface possible

A typed lambda function (i.e., with a defined arguments + return signature) is exactly as specific as any other typed interface, an arbitrary lambda function isn't, sure, but there's few languages where a static interface and an arbitrary lambda function are both available tools.

No, when each different way is unique to the caller, it needs to be defined at the caller.

This is what lambdas are for. Going to interfaces just adds needless syntax sugar.

...This way lies callback hell.

I am seriously going to start doing this. Great suggestion and the authority and insight with which it was presented gives me a lot of confidence it'll work out. Thanks

Nature obviously agrees. Look at the structure of the genome with all its gene duplication with minor variation. Copy-paste-hack is one of the primary mechanisms of evolution.

Nature takes billions of years though, as it has to operate in blind watchmaker mode. We don't have that luxury. Not often, anyway.

one problem with duplication is that its tempting to copy the code you need and replace the appropriate parameters or values. Its pretty easy to mess that up.

Sometimes it's good to let some duplication proliferate before trying to condense it. You don't always know which way it is headed.

The real problem is that very few people knows how to tell which abstractions are "wrong".

I also like what Rob Pike used to say: "A little copying is better than a little dependency."

When optimizing for complexity, this is a reasonable argument. What about optimizing for performance? It might be cheaper to make a network call in one place and have multiple consumers each use the result, even if each of them needs to use the result in slightly different ways.

Premature abstraction...

About half way through I realized this was true of mythology.

TL;DR: don’t put switches in functions.

No just don't switch on a new kludgy ad hoc argument:


int countlines(file afile, bool hasfuckeduplineendings, bool needtobusywaituntilreadswillsucceed)

My preferred solution, and I don't claim that this is correct, is just to put a global variable

bool nexthasfuckeduplineendings = false; //set to true before counting lines in a file that needs preprocessing for fucked up line endings

bool needtobusywaituntilreadswillsucceed = false; // this is a hack. Certain specific files will just fail to read for an unspecified period of time, they will fail and fail and fail and then succeed. For cases that we know this will happen, set this to true.

See how awful and fucked up this is?

It is "obvious" that this hack is just so wrong.

But is it really? It's clear, gets shit done, and is super transparent about how wrong it is.

Should every reader hang in a busy loop?

Should every reader preprocess line endings?

Maybe "no" and "no".

What do you all think?

I think you should copy/paste the whole function, change one of the copies to suit your needs, and then factor out any subroutines the functions have in common.

This is what OP recommends too.

That works okay for the first bool, but for the second duplication wouldn't there be 4 versions already?

You can have 1000 versions. As long as the function is mostly free of side effects, except whatever side effects are documented in a public interface, then you can scale the repository linearly without any real increase in complexity.

This is because while the namespace is wide, in practice you work within a “working set” of your daily use packages.

What happens when Rails and Javascript is the wrong abstraction?


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact