Hacker News new | past | comments | ask | show | jobs | submit login
An Experiment on Code Structure (pboyd.io)
84 points by danielepolencic on Nov 10, 2019 | hide | past | favorite | 59 comments



According to GitHub, the totals are:

backendA: 11 files, 1 directory, 799 lines (676 sloc), 23.56KB

backendB: 23 files, 5 directories, 1578 lines (1306 sloc), 42.26KB

It's approximately twice as big for the same functionality, and I had to spend a lot more time "digging" through the second one to get an overall idea of how everything works. Jumping around between lots of tiny files is a big waste of time and overhead, and one of the pet peeves I have with a lot of how "modern" software is organised. If you believe that the number of bugs is directly proportional to the number of lines of code, thus "less code, fewer bugs", then backendA is far superior.

backendB required a bit more work

I'm not surprised that it did. This experiment reminds me of the "enterprise Hello World" parodies, and although backendB isn't quite as extreme, it has some indications of going in that direction. The excessive bureaucracy of Enterprise Java (and to a lesser extent, C#) leads to even simple changes requiring lots of "threading the data" through many layers. I've worked with codebases like that before, many years ago, and don't ever wish to do it again.

I really don't get this fetish for lots of tiny files and nested directories, which seems to be a recent trend; "maintainability" is often dogmatically quoted as the reason, but when it comes time to actually do something to the code, I much prefer a few larger files in a flat structure, where I can scroll through and search, instead of jumping around lots of tiny files nested several directories deep. It might look simpler at the micro level if each file is tiny, or the functions in them are also very short, but all that means is the complexity of the system has increased at the macro level and largely become hidden in the interaction of the parts.


I really don't get this fetish for lots of tiny files and nested directories, which seems to be a recent trend;

I suspect it is the same kind of thinking that says all functions should be very small (without reference to whether each function provides a single meaningful behaviour). Locally, this keeps things relatively simple, but it ignores the global issue that now there are potentially many more connections to follow around and everything becomes less cohesive. As far as I’m aware, such research as we have available on this still tends to show worse results (in particular, higher bug frequencies) in very short and very long functions, but that doesn’t stop a lot of people from making an intuitive argument for keeping individual elements very small.

A similar issue comes up once again in designing APIs: do you go for minimal but complete, or do you also provide extra help in common cases even if it is technically redundant? The former is “cleaner”, but in practice the latter is often easier to use for those writing a client for that API. Smaller isn’t automatically better.


The book "A Philosophy of Software Design" should interest you then: https://www.amazon.com/t/dp/1732102201 It argues, among other things, that deep interfaces matter more than code complexity inside a module.


That one was the first software book I read in a while where I got to the end and felt like if I wrote a book myself then that is very close to what I would want it to say. I highly recommend it to anyone who has built up a bit of practical programming experience and wants to improve further.


> The excessive bureaucracy of Enterprise Java (and to a lesser extent, C#) leads to even simple changes requiring lots of "threading the data" through many layers. I've worked with codebases like that before, many years ago, and don't ever wish to do it again.

Yeah I tend to like something like a semantic compression approach: I'll start in a single file, and then split it into separate files organized by domain as the length of the file starts to get unwieldy. And so on into more files and later subdirectories as the program grows.

In my opinion it's much better to let the "needs of the program" dictate code and filesystem structure rather than some academic ideas about how a program should be organized. As you say, when I've worked on projects which are very strict about adopting a particular structure, a lot of time ends up being wasted figuring out how to map my intent to that structure rather than just writing the damn code.


> excessive bureaucracy

I like to call this mountain of abstractions forced on you (as opposed to coming from your domain): gratuitous object astronautics.


When we are talking about 500-1500 sloc I completely agree this kind of structure is overkill. But when dealing with medium to large codebases (anything beyond, say, 100kloc) I much prefer the second approach, bonus points if you can get a fractal-like hierarchy.

Digging through files manually (I.e. Using a mouse) is painful, but your IDE is your friend. It takes me less than 3 seconds to search and open any file of the codebase I currently work in (it has a bit more than 2k files). And having a sane hierarchy means I type the folder / file name as I remember it, and filter the search results on-demand.


Splitting things up into multiple independent translation units enables incremental compilation. One function per file is the most extreme version of this. For example:

https://git.musl-libc.org/cgit/musl/tree/src/stdio


That seems like a problem for compiler optimizers to solve, not programmers.


It's actually the domain of build systems. Splitting code into as many independent files as possible gives the build system more data to work with, allowing it to compile more parts of the program in parallel only when necessary.

If a file contains two functions and the developer changes one of them, both functions will be recompiled. If two files contain one function each, only the file with the changed function will be recompiled.

Build times increase with language power and complexity as well as the size of the project. Avoiding needless work is always a major victory.


In C++, the experience is the opposite - a "unity build", where everything is #included into a single translation unit, tends to be faster:

https://mesonbuild.com/Unity-builds.html

http://onqtam.com/programming/2018-07-07-unity-builds/

https://buffered.io/posts/the-magic-of-unity-builds/


Unity builds are useful too but they have limitations. They are equivalent to full rebuilds and can't be done in parallel. The optimizations they enable can also be achieved via link time optimization. Language features that leverage file scope can interact badly with this type of build. They require a lot of memory since the compiler reads and processes the entire source code of the project and its dependencies.

Unity builds improve compilation times because the preprocessor and compiler is invoked only once. It is most useful in projects with lots of huge dependencies that require the inclusion of complex headers. The effect is less pronounced in simpler projects and they shouldn't be necessary at all in languages that have an actual module system instead of a preprocessor: Rust, Zig.


I have to clarify here a little bit and say that it is faster on one core. If you have multiple cores, having your translation unit count in the same order of magnitude as your core count will be faster. There is a lot more redundant work going on, but the parallelism can make up for it.


> If a file contains two functions and the developer changes one of them, both functions will be recompiled. If two files contain one function each, only the file with the changed function will be recompiled.

Still sounds like a compiler problem


Multiple files also seems like a problem for IDEs to solve, not programmers.


That brings to mind an interesting idea for an IDE: having one big virtual file that you edit, which gets split into multiple physical files on disk (based on module/class/whatever). Although, thinking about it, there are some languages that would make such automatic restructuring rather difficult.


You've just described Leo - leoeditor.com - where you're effectively editing a gigantic single xml file hidden by a GUI. The structuring is only occasionally automatic - mostly manual. It has python available the way emacs has elisp.

Git conflict resolution of that single file is intractable, so I convert the representation into thousands of tiny files for git, which I reassemble into the xml for Leo.


Yes! Why can't OOP language editors (IDE) simply represent the source code of classes, interfaces and other type definitions as they are without even revealing anything about the files they reside in? The technical detail of source code being stored in files is mundane.


That brings to mind an interesting idea for an IDE: having one big virtual file that you edit, which gets split into multiple physical files on disk

If you're going to work with it as one big file, then what's the point of multiple physical files anyway? Just store it as one big file then.


Why store it as a (text) file at all? Why not store the code in a database? Or as binary? Then you can store metadata pertaining to the code and not just the code itself. Unreal blueprints are an interesting way of structuring code and providing a componetized api. It would be interesting if they were more closely integrated with the code itself. Then you could manipulate data flows, code and even do debugging from inside the same interface.

Yes, this is all pie in the sky stuff, but it's interesting to think about.


I have been toying with the idea to store programming projects in a single sqlite3 database butt never seen enough value to actually pursue it.

As you mentioned though, it's interesting to think about.


There’s not one perfect answer, the eye is in the beholder.

I personally have a harder time coming up to speed on things that don’t break things down into fairly small chunks. I have an easier time dealing with abstraction and would rather implementation details of what I’m looking at to be hidden until I drill in another level. IDEs make that latter part easy.

However I’ve come to realize that there’s not a one size fits all here. I’ve worked with people who are the exact opposite, and everything in between.

The best one can do is try to find the happiest medium for everyone involved and power on


Even though the results weren't terribly illuminating, I have to give the author a lot of credit for even attempting to do a proper experiment like this. So much of our programming dogma is based on gut feelings ("it looks cleaner") rather than empirical data and peer-reviewed studies. We have very vague notions of what works, and even vaguer notions of why those things work.


I’ve come to the conclusion that half or more of the rules we have about “clean” are about avoiding merge conflicts. Few things have been consistently disappointing to me as the inability of coworkers and myself to reason about merges correctly. There are three hard things in software and merges are #3.

If anyone ever figures out how to make merges Just Work, then I expect a lot of pressure toward decomposition over locality would be reduced, and much of the rest would be to facilitate testing.


An interesting point.

I think it would be better to merge ASTs rather than text files that represent code. The annoying issues with merges are all about the text representation. When there's actually different logic changes in two different directions the merges cease to be annoying and start to require domain knowledge.

Of course getting from this hand-wavey thought to working software is difficult. Perhaps we first need to start focusing more on the tree nature of code even in the editing tools?


I wish I kept better bookmarks. There was a project years ago where the diff tool had a tokenizer per language so that it could diff the code similarly to what you suggest. Obviously it did not take over the world.

But yes, that should help.

It always annoys me that I add a method and the diff tool says that I inserted code before the last curly bracket for the previous function, instead of balancing the brackets.


Can you elaborate a little please? It's unclear to me if you are taking about merging data, code changes, or something else


I’m assuming it was a reference to merging in source control. A lot of “noise” in diffs, and by extension in merges and the sometimes awkward job of resolving merge conflicts, comes from little details like whitespace and punctuation rather than substantial semantic changes in the code. Many a coding standard, and even a language change from time to time, has been made with this in mind, sometimes to the point of putting punctuation in odd places or avoiding aligning items using extra whitespace just to minimise the number and/or size of diffs to check.


Code changes are adding or correcting behavior. A lot of coding practices tend to help two things: reading comprehension and keeping developers from bumping into each other. Adding code to the same areas and then having to handle merge conflicts without introducing regressions. It’s much simpler to segregate the code into separate concerns so that new features do not intersect.

But too much decomposition also hurts reading comprehension. So if the specter of merge conflicts went away you’re left with readability, which will settle out to somewhere between the extremes of decomposition. I’m suggesting that would result in somewhat larger methods. Especially where crosscutting concerns intersect each other.


I wish for a future where we can have more than one concurrent view of the same code. Structure need not be derived from from mere files and newlines and a handful of semantic organizational elements (function, class, module).

The current way of doing things forces us to make a compromise between prioritizing the forest over the trees, or vice versa. Programming languages are largely concerned with the trees' bark. But to make good software, you need to see and understand both, so the compromise is always a problem.

The solution probably needs large-scale re-imagining of how compilers, languages, version control, and editors/ides work (which also requires one to accept that working with a simple flat-file text editor won't work -- a bitter pill to swallow for someone like me who likes the simplicity of simple text editors).

I have some (very vague) ideas, but gosh, how do I find the time to experiment and refine or reject them...


I love this. Currently working on a file storage system that gets away from folders, and that's hard because everyone has folders hard-wired into their brains because history.

Functions shouldn't live in files, for a start. Files are an artefact of storing code in a file-based storage system, and have nothing to do with code architecture. Creating a code editor that stopped working with files and only worked with functions would be interesting as a start on this, I think...


File systems aren't without their advantages. One huge advantage is that text files are extremely un-opinionated about how they're used. If my project exists as a tree of directories with text files inside, there are a ton of tools which can operate on them without any knowledge of my program or even programming language. I can open them in vim or my favorite IDE, dump them to the console with cat, manage versions with git and so on. Basically text files are one of the fundamental building blocks of *nix so having my project represented as files means I can leverage decades of tooling.

It's not to say that it couldn't work to have a program represented as some kind of a database or API, but that would imply much tighter binding between tools and their storage representation.


interesting. But if you assume functions don't intrinsically live in files, they just do that because we have a file-based storage system, and that functions actually live in, say, scopes, then what does that do to your tools?

Can we have a Vim that understands (e.g) scopes natively rather than files?


> Can we have a Vim that understands (e.g) scopes natively rather than files?

Sure we can have a vim that does that. But as I say, it would require tighter binding between the tooling and the code representation.

Right now vim only has to understand code as lines of text separated by spaces, newlines, and tabs. The semantics of that code are the business of the build system and the compiler. The same goes for git. As a result, tools like git and vim can operate on code of any language which is represented as text. That could be an popular language like Java or Go, or some weird experimental language you dream up yourself.

If, as you suggest, the storage representation of the language were tied to the semantics of the language, rather than some external format, then all the tools need to have a deeper understanding of the language itself in order to operate on that storage.

You could try to make it general: i.e. design an organizational structure based on "scopes" which should apply to all languages, but then what if a language comes along which doesn't fit neatly into the "scopes" paradigm? Now you put yourself into a position where you might be making language design decisions which are based on what's possible with the tooling, rather than what's the best possible choice for the language?

Decoupling the storage method from the semantics of the language obviates these problems.


thanks for the answer, that's interesting.

We do have this to a certain extent now, though - file scope is a thing in some languages.

I'll give up my plan to write a neovim plugin for scope management, though ;)


It would have to not just understand scopes, but their sequential relationship. In most languages, scopes don't just exist, they are loaded, in a particular order. There are "top-level" effects that happen from loading a piece of code into compiler/runtime until the end. Maybe this is different in purely functional languages (though I suspect not at compilation level).


All these ideas have been implemented in JetBrains MPS. Terms to look up are structural/projectional editing and language workbenches.

Here's a concise demo (although you should read the original paper and the documentation to really grasp this concept): https://youtu.be/pVIywLXDuRo

Papers: https://confluence.jetbrains.com/display/MPS/MPS+publication...


Future programming languages will be graphical in one way or the other, I think. As you said, the programmer needs to have a clear way of visualizing the big picture. I think this can be achieved without forcing people to go visual. You could have the code on the one hand, and the metadata for the presentation of the code on the other, in a separate file. You could also just hide the graphical metadata for the code view.


Separate metadata/markup for presentation + code sounds sounds like a straightforward choice, but I'm concerned that it'll incur a lot of maintenance overhead, and the programmer working with the code still needs to keep it up to date and relevant somehow. Dunno, I feel like it'd feel like code + doxygen boilerplate comments (a pain in the butt if you ask me) but worse.

I'm thinking that we need language level support for higher level semantic constructs and relations. Right now code is somewhat analogous to raster graphics or very simple vector graphics. You can construct anything with it, but it is very rigid and there's only so much high level structure that tools can try to infer and dump out of it. (Think call graphs, dependency graphs, flow charts, index of class hierarchies.. all of them somewhat useful for certain purposes, but none of them really good for high level design work or reasoning about systems at a level above the plain code).

We could slap some metadata on vectors or raster images but I think that's a far cry from ideal. I think that, with sufficient support from the language, we can provide most of the visual structure for alternate views by simply graphing with help of the semantics that are laid bare in the code. I wouldn't mind some additional hints for presentation, but if we're adding lots of markup and metadata, I think we're going in the wrong direction.


https://darklang.com/ appears to share some of those goals.


I've become a big fan of not worrying about architecture until the rewrite. The first version is always an exploration of the problem domain, and treating it as that has always made my projects go quicker.

This is going to trigger some people, so here's some caveats:

- there's always a rewrite. Even with perfect architecture. Usually because nobody understands the problem domain until there's been an exploration of it with a first attempt (occasionally for other reasons). A few have two rewrites. And that's not a bad thing. Starting again with better knowledge can make the whole project go quicker, because there's less chance of ending up in the situation TFA talks about ("we have to refactor because tech debt").

- architecture needs to be shaped by the problem domain. There isn't a "best" architecture, so picking one requires knowledge of what the code needs to do. And that needs an understanding of the problem. No-one understands the problem from a technical point of view until/unless they've tried writing a program to solve it.

- a lot of features of architecture (like choosing to DI the database engine, instead of picking an engine because it's clearly the right choice) are made because the devs don't have enough knowledge to make an architectural decision when they write the code. It's interesting to see how many of these disappear on the rewrite. It's always more efficient (both performance and development time) to make these decisions, but making them is difficult without enough problem information.

- never underestimate the power of a monolith with good file structure.


The issue that I see with this is that, even if they say otherwise during the first version, when it comes time for the rewrite the powers that be often (usually?) aren't willing to support it.


The "powers that be" are non-tech-aware. They care about results, not nerds pushing the nerd buttons (I paraphrase).

They literally have no clue about what they're asking for, and just have to hope that the people doing the coding can deliver what they want. There's no backup, no "plan B", no way of delivering this without relying on the devs to deliver. So, who cares what they think?

You can literally say to them "we can continue like this, but because of tech debt it'll take 6 months, or we can rewrite in 3 months". And who's to say you're wrong? I've had more than one project do that.

The truth is that no-one knows how long any of this takes. Not the devs, not the project manager, not the CEO. It's always a rough guesstimate, and the estimates only get better with more information. Smart non-tech managers get this, and deal with it. Stupid non-tech managers try to control it and create deterministic outcomes from the non-deterministic process that is software dev. That always fails.

So, yeah, the "powers that be" need to grok the nature of the thing they're trying to do before saying "you can't do a rewrite even if you think that'll be quicker"


"...until the rewrite"

Old school me understood we always create three versions: understand the problem, understand the solution, do it right.

I'm poorly adapted to today's world where projects don't mature past the first stage. Because of fashion, re-orgs, acquisitions, general purpose chaos.


closely related to the "get it working, get it quick, get it pretty" process of experienced devs doing new stuff.


"architecture needs to be shaped by the problem domain"

(Belated response, sorry. Reviewing my comments and replies received.)

Applying Use Cases deeply influenced me. TLDR: Architecture is derived from use cases.

https://www.amazon.com/Applying-Use-Cases-Practical-Guide/dp...

At the time (of the 1st edition) I was still doing UI. Stuff like direct manipulation graphic design apps. Basically domain specific knockoffs of Illustrator.

I call this strategy "outside in architecture". (I'll have to read the book again to see if I stole that phrase.) Whereas pretty much every other dev I've ever worked with started with the building blocks and worked towards the user.

Per the book Design Rules: The Power of Modularity, architecture is the visible interface of a system, and all the design choices captured by that interface. In other words: What the user (client) sees. Even though I now do mostly services and backend stuff, I still have a user interface designer's sensibility. Where I figure out how something should look and feel before figuring out how to implement it. (There's still an iterative back & forth dance, of course.)


> TLDR: Architecture is derived from use cases.

This, completely.

I always try to explain to startups that they don't understand the problem until they've built the first version and launched it, and until they understand the problem they can't spec an architecture to solve it.

Needless to say, it's not a popular opinion ;)


I have a pet theory that there are two different ways that people think about and approach programs.

Group 1 likes highly decomposed programs which they feel results have clearer code since hiding the details makes it easier to focus on the behavior.

Group 2 likes to keep code together which they feel results in clearer code since the details of the implementation are readily apparent.

I suspect that these groups may correspond to the Artist versus Hacker groups in this article https://josephg.com/blog/3-tribes/. I.e. do you view writing code as primarily about expressing intent or primarily about controlling technology?

The conclusion that I draw from all of this is that these are likely fundamental differences that may even result from how different people are genetically wired to think. Therefore, I think that any solution should find a way to satisfy both groups. On the other hand, problems arise when, for example, people in group 2 dismiss the needs of people in group 1 by declaring that organizing the code is premature optimization and YAGNI.


I am not so sure that pitting to tendencies against each other is such a good idea. The thing is that the good programming is somewhere in the middle of all of these things because if any of these tendencies goes too far we run into problems. I think we should all be able to belong to each of these three tribes depending on the circumstances.


I definitely agree! Going too far in one direction or the other is likely to both result in poorer code and to antagonize whichever side isn't compatible with that approach.

I think what I was trying to get at is that one of the reasons that teams often don't find balance is because the differences are dismissed as being just differences of opinion. I was trying to show that they are often much more significant than that since they can make it difficult for one side or the other to understand and work with the codebase.


Not everyone has a master craftsman in them. Some people will show up in a new code base and need to do something; their first instinct will be to look around and try to fit their change in with the established conventions.

They are the minority.

Most will show up and handjam their change in the only way they know how. There will be no concern for the forest. Their job is processing trees after all.

This is something that was on my mind in the Google PR review thread. Not everyone is a "peer" in code reviews. There will be a certain cabal on equal footing, but there will be many more people who are simply contributors.

This is where people like the author come in; Project leads.


What gets me is that people are willing to write the same code dozens of times. It’s just a tool in their toolbox. It never seems to occur to them that our job is substantially about automating predicable things.


These days, when I start a new project, I think of my code as a tree. I start at the trunk and write the branches.

Each kind of state change needs to flow through the code in a consistent direction to avoid unexpected state mutations (like sap flows through a tree).

Another developer should be able to understand all the main parts of my program just by looking at the main entry point/file (the trunk of the tree).

Also, no dependency injection should be used; all dependencies need to be listed explicitly and be trackable to its source file. Dependencies need to either be explicitly imported where they are used or passed down through the branches explicitly via method or constructor arguments. Traceability is very important.

About classes/abstractions, they should be easy to explain to a non-technical person. If you can't explain a class or module to a non-technical person, it shouldn't exist because it is a poor abstraction.


> Also, no dependency injection should be used; all dependencies need to be listed explicitly and be trackable to its source file. Dependencies need to either be explicitly imported where they are used or passed down through the branches explicitly via method or constructor arguments.

Isn't the latter precisely dependency injection?

https://en.wikipedia.org/wiki/Dependency_injection#Construct...


Yep, I think GP meant DI frameworks.


“I’d really like to get away from the opinions and be able to say with confidence that one design is better than another. Or, at the very least, understand the trade-offs being made.” As I’ve taken more leadership in architectural decisions, this is one of the skills thats helped the most. Having most of the data regarding tradeoffs before making a commitment has steered projects from disaster.


How does your code structure help you against these situations:

- Version control conflicts: if developers are editing the same files all the time, there will be more conflicts and therefore more tasks related to resolve them, such as merging, re-testing, fixing bugs related to a bad merge, re-attempting the merge, etc.

- Code so complicated that becomes easy to misunderstand, and a source of an unusually large amount of bugs.

- Code so complicated that cannot be reliably tested without spending an unreasonable amount of time or relying on opaque testing methods.

- Code so complicated that increases the dependency on specific team members, usually the authors, so that the team cannot function optimally if they're unavailable or unwilling to collaborate.

- Code so complex that is impossible for an engineer to determine if the system is in a healthy state, diagnose a problem, obtain a reproduction step from a bug report...

- Code so poorly organized that developers fail to find implementations for a particular problem, causing them to implement the same thing again.

- Having multiple variations of the same code, so when a bug is found you may have to refactor multiple versions of the same code to fix the problem, if you manage to find them all.

And the list goes on and on. And a solution to these problems can have to do with how code is structured, and conventions/good practices.

If I see a piece of code that needs to know about 40 classes and 50 methods to produce a result, I know that it is likely going to be a pain to maintain. It's not subjective.

If I see a function with 1000 lines of code and a cyclomatic complexity of 500, I know that it may take at least 500 test cases to test it and will be a pain to maintain in a way that doesn't break. That is not subjective.


There is a huge amount of artisan streak to programming, vanity even.

Betting incredible amounts of effort and time, we tend to double-down as long as we can before considering alternatives.

There's also the tendency to choose our favourite hammer, it worked so well in the past!


One thing that I've noticed in my time working in large Java server codebases is that there seem to be a number of broad categories of code (not mutually exclusive, just one breakdown, and not exhaustive):

* framework - dictating how people should do things like request handlers, how work is scheduled * feature - making something new work * wiring - the binary had this information in it in this codepath, but we also need it in this other place...

And I have found that DI tends to be the magical "will write code for you" thing that mostly replaces the third one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: