Sometimes decomposition results in problems at the other end of the scale such as communication performance, data duplication, extremely nested abstractions, messaging complexity, contract and API versioning hell etc.
Getting the sweet spot between monolithic coupled blobs and fragmented latent deathtraps is an art which can't be puked out in a blog post. It takes literally years of experience and some guesswork and testing and thinking.
Ultimately, lots of small programs are just as painful as a single large one if they have to talk to each other or do IO.
My instincts are similar to the articles author, a preference for small discrete pieces software rather than a giant monolithic application. More Web Services!, if you will.
But you are correct, getting this sweet spot is hard. Truth be told, I am not even sure experience guarantees a successful design first time around.
There is a tendency with developers to want to keep everything nice and clean. For example app A is responsible for a data set, anytime other apps want to access it they have to talk to A, if they are asking for that data a lot you might be better off caching or periodically copying the data over to the parts of the system. I always try to decide whether to segment something by thinking about how many calls it is likely to receive as a web service, more than a couple in short succession and I start having doubts.
Ultimately what makes the entire system work best for users is the correct thing to do and sometimes it is very difficult to come up with something which does that and is pleasing to the discerning coders eye.
When you have a big system with disjoint parts written in different languages, re-use and refactoring is a pain, and redundancy is almost certain to creep in (and with redundancy often comes inconsistency).
Different languages are just different forms of integration and the mantra of integration is hell should be in the forefront of everyone's mind, always.
> Getting the sweet spot between monolithic coupled blobs and fragmented latent deathtraps is an art which can't be puked out in a blog post. It takes literally years of experience and some guesswork and testing and thinking.
I agree in the blog post, proper decomposition is the key if you want to write a good systems, and to be honest is really hard to achieve.
If your comment consisted simply of its second and fourth paragraphs it would be better in every way and you would have contributed something of value.
Apologies if you are personally offended, but my point still stands.
If the system-of-small-programs doesn't perform, then you're in a state where larger programs might make sense. If the problem is well-understood and the pieces have been built and refined by competent programmers, but it's impossible to go any further without some coupling and integration, then a large program isn't the worst thing in the world. Really, that's what most "optimization" is: the use of about-the-system knowledge to make changes that, while they create couplings that exclude (by which I mean, may cause horrible things to happen, but that's irrelevant) unused cases, improve the performance of the used cases.
For example, with databases, you have requirements that are both technically challenging but also need to work together: concurrency, persistence, performance, transactional integrity. These involve an ability to reason about "the whole world" that can't be achieved with a system-of-small-things approach. That's a case where "bigness" actually imposes complexity reduction. But it has taken some very smart people decades to get this stuff right.
The problem with ad-hoc corporate big-program systems is that the one benefit of largeness-- complexity reduction-- never occurs because there is no conceptual integrity, but only a heterogeneous list of "requirements" that pile on and don't work together. You get the ugliness of "lots of small programs" but the APIs aren't even documented. Instead of reading crappy APIs to work on such systems, programmers have to read crappy code, which is even harder.
Small is the way to start. If you need to make a program large, there are intelligent ways of doing it, but it's best to start small and build enough knowledge so that, when largeness becomes necessary, the problem is actually well-understood.
For example, the program I work on has to support a million row database that can be sorted and filtered both on the server and client with subsecond response time. The program is incredibly configurable based on data in the system, so many of the features depend on reading data and reacting to it.
The problem with "many small programs" is the cost of communication. I can pass a pointer to a list of 100,000 items to be sorted and filtered in a trivial amount of time. If I have to serialize that list to json to pass to a separate program that then has to deserialize that list and perform the function, then reserialize the sorted/filtered list, send it back, re-deserialize.... it'll take longer to do the communication than it does to do the sort.
However, that's not to say that the idea of separation of concerns still can't be applied to large program. And in fact, most enterprise devs do exactly that. That's what all these "services" are in the program. Except that instead of having to serialize data, I can just pass them a pointer.
Just because you can't see all the different programs, doesn't mean they're not there.
This company is ostensibly doing the right thing: they have developed a large number of "single purpose" programs. They also have some applications which attempt to integrate some of their technology into single packages. The problems, however, are exactly as you describe. From an end user perspective, having the various programs send data to one another is a crap shoot. Some applications are very tightly integrated while others seem to have been developed in a vacuum. The company has even developed an entire application that tries to fix this by allowing data to be automatically exported to and imported from excel. End users could try to use the COM interface to get and send data where they want it to go, but we have to remember that the target audience is engineers, not programmers.
Not where I work.
And even then, this is no excuse to stick to a zeroth order heuristic, and make big programs every time. Some systems can be cleanly separated in simple components. Failing to see that is a waste.
Right, but there's a different process to it.
Databases solved a problem, and the requirements grew organically as people used them to solve harder problems. With product companies or with open-source software, the project owners can say, "We aren't doing that shit".
Enterprise projects accumulate requirements based on who has power within the organization. Each person who has the power to stop the project asks for a hand-out, and "We aren't doing that shit" isn't an option. It's like how businesses that want to operate in corrupt companies need to have a separate "bribe fund" for local officials. Over time, the result is an incoherent mess of requirements that make no sense together.
The requirement list for a typical enterprise project is the bribe trail.
Sure, but when you have a multi-developer project without an explicit API, what you end up is an undocumented and implicit API between peoples' code. This devolves into the software-as-spec situation where it's not clear what the rules are.
I think it's better to start with the inefficient service-oriented program, get that working, and then optimize with the merged, larger program if needed (and to document the API that has now become an implicit within-program beast).
I think this is purely a stereotype.
The behavior experienced is largely down to the fact that a large body of humans can't come up with a single consistent view of a large set of problems. You need singular control and ownership by someone with technical and business domain expertise. Some of this is politics (particularly from the MBA and psychotic corporate climber faction) but it's at least 80% standard human idiocy and ignorance.
I think from an architecture perspective (I'm an "enterprise architect" [whatever that is] by trade), clean service APIs are a good idea, but not necessarily the distribution model or fully decoupled integration path.
There are certainly advantages to having smaller components. It allows you to rewrite components in a different language should you want to, for example. But there are disadvantages to: smaller components means dealing with failure at a much finer granularity.
In my opinion, the reason large programs become complicated is that there has been no emphasis on simplicity. Breaking components into smaller pieces forces you to adopt robust interfaces, but there are better ways of creating simpler programs.
My personal approach is to reason about parts of a program in terms of what they mean rather than what they do. I also have a strict rule that says, "don't change the meaning of a component, create a new one". This methodology works for me.
A good example that many people are familiar with is parsing data. Compare these two approaches. You could write functions that manipulate the input and build up an output or you could create a structure that represents a grammar for the data you are parsing (perhaps by using a parser combinator library).
In the first approach, the only feasible way of reasoning about the program is operationally; when it gets to here, this function is called, causing ... . In the second approach, you can reason about the program by considering the grammar that you created. You don't need to know exactly how the parsing happens in order to understand a grammar. I argue that this is because the grammar has been given a meaning.
Parsing makes a good toy example, but this same technique of finding ways of giving meaning to components is applicable to software in the real world.
I am really curious why you see this as a disadvantage. At my current job, I've had the experience of moving from a relatively small backend system that was broken up into discrete message-passing parts to a larger frontend system that was mostly one monolithic project. The former was by far much easier to debug, despite it being much older, less sophisticated, and having less logging, simply because it was easier to isolate and reproduce problems. The queues also made it very easy for us to bring down, diagnose, or scale individual components as they ran in production.
I couldn't see this as anything but an advantage, and one that's well worth the added complexity once you pass a size threshold.
And sure you can just go cowboy and say "I'll make it simple anyway!," but firstly, interface-enforced simplicity is not the only reason you would go with queues and RPC (other reasons have been mentioned in the article or by myself), and secondly, it is much more difficult than you think to enforce such abstract common disciplines on a large project.
> interface-enforced simplicity is not the only reason you would go with queues and RPC
No, certainly not. I shouldn't have implied that that was the case. However, if reliability, maintainability or simplicity is the reason why you are considering breaking up your large components then perhaps there is a better way of achieving your goal.
> it is much more difficult than you think to enforce such abstract common disciplines on a large project.
I'm speaking from my experience working on medium-large programs written exclusively by myself. I don't know that my methodology would scale to a team, but I also don't know that it or something similar to it couldn't.
This tension exists at any scale, from a single-developer hobby project up to massive enterprise projects and OSS giants, so I challenge the original premise of this blog post that having a large code base is the root cause of the problem. Going too far in either direction can result in absurdity, whether that’s “enterprise software” levels of boilerplate (too much tight coupling) or DLL hell and typical Linux package management (not enough cohesion).
1. While working on module 1, you realize you need something from module 2
2. Open module 2, add new feature and publish changes
3. Go back to module 1, test new feature and resume work
This process is fine once both modules 1 and 2 have matured but painful to deal with while the APIs are still taking shape. Hence it makes sense to keep a good abstraction between potential components and spin them off as an individual service only when they're stable enough.
But often the plumbing required in the form of web services becomes really painful to leverage. For instance they require creating complex WSDLs and workarounds to prevent timeouts.
Instead of making small isolated services they do one single gigantic WAR file.
Instead of using right tool to do the job everything is written in Java.
Instead of having services with implemented business logic they do services that convert one DTO to another.
Solving complex problems in the physical world usually results in complexity in the source code world.
It is always overwhelming to jump into a new gigantic code base. Talk to someone who's been on it a while and they won't have the same drowning outlook.
What are you supposed to do? Find the int main() and then make the program run in your head?
I can make an analogy with a car - I don't know every one piece of it but I can infer from the context. The scale is evident.
That's a fact that still didn't change, but user level code is getting more powerful (mainly for virtual machines), and computers are still getting faster. So, it's still too early to declare the race finished.
Anyway, none of that has any relevance to how one should organize user level code.
Is it "good" or "bad"? I'd argue that it's both. It's an important piece of software but a bad architecture, and it's grown in complexity over the years. I think you could write a piece of functional wiki software in about 500 lines of code, but complexity and features win in the marketplace, even if it's just for "mindshare".
I joke, but I've seen some php before that was at least 600 chars before line breaks, I have no idea how they wrote it like that.
Pretty much all good code that I've read or written was compartmentalized into units of roughly 500 LOC. A big program may be composed of many such units, but it was almost always a bad sign when a divisible part would exceed the "magic" number.
What comprises a divisible part of course also varies by language; at the least it'd be the LOC-per-file, but usually it'd be a self-contained and separately tested module.
In a moment someone will probably come up with a great piece of software where this doesn't hold true, I'd actually be curious to see it.
I mean an isolated unit that could be ripped out at any time and would be immediately useful on its own.
Anyway, abecedarius (below) has phrased it better than I could.
Also, a library can be thought of as a collection of modules.
Most of the "it's too hard and there's too much to do!" crowd doesn't understand the benefits of working clean.
Does anyone have real experience with this (open sourcing a core piece of infrastructure and finding that others have found it, used it and provided feedback)?
It can sometimes be beneficial from a distribution/deployment standpoint to have everything in one self-contained file. But you can't conclude much about the code quality of e.g. a computer game engine based on how many megabytes of graphics, music and sound effects a particular game based on that engine uses.
Constraints like this can really shape a piece of software, for better or for worse. My inspiration is having work with a really powerful firmware system that had a hardware constraint to fit on a 1MB flash chip, everything included, and was done so well that it looked easy. Give yourself unlimited space and it's much easier to end up with UEFI...
I suspect quite a few programs out there would have turned out better if their authors had picked a semi-arbitrary maximum value for lines of code / bytes of RAM / bytes of disk / etc.
The actual rule I'm using for now is:
- 1 second compile excluding dependencies.
- 1 minute compile including dependencies (excl C compiler).
- 1 MB executable including everything except libc and base OS.
The program-to-programmer relationship deserves to be many-to-one. It's a rewarding way to do things. You solve a problem. You add value. It's Done. You may have to go back to a program later to add features, but you don't end up with massive codeballs.
When the program-to-programmer relationship is inverted and becomes one-to-many, you get the enterprise hell with no feedback cycle, terrible code, and unnecessary complexity. It's not rewarding. Problems are never solved and software is never Done. Requirements are "collected", bundled into an incoherent mess, and delivered to bored, underachieving developers who never get to see their programs actually do anything.
Large problems that require more than one person need to be solved with systems and given the respect that systems deserve. Single-program approaches are a denial of the complexity (that comes whenever people have to work together) and a premature optimization.
I wrote about the political degeneracy that this creates: http://michaelochurch.wordpress.com/2012/04/13/java-shop-pol... . But it's unfair to associate it with one language. It's not that Java is any more evil than C# or C++. Any company that calls itself an X Shop is doomed.
There are cases where large single programs deliver value. For example, most people experience a relational database as a single entity. There are a lot of requirements (performance, persistence, transactional integrity, concurrency) that are technically very difficult to meet and all have to work together. I will also note that it has taken some very smart, very well-compensated, people decades to get that stuff right. The quality of programmers who tend to stick around on corporate big-program projects is not high enough to even attempt it, though.
So why is big-program development winning? There are a couple reasons for that. First, it gives managerial dinosaurs the illusion of control. If programs are Giant Things that can be measured in importance by "headcount", then executives can direct the programming efforts... which they can't do if the programmer's job is to go off and independently solve technical problems they deem to be important. Second, big-program design gives a home to mediocre programmers who wouldn't be able to build something from scratch if their lives depended on it but who, in teams of 50, might be as effective as 0.37 good developers. It's about control and a failed attempt to commoditize programmer talent, but it doesn't actually work.
So why do larger programs 'win' in the open source world as well? Pop psychology about management doesn't seem sufficient to explain the phenomenon (although I'm sure it is a good way to sell a '101 habits of highly effective managers' book or get paid to give talks about management). Large systems are large, breaking them up into smaller pieces doesn't change that, but it makes navigating the code base harder (although I assume you don't care about that since judging from your blog posts you don't think tooling is important). It makes your interfaces less malleable (can be good can be bad), and moves a lot of communication to places where the compiler can't warn you about mistakes (again, if you don't care about tooling I guess this doesn't matter… but I would argue that this is bad).
It seems to me like systems of many small separate processes is basically dynamic OOP. Everything is late bound, dynamically typed and async. It's easy to make changes and also easy to break things. You can argue that this is better for certain problems, but I don't think it's universally better, and the community seems pretty divided on the issue too: look at the popularity of Go, statically typed and building concurrency into the language rather than using the OS like in the older C world.
Aa an aside; surely the web developer community is eventually going to grow tired of talking about how terrible Java is and how $idea_of_the_moment is good because it's 'not java'? As an outsider the obsession seems extremely unhealthy, and leads you to bizarre places like arguing against automated refactoring or interactive debugging or static type systems just because those things are associated with Java. I guess to maintain credibility I also need to point out that I don't and have never used Java…
 Firefox/Chromium vs uzbl, gcc/llvm/clang vs pcc, gdb vs printf debugging, sqlite/mysql vs directories and plain text files, perl vs sed/awk/grep shell scripts, emacs/vim vs ed/notepad etc etc.
When you set out to architect a customized, typed, async architecture, you end up with the "flow-based programming" style which has been captivating me recently. It tends to reduce to two events per component: "start" and "stop." The "integrated program" appears in a tiny top-level definition that sets up the components and connections. Components are relatively small and reminiscent of "pure" algorithmic code. Where synchronized behavior becomes essential, flows can be split into stages of processing and kicked off in a sequence.
This particular style has had a lengthy history of reappearing in numerous domains under various guises, and it has demonstrable effectiveness, but it can also feel alien and more "mathematical." The main issue is that has a lengthy design/prototyping time of weeks to months, and the initial complexity of the system looks high because you need a decent number of components to do anything substantial. This really, really goes against the "move fast and break things" mainstream, even within open source - everyone wants their project to just _instantly_ accelerate from 0 to 100 in terms of progress, and we've put most of our efforts into a toolchain that makes it easier and easier to do that.
The best remedy at present seems to be to embrace asynchronity, embrace static types, and maintain faith in both - i.e. to have a lot of discipline.
I'm actually a pretty big fan of static typing. You don't get static typing's main benefits in C++ or Java, though. You have to use a language like Haskell or Ocaml, or the right subset of Scala, to see the major benefits of that.
Open-source is a bit different because people choose whether they contribute to a project. The quality of code in the active open-source world is leagues above what you find in typical enterprise codeballs, because of survivor bias. No one has the authority to mandate that code be maintained by others, so the messes are cleaned up by people who actually care, not people slogging through it to keep a paycheck coming.
The big-program methodology of the corporate world is the evil. In FOSS, the major projects are an unusual set-- code-quality at a high level just not seen in the for-paycheck commodity-engineer world and large because of success-- rather than the reverse. There's a survivor bias that occurs because the best projects are the only ones people pay attention to.
The corporate world is screwy because projects become large or small based on political reasons that have nothing to do with code quality. In the FOSS world, code-quality problems related to growth will be self-limiting because no one has the authority to "force" the program to grow.
Also yes static typing in Java does look pretty cumbersome. Personally I'm hoping that Rust takes off, I've enjoyed playing about with it over the last couple of weeks, although it has made me less happy using the more dynamic languages I normally use to do real work.