1. Business level constraints (time, human, fiscal and other resources, stakeholders) trump technical constraints every time. Identifying these should be step zero in any design process.
2. A business-level risk model assists with appropriate design with respect to both security and availability and should ultimately drive component selection.
3. Content seems very much focused on public IP services provided through multiple networked subsystems. While this is a very popular category of modern systems design, not all systems fall in to this category (eg. embedded), and even if they do many complex systems are internal, and public-facing interfaces are partly shielded/outsourced (Cloudflare, AWS, etc.).
4. Existing depth in areas such as database replication could perhaps be grouped in a generic fashion as examples of fault tolerance and failure / issue-mitigation strategies.
5. Asynchronicity and communication could be grouped together under architectural paradigms (eg. state, consistency and recovery models), since they tend to define at least subsystem-local architectural paradigms. (Ever tried restoring a huge RDBMS backup or performing a backup between major RDBMS versions where downtime is a concern? What about debugging interactions between numerous message queues, or disparate views of a shared database (eg. blockchain, split-capable orchestration systems) with supposed eventual consistency?)
6. Legal and regulatory considerations are often very powerful architectural concerns. In a multinational system with components owned by disparate legal entities in different jurisdictions, potential regulatory ingress (eg. halt/seize/shut down national operations) can become a significant consideration.
7. The new/greenfield systems design perspective is a valid and common one. However, equally commonly, established organizations' subsystems are (re-)designed/upgraded, and in this case system interfaces may be internal or otherwise highly distinct from public service design. Often these sorts of projects are harder because of downtime concerns, migration complexity and organizational/technical inertia.
Sometimes I feel it's a pity that HN doesn't support bold text and increased font sizes.
Yes, to some extent this means you are "shipping your org chart", but that's mostly inevitable at scale anyway.
Best to give it some time, figure out where the important interactions are and separate things that don't interact heavily.
E.g. slashdot used to run with, IIRC, only 4 roles: memcached, reverse proxy, apache, and mysql.
Your interviewer might have just failed in explaining the problem constraints properly.
The simpler and more common explanation is that the interviewer was looking to have his biases reflected by the interviewee. In my experience it is difficult for a certain kind of mind to distinguish between "This person is stupid/crazy/wrong for the job" and "This person knows something I don't".
If you're not a large employer, I can elaborate, but I'll have to tread a bit carefully, hence my reticence to do so.
You've asked the right question though. Is independent scaling "necessarily" better? No, it's a trade-off.
* To get this feature you've sacrificed the simplicity of having a single codebase leading to a single binary that can be deployed with ease.
* You've added extra components within your system like load balancers.
* What would earlier have been a function call is now RPC with network and serialisation overhead.
* Your system might be a little more difficult to debug unless you collect logs in one place and it's possible to follow a single user across multiple components.
* It's possible for your system to fail in ways that weren't possible before. For instance there might be a network issue between two specific nodes and it's hard to figure that out unless you have proper monitoring in place.
Of course there are advantages other than failure isolation and independent scaling but I haven't gone into those.
I could maybe see if you have some services that take huge amounts of memory and others that take huge amounts of CPU. If you have a standard "monolithic app instance", and had to scale up your CPU by 10x but memory only by 1x because of your pdf renderer, you will likely be wasting large amounts of memory. But unless you have huge disparities in memory vs cpu use (and services don't vary in use together) I don't really know what sort of cost savings you can get here - wouldn't this be the actual value add of being able to independently scale - less cost because you can more accurately hit your resource (cpu vs memory vs disk and so on) requirements for your project?
In contrast, from my experience, separating your web/app/database/cache layer from eachother tends to be extremely beneficial for independent scaling because they almost always vary widely in how they consume which resources (memory vs cpu vs disk and so on). They also tend to be written with this in mind and so it is basically free to do so.
An aside, but many of your downsides apply not to just services, but also having to scale an application horizontally. If the name of the game is scaling, then many of these you will pay for regardless of the question of monolith vs. services.
All that above aside, I definitely feel the other benefits. But I really don't get it - every article I read about services seems to mention independent scaling, when it seems like a fairly suspect benefit. Maybe I just haven't worked on the correct project.
- One needed a large in-process cache in order to deliver good performance; it would have consumed to much memory on each instance of the monolith.
- Some services used large ML models, which also would have consumed too much of memory on monolith instances.
- A lot of our payment-related services had hourly or daily batch jobs. Anything with big resource spikes probably shouldn't share a machine with latency-sensitive code (like online payment processing or just web handlers).
- Related to the above, some jobs had to be done by a master instance. If the monolith did them, they would have disproportionately affected a single instance of the monolith.
This is a contrived example. I mostly agree with you. I think you do hit a point at the application level where the big monolithic app is a little bit to big, and that's a tough spot to be in.
If you can keep things small, light and efficient mono will work forever. But always keep in mind that it can outgrow your instance sizes. So prepare to either move up to the next size instance (tough if you're running your own hardware), or start thinking about splitting out parts of the app.
One possible argument is that you can reserve capacity by service, so a computationally expensive but unimportant endpoint doesn't accidentally gobble up the resources you added to alleviate the starvation that was slowing down a different endpoint.
But you could also do that with smarter load balancing - deploy a monolith to all hosts, but partition traffic between separate pools depending on endpoint at the load balancer level.
I don't think performance isolation is a good argument for microservices. I don't think failure isolation is either - the interactions are likely to create more, trickier failures than the isolation will prevent.
The real argument is about scaling the organization. Much easier to work on and frequently deploy small codebases with small numbers of commits and commiters per day, communicating across team boundaries via Thrift IDL files, than to have thousands of engineers on one codebase and thousands of potentially breaking changes introduced between every deploy.
It's in my own interest to spread the knowledge around, because I like holidays-that-are-holidays.
I think you miss "Show HN" on your post.
Like I said, I'm only starting and I don't know how real Erlang systems are built. However, I suspect that they tend to eschew the orthodoxy of treating a relational database as a single source of truth, with stateless app servers (these two features are the core of all the systems in the OP's thing) and embrace distributed, redundant statefulness. If this can be done (without becoming impossible to reason about) I suspect it represents an optimal server system, in terms of resource usage, availability, and probably even in performance in a world where most organizations' datasets can fit entirely in RAM, from when they are a twinkle in someone's eye to their eventual dissolution.
I should stress that "questioning orthodoxy" is something of a hobby, which probably biases me.
 http://learnyousomeerlang.com keeps coming up as a good learning resource.
Which tool is used for creating diagrams ?
A missing area is identity management. Most likely this should be separated from your system (e.g. don't have a table somewhere with username, password in it).
In consumer facing systems, OpenID Connect (better) as practiced by Google, OAuth is used by most others.
In enterprise software, SAML is the common parlance.
That leads naturally to questions about API authorization (are API calls made on behalf of system users? If not, start probing further).
Always enlightening to start asking questions about identity management very early on in designing systems.