At least as important as designing something that can scale up is designing something that can scale down. You don't know when the organization's going to need to deprioritize this project and be able to keep it running without burning a couple of million in resources every year.
See: microservices. (as in, for the problem, not the solution)
Over-complicating things is endemic though. An aside to illustrate: Our work straddles multiple non-tech industries. Theres a common theme in software there. A thin veneer of modern tech companies on an ocean of legacy systems, mostly running off a single PHP server in a backroom somewhere.
Everyone wants to replace them, startups & users. But we see time and time again startups being limited by opinionated choices in their architecture. IE a focus on fanciness VS providing the functionality thats needed. Not just distributed systems, but things like teams struggling with react front ends, designing apps where websites will do, custom CSS where a template will do.
It stems from a common misunderstanding. Its not your tech that makes a great product. Your great product is enabled by great tech. SAAS systems that displace legacy enterprise systems do so mostly because of business models and functionality, not amazing technology. Netflix wouldn't need their architecture if they didn't have the users and the content. etc.
Kind of boils down to simple things. Put assets in S3 or GCS. Have DB separate to your app, preferably if it’s in prod and you have paying clients then have at-least 3 replicas. So if one goes down or you need to do some upgrades, everything goes smooth.
You probably want to dockerize your app so you can deploy the same thing to stage and prod. It’s scary how very few companies have a proper staging environment.
But none of this matters. The first part is dead simple but hard to do “build something that people want”. Everything is secondary and useless if you’re building stuff that nobody wants.
Few companies will take a product that actually needs large scale systems and hire someone that has no prior experience.
If you want to actually build large scale systems, you have to start somewhere.
Even if you just want to be an entry-level person on a team that builds large scale systems to learn by experience, they are likely going to ask you questions about that topic.
You may not need that many people to build large scale systems, but you still need a pathway as people leave that particular niche.
No, the company that I work for isn't Netflix, but it still has tons of customers. One of our services regularly pushes past 100k rps, and knowing much of what is covered in this guide has been incredibly helpful over my career.
When I interview people, I put as much if not more focus on being able to come up with a sane design as I do coding. Especially for a senior engineer.
No I think most people end up hiring those who have experience creating big complicated systems but haven't stuck around long enough for their chickens to come home to roost.
Of course, averages (even if true) are like stereotypes.
It would be interesting to see the tenure data on the experts (consultants/implementers) of large-scale systems, other than at the iconic ones (e.g. Google, Netflix).
We don't even know if the tenure is shorter than average.
Regardless, neither the primary motivation for a short tenure, nor even any average would be particularly meaningful with regard to what I believe to be ris's implied accusation:
Absent at least one tenure long enough to see through the consequences of the creation of the large-scale system, such a creator cannot be truly considered experienced with large-scale systems, no matter how many such creations are on the resume (even though the market values/hires the latter).
Sure, a few groups in each F500 need epic skills - but I think that's an exceedingly vanishing amount of the (actual) work that is being done. The term Enterprise and what it stands for earned it's laughable reputation for a reason.
As an example, any kind of analytics could generate terabytes of data a day... per customer.
A side project I am building will have to handle billions of events per day. Per customer. There are 0 customers (this is for fun, not profit), but as soon as it would hit one customer I would need to consider an approach that scales.
How many companies have similar requirements?
But that's actually besides the point.
Microservice architecture, or any architecture that focuses on isolated, asynchronous components, adds complexity. Of course.
But it also reduces work in other areas. If you build async, isolated services, you no longer have to deal with catastrophic service failure. Cascading failures go away at the async bound.
For many of us, I imagine we've spent a lot of time fighting fires at organizations where one service going down was a serious problem, causing other services to fail, and setting your infrastructure ablaze. Hence a bias towards solving that problem upfront.
Wait, what? I've never worked anywhere where one customer generated terabytes of data per day, and I've worked on very large commercial enterprise software.
The only thing I have experience with that produces anything close to that kind of data per customer is in genetic sequencing, and you only do a customer once. (Even that isn't a TB in its bulkiest, raw data form, and the formats used for cross-customer analysis are orders of magnitude smaller).
> For many of us, I imagine we've spent a lot of time fighting fires at organizations where one service going down was a serious problem, causing other services to fail, and setting your infrastructure ablaze.
The reason so many of us have worked in places like that is that those places 'got stuff done' and survived and grew.
Both companies are just are part of the large tech scene hence my skepticism about there not being many engineers that have to manage tons and tons of data/distributed systems as there are probably hundreds of thousands if not millions of engineers that have to think about these problems outside of the two companies I have worked for.
Still probably designed more system than needed (cluster). But scarier than that was seeing some DC/OS apps with $5,000/month in server costs even without user load.
In the grand scheme of things this doesn't have to mean microservices across a million hosts, only that you've decomposed the problem into it's elemental parts. Those parts can now be considered separately as their own elements rather than having to contend with the entire architecture in your head when a problem arises.
It allows everyone to focus on their specific components without leaping ahead in assumptions about how each developer will use each piece in the future. Lots of those kinds of problems are more easily solved in a room together, planned out, and done together. At least, that's what I've learned from how NASA developed their most important, complex parts.
It's very easy to get ahead of oneself. Complexity grows by factors that are incredibly difficult to manage. Being able to simplify down to a context of parts that are moving and parts that are stable is a serene state of coding. Everything flows much easier that way.
There will likely always be bugs and issues, but minimizing them to the smallest number there can be is an ideal value to maintain in software development.
In general, the 'get it done' mentality is the one that makes economic sense, because once you've added together the pile of software that doesn't need to scale to the other piles where this long-view doesn't matter, you have almost everything built. The other piles, for the record, include software:
- that is designed wrong so it needs to be re-written
- is obsoleted by changes in business direction (a project canceled, for example)
- gets replaced by something off-the-shelf or open-source
- is built for a startup that won't survive, or that gets aqui-hired, or that pivots to a wildly different thing
On the other hand, I sometimes see the opposite thing in heavily analytical work, where data science work is done in Python because its "easy", and then a team of engineers builds a crazily complex pipeline to make the python perform in some reasonable time frame. (Hi, Spark!). In my workplace, one example allocates bits of a job to roughly 100 machines, moving data to each, in a cloud environment where the data movement overhead is constantly fighting the benefits of distribution.
Having seen at least a couple of similar setups, I remain skeptical that this isn't, at its core, just a problem of ignorance of how "big" one can make/get a single server, before even paying a premium.
However, even for the "largest" commodity servers, last I looked, the premium at the highest end (over linear price:performance) was only something like 4x.
There was some relevant discussion of single server versus distributed in subthreads of https://news.ycombinator.com/item?id=17492234 a few days ago.
> In my workplace, one example allocates bits of a job to roughly 100 machines, moving data to each, in a cloud environment where the data movement overhead is constantly fighting the benefits of distribution.
I'm confident that cloud environments contribute to hardware ignorance, since cloud providers offer a very limited choice of options, and I have yet to see anything high end.
This is especially a frustration for me with networking options, where high bandwidth (beyond 10Gb/s on AWS, until recently, and still only 40GB/s max, AFAIK) is nonexistent and, otherwise, expensive, and low latency options like Infiniband don't seem to exist, either, even at the now low/obsolete bandwidths of 16 or 32Gb/s.
The architecture is in general 'fine'. But communication paths of subsystems is probably the easiest part of the problem. And in general, re-organizing the architecture of a system is usually possible - if and only if - the underlying data model is sane.
The more important questions are;
- What is the convention for addressing assets and entities? Is it consistent and useful for informing both security or data routing?
- What is the security policy for any specific entity in your system? How can it be modified? How long does it take to propagate that change? How centralized is the authentication?
- How can information created from failed events be properly garbage collected?
- How can you independently audit consistency between all independent subsystems?
- If a piece of "data" is found, how complex is it to find the origin of this data?
- What is the policy/system for enforcing subsystems have a very narrow capability to mutate information?
If you get these questions answered correctly (amongst others not on the tip of my tongue), you can grow your architecture from a monolith to anything you want.
I can answer the above for systems I've built, but I've spent quite a bit of time with those systems. How do I get better at doing this during the planning phases, or even better, for a system I'm unfamiliar with (ie. are there tools you lean on here)?
But in general the cliche of "Great artists steal" applies here. If AWS/GCE/Azure (or any other major software vendor) is offering a service or a feature, then it is almost certainly solving a problem somebody has. If you don't understand what problem is being solved, then you cannot possibly account for that problem in your design. Today, the manuals for these software features are documented in unprecedented accuracy. Read them, and try to reverse engineer in your head how you would build them.
For example; AWS' IAM roles seems like a problem which could be solved by far more trivial solutions. Just put permissions in a DB and query it when a user wants to do something. Why do we need URN's for users, resources, services, operations, etc? And why do those URN's need to map to URI's? Well, if you look at the problem - it ends up being a big graph which is in the general case immutable over namedspaced assets. So reverse engineer that, how would you build that?
I agree with you about reverse engineering the giants, it is one way of acquiring knowledge.
However I disagree on :
> If AWS/GCE/Azure (or any other major software vendor) is offering a service or a feature, then it is almost certainly solving a problem somebody has.
AWS/GCE/Azure have industrialized the process of proposing new building blocks. The cost for them to propose and maintain a new service is lower than a few years ago. They are logically able to experiment more with users, and eventually shut down services with no actual needs (or overlapping with another service they propose). Especially true for Google.
I have the intuition it also works as a marketing process : more you spend your time reading their documentation, more you accept their brand, more you are statistically going to buy something from them.
I suspect you have a mis-adjusted notion of "usually". "Usually", as in, for the majority of systems designed and in-use in the world, a well tuned, reliable RDBMS will be able to do this absolutely fine. The scale of systems that the world needs vs the quantity of them is an extremely long tailed curve.
Anki is an open source application (desktop + mobile) for spaced repetition learning (aka flashcards). It's a very popular tool among people who want to learn languages (and basically anything else you want to remember). There are many shared decks (https://ankiweb.net/shared/decks/). Creating and formatting cards is also possible and pretty easy.
If you are planning to learn a language or anything else give Anki a try. I used it for all of my language learning efforts. With this least my vocabulary is rocking solid.
The first day was really tough; I missed cards over and over again. 20 new cards (the default) is probably too many for this type of study. But I kept at it, once per day, and today (a week later) I can recognize nearly every font in the deck, and the ones that I have trouble with are very similar to other fonts (which is a useful thing to know in its own right; you can start to group fonts into "families" with a common ancestor). Pretty cool!
There's just one problem: so far, this hasn't translated into any ability to recognize fonts in the real world. I can think of a few reasons why. First, there are a LOT of fonts out there; even the "most popular" ones don't show up all that frequently. This is especially true for business logos, which like to use unusual fonts to make themselves stand out. Secondly, I think studying by memorizing a single sentence has caused me to "overfit!"
For example, there's a font that I can instantly recognize (Minion Pro) by how the 'T' and 'h' look together at the start of the sentence. I don't pay attention to anything else about the font, because that single feature is enough to distinguish it from the rest of the deck. And this turns out to be true for most fonts: Today Sans has a funny-looking 'w', Syntax has a funny-looking 'x', etc. So if I see a logo written in Today Sans, but it doesn't contain a 'w', I can't recognize it! Similarly, because the cards only contain the one sentence, which is entirely lowercase except for the 'T', I can't identify any fonts from an uppercase writing sample. What I can do is say, "Hmm, I don't know what that font is, but it definitely has a lot in common with Myriad..." and then I look it up and find out that the actual font is Warnock, which was designed by the same guy (Robert Slimbach) who designed Myriad.
So yeah, Anki is pretty cool, but an unintended side effect is that it can give you a striking sense of how a classification algorithm "feels" from the inside. :)
My personal favorite webfonts are proxima nova (for commercial) and roboto(for free google fonts) for modern web typesetting
There are a lot of fonts out there (~50,000 families according to random Quora people). Their distribution is probably power-law-like even if you discount the ones that are preinstalled on major platforms. It might make sense to recognize a few if you want to be able to really deeply discuss the difference in how they are used for design, but just recognizing them doesn't seem like the right way to gain that understanding. If you repeatedly perform a task where you have to recognize a font, learning only the top 100 won't help you much since it will eventually become pretty obvious. If you don't do that task, then why train for it instead of looking it up as necessary?
My thinking on "what's worth flashcarding" is that there are two major categories where it makes sense. First, if you need to remember a bunch of specific facts and you will need to recall them more quickly than they can be looked up. This is the case for things like tests, but there are also reasonable possibilities for this in real work (for example, if you are a programmer you may know you are going to need to look up the parameter ordering of a standard library method that you use only once a month, or you could memorize it).
The second is where you are using the flashcards as a scaffold, but the actual knowledge is something that references or brings together the facts that are contained in the flashcards. Recognizing fonts fits into this category, but I have a hard time imagining that actually recognizing them is the knowledge that is most efficient. Instead maybe you should be studying the major categories of fonts, features of fonts, or something that would help you make quicker decisions for whatever the real task is. I used to be able to recognize a lot of fonts and it's basically only useful as a parlor trick.
Although if you are new to design then learning the top 20 or whatever could be helpful to just have a basic fluency with Arial vs Times New Roman vs Comic Sans, so you have a shared vocabulary to discuss with others. "It's like Times New Roman but more suited for headlines and all caps" for example.
This is the weird confluence of work I've done at multiple companies (in one case I basically implemented a SRS like Anki with applications to finance exams, and in another I did a lot of work with fonts for a laser cutting design editor).
I highly recommend it if this kind of thing interests you. It gives a solid overview of the theory and practical lessons from his daily use of Anki over the last few years:
Thank you so much!
There are quite nice deck ideas:
+ Facts about friends
+ Standard Library of your programming language
+ Bird Voices
Additionally I'd say even if you succeeded in memorizing it this way, it's not making you a better problem solver, which is what actually matters for that particular subject; you're just (temporarily) better at regurgitating some lines of code.
I wouldn't really know for sure; I took the long way 'round (time and experience) for gaining that skill. But it at least seems plausible. Of course, the linked article isn't this.
This is mostly for computer science / programming. If I used anki for learning a language, I would just mass create all my cards at one time or use a curated deck.
don't get me wrong here I love using anki and ankidroid but adding cards is a PITA. I add cards only on desktop anki because neither ankidroid nor ankiwebapp support easy-image formatting. But I do add cards from ankidroid if its a picture of some handwritten / whiteboard drawings I've made.
E.g. you learn and retain the information far more efficiently if you take the route of read -> write note -> create flashcards from the note. As opposed to autogenerating the flashcards from the source material, for example.
The process of formulating the note first and then formulating the flashcards means you have to actually think about the material in two stages instead of just performing data entry.
In my experience it doesn't take all that much time, and if you're really interested in learning a topic is it unreasonable to expect that you have to spend say 10% longer with any given book or article to perform this review process? If you spend a bit of time up front creating a structured approach and stick to it, it'll become very quick. Here's an interesting article as a reference: https://robertheaton.com/2018/06/25/how-to-read/
Real answer though it's a Microsoftian's (that's not a word) website https://lamport.azurewebsites.net/
In case anybody is interested, there is a nice talk by Hillel Wayne on youtube (https://www.youtube.com/watch?v=_9B__0S21y8) that provides a high-level overview on what TLA+ is about.
I feel TLA+ would be too much to ask
Two sender/receivers (sending messages back and forth) and a data bus. For some reason, we seemed to be getting bad data (but not consistently). My "debugging" was actually more like "bugging". I took a correct TLA+ spec, and weakened constraints on different parts until I recreated the behavior we were seeing (it was the data bus). But the nice thing was being able to show that the particular bug couldn't happen from the sender/receivers. Their interaction with the data bus were correct (per the specification) and they were the only parts I could directly inspect (as I didn't have the files on how they implemented the data bus itself).
Once I found the right constraints on the data bus to weaken, I recreated the errors we were seeing in the model itself. This led to devising several more test programs that could more reliably produce the error. From there we were able to better communicate the problem with the contractors involved and get things corrected ("proof" that we weren't the problem).
A particularly nice thing was being able to model abstract versions of the system. I didn't need the fine details (what message specifically is being sent, didn't matter). But I also found I needed more details than my first pass and was able to refine the data bus specification (in TLA+) as needed to provide the necessary level of detail and extend it to add new capabilities.
Is also a good introduction. That was enough to make me dangerous. Since work didn’t have an interest in training me, and I had other obligations, that’s all I’ve done so far.
1. Helping us design features whose requirements are vague. The more hand-waving required to explain a particular feature the more likely we are to use TLA+ to model our assumptions and verify our understanding. This has led us to ask some interesting questions of our design team to help us build a better feature.
2. Requirements that are really hard that we need to ensure are implemented correctly. We use TLA+ to ensure the properties and invariants are correct with respect to the requirements and validate our model. This is really helpful in the case of concurrency and consistency. For our application we're using event-sourced data and it's imperative that our event store is consistent in the face of concurrent writers, can be replayed in a deterministic and consistent order, and that our assumptions will hold between versions of the events.
Thanks! Please do consider writing a blogpost, I'm sure many here would appreciate it.
"The key to performance is elegance, not battalions of special cases."
Elasticsearch - for searching/recommendations
Redis - hot data (certain data is only kept in redis)
Postgres - for the rest of data
Clickhouse - analytics
Most of the system is written in Go.
Whole system was tuned for performance from day one.
As to latency, data from the last 21 million requests today:
p99: 17.37 ms
p95: 6.86 ms
avg: 2.37 ms
I'm not actually sure, just guessing...
Balancing between user experience and cost.
Why should adding features make an app crumble on a single server?
I think the point is that good software is able to serve a lot of users on a single server.
A great example imho is Blender. Features are added constantly but because the software is modular it doesn't have any impact on the overall performance.
Today the problem is that adding features means: adding the latest and greatest lib while having absolutely no idea about the inner workings.
Yes it takes time to write your own libs. But when performance is an issue you will either have to write your own lib or take one that is good and tested.
It's not inherent, but obviously as you have more developers working on more and more things independently, each with different needs, tolerances, and deadlines, it becomes increasingly unreasonable to presume it can all be managed well on a single box.
If they went and added chat, or Twitter-like features, or subreddits, or similar, it might be a lot tougher to keep it all on a single box. It's a lot easier when we're all looking at the same top 30 stories, and pretty limited in how we interact with them and each other.
As someone who used to hack on Blender all I can say is it's a big ball of inter-dependent modules all with interlocking dependencies. The only thing that really keeps it manageable is the strict adherence to MVC which, I suppose, does make it modular.
Most of the work is server side, e.g. voting ring detection. We only notice indirectly, when the quality of the site goes up.
Some feature requests were a matter of days, like user profiles: https://news.ycombinator.com/item?id=481
A brief essay on some HN design decisions: https://pastebin.com/bSW5dfRQ (from https://news.ycombinator.com/item?id=8424502)
I think the arc codebase is worth studying and understanding, primarily so that you can extend its simplicity into your own projects. The reason HN was such a success is because it handles so many cases in the same way: Stories, comments, and polls are all the same thing: items. If you want to add a new thing, you just create a new item and add whatever fields you want.
These rapid prototyping techniques have downsides, but the cure is to keep in mind what you can't do. (For example, you can't rename item keys without breaking existing items, so be sure to choose good names.)
Like emacs, HN's design is borne out of simplicity and generality. It's what you get when you write the next most important feature as quickly as possible, then cut as much code as possible, every day. Both halves are equally important.
It's fine to say that our modern applications are so much more complicated that the old lessons don't apply. And in extreme cases, that may be true. I don't think SpaceX has the luxury of rapid prototyping their software.
But the typical app is CRUD. For those, data processing flexibility is perhaps the most important factor in whether you can write new features quickly. And since code is data, a lisp master can write systems with a shocking number of features in shockingly few lines of code. (See Jak'n'dexter: https://all-things-andy-gavin.com/2011/03/12/making-crash-ba...)
Are you talking about the source code for ARC, or for Hacker News?
It would be interesting to see how ARC is being employed on such a high profile site. I expect that some algorithms won't be freely available so as not enable people to game the site, but the rest would be interesting to see.
I could find any source for Hacker News though.
> And since code is data, a lisp master can write systems with a shocking number of features in shockingly few lines of code
You don't even have to be a master. So much bikeshedding has been spent over the decades. We are still going back and forth on data interchange formats...
Current hn code (as in the actual code that delivers this comment) isn't open AFAIK (partly because of the shadow banning, filtering etc code.
But there's a full "news" site in Arc source - old and more maintained/evolved:
pg's and rtm's original arc3.1 is on the "official" branch: https://github.com/arclanguage/anarki/tree/official
news.arc is the old HN source code. You can run it by following the steps in how-to-run-news.
(Run it with "mzscheme -f as.scm" though, not mzscheme.)
That they keep changing (many times against express complaints of their users) is not reason for excusing them a bad experience.
Incomplete. They are fronted by one of the largest CDNs in the world, on whom they rely for most traffic.
Well, I actually don't know the numbers, but I do know that for popular posts HN admins (a) try to break the conversation up over multiple posts and (b) plead with users to log out so as to allow Cloudflare to handle the load.
The biggest lesson HN teaches for designing large scale systems is "use a large scale system someone else has already designed".
They now have a few more bits:
I guess the advantage of not relying on your site for profit is that you can afford to not worry about a bit of down time so much...
When HN breaks into the top-100 for the world, let us know what it runs on.
Racket 6.1.1 with some HN and FreeBSD specific patches.
2x 3GHz Intel Xeon-IvyBridge (E5-2690-V2-DecaCore)
8x16GB Kingston 16GB DDR3 2Rx4
9x 1000GB Western Digital WD RE4
2x 200GB Smart XceedIOPS SSD
CPU: Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
You can do all of it locally (except multi zone and multi region reliability) but it would take a lot of configuration, skills and you wouldn't have the same Gui for everything.
AWS is not cheap but companies don't care, because without it, you have to pay to find experts and keep them happy, which is a lot harder.
Some older discussion on the topic here: https://news.ycombinator.com/item?id=3165095
Nobody is confused as to what a "system administrator" is, even though technically the word "system" itself can have a much broader range of meaning.
Large scale systems come in many different shapes and forms; this is an instance of one of them. Its learnings are interdisciplinary and cross-functional, but this isn't the roadmap for other types of systems, especially asynchronous reactive systems.
Large scale usually means some aspect of the business is focused on catering to developers, because the systems have become that complex that they require some form of automating existing automation.
Are there cases this design does not work for?
The "only catches" are the developer(s) need experience working in multi-threaded C++, and they need to understand the traditional web stack they are eliminating.
again, what is the traffic and bandwidth load like? peak and average values? what kind of data are you planning to store? small values but huge volumes or the opposite? a lot will change based on your system requirements.
I won't have millions (realistically not even thousands) of users and the database will be comparatively small.
I've looked at NDB cluster but it feels quite complicated to setup and maintain
All of the online and print material about such things focus on how to achieve massive scale correctly. Don't get me wrong; this is valuable and, generally, sound advice.
However, it also ignores the majority of use cases for software.
I would love to see a blog post here from someone who has solved a very specific problem for a very small audience, and gotten a very enthusiastic response. That would be meaningful on a larger scale for me.
Real world system design is dirty. Mostly this is due to constraints (time, cost, etc). And no one starts with zero architecture and 10 million users.
Guides like this serve no purpose other than to fatten vocabularies and promote the "brand" of people who aren't actually doing the work (speakers, educators, etc).
Surfacing things like this elevates the entire practice, since it illuminates what that "dirty" work looks like.
Yep. They're often ghostwritten, too.
Start out small, make efficient systems and have scalability in the back of your head when doing so. Don't do as so many others: "Oh, this lib seems popular, let's just use that! Heck, the cart sometimes takes 8 minutes to load, we need to add more nodes on AWS!"
Yeah, stuff like that happens.
At least in my book optimization usually beats scalability as the place to start for more performance.
Still recommend people read Fielding's REST thesis - as it demonstrates a lot of possible architectures (eg fat client or what we today call SPAs) - not simply REST. Along with some trade-offs. (REST is mainly motivated by simplicity of a simple hypertext application coupled with easy multi-level caching).
And keep in mind the text is from 2000. Early Ajax was introduced in IE in 1999, and late 2000 in Mozilla - but it took a while for Ajax to become standardized...
Today's post is way more in-depth. Good follow-up indeed.
The use of their / they refering a single person doesn't come naturally to me as english is my 2nd language and we're taught its plural. (it can indeed be used as "third person plural singular" according to oxford dict.)
Since its the "least bad" (to my ears) of the gender-neutral pronouns on the wiki page I'll try to use the "they/their" instead.
At first look, seems like these are fairly general questions, which is great.
It also gives you a way of controlling what queries are used by the API servers preventing a developer from doing silly things and creating a production outage
> It also gives you a way of controlling what queries are used by the API servers preventing a developer from doing silly things and creating a production outage
Also called database roles. Do your DALs have full database admin credentials?!
> mode where it queues all the api requests allowing you do to updates / upgrades to the database with no downtime.
This is just a bad idea. Better idea: unless you are rewriting your entire schema from scratch, you should be able to use database views, database triggers, extra/duplicated columns and tables as you make schema swaps.
Is that a performance burden? Yes, though it is temporary and a lot less of a burden than a whole 'nother layer of indirection. Does this also allow the really nice feature of not stopping your entire system to change schemas? Yes. How about allowing testing new schemas in production piecemeal? Yes.
After a while of slowly modifying and lumping more crap on a database, and it becomes a slow PITA that everyone is too afraid to touch, the usual result is to lift everything onto a new DB. DALs make this easier, but I agree that this should in no way be the point of a DAL. The point should just be to simplify & improve access to the database.
The complexity of the modern stack is ridiculous. You run java containers inside docker containers inside virtual machines and call it optimized.
Even with tools like Liquibase, the more functionality you put in the database (views, stored procedures, triggers, etc.) the harder it is to do deployments and rollbacks and keep the code and the database functionality in sync.
> keep the code and the database functionality in sync.
At any time there should only be a fixed number of versions of the code (ideally two: Production and Stage; and maybe a half, Development, if things go really sour). Hence the overhead for supporting that fixed number of versions should then be relatively constant. When a system is done and moved to maintenance mode you remove all of your temporary functionality and get the database back to it's optimal form for the current code.
Obviously it gets tricky when you are doing multiple products on the same database, or a very large database. But I don't see how a DAL delivers anything to those that just having clean well documented code and using existing database features doesn't.
Stable App w/ old schema -> add functionality to support new schema -> add new code w/ new schema -> add functionality to support old schema, migrate to new schema -> remove old app w/ old schema -> remove now vestigial functionality supporting old schema. Supports multiple versions of the code and supports rolling updates. If it's hard to keep track of that I don't know how to help you.
I LOVE using views early on in a new application's schema as it allows me to evolve the logical model separately from the physical model, and once I've coalesced on something I like it's easy enough to swap the view with a real table and my application code higher in the stack is none the wiser.
Even Facebook at one point relied on MySQL triggers to keep its memcache fleet synced.
And you never know whose using what.
In one of the cases where I had to switch, we swapped from Cassandra to S3 for 100x OpEx savings since C* couldn't scale cost effectively to our needs, so we rolled a database on top of S3 instead that well out performed C* for our use case (e.g. need to export a 3B row CSV in a minute?).
If its easy to do this then you are using a tiny fraction of Postgres.
If you want it to be easy to switch your database then you need to code to the lowest common denominator. I would rather use my databases to their fullest potential, rather than purposefully handicap myself because I might have to change it in the future.
I used to design systems so this was possible, but eventually realised it just wasn't needed - I was adding more abstraction and complexity for no reason.
I took it as changing the DBMS under the hood.
Anything that requires a fleet of (relational) databases to ensure consistency will not work on a global scale.
One well designed fast app server can serve 1000 requests per second per processor core, and you might have 50 processor cores in a 2U rack, for 50,000 requests per second. For database access, you now have fast NVMe disks that can push 2 million IOPS to serve those 50,000 accesses.
50,000 requests per second is good enough for a million concurrent users, maybe 10-50 million users per day.
If you have 50 million users per day, then you're already among the largest websites in the world. Do you really need this sort of architecture for your startup system?
If anything, you'd probably need a more distributed system that reduces network latencies around the world, instead of a single scale-out system.
I'm seeing about twice that on higly dynamic PHP pages with ~10 read/writes from/to MariaDB(running on the same machine).