Hacker News new | comments | show | ask | jobs | submit login

Couldn't agree with this article more.

I built the biggest social network to come out of India from 2006-2009. It was like Twitter but over text messaging. At it's peak it had 50M+ users and sent 1B+ text messages in a day.

When I started, the app was on a single machine. I didn't know a lot about databases and scaling. Didn't even know what database indexes are and what are their benefits.

Just built the basic product over a weekend and launched. Timeline after that whenever the web server exhausted all the JVM threads trying to serve requests:

1. 1 month - 20k users - learnt about indexes and created indexes.

2. 3 months - 500k users - Realized MyISAM is a bad fit for mutable tables. Converted the tables to InnoDB. Increased number of JVM threads to tomcat

3. 9 months - 5M users - Realized that the default MySQL config is for a desktop and allocates just 64MB RAM to the database. Setup the mysql configs. 2 application servers now.

4. 18 months - 15M users - Tuned MySQL even more. Optimized JDBC connector to cache MySQL prepared statements.

5. 36 months - 45M users - Split database by having different tables on different machines.

I had no idea or previous experience about any of these issues. However I always had enough notice to fix issues. Worked really hard, learnt along the way and was always able to find a way to scale the service.

I know of absolutely no service which failed because it couldn't scale. First focus on building what people love. If people love your product, they will put up with the growing pains (e.g. Twitter used to be down a lot!).

Because of my previous experience, I can now build and launch a highly scalable service at launch. However the reason I do this is that it is faster for me to do it - not because I am building it for scale.

Launch as soon as you can. Iterate as fast as you can. Time is the only currency you have which can't be earned and only spent. Spend it wisely.

Edited: formatting

> I know of absolutely no service which failed because it couldn't scale. First focus on building what people love. If people love your product, they will put up with the growing pains (e.g. Twitter used to be down a lot!).

They haven't failed (yet), but I think gitlab.com would be a lot bigger if they scaled faster. Lots of people have been rejected using it because it was too slow.

Would it be as big if they had stopped to consider all scaling issues at the beginning?

In 2015 we only had the people to focus on one thing. We opted to focus on the downloadable product. This grew our revenue much faster than GitLab.com. We started to focus on .Com in 2016 but in hindsight not aggressive enough. Now it is one of our top three priorities.

You could argue it would be bigger. Instead of working around Travis to connect to repositories, the repositories are there. That's the biggest sell for me, anyway.

(Note: I agree w/ the OP + commenter whole-heartedly)

The other counter-example I can think of is Myspace who frequently had tons of server errors and page load errors alongside failing to iterate their product rapidly.

gitlab.com is not their core business. It's mostly a demo.

In 2015 we saw it as our demo. In 2017 we see it as one of our two main products and are improving performance accordingly. 95%+ of our revenue is still coming from licenses instead of GitLab.com. But over time this week change.

That's not how people perceive it, though.

How people perceive it is not relevant if it actually is a side business

How people perceive it may be impacting your main business.

Only they would have the data to be able to determine that, if anyone would have that data at all.

I built a service (~10mm users at peak) which was designed from the ground up to scale, around 2002. When the numbers did grow we just sat back and watched it pretty much.

Even given this, I completely agree with you. That's why I now develop in ruby. I'll take developer productivity over performance any day.

It's the old saying - 'nice problem to have'.

> That's why I now develop in ruby. I'll take developer productivity over performance any day.

In my experience, dynamically typed languages don't do well for developer productivity.

> In my experience, dynamically typed languages don't do well for developer productivity.

At least one empirical study suggests static typing does not improve programmer productivity. Source: http://courses.cs.washington.edu/courses/cse590n/10au/hanenb...

This is just one of the many common industry "wisdoms" that collapse under scientific scrutiny.

This study looks bad across the board. For one, they're writing a parser for a context-free grammar. That's not a good test for the benefits of static typing which often happen with very different types of data mixing, modifying large programs with less breakage, and so on. Contrary to your statement, there's a lot of reports and articles out there where switching to formal specs or a stronger type system reduced the number of defects. The first were in high-assurance where formal specs caught inconsistencies left and right. Later on, military and safety-critical firms looked at defect rates of C, C++, and Ada to find Ada's type system knocked them to half. This counter-example is about something so straight-forward there's many tools that can automate it with one formally-verified for correctness (w/ strongly typed language).

Next, I look at the languages themselves. Java's type system is so weak to barely have any benefit. Terrible choice to test static typing. The control language is then a language based on Smalltalk... the original, great language for dynamic, OOP, and complex programs... that's even simpler than Java & could possibly boost productivity just due to easier review/revision. A proper comparison would be with something like Ocaml, Haskell, Ada/SPARK, or Rust that really uses a type-system with the high-level & composition benefits of something like Smalltalk. Makes me lean toward Ocaml or Haskell than others which slow productivity fighting with compiler.

This paper doesn't prove anything for or against the topic. It's quite weak. The authors should do their next one on the kind of program that others in the field reported benefited from static, strong typing. Then, attempt an experiment on same type of software between a productive language with powerful, type system and a productive, dynamic language of similar complexity. Alternatively, try to modify similarly large applications in parts that touch a lot of the code base without breakage for both static and dynamic languages. Then, we might learn something.

That's a cool experiment, but from a quick read they are creating a compiler which is a bit different than web development. Looking at the related work all seem to contradict this finding and suggest preference to statically typed languages, such as one that looked at API development.

From my own experience with dynamic types it's not the dev time that increases but maintenance time. When you have a team working on an evolving codebase with data types distributed in caches, databases, and code it seems inevitable that you will have an error if you don't do any type checks. Even good linters can miss something that would have been easy to catch with static type checks.

"That's a cool experiment, but from a quick read they are creating a compiler which is a bit different than web development."

They're doing a lot of things wrong. See my reply.

Having worked on corporate Python platform with tens of thousands of developers and hundreds of applications running hubdress of thousands of instances on a many million line common core code base, I beg to differ.

We were sometimes able to go from concept to deployed application in a single day.

Having worked on corporate Python platform with tens of thousands of developers and thousands of applications running on thousands of instances on millions of lines common core code base, I beg to differ.

It is easy to ship any proof of concept. Is it hard to maintain, it is rigged with performance issues and it is almost impossible to refactor because python has zero refactoring support and no compiler to help you catch any error.

What you save on the initial rollout, you loose ten folds on the long term.

But that is basically the point of the whole article and this thread, that it's great for proof of concept that morphs into production when needed. If you can deploy in one day that means you can test 5 ideas/week until one catches on (hopefully not that long). Once one catches on, you can actually put more effort into it and optimize it as needed and not the other way around.

You can't try 5 products or features a week on real users. They don't keep up.

If you're looking to Facebook for your user experience guidelines, you're going to end up angering a lot of actual users.

While making millions to billions dominating a market I own. So, sure. Other circumstances it would just teach you can experiment more than a few things a week.

Facebook have thousands of products and billions of users. You do not ;)

Now that's a better point. Obviously, the amount of changes are tied to company's circumstances.

> We were sometimes able to go from concept to deployed application in a single day.

This just makes me shudder a little bit. I've seen such things before, even recently at my current job. These kinds of applications tend to be a nightmarish mix of unmaintainably overengineered and underdesigned terrible code.

It's not ideal certainly, most such cases would be either a variation of an existing application or something that isn't business facing such as a compliance or operational metrics report.

The key thing is that the testing, release and deployment tooling were not a significant bottleneck. You could develop the code, perform testing in a production like environment and sign it off with a minimum of down time in between waiting for compilation, package building, etc. So a lot of that was down to Python being interpreted.

They do well for single-person code bases, but velocity scales badly with the number of developers in my experience.

I have the opposite experience and I did a lot of back and forth between statically typed and dynamically typed languages over the past decade so I'm not biased in either direction.

And in my experience the exact opposite is true.

That's why it's called an anecdote; not data.

Exactly, I never understand the static vs dynamic flamewars. Most of the issues usually presented stem from poor developers or poor development processes/environments/tools.

Poor code written in JavaScript won't become pretty just by translating to Java, and vice versa. That I've seen anyway.

Many of the "poor development processes/environments/tools" issues are related to choosing dynamic languages in a "ship first, design later (maybe)" paradigm, though.

"Ship first, design later" is completely orthogonal to what language one's using. Using a language that forces one to write int or str before the variables isn't preventing anyone to write software without having a design.

It's not even part of the "static vs dynamic" argument; one can rush things in Java or in C++ just as easily. The only difference is that Java might allow doing it slightly faster.

As someone smarter than me put it: "Bad code is language agnostic"

Yes, catching type errors eliminates one category of error but, generally, the type of programmers that make that kind of error frequently will not constrain themselves to just one category of error.

In many situations these days, I'd argue you don't have to trade one for the other.

>I built a service (~10mm users at peak) which was designed from the ground up to scale, around 2002. When the numbers did grow we just sat back and watched it pretty much.

>Even given this, I completely agree with you. That's why I now develop in ruby.

Ouch that is a pretty big burn.

> I know of absolutely no service which failed because it couldn't scale.

I would say this is because of the simple fact of visibility and adoption. You've probably never heard of these services probably because they ground to a halt with a mere 1000 users, so they never got mainstream enough to be recognised as a viable service. It is a bit like how no one remembers the dozens of people who failed to achieve sustained powered flight before the Wright brothers. Doesn't mean there weren't any, and scalability, technical or planning issues killed those efforts before anyone knew of it.

Some 'scalability' issues are inherent to your initial design, and not just the choice or configuration of your hardware/software platforms.

For instance, what if you were building a contact database of some sort. At first, you may have things like 'Phone Number' and 'Email Address' as part of the 'Person' database. Then, as your service gets popular, you notice people asking for extra contact like Twitter handles, LinkedIn pages etc. So you start adding those to your Person table as extra columns.

Eventually, you realise that you should have though more about this at the outset and have contact details stored in another data table altogether, linked to a 'Contact Type' table and related back to the Person table. This would have been mitigated at the start via better database design and catering for eventualities that you might never have foreseen. Migrating the original database to normalise it is a massive effort in its own right, and probably will take more time, cause more outage time, and cause more bugs in existing code than designing for that eventuality in the first place.

Even if 99% of your users only ever enter Phone and Email contact details, the second option, designed for scalability, will still handle that without a sweat, and 'scaling' to meet additional demands later is merely a matter of adding new contact types in the 'Contact Type' data table so that they become an extra option for all your users.

I am willing to bet that 9 out of 10 'weekend projects' have had to be thrown out completely and redeveloped from scratch when the number of users became significant. Of those rebuilds, I would be interested to see some research into how many users abandoned the said platform when (a) the original one started to grind to a halt or constantly fell over with errors and (b) the new platform came out with new features or a different UX that broke the 'look and feel' of the original.

I mentioned something similar in a separate comment.

My initial DB schema was pretty bad. We did at 2 schema rewrites and migrations from the launch to 5M users. Each time it took 2 weeks of sleep less nights.

The machines today are really powerful. You can do a lot with 244 GB RAM machines backed by SSD.

Someone who doesn't have the skill set to be able to scale once they get traction - it's likely they will not have the skills to design for scale at start.

My recommendation to everyone would be pick a language and db you are most comfortable with and get started as soon as you can. You will fail on the product side a lot more times before you will fail on the technical side.

And if you are failing on technical side, reach out to me. I will definitely be able to help you find a way out. I am not sure if there is any product guy in the world who an make a similar claim on the product front. However there are at least dozens of technical guys in the world who can make a claim like I did.

So focus on launching the product as soon as possible. Work hard, reach out for help if needed. You will eventually get success.

>> You can do a lot with 244 GB RAM machines

Is this a typo of "244 GB" instead of "24 GB"? Nearly any company that has a single machine provisioned with 244 GB of RAM is doing something severely wrong, likely putting the company's ability to grow at risk. Such a machine screams of trying to vertically scale a poorly performing legacy product instead of figuring out to horizontally scale out with 16-64 GB servers.

That much memory on a single server is a huge red flag for 95-99%+ of companies. It takes a very specialized system (ie: you probably don't fit the mold, no matter what your excuses are) to require such a server.

Pretty sure it's not a typo. Amazon i3.16xlarge is 488 GB of Ram, 64 vCPUs, and a 20 Gigabit connection. Postgres/Mysql can absolutely scream perf-wise on that.

AWS also has 2 TB instances: https://aws.amazon.com/ec2/instance-types/x1/

Outside AWS, you can put 3-6 TB in a normal-ish server from Dell or HP, or 64 TB in a big iron box from Fujitsu or IBM.

+1 - it wasn't a typo. DB machines benefit a lot from huge RAM.

I know Basecamp uses a server with 2TB of RAM for their single MySQL server. FYI CTO of Basecamp is the creator of Ruby on Rails -- smart people.

Checkout​ stack overflow stack too.

Their system is faster too.

That is actually insane. The day their MySQL cluster experiences a severe downtime incident, it is likely going to take an enormous amount of time to recover.

That is my intuition. If you know of a video or blog post (conference, talk, article, etc.) where someone explains the benefits of a 2 TB MySQL server and how it is not a crazy bad idea, I would love to see it. Because my 15 years of professional experience screams "NO WAY IN HELL" at that one.

Keeping all data cached in RAM is something that eats a bunch of memory, but is not usually wrong.

I am willing to bet that 9 out of 10 'weekend projects' have had to be thrown out completely and redeveloped from scratch when the number of users became significant.

9 out of 10 weekend projects never get to a significant number of users.

Probably more like 99/100 or 999/1000 but yeah. Also, you can get pretty damn far with PHP and MySQL. Just look at Facebook. Just don't rely on Drupal or Wordpress to get you there and it'll be fine. KISS and solve the scaling issues as they come up. Far too many projects focus way too much on infrastructure and architecture and micro services instead of building something that solves a real problem for users. Focus on that first. It doesn't matter if the tech behind it is a bunch of bash scripts if it does something that is really useful to a lot of people.

An alternative view on history: Facebook succeeded because the competitors failed to scale.

Do you remember MySpace? The big social network with hundreds of million of users that came before facebook.

They had massive scaling and performance issues. At the peak when everyone was moving to social media (almost a decade ago) the site could take an entire minute to load (if loading at all). They lost a lot of users, who went straight to facebook and never recovered.

I was a big MySpace user and late FB user, and I can tell you that it had nothing to do with [tech] scale. Back then I didn't know about programming (my first lines of code were some CSS in MySpace!) and the real reason is that users were becoming more hostile in a site that was a niche on its own. You had to start filtering out and removing SPAM all the time from your public wall. The fact that people could personalize their site and the community (heavy on music and alternative people) put most of the people away.

You can say that it failed to scale, but I wouldn't say that it failed to scale technically; it failed to scale the community. Then at some point when it was already dying they started to change the layout big time I guess to attract more people, alienating the initial community.

MySpace and Friendset have been largely commented in this discussion, in other comments.

I agree, they were different from facebook enough, that it's not all about scaling. Yet, don't underestimate the impact of having your site unreachable, when the competitor is coming strong. Not a good position to be in.

Drupal is for scaling, learn the architecture and you see it.

indeed, I would go so far as to say 9/10 weekend projects never get finished, let alone released, let alone getting enough users

You're refusing to learn from someone who lived it. I also know of zero services which failed because they couldn't scale their technology. But I know of 100s that failed because they couldn't get enough users or usage.

Your example about the DB tables is exactly the trap to avoid while you are iterating rapidly to try and find product/market fit.

You are assuming that I haven't "lived it". Over the past 2 or 3 years, I have built around 8 web apps. Some which got virtually no traction at all, and some which have reached a happy medium of users and income. None which have reached mega scale or millions of users (yet).

Some of the (real world) feedback I got from the web apps that failed to get off the ground were due mainly to our customers complaining that:

* pages were taking too long to reload.

* not enough fields on a particular data table to store information in

* missing API

* pages that would refresh in entirety instead of just updating the changed portions

Most customers didn't stick around to wait for us to address or fix the issues - there were plenty of other competitive products that fitted them better that they could start using that same day.

A couple of those sites also got negative feedback on Reddit or HN because I hosted the front end website on a $5 VPS server and when the site suffered the inevitable 'hug of death' from posting to these sites, they immediately went down due to inability to scale under the sudden deluge of visitors, and I received uncomplimentary feedback on that (where people actually elected to post feedback rather than just close the browser tab and completely forget and move on from my app).

Yes, over thinking scaling and design is bad, but putting it aside as a 'totally not important now' factor is also just as bad, if not worse, in my 'lived it' experience.

It is great that you got a few which got traction out of 8 tries. This is an incredible success rate.

Most people see much less success. The original suggestion is geared towards first few tries that people make. Once they have made a few attempts, they learn from it and naturally build more scalable products without spending extra effort. You have already read my story about my first try which had a very poor start. My subsequent efforts have scaled in that order without needing a single rewrite.

However, if they spend too long procrastinating, worrying about and investing in scale - they are wasting precious time which would be better spent finding product market fit.

Thanks, and also thank you for not reading my posts the wrong way. I am in no way having a go at you or refuting what you say. You've built something and got it to a level that many of us can only dream of, and for that you have my admiration and respect.

Like you, each subsequent app that I built contained the lessons learned from the previous efforts - better front end frameworks, better data table normalising, better hosting infrastructure etc.

Part of me always thinks back to the ones that didn't work, and I always ask myself - would they have worked if they had faster page refreshes, or more flexible data entry? Perhaps if I had spent more time in the planning and design before I launched them, they may not have sunk as quickly? Those 1000 users that time who came from my HN post and saw an 'Error 500' page - could a fraction of them been the ones who would have signed up and made us profitable if they didn't see the error page as their first introduction to my app??

For that reason alone, I am always skeptical of any post that promotes 'iterate often and fast' or 'fail quickly' rather than 'spend time on design and a unique experience'...

It's all in good spirit of sharing and learning.

When you are thinking back, also consider the alternative where could you have missed opportunities and product launches because of extra effort spent on scaling upfront.

It's all good learning in the end. It's great to be in your situation. Keep at it and you will see great success in future!

> It is great that you got a few which got traction out of 8 tries. This is an incredible success rate.

That seems fairly reasonable to me.

It all depends on what you call "success". Running a startup from zero to a billion dollar business is an incredible and rare achievement. Running a small site with a few hundreds/thousands users and a decent income is fairly reasonable.

If it was slow for a small number of users, that is a fundamental design or operations flaw, not a scaling problem.

Nobody is saying that performance doesn't matter. But if it performs well with 100 users, you can worry about the performance with 10k users later. And if it doesn't perform well with 100 users, it doesn't matter if the scaling is O(logn) even O(1).

Performance under a sudden flood of users matters, but a lot less than day-to-day performance. And most of the criticism I see for sites dying is when they're serving static content with ideal O(1) scaling, but in a very unoptimized way.

None of those problems sound like scaling issues.

Absence of evidence is not evidence of absence.

I've been part of many projects that couldn't scale because lack of forethought meant there was a nasty implicit O(n^2) thing going on when it could have easily been designed to be O(n log n).

Almost all of these projects were killed with CYA hand waving about "just not fast enough".

Can you please share some examples. I would love to learn from your experience.

Thanks for putting across your point in such a straight forward way.

I didn't call out directly not the DB tables point. This kind of premature optimization is a huge red flag and I would anyone I am advising to avoid it.

Friendster and MySpace failed because of scaling issues - but it was not scaling per se but bad execution: They choose the wrong technology, brought in the wrong CTOs who bought very expensive hardware (wrong, few expensive high end servers with expensive storage, instead several more low cost normal servers) and software with very high license costs (wrong, ColdFusion on dotNet (experimental) on Windows (high lic costs)). That's why both lost the race, they stalled development because of rewrites for a year (too long).

Edit: you can read about both project histories, and learn about them. Of course on gets down-voted for mentioning it.

As a counterpoint stack overflow is hosted on windows and did just fine. If your service is printing money you can easily fix tech problems (e.g. Facebook effectively rewriting php), if it's not you may flail around trying rewrites etc but the real problem lies elsewhere.

Tech people are quick to find technical reasons for failure but the reasons are usually elsewhere.

1) Friendster had scaling issues. They insisted on keeping the x-degree-of-friendship-calculation which is quite resource expensive and doesn't scale. Instead making the software scalable, they brought in expensive exec who decided to through more and more very high end servers and enterprise storage on it. It eat their startup money, and they were constantly firefighting with very high page load time like 12+ sec, instead of advancing the site for at least one year. MySpace took over Friendster because of this.

2) MySpace culture and management was in trouble due Murdoch's News Corporation bought MySpace's parent company. Instead of investing News Corp choose the wrong path. Instead of improving the site, they decided to do a (in the end) very costly deal with MS to switch their ColdFusion stack from Java stack to dotNet stack incl very expensive MSSQL licenses and various other license costs. It turned out ColdFusion on dotNet was still in very experimental phase and MySpace suffered from the caused troubles a lot and bleed money like never before (which made News Corp very unhappy). MySpace website got little updates for at least one year while MySpace devs were busy with firefighting and switching backend. Facebook took over.

There are books worth reading.

3) Stackoverflow (like 2 years ago) run on less than a dozen of servers. It's several magnitudes smaller than social network services like (former) MySpace, Facebook, Twitter, etc. License costs often don't scale, you bleed through your startup money for little competitive advanced features or return. That's why successful startups often choose open source stack, look at Amazon, Google, (former) Yahoo, Facebook, Twitter, etc. - Perl, Python, PHP, Java, Ruby, MySQL, Postgres, Hadoop, etc. Stackoverflow shows it can be done with off the shelf COTS as well, if you keep the server license count low by wise decisions and performance tuning. But it's not like I could name dozends of successful startups that have a software stack like Stackoverflow. And even Hotmail, Linkedin, Bing run or used to run for the majority of their service-live on mainly open source software stack.

"MySpace took over Friendster because of this."

Friendster was mostly focused on Asia plus lost a ton of users around the same time Facebook gained a ton of users. I don't think it was scaling architecture that did them in. It was the market choosing the competition for the user experience plus what their friends were on. That's for most of the world. I have no idea what contributed to their failure in Asia since I don't study that market when it comes to social media.

EDIT to add: I recall the founder did say they had serious technology problems for a few years that affected them. I'm just thinking Facebook spreading through all the colleges & moving faster on features was their main advantage.

Asia? You came a few years to late... Friendster was US focused, at least until it lost. Then the probably pivoted, but I wrote about the first phase of Friendster (the one everyone in US remembers who was active online in early 2000s).

Hmm. It peaked in Asia with most users in Asia operating from Asia. You don't think the Asian focus could affect its marketing strategy for Americans? Facebook's college angle worked really well after MySpace's express yourself in cluttered pages worked before that. Im not sure Friendster understood our userbase enough to keep it.

Stack Overflow seems very relevant to this conversation given how unremarkable their architecture is – their level of traffic is comfortably running on 4 MS SQL Server boxes:


I've seen too many breathless posts which would have you believe they'd need a clustered NoSQL database or it wouldn't scale.

>>> how unremarkable their architecture is – their level of traffic is comfortably running on 4 MS SQL Server boxes:

Wrong, and wrong.

First, their architecture is extremely remarkable. They are doing and mastering vertical scaling, down to every little details. Terabytes of RAM, C# instead of Ruby/Python, FusionIO drives, MS SQL instead of MySQL/PostGre, etc...

Second, that's at least 6 database servers, each one being more expensive than 10 usual commodity servers:

First cluster: Dell R720xd servers, 384GB of RAM, 4TB of PCIe SSD space, and 2x 12 cores. It hosts the Stack Overflow, Sites (bad name, I’ll explain later), PRIZM, and Mobile databases.

Second cluster: Dell R730xd servers, 768GB of RAM, 6TB of PCIe SSD space, 2x 8 cores. This cluster runs everything else. That list includes Careers, Open ID, Chat, our Exception log, and every other Q&A site (e.g. Super User, Server Fault, etc.).

I think you misunderstand my use of unremarkable. It's not wrong or bad but key parts are the same stack you'd have picked 1-2 decades ago – well implemented, values scaled up, yes, but something you could have shown people in 2000 and not had to explain more than that the hardware costs so much less.

I think that's good: SQL databases are very mature and you don't want to be exciting for your core business data if you don't get some major benefit to defray the cost. Boring is a delightful characteristic for data storage.

My bad then. I think that using a proven technology perfectly tuned and the way it's meant to be done, is remarkable. This should be promoted more, instead of the new and shiny.

I'm not sure you could get 1TB of memory and multi TB SSD drives anywhere in the 2000's, even for a million dollar. That makes a major difference in the ability to scale up. Data didn't grow, storing 1M user account always took the same space.

Agreed that data sizes would have been far more of a challenge – far more people used clustered services just because a single box only supported so much RAM, RAID arrays for IOPs and size as well as redundancy, etc. – so it's arguably become much easier for a growing chunk of the industry.

>believe they'd need a clustered NoSQL database or it wouldn't scale.

While for many developers, NoSQL might be overkill - Stackoverflow is a bad example. If you were any fast growing startup in the cloud, and you wanted to go the SO route it would have meant going CoLo. SO has set of machines with a nearly a terabyte of RAM - GCE doesn't even offer cloud machines with the same specs.

And even then their setup is far more fiscally expensive than something you could get done with 20 cloud nodes on some NoSQL solution.

One: run the numbers on the dozens of cloud nodes you mentioned. Is that really cheaper than renting some servers in a colo?

Second: how much time would they have spent rolling all of the data integrity, reporting, etc. features they'd have needed to add. I'm inclined to take them at their word when they say this was safer and cheaper given their resources.

>I'm inclined to take them at their word when they say this was safer and cheaper given their resources.

I never meant to claim that cloud was safer and cheaper. What I meant was, for the majority of operations, staying in the cloud with some distributed setup is likely more feasible that moving to CoLo (see GitLab).

What you describe is more of a design flaw than a scaling issue.

Thinking ahead that custom fields could be needed in the future is a design choice and can apply to 5k users or 500 000k users. It's not the scaling itself that causing you pain, it just exacerbate the difficulty of a bad design choice.

So advice of OP still apply, at start you should take most of your time develloping the best design for your application and less on scaling. Because scaling a good design is way less painful.

Do you know of resources to learn more about good database design? I've been looking for ways to optimize my current db usage and we're absolutely adding a new field to the Person model every time we feel we need it...

myspace.com failed because it could not scale

I personally switched to facebook for two reasons:

got sick of repeated errors every time I browse myspace

facebook has better album permissions, (myspace has none)

as another comment said these look like bad design not bad scaling.

Many problems including not filtering user input and comments for CSS, JS, etc.

But main problem, they crumbled under heavy traffic

Since I am an Indian I know of a service which is hated by most people - IRCTC because of it's scaling issues. It's a lot better now but I remember those horrible days you have to do a tatkal ticket.

I also know of some government services (my state has paperless administration so everything is digital) which are used in day to day basis (but only hundreds of users) by government employees goes down for almost 2 hours per day everyday which stops work for everyone (both the employees and the people came for that service). These services are developed by companies like TCS and Infosys. Only if they thought of scaling before.

In case of government, it's often another issue. While the social network of the parent was probably a lean stack, the government one was probably over-engineered. Some webapps are front-ends that communicate with a back-end in XML, with massive flaws like serialization and lack of transactions, and the backend syncs with a business solution backed by dBase, all of that for 11 web pages (I know, I've been on govt apps before – France). Scaling was a major requirement from the beginning, but those engineers just don't know what they're doing and follow the XML frontend/ESB bus patterns, assuming that's what gives performance. And committee design (12 to 24 ppl) does the rest of the bloat. 11 screens, $2m, sluggish result.

Had they started with only $300 like the parent comment, their webapp would have often performed much better. Big projects are a difficult curse, our most difficult question in IT engineering.

> These services are developed by companies like TCS and Infosys

This is the real problem: lack of in-house expertise and ongoing incentives to maintain performance, while contractors are usually paid for features and, if they also do hosting, even have a financial disincentive for efficiency.

I would bet it's at least as likely that had someone thought of scaling it would have added a significant cost and delay to the actual project and the first major scalability bottleneck would still have been something unanticipated.

> I know of absolutely no service which failed because it couldn't scale.

If by scaling you mean increasing the number of page views that a given version of a web app can serve, then this is mostly true.

But things like conceptual simplicity, unit economics, codebase readability, strategy, etc. are also dimensions of scaleability. E.g. the only reason that startups can even exist at all is because large companies have a lower marginal output per employee.

"scale" here, as with all conversations on HN about "scale", refers to the technology required to service a large number of users or usage.

I think you'll find there are plenty of conversations here where "scale" is used in reference to stuff like the business models of "gig economy" apps, or at what point your startup has enough scale to need an HR department, and so on and so forth.

> "I know of absolutely no service which failed because it couldn't scale."

Zooomr did. They came out of nowhere in 2006 and were a real threat to Flickr. They had AJAX-powered editss, geotagging, various other unique features and they were starting to pull in some highly followed followed photographers.

But, the site kept crashing as traffic grew and some scaling problems even lead to data loss. I wanted to see them win but they just couldn't keep up with traffic and eventually Yahoo cloned their features and they became irrelevant.


Main question here is that would they have done better if they had started with scale in mind. When you start you don't know how you product will look like when it has millions of users. You will likely do major product changes quite a few times.

If someone cannot scale your product which has adoption, they likely don't have insights, problem definition, capabilities, skills and resources to do that when they launch.

If I could edit my original post, I would put a rider related to capable team for service not dying.

> However I always had enough notice to fix issues.

Thank you. This is the one major point I try to get across to people when 'scaling' comes up. "Oh, that won't be scalable" or "yeah, but when we have 2 million users, XYZ won't work". I've been on projects where weeks and months (calendar time, part time efforts) were spent on things that were 'scalable' instead of just shipping something that worked earlier.

I keep trying to tell folks on these projects (have been involved in a couple now) - man, we're not going to go from 20 users to 2 million users overnight. We'll notice the problems and can adapt.

Trying to whet their appetite, I've even argued that a new 'flavor of the month' will be out before our real scaling needs hit, and we can then waste time chasing that fad, which will be even cooler than the current fad we're chasing (although... I try to be slightly more diplomatic than that).

> Time is the only currency you have which can't be earned and only spent.

I would argue that you can earn lifetime with a healthy life style. I remember to have read a study that measured the average difference between healthy and unhealthy life styles with 14 years.

If you spend 1 day a week more for a healthy lifestyle (exercise, self cooking, enough sleep, little stress, enough recreation time) for 40 years you have earned ~9 years life time.

And I would argue that those 40 years are more enjoyable.

You make a great point. Health should be the topmost priority and I need to get better at it.

However to my original point - you can lose fitness (within reasonable limits) and then gain it back and not notice a difference later on. However time lost is lost forever.

Yes you can have a life style debt. But do not overload it or forget to pay it back.

You're omitting a lot of details. The number of users isn't as important, for example, as the number of users simultaneously using services and which services they were using concurrently. Also, no offense here, but the way you describe your experience makes it seem like you were very inexperienced in general. Your "not noticing" services that failed could mean, for example, that you weren't measuring properly and probably never learned to. In which case your anecdote isn't necessarily a useful one.

It might be a good idea to read my post again. When I talk about 'service which failed' it was other products (e.g. Facebook, Twitter etc) - not components of my product.

My post says 50M+ users and 1B+ message on a day. Almost 50% daily active users. And this was early 2000 hardware. Last year WhatsApp was doing 60 billion messages per day (60x of what we did).

However our messages had a lot more logic, because we delivered on unreliable Telco texting pipes. Pipes had a set throughput. Some pipes delivered to just some geographies. They would randomly fail downstream so you had to do health monitoring and management. Some pipes cost per message and some cost based on throughput. So pick in real time for the lowest cost. Different messages had different priority. In summary - there was a lot more business logic per message sent.

I was of course very inexperienced when I started. I was just two years out of college. However in the end the product was 50+ services running on 200+ servers. This experience helped me scale user communication infra at Dropbox to 100x scale within 3 months.

Your message was flame-bait at best and didn't seem to contribute to the discussion. You were passing judgements based on incomplete information. You could have asked nicely and still gotten a response.

Edit: fixed typo

> It might be a good idea to read my post again. When I talk about 'service which failed' it was other products (e.g. Facebook, Twitter etc) - not components of my product.

This wasn't clear at all from your comment.

> Your message was flame-bait at best and didn't seem to contribute to the discussion. You were passing judgements based on incomplete information. You could have asked nicely and still gotten a response.

My message was not flame bait; that's your interpretation. I'm sorry you took offense, but please do consider that your original comment does omit lots of important details, as you apparently (should) know from your experience at Dropbox.

Also, per your quoted numbers WhatsApp was going 60x what your app was, not 1/60. Typo?

This is a delightful retelling; thanks for sharing.

> I know of absolutely no service which failed because it couldn't scale.

Genuine question: what about Friendster?

Friendster definitely failed because they couldn't pull off scale. I was CTO of the Tagged social network at the time and the issue loomed large during our 2005 fundraising. Investors had seen $50m go down the drain when Friendster got slow. Despite network effects, users moved over to products that were up and running (mostly MySpace but also hi5, etc.).

I seem to recall that the Friendster team had more than one aborted rewrite before they got it right, and by then it was too late. Scaling social networks required a different bag of tricks than was popular at the time (e.g., database replication was well accepted as a best practice, but sharding worked much better). Talented people could still get it wrong.

Still, social networking illustrates how failing to scale is rare, and should practically never be the concern early on. Almost every social network that gained traction struggled to scale at times, but the vast majority of them overcame their challenges.

We poured sweat and tears and many sleepless nights into scaling the Tagged back-end from 10 machines to over 1,000, over time serving hundreds of millions of users. It was a tremendous amount of work, much more than building the initial product, and I don't think there's much we could have done early on to make later scaling easier.

More on the story: http://highscalability.com/blog/2011/8/8/tagged-architecture...

Thanks for sharing your experience.

There is always scope for a botched execution or just pure bad luck. I wish someone shared more candid details of what happened at Friendster.

The Startup podcast[1] has an upcoming series about Friendster

1 - https://gimletmedia.com/startup/

I remember reading that Friendster founder was complaining that there were facing a lot of technical issues and investors were not letting him invest effort in fixing them.

IMHO, if your investors need to be involved in seeking approval for handling scaling pains and on top of that if they reject it - there are deeper issues within the company (management issues, politics, technical incompetence, leadership issues).

So to clarify my point - any team could still screw up technical execution. However there is no reason that given a good team and support from within the company, a product can't be scaled.

Friendster's problem was that instead of fixing its problems, it decided to re-write.

You started with Java... a language designed to handle credit card transactions world wide. Your story would be very different if you started with say Rails.

If you build stateless horizontally scalable service (and you should) - you are almost always going to be blocked on DB. On the app server front you just keep adding more machines behind load balancer.

As I look back, if we had used RoR or Python we might have moved faster with no negative impact on scale. We would have spent more money on extra servers though!

Awesome story and takeaways. What happened to your product and what did you learn you could not keep up with or grow into?

I built this product as a side project within another startup I was working in. We had lot of money, had been around for 3 years and did not have a product.

Because of this I had multiple levels of bosses above me in the company. They all got really interested in starting to manage me and the product once we hit 5M users. They had a different vision, attachment and passion about the product.

Around 2008, we were spending a lot of money on text messaging. Most people where using the product to send messages for cheap to their friends and family (essentially like WhatsApp vs Twitter).

I wanted to slowly phase out text messaging in favor of Data (essentially become WhatsApp!). CEO didn't believe that Mobile Data will get adoption in India for a long time. He instead wanted to focus on monetizing - by sending ads on text messages, making people pay for premium content etc. We tried 8-10 things - nothing worked (this can be a really big post in itself).

Soon I quit and moved to US. It was like a bad breakup!

My biggest learnings were:

1. If you do not have ownership like a founder, do not that take responsibility like a cofounder.

2. Personally for me: do not work where I don't have veto rights. But bills need to be paid and family has to be supported. Took me 5 years of really hard work to get there!

Thanks for sharing your experience. People are getting caught up in Ruby v Java v whatever and not seeing the forest for the trees - you have a valuable perspective and an effective way of communicating it. Appreciate it!

May I ask you about your first 'biggest learning': how do you eschew cofounder responsibilities in an early company? Someone who genuinely cares and/or is really interested in the core business (pretty common among first hires of startups) may have a hard time turning work away or willfully not participating in major discussions/planning when they believe their contributions could be valuable. In other words, situations when you know you can contribute but the company will get the 'better end of the deal' as you take on more responsibility without taking on more compensation.

It sounds difficult to ride that out - were you able to do that successfully or are you saying "This happened to me, don't let it happen to you?"

Just saw this comment. Not sure if you will see the response. You raise a really good point.

If someone is joining a startup at an early stage, work like it is your own baby. You are spending your most valuable currency by being here - which is your time!

With time, hopefully, you will get rewarded for your effort and results. You will likely not get a founder level say and stake but your role should gradually expand. In the end founders take almost the same risk as the first employee, but get a lot more equity because founder was there at the start. Join a startup as an early employee only if you are able to accept this fact - otherwise you are setting yourself up for a lot of resentment and pain.

This above advise applied to me too for the first two years of my employment, before I started working on the ultimate successful product. We were building an offline search engine and as a part of that I build distributed file systems, crawled a billion pages and I was appropriately rewarded with career growth.

However with the new social product that I built, the situation was dramatically different. For that product I was there from day 0. I was owning the whole engineering (sole engineer at start and head engineering till I left) and significant portion of product. Till we got 5M users, I had a co-founder level say (but not the stake of course) and I worked with a co-founder level of involvement and dedication till that time. However once we got 5M users, multiple levels above me started doing meetings and taking critical decisions without involving me. I still had a co-founder level passion and it hurt to see the product flounder and being unable to do anything about it.

So I was in a uniquely bad situation and it is not common to be in this situation. However if someone really is and there is no way to get a founder level control - move out as soon as you can. And this is what I did - albeit 2 years too late.

I can name one: Darcs. It was my favorite certain control system, apart from the lack of scalability.

Darcs is a distributed version control system. It is inherently not a hosted solution that "needs to scale" the way the article is talking about.

I'd definitely say that Darcs failed to catch on because it didn't scale with the size of the repo, the larger your history was the more likely you were to run into performance issues, and there was little you could do about it because the underlying theory is flawed.

The scaling problem with Darcs wasn't about hosting, but in the implementation of their theory of patches and how it handled conflicts. See http://darcs.net/FAQ/ConflictsDarcs1#problems-with-conflicts for an overview of the issues in Darcs 1 and 2.

This seems to be also one of the driving forces behind the development of Pijul, which is also patch-based instead of snapshot-based, which makes it easier to understand and use, but all implementations so far had major performance issues once repositories grow. For more on that, see https://pijul.org/faq.html

I was one of the early users of Darcs 1, back before Git existed. I wanted to use version control, but the alternatives were pretty hard to understand and use. While Darcs was really nice to use (, and fast on small repos, after a few years I had to convert everything to git because exponential times on most operations was just not sustainable, and fixing that required constant vigilance and altering history, not very friendly for new contributors.

Even GHC moved from Darcs to git in 2011 because of this: https://mail.haskell.org/pipermail/glasgow-haskell-users/201...

I love this example. Thanks for sharing! It's very encouraging to remind me to just make something and get it out there.

One prominent counterexample that comes to mind is Friendster which was a big social network before MySpace and Facebook. They had terrible performance issues but to be fair to your point, their decline was more mismanagement than a true inability to scale.



> I know of absolutely no service which failed because it couldn't scale.

Well, Friendster failed for exactly that reason. But, those were the early days.

p.s Completely agree with the article.

> I know of absolutely no service which failed because it couldn't scale.

You must not be looking around at all, right?

>I know of absolutely no service which failed because it couldn't scale

So what are some examples of services which failed because they tried too hard to scale too early?

For example, when your network could handle 20k users, was there another network that could have already handled 500k, but they failed because they started a month after yours?

>I know of absolutely no service which failed because it couldn't scale.

These days there would be hardly any way to fail to scale given a large budget. The technology to do so is all so good. However, back in the day it wasn't always so. I'd say Friendster was an example of a company that failed due to scaling issues.

Some things simply don't scale. You can't do some things in real-time. You have to do it in batch-jobs or avoid them. If you know about the O-notation, you will probably know certain algos don't scale. Friendster insisted to calculate the friends-of-friends-of-friends relationship in real-time and was not able to scale it. MySpace and Facebook did the same calc in the early days too, but both scrapped the feature as they were not able to scale it. Other examples are big data (e.g. log analysis), if you choose the wrong software stack (database) or wrong indexes you won't be able to scale it (especially if it's real time. You will bleed money in no time for cloud servers with TBs of RAM, when you could do it with a couple of servers with a normal sharded SQL database and one hand tuned index.

> The technology to do so is all so good.

Mistake #1: Took the wrong technologies.

For every technology that can scale, there is a bunch of other that will give a lot of troubles.

Depends on what you're doing, easy to fail when you do analytics or something like datadog.

How much money did Twitter waste in its migration to Java? I think it's an important factor to consider.

They have thousands of people sitting around doing nothing. They don't really optimize for expenses.

Can you share a link to the site?

I think in your particular case, you had a good intuition about performance/scalability and this is why it was not hard for you to scale. However this is not the case for everyone and I have seen many counterexamples in my career.

There have been a bunch of discussions about language choices and impact on productivity. Wanted to touch upon this here based on my experience.

TLDR: Depends a lot on individual situation. Pick the language you are fastest and most comfortable with.


SMS Gupshup - 6 years - Java and some C++ - Built distributed filesystems. Crawled 1B pages with early 2000 hardware. Built map-reduce framework before Hadoop came. Built a search engine on top of it. Built an infra which would send 1B+ text messages with prioritization and monitoring. Social network with 50M+ users.

LinkedIn main stack - ~ 1 year - Java

LinkedIn mobile stack - ~ 1 year - Javascript - Was one of the early engineers.

Founder Startup - 1.5 years - Clojure

Dropbox - ~ 1 year - Python

Head Technology Tech incubator - 1.5 years - Node.js

As you can see my experience has been all over the spectrum. While primarily using one language, I would keep experimenting with other languages like Scala, Go.

Following are my learnings and the reasonings:

I love Clojure. Love it so much that I feel like quitting everything go to a mountain cabin and just code Clojure. However I would most likely not use Clojure for my startup. The learning curve is really-really steep. It would make it very difficult to me to build a team. It's really fun if you understand it. But first few weeks for a new person might be a nightmare and very emasculating. So even though I really love it, my current strategy is to bring Clojure patterns to other languages.

User facing logic

I would not use a statically typed Object Oriented language for any user facing logic. User facing logic and behavior tends to be very fluid and I feel that it gets strangled by the constraints of statically typed languages. One you start building things with Class, Inheritance and Polymorphism etc - soon these patterns start driving business logic rather than the other way around. Because of this reason you will see most arcane and slow moving popular sites today are Java based - (LinkedIn, Yahoo, Amazon, Ebay) vs (Facebook, Twitter, Instagram, Pinterest). For this purpose I like using Dynamically typed languages building code in as functional way as possible. Pick Node.js, Python, Ruby - whatever you are comfortable with.

In some cases people recommend Statically types languages for catching errors because of type safety etc. However I would solve that problem completely with test suite rather than solving it part of the way using language features while being forced to pick a statically types language.

Another big benefit of dynamically types languages is ability to just create a dictionary and start using it. Most CRUD apps work with JSON request and response and I would rather just dynamically pass objects, rather than build a class for the request and response objects for each and every route. This was a nightmare to create and maintain in Scala and Java.

Fast developing Infrastructure

E.g. Hadoop, Hive, Distributed FS or Queue. 5 years back I would have built the in Java. However I would use Go for this today. However you couldn't go wrong either way. Pick what you are most comfortable with.

There is decent amount of talent available. These languages are easy to learn. And you get close to C level performance if you built it right. Development is fast paced.

Mission critical performance - File System or Database

Most likely C, because you want to squeeze out last of the performance. However be ready for really slow development cycles. And most smart kids out of college wouldn't know C. So you will need to set aside some time till they are really productive.

This is probably because implicitly made available decisions from the get go. For example, if you had chosen MongoDB instead of MySQL, the story would be different. Ditto if say, Meteor instead of Java.

I am guessing you point is that if I had chosen Meteor and MongoDB - we wouldn't have been able to scale.

My initial DB schema was pretty bad. Had to do quite a few DB migrations - which took weeks of work to execute.

I have used MongoDB extensively since, and I am confident that it would have helped us scale to 5M users comfortably. I might have migrated to something else at that point.

There are very few products who reach even 5M users. Which is why developers should focus on launching fast.

Another thing is that best and most successful product have a really simple core product (remember Facebook, Twitter, Instagram etc when they had 5M users). It is not that much work to migrate and rewrite. When products are not getting traction - is when they start getting overcomplicated.

Isn't Meteor a framework? I'm assuming you meant Javascript instead of Meteor

Yes, I should have said JavaEE vs JavaScript

Available --> scalable (typo)

> I know of absolutely no service which failed because it couldn't scale.

I know HN skews young when no one remembers Friendster.

need much longer story if you could share. I's also Indian and Mine died at about 500k due to my focus on scalability.

Would love to know more about your story.

My biggest learning is Functional->Fast->Pretty from the product POV and Launch as soon as you can from an Engineer's POV.

Some more details on my blog: https://anandprakash.net/2016/10/28/how-to-build-a-business-...

Most of this was written in sleepless nights when we had a baby. So do not expect a super artistic flair ;)

At step 3, how many messages per day were you pumping with 64MB allocated to MySQL?

We were using MySQL like a persistent queue. As soon as a message was spawned, we would send it to downstream SMS gateways and delete them. So essentially that volume didn't use a lot of memory.

At stage 3 we had 50% daily active users who would receive on an average of 8 messages per day. So at step 3, we would send 20M messages per day.

Arguably Friendster failed due to its interminable scaling issues

> I know of absolutely no service which failed because it couldn't scale.

Bullshit. What about the site (voat something?) that tried to be an alternative to reddit when there was some scandal there but couldn't keep users because they kept crashing?

Bullshit. You can go on voat right now just fine and there are people using it just fine. It hasn't failed, it's just not a big deal because it's an almost exact replica of a site that already exists and has traction. The difference being one allows you to hate on fat people all the time and the other only lets you hate on fat people most of the time. It's just too niche of a USP for most people.

I don't think that's a fair analysis of what happened. Though it seems obvious that they have lost some potencial users by being down so many times, ultimately what has obstructed Voat’s success is its core community: the fringe of the fringe of reddit.

It's still going and they post about the scaling issues being due to lack of funding available that would allow them to continue their business model.


But is that a true scaling problem like "2x servers can't handle 2x users" or is it a business model problem like "every user means losing money and we can't afford more even if the cost was O(n^0.9)"?

I can't find the posts you're talking about.

What application did you built ?

Obligatory "Mongo DB Is Web Scale" https://www.youtube.com/watch?v=b2F-DItXtZs

There is scaling and there is scaling.

A messaging service is the most trivial service one can make, it's a well known easy problem that's been solved for decades with current technologies. There is no challenge in that. For comparison, WhatsApp had people with experience and they could handle 1 billion users with 50 people.

The fact that you can handle 10M users with a single untuned MySQL database is not a demonstration that scaling is overrated. It's an expression than you are running a trivial service that doesn't do much.

Almost any problems will be more challenging than that. There are endless companies that have 1/100th the customer base and yet require 100 times the data volume and engineering.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact