
A lot of complex “scalable” systems can be done with a simple, single C++ server - Impossible
https://twitter.com/ID_AA_Carmack/status/1210997702152069120
======
IndrekR
A site for proof. It keeps amusing me on what hardware/software Stack
Overflow/Stack Exchange is running on:
[https://stackexchange.com/performance](https://stackexchange.com/performance)

This is way less in HW than most people in the trade (from web devs to devops)
seem to think when asked about it.

SO ranks #36 in Alexa right now:
[https://www.alexa.com/siteinfo/stackoverflow.com](https://www.alexa.com/siteinfo/stackoverflow.com)

~~~
perl4ever
At the risk of exposing my ignorance - look at all those "Peak 5%-20%" labels.
Doesn't that mean they have a lot more than they need?

~~~
dahfizz
Stack overflow hosts all (most?) Of their own baremetal servers in their own
data center.

Looking at the specs of the machines, they are actually pretty basic as far as
servers go. A server isn't barely worth the cost of it's chassis and
motherboard if you put less than 64 G ram and 24 CPUs in it.

In other words, these are about the lowest specd proper servers you can get.
So yeah, even their modest hardware is still overspecd for running their
website.

~~~
latch
"Their own data center" implies (to me at least) that they built their own
data center. That doesn't sound right, so I looked it up, and it seems like
they're colocating. That might be what you meant, but the data centers they
use certainly weren't built or owned by SO.

~~~
dahfizz
As far as I can tell, you are correct. They manage their own metal
servers/racks inside a colocated data center. Sorry if this caused confusion.

------
jandrewrogers
Many developers severely underestimate how much workload can be served by a
single modern server and high-quality C++ systems code. I've scaled
distributed workloads 10x by moving them to a single server and a different
software architecture more suited for scale-up, dramatically reducing system
complexity as a bonus. The number of compute workloads I see that actually
need scale-out is vanishingly small even in industries known for their data
intensity. You can often serve millions of requests per second from a single
server even when most of your data model resides on disk.

We've become so accustomed to extremely inefficient software systems that
we've lost all perspective on what is possible.

~~~
hnews_account_1
Can you expand on this? I have some pretty massive compute loads that need to
be scaled onto a cluster with 100+ workers for most computations. This is
after I use a library called dask that graphically does its own mapreduce
optimisation inside its modules. This is all for a relatively small 250GB raw
data file that I keep in a csv (and need to convert to SQL at some point).

Are you saying this can be optimised to fit inside a single 10 core server in
terms of compute loads?

~~~
sfifs
Don't know why you're being downvoted but I'll assume your question is
genuine.

You use a cluster when your data and compute requirements are large and
parallel enough that the tax paid on network latency trumps the 10-20X speedup
you get on SSD and 1000X speedup you get from just keeping data in RAM.

250 Gigs is tiny enough that you could probably much get better performance
running on high memory instance in AWS or GCP. You'll generally have to write
your own multiprocessing code though which is fairly simple - your existing
library may also be able to support it.

I once actually ran this kind of workload on just my laptop using a compiled
language that performed better than pyspark on a cluster.

~~~
hnews_account_1
I'd love to keep it in RAM if I could. The problem is, the library I'm
familiar with (pandas) typically seems to take more memory than the original
csv file once it loads it onto memory. I know this is due to bad data types
but in certain cases, I cannot get around those.

However, even if I could load it all into memory at once, and assuming it
takes 200 gb, I'm still using a master's student access to a cluster. So I get
preempted like it's nobody's business. Hence why I prefer a smaller memory
footprint even if I take up cpus at variable rates through a single execution.

I did try to write my own multiprocessing code for this, but the operations
are sometimes too complicated (like groupby) for me to rewrite everything from
the ground up. If I'm not reliant on serial data communication between
processes (like you'd need to sort a column), I can get it done pretty easily.
In fact, I wrote my data cleaning code with this and cleaned up the entire
file in half an hour because single chunks didn't rely on others.

However, if you have some idea of how to run these computational loads in
parallel in python or any other language on single compute instances (like the
size of a laptop's memory of 16 gb), I'd really love to see it. Thanks.

~~~
giantrobot
Numpy supports memory mapping `ndarrays` which can back a DataFrame in pandas.
This lets you access a dataset far larger than will fit in RAM as if it lived
in RAM. Provided it's on fast SSD storage you'll have speedy access to the
data and can process huge chunks at once.

~~~
hnews_account_1
Can you provide a link to this please? My current knowledge is that all numpy
data lives in memory, and pandas itself has a feature to fragment any data
into iterables so I can read upto my memory limit. I cannot use this feature
due to the serial nature of some of the operations that I alluded to (I'd have
to almost rewrite the entire library for some of these complicated operations
like groupby and sorting).

I do have fast SSD storage because it's on the scratch drive of a cluster and
from what I've seen it can do ~300-400 MB/s easily. I haven't had a chance to
test more than that since I'm mostly memory constrained in much of my testing.

My current attempt is to push this data into a pure database handling system
like SQL so that I can query it. But like I said, I work with a less-than-
stellar set of tools and I have to literally set up a postgres server from
ground up to write to it. Which shouldn't be a big deal except when it's on a
non-root user and I have to keep remapping dependencies (took 5-6 hours to set
it up on the instance I have access to).

My other option was to write the entire 250 GB to a sqllite database using the
sqlalchemy library in Python, but that seems to fail whether I do it with
parallel writes, or serial writes. In both cases, it fails after I create
~64-70 tables.

~~~
giantrobot
[https://docs.scipy.org/doc/numpy/reference/generated/numpy.m...](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html)

You can create memory mapped ndarrays, these act like normal numpy arrays but
don't need to fit into RAM. Numpy maps the array to a binary file on disk. The
array otherwise acts like an ndarray so you can build a DataFrame with it.
Whenever you access an array index Numpy in the background (essentially) seeks
that many values into the file to grab the value of that index.

Since you're on a fast SSD and Numpy is fairly smart you'll be able to access
your arrays close to your drive's speed. It's slower than if the whole
database was in RAM but far faster than distributing the data over a network
to a bunch of worker nodes. Memory mapped files let you have array-like access
to data on disk as if it lived in RAM. When building a pandas DataFrame from a
memmapped ndarray I believe you just need to set copy=False in the constructor
for it to Just Work.

I don't know what your data looks like but I doubt loading it into SQLite is
going to improve your performance.

~~~
hnews_account_1
Unfortunately my data isn't all numbers. It has text too. The sparse examples
in that link only show this for reading in numbers. Do you know off hand if it
translates well? There is a dtype parameter, but it'll take me a few days to
get back to this code, so I figured I'd check beforehand.

Your second paragraph is essentially what I want. I'm willing to wait a day
for code that may run in 1 hour from memory, so time isn't entirely an issue
unless it's starting to bleed into weeks. The read_csv function in pandas has
a parameter called memory_map, but when I tried using it on a smaller 7GB
dataset, it read the whole thing into memory (32GB instance) even when I set
it to True.

SQLite is definitely not my best option here. It was the only server-less
implementation I could find, so I tried to use it and it didn't work. However,
a database like implementation will be helpful because each operation I need
to do will require data that satisfies certain timestamp and arithmetic
conditions. I figured it'd be best to load the whole thing into a db and query
it for every operation to train my model.

------
sriram_malhar
This is precisely the point made by McSherry, Isard and Murray in their lovely
paper, "Scalability! But at what COST?" (Usenix HotOS '15). They demonstrate
how much performance headroom there is in modern CPU and memory, and show how
simple cache-sensitive batch algorithms running on a single core can
outperform hundreds of cores running distributed map-reduce style jobs.

[https://www.usenix.org/system/files/conference/hotos15/hotos...](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-
mcsherry.pdf)

~~~
jpz
Big data is about I/O not CPU

I’m a C++ veteran btw and understand the point but big data is about how to
process petabytes of I/O not how to consume CPU.

~~~
bane
This is true in a "water is wet" kind of way. The point is that a great many
problems that can fit neatly into a single machine are being turned into I/O
problems by being distributed onto clusters.

There's an incredible number of gigabyte and even terabyte scale problems that
are consuming racks of blades when a little thinking and understanding of the
problem being solved can be done pretty nicely on far fewer resources.

What's really happening is that to many people they think it's "easier" to
simply rack more equipment into the cluster and end up shifting the complexity
into cluster administration rather than programmer time.

~~~
mannykannot
>...and end up shifting the complexity into cluster administration rather than
programmer time.

Sometimes, that is the right thing to do. The problem is when the cluster
solution also adds to the programmer-time complexity.

------
8fingerlouie
As someone who has implemented a complex system in C++ in this decade, I’d say
he’s not wrong, but you need to carefully weight the pros and cons.

In our case latency and real time demands mattered a lot (NASDAQ feed parser),
to the point of the (potential) slowdown of a garbage collector kicking in was
enough to rule out Java and .NET. It runs entirely in memory and on 64+ cores.

We implemented our own reference counting system to keep some of our sanity,
and to at least try to avoid the worst memory leaks.

This was an edge case, and for almost everything else you’re probably better
off implementing it in something that handles memory for you. If performance
is an issue, least try it in Go or Rust with a quick PoC before jumping the
C++ wagon.

~~~
davidcbc
Memory management in modern C++ is considerably easier than it used to be.
Bespoke memory managers aren't really needed, you can do almost anything you
need to without ever using new or delete.

~~~
stevenwoo
Out of curiosity, what is the field for which you find this to be true?

~~~
wffurr
If you are writing C++, any field! Just use std::shared_ptr and
std::unique_ptr from the standard library, along with std::make_shared and
std::make_unique.

~~~
sillysaurusx
This is the quickest way to kill your performance in C++. I guarantee it
wasn’t what Carmack was talking about.

I used sharedptr extensively in a game engine. Whoops: suddenly 20% of the
frame time was gone, never to be recovered. Once that performance is gone it’s
almost impossible to get it back, short of rewriting every system.

~~~
wffurr
Only use shared_ptr when you actually want shared ownership. If you stick to
unique_ptr for owning references and then "borrow" that raw pointer in
functions, then you get speed without sacrificing too much safety and still no
new/delete.

~~~
slavik81
Additionally, shared ownership should be rare. Most objects should have a
single owner that is responsible for them. This is not only better for
performance, but also makes the system easier to understand.

------
bsenftner
I'm 15 years in writing high performance Internet servers in C++, and I can
confirm higher level languages provide an illusion of capability, but once
you're talking high performance with high compute requirements and scaling
your service, the cost efficiency of C++ is exponential better than any other
language. The higher level language ecosystems are bloated beyond repair. I
was able to use one 32-core physical server running a C++ http server I wrote
providing a rich media web service and replace an AWS server stack that cost
my client 120K per month. The client purchased one $8K 32-core server and co-
located it behind a firewall at a cost of $125 per month. And the C++ server
ran at 30% utilization, plenty of room for user growth. Their AWS stack of a
dozen C#, Python, PHP and Node apps were peaking their capacity too. If
course, my solution caused existential questioning by the non-geek CEO and the
CTO, but they were in crisis and needed to radically revise how they provided
their service or close.

~~~
cthalupa
The thing that worries me about stories like this is that there is frequently
(as is the case here) no mention any sort of HA or backups. No details on what
disaster recovery looks like. Those are business critical considerations that
cost money, and just disappear from the discussion when people say "hey i
saved all this money dropping everything down to a single server!"

~~~
bsenftner
Well, in my case described above, single server solutions include an automated
backup sub-system, and my servers expect multiple instances of itself to
running on the client network, and these multiple instances synchronize with
one another via additional endpoints specific for the purpose. The whole issue
of HA and backups is critical and one of the areas my approach shines.

~~~
cthalupa
You're still not actually answering the question of how you are HA and backing
things up with only a single physical server and nothing else.

If you're backing things up to the same server, that's not enough. If the HA
instances are running in the same server, that's not enough.

If there are other things besides that one physical server and it's
power/network, you didn't include them in the cost, so the comparison is
disingenuous there.

~~~
bsenftner
I do say the deployed system ends up being multiple instances of my single
server, which synchronize with one another. Those are separate physical
devices each running my one server. Additionally, when a backup runs the data
is stored locally as well as on a physical storage device separate from the
hardware it is running. Typically clients already have a firewall/router which
is used to distribute requests to the various instances. My deployed systems
are not one server, they become a server mesh.

~~~
cthalupa
>. Those are separate physical devices each running my one server.
Additionally, when a backup runs the data is stored locally as well as on a
physical storage device separate from the hardware it is running. Typically
clients already have a firewall/router which is used to distribute requests to
the various instances.

Awesome! Really glad to hear this is the case. But those are all added costs
beyond the single physical server you gave the price of.

Server cost * number of physical servers you have deployed Cost of your off-
server storage Cost of your network appliance doing load balancing

It's still probably way less than the AWS bill, but it's not really fair to
compare the total price of infrastructure in one environment vs. just a
portion of the other.

------
lmm
Horizontal scalability carries a lot of overhead. Probably a factor of 10,
easily. But the clue is in the name: eventually you'll get to a point where
you have to scale.

Back in 2010 I worked for a company whose system, in Java, ran on a single web
server (with one identical machine for failover). We laughed at our nearest
rivals, who were using Ruby, and apparently needed 60(!) machines to run their
system, which had about 5x the average request latency of ours.

Then traffic doubled, and suddenly we were having to buy six-figure servers
and carefully performance-optimize our code, and our rivals with the 120 Ruby
servers didn't look so funny any more. And then traffic doubled again.

~~~
paulddraper
Web servers are usually trivially horizontally scalable.

You must have had significant in-memory shared state to encounter that
problem. Right?

Had you adopted a less stateful model, you'd have looked rather pretty with
two Java servers.

~~~
lmm
> You must have had significant in-memory shared state to encounter that
> problem. Right?

"Significant" is in the eye of the beholder. The core of the system was easy
to make shardable. But you'd be surprised how many implicit assumptions creep
in, how easy it is for ancillary parts to end up sharing state when it's easy.
Also note that just because your state's in a database doesn't mean having two
instances of the thing that accesses it will work, you can easily end up with
an access pattern that assumes only one reader (for example) even though on
paper there's no in memory state.

> Had you adopted a less stateful model, you'd have looked rather pretty with
> two Java servers.

Maybe. At the point where we're running 4 or 8 servers we'd have been facing
much the same ops problems that they were. Java bought us an extra year or two
of not having to deal with that, but also a significant amount of migration
work when it just became impossible to stick to the single process model. At
the end of the day we still kept the 5x latency advantage, which is definitely
not nothing. But there were also definitely features that they brought to
market quicker, and I'm pretty sure Ruby played a part in that.

Tradeoffs, tradeoffs everywhere. I left before the final outcome of that fight
(for all I know it's still ongoing), but I don't think either company was
being dumb.

~~~
paulddraper
> At the point where we're running 4 or 8 servers we'd have been facing much
> the same ops problems that they were.

Yes and no.

You'd need some ops work, but you'd need to worry a lot less about managing
your infrastructure provisioning to keep costs low, e.g. reserved instances,
dynamic scaling, etc. and putting out fires when you inevitable exceed your
tight perf margins.

You could overprovision 24/7 by 50% and write it off. Your competitors
couldn't.

------
ummonk
Yes, I’m always shocked by just how much performance overhead most languages
have compared to C and similar lower level languages. It is a price worth
paying for better language ergonomics, but I do wonder whether Rust might be
able to give us the best of both worlds here.

~~~
twic
I semi-seriously think the entire modern shape of the cloud is a result of
Ruby being really slow.

Back when people were writing their backend business apps in C++, COBOL, Java,
etc, if there was ever a performance problem, you could usually just get a
slightly bigger machine and grow your thread pools a bit. But once the web
took off and Ruby exploded onto it, you couldn't do that, because it's an
order of magnitude slower, and doesn't really do multithreading. But, as long
as you follow twelve-factor discipline, it scales horizontally like a champ.
So, we took to horizontal scaling over multiple VMs (and caching things in
Redis instead of local memory or Hazelcast or whatever), and that's been the
unquestioned way to do scaling ever since.

~~~
ww520
The push for the need of scaling out started with Ruby and Python's lack of
performance. The reason being pushed at the time was, "developer time was more
expensive than hardware." Well, that didn't count the amortization of
developer time over the lifetime of the product once the product was
developed.

~~~
shrimpx
It's mostly a fallacy that a demanded product becomes "developed". Maybe a
game that gains cult status and therefore a long tail end of life. But popular
web services are in constant churn and in that space it's valid to trade
hardware for programmer productivity.

~~~
ww520
Backend web development doesn't change much once developed. How many ways can
one do CRUD on the backend?

~~~
filleokus
In my experience this is only true iff the system never gets any more user
facing features.

Every new non-insignificant feature requires a new REST-API route, or database
table or modification of the GraphQL schema. And depending on how you designed
the backend, even small redesigns of the frontend might require changes on the
backend. Consider a simple app showing car rentals, where you initially have
something like /car/[id], then the frontend guys realise that we need to show
cars rented by each customer and it's necessary to have
/customer/[id]/rentals.

(Of course it's possible to design a system without any schemas or
normalisations, which would make your statement truer, but that's rarely
attractive for other reasons.)

~~~
ww520
But all those are just extensions using the established frameworks and
development practices for that backend application. Once those are developed,
it’s pretty easy to add new features. I would argue using Ruby and Python like
script languages brings no significant benefit in development time. In fact
they make it worse when working on an existing app, with the extra unit tests,
difficulty in refactoring, and extra performance work.

------
blondin
not how i read his tweet. he was lamenting that python is too slow for some
server side development use cases. and he gave cpp as an alternative that
would be simpler and faster. he even followed up citing java and csharp.

totally agree. if all your backend server is mostly complex serialization &
de-serialization, and pushing bytes to other sub-systems, i think many other
languages have advantages over python.

wondering why no mention of go or rust though...

------
twic
What's that, Frank McSherry? For horizontally scaled systems, we should ask
what the Configuration that Outperforms a Single Thread is?

[https://blog.acolyer.org/2015/06/05/scalability-but-at-
what-...](https://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/)

~~~
Silhouette
IMO, the linked article is much more insightful than the flippant comment here
might suggest. It's a solid argument, backed by real world data, about how
easy it is to make bad assumptions equating better scalability with better
performance.

------
rossmohax
Carmack is moving to AI and inevitably has to deal with a lot of Python, which
still bottlenecks process here an there despite all the effort to move
computation to C extensions. I have a really high hope that he detours a bit
and creates very-very good non-python tooling for ML.

~~~
ospider
Maybe he could make Python fast once and for all :p

~~~
stefano
Python has some fundamental language semantics that make it really hard, if
not impossible, to create an implementation that can match Java. Pypy is
probably the best you can do to optimize python, and it shines in tight
numerical loops, but it gets less effective as code gets more complex.

------
kitsuac
I've been programming C++ and assembly for 23 years. Few years ago I became a
huge fan of Python. In my opinion Python is amazingly well suited for rapid
first revision and can then be swapped out for C++ / asm.

~~~
davidcbc
This is fine as long as you can convince management to spend the money to
rewrite your software. That's usually a hard sell though. In my experience
this plan usually ends up with a python monstrosity that everyone hates but is
forced to deal with forever.

~~~
j88439h84
Type hints and dataclasses are a game-changer for Python. Much easier to
reason about programs that use them.

~~~
adev_
> Type hints and dataclasses are a game-changer for Python. Much easier to
> reason about programs that use them.

Type hints made large python codebase switch from "Any change will blow up to
my face" to "I can touch it carefully with protection gloves".

Through, it is still pretty easy to fool mypy and very far of the compile time
guarantees that provide most static typed AOT compiled language.

And unfortunately, it also does not bring any advantage in term of
performance.

------
anovikov
My good friend built whole career doing exactly the same. "Replace a cluster
of 10 Elasticsearch servers with 1 running a custom built C app and in-memory
database".

Of course, it won't work out to replace 1000 Elasticsearch servers - that's
where the advantage of a true "big data" tool will show - but none of the
clients really have data "that big".

------
fxtentacle
Many people don't know about Python's GIL
[https://wiki.python.org/moin/GlobalInterpreterLock](https://wiki.python.org/moin/GlobalInterpreterLock)

That's the reason why you need to go multi-process if you want to reach a
similar level of concurrency in Python as multi-thread in C++. And that surely
adds a lot of complexity.

As a very practical example of this, TensorFlow has a dedicated page with
advice on how to make the Python part that reads the files from disk less
slow. Think about that: The bottleneck for training a highly advanced AI with
millions of parameters is in the 20 lines of Python code to read in a binary
file...

[https://www.tensorflow.org/guide/data_performance](https://www.tensorflow.org/guide/data_performance)

~~~
avip
Many other people "know" about the GIL, to the extent of believing there's no
point using threads in python "because of the GIL".

I had a funny such experience lately in a job interview. I told the
interviewer his misconception could be falsified with ~10 LOC summing a list
with 2 threads.

~~~
avip
Ok, I see some comments (rightfully) asking for less talk and more code.

# main.py

    
    
        import random
        from concurrent.futures import ThreadPoolExecutor as Pool
    
        items = [random.random() for _ in range(10 ** 7)]
    
    
        def run(items, n):
            step = len(items) // n
            with Pool(max_workers=n) as ex:
                res = [ex.submit(sum, items[i*step : (i+1)*step]) for i in range(n)]
            return sum(r.result() for r in res)
    
    
        if __name__ == '__main__':
            import timeit
            import sys
            n = sys.argv[1] if len(sys.argv) > 1 else 1
            time = timeit.timeit('run(items, %s)' % n, 'from __main__ import run, items', number=10)
            print("%s\t%.3f" % (n, time / 10))
    
    
    
      $ for x in `seq 1 16` ; do python3 -m main $x ; done
    
    
        1       0.172
        2       0.170
        3       0.166
        4       0.155
        5       0.149
        6       0.142
        7       0.144
        8       0.140
        9       0.136
        10      0.135
        11      0.135
        12      0.137
        13      0.135
        14      0.136
        15      0.136
        16      0.136

~~~
ginko
That’s not exactly what I would call good speedup.

~~~
avip
The point of the code is not to speedup the execution of summing a list of
random number, but rather to speedup the acknowledgement of N random python
developers that they have some misconceptions about the GIL.

I think it does that pretty well but, well, that's just like my opinion.

~~~
pnako
I'm a casual Python programmer (just bit of scripting) and I had no particular
misconception about the GIL; I don't care because I write single-threaded
cookie cutter scripts.

Your example I think is demonstrating the opposite of what you want to show.
Those figures are atrocious.

~~~
avip
The opposite of what? All I want is the GIL "discussion" to be based on facts
rather than emotions, hyperboles and FUD.

I'm not even sure what the opposite of that is.

~~~
deanmoriarty
What were the misconceptions of your interviewer you were trying to prove
wrong with your POC of multiple threads summing a list?

The way I read your post, it seemed like your interviewer told you that
threads in python are not effective for parallel computation because of the
GIL, and your example proves exactly that. The performance of your threads are
absolutely horrible, if you were to do that in C++/Java/Go you would likely
see a speedup in the order of min(16, cores). Your example proves that your
threads are exactly serialized during the computation, which I assume was the
point of your interviewer (but please clarify my assumption).

Perhaps your interviewer was instead proposing that threads in Python work
like a charm?

~~~
avip
2 things:

1\. There is a common misconception that any threaded python code is slower
than sequential. This is trivially false for IO-bound, and _possibly,
potentially, in some cases_ false for CPU-bound. Said interviewer had that
very misconception.

2\. More importantly - you and another parent are right. The code does not
demonstrate what I thought it does. This is proven by removing the ThreadPool
and leaving the code intact:

    
    
        import random
        import math
        from concurrent.futures import ThreadPoolExecutor as Pool
    
        items = [random.random() for _ in range(math.factorial(11))]
    
    
        def run(items, n):
            step = len(items) // n
            res = [sum(items[i*step : (i+1)*step]) for i in range(n)]
            return sum(res)
    
        if __name__ == '__main__':
            import timeit
            import sys
            n = sys.argv[1] if len(sys.argv) > 1 else 1
            time = timeit.timeit('run(items, %s)' % n, 'from __main__ import run, items', number=10)
            print("%s\t%.3f" % (n, time / 10))
    
    
        $ for x in `seq 1 11`; do python3 -m main $x ; done
        1       0.799
        2       0.749
        3       0.671
        4       0.715
        5       0.730
        6       0.704
        7       0.649
        8       0.631
        9       0.689
        10      0.613
        11      0.616
    
    

This is a result I'd have to silently contemplate before making any further
comments.

------
BalinKing
From the next tweet in the thread: “JAVA [sic] or C# would also be close, and
there are good reasons to prefer those over C++ for servers.”

~~~
paulddraper
Yep. C++ is the nuclear option, when you _know_ you need to keep memory
reasonable, or capture the last bit of compute performance.

Neither of these are usually the case for web application servers.

------
jacobsenscott
You can choose a language that optimizes your hardware, or you can choose a
language that optimizes your programmers. 99% of the time optimizing the
programmers is the right call.

~~~
otabdeveloper4
Yes, if by "optimizing programmers" you mean "optimizing the manager's
corporate structure footprint and bonus incentives".

If the choice is between hiring one good C++ programmer or 15 really dumb
Python "backend engineers" (and a team of QA's and sysadmins to support them),
what do you think your pointy-hared corporate boss would chose?

~~~
cthalupa
This is a false dichotomy. You don't have to choose between 1 C++ developer
and 15+ package of Python developers.

Personal productivity is generally going to be slower in C++ than it is in
Python. In most situations you would gain more productivity out of a similarly
skilled Python dev as you are a C++ dev, so you'd probably need to hire more
C++ developers.

Unless your argument is "C++ devs are smart and Python devs are dumb", in
which case, let's not start calling people names over the language they use.

~~~
otabdeveloper4
> Personal productivity is generally going to be slower in C++ than it is in
> Python.

No, it entirely depends on the skill and experience level of the programmer.

> In most situations you would gain more productivity out of a similarly
> skilled Python dev as you are a C++ dev

No, a good programmer with equally good knowledge of both languages will code
equally fast in both.

> This is a false dichotomy. You don't have to choose between 1 C++ developer
> and 15+ package of Python developers.

You failed to see the point. Large teams of clueless programmers doing things
slowly and badly is a feature of the system, not a bug. KPI's for managers
don't include lowering headcount and cost cutting as an incentive. (And trust
me, you really wouldn't like it if they did.)

> Unless your argument is "C++ devs are smart and Python devs are dumb", in
> which case, let's not start calling people names over the language they use.

Yet that's effectively what you just did in your specious 'productivity'
argument.

~~~
cthalupa
>No, it entirely depends on the skill and experience level of the programmer.

We're comparing programming languages. You control for the other variables -
otherwise the comparison is meaningless. My argument is equally skilled
programmers will generally be more productive in Python than C++.

>No, a good programmer with equally good knowledge of both languages will code
equally fast in both.

Care to explain how? Huge amounts of the features in these higher level
languages are explicitly to increase productivity, and it's generally pretty
well accepted they're successful. In this very comment section there's
multiple sets of people talking about having to implement their own reference
counting systems, and all sorts of other things. Implementing those systems
takes up productivity.

If any one language was the best at everything, we would only have one
language. There's trade offs made, and that's why there frequently is a right
language (or set of languages) for one job vs. another.

>You failed to see the point. Large teams of clueless programmers doing things
slowly and badly is a feature of the system, not a bug. KPI's for managers
don't include lowering headcount and cost cutting as an incentive. (And trust
me, you really wouldn't like it if they did.)

This simply hasn't been the case anywhere I've worked at for a meaningful
amount of time. Empire building was plenty discouraged at all of the places I
have been employed long term and there were KPIs for reducing headcount, or
maintaining it while taking on additional responsibility, and accounting was
always happy to step in if costs were increasing without solid justification.
Reducing headcount doesn't even have to mean firing people or managing them
out - it can be not backfilling spots, helping people find teams with open
headcount and transferring, etc. It's never bothered me any - if we have too
many people for the amount of work, I'm more likely to get bored.

>Yet that's effectively what you just did in your specious 'productivity'
argument.

You started this whole tangential discussion by taking shots at Python
developers for no apparent reason. I'm not calling anyone names, or anything
even remotely similar - different languages are frequently differently suited.
There's obviously places where C++ makes a lot of sense.

I don't really have a horse in this race - right this moment, I'd do basically
any serious project in rust, and cargo-script has me even doing small scripts
in rust as well. Maybe Erlang or Elixir if OTP makes a lot of sense for the
project.

------
useful
PCI-e 4.0 and 5/6 enables consuming absolutely insane amounts of data. I can't
think of many applications that need that.

128GB/sec? sure! and 256GB/sec is coming.

With 10Gb, 40Gb and 100Gb internet links becoming mainstream the roadblock to
having cool stuff is developers to make it.

~~~
PedroBatista
Hold my beer, I’m firing up my [ insert dev stack ].

------
cryptica
Scalability and performance are very different.

Opting for a performant but not scalable solution is basically an
acknowledgement that:

\- The project will only succeed up to a certain point.

\- After the initial launch, no further changes will be made to the project
since new additions are likely to cost performance and lower the upper bound
on how many users the system can support.

Not many projects are willing to accept either of these premises. Nobody wants
to set an upper bound on their success.

Also, it should be noted that no language is 'much faster' than any other
language. Benchmarks which compare the basic operations across multiple
languages rarely find more than 50% performance difference. The more
significant performance differences are usually caused by implementation
differences in more advanced operations and in any given language, different
libraries can offer different performance characteristics so it's not fair to
say that a language is slow because some of its default advanced operations
are slow.

Usually, performance problems come down to people not choosing the best
abstract data type for the problem. Some kinds of problems would perform
better with linked list, others perform better with arrays or maps or binary
trees.

Time complexity of any given algorithm is way more significant than the
baseline performance of the underlying language.

------
slx26
the title is misleading. sure, it advocates for C++, or Java, or C#, or others
above Python, but clarifying the context: "a lot (not all!) of complex
“scalable” systems can be done with a simple, single C++ server.", which I
take as: "sometimes it's better to write a simple server in a low level
language, than a complex server in a higher level language".

~~~
dang
Submitted title was "John Carmack advocates C++ for server development". We've
replaced it with a shortened version of what he said.

------
arcticbull
I'm not sure there's any such thing as a simple C++ server ;)

------
buboard
people will bike to work to "save the environment" but won't use C direct on
metal to reduce the carbon footprint of their code.

For a guy like Carmack, it may be quite frustrating working with the
constraints of pytorch etc. He ll probably end up making his own pytorh
frontend in C, which as a bonus people will use to deploy models.

~~~
Aperocky
I don’t think it’s that big of an language issue. Java isn’t that much slower
(logarithmically speaking). But design choices have arbitrarily incorporated
huge amount of bloat and inefficiency.

~~~
buboard
yep, logarithmic. Well, the question is what's the motive about those design
choices? Today's e.g. web app choices seem to be basically aesthetic with an
eye for extreme novelty, and total disregard for how many times a buffer is
being copied back and forth before reaching the user.

~~~
Aperocky
That is compared to python/ruby which would be multiple logs slower or scaling
choices that makes the service resemble the microkernels. Usually dedicated,
simple engineering can be very fast using either C or java

------
qaq
Keep in mind a modern commodity x86 server with 128 physical cores 4 TB RAM
and decent amount of SSD storage and dual 100 gbe nics is about 70K. Ability
to use something like Rust also changes equation significantly.

~~~
eb0la
I believe the key is able to use something _native_, not interpreted code or
bytecode, running on a virtual machine, which runs in a userspace process
inside a virtualized server inside a bare metal server.

Said that, I believe Rust has a great future in data processing. Just take a
look at Apache Arrow Rust bindings and Ballista
([https://news.ycombinator.com/item?id=20456273](https://news.ycombinator.com/item?id=20456273))

------
fatbird
Has Carmack looked at Rust? I'd be very curious for someone like him to look
at it in light of his experience with C/C++ and high performance systems.

~~~
pizlonator
Good question, not sure why you were voted down. And I’m not even a Rust fan.

~~~
fatbird
I was curious mainly because I remember some long comments he made about the
relative value of linting and static analysis tools, specifically going into
Coverity's analysis of the Doom 3 code (I think), fixing everything it flagged
for some subset of the code, and then asking himself whether it was really
better or more of a hindrance that obfuscated code.

IIRC, his conclusion was mixed: a lot of it was obviously beneficial and worth
having turned on, but much of it wasn't, and going forward he intended to make
it a limited but continual part of his toolchain.

So my interest in whether he'd tried Rust was whether he'd compared Rust's
changes like the borrow checker against his earlier conclusions on writing
good C/C++. Cute that he's tried and liked it, but I'd really like to see a
more in-depth comparison from him.

~~~
pizlonator
Right. He’s got some kind of notorious programming style with who knows what
kinds of object graphs. It would be quite a data point to know if he thinks
that he can comfortably map it to Rust.

------
wruza
_Morals of this story: 1\. Always use the most performant language available
to you. (What if your program gets a few million users?)

2\. Horizontal scalability is too much complexity/work. Just apply the correct
amount of optimization when you initially write the code._

If you, like me, read this on mobile and did not click [more answers] link
under that, then do it. That may save you a minute or two of derealization
time.

~~~
choeger
I really had a hard time to decide whether I agree with that statement or not.
If you design a whole service it is obviously not true. But if you develop
something like a specialized Backend, say a database, you might want to
reconsider the complexity of horizontal scalability.

------
api
This has been a repeated point since "enterprise" Java (a.k.a. Jabba) became a
thing in the late 90s and early 2000s. A ton of enterprise code is comically
inefficient and held together with scotch tape and used chewing gum.

It ends up boiling down to the fact that compute power is a lot cheaper than
developer time and really good developers are more expensive and harder to
find than inexperienced ones.

------
gregdoesit
From this thread, I feel a lot of people misunderstand what a scaleable system
means for a scale-up startup or BigTechCo. It doesn’t mean cost-efficiency. It
means the ability to solve scalability issues at 10x, 100x, 1000x load by
throwing money at it (aka buying more instances/machines).

Yes, John Carmack is right that a bunch of scaleable systems that see
reasonable load, with well-understood traffic requirements could be rewritten
more efficient ways. But how long would it take? How much would the extra work
cost? And what would happen if traffic goes up, to another 10x, 100x? Can you
throw still throw money at that problem, with this new and efficient system
already in-place?

One of my memorable stories comes from an engineer I worked with when we
rewrote our payments system[1]. He told me how at a previous payments company,
they had a system that was written in this kind of efficient way, and needed
to run on one machine. As the scale went up, the company kept buying bigger
hardware, at one point buying the largest commercially available mainframe
(we’re talking the cost being in the multi-millions for the hardware). But the
growth was faster and they couldn’t keep up. Downtime after downtime followed
at peak times, making hundreds of thousands of losses per downtime.

They split into two teams. Team #1 kept making performance tweaks on the
existing hardware to try to get performance wins, and increase reliability
during peak load. Team #2 rewrote their system in a vertically scaleable way.
Team #1 struggled and the outages kept getting worse. Team #2 delivered a new
system quickly and the company transitioned their system over.

What they gained was not cost savings: it was the ability to (finally!) throw
money at their growth problems. Now, when traffic went up, they could
commission new machines and scale with traffic. And they could eventually
throw away their mainframe.

Years later, they started to optimise the performance of the system, making
millions of $ in savings. The new system - like the old one - cost millions to
run. But that was besides the point. Finally, they stopped bleeding tens of
millions in lost revenue per year due to their inability to handle sudden,
high load.

It’s all about trade-offs. Is cost your #1 priority and are you okay spending
a lot of development time on optimising your system? Go with C++, Erlang or
some other, similarly efficient language. Is product-market fit more
important, with the ability to have high uptime, that you can buy? Use the
classic, horizontally scaleable distributed systems stack, and worry about
optimising later, when you have stable traffic and optimisation is more
profitable over product development.

[1] [https://blog.pragmaticengineer.com/distributed-
architecture-...](https://blog.pragmaticengineer.com/distributed-architecture-
concepts-i-have-learned-while-building-payments-systems/)

~~~
tlarkworthy
(u meant they rewrote in a horizontal scaling way... Add more machines to a
pool to scale)

~~~
gregdoesit
Yes - thank you! Corrected.

------
coderunner
What about in the context of trying to get to an MVP? Is the dev time speedup
of using a dynamic programming language and stack significant over using a c++
backed? You wouldn't care much about performance when you're trying to figure
out if you'll get traction.

~~~
otabdeveloper4
No, it's not significant. Dev time depends on programmer skill, not the
toolset. A good C++ programmer will develop your MVP many times faster than an
average Python programmer.

Python programmers are much easier to hire, though - you already need a good
C++ programmer on the team to hire another one, because HR and corporate
management can't into proper hiring process.

This last factor is the overarching most important one for BigCorp Enterprise
Inc., not development speed or cost.

~~~
filleokus
> Dev time depends on programmer skill, not the toolset.

This is obviously not strictly true, always. A skilled programmer will use the
proper tools for the job.

If you for example is tasked with writing a backend service exposing a GraphQL
API, I think it would be foolish to do this in C++, and would bet that the
average Python programmer would do it quicker than even a top-tier C++
programmer (if the latter would be hellbent on doing it in C++).

Especially when working with MVP's (or new projects in general), the ability
to leverage already existing tools and frameworks are key to rapid progress.
This doesn't necessarily have to be scripting languages, but the
Python/Node/Go/etc developer would have a working GraphQL server up and
running connected to a database of choice within an afternoon while the
skilled C++ developer would have to spend at least a few days implementing a
GraphQL server mostly from scratch [0].

[0]: A quick Google show that schema parsers exists for C++, but nothing
matching the frameworks/library available for more web-fashionable languages.

~~~
otabdeveloper4
Two points:

a) Writing a schema parser is not rocket science. In fact, for a good
programmer implementing their own GraphQL library would be quicker than
integrating some third-party library. So your first point ("average Python
programmer would do it quicker than even a top-tier C++ programmer") is
absolutely wrong.

b) There's no value in an MVP that does something generic that is already
available in off-the-shelf libraries. Your GraphQL example is pretty pointless
because it doesn't actually do anything.

> ...the Python/Node/Go/etc developer would have a working GraphQL server up
> and running connected to a database of choice within an afternoon

Well, no. By the end of the week they'll still be arguing about which package
manager to use, whether TDD is a good idea, what makes a microservice 'micro'
and how to configure Kubernetes.

------
lonelappde
A single server cannot be made reliably available, though.

At single-node scale, development costs dwarf hardware costs of scaling
horizontally.

------
zem
vibe.d is well worth a look: [https://vibed.org/](https://vibed.org/)

~~~
zerr
Isn't D dead practically?
[https://news.ycombinator.com/item?id=21902953](https://news.ycombinator.com/item?id=21902953)

~~~
kal31dic
And yet dicebot continues to work in D, including for me for a while.

I'm hiring 25 D programmers, so I suppose it very much depends on what you
mean by practically!

~~~
zerr
What company is it?

------
rbanffy
OTOH, we can assume one doesn't need to worry about using C++ unless they have
many millions of users.

------
ungerik
This is why Google developed Go. Python like productivity with C++ like
performance...

------
tyuioyjj
What if those "scalable" systems were designed to "just work" and then after
it succeed it was changed into pseudo ""scalable""?

------
easytiger
Been saying this for years.

Modern SW trends are horrific

------
fnord77
wow, so much speculation, zero empirical evidence.

~~~
nbevans
His entire comment was based on empirical evidence. Which is why people are
disputing it - because there is room to do so.

------
amelius
The catch is: until you can't.

------
kyledrake
The analogy here is one of the best ice climbers in the world proposing an
ascent of the Matterhorn. Please use testable, easy to prototype with, memory
managed languages for production servers unless you are solving a very
specific problem and _really_ know what you are doing.

~~~
mberning
Very apt. I think people are also forgetting “the bad old days” where you
worked on a large-ish cpp or java app for a year, nothing worked right,
schedules slipped, and then the whole thing was scrapped and teams disbanded
to work on other stuff. That was very common. You can’t count on having a team
of Carmacks work on your blub app.

~~~
otabdeveloper4
Replace "cpp or java" with "Python or Ruby" and how is it any different today?

------
rvz
> ...My bias is that a lot (not all!) of complex “scalable” systems can be
> done with a simple, single C++ server.

The second tweet of the discussion:

> JAVA or C# would also be close, and there are good reasons to prefer those
> over C++ for servers. Many other languages would also be up there, the
> contrast is with Really Slow (but often productive and fun) languages.

I'm afraid that Carmack has sided with his own anecdotal experiences of the
1990s to justify the use of C++ for server-side development for the 21st
Century. This probably made sense at the time due to the availability of more
C++ devs and less language choices, but today in the 2020s? I remain totally
unconvinced by his argument.

He goes on to suggest Java or C# which still makes sense for many companies
for generic server-side development if you are after a more secure backend.
Kotlin is pretty much the most sensible for this. But for the sake of
Carmack's engineering background however, it is unsurprising why
Java/C#/Kotlin are technically unsuitable for high-performance gaming
platforms if one was to create one. So what credible languages could be used
to compete with C++? I hear Discord is having a great time with using Elixir
(Erlang could also be used) and another gaming platform called 'Hadean' is
using Rust for their platform.

~~~
jariel
"But for the sake of Carmack's engineering background however, it is
unsurprising why Java/C#/Kotlin are technically unsuitable for high-
performance gaming platforms if one was to create one."

This really is just not true.

Financial, real-time style High Frequency Trading apps are often written in
Java - not C++.

Much of the JVM is not a VM, it compiles to machine code - in an optimised
manner. For starters.

Given how difficult it is to develop safely in C++ I can hardly think of a
reason to ever use it on the backend.

~~~
jiggawatts
Trading apps generally process a small amount of data. Graphs are downright
lightweight compared to what a 3D game pumps through.

Generally, for a 3D game manual memory management and explicit data layout are
critical. For example, it's common to use custom memory allocators with a
region for each frame, a region for each loaded level, etc... This is then
much cheaper to simply drop on the floor than _any_ kind of object-by-object
cleanup, whether that is reference counting, garbage collecting, a traditional
heap, or whatever. Even Rust can't yet compete with this!

Similarly, many game engines use ECS systems or in-memory columnar data layout
to (structure of arrays instead of arrays of structures) to enable SIMD
instruction sets such as AVX.

Java can be coerced into doing much of the above, but it generally takes a
ridiculous effort to approach what comes nearly effortlessly with a language
like C++ or Rust.

Even C# is a better choice than Java, as it monomorphises more code and
recently had a range of extensions[1] added to reduce GC pressure such as
stackalloc, Span, Memory, MemoryPool, SequenceReader, ValueTask, etc...

I've bought and played several games written in C#, but other than Minecraft
I'm not aware of any popular real-time games written in Java in the last 15
years or so. Meanwhile, Minecraft is not at all smooth on my very high end
gaming PC in 2019 despite being 8 years old. (I'm sure this can be eliminating
by tweaking some settings, but it's indicative of the problem.)

I say this as someone who used to professionally develop browser-based Java
games back in the early 2000s and had to personally jump through hoops to
reduce heap allocations to avoid GC pauses.

[1] [https://docs.microsoft.com/en-us/archive/msdn-
magazine/2017/...](https://docs.microsoft.com/en-us/archive/msdn-
magazine/2017/connect/csharp-all-about-span-exploring-a-new-net-mainstay)

~~~
jariel
Yes thanks for that but I think author was referring to game synch server not
actual games. Synching minimal state data can use small data structures etc.
But thanks though great comment.

~~~
jiggawatts
If anything the server-side coding is _harder_. Many games do the full
physics/simulation on the server to minimise cheating, and have to simulate
from every player's perspective. Meanwhile the clients have a single
perspective and most of the computation effort is offloaded to the GPU.

Additionally, most multiplayer games have the same codebase for the server and
the client for the obvious reasons. Single player is literally "online play"
with an in-memory channel to a local server.

All Quake-based games work this way, including the derivative Source-based
games and a bunch of other engine variants. Unreal works this way too if I
remember correctly.

You can't realistically write a client in C++ and a server in Java. You'd be
practically doubling your development time!

