Well... we have 3 node MongoDB cluster and are processing up to a million trades... per second. And a trade is way more complex than a chat message. Has tens to hundreds of fields, may require enriching with data from multiple external services and then requires to be stored, be searchable with unknown, arbitrary bitemporal queries and may need multiple downstream systems to be notified depending on a lot of factors when it is modified.
All this happens on the aforementioned MongoDB cluster and just two server nodes. And the two server nodes are really only for redundancy, a single node easily fits the load.
What I want to say is:
-- processing a hundred million simple transactions per day is nothing difficult on modern hardware.
-- modern servers have stupendous potential to process transactions which is 99.99% wasted by "modern" application stacks,
-- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
-- most databases (even as bad as MongoDB is) have a potential to handle much more load than people think they can. You just need to kind of understand how it works and what its strengths are and play into rather than against them.
And if you think we are running Rust on bare metal and some super large servers -- you would be wrong. It is a normal Java reactive application running on OpenJDK on an 8 core server with couple hundred GB of memory. And the last time I needed to look at the profiler was about a year ago.
A million sounds impressive, but this is clearly not serialized throughput based on other comments here. Getting a million of anything to fast NVMe is trivial if there is no contention and you are a little clever with IO.
I have written experimental datastores that can hit in excess of 2 million writes per second on a samsung 980 pro. 1k object size, fully serialized throughput (~2 gigabytes/second, saturates the disk). I still struggle to find problem domains this kind of perf can't deal with.
If you just care about going fast, use 1 computer and batch everything before you try to put it to disk. Doesn't matter what fancy branding is on it. Just need to play by some basic rules.
Primary advantage with 1 computer is that you can much more easily enforce a total global ordering of events (serialization) without resorting to round trip or PTP error bound delays.
Trading is for multiple reasons ideal for this, one is that total global ordering is a key feature (and requirement) of the domain so this "1 fast big server" thing is good. It is also quite widely known that several of the big exchange operate this model, a single sequencer application and then using multicast to transmit the outcomes of what it sees.
The other thing that is helping a lot here compared to Discord: Trading is very neatly organized in trading days and shuts down for hours between each trading day. So you don't have the issue that Discord had where some channels have low message volumes and others have high, leading to having scattered data all over the place. You can naturally partition data by day and you know at query time which data you want to have.
The decentralized Solana network is currently capable of over 50k TPS 24/7 in the live Beta with a target transaction finality speed of 30ms round trip from your browser/device. Their unofficial moto is "DeFi at Nasdaq Speed." Solana is nascent and will likely reach 10M+ TPS and 10ms finality within a couple years time.
Decentralized Acyclic Graph based networks (e.g. Hashigraph, which are not technically blockchains) can reach effectively infinite TPS but suffer in time to finality.
Solana is a blockchain with zero downtown (and a Turing complete smartchain), mind you-- nota centralized exchange.
you can handle the load or not, right? A built in maintenance window is super nice, but servers crash all the time. So, that's a problem, or you've got a system in place. if you can handle failover, you've got free maintenance windows anyway, so it seems not any more difficult?
It is wise if you mean "be ready for servers to crash at any time by thinking they are going to crash at the worst possible moment".
But it is stupid, because people think they need massive parallel deployments just because servers will be constantly crashing and it is just not true. The cost they pay is in having couple of times more nodes than they really need to have if they got their focus right (making the application efficient first, scalable later)
The reality is, servers do not crash. At least not the kind of hardware I am working on.
I have been responsible for keeping communication with a stock exchange for like 3 years in one of my past jobs and during that time we haven't lost a single packet.
And aside from some massive parallel loads which used tens of thousands of nodes and aside from one time my server room boiled over due to failed AC (and no environmental monitoring) I never had a server crash on me for the past 20 years.
So you can reasonably assume that your servers will be functioning properly (if you bought quality) and it kinda helps a lot at design stage.
This is the regime we operate in as well. For our business, a failure, while really bad, is not catastrophic (we still maintain non-repudiation). We look at it like any other risk model in the market.
For many in our industry, the cost of not engineering this way and eating the occasional super rare bag of shit is orders of magnitude higher than otherwise tolerable. One well-managed server forged of the highest binned silicon is usually the cheapest and most effective engineering solution over the long term.
Another super important thing to remember is that main goal of this is to have super simple code and very simple but rock solid guarantees.
The main benefit is writing application code that is simple, easy to understand and simple to prove it works correctly, enabled by reliable infrastructure.
When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.
> When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.
This is 100% our philosophy. I honestly don't understand why all high-stakes software isn't developed in the same way that we build these trading/data systems.
I think this is the boundary between "engineering" and "art". In my experience, there are a lot of developers who feel like what they do is not engineering because they believe it to be so subjective and open to interpretation. Perhaps there is a mentality that it cant ever be perfect or 100% correct, so why even try upholding such a standard as realistic? It is certainly more entertaining to consume new shiny technology than sitting down with business owners in boring meetings for hours every week...
In reality, you can build software like you build nuclear reactors. It is all a game of complexity management and discipline. Surprisingly, it usually costs less when you do it this way, especially after accounting for the total lifecycle of the product/service. If you can actually build a "perfect" piece of software, you can eliminate entire parts of the org chart. How many developer hours are spent every day at your average SV firm fighting bugs and other regressions? What if you could take this to a number approximating zero?
The classical retort I hear from developers when I pose comments like these is "Well the business still isnt sure exactly what the app should do or look like". My response to that is "Then why are you spinning up Kubernetes clusters when you should be drawing wireframes and schema designs for the customer to review?"
Every time I write something like "Yes, you really can write reliable application. No, if it breaks you can't blame everybody and the universe around you. You made a mistake and you need to figure out how this happened and how to prevent it from happening in the future." I just get downvoted to hell.
I suspect in large part it is because when people fail at something they feel a need to find some external explanation of it. And it is all too easy when "business" actually is part of the problem.
The best people I worked with, let's just say I never heard them blaming business for their bugs. They own it, they solve it and they learn from it.
What I am not seeing is people actually have a hard look on what they have done and how they could have avoided the problems.
For example, the single most cause of failed projects I have seen, by far, is unnecessary complication stemming from easily avoidable technical debt.
Easily avoidable technical debt is something that could have reasonably be predicted at the early stage and solved by just making better decisions. Maybe not split your application to 30 services and then run it on Kubernetes? Maybe rather than separate services, pay attention to have proper modules and APIs within your application and your application will just fit couple of servers? Maybe having function calls rather thancascade of internal network hops is cheap way to get good performance rather than (ignore Amdahl's law and) try to incorporate some exotic database that nobody knows and will have to start learning from scratch?
Then people rewrite these projects and rather than understanding what caused the previous version to fail -- just repeat the same process only with new application stack.
We know because that communication happens on UDP and each packet on app layer has sequence number. It is used on receiving side to rebuild sequence of packets (events from the exchange can only be understood correctly when processed in same order as generated and only if you have complete stream -- you can't process a packet until you processed the one preceding it). It is trivial to detect that we haven't missed a packet.
We had a special alert for a missing packet. To my knowledge that has never activated in production except for exchange-mandated tests (the exchange runs tests regularly to ensure every brokerage house can handle faults in communications, maximum loads, etc.)
If a packed was missed, the same data is replicated on another, independent link through another carrier.
And if this wasn't enough, if your system is down (which it shouldn't happen during trading) you can contact exchange's TCP service and request for the missing sequences. But that never happened, either.
As we really liked this pattern, we built a small framework and used it for internal communication as well including data flowing to traders' workstations.
Mind, that neither the carrier link, the networking devices or people who maintain it are cheap.
In regular trading (not crypto, see the other comment about the volume differences) it is common to tune Java for example to run GC outside of trading hours. That works if you don't allocate new heap memory in every transaction/message but instead only use the stack + pre-allocated pools.
Thats not really what this article is about. Their problem wasn't throughput. What's the size of all the data in your MongoDB instance? And what's the latency in your reads?
In the big data world the "complexity" of the data doesn't really mean much. It's just bytes.
> What's the size of all the data in your MongoDB instance?
3x12TB
> In the big data world the "complexity" of the data doesn't really mean much.
Oh how wrong you are.
It is much easier to deal with data when the only thing you need to do is to just move it from A to B. Like "find who should see this message, make sure they see it".
It is much different when you have large, rich domain model that runs tens of thousands of business rules on incoming data and each entity can have very different processing depending on its state and the event that came.
I am writing whole applications just to data-mine our processing flow just to be able to understand a little bit of what is happening there.
At that traffic you can't even log anything for each of the transactions. You have to work indirectly through various metrics, etc.
Nice that's a pretty decent size, curious on the latency still. Thats the primary problem for a real time chat app.
Complexity of data and running business rules on it is not a data store problem though, that's a compute problem. It's highly parallelizable and compute is cheap.
For reference, my team runs transformations on about 1 PB of (uncompressed) data per day with 3 spark clusters, each with 50 nodes. We've got about 70ish PB of (compressed) data queryable. All our challenges come from storage, not compute.
In order to be able to run so much stuff on MongoDB, we almost never run single queries to the database. If I fetch or insert trade data, I probably run a query for 10 thousand trades at the same time.
So what happens is, as data comes from multiple directions it is being batched (for example 1-10 thousand at a time), split into groups that can be processed together in a roughly similar process, and then travels the pipeline as a single batch which is super important as it allows amortizing some of the costs.
Also the processing pipeline has many, many steps in it. A lot of them have buffers inbetween so that steps don't get starved for data.
All this causes latency. I try to keep it subsecond but it is a tradeoff between throughput and latency.
It could have been implemented better, but the implementation would be complex and inflexible. I think having clear, readable, flexible implementation is worth a little bit of tradeoff in latency.
As to storage being source of most woes, I fully agree. In our case it is trying to deal with bloat of data caused by business wanting to add this or that. All this data causes database caches to be less effective, requires more network throughput, more CPU for parsing/serializing, needs to be replicated, etc. So half the effort is constantly trying to figure out why they want to add this or that and is it really necessary or can be avoided somehow.
I thought you are doing millions qps with a 3 nodes mongodb cluster, from the top level comment. That would be impressive.
By batching 1-10 thousands records at a time, your use case is very different from discord, which needs to deliver individual messages as fast as possible.
Data doesn't come or leave batched. This is just internal mechanism.
Think in term of Discord, their database probably already queues and batches writes. Or maybe they could decide to fetch details of multiple users with a single query by noticing there are 10k concurrent asks for user details. So why have 10k queries when you could have 10 queries for 1k user objects?
If you complain that my process is different because I refuse to run it inefficiently when I can spot an occasion to optimize then yes, it is different.
Of course, cassandra/mongodb/etc can perform their own batching when writing to the commit log, and can also benefit from write combining by not flushing out the dirty data immediately. That's besides the point.
Your use case allows you to perform batching for writes at the *application layer*, while discord's use case doesn't.
Why couldn't others with lots of traffic use a similar approach? I assume they do. Seems pretty genius idea to batch things like that, especially when qps is very high batching (maybe waiting for a few ms to fill a batch) makes a lot of sense.
I don't see why discord's case can't use same tricks. If they have a lot of stuff happening at the same time and their application is relatively simple (from the point of view of number of different types of operation it performs) at any point in time it is bound to have many cases of the same operation being performed.
Then it is just a case of structuring your application properly.
Most applications are immediately broken, by design, by having a thread dedicated to the request/response pair. It then becomes difficult to have parts of that processing from different threads be selected and processed together to take benefit of amortizing costs.
The alternative I am using is funneling all requests into a single pipeline and having that pipeline split into stages distributed over CPU cores. So it comes in (by way of Kafka or REST call, etc.), it is queued, it goes to CPU core #1, gets some processing there, then moves to CPU core #2, gets some other processing there, gets published to CPU core #3 and so on.
Now, each of these components can work on huge number of tasks at the same time. For example when the step is to enrich the data, it might be necessary to shoot a message to another REST service and wait for response. During that time the component picks up other items to do the same.
As you see, this architecture practically begs to use batching and amortize costs.
What you're describing sounds like vanilla async concurrency. I seriously doubt 'most applications' use the one-thread-per-request model at this point in time, most major frameworks are async now. And it's not a silver bullet either, plenty of articles on how single-thread is sometimes a better fit for extremely high-performance apps.
After reading all of you responses, I still don't see how you think your learnings apply to Discord. They would not be able to fit the indexes in memory on MongoDB. They can't batch reads or writes at the application server level (the latency cost for messaging is not acceptable). Millions of queries happen every second, not one-off analytical workloads. It seems these two systems are far enough apart that really there is no meaningful comparison to be made here.
Well on one hand you've got engineers at a billion dollar company explaining how they've solved a problem. On the other hand you've got some random commentor on HN over-simplifying a complex engineering solution.
I think you're reading into it. They are stating that the solution in the post was overengineered, and describing an alternate solution that doesn't require as much abstraction or resources, but is manageable for data with a much higher dimensional structure
The fact that you read that as "I am very smart" and that that was a reason to downvote the post, tells more about you than it does the person you're supposedly describing.
As an example, there are bitemporal queries like "for the given population of trades specified by following rules, find the set of trades that met the rules at a particular point in time, based on our knowledge at another given point in time". Also trades are versioned (are a stream of business events from trading system), then have amendments (each event may be amended in the future but the older version must be preserved). Our system can also amend the data (for example to add some additional data to the trade later). All this causes trades to be a tree of immutable versions you need to comb through. A trade can have anywhere from 1 to 30k versions.
This takes about 20 seconds. The process opens about 200 connections to the cluster and transfers data at about 2-4GB/s.
Are you not sure that financial data "with hundreds of fields" is more complex than chat data which has a relatively linear threading and only a handful of fields?
I'm asking about how your system scales to the number of queries, but you seem to be taking every question personally. You seem to really want to make sure everyone knows that you think Discord's problems are easy to solve. I'm not saying Discord is more complicated, but you're not really giving enough information to prove that Discord's problems are a subset of yours.
Do you support more simultaneous queries than Discord?
these days they added an extra thread_id field, sure. But the data itself is blisteringly uncomplex and there is only a single way to display it (ordered by time, i.e. the 'thread')
Just recently Symphony switched order in which I have received two messages from a colleague. This completely changed the meaning of the message and we got in an argument that was only resolved after I have posted screenshot of what he wrote.
It seems, the threading might not be that simple after all.
> -- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
I got into programming through the Private Server (gaming) scene. You learn that the more you optimize and refactor your code to be more efficient, the more you can handle on less hardware, including embedded systems. So yeah, it's amazing how much is wasted. I'm kind of holding hope that things like Rust and Go focus on letting you get more out of less hardware.
Go is slower because it optimizes for compile time.
Rust is slower (usually) because Rust does not revolve around making the most use of the hardware it has.
both are fine choices; nothing wrong with either direction.
Zig performs very well by default because it was designed to be efficient and fast from the start, without compromise. it has memory safety, too, but in a way that few seem to understand, myself included, so it's difficult for me to describe with my rudimentary understanding.
Maybe if you use nostd. I evaluated a bunch of Rust libraries for some server software, but I could not use any of them because they pervasively assume that it is ok to make syscalls to allocate memory. If you'd like to write software that makes few syscalls in the steady state, you can do it in rust, but you can't use libraries. Or String or Vec, I guess.
Why is memory allocation via syscalls bad? I get it for embedded (which was mentioned above so perhaps that's what this targets), but I kind of assumed malloc was a syscall underneath on an actual OS and that was fine.
Actually, you don't need to run any functions to cause a context switch. For example, at the very lowest level, even trying to access a memory page that is presently in physical memory but does not have entry in TLB causes CPU to interrupt to OS to ask for mapping.
This is my experience too. Millions of persisted and serialized TPS on regular NVMe with ms latency. Though it took a bit of effort to get to these numbers.
Curious as to how many days of data you have in your cluster. It seems like it could be ~1/2 billion records per day, 125 billion per year-ish. In a few years your 3 node Mongo cluster would be getting towards volumes I associate with a 'big data' kind of solution like BigTable.
I'm not "we" but I have some experience in this area.
Computers are fast, basically. ACID transactions can be slow (if they write to "the" disk before returning success), but just processing data is alarmingly speedy.
If you break down things into small operations and you aggregate by day, you can always have big numbers. The monitoring system that I wrote for Google Fiber ran on one machine and processed 40 billion log lines per day, with only a few seconds of latency from upload start -> dashboard/alert status updated. (We even wrote to Spanner once-per-upload to store state between uploads, and this didn't even register as an increase in load to them. Multiple hundred thousand globally-consistent transactional writes per minute without breaking a sweat. Good database!)
apenwarr wrote a pretty detailed look into the system here: https://apenwarr.ca/log/20190216 And like him, I miss having it every day.
I have a plan to write a book on how to write reactive applications like that. Mostly collection of observations, tips, tricks, patterns for reactive composition, some very MongoDB specific solutions, etc.
Not sure how many people would be interested. Reactor has quite steep learning curve but also very little literature on how to use for anything non-trivial.
The aim is not just enable good throughput, but also achieve this without compromising on clarity of implementation. Which is where I think reactive, and specifically ReactiveX/Reactor, shines.
I'm interested in getting your book published. Career in publishing and specialist media but a lot of it spent on related problems to your subject. Semi retired have risk capital to get to the right distribution maintaining well above industry standard terms. Email in profile.
Thanks. I will try to self publish. I want to keep freedom over content and target and I am not looking for acclaim for having my name on a book from a well known publisher. I am just hoping to help people solve their problems.
Considering that a cpu can do 3 billion things a second , and a typical laptop can store 16 billion things in memory , it shouldn’t take more than 5 of these to handle “billions of messages” . I agree with you that modern frameworks are inefficient
By 16 billion things you mean 16 billion bytes? If you are talking about physical memory, then no, you can't occupy the entire memory. If you are talking about virtual memory, then you can store more data.
Actually, CPU processes things in words, not bytes. On 64-bit architecture the word is 64 bit or 8 bytes.
But there is a lot of things that CPU can do even faster than that, because this limitation only relates to actual instruction execution (and even then there are instructions that can process multiple words at a time).
I've used cassandra quite a bit and even I had to go back and figure out what this primary key means:
((channel_id, bucket), message_id)
The primary key consists of partition key + clustering columns, so this says that channel_id & bucket are the partition key, and message_id is the one and only clustering column (you can have more).
They also cite the most common cassandra mistake, which is not understanding that your partition key has to limit partition size to less than 300MB, and no surprise: They had to craft the "bucket" column as a function of message date-time because that's usually the only way to prevent a partition from eventually growing too large. Anyhow, this is incredibly important if you don't want to suffer a catastrophic failure months/years after you thought everything was good to go.
They didn't mention this part: Oh, I have to include all partition key columns in every query's "where" clause, so... I have to run as many queries as are needed for the time period of data I want to see, and stitch the results together... ugh... Yeah it's a little messy.
The bigger catch is that when your partition grows too big and your nodes are hit by the OOMKiller, you have very few options other than create a new table and replay data, or use a cli tool to manually partition your data while the node is offline.
Using Cassandra tends to mean pushing costs to your developers instead of spending more money on storage resources, and your devs will almost certainly spend a ton of time fixing downed nodes.
Apple supplied some of the biggest contributors to Cassandra who were optimizing things like how to read data in a partition without fully reading the partition into memory to avoid the terrible GC cost. They put in a ton of engineering effort that probably could have been better spent elsewhere if they’d used a different database.
Cassandra won't try and load a partition into memory. It doesn't work that way. The only way you would get behavior like that is by setting "allow filtering" to on. Allow filtering is a dedicated keyword for "I know I shouldn't do this but I'm going to anyway". If you're trying to run those types of queries, use a different database. If someone is making you use a transactional database without joins for analytical load, get a different job because that's a nightmare.
Also, your partitions should never get that large. If you're designing your tables in such away that the partitions grow unbounded, there's an issue. There are lots of ways to ensure that the cardinality of partitions grows as the dataset grows. And you actually control this behavior by managing the partitioning. It's really easy to grok the distribution of data in on disk if you think about how it's keyed.
You've basically listed a bunch of examples of what happens when you don't use a wide columnar store correctly. If you're constantly fixing downed nodes, you're probably running the cluster on hardware from Goodwill.
I huge partition is often spread across multiple sstables, and often has tombstone issues if the large partition consists of a large number of column keys or any regular update cycle, which is often the case for hot rows.
In that case the overhead of processing and collecting all the parts of the data you need spread across different sstables and then do tombstones can lead to a lot of memory stress.
Kind of the million dollar question. People like to complain about cassandra, but that just brings up the adage about C++: people complain about the things they use.
But let's not pretend that cassandra isn't almost always a bear. The other problem is that cassandra keeps things up (and never gets the credit for it) but that creates a host of edge cases and management headaches (which makes management hate it).
Most competitors abandon AP for CP (HBase and Cockroach and I think FoundationDB) in order to get joins and SQL, but the BFD on cassandra is the AP design.
Scylla did a C++ rewrite to address tail latency due to JVM GC, but after an explosive release cycle, they basically stalled at partial 2.2 compatiblity. Rocksandra isn't in mainline and doesn't appear to be worked on anymore.
I follow the Jepsen tests a lot: they don't seem to have found a magic solution.
I think Cassandra stopped short of some key OSS deliverables, and I think they could simplify the management as well, both with a UI for admin and with some re-jiggering of how some things work on nodes. The devs are simply swamped with stability and features right now.
And Datastax won't help that much, what admin UI cassandra had was abandoned, and I half think the reason they acquired TLP was that TLP was producing/sponsoring useful admin tooling.
I would love to try something new. What appeals to me about cassandra is the fundamentals of the design, and the fair amount of tranparency there is (although there is still some marketing bullcrap that surrounds it like "CQL is like SQL" and other big lies).
So many other NoSQL's are bolt-on capabilities for handling distribution that Jepsen exposes (MongoDB famously) and have sooo much bullcrap in their claims. All the NoSQLs are desperate for market share, so they all lie about CAP and the edge cases.
Purely distributed databases are VERY HARD and are open to exaggeration, handwaving, and false demonstrations by the salesmen, but those people won't be around when you need a database like this to shine: when the shit hits the fan, you lose an entire datacenter, or similar things.
> Scylla did a C++ rewrite to address tail latency due to JVM GC, but after an explosive release cycle, they basically stalled at partial 2.2 compatiblity
Could you explain this more? Because Scylla has had pretty steady major release updates over the past few year. See the timeline of updates here:
We have long since passed C* 3.11 compatibility. In fact, if anything, Scylla, while maintaining a tremendous amount of Cassandra compatibility, now offers better implementations of features present in Cassandra (Materialized Views, Secondary Indexes, Change Data Capture, Lightweight Transactions), plus unique features of its own — incremental compaction and workload prioritization.
But if there's something in particular you're thinking of, I'm open to hear more on how you see it.
Honestly, I find discord super frustrating. Can't have multiple chats open at the same time, can't close the right rail, etc. It's UX is subpar in almost every way that matters to me. I use it because _everyone_ uses it, not because I want to.
It's not bad per se but there's plenty of crap in there.
The shortcuts situation is absolutely dreadful for one, I don't understand how gamers can cope with it:
* there are all of 5 actions you can bind to custom shortcuts
* discord defines dozens of built-in shortcuts you can not rebind or disable, if any of those conflicts with something you need you better hope the OS has a way of overriding it
Large chatrooms as well, the moderation tools seem rather limited, maybe it's better for administrators but as a user all you can do is block someone and you still have to see that they're posting comments. It' incredibly frustrating.
Then the linking and jumping to old message works half the time, maybe, search is absolute dogshit, and I've rarely seen a less reliable @-autocompletion, half the time I have to find old messages of the person I'm trying to ping before discord remembers they exist and lets me actually @ them.
And I don’t think support actually exists. You just post into the black hole that are tte support forum thing.
Slack sold for approximately $30B and it has a lot of the same limitations (or maybe I'm just missing something extremely simple). I think Discord is doing pretty great job all things considered. By focusing on gamers, they unintuitively created a great product for casual and business users.
Going to be the dissenting voice here and say I find the UI usable, and have no major gripes. The Member List can be hidden by the user icon to the left of the search field, and there's a convenient "Compact mode" to make things more dense information wise.
I can't think of many reasons Discord "ate the whole market" besides smart marketing, honestly. It does audio rooms incredibly well, but everything else (even their developer support team) is just terrible.
Even years later it's still the only platform I know of that combines text chat rooms, voice chat rooms, and video streaming into one place, all accessible from your 'server' as they call it.
It also has clients for many platforms, including a web client, all of which look and function the same.
Any alternative out there does one of those things decently well, but either completely lacks or is utterly awful at the other things.
>combines text chat rooms, voice chat rooms, and video streaming into one place
and unlike skype (and probably teams too) supports PUSH 2 TALK which gaming oriented voice chats had close to 2 decades ago and is even more useful now, during WFH.
I think the most direct competitor to discord (for gaming and related communities) is guilded.gg which is basically a clone of discord that offers additional features as well, including offering most of the features that discord has paywalled behind "nitro" for free.
The big issue they have is building up a large enough network effect. I really can't see the discord communities I am a part of moving over there any time soon. Also, they were recently acquired by roblox, and nobody knows for sure what the new ownership will end up doing to the platform.
Because most people used ventrilo, teamspeak or mumble, which required running or paying for a server, and had limited chat/ social network functionality, and skype which was just terrible. It came at the right time as gaming hit its stride in the mainstream when people needed a place for 'local' communities that was basically frictionless and discord was there to capture that audience because there was basically nothing else.
All the features afterwards are mostly just them throwing stuff at the wall and seeing what sticks.
Discord ate the market because it is free and good enough voice quality. Before that everyone was paying for voice server hosting or doing it themselves. Is that smart marketing or just the standard operating procedure for startups around that time?
I agree, but also worth mentioning that Discord really did Chatrooms correctly.
The ability to easily create servers, invite users to your server, and then make that server your homebase with its own channels and emojis, is pretty novel and perfectly fit into the gaming community which is basically a loosely connected graph of friend groups.
Technically Skype existed but lost with the new sms-looking ui that everyone hated and it was hardly suitable for anything larger than a friend group. Then there were long-standing issues of voice chats being P2P and thus allowing users to find the IP of other users, enabling DDOS attacks on routers.
Yeah I think people forget that Skype actually owned the video game voiceserver market for a few years.
Ventrillo and Teamspeak and Mumble were all good. But you had to assign someone in your friend-group to manage the server. This meant paying for hosting to do it "the right way", and in turn one friend either paid the hosting themselves or you had to figure out how to split the cost. Then if someone else joined the group you had to split it with them, etc.. Some people would self-host teamspeak or ventrillo at their houses so you could avoid those costs, but now you are reliant on an unreliable system of one friend hosting your voiceserver on their desktop computer. This means that router mishaps could send it offline, them turning off their computer could send it offline, or if the teamspeak/vent daemon wasn't running then your whole server is offline.
Skype solved a lot of those problems because it was always online, no one had to manage a server, and it was free. It sucked in just about every other way as a game chat option, but the benefits of no-server-management, always-available, and no-cost, made an objectively inferior product dominate the world of game chat.
Discord simply took the features of teamspeak/mumble/ventrillo and combined it with the service benefits that skype offered. No more server cost sharing and no more server administration. But you still got the benefits of actual game chat servers like voice lobbies (as opposed to initiating calls like skype).
I really don't think Marketing is what made Discord successful. This is truly an example of someone who solved a need. We needed a product like teamspeak/ventrillo/mumble combined with a service like skype. Discord was that creation. It truly solved a problem for gamers. Gamers were not looking to cling to skype, but they were all using it. Discord created a product that fit into the market perfectly and the masses ran to it because the need was so big, and Discord solved the problem that gamers needed. The ease of setup also helped. Sending a single share link that someone simply clicked was all it took to join a server and start talking. I think that ease of setup is also an incredibly under-rated strength of Discord. In fact I would venture to guess that most gamers joined their first Discord server by clicking a discord share link that was sent to them via Skype.
I don't think they did much marketing. They provided, for free, a replacement for a chat room, forum, and voice chat, which was in total easier to set up than any of the previous options any gaming community had for those, with similar levels of functionality.
The only bad part of mumble calls for voice come from:
* Where the server is hosted / quality of server
* Poor client UI
The client UI issue is how easy it is to work-around bad audio from other users. It's possible to do, the UI just completely sucks.
User interface and end user fulfillment just aren't great generally for OSS. I think it would take a commons improvement project with either government grants (infrastructure) paying for results AND/OR a university spearheading the development project.
i totally disagree. the worst part about mumble is that it was a pain in the ass to set up. creating a discord server is trivial: all of my nontechnical friends have used the software just fine. mumble is terribly fiddly in comparison.
I don't see how that's relevant? I don't even prefer those over Discord, but I don't think it's enough of an improvement to warrant the market share it has now.
So what do you think happened? That people were manipulated in to using discord? Or that they don't know what the alternatives are?
Everyone I have spoken to loves discord and thinks it is one of the best programs they have. It's only a select group of hacker news style users who complain about minute details the average person does not care about.
I know it's hard for most people on this site to understand but the average user has very different priorities. Being able to create a "server" with the click of a button is worth more than every other issue listed in this thread. Having to pay or self host to create a group is a total non starter in 2021.
What I think happened, as someone who's been using Discord daily since 2015, is that they came up with a slightly better product than the alternatives, spent enough in marketing (to gamers specifically) to convince investors that it was a platform worth investing in, and only then slowly started improving their faulty software.
To say people were manipulated into using Discord is obviously not true, but it's also disingenuous to deny the massive amount of marketing Discord pushed back when it first started, not only in advertisement but just branding in general.
I'm not going to address the latter part of your comment because I don't understand what you're trying to say. I'm of the belief that I'm allowed to voice the legitimate issues I have with the software that impact not only myself and other developers but users in general.
The last part of the comment was probably a reply to this: "I don't see how that's relevant? I don't even prefer those over Discord, but I don't think it's enough of an improvement to warrant the market share it has now."
You don't see how those valuable those improvements are but the average person does and that's why it has a massive market share.
I think you misread my comment. I definitely prefer Discord over the others, but I still think it has a long way to go before it becomes a chat experience that's not insufferable to use.
Small amounts of friction make a big difference. Back when my gaming friends were using Mumble, half of the group wouldn't bother joining voicechat (and we were lucky to have someone technical enough to run the server in the first place); with Discord it's easy enough that everyone does it.
Slack was/is pretty terrible too. Having every workspace require a new user is the pinnacle of idiocy. So annoying and even worse if you have different emails for different workspaces.
There are plenty of reasons to do this, all of which have to do with, say, privacy, and Slack has managed their way into making it dramatically less annoying.
1) They send magic links. Pretty easy.
2) They make all known workspaces you've logged into before discoverable and allow for a one-click "add to desktop Slack" option, which makes dealing with the whole "different users" issue. And to the extent that I use different emails for different workspaces, Slack accommodates that and allows me to do so within the same desktop instance, so not really sure what the concern is there.
That's true, but it doesn't change the user story going from "I click a link, I join the workspace" to "I click a link, I fill out yet another registration form, decide which email to use, add another password to my password safe, then join the workspace". Minor differences but friction does matter.
To add onto this, you also have no cross platform user consistency, so if I DM a user, if I want to search my DMs there, I might have to search my DMs in 10+ servers for the specific message I want. There are a million other problems with this model but this example is definitely one that I frequently ran into before people I knew switched to discord
People love it, marketing has nothing to do with it.
I've heard any Discord when I followed open source programming project (Leela Chess Zero) and it was obvious after a few minutes why it's a fantastic fit. I moved my project there shortly after as well and it's fantastic.
It's definitely improved over the years, but every remotely populated server I'm in uses bots for basic moderation features like ban words, proper bans/kicks (for example, temporary bans), warns, etc. There's still a long way to go in my opinion.
I'm not aware of anything that does a better job than discord. So they can be doing a fantastic job relative to the competition while still leaving stuff to be desired. Although bots are not really a bad solution and they leave the tools in the hands of the users who can now do just about anything.
I disagree. In my opinion those tools are the bare minimum for effective moderation, and while I love that Discord gives developers an API that allows them to implement those systems, I think it's something that should be handled by Discord themselves.
In indie game dev, everyone’s on discord and so are your extremely-important potential players. Whether you enjoy discord or dislike it, you’re going to be on there. It has really strong networks effects
You may enjoy Ripcord if you're not happy with Discord's UI. I've been using it for a few months, and it's made Discord enjoyable to use.
I do have to open the official client whenever I do voice calls though, because there's currently an issue that can cause incoming audio to sound terrible. But for text chat, it's great.
I've ditched the client for web only with custom css.
Also allows me to block every kind of tracking (opened programs).
I only use it as a chat client and still run a TS3 server because discords audio is just garbage
Personally, I find that Discord feels extremely bloated compared to almost any other program on my computer. Probably has to do with the fact that Discord has still not released a native ARM version of its client for the Apple M1. Nearly a year after its release there really is no excuse for any Electron application to stick to x86_64+Rosetta 2.
Discord is literally the only x86 application that is still installed on my MacBook Pro M1.
Discord recently downgraded from 64-bit to 32-bit on Windows as well, I imagine there were issues with the interconnects with other, native parts of Discord like screen-sharing that were easily 'solved' by only distributing 32 bit binaries.
everytime I open an irc client I feel lighter.. something about the discord web client is so dreadful. the android app focuses more on notifications and quick switches.. a bit better somehow.
> There are alternate Discord clients, and — unlike Slack — Discord doesn't try to actively prevent people from writing alternate clients against their API.
> Hey, so I know this is somewhat of a bummer, but I got banned because of ToS violation today. This seemed to be connected to creating a new PM channel via the /users/@me endpoint. As that's basically a confirmation for what we've believed would never be enforced, I decided to not work on the cordless project anymore. I'll be taking down cordless in package managers in hope that no new users will install it anymore without knowing the risks. I believe that if you manage to build it yourself, you've probably read the README and are aware of the risks. I'll keep the repository up, but might archive it at some point. And yes, you'll still be able to use existing binaries for as long as discord doesn't introduce any more breaking changes. However, be aware that the risk of getting a ban will only get higher with time!
> Disclaimer: So-called "self-bots" are against Discord's Terms of Service and therefore discouraged. I am not responsible for any loss or restriction whatsoever caused by using self-bots or this software. That being said, there's no one stopping you from risking using an account, so go head!
> There are alternate Discord clients, and — unlike Slack — Discord doesn't try to actively prevent people from writing alternate clients against their API.
It's again TOS and people have copped bans for using alternate clients.
"All 3rd party apps or client modifiers are against our ToS, and the use of them can result in your account being disabled. I don't recommend using them."
Well it solved a lot of pain points with the target market.
I remember my friends and I kept bickering who would pay for this month's bill for the vent/mumble servers. That kept on for years until I had enough and hosted my own in a droplet in digital ocean. None of my friends knew how to do that since they're not very technical.
Discord you just had to click a couple buttons and its free.
It's free for the same reason everything is free these days. VC funds anything that will attract a lot of users to mine data from so they can sell the data. Discord didn't do anything that was groundbreaking or even solve a problem that had no solution; they just came along during a time when investors are willing to fund a company operating at a loss for a decade until FANG buys them.
Discord's a pretty good product, and they've got the engineers and money to get better, but the only reason they won is because of timing. Same for Slack; there were identical products to Slack that tried for decades to gain traction, but they weren't free, because that business model didn't exist at the time.
The ux of Slack is essentially screen+irc implemented in JS with emotes. It enabled technical and non-technical people to use the same tool. The key to success is not technical, it's that they tailored the product to a specific group that would then lock itself in.
I didn't understand Discord's success, but comments here point that gamers couldn't find free group-voice apps at a critical time. Here again, they tailored the product to a group that would then voluntarily lock itself in.
Later, they sell the companies with valuations based on the captured user bases.
I wasn't referring to things like IRC. When Slack was initially released, it was no different from Campfire and a whole string of other web-based chat systems that came and went going all the way back to the dawn of AJAX in the late 90s. Slack's improved a lot since then, with app integrations and other features, but fundamentally it wasn't any different than its predecessors. It's easy to think that Slack did something groundbreaking, or figured out the magic solution to the problem that sank its predecessors, but just like Discord, the reason Slack won is because it came along at a time when companies can raise tens or hundreds of millions of dollars to float them for years while offering a free product. Then they can upsell later, and/or commoditize their users' personal information. Those business models weren't as easy to come by in the past, so a lot of products failed. None of this is to bash Slack; it's an adequate product for what it does.
Another big thing that the current crop of winners has going for it is that cloud hosting allows applications to launch literally for free and scale quite a bit without paying much of anything in infrastructure costs. That also wasn't an option 10-20 years ago.
I wonder about this a lot. I wonder if they have some big 'whales' that help sustain their business OR they're just selling all of our data (is that enough to make money at discords scale??).
one of the founders and current CEO Jason Citron had a previous org called OpenFeint, which:
> was party to a class action suit with allegations including computer fraud, invasion of privacy, breach of contract, bad faith and seven other statutory violations. According to a news report "OpenFeint's business plan included accessing and disclosing personal information without authorization to mobile-device application developers, advertising networks and web-analytic vendors that market mobile applications" [1]
Of course that doesn't mean anything about the current model of Discord, but good to be aware of.
IMO its a good competitor to slack. They probably make money from businesses too. They have lots of options for permissions/roles and all kinds of API access to write bots for.
Unless businesses are paying for server boosts[0] (which would only be useful for 1080p60 screen-share or a 50mb upload limit), there's no way to use Discord for business or pay extra for business use outside of creating a free server like any other; there's no real reason to choose Discord for business either, since it has no real retention policy (other than storing messages forever, for now), DLP is non-existent, there's no SSO/SAML, etc. The only reason to use Discord for business is if you really like Discord and/or other parts of your business are on Discord, like if you run a video game.
There are "businesses" that have communities, and want to own/manage them. Discord works much better than Slack as a platform for "official" managed open-membership communities; it's seemingly a use-case the Discord staff have put a lot of thought into.
Think: every content-creator or streamer.
But also: regular corporations that provide platform services that people build their own stuff on top of, such that people want to talk to each-other about the service rather than just talking to the corporation about the service. (The sort of thing you used to stand up a hosted forum for.)
Yep, I agree about limited industry but it does work in that respect. I see it used a lot for content creators as a way to organize and tier out their fans as well.
we're trying to use Discord for our multi-site grant-funded healthcare project... it's pretty messy to use. Would love to pay for some decent support... People are getting locked out of their accounts for some reason and working with their support team is very painful.
Our only source is this WSJ article (excerpt from qz[0]):
Discord declined to share how many Nitro subscribers it has, but the Wall Street Journal reported that Discord generated $130 million in revenue last year, up from $45 million in 2019. In the same time period, its monthly user base doubled.
we use discord for work and to "boost" your server to a level where you get reasonable streaming you need to pay ~$60 and a higher upload limit, or ~$110 for the max. Which is pretty good in that it applies for all users
It's a bit of an odd model for paying for businesses, but works well in the gaming world where multiple people can essentially help pay for a server (if you want the extra toys)
Why did you pick Discord over Slack or Teams? I'd be driven crazy if there wasn't any SSO or fancy admin features, but then again I'm a nerd who cares about things like that (and also why I really want Discord for Enterprise to happen). Is it just because it's easier to use?
I pay for nitro, I don't use discord non stop but it's great for a bunch of niche channels i'm on. I'm happy to pay $10 to a platform that makes it easier for me to find information and I know a bunch of my colleagues pay as well. All in we still support individual projects as well but truthfully it's the cost of a beer a month.
I don't get it, you can run mumble on any random Linux box in your house, you don't need to pay to have it hosted somewhere. Works find running on any box on your desk.
Discord makes you the product. It's gratis in exchange for letting them spy on you. If you don't know why that's bad...
> you can run mumble on any random Linux box in your house
That seems easy to you. That would be easy for me too and most likely 90% of the people on HackerNews.
But the average person doesn't have a "random Linux box" in their house. Most people don't even know what Linux is. Most people would be overwhelmed just looking for the terminal emulator on their computer, before they even typed a command into it.
Most people don't want to manage an always-on linux box for a voice server. Most people don't want to manage port-forwarding on their firewall/router. Most people don't have static IPs at their house and wouldn't know how to setup dynamic dns to solve the problem. Most people don't even know what DNS is.
MOST PEOPLE just want a program they can launch when they want to talk to their friends. That is why Discord has been successful.
I'm not saying that's good. I am just saying that its the way the world is.
I find it astounding that people here can not even grasp the concept of why Discord is popular. I am perfectly capable of hosting my own server and doing everything manually. But it is clear as day why discord wiped out the competition while most of the comments here seem dazed by the fact and are left wondering why people don't just use IRC.
It's no wonder so many projects and FOSS tools fail to gain large userbases when it seems that most developers seem to be living on another planet entirely.
This is an important point. I write that with no disrespect to all those FOSS developers who have devoted themselves to the work of creating new, interesting and useful things. But the fact is that usability, like intuition, is usually a very subjective matter. That's why QA and UAT were such an integral part of traditional software development, and why community engagement needs to be a two-way street.
Best comparison for Discord is to think it as social media site, Facebook or Reddit. Find or get invite to server, everything is there trivially. Creating your own space is simple, easy and fast. Everyone is already there.
IRC is pretty similar, but much more fragmented and not really very user-friendly.
somehow the only real challenge for the tech community is how to wrap all that self-hosting complexity so that most people could just use it with the click of a button.
its not a technical challenge, its a moral challenge. it means doing what is good for the users even if they don't really know it
I vouched for your comment. A lot of the rise of Discord can be attributed to convenience, network effects, and pretty features like animated reactions, but ultimately it is still surveillance capitalism. Unfortunately, it appears that the masses don't care about things like privacy, as they're more than willing to sign up for these kinds of services.
Discord's smart move was emulating the concept of servers (including all the teenage drama coming from having administrators and moderators more interested in (ab)using their power than community building) while making them accessible to anyone without technical knowledge.
But it's important to remember that Discord is not that. Discord holds all your data, in luxurious detail, with no option to delete. They go as far as ignoring GDPR when people ask for their messages to be deleted. "Deleting" your account will not even anonymize your ID, it unsets your avatar, renames you, kicks you from all guilds and disables logging in. That's it. And if they ban you there is no place to move on to.
That used to be a huge issue, less with Client-Server model software, but more the P2P mechanisms in Skype.
These days? Well, most server providers have some sort of basic flood mitigations in place now, and even more advanced protection has become affordable.
>These days? Well, most server providers have some sort of basic flood mitigations in place now, and even more advanced protection has become affordable.
Hmm
I didn't meant your server being DDoSd, but you being DDoS (but probably that's what you meant with Skype P2P example?)
Well yeah. Though Vent, Teamspeak and Mumble never had these issues (if you could trust the server admin).
Skype (at the time, no idea now) was a very shoddily written piece of software. It was trivial to query the IP of any online user, even if they were not on your contact list or appearing offline.
You had to use a VPN or carefully conceal your Skype ID, I did work with a somewhat popular live streamer back then (so a VPN wasn't feasible), and their ID was a very random string that was not to be shared under any circumstances.
I've also noticed that in a lot of tech-related social circles people are increasingly choosing Discord over Slack. That's a trend I totally didn't expect: at least until a few years ago it was clear that Discord was for gamers and Slack was for work and everyone else. That changed quickly. Impressive indeed!
At my work we use Discord to have virtual "desks" (really just audio channels) so people can drop by and chat while you are at your desk. If you're busy or don't wish to be disturbed you can 'lock' your desk to prevent people from joining it (it limits the room size to 1, aka just you).
It really has helped the social factor of moving nearly everyone in the office to remote working. Every department that has adopted the "virtual office" Discord setup loves it over Slack and basically never uses Slack anymore. It's way less awkward to call people, it's easier to not incidentally disturb them when they're busy, during breaks/lunch you can go to the "breakroom" and hang out and chat with everyone else. And it was all very easy to setup and with the Discord server template stuff we can even clone it for each department with very minimal work (renaming channels to that departments' people).
Slack implemented something similar called Huddles, but I think is for paid plans only. I personally think Slack calls quality in general are much worse than other services or platforms like Discord or Meets/etc, so I don't know if it'll really help reaching people and companies that are using alternatives for voice.
Huddles uses Amazon's chime backend for audio, so it should perform much better then the current "audio calls" that slack had, though I haven't tried it yet.
> Slack does not support syntax highlighting of code blocks.
It does, but only if you make your code block into its own post as a "text snippet." (I assume this is because Slack's internal markup doesn't allow regions to have parameterized metadata, but there is parameterized metadata at the chat-post-event level.)
You also get other benefits of doing this, e.g. being able to collapse the snippet, download it, etc. Code pasted into Slack should really always be pasted as a snippet. I just wish it auto-detected you were trying to do that and offered to make a snippet.
Yeah, I have done that. It’s really clunky. I’ve also done it to be able to use headings and other things that I’d prefer just worked in the main chat.
It doesn't support: headings, paragraphs, lists (ordered or not), labelled links, tables, images, footnotes, images (you can only use the image upload feature which puts a single image below a comment).
It also has a limit to 2000 char (4000 with nitro), which can be rather low when posting code snippets.
I hadn't noticed some of those. I don't use Discord a ton, just enough to know it works better than Slack I guess, at least what they do support uses markdown syntax.
The most frustrating thing is that somewhere on their website they mentioned that most people are not familiar with the Markdown syntax, so they chose not to use it. But instead they created their own syntax that even less people are familiar with...
The free discord tier is better than the free slack tier. That’s honestly why 90% of ppl use discord over slack.
Also paid discord is 100x cheaper than paid slack, for non-corporate entities. You can get top tier discord for like $100/m while slack price goes up with each user. Not to mention that discord allows users to easily assist in upgrading your server while slack doesn’t have that functionality at all.
They also pivoted their marketing message away from being "for gamers" and towards anyone who wanted "a place to hang out," like developer groups or high schoolers.
Kinda makes perfect sense, Discord is very good addition to subreddit or Facebook group. And barrier of entry is low, just like those two. It fills a niche for audio and chat of large communities.
Discord hasn't gotten bloated yet the way Slack has, which makes it much more pleasant when all you actually want to do is chat and maybe hang out on voice and sometimes with a screen broadcast.
Once you're in an enterprise space, Slack's features become actually useful.
To be fair, Mumble is FOSS, and Ventrilo and Teamspeak have literally not iterated since 2005. Discord is pretty mediocre software (remember when they accidentally allowed iframe XSS RCE attacks? A very amateurish mistake), but the incumbents were an absolute dumpster fire.
For Mumble in particular, the devs had their heads in the clouds for so long that it is no surprise that it is no longer relevant.
If you had a mic that had issues in any way (buzzing, volume, balance), "The Wizard" and "AGC" were supposed to fix it for you. Do not fret little one, for you do not need nor want to manually fiddle with settings, The Wizard will make everything right [1]!
The pivotal feature that was the reason so many people I know stopped using it is the ability to change the volume of an individual person [2]. It has been a requested feature since the beginning of time, yet it took until 2016 to implement in dev branch and didn't actually make it into a release version until 2020! Too little, too late.
That issue is pretty amazing, especially the developer straight-up accusing those saying 'this is a killer feature stopping us from moving to mumble' of lying (considering the substantial overlap between the features of the various options competing in the space at the time this is really obviously the kind of thing which could be a deal-breaker: I remember the arguments over which option to use based on far less substantial differences than this).
its just slack for gaming. the ui is ripped off as suck. it is better than ventrilo but its not like they are that much better, just they realized a good concept and took it.
This is like a masterclass in how to answer system design questions. Maybe a bit verbose.
They cover requirements, how to answer those requirements, relevant tech for the problem, implementation, and techniques for maintenance
No?
But I am saying this is like an extremely articulate, over the top, excessively detailed answer for a system design question. I'm not saying this is what people should aim for, just that it's a good example of the types of things you should discuss during a system design interview.
I mean that's what it IS. It's a system design.
2) But this isn't an interview situation. This was "system design questions" as in how to solve a problem as a company using the whole team (4 backend engineers at that time).
Well, that's true for any task that you're asked to complete at an interview: you're doing the fast, draft version that is obviously not comparable in quality to what you would do in a real setting. But still, even this draft could be illuminating.
We took a big bet on Cassandra, and then on an opinionated wrapper around Cassandra at $PASTJOB. The use case was a text search engine for syslog-type stuff.
The product we built using Cassandra was widely known as our buggiest and least maintainable, and it died a merciful death after several years of being inflicted on customers.
We didn't have a good handle on the exact perf implications of different values of read/write replication. Writing product code to handle a range of eventual consistency scenarios is challenging. The memory consumption and duration of compactions and column/node repair jobs is hard to model and accommodate. It's hard to tell what the cluster is doing at any given moment. Our experience with support plans from Datastax was also pretty dismal.
Maybe the situation has changed since 2016. In my experience with several employers since then, it seems like every enterprise architect fell in love with Cassandra around 2014-2015 and then had a long, painful, protracted breakup.
However, we also compared it to Scylla's latest release, and though C4 is better*, you can still find other CQL-compatible databases that outperform it. Especially around compactions and topology changes:
I like Scylla. I've been working with it for a while now, and it's a good alternative for transactional loads. It's a hell of a lot faster than Cassandra, and much much much much cheaper than DynamoDB. Cassandra has always felt like improvements came in fits and starts. I work at a Fortune 50 company, and Amazon quoted us ~$2-3 million a year to run our load on Dynamo (we were paying $350k/year for Aurora). With Scylla, we're looking at > $100k to get an order of magnitude better performance and a nice stable system that doesn't wake me up in the middle of the fucking night.
I was following Scylla since the beginning (only because I like their mascot), and it's actually sort of interesting to see what's going on with the company. I've spent the past few years designing things where transactional systems backed by Cassandra. This is the first time I've been able to use Scylla on someone else's dime, though. The unpleasantly big company I'm at right now is looking to replace a bunch of infrastructure with ScyllaDB (Couchbase, Cassandra, Elasticsearch, DynamoDB). It's catching on for sure, but it still doesn't return any results when I search Dice. It looks like Discord is hiring, though...
I'd love to hear how Dynamo would end up being $2-3 million a year They sure do a great job of convincing people that it's cheap so I'm curious where the cost seems to blow up?
If you are doing north of a million ops on DynamoDB you can quickly run into the $2-3 million a year range.
In this 2018 benchmark, we were able to calculate that a sustained, provisioned of only 160k write ops / 80k read ops for DynamoDB would cost >$500k per year:
That was a few years ago. These days, according to our most current pricing you could do DynamoDB provisioned, 1 year reserved for $38,658/month, which is "only" $463,896 annually (pop up the "Details" button and choose "vs. DynamoDB"):
The same workload on Scylla Cloud would run $29,768 reserved/month, or $357,216 per annum — 77% cheaper.
Of course, all of this is just pure list price. Depending on volume you might be able to negotiate better pricing. However, you'd need a really steep discount for DynamoDB just to get back to Scylla Cloud's list price.
Let me know if you spot any math errors or omissions on my part.
> Maybe the situation has changed since 2016. In my experience with several employers since then, it seems like every enterprise architect fell in love with Cassandra around 2014-2015 and then had a long, painful, protracted breakup.
I think 2012-2014 was peak marketing from DataStax. There would be some new major feature with every new blog post, and it would mostly never work as expected. Between 2017 and now, things have settled down.
I've used Cassandra at two companies, and had the exact same experience as you at the first company. At a much bigger company that had some very, very highly paid Cassandra DBAs it was actually a relatively smooth experience.
I don't think it would be appropriate for me to say very specifically, but I suspect about double what a software engineer with the same amount of experience would earn.
Data is big $$$. Slap a couple of NoSQL databases and Spark on your resume and watch the money roll in. DBAs are disappearing with managed services, though.
Yeah, I don't know about "slap." We want you to have deep production experience with these systems. Designing them, deploying them at significant scale, predicting their pitfalls and avoiding them proactively. Diagnosing systemic problems and finding reasonable solutions.
If you can't magically put out production fires, on huge high-throughput systems, potentially in the dead of night, we are unlikely to pay you $300-400K.
The intent was to be a little hyperbolic and self-effacing. In terms of competent and capable developers, I think it’s hard to get a better return on your skill set than adding “data” stuff. And honestly I think it’s one of the most critical skill sets that is lacking across the board. So many bit companies have great data engineering teams, but generally other dev teams are left to design their own databases, which is a shit show. And even then, it’s amazing to me how difficult it is for data engineering teams to move from framework to framework without just mapping old solutions onto new technology.
My career has been primarily focused on something like “bringing modern data-driven solutions” to big companies. The one thing that is a constant challenge is that most teams (and leadership) aren’t prepared to handle responsibilities of data engineering and stewardship in transactional, operational systems. I feel like critical responsibility when I come on as a consultant is to impart knowledge about managing their data.
To elaborate, I meant 2x compared to SWEs at the same company. I'd prefer not to post exact dollar amounts as they are relative based on location, company, and several other things.
Do you mean BigTable? That's Google's "HBase" in sofar as hbase is based on the BT paper.
From what I recall from using it a few years ago, it's pretty damn fast, very low latency. HBase had speedy p50s as well but tended to get quite slow at p99 due to GC.
If you're paying for Discord every month, it's actually fairly expensive. A lot of the good features unlock once people start boosting servers with Nitros and those aren't cheap either. So I'd assume they aren't bleeding cash left and right on infra costs. They might actually breaking even on the infra costs at least.
For years I've had a little bet going with friends about who ends up buying them to subsidize all this. My money was on amazon, because it could work so well with twitch + amazon prime.
> we knew we were not going to use MongoDB sharding because it is complicated to use and not known for stability
But then goes on to describe using Cassandra and overcoming sharding and stability issues. I.e., changing the key, changing TTL knobs, adding anti-entropy sweepers, and considering switching to a different cassandra impl entirely.
Are these issues significantly harder to solve in MongoDB than Cassandra?
At the time MongoDB's sharding story wasn't great. They've gotten better since, but still have a primary-replica set model that has a single point of failure/failover. Cassandra (and Scylla) are leaderless, peer-to-peer clustering. Any node can go offline and the cluster keeps humming. Cassandra shards per node. Scylla goes beyond that and shards per core.
Cassandra and Scylla also use hinted handoffs so if a node is unavailable temporarily (up to a few hours) you can store "hints" for it when it comes back online. Handy for short admin windows.
MongoDB has the equivalent of hinted handoffs. Changes are streamed to secondary nodes via the oplog, and the secondary just resumes where it was once it is back online. There is a limit to how long it can be offline (based on the size of the oplog), but that is the same limitation as hinted handoffs.
A MongoDB shard isn't necessarily a single-point-of-failure since a shard is usually deployed as a replica set. If a shard's primary node goes down, a secondary node in the replica set is elected as a primary and takes reads + writes. Similar to what you mentioned for Scylla - a node can go offline on a shard in a MongoDB cluster and it keeps humming.
Its hard to say because they're not explicit about this but, despite being a decade-long Mongo apologist myself, I'd totally believe that they liked the linear scale story for Cassandra more from an infrastructure/config perspective.
Increasing top-end write throughput or replication in Cassandra is just adding more nodes, where in Mongo its not just adding nodes, its adding replica sets (which consist of 3 or more nodes). So there's a few more layers of complexity to that story. You need more replica sets to increase write throughput and need more nodes in replica sets to increase replication.
Im hand waving some details here, but I've worked with both platforms can definitely understand the choice at least from a pure infra lens.
KKV databases (Cassandra and DynamoDB are good examples) have a common problem with hotspots or "hot partitions". The most common mistake is to use a timestamp of any kind in the range (cluster) column. Then, whatever partition represents "today" or "this hour" ends up being the hot partition.
The article mentions hot partitions becomming a problem with max partition size, but they're also a problem with scalability. Say, if your writing a very high throughput of logs into the table (contrived example), then your bottlenecked by the rate at which you can write to one partition.
Adding the bucket id (say, the current day or hour), is a common solution, and solves the max partition size issue, but not the scalability issue of hot partitions.
That said, hotspots are 100% the reason why Cockroach encourages UUID primary keys. The disadvantage to UUID is you want sequential data, you then need a secondary index which you'll have to bucket anyway.
- No out of the box horizontal sharding, according to the post they had 4TB (compressed) data in the cluster in 2017. Looking at their growth I think it is safe to assume that today they would have >50TB which can't be done on a single node. You could use Citus but this is not exactly vanilla Postgres anymore. For such a simple data model wasting time implementing your own sharding solution and (more importantly) shard migration makes no sense.
- Discord is storing text data, in Postgres this will be stored in TOAST tables which has some drawbacks.
- Their workload is mostly inserts, almost no updates. Vacuum only operates on complete tables so you would wast I/O and CPU processing data which you don't even touch. You can partition tables but it's a manual process and you have to make compromises. In 2017, Postgres partitioning still had many performance drawbacks.
- No out of the box redundancy.
- Once your data doesn't fit in memory, Postgres performance becomes unpredictable.
Personally I would have chosen ElasticSearch for this project.
My understanding was PG only uses TOAST when the data is too large to fit in the row, and since PG compresses data before inserting wouldn't user messages be fine?
Do you have any case studies for ElasticSearch that you can recommend, in projects similar to this? Would be very interested in seeing what that option would look like.
Hmm... have not tried this myself, but just brainstorming. You could shard based on discord server or chat room, which would give "read before write" consistency since writes can lock, but then you'd have to manage shards to account for varying loads like servers/rooms which grow rapidly, and deal with hot shards which might outgrow the capacity of a single db server.
Given that they said their requirements were "linear scalability, automatic failover, low maintenance, predictable performance", I don't think I'd go that route.
They wouldn't need to store so many if they actually let people delete their messages on account deletion. Instead, they ban many people who attempt to do so via automated scripts.
Deletion of data at scale is a really difficult technical problem, unfortunately.
I'm not saying they shouldn't do that though - especially given regulations like GDPR. Designing systems for deletion is important! But it's also really hard, especially if you didn't design for it from the start.
There's also no way the tiny fraction of users who want to delete their data would make up a significant enough proportion of the messages that it would impact their scaling strategy.
One of the big reasons I refuse to use Discord. Deleting your messages is a right every user should have. Whether it be individually or in bulk. The way it's done now just makes it more susceptible for users to be open to malicious attacks. Whether someone archives your content before you delete it, that's not of importance, that can happen on any internet medium.
I love Discord's tech blog. There are a few corporate tech blogs that are just fantastic. Fly.io is another one that has great writing and interesting topics.
Can anyone share experiences with using Discord as a communications tool in a workplace? We're currently on Google Chat because it comes with the package that we pay for anyway, but it's pretty lame. So from time to time we consider jumping to Slack. But then, why not Discord?
We used Discord for a while for our team at work for over a year. Stopped using because company policy changed (we had no centralized chat program, some teams were on Skype, some on WhatsApp, then Slack was instituted). Context: 10-person, mostly technical, project-oriented, game development team.
---
It was a joy to use, we created channels left right and center and knew everyone needed would be in them thanks to the centralized "role-based" permission system. (We would create project-specific channels and an accompanying role, or client-specific roles for the few high-throughput clients that had lots of small projects)
At the time it did not have threading, which was one of the biggest pain points on the text-chat front.
---
The voice chat is very good, and having dedicated voice channels means you can emulate meeting rooms or desks and have people join as desired/needed. You could be working and idle at "kroltan's desk" voice channel, but even if you weren't, joining one is trivial (a single click, can be done independently by many people) compared to Slack (find the call button somewhere different each time because they redesign the UI every week, then wait for your peer to join the call).
Screen sharing is 720p on the free plan, so for meetings, it was hard to read documents, requiring zooming and whatnot. At the time there was also no setting to optimize for framerate or definition, so even 720p felt closer to 480p. Nowadays you can lower the framerate and also select the desired optimization, so you can ask Discord to optimize the stream for quality which is much better for documents, even in 720p.
---
The client is also much more responsive than other Electron-based chat programs, especially with big workspaces with close to a hundred channels (yes, for a 10-person team, we sure type a lot), search is basically instant and has very useful filters, mentioning roles is great and the notification settings are fine-grained enough to please everyone.
Discord would be awesome for work ... I want to use it so bad. But the terms of service are completely unpalatable. Discord basically gets a perpetual license to anything you post.
"Huddles" pick up so much noice from the background I stopped using it. Whenever I had to take a call without headphones on it was basically unusable. I'm a single data point however, your milage may vary.
It's everything you could want, but this a lot of asterisks. A lot of things are limited (check other comment), but most importantly, their policies force you to hand over any text you write on their platform. But frankly, I'm not sure if there's anything near a perfect solution. A lot of companies use a big clump of services including both self-hosted and "rented". I hope to one day see a better, more comprehensive solution in line with the future of online work.
I would have used CockroachDB, it has all the requirements listed and you don't need to know in advance the queries you will perform when deciding the database schema.
This would not work at scale for a company like Discord, with its volume of traffic. Cockroach, being consistency-oriented would quickly become transaction-bound. You want a database like a Cassandra or Scylla that is more performance/availability oriented. Otherwise you are going to see a lot of lag and latency in the Discord chat.
Cockroach is very, very good for a distributed SQL database. But it's still performance-limited in its very nature.
More here on the difference between NoSQL/NewSQL performance, using Scylla (a CQL-workalike) as a point of comparison:
I'm curious of the 2021 measure of total disk space that Discord consumes. Servers that I'm in share images every few minutes, which must add up pretty quick.
I don't know much about image de-deuplication, but maybe they can get some sort of fingerprint/hash for an image, see if they already have it, and then serve that already existing image.
I'd imagine a hash like SHA256 would be tricky because if that image was compressed an additional time at all throughout it's internet journey, then we'd get a different resulting hash, but maybe there is an effective way to fingerprint images. I have a utility on my machine (czkawka maybe?) that does really good image de-duplication with what seemed like a common algorithm (based on a quick look at the source).
I think it might, spamming same meme images over and over is quite common in some servers. On other hand the bigger pictures might overhelm these just in size.
Yeah, that's why I assumed it wouldn't help that much. People re-upload 100kB memes all the time, but the bulk would probably be 5MB phone pictures that won't typically be re-uploaded.
The plural of anecdote isn't data, but about 20% of the images I post on Discord come from Discord in the first place, cross-posting among different servers.
Yes. There are ways to group images that seem to be the same. TinEye and Google image search do that. So you'd have a collection of related hashes that equal "Bob's prom photo where he looks like a goofer."
Yes definitely I have seen it work in action but you cant just tell a user "here use this smaller and more pixelated version of your image that we think is kind of similar".
We use scylla for our IoT stream, bucket per day, with a date index for second resolution data. The current day is a hot spot of, but we throw that in redis. It's running one of the largest re insurance providers IoT deployments.
That's an interesting point too. They talk about not being a blob store, not wanting the serialisation cycle to hamper performance but makes you wonder how exactly they're storing the data. I'd guess it's not encrypted at all.
ETA: Going back to the original thread, the whole question of encryption seems to be dodged and that usually means the answer isn't the one people are looking for: https://news.ycombinator.com/item?id=13440921
Scylla, Discord's replacement for Cassandra, supports both encryption in transit (server-to-server within the cluster; client-to-server) and encyption at rest for stored data.
If you can, use Scylla over Cassandra. The performance difference is tremendous in my experience and replacing can be trivial (easier if you start with Scylla on day 0)
I'm very much in the "use PostgreSQL unless you have absolutely proven to yourself that it won't work for your project" camp but in this case it really does look like moving to Cassandra was a good choice.
NoSQL scalable stores like Cassandra basically only work well if you have a very strong model of the queries that you will need to make.
In this case, that's exactly what they had: they knew what their read/write patterns looked like and they knew that they would be growing at hundreds of millions of rows per month, so easy horizontal scalability was a hard requirement.
The biggest weakness of classical relational databases like PostgreSQL come when you have a super high volumes of inserts (as opposed to updates) which will continue to grow your database over time, and you need to keep all of that data accessible for real-time queries.
They might have been able to achieve something like this using a PostgreSQL extension such as Citus, but it really does look like what they are doing fits Cassandra's sweet spot.
Afaik you could up until the data exceeds your capacity to fit in one machine, at which point you have to figure out how to split your data up in a way which lets you preserve all the strengths of sql (strong consistency). At that point you run into a lot of complexity with managing your shards.
Two horrible choices of databases... first they used MongoDB then they migrated to Cassandra. I've used tons of databases [1] in production and those two are the worst.
[1] I've used RethinkDB, Postgres, MongoDB, MySQL, Cassandra, CockroachDB, TimescaleDB, SSDB, and others
Im not questioning your intellect, but each database has its own use case for usage. You might be expecting wrong things from Mongo or cassandra.
Using gazillion databases for wrong use cases doesnt mean nothing.
From your child comment, you cooked up an in house solution, it may be best suited for you. But for others it would be horrible too.
If Cassandra is suited for Facebook, I think the problem is with you making a choice to use it for something its not suited for rather than the database itself.
Cassandra has been successfully deployed at many companies. Would you care to provide some insight into your experience and why you consider it one of the worst?
1. Hinted Handoffs - if a node has a transient failure, the other nodes store up messages, like your buddy might take notes in class if you had to go to the bathroom. They'd pass you those notes when you got back. "Here's what you missed." When the node comes back online it processes all new operations and works through its backlog of hinted handoffs to get caught up. Because of the backlog it creates, hinted handoffs are only stacked up for a few hours. If the node never comes back up, or comes back after that window...
2. Repairs - in an eventually-consistent database you might miss an update or two over time. Or maybe you're a replacement node that has to fill in for a failed node. The replacement will get streamed data from the other replicas to get it started, or you might restore sstables from a backup, but then you should run a repair job to make sure all your replicas are properly in sync.
(That's my understanding. Let me know if that sounds correct from the hands-on experts.)
Ah it gets rid of the non-time related bits. Below is a description of the formatting
* id is composed of:
* time - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
* configured machine id - 10 bits - gives us up to 1024 machines
* sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
MongoDB is great for developers. Very facile to get started. However, it tends to fall over when it hits scale — which could be in total data set size (like, >TB scale), transaction scale (>100k ops) or in low latencies (submillisecond to single-digit millisecond).
In any of those domains, if you are trying to solve your problem with MongoDB you are in for a world of hurt.
That's generally when people start looking at other options. Whether an in-memory system for pure speed, or a horizontally scalable system for raw size or throughput.
Does Discord offer undergraduate internships? I use the platform daily, enjoyed reading the thought process that went in to this design, and would love the opportunity to intern there.
It's unfortunate Discord is still requiring relocation to SFO, the product is amazing and it looks like some awesome engineering behind the scenes that would be fun to work on!
I applied the other month for a job that mentioned SFO or remote, then halfway through the signup it stated that they were allowing folks to work remote until COVID was better, then wanted folks to be onsite, and then prompted for a yes/no if I was willing to move to SFO at a later date. Didn't get a chance to talk to anyone and expect it was because of this, so is a bit disappointing.
Ah yeah, just tested it and it says "IF APPLICABLE, WOULD YOU BE WILLING TO RELOCATE TO DISCORD'S SF HQ? WHILE DISCORD IS EMBRACING A HYBRID REMOTE APPROACH GOING FORWARD, SOME ROLES WILL REMAIN HQ-BASED."
It is not applicable for that role. It is not applicable for most (maybe all) engineering roles.
We've moved quite a few datasets from Cassandra to Scylla, but not messages. I think we're planning to make a blog post about our experience with Scylla at some point.
MongoDB is ranked #5 on the list at present; Cassandra comes in at #11. (And Scylla, which they moved to most of their workload from Cassandra, is currently #88.)
DB-engines also have specific rankings for what are known as 'NoSQL wide column stores' — which is what Cassandra and Scylla are classed as:
But what this means is that even though both MongoDB, Cassandra and Scylla are all "NoSQL" making this move for Discord required significant data modeling and migration.
(Note that the difference between Cassandra and Scylla is far narrower. Both use the same data model and Cassandra Query Language (CQL).
Hope that helps give you some orientation in the NoSQL database field.
That is a fair concern. However, as a customer - searching my message history is a desirable feature. I would rather see meaningful individual and corporate accountability for privacy breaches. The threat of jail and/or 100MM's in fines should motivate better data handling.
Sorry but no, I don't want my messages deleted. I use search history all the time even to search for things I personally said. I can get behind deleting messages if the account is deleted though, but as some type of automatic thing based on time in the past? No thanks.
That you wold chose to 'keep your messages' is a little besides the point of those who would opt for more privacy.
Nobody is making the argument you should be forced to delete your messages.
In any normal world, messages that are not used would be deleted as a matter of privacy. They're kept, because they can be kept, and they can be monetized. That monetization has zero benefit to the user, it's just an artifact of our odd way of doing business where we continue to externalize a lot of things. I think over the next 10 years we might see a regulator shift , which also means costs more directly exposed, meaning Discord may cost $1 month, i.e. the externalization 'costed in' like carbon tax on fuels.
All this happens on the aforementioned MongoDB cluster and just two server nodes. And the two server nodes are really only for redundancy, a single node easily fits the load.
What I want to say is:
-- processing a hundred million simple transactions per day is nothing difficult on modern hardware.
-- modern servers have stupendous potential to process transactions which is 99.99% wasted by "modern" application stacks,
-- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
-- most databases (even as bad as MongoDB is) have a potential to handle much more load than people think they can. You just need to kind of understand how it works and what its strengths are and play into rather than against them.
And if you think we are running Rust on bare metal and some super large servers -- you would be wrong. It is a normal Java reactive application running on OpenJDK on an 8 core server with couple hundred GB of memory. And the last time I needed to look at the profiler was about a year ago.