More

bigmutant · 2025-12-12T19:08:42 1765566522

Pretty common attitude from folks who have never worked in one of the BigTech companies where Java rules (Amazon being a prime example). Since they never encounter Java in the "SF-style Startup" world, they assume that it must be dead. Meanwhile hundreds-of-thousands of Engineers deal with hundreds-of-millions (billions?) of lines of Java every day

bigmutant · 2025-10-21T04:14:45 1761020085

DynamoDB is used *everywhere* in AMZN Retail, this is absolutely not surprising. Plus the vast majority of internal Services are using EC2 in the form of Apollo/ECS. So OP probably hit some parts of the site that are hosted in us-west-2. For all I know they started routing all requests for us-east-1 traffic to other DCs, figuring latency is a fine trade-off for availability

bigmutant · 2025-10-21T04:09:30 1761019770

To clarify, most of CDO (Consumer Devices Other) does run on AWS in the sense that NAWS is the target state, MAWS is legacy and actively (slowly) being migrated off of. CDO (including Alexa) has been using DynamoDB/Lambda/Kinesis/SQS etc forever, its just the compute and kind-of network layers that are still MAWS. Even then, a large part of CDO has moved from Apollo to ECS/FarGate/whatever unholy Hex or DataPath thing they're pushing these days

Source: Ex-AMZN

bigmutant · 2025-02-27T19:32:52 1740684772

Def agree. Most people will never touch an Abstract Syntax Tree or even Expression Trees. Almost everyone working in back-end will use Cloud Services, will make mistakes based on assumptions of what they provide

EtCepeyd · 2025-02-27T23:26:31 1740698791

I was studying for my MSc in CS some 25 years ago. Our curriculum included both automata/formal languages (multiple courses over multiple semesters) and parallel programming.

The latter course (a) was built on a mathematical formalism that had been developed at the university proper and not used anywhere else, (b) used PVM: <https://www.netlib.org/pvm3/>, <https://en.wikipedia.org/wiki/Parallel_Virtual_Machine>, for labs.

Since then, I've repeatedly felt that I've seriously benefited from my formal languages courses, while the same couldn't be said about my parallel programming studies. PVM is dead technology (I think it must have counted as "nearly dead" right when we were using it). And the only aspect I recall about the formal parallel stuff is that it resembles nothing that I've read or seen about distributed and/or concurrent programming ever since.

A funny old memory regarding PVM. (This was a time when we used landlines with 56 kbit/s modems and pppd to dial in to university servers.) I bought a cheap second computer just so I could actually "distribute" PVM over a "cluster". For connecting both machines, I used linux's PLIP implementation. I didn't have money for two ethernet cards. IIRC, PLIP allowed for 40 kbyte/s transfers! <https://en.wikipedia.org/wiki/Parallel_Line_Internet_Protoco...>

bigmutant · 2025-02-28T16:59:10 1740761950

Sure, I did the same, BS/MS with a focus on Compilers/Programming Languages. It's been personally gratifying to understand programming "end-to-end" and to solve some tricky problems, but 99% of folks aren't going to hit those problems. There are tons of people interacting with Cloud Services every day that aren't aware of the basic issues like:

- Consistency models (can I really count on data being there? What do I have to do to make sure that stale reads/write conflicts don't occur?)

- Transactions (this has really fallen off, especially in larger companies outside of BI/Analytics)

- Causality (how can I handle write conflicts at the App Layer? Are there Data Structures ie CDTs that can help in certain cases?)

Even basic things like "use system time/monotonic clocks to measure elapsed time instead of wall-clock time" aren't well known, I've personally corrected dozens of CRs for this. Yes this can be built in to libs, AI agents etc but it never seems to actually be, and I see the same issues repeated over-and-over. So something is missing at the education layer

bigmutant · 2025-02-27T19:02:58 1740682978

The fundamental problems are communication lag and lack of information about why issues occur (encapsulated by the Byzantine Generals problem). I like to imagine trying to build a fault-tolerant, reliable system for the Solar System. Would the techniques we use today (retries, timeouts, etc) really be adequate given that lag is upwards of hours instead of milliseconds? But that's the crux of these systems, coordination (mostly) works because systems are close together (same board, at most same DC)

bigmutant · 2025-02-27T18:57:12 1740682632

Good resources for understanding Distributed Systems:

- MIT course with Robert Morris (of Morris Worm fame): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

- Martin Kleppmann (author of DDIA): https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjc...

If you can work through the above (and DDIA), you'll have a solid understanding of the issues in Distributed System, like Consensus, Causality, Split Brain, etc. You'll also gain a critical eye of Cloud Services and be able to articulate their drawbacks (ex: did you know that replication to DynamoDB Secondary Indexes is eventually consistent? What effects can that have on your applications?)

ignoramous · 2025-02-27T19:11:21 1740683481

> Robert Morris (of Morris Worm fame)

(of Y Combinator fame, too)

bigmutant · 2025-01-30T23:35:09 1738280109

As others have said, this is a solved problem in a lot of companies. Basic answers are: 1. Queuing 2. Asynchronous APIs (don't wait for the 'real' response, just submit the transaction) 3. Call-backs to the Client

A good async setup can easily handle 100k+ TPS

If you want to go the synchronous route, it's more complicated but amounts to partitioning and creating separate swim-lanes (copies of the system, both at the compute and data layers)

ndriscoll · 2025-01-31T01:36:05 1738287365

Note that the client doesn't need to know about async operations/you don't need an async api at the http layer. Put the queue in memory. Have your queue workers wait up to ~5 ms to build a batch, or run the transaction when a batch is big enough (at 100k RPS, you already have a batch of 100 every ms). You're adding ~1-5 ms latency, so no reason not to respond synchronously to the client. Conceptually, the queue and workers are an implementation detail within the model. As far as the controller knows, the db query just took an extra ms (or under any real load, responded more quickly).

zapkyeskrill · 2025-01-31T01:46:29 1738287989

Sure, but no matter how many async request you accept you still only have 50k items available. You also presumably take people's money, having them input their personal and card information so not waiting for real response means what? Thank you for your money and the data, we'll be in touch soon; pinky promise?

lmm · 2025-01-31T05:07:33 1738300053

> Thank you for your money and the data, we'll be in touch soon; pinky promise?

That's very much an option when it's something this popular - the Olympics I went to did an even more extreme version of that ("Thank you for putting in which events you wanted to see, your card may be charged up to x some time within the next month").

Or you can do it like plane seats: allocate 50k provisional tickets during the initial release (async but on a small timescale), and then if a provisional ticket isn't paid for within e.g. 3 days you put it back on sale.

Ultimately if it takes you x minutes to confirm payment details then you have to either take payment details from some people who then don't get tickets, or put some tickets back on sale when payment for them fails. But that's not really a scaling issue - you have the same problem trying to sell 1 thing to 5 people on an online shop.

foota · 2025-01-31T03:10:56 1738293056

You have 50,000 tickets to spread between one million people, you can partition people to tickets and only have 20 people per ticket. You won't have strict ordering (e.g., someone who applied after may get a ticket where someone who applied earlier doesn't), but we'd be talking about errors in the milliseconds.

bigmutant · 2025-01-30T21:17:39 1738271859

Absolutely not true in my experience. MySQL has its share of issues (all DBs do) but it is rock-solid when using the correct engine (InnoDB for most cases, RocksDB for high-throughput writes, Memory for caching). MySQL is very hard to beat for very high-volume OLTP workloads, both reads and writes. Its replication systems were years ahead of other systems (SQL Server, Postgres, SQLite doesn't have replication). DuckDB AFAIK is OLAP and they don't compete in the same space. Every DB system has "the things its good at" and MySQL really shines at very high-volume OLTP spread across partitions.

bigmutant · 2025-01-30T21:02:40 1738270960

That all depends on the setup. The "standard" setup (not specific to MySQL) is:

- Single Write Leader per partition

- Backup Write Leader that is setup with synchronous replication (so WL -> WLB and waits for commit)

- Read Followers all connected asynchronously using either binlog replication (not recommended anymore) or GTID-based row replication (recommended)

In the above scenario, the odds of loss are pretty small since the Write Leader has a direct backup, and any of the Read Followers can be promoted to a Write Leader/Backup. DDIA calls the above semi-synchronous replication, although MySQL now supports a similar-but-slightly different version out of the box: https://dev.mysql.com/doc/refman/8.4/en/replication-semisync...

evanelias · 2025-01-31T03:03:59 1738292639

Terminology-wise, that's not quite right. "Binlog replication" is the general term for all built-in MySQL logical replication, including all formats (statement, mixed, row); positioning via coordinates or via GTID; async or semi-sync or group replication.

Among AWS users, "binlog replication" is often contrasted against Aurora's system, which uses physical replication instead of logical.

When using binlog replication, you are correct that GTID positioning and row-based replication are strongly recommended and widely used.

Regarding semi-sync replication: that's been a MySQL feature for over a decade now, and there are some indications from Oracle that it may become deprecated in the future. (which is surprising, since many large MySQL users do leverage it to ensure writes cannot be lost. But it seems Group Replication is promoted more by Oracle.)

MySQL semi-sync doesn't necessarily involve the setup you've described as "standard" above. In my experience it's more common to see 2 replicas doing semi-sync ack'ing, not 1. And sometimes these are just binlog servers rather than full MySQL replicas; that's the setup Facebook adopted in 2015, although more recently they've moved to a home-grown Raft-based replication system.

bigmutant · 2025-01-28T19:57:30 1738094250

Agree, Data Quality in-the-wild is a huge concern. I've led efforts to establish Lineage/Quality in large orgs and doing this after-the-fact is a massive undertaking. Having this "up-front" before all the data pipelines (origination, transformation, pre-processing) calcify saves a lot of headache down the road.