Hacker News new | past | comments | ask | show | jobs | submit login

> MySQL supports multiple different replication modes:

> Statement-based replication replicates logical SQL statements (e.g., it would literally replicate literal statements such as: UPDATE users SET birth_year=770 WHERE id = 4)

Postgres has that too (using a 3rd party tool, but it's an officially supported tool). We were using it on reddit 10 years ago. It caused a lot of problems. I wouldn't call that an advantage for Mysql.

Honestly, reading this it seems like the summary is: "We don't follow great engineering practices so we need a database more forgiving". Which is fine if that's how you want to run your business, but isn't really the death knell for Postgres.

A specific example:

> This problem might not be apparent to application developers writing code that obscures where transactions start and end. For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.

Your developer should understand database transactions. But you should make it easier for them by abstracting it so that they don't have to. And in this particular case, I'd say they shouldn't be using the database to do locking around sending a receipt. It should be put into a queue and that queue should be processed separately, which avoids the transaction problem altogether.




The actual summary of the article is "The design of Postgres means that updating existing rows is inefficient compared to MySQL".

Yes, there were some other points that were just extra annoyances for them but clearly that point was the most important to them. It's what the header image and the first 60% of the article was talking about and yet nobody seems to be engaging with that point in this thread.

Is the design choice bad? They never said it was. It's just an engineering trade off. It's very possible that most workloads benefit from this design. But if you workload involves updating lots of existing rows at large scale, then MySQL is going to be a better choice for you.


But it's only an issue if you rely on lots of transactions for data consistency and my point was that it sounds like they are relying on transactions too much which is why they need a more "forgiving" database, which is the part I quoted.

Also they didn't mention anything about the auto vacuumer, which mostly solved the issue they are talking about.

Their lack of mention of the vacuumer and not seeming to know that Postgres supports statement level replication makes me wonder if they took a deep dive into the wrong part of the technology.


None of what you said addresses the issue that I (or they) are talking about.

On Postgres an update requires a rewrite of every index of the row.

On MySQL it only requires an update of the indexes that were touched by the update.

If you have a table with 10 indexes then this means doing 10 extra writes physically to the disk.


As I know, not every updates rewrite indexes in PG. It has single page clean up and HOT (heap only tuple) update optimization.

Please refer to 64~ page of https://momjian.us/main/writings/pgsql/mvcc.pdf.


10 indexes on one table seems a bit much. It sounds like a table that hasn't been normalized.


Though that is quite proper for a data warehouse.


A data warehouse should use much fewer transactions though.


Both vacuuming and logical replication are discussed in the article. In particular, vacuuming is easier with InnoDB since the changed records all exist in the redo log whereas PostgreSQL needs to scan the whole table. pglogical is mentioned for people running PG9.4+ as a way of doing minimal downtime cross version upgrades, which wasn't an option back with PG9.2 unless you go with something like slony1 or londiste.


> vacuuming is easier with InnoDB ... PostgreSQL needs to scan the whole table

There's been quite a few improvements to VACUUM in 9.6 [1], including avoiding full-table scans.

[1] https://www.postgresql.org/docs/9.6/static/release-9-6.html#...


Agreed, they even admit Postgres is faster for querying because MySQL requires 2 index lookups. I like the immutable data design of Postgres, is more robust and enables fast transactional DDL. My design never deletes or alters a tuple, so all of these problems that Uber is obsessing about, go away. Question is - Why do they need so many updates? Is there a problem with their data capture?


Try enforcing this on teams that use ORMs like hibernate with 500 developers.

Super, duper, common issue, you will find this at every large shop at some point in its life time, usually around the time of hiring people and expanding extremely fast, and taking on some tech debt.

All functions of extreme scale, hyper growth, and yeah, not following the absolute best practices all the time, but tech debt is like any debt, you get something now, and pay later. If they continue going up and to the right they will be able to afford it.


ORMs have mostly been just painful at ever shop I've been at. I've used ActiveRecord, Squirl, Hibernate, django.db .. they're all various level of suck.

The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM. The moment you have a function too complex that you need to write some SQL, now you need some case statements and multiple integration tests for all the database your product needs to support.

They remove some boiler plate while adding others. In one of my own projects, I just created several files (pgsql.commands, mysql.commands, etc.), a basic set of classes around them and a base set of commands that will work for all the DBs I wanted to support (so the command files had an inheritance model, albeit only one layer).

With all that being said, most ORMs I've used do have explicit transaction support. I know Squirl had a `transaction {}` block you could wrap commands around. Transactions shouldn't be an excuse. They should be off by default and explicit added around blocks of things that need to be atomic.

> Try enforcing this on teams that use ORMs like hibernate with 500 developers.

I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.


I used to be a huge proponent of ORMs everywhere, but I've come to realize that if you're writing your app in such a way that a developer needs to be able to do any arbitrary data fetch or transformation whenever they want, that's your real problem. The set of retrievals and transformations you want to support should be well-defined, and abstracted into a layer whose interface allows the level above it to only think in terms of models and not their backing store.

After you have that, then it doesn't even matter on the backend. The models you present to the layer above can have complex and changing relationships to the actual storage -- maybe they contain data which is derived from what's in the database, but transformed after being fetched so that none of their properties actually correspond to a column or a field in a store. In my experience -- having seen the tragedy that is a Rails project gone full ActiveRecord -- this pattern enforces an excellent separation of concerns and constrains a problem which can otherwise grow unboundedly in complexity.


I don't really agree; I think you either need to have a very thin layer between your DB facts and your domain, or else use the DB as a kind of persistence layer for a complex graph.

The latter only really works if you've got a primary implementation language and aren't integrating lots of applications / libraries written in different languages communicating with the same database. You need to go down the SOA / distributed RPC / FFI route to integrate different languages, and that has its own complexities.

Personally I prefer treating the DB as a canonical store of facts. Models with a lot of code are pretty suspect. Retrievals can be tuned to just the facts required, ma'am - you don't accidentally drag in the banana + gorilla + whole forest. Doesn't stop you building a higher-level service layer if that's what you need, either. You'll need that when you scale up anyway; chatty models won't work at that level either.


Yeah, I'm saying that as your app grows out of being a simple CRUD app into something more useful and involved, there will be less of a relationship between what you need to store things efficiently, and what you need to present them well. Your model will become more graph-like, probably. For this reason, patterns designed around AR-style models will fail to scale. I disagree that this only works in a mono-lingual environment, although you will need tooling and infrastructure to support it; a model-centric architecture typically doesn't afford the possibility of multi-lingual support.

The code doesn't go in the models, it goes in the service/arbitration layer. DB as a store of facts is obvious -- DB as a 1:1 representation of what yet-unforseen features, UIs and platforms will need is a naive and limiting assumption. You have to build your application in a way that future product needs won't be constrained by storage and modeling decisions, which is a tension that Rails apps frequently encounter.


> The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM.

It has several advantages: support for multiple databases (which is useful, sometimes), the ability to serialize/deserialize an object graph in one go, and sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

Unfortunately, it's also terribly easy to destroy performance by using lazy collections configured the wrong way for your use case and not notice it, to the point where I strongly advocate using query builders instead.

> I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.

The problem is that when you come in later, that the codebase grew way too fast and the deadlines are tight, retrofitting best practices on an existing ball-of-mud can be daunting.


> It has several advantages: [1] support for multiple databases (which is useful, sometimes), [2] the ability to serialize/deserialize an object graph in one go, and [3] sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

[1] was the point of the comment you replied to and it provided a very important constraint as well

[2] rarely needed & easy to implement with recursive queries in native SQL

[3] building queries is pretty straight forward; what ORMs usually tend to bring to the table is knowledge about the schema and therefore compile-time error reporting - but this can be done in any language where one has the level of reflection, or, in worst case, by a two-stage compilation process where stage 1 generates code from the schema that can then be used by the compiler for verification in stage 2


I don't see how recursive queries are going to help you serialize a new Foo with a new field of type Bar with a new field of type FooBar, each going in a different table. That's what I mean by serializing an object graph.

As for building queries, SQL is straightforward (mostly). The problem is that it composes very badly. Any time you need to implement something like an advanced search (ie, lookup the same information, but with a number of different search criteria only known at runtime), the best you can do is concatenating partial queries and hoping you got the parentheses count right. Not to mention that a query builder will help with stuff the SQL syntax is miserable for, like IN clauses.


> I don't see how recursive queries are going to help you

You construct a graph of objects with very few queries, transform it and write it back; with knowledge of the db-schema this will outperform any ORM-based solution and give much greater flexibility.

> As for building queries [...] The problem is that it composes very badly. [...] the best you can do is concatenating partial queries and hoping you got the parentheses count right.

There is an area between full-fledged ORMs and string concatenation. In a purely functional approach queries are composed by composing functions that compile to queries. Postmodern[1][2] is a good, open source example, though in most commercial projects we just built our own wrappers for the tasks and databases at hand. This also allows for much better performing code since for a lot of tasks hooking up the db-reader to the json emitter without going through object instantiation reduces memory and cpu consumption by an order of magnitude (or two), while in the same project, some code benefits from a OOP approach (for which you just use a reader that constructs the objects on the fly).

[1] http://marijnhaverbeke.nl/postmodern/

[2] http://marijnhaverbeke.nl/postmodern/s-sql.html (yes, it does string concatenation at run time, but it does it for you, you don't worry about getting the parenthesis count right)


This is why I like SQL Alchemy. It provides an ORM that acts as a veneer over a very powerful SQL expression library, and you can move back and forth between the two seamlessly, even mixing them in individual statements.


That's what I've often wondered about ORMs. First they try to abstract away all the database internals so that the developer hardly even knows what's happening. However, ultimately the abstraction always ends up having some limitations. In the end the internals need to be exposed, but since your code base is still mostly based around the ORM it can end up being a mess.


If you're team is 500 strong you definitely should be at a point where you can do things right. If not, the 'debt' is not mainly technical, it's clusterfuck of bad decision making.


Yea sure some people in the sea of 500 will be doing elegant clean things, working in the core part of the business ( usually at or near the cash register program ) but if you have reason to have 500 developers you don't simply get the luxury of having every project be written perfectly.

Landing somewhere in-between by getting things done not perfect but shippable, and doing it fast as possible is par for the course ime. The best code is always written the 2nd or 3rd time, never the first.

Especially on the case of hyper growth worrying your competitors like lyft or postmates or amazon might get something pivotal out first.

People have to learn somewhere, usually things like this they learn at scale, on the job.


Something that hit me researching how build a relational language and thinking how hard could be to remove SQL and put instead my own flavor, (super-oversimplification):

- SQL "bad" - ORM "good" - NoSql apis "good"

So, the thing is that when facing with SQL "everyone" try to "abstract" it more.

Or instead use NoSql, because is "easier".

Fine.

Then if exist a market demand for a better API for the databases, why the databases guys not DO IT?

Yep, I know SQL is supposely the way for it, but bear with me: I live in the FoxPro era so I know what is code "to the metal" in database without SQL (and it was fine and easy).

If the SQL layer could be optional and more bare layer is provided (you can copy the dbase ideas!) then the problem of ORM mappers could be solved far easier (I imagine!).

How this could be?

With a AST api layer, for example. So I can send:

TABLE "Customer" SELECT "*"

So, imagine a kind of LLVM but for databases...


LINQ: https://msdn.microsoft.com/en-us/library/bb397926.aspx

Most of the time, the problem is to "embed" a foreign language (SQL) into an existing one (Python, Ruby, etc.), but you can do it in a proper way such as LINQ, then you have type-checking etc.


I wish your thinking would be more widespread. I never understood how come that restful HTTP became very popular without having any universal query language around and merely providing an HTTP client with each service built, while in the DB world everybody is obsessed with SQL? Why not just provide the developer with an API/library for the underlying algorithms/datastructures?

For example I hate so much when I have to read mySQL documentation jus to find out what went wrong when I used ORDER BY and the optimization didn't kick in? I mean, if I have to keep in my head the whole optimization mechanism of the DB engine and all the steps of the filesort algorithm going on behind the scenes just to write a single wretched SQL statement, then what's the point of SQL in the first place?

And this was just one example. The truth is that the SQL abstraction is leaking. Most of the commands don't encapsulate anything at all.


it's very hard to abstract away from SQL. to be honest, many nosql have integrated concepts similar to grouping, having, joins in their API as well as they matured, because that's what data elaboration needs.


But still everyone do it. That must tell something, right?

And I'm not talking about NoSql borrowing some relational concepts, but the opposite, and more directly, the API.

I know that most folks (including the ones that downvote!) have no clue what I'm talking about, because almost nobody (recently) have experience in talking against a database without SQL. Is like if the only way to talk to a NoSql was using Json + REST. That constrain badly your mind.

The inner relational model is far more rich, and simpler, SQL was not designed to be used by developers and it show (specially when you have contrived syntax as with CTE)

And before you ask what is the problem with CTE, is exactly because the way to have that is create a contrived syntax that obscure what is goin on. SQL is too restricted, yet too broad for interfacing.


Well, data transformations are much easier to read and write in SQL than e.g. Java, what with temporary collections built up in memory, random maps, lists, etc.

CTEs aren't required very often - you generally only need them for recursive CTEs, and that's iterative retrieval analogous to pointer-chasing. It's typically a sign of a data model that's poorly suited to relational storage, e.g. trees and graphs.

I have issues with the irregularity of SQL syntax - it deeply annoys me that we have both 'where' and 'having' for doing exactly the same thing, one pre-fold and one post-fold - and I don't like my lack of access to the potential the index has for related fetches (window functions are not pleasant to work with) - but mostly my problems come not from lack of access to the relational nature, but lack of power over poor implementation / query planner / etc. details.


Agree, but that are problems at the end, when most developers have issues at the start (basics). Baffle me when i read that relational databases are "hard" and you need to teach some basics even for experienced developers. Plus, the thing here is that yeah, everyone know SQL have issues.

So they make a ORM

Then it create worse issues. Is like the Database-side and the app-side are at war (without intention!) with each other and the database-side only concern themselves for the kind of issues that only happened for specialized tasks.

And the app-side still try to interface to the database, but poorly.

----

Articulating this better, I hope: Is like the movement now to improve the syntax and semantics JS. Or like ASM.js.

So, why not have SQL 2.0 with the fixes everyone since 20 or more years ago already know? And the libraries/API made to acknowledge that databases are not primarily used by "end-users" but app developers?

But that is probably like ask why not clean C++... ;)

I know this is just ask for the moon (obviously not exist one "SQL", only several more or less alike implementations), but the trouble with ORM and databases is more than a decade old and the answer is mostly "yep, we have that problem. Move along"


I started using CTEs much more often for complex queries actually. IMHO, they're more clear and easier to read than sub-queries, especially when those sub-queries get nested 3 or 4 layers deep.

I agree with the ideas of irregularities. The WHERE vs HAVING doesn't bother me much, and I can't think of a better syntax off hand. My SQL pet peeves are that the SELECT list is at the beginning - I don't know what columns I want exactly until after I type out the JOINs, so I usually type SELECT * and then fill in later. I'd rather put it between WHERE and ORDER BY. I'd also like UPDATE to put the WHERE before the SET, so you don't risk blowing up a ton of data if you forget or miss the WHERE.

Probably a lost cause to get that in, but I would think it wouldn't be too hard to at least support those syntax changes in addition to the current standard.


Complex queries using CTEs may be easier for you to reason about when you are writing them, but are a nightmare to understand if you are not the one who wrote them. We have heaps of CTE-intensive technical debt and usually it takes longer to understand what they are doing than it would take to write them from scratch from a specification.

If you are using them to ease your understanding, chances are that you are doing the job in a very memory intensive way. It also tends to be slower because the generated intermediate results do not have indexes nor good statistics that could help the query planner to be efficient. Essentially, you are taking upon yourself the query planner job and assuming you'll do it better yourself.


> SQL was not designed to be used by developers

Sure it was. Who else would it have been designed to be used by?


> > > SQL was not designed to be used by developers

> Sure it was. Who else would it have been designed to be used by?

More accurately than the grandparent: SQL wasn't designed exclusively to be used by developers. It was designed to be accessible to analysts (domain-focused analysts, not systems analysts.)


Yes, but there is no point in blaming the database for that. It's seriously not a database issue. Particularly Postgres has well exposed internals for monitoring and killing of long running transactions (if you are not setting statement timeout).


Sure, but the show must go on.


Experience with Mysql replication (even simple master/slave) leads me to believe Uber is going to have some rather nasty surprises at some point.


MySQL has had solid and flexible replication options for a long time. Postgres has only just started to catch up in the last couple of years.

Don't get me wrong, I would generally choose Postgres over MySQL for an RDBMS with replication requirements these days, but I'm not sure I would have made that same descion a few years ago.

There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.


it's not that "replication options just started to catch on" with postgresql, it was that when the postgresql team take something into core it has to be fully fleshed out and fully supported with no surprises.

And with that in mind, I would take the slow to deliver replication options postgresql supplies in base over the previous third-party options which where comparable to the built-in options in mysql... and I'd take it over mysql's built-in options.

and I've used both postgresql and mysql in highly transactional/reliable production environments for over 5 years.


I guess I'll be "that guy" who brings up non-technical concerns. I'd think very long and hard about using MySQL these days. Oracle is pretty much bizzaro treating it and widening the feature gap between the open source and enterprise version. Wikimedia is using MariaDB so I'd guess it works well for certain high scale use cases.

[I know there's different opinions but I'd rather have a fully open database if I have a choice. For the record I default to PostgreSQL]


> There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.

I think you're misrepresenting things here. While all do use MySQL for some specific tasks, it's clear that all also have many other central datastores which are not MySQL. In Google's case this is a gross overstatement.


Twitter's underlying data stores are and always have been MySQL.


Not really, YouTube runs on MySQL. Pretty sure it's still the largest video site in the world.


YouTube runs several MySQL instances behind Vitess (source: http://blog.vitess.io/2015/03/scaling-mysql-in-cloud-with-vi... )


At the same time, YouTube is only one of Google’s services, and there is no evidence that the remaining services run on MySQL as well.


Could you elaborate?


You can read it in their next blog post: "How we migrated back to PostgreSql from mySql after migrating from PostgreSql to mySql". My experience with mySql replication was OK, but I did nothing fancy - just master/slave standard stuff. However the database itself was allowing you to do stuff by default (at the time at least) that is really bad. Not to mention scaling while I was using it was pretty much "you're on your own buddy" kind of deal. Today I'm scared to touch mySql because I don't trust Oracle. I use MariaDB on an old project which used mySql but anything new is postgres.


Personally I prefer Postgres. But at the shop we use MySQL. One of the leads personally believe in Percona Server[0] and mentions some of the old mysql folks work there. I like their docs and the idea of a more tuned mysql config.

[0]: https://www.percona.com/doc/percona-server/5.7/index.html


I can second using Percona; I wasn't much of a believer in MySQL until I inherited a medium sized application using Percona 5.7. Once you get past a few of the initial warts (like making sure the database doesn't silently truncate your data--yuck), it's been really performant, stable and impressive (and we're running a setup with a fairly complex schema with master operating at ~3000 QPS average, a slave actively replicating for read load balancing and backup, and Sphinx for FTS).

Knowing what to be careful with on MySQL, I would certainly consider using the Percona flavor again in a new application.


I hit an in production bug with the silent truncation field thing the other day, old system (2009ish) that was been moved.

Table had a field that was varchar(128) and wasn't used (original filename but system generated a hashed filename and served it with that name (for sharding, lots of files)) so for 5 years MySQL had been silently killing everything > 128 characters and it didn't matter.

Then along comes our hero (me in this story) who does all the upgrades, tests everything thoroughly (he thought) and it's all fine, deploy, test everything fine.

Next morning, 9 bug reports "Can't upload files at all FIXITNOW!!!".

Turns out company was auto generating PDF's from some report software that dumped all the data into the filename and none of them where less than 128 chars.

MySQL 5.7 started not silently dumping data into the bitbucket in the sky and instead threw an error (when quite reasonably you tried to shove 300 characters into a varchar(128)).

TLDR: MySQL doing the right thing broke the broken.


Heh, yes came here to say something rather similar...


I've tried to setup a replicated postgres with autofailover and it honestly is a pita.

the only sources of failover are rando scripts over the internet, that you have to download hammer in to your version dialect and hope you don't trigger one of the many uncovered failover modes.

sure log shipping works, but that's far, FAR from a working solution. the gap requires ton of development hour, testing etc.

can't really blame people for using things with mature tooling.

(but I didn't switch to MySQL that'd be madness! there are plenty of good, replicated redundant stores out there, both nosql and sql)


Does repmgr handle most of this?


most of it. but in the master-slave configuration the clients can only work if connecting to the master, and when the master switch, scripts needs to go to each clients and update their config. not immensely bad if you also use pgbouncer so you can do it on the fly without restarting the whole client, but exceptionally vulnerable to split brains and the like.


I've handled this with a DNS CNAME record that points to the active master, though pgBouncer is a valid solution as well.


Uber engineering in general is extremely disappointing. They consistently oversell/overhype mundane processes and technologies both software and otherwise. I suspect this is mostly because they are driven more by marketing and business needs than actual engineering bottlenecks. That is the only way I can explain all the technological missteps they keep bragging about.


Sending an email to a local MTA should be pretty fast.

That is, if you cannot and will not do async stuff in your program, there are ready-made tools that will do that particular thing asynchronously for you, and have been doing so for years (or even decades).


True, but really your local MTA is just acting like a specialized queue, since the first thing it will do is send the message to your relay.


Yes, the whole point is to avoid building your own queue where a ready-made, purpose-made solution already exists.


Well there are pros and cons. If you have already have a queue structure set up in your environment, it may not be a good idea to have to maintain a second queue structure in the form of a bunch of local MTAs. You'll need monitoring (how many messages are queued in each local MTA?) and a way to keep those MTAs up to date.

If you just put an item in to a queue, and then have a specific cluster of machines that does nothing but grab items from that queue and push them through an email infrastructure, then it'll be a lot easier to maintain.


I was wondering how could that happen? It sounds like someone is trying to do two things in parallel.

I would expect

    def my_view(request):
        try:
            receipt = generate_receipt(...)
            receipt.send_email(...)
        except SomeKindOfEmailError as e:
            # okay do something else
and this should be synchronous and thus blocking. So to not block, they either wrote co-routines (asynchronous) or execute things in parallel. Have I interpreted their problem incorrectly?


Open db connection

get data for receipt

generate receipt

send email

write success to database

close connection

To a junior programmer this would probably look reasonable, and to be fair, it takes some experience and getting burned, or good training, to know it is not.


I was about to argue but then I realized: I've been working in the MS world for so long, I forgot that not everybody has connection pools.

The MS recommendation is the opposite: get the connection early, finish with it late, let us worry about the "real" connection. I've been working on multi-TB databases like that, with hundreds of concurrent requests, and never had problems.


This has nothing to do with pooling, you can't use a connection pool in this scenario because you're in a transaction. Even in the MS world, that requires a single connection to be used throughout without releasing it back to the pool. If one isn't in a transaction then each access to the db uses a fresh connection from the pool and this problem never comes up, but the OP here is assuming the time between open and close is a transaction.


So the correct version is, I guess:

Open db connection 1

get data for receipt

Close db connection 1

generate receipt

send email

open db connection 2

write success to database

close db connection 2

I guess you could speed this up a lot by doing it in bulk instead of opening and closing db connections twice for every email. Anyway, the version you wrote sounds reasonable for everything but really big operations to me, but then again I'm fairly junior.


You can still do it in one transaction, sort of. Replace "send email" with "queue email" and then either eliminate the step that writes the success to the database (pushing that off to the queue processor, who will open or already have open a database connection) or instead write to the DB that you queued the item.

The advantage you gain is that the queue is a nice buffer if something gets held up with email sending, and also, the queue processor can work in bulk, say sending 100 emails, and then opening a connection to the DB and writing them all in one statement.


So if the "send email" step fails (temporarily), the next worker to come along will grab the same receipt and send it again?

I think a better solution would be to use a centralized queue that actually does the mail sending, and retries in case of failure.


That depends on the implementation of the queue. Most queues have a "failed/retry" concept, where they mark failed jobs to be retried in some set future point in time. So one failed job does not hold up your entire queue processing.


> a centralized queue that actually does the mail sending, and retries in case of failure.

A bit like SMTP?


not necessarily, in the first write we can update receipt status as "queued" so queue system can update to "sent" or retry


Depending on the number of email sending workers, it could be closer to:

    transaction 1:
        mark an unsent receipt as being processed
        get the marked receipt
    send email
    transaction 2:
        mark the receipt as sent


You don't need to close the database connection to not have an open transaction, right?


No you don't. But depending on the ORM, some of them batch writes unless you explicitly tell them not to, but will flush when you close the connection. So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.


> So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.

To be fair, this problem isn't limited to databases. Filesystems, even HDDs/SSDs, have been known to readily ignore flush() calls in order to achieve better benchmark results.


This still can be a pretty reasonable and correct course of action. Everything depends on the amount of data you are locking, and amount of transactions open simultaneously.

When there's a noticeable contention due to the number of parallel transactions, one should consider ways to loosen their data guarantees (the status can be unset for some time) and go for an explicit asynchronous approaches.


Thanks for the illustration. But if Uber team was using ORM, then expect ORM to take care of the pool of connections.

get data for receipt would be a SELECT and only write success to database would do INSERT or UPDATE. I expect junior programmer to complete the above in at least two SQL calls. I have a feeling they were trying something smart.

EDIT: hmm reading the other commenter above, probably they are trying to lock on the data for full data integrity. Okay. That makes sense then. I was looking at the problem from the wrong angle.


Grapevine says that they were holding a transaction open for the entire duration of the Uber ride.


The fact that Postgres didn't have official replication feature prior to a recent version is just unacceptable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: