Hacker News new | past | comments | ask | show | jobs | submit login

According to this article [1], you plan to move Tumblr's backend onto WordPress. Considering that Tumblr's infra stores over 1 trillion distinct product objects, this would be one of the most technically ambitious migrations in history. Can you share any thoughts of how it will be approached? Will you be pruning/purging old content or inactive users?

[1] https://poststatus.com/automattic-has-purchased-tumblr/

That's an excellent question! I don't want to be so presumptuous as to define an exact approach before the technical exploration has started, besides saying it'll be done incrementally and in an easily revertable way to be invisible to users, just like the big datacenter migration Automattic just completed a few weeks ago.

At the point when we start this the Tumblr team will have been part of Automattic for the better part of a year if not more, so there will be a lot of learning and evolution of the products on both sides to make any migration easier.

I promise we'll write about it afterward for anyone who is curious.

    I don't want to be so presumptuous as to define 
    an exact approach before the technical exploration 
    has started
The same could be said of deciding such an absolutely massive migration is even beneficial/necessary in the first place, before a technical exploration! And yet apparently, that part has already been decided?

I am, of course, completely ignorant of how WP and Tumblr's infrastructure works. I'm not saying the migration would be a bad idea, technically or financially. Or good. Just honestly curious.

    I promise we'll write about it afterward for 
    anyone who is curious. 
Definitely looking forward to that. Best of luck. (That's not sarcasm. Honestly, best of luck!)

I've often decided on technical decisions with incomplete information. Here's an example of how it works (translated to this analogy):

1) You come up with a working (if expensive or otherwise imperfect) technical path 2) You define a half-dozen other potential (less expensive, more practical) paths 3) You announce the decision 4) You complete due diligence, and take the best path

There are other ways as well. Announcements and plans aren't binding; they occasionally change. You can make an announcement when you're 98% confident you'll do something. You can pivot if it doesn't work out. There are places this doesn't work (e.g. customer promises), but on something like an internal migration, this is a-okay.

If "Wordpress" (which I reluctantly use for our corporate branding site and blog--very carefully managed and controlled) is a better architecture than what Tumblr is using now, what they have now must be truly awful! Wordpress really doesn't scale very well, and you can easily have massive security problems.

Among Tumblr users, the basic incompetence of @staff in constructing a functional website is legendary. I would not be surprised if the backend were far worse than you're supposing.

The funny thing is, the incompetence of @staff is the value proposition for Tumblr, as a user - because Tumblr's backend is a rickety tower of matchsticks and paste and the devs couldn't program themselves out of a wet paper bag, it means that they haven't been able to implement - for instance - algorithmic non-chronological timeline ordering, or competent data harvesting / robomarketing. And the comically broken search tools actually give a reasonable approximation of privacy for discussions. The user experience is firmly stuck in the mid-2000s, when social sites were for communities and discussions instead of data farming.

Don't get me wrong, Tumblr's user experience is also awful - search sucks, tags suck, moderation EXTRA sucks, the website's still overrun by pornographic spambots even after the Great Titty Purge - but any development team competent enough to make real improvements would also be one competent enough to squeeze out what makes Tumblr work.

Generally before a company buys another company, there is some amount of due diligence done beforehand, so I wouldn’t presume there hasn’t been a technical exploration.

You should almost make it a documentary.

Yes! I’d pay more for real world business documentaries than anything on netflix.

Please actually do this.

Find a way to get Verizon to sign off on this, and then get in touch with an established documentary maker. Pair them with an engineer and follow the story of the migration efforts. It will take time, and it'll certainly have a narrative.

Nothing like this has been done before. I struggle with making what I do relatable to people, but having a technical or semi-technical documentary following this large project would be eye-opening.

We'll even crowd fund this if you give us the chance. I'm not kidding.

Please, please, please make this migration a documentary film.

From the documentaries I've seen, its lots of people walking to a meeting, meetings themselves, etc. For that kind of documentary, probably people walking into servers rooms, or having heated discussions.

In Automattic, we basically evolved to remove all that :) There would be basically zoom calls and slack discussions. The most ambitious project I worked on in Automattic were just me, looking at the code and trying to understand why something is happening. Or looking up Stripe documentation.

We get to sit in front of our laptops in nice places though :)

We don't need server rooms.

We need discussions about how to untangle integrations of your user model with Verizon/Yahoo's auth system, how you'll consolidate all the microservices, which ongoing migrations you'll halt, the puzzled looks you'll have at undocumented code that performs nested eager-loaded lazy migrations of data, etc.

I've been involved in a multi-year migration effort. I expect this may be the same for y'all. It'd be fun to have an account of something that is so prolific and well known.

This would be an interesting new type of documentary. A few shots of people with laptops on beaches around the world to establish characters, then just animated slack chats, terminal sessions and whiteboard sessions.

Follow-up: these are two of my most upvoted comments. A lot of people want to see this. :)

Actually, you could pitch the idea to Netflix. It sounds like something they might do. And, yes, I would watch it. :)

Verizon wouldn’t risk having negative publicity from a documentary. Not to be a buzzkill but seems too left of center for them.

As a software engineer and independent filmmaker, I fully support this idea of creating a documentary (if one hasn't already been started).

This is a fantastic idea. I imagine similar form to "Some Kind of Monster", the documentary about Metallica. It's mostly the band meeting, discussing ideas, playing some music, struggling with internal tensions and personal issues etc. I'm not even a huge fan of the band, but it was a very entertaining watch and I think it would be almost guaranteed that such a massive project will result in many interesting stories.

Or disaster movie.

I have no idea precisely what we're going to do or how, but if I were spitballing, I'd think something like ...

Currently I think Tumblr stores all posts across all sites in one big table? WP.com does different tables per site.

I also think Tumblr's post ids are often above php's int max for 32-bit systems ( 2,147,483,647 ) -- I know I've seen some issues trying to parse tumblr's post ids to integers rather than strings on some old servers years ago.

For an overview of how our systems are run here's Barry, our head sysadmin, talking about six years ago on how the wpcom infastructure is structured:


It's changed somewhat, but not much conceptually. It's a really fascinating talk and I'd encourage anyone curious on massive scale data to give it a look, and see precisely what can be managed with determination and mysql and coffee.

Close; For the most part, to the application the Posts appear to be from one table since the lookup interface is the same (externally), but in reality at this massive scale, it just wouldn't be possible in a single MySQL table and database, even if you had a huge number of replicas. Tumblr's posts are spread across MANY "shards" which are actually different servers, each running a chunk of posts, shared by the blog owner. e.g. Blogs 1 - 10,000,000 on Shard 1, 10,000,001 to 20,000,000 on Shard 2, etc. More in depth talk here http://highscalability.com/blog/2012/2/13/tumblr-architectur... and http://assets.en.oreilly.com/1/event/74/Massively%20Sharded%... though it's from 2011/2012, but the overall ideas still hold. At the time of that post (7 years ago) there were already 200 Database servers.

Thankfully nothing is 32-bit so no worries about integer overflows. That would cause huge headaches everywhere on the PHP side. In MySQL, a regular unsigned INT column does have that limitation (roughly max 4.5 billion for unsigned, 2.2 signed), so BIGINT must be used there (Twitter had to do the same). Where it gets interesting is PHP doesn't support unsigned integers, so with 64 bit your max integer in PHP is 9,223,372,036,854,775,807, whereas in MySQL an unsigned 64 bit int is double that. I think it's safe to say though that neither Tumblr nor WordPress, even combined, would ever have more posts than atoms on Earth =)

> I think it's safe to say though that neither Tumblr nor WordPress, even combined, would ever have more posts than atoms on Earth =)

If my math is correct, you're a wee bit off.

9,223,372,036,854,775,807 is 2^63, which is roughly 10^19. The Avogadro constant, which is about 6 * 10^23, is the number of particles (atoms / molecules) in a mole of substance, which for atoms amounts to (atom number of element) grams of mass (so e.g. 12g of C12 is a mole).

The fact that they are already planning on making changes seems to run counter to the idea that they would be taking a "Berkshire Hathaway" approach. In my understanding Warren Buffett's current philosophy is to buy companies that need his capital, potentially his name, but definitely not his intervention.

I'm curious: What approach, if any, would you take?

I'm sure you have an interesting prospective based on your experience working at MySQL-friendly shops, in addition to being an engineer at Tumblr once upon a time.

Thank you for asking, but I probably shouldn't get into that :) I'm inherently conflicted / biased due to designing a solid chunk of Tumblr's backend, and also previously worked for another Automattic competitor (Six Apart, the defunct company behind Movable Type).

Edit to avoid anyone misconstruing, I'm not trying to imply one thing or another, just that I can't approach this impartially. And in any case, I wish everyone well on both sides of this acquisition. I'm just genuinely curious how they plan to proceed from a technical standpoint, as it's a really interesting challenge.

Just curious about your just curious; what is the architecture of the current Tumblr back end?

This appears to be 6 years old. Is it still relevant?


Ehhh parts of that article were never accurate, especially the stuff about having an hbase-powered dashboard feed.

Primarily the product backend is monolithic PHP (custom framework) + services in various languages + sharded MySQL + Memcached + Gearman. Lots of other technologies in use though too, but I'll defer to current employees if they want to answer.

Fantasy big data: let's use Hadoop and Kafka!

Reality big data: Let's shard it across Mysql.

Not exactly. Tumblr has a pretty huge Hadoop fleet and decently large Kafka setup too. It's just a question of OLTP vs OLAP use-cases being powered by different tech stacks.

My answer above was limited to the product backend, i.e. technologies used in serving user requests in real-time. And even then I missed a bunch of large technologies in use there, especially around search and algorithmic ranking.

That's kind of the point though. Everyone has a Hadoop/Kafka, but when it comes to actually getting things done, good ole MySQL to the rescue.

I honestly don't see the draw for Kafka. And by all means I get it, I just don't buy it. Maybe I'm just holding it wrong or something.

It really depends on the task at hand. I'm one of the most vocally pro-MySQL commenters on HN, and have literally built my career around scaling and automating MySQL, but I still wouldn't recommend it for OLAP-heavy workloads. The query planner just isn't great at monstrous analytics queries, and ditto for the feature set (especially pre-8.0).

For high-volume OLTP though MySQL is an excellent choice.

Regarding Kafka: in many situations I agree. Personally I prefer Facebook's approach of just using the MySQL replication stream as the canonical sharded multi-region ordered event stream. But it depends a lot on the situation, i.e. a company's specific use-case, existing infrastructure and ecosystem in general.

I don't think youre quite getting my point.

Kafka is not going to replace MySQL specifically because it depends on the task at hand.

If you can't replace MySQL with Kafka, then why not just stick with whatever queue/jobs/stream infra you had before kafka. At least those solutions are quite limited in scope and easily replaceable.

At this point Kafka is a solution looking for a problem.

My feeling about Kafka is that it's a useful tool to solve the "we MUST get this data to reliable storage IMMEDIATELY" problem. And to greatly mitigate the "each item must be processed and shown to be processed, exactly once" problem.

But there are relatively few situations where that's absolutely vital. And you can solve it with good ol' SQL.

You’re being generous. Most of that article was a pipe dream that never came to fruition.

That's a really underrated statement - as a lot of "scale blogs" are often referred to as fact. I'll have to reconsider a lot of those in hindsight.

Cool, thanks for the answer.

Wow, that sounds insane

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact