At the point when we start this the Tumblr team will have been part of Automattic for the better part of a year if not more, so there will be a lot of learning and evolution of the products on both sides to make any migration easier.
I promise we'll write about it afterward for anyone who is curious.
I don't want to be so presumptuous as to define
an exact approach before the technical exploration
I am, of course, completely ignorant of how WP and Tumblr's infrastructure works. I'm not saying the migration would be a bad idea, technically or financially. Or good. Just honestly curious.
I promise we'll write about it afterward for
anyone who is curious.
1) You come up with a working (if expensive or otherwise imperfect) technical path
2) You define a half-dozen other potential (less expensive, more practical) paths
3) You announce the decision
4) You complete due diligence, and take the best path
There are other ways as well. Announcements and plans aren't binding; they occasionally change. You can make an announcement when you're 98% confident you'll do something. You can pivot if it doesn't work out. There are places this doesn't work (e.g. customer promises), but on something like an internal migration, this is a-okay.
The funny thing is, the incompetence of @staff is the value proposition for Tumblr, as a user - because Tumblr's backend is a rickety tower of matchsticks and paste and the devs couldn't program themselves out of a wet paper bag, it means that they haven't been able to implement - for instance - algorithmic non-chronological timeline ordering, or competent data harvesting / robomarketing. And the comically broken search tools actually give a reasonable approximation of privacy for discussions. The user experience is firmly stuck in the mid-2000s, when social sites were for communities and discussions instead of data farming.
Don't get me wrong, Tumblr's user experience is also awful - search sucks, tags suck, moderation EXTRA sucks, the website's still overrun by pornographic spambots even after the Great Titty Purge - but any development team competent enough to make real improvements would also be one competent enough to squeeze out what makes Tumblr work.
Find a way to get Verizon to sign off on this, and then get in touch with an established documentary maker. Pair them with an engineer and follow the story of the migration efforts. It will take time, and it'll certainly have a narrative.
Nothing like this has been done before. I struggle with making what I do relatable to people, but having a technical or semi-technical documentary following this large project would be eye-opening.
We'll even crowd fund this if you give us the chance. I'm not kidding.
Please, please, please make this migration a documentary film.
In Automattic, we basically evolved to remove all that :) There would be basically zoom calls and slack discussions.
The most ambitious project I worked on in Automattic were just me, looking at the code and trying to understand why something is happening.
Or looking up Stripe documentation.
We get to sit in front of our laptops in nice places though :)
We need discussions about how to untangle integrations of your user model with Verizon/Yahoo's auth system, how you'll consolidate all the microservices, which ongoing migrations you'll halt, the puzzled looks you'll have at undocumented code that performs nested eager-loaded lazy migrations of data, etc.
I've been involved in a multi-year migration effort. I expect this may be the same for y'all. It'd be fun to have an account of something that is so prolific and well known.
Currently I think Tumblr stores all posts across all sites in one big table? WP.com does different tables per site.
I also think Tumblr's post ids are often above php's int max for 32-bit systems ( 2,147,483,647 ) -- I know I've seen some issues trying to parse tumblr's post ids to integers rather than strings on some old servers years ago.
For an overview of how our systems are run here's Barry, our head sysadmin, talking about six years ago on how the wpcom infastructure is structured:
It's changed somewhat, but not much conceptually. It's a really fascinating talk and I'd encourage anyone curious on massive scale data to give it a look, and see precisely what can be managed with determination and mysql and coffee.
Thankfully nothing is 32-bit so no worries about integer overflows. That would cause huge headaches everywhere on the PHP side. In MySQL, a regular unsigned INT column does have that limitation (roughly max 4.5 billion for unsigned, 2.2 signed), so BIGINT must be used there (Twitter had to do the same). Where it gets interesting is PHP doesn't support unsigned integers, so with 64 bit your max integer in PHP is 9,223,372,036,854,775,807, whereas in MySQL an unsigned 64 bit int is double that. I think it's safe to say though that neither Tumblr nor WordPress, even combined, would ever have more posts than atoms on Earth =)
If my math is correct, you're a wee bit off.
9,223,372,036,854,775,807 is 2^63, which is roughly 10^19. The Avogadro constant, which is about 6 * 10^23, is the number of particles (atoms / molecules) in a mole of substance, which for atoms amounts to (atom number of element) grams of mass (so e.g. 12g of C12 is a mole).
I'm sure you have an interesting prospective based on your experience working at MySQL-friendly shops, in addition to being an engineer at Tumblr once upon a time.
Edit to avoid anyone misconstruing, I'm not trying to imply one thing or another, just that I can't approach this impartially. And in any case, I wish everyone well on both sides of this acquisition. I'm just genuinely curious how they plan to proceed from a technical standpoint, as it's a really interesting challenge.
This appears to be 6 years old. Is it still relevant?
Primarily the product backend is monolithic PHP (custom framework) + services in various languages + sharded MySQL + Memcached + Gearman. Lots of other technologies in use though too, but I'll defer to current employees if they want to answer.
Reality big data: Let's shard it across Mysql.
My answer above was limited to the product backend, i.e. technologies used in serving user requests in real-time. And even then I missed a bunch of large technologies in use there, especially around search and algorithmic ranking.
I honestly don't see the draw for Kafka. And by all means I get it, I just don't buy it. Maybe I'm just holding it wrong or something.
For high-volume OLTP though MySQL is an excellent choice.
Regarding Kafka: in many situations I agree. Personally I prefer Facebook's approach of just using the MySQL replication stream as the canonical sharded multi-region ordered event stream. But it depends a lot on the situation, i.e. a company's specific use-case, existing infrastructure and ecosystem in general.
Kafka is not going to replace MySQL specifically because it depends on the task at hand.
If you can't replace MySQL with Kafka, then why not just stick with whatever queue/jobs/stream infra you had before kafka. At least those solutions are quite limited in scope and easily replaceable.
At this point Kafka is a solution looking for a problem.
But there are relatively few situations where that's absolutely vital. And you can solve it with good ol' SQL.