The tables looked fictional in the first screenshot, but after seeing the rest in the replies I’m not so sure. There are enough relevant tables for an app of this scale to make this seem pretty legit.
They are legit. But just not from TikTok. Think of a system where you can manage your ad spending and user growth with TikTok/WeChat (like product promotions, referring from a friend etc.)
It occurs to me that this could be the database for one of those shady "buy TikTok likes" services. That'd neatly explain a lot of the features we see in the dump.
Yes looks very much like scraped data as much as some people here want it to be a real leak from TikTok so they can go after them for the millionth time. I wonder what the scrapers use this data for and how it's monetized. Is it just for buying likes and followers?
> Example: WeChat (Which is state owned)is within the same database as the TikTok DB (which claims not to give such information to their government).
that seems very material, for the US population and Bytedance's verbal commitment to the govt. At this point, it's obvious that data of US citizens (a large chunk underage) has been heavily shared with what is perceived as a rival state
It might not necessarily even be a direct integration. WeChat is often used as a primary means of communication within China -- it's comparable to the status of email in the US.
Literally “within the same database” is highly improbable as they are direct competitors just like Google and Microsoft are. On the other hand, countless companies will surely try to match user profiles across the apps, which sounds like a common practice everywhere.
This is one of the major problems with understanding how China operates. We view China through a Western lens, and just assume we are right - which has very bad policy outcomes. If you read a lot about how the CCP operates (I specify the CCP to differentiate the CCP running the PRC since 1949 from the ROC who fled to Taiwan), it is highly probable that those companies share the same database and have common state ties.
If you want data on this assertion research "capitalism with socialist characteristics." If you want the data straight from China read their 14th 5-year plan (I wish the US had these kinds of plans) translated by Georgetown University here [1]. China is very open about what they are doing in published writing.
If you want more information read "The World According to China" by Liz Economy. I can suggest about 10 other books too, but that book was published in 2022 and is the most relevant book I read. Unfortunately China's government from a Western lens is such a complex topic I can't simply link to one article about how China runs its economy and "private" organizations to prove my point. This complexity unfortunately leads to people making false assumptions about how China operates without the requisite knowledge - and then deciding how we should make policy decisions towards China.
Just spitballing, but this changing of source url reminds me of an idea I had a while back. In cases like this, where the url is changed from source, the old url should still be linked. Maybe as a sub title, or perhaps there should be a separate "changed" link on posts like this. The changed link could show title or url source changes.
The old url is linked in the url-changed comment. Sometimes there's a pinned topcomment with the old url or the url of a merged thread, if the url change is likely to really confuse.
I'm curious if anyone's been keeping track of title and url changes. HN's small enough that one could easily hold the data. Someone has a page showing recent title changes,[1] but without retention. In that thread is a link to a browser extension that does the same, but I can't tell if its scraping is client-side (probably) or if there's a server.
It all boils down to the fact that dang, the admin who does the edits, has a history of doing these changes quite rarely and within reason, plus he always leaves comments when he does so. That approach awarded him some general trust, including mine.
Part of the issue is that it isn't that rare. I guess because I'm reading HN all the time these changes seem fairly frequent, despite overall submission volume. I also have the same trust.
It takes one scandal stemming from a hostile link change to take this sort of reputation down. Since all of that information is public, to paraphrase Linus, with enough eyes, all HN link changes are shallow.
True, but blockchains only help if you have multiple copies of the chain. With one copy, there’s nothing stoping the owner from rewriting history. It’s why companies saying they’re “investing in blockchains” means nothing; DB replication/redundancy doesn’t stop a malicious owner.
So unless we’re all gonna start storing copies of HN’s database(s) on our home computers and phones, a blockchain won’t help stop abuse. Speaking of which, how big would said copies be?
> This is so far pretty inconclusive; some data matches production info, albeit publicly accessible info. Some data is junk, but it could be non-production or test data. It's a bit of a mixed bag so far.
The screenshots suggest that someone scraped public data from TikTok and then got hacked. Highly doubt any of this came directly from TikTok.
As a consultant/free-lancer I've seen this in plenty of SaaS businesses, even if they only have one product. The DB url would be "product-name.db.domain.tld/product-name", the database would be named "product-name" and the table would be called "product_name_users". Not really sure how it came to be, but it's not super uncommon.
For products built by outsourced teams (or if initial prototype was built by a outsourced team but then taken over by in-house), it's more common, even when building products for others.
Prefixing table names with the app name is a simple way of letting multiple apps share the same database. Sharing the same database used to be really common, almost standard I’d say, when deploying on low-cost web hosts and also locally when working on multiple projects. I don’t let apps share databases anymore, but I still usually prefix table names anyway. Old habits die hard I guess, and for most ORMs it’s usually just a single config switch anyway.
more specifically for people who don't use it, the general naming is "module_modelname" (so for example "payments_paypaltransaction"). Django's term for what one might call a module is "app", which is also a bit of a loaded term, but the simple way of seeing it is that Django projects have many sub directories each managing a set of models (which map to DB tables, generally).
I think it's silly too, but we (not TikTok) have some $company_$sensiblename tables. I think it may have originated with Django, there being a single Django 'app' when it was first set up, with the name $company; Django then uses that as a prefix for all the tables it creates. (Not suggesting TikTok ever used Django, that's just my suspicion in our case. Point is, it looks weird but it happens.)
It could make sense because of the musical.ly merger, however in that case I would also expect finding TikTok’s original name somewhere (a.me or douyin), as well as references to musical.ly.
If all traces of the renames and mergers have been scrubbed over time, then I would also expect the namespacing to have been removed (unless they expect more merger?)
I don't believe the kids behind this Twitter account. I don't know why they're doing it exactly, probably some form of clout or to scam buyers on darknet marketplaces, but I know that many of their screenshots are faked. I know people at one of the companies they claimed to have hacked - they posted a Ruby on Rails directory structure as proof of hacking them but the company does not have Ruby code. So I would not trust any of their tweets.
I don't trust them (AgainstTheWest, not Troy Hunt) either — and frankly I'm surprised to see that they're still active.
Earlier this year they claimed to have discovered an NGINX 0-day RCE and tested it against a Canadian bank. Not only was it a big nothing-burger, but they ended up purging their Telegram channel aftwards with claims of infighting (screenshots for posterity: https://imgur.com/a/5AThvTv).
> they posted a Ruby on Rails directory structure as proof of hacking them but the company does not have Ruby code
I think it's extremely suspicious, but often times breaches like this aren't through the core platform itself. For example, Equifax was a support site that was hosted and built separately from their main platform.
This whole thing does smell like BS to me, though as well.
"We've downloaded the user information tables from the database. Looks like it's the system log now, which is 790GB.
Also, current user entries is 2.05 billion."
FWIW, I asked my question in good faith. For a hack such as this, which appears to include massive databases of one of the most highly-visible companies in the world, I'd expect a bit more meat.
I agree, it seems like if this hack is true, it'd be possibly one of the largest breaches of user data ever only behind the Yahoo breach. I'd hope they'd include more info to corroborate their claim. So, I agree with being a bit cautious to begin with
I am a bit wary of the claim/the validity of the hack. It seems rather short on details of what, and I haven't seen independent verification from someone reputable. The hack seems plausible though so who knows ¯\_(ツ)_/¯
EDIT:
In Troy Hunt I trust, and he is currently digging through it in this thread if anyone wants to follow along https://twitter.com/troyhunt/status/1566565409939427328. So far, the data seems legit, but publicly available/scrap-able.
> Edit 3: It's been 11 hours since the contact, about 1.37 billion entries have been pulled. DBeaver has crashed multiple times and we've left it running. It's "fetching rows" still. So I guess there's still more.
Considering the entries are from all over the world, it is unlikely we will sell or release this. Lastly, this data contains a lot of under aged people. Releasing such information, along with the data that is being stored without user's knowledge is so dire that we think it could spark something dangerous. Example: WeChat (Which is state owned) is within the same database as the TikTok DB (which claims not to give such information to their government).
They have this data and aren't going to sell/release it? Their provided sample doesn't seem to have any critical user data like passwords and phone numbers. All the user data is publicly available. Not to mention there are very few sample rows for 2 billion records. Not sure how to confirm the PayPal data. So far, I'm skeptical.
The data could be very likely legit. But they are not TikTok's user database. It is more like TikTok's ad/activity log for some customers. The passwords are for the logins from those customers if any.
This also explains why WeChat data is in the same database. And if you are paying attention, the WeChat tables are more complex which makes sense because WeChat has a much longer history in business.
I don't know anything about what ad databases look like, but so far I haven't seen any evidence that the data is legit. Fake breaches happen every now and then so I wouldn't be surprised if this is one.
The sample has very little data for such a comparatively large claim of 2 billion.
To be clear this is not a TikTok ad database. This is a tracking database from a third party who manages their clients' business with TikTok (and others).
2 billion ad views are not that big. You are likely looking at logs not the core tables.
I'm not sure I can take them seriously if they're "extracting" the data using DBeaver as opposed to some command-line utility that could probably do chunk/batching without spinning up a UI. If it's "crashed multiple times", that'd be a clue to use a more dedicated tool or process even.
How many DBAs do you see starting up DBeaver or SSMS in order to do database backups or restores or any large-amount-of-data action?
So is this guy going to download and "leak" the source or not? Considering the current US-China relations, it wouldn't be a stretch to suggest they might not face consequences
Releasing private information about millions of Americans or Chinese is likely going to get you into trouble no matter where in the world you are. Maybe they'd avoid hacking charges, if caught they'd face civil charges at least.
Wait, so they were downloading stuff from TikTok's servers/storage and giving info out live.
They literally said "It's fetching rows still". Why did TikTok not stop this once they knew someone had unauthorized access to their systems or storage.
I am not sure I care much about tiktok's source code really.
Most of it cannot be too hard. The interesting part of tiktok to me is how they scale video storage and then how it gets back to the user so fast. That part is amazing.
The most praised thing in TikTok is how they tailor videos to each user. While Insta or Facebook rely on virality, TikTok is able to show you videos barely liked and shared, and you somehow like them.
I remember reading about a guy experimenting with liking only long and high quality videos, skipping short and cheap ones. After a dozen of visualizations TikTok got the cues and started showing exactly what the guy had planned.
After five years in LinkedIn saying the plain truth, they still don't get me at all, and they just keep my feed filled with the most "interacted" content.
It fills my feed with inflammatory US based political tweets, and I'm not interested.
If a tech journalists decides to reply on a thread about the latest GoP nonsense, then as an Australian that is interested in tech news, I don't want to hear it!
Twitter's algorithm isn't front and center as long as you're set to "latest tweets". It's reduced to the sidebar where it guesses (badly) what trends I'll be interested in. On mobile, I don't even see the sidebar or trends unless I deliberately seek them out.
It's one thing I like about Twitter. All I see are the people I followed and the people they retweet. If one of them gets annoying, unfollow or mute. And it's all in chronological order. Unless I'm missing something, nothing is getting filtered or pushed by some annoying algorithm.
For your situation, you might want to create a list of words for Twitter to block for you. That might help you avoid certain topics.
I really like that about it too. I swipe all negative, political or polished content away and my feed feels more enjoyable than any other content before. No endless negative emotion trigger baits like Twitter/FB or overpolished bullshit world like Instagram. It delivers the weird and wonderful world that I like.
Everyone says this but except for the hot babes in their bikinis the TikTok algorithm does a TERRIBLE job of seeing what I want to look at, to the point where I use the app less than 15 minutes a month. There's always something absolutely fucking disgusting I see within ~25 swipes or so that makes me quit the app. If they just stuck to the babes I would probably watch a lot more.
You need to process the video before that, as not every phone/browser has the right codecs to stream from every other phone. Then you need to ship that to POPs and serve it. That’s some industrial strength bandwidth and computer power, not to mention the feature tagging the AI is doing to show the videos to the right audience.
Are you telling me that the compression is done on the device? I would think very little compression would be done on-device and instead sent as-is, with the server doing the compression.
CDNs are designed to easily handle relatively static content (e.g., Netflix-like streaming services). Scaling user-generated video serving is a much more challenging problem.
I bet most tiktok videos can be categorized into a few classes and all users just watch the videos in one of those classes. So basically, it is very much like Netflix. Also, the videos are much shorter.
The most interesting parts of the source code would be parts showing that they're collecting/using data that they either promised not to collect/use or that would be illegal for them to collect/use.
The only real existential threat TikTok faces (in the US) is the US government. I think as we get closer and closer to war (either cold or hot) with the Chinese that services like alibaba and tiktok won't stand up to scrutiny and be blanket banned. It's pretty clear that the CCP wholly controls them and is hoovering up the data on American citizens. I think the only reason it's on the back burner right now is all the Trump stuff stealing the headlines.
Yeah, it is unsurprising since even another security researcher knew that all of TikTok's user data is being harvested on Alibaba Cloud which that already got breached resulting in a 900GB file being dumped somewhere as they collect the IMEI numbers of many users as found in the Android app. From [0]
> We at Penetrum believe that everyone should have the right to know what data is being harvested by companies and would like to give our readers a clearer understanding of what happens when you download the mobile application TikTok. From our understanding and our analysis it seems that TikTok does an excessive amount of tracking on it’s users, and that the data collected is partially if not fully stored on Chinese servers with the ISP Alibaba.
For it to happen again. Hardly shocking and very expected. But as always the TikTok fans rushing here, panicking will do anything to deny the breach and after examining the contents, sample(s) of the breach, it looks highly legit.
Did you actually read what you linked? Alibaba cloud wasn't breached, one of the tens of thousands of Alibaba cloud customers had a database exposed to the internet. Same happens with AWS and Azure all the time.
I hope you read the guidelines: "Please don't comment on whether someone read an article." [0]
> one of the tens of thousands of Alibaba cloud customers had a database exposed to the internet.
Was exposed ON Alibaba Cloud and that database WAS breached. It really doesn't matter and there is little difference in that event or whatever the security researcher and I meant.
I'm assuming you must have read this before commenting: [1]
"We provide ongoing security guidelines and training to all our customers, and always advise them to protect their data by setting a secure password among other security recommendations,” an Alibaba spokesperson stated."*
Either way, if that is not a breach then I don't know what is.
You have misread the point of my comment. I already know it is a breach, the GP commenter was hairsplitting and the difference in my comment is negligible and is the same as what the security researcher and the statement of the breach was.
> It does matter. If you setup an unprotected server on AWS and someone hacked it. It is your fault, not AWS.
Indeed it does matter. Whoever owns the server got breached and that is that.
But I didn't say the breach didn't matter. I said the difference in where it got breached doesn't matter.
It works if you rig it to work well at that scale. Google used a modified version of MySQL to host AdWords for a long time. A very large database with massive qps:
Pretty obvious if you look at the tables closely. And the "cabinet" means hosting cabinets (steel frames holding the machines).
Which means those dudes were basically downloading the ad logs..