
An Analysis of WeChat’s Realtime Image Filtering in Chats - bookofjoe
https://citizenlab.ca/2019/07/cant-picture-this-2-an-analysis-of-wechats-realtime-image-filtering-in-chats/
======
word-reader
It looks like something similar is going to be legally required in the EU [1].
For encrypted chat apps like WhatsApp, the content filtering database can
simply be baked into the client application [2].

[1] [https://torrentfreak.com/eu-members-approve-upload-
filters-f...](https://torrentfreak.com/eu-members-approve-upload-filters-for-
terrorist-content-181214/)

[2] [https://www.bloomberg.com/opinion/articles/2019-01-25/how-
to...](https://www.bloomberg.com/opinion/articles/2019-01-25/how-to-stop-
misinformation)

~~~
bin0
I'm not sure e2e-encrypted applications can be reasonably expected to handle
this. The bytes they are transmitting are effectively not copywritten, and
cannot be deciphered into copywritten information without something which the
company does not have. I could also see a huge increase in client-side
processing requirements due to this, meaning this could translate to
substantial financial loss and an undue burden.

~~~
kccqzy
On the contrary, client-side processing means it is burning the battery in
users' devices, not consuming any cloud resources. It could very well be
cheaper for the companies to implement.

~~~
mycall
If it is running on my device, software can be cracked. This will prove fun.

~~~
darkpuma
That some technically proficient people will find ways around it is beyond
doubt, however it may prove to be an effective system for information control
despite that, if the common user finds it difficult to circumvent.

------
whoevercares
I’m saddened by the fact many brightest mind might be working behind this

~~~
luminati
Considering the choice of hash function of md5, not so bright I guess..

~~~
varenc
You're right.

CityLab researchers exploited MD5's weakness to answer questions about the
system. While not a real problem in practice, it seems clear MD5 was not an
ideal choice.

From the article, the researchers generated forbidden and allowed images with
a colliding hash to prove WeChat was using MD5. The allowed image was banned
in the future as a result.

However, MD5 collision generation has some constraints. It's very hard make an
image collide with a particular known hash, but it's feasible (5 hours with a
large GPU) to take two images and modify them until their hashes collide.
Practically this means exploitation opportunities are rather limited, but a
forced collision being possible at all seems non-ideal for an adversarial use
case. There's also the risk that future cryptanalysis will further weaken MD5.
Seems clear to me WeChat just should have used something like sha256.

~~~
luminati
As you pointed out, since this is an adversarial use case, a robust
cryptographic hash function is the way to go. [For a non-cryptographic Hash
function, SeaHash [1] would be the best choice, which is violently fast!]

BLAKE2b would have been the perfect choice given the adversarial nature, as it
much secure and faster than MD5. [2]

MD5 is so broken, it's really poor choice for any use case - cryptographic
(fundamentally broken) or not (fundamentally slow).

[1] [http://ticki.github.io/blog/seahash-
explained/](http://ticki.github.io/blog/seahash-explained/)

[2] [https://leastauthority.com/blog/BLAKE2-harder-better-
faster-...](https://leastauthority.com/blog/BLAKE2-harder-better-faster-
stronger-than-MD5/)

------
Macuyiko
Interesting fragment:

> Moreover, we found that new accounts required approval from a second account
> that must have existed for over six months, be in good standing, and have
> not already approved any other accounts in the past month. Because of these
> requirements, we found that creating new WeChat accounts was prohibitively
> difficult.

I've noticed WeChat tightening up their accounts as well over the past months.
I have been "lucky" to have created an account years ago, with a wallet still
working as well (as a non-Chinese, that is) without having to link a Chinese
bank account/card. Friends of mine who visited recently were no longer able to
do so for their account.

------
perlwle
A tricky thing WeChat does on the UX is that censored image looks like sent
successfully from the chat history. But the other side really didn't receive
it.

It's like a black hole eating your messages without telling you.

~~~
whytaka
The Memory Hole
[[https://en.wikipedia.org/wiki/Memory_hole](https://en.wikipedia.org/wiki/Memory_hole)]

------
Causality1
This kind of thing reminds you of the fact that the reason we don't do
business and trade with North Korea is because they're a geopolitical threat,
not because they abuse their citizens. So long as a country like China's evil
stops at the border the rest of the world could not give less of a shit.

~~~
anbop
China could easily go beyond its border without consequence. Look at how
easily Russia took Crimea.

~~~
deogeo
Or how easily China took Tibet.

~~~
hobs
Or the current claims in the South China Sea
[https://twitter.com/indopac_info/status/1102650429458931713](https://twitter.com/indopac_info/status/1102650429458931713)

------
fourier_mode
> Internet platform companies operating in China are required by law to
> control content on their platforms or face penalties

I recently learned that Signal[0] works in China, are they forced to do the
same?

[0]:
[https://en.wikipedia.org/wiki/Signal_Messenger](https://en.wikipedia.org/wiki/Signal_Messenger)

~~~
oppositelock
China manipulates data pretty much anywhere imaginable. See the Google Maps
link [1] and corresponding Baidu Maps [2] locations. Notice how the Google
Maps data has huge disagreement between the road network and satellite
imagery? It's because if you do mapping in China, the government hands you a
perturbation function to apply to each data layer. You have to warp your data
per their function, and they can audit it. Baidu doesn't have to do this.
However, both Google and Baidu maps are WAY off on GPS locations, 100+ meters
off, everybody has to do that unless yo have an accurate mapping license.

I realize this is a little off topic, but since I work on something which has
a big China presence, I'm always running into their BS, and censorship is just
one little piece of it. VPN connectivity to your non-China offices is also
problematic, running TLS over the Chinese internet is also problematic, unless
you use officially provided certs and keys, etc.

[1]
[https://www.google.com/maps/place/Beijing,+China/@39.7616007...](https://www.google.com/maps/place/Beijing,+China/@39.7616007,116.3928139,1130m/data=!3m1!1e3!4m5!3m4!1s0x35f05296e7142cb9:0xb9625620af0fa98a!8m2!3d39.9041999!4d116.4073963)
[2]
[https://map.baidu.com/@12957558.390456071,4804287.368277797,...](https://map.baidu.com/@12957558.390456071,4804287.368277797,17.04z/maptype%3DB_EARTH_MAP%26latlng%3D39.7616007%252C116.3928139%26title%3D%25E6%25A0%2587%25E9%25A2%2598%26content%3D%25E5%2586%2585%25E5%25AE%25B9%26autoOpen%3Dtrue)

~~~
kccqzy
About this perturbation function: almost the entire world uses the WGS84
datum. The Chinese use a datum that's similar but subtly different. If you
didn't account for this datum you got shifted roads and features. The
technical information about this datum has never officially been made public,
but only licensed to certain companies. I'm fairly certain Google doesn't have
such a license. You can find reverse engineered info online but there's no
guarantee that those are correct.

~~~
oppositelock
You're referring to GCJ-02. From best as I can figure, it adds some form
multi-frequency noise to a shifted WGS84 coordinate, but it also seems that
different companies are told to use different coefficients for the different
noise frequencies, that's how they can tell if you're doing what you're told.

Regardless, it's difficult to work with. If you have a mapping license, you
must also take serious precautions never to let the accurate map data leave
China, or your Chinese employees are in deep trouble.

------
NKCSS
> _We found that the use of client re-encoding can make the image filtering
> more powerful in that it effectively results in an image’s hash representing
> all images with identical pixel values as opposed to merely a specific image
> encoding. The process of re-encoding extends the generalizability of hash-
> based filtering because any image containing the same pixel contents will be
> encoded to the same file and thus have an identical hash (see Figure 3).
> When this happens, any changes to the original image’s encoding or to its
> metadata will be ineffective at evading filtering, and some change to the
> image’s pixel values, if even a small one, will be required to change the
> resultant hash. When the client does not re-encode, then any change to an
> image’s file encoding, including to its metadata, is sufficient to change
> its hash._

So, just add a random off-color pixel in your image and the system will fail.

~~~
elaus
No, according to the article this will only evade the hash-based lookup. All
images that aren't in the hash database are then analyzed by some other,
computational more expensive system (OCR, …).

------
netsec_burn
Fascinating analysis given such limited information. It's amazing how much can
be inferred.

------
whytaka
While the crisis with fake news certainly warrants suspicion of the
citizenry's ability to make informed decisions, how do pro-government citizens
of countries without even the cultural ideal of free speech attempt to claim
objective soundness for their political positions?

~~~
hkai
Would you also be open to question the existence of a "fake news crisis"
itself, or were you convinced by the news reports that it exists?

~~~
whytaka
Having seen its proliferation first-hand, I don’t think I need much
convincing.

