
HK social media users writing phonetically spelled posts to shut out trolls - s_Hogg
https://news.rthk.hk/rthk/en/component/k2/1475264-20190818.htm
======
hardmaru
Hi!

Here is some information about romanisation of Cantonese if you are
interested:

Romanisation system for Cantonese has an interesting history! Yale
romanisation system [1] is (IMO) the most readable and also later on refined
as Jyutping [2], another method used in more academic contexts which IMO is
less readable (both used in GBoard as Cantonese input methods). However most
persons and place names in HK use older system [3] developed in 1880s by
Christian missionaries.

When people use Cantonese romanisation as part of their casual text chats on
instant messaging or social media platforms, it’s usually a mix of both
systems [1, 3], but rarely [2] but without the tone information (so lots of
many-to-one mappings), mixed in with bits of English, making it hard to
understand (even for a local Hong Kong person) without having good prior
context of the entire conversation.

[1]
[https://en.wikipedia.org/wiki/Yale_romanization_of_Cantonese](https://en.wikipedia.org/wiki/Yale_romanization_of_Cantonese)

[2]
[https://en.wikipedia.org/wiki/Jyutping](https://en.wikipedia.org/wiki/Jyutping)

[3]
[https://en.wikipedia.org/wiki/Standard_Romanization_(Cantone...](https://en.wikipedia.org/wiki/Standard_Romanization_\(Cantonese\))

~~~
s_Hogg
Somewhat off topic: any chance you know how come Google doesn't have an
explicitly cantonese model for translation?

~~~
spacehunt
Not a Googler so I can only guess. But it seems like Google did try to treat
Cantonese as a Chinese variant in the past, eventually they dropped it
probably because they realised they're too different.

I know Google is actively working on the Cantonese version of Google
Assistant, though not sure when it'll be officially released.

~~~
toastal
It is a variant of Chinese though. Chinese is a language family, not a
language -- which includes Mandarin, Cantonese, Hakka, et. al.

~~~
spacehunt
Whatever it is, Cantonese has different pronunciation, vocabulary and even
grammar from Mandarin. Which means it takes a non trivial amount of work to
adapt a language model designed for one to the other.

Source: I'm a native speaker of one and fully fluent in the other.

------
probablybroken
I remember this being similar to the reason claimed for the use of leet speak
back in the day (i.e, that it prevented searching of messages by 'the man' );
Interesting to see something so similar actually being practically applied in
the present.

------
koonsolo
As a side note: If that would happen here in Flanders, not even Dutch speakers
would be able to translate it. We have so many local dialects with their own
pronunciation, and region bounded words, that only local people would
understand what is written.

------
johnzim
Very clever. That said, there’s plenty of HKers who were happy to be govt
stooges who speak Cantonese and can write English.

~~~
ARandomerDude
True, but it no doubt greatly _reduces_ the amount of government involvement
in this area.

~~~
krasin
Only for a little while. It will take a few days to make a browser extension
to fix this.

~~~
realusername
There's nothing which will translate that as far as I know, it's an entirely
different problem than chinese characters since Cantonese is a separate
language and the phonetics are not even standardized in these messages so
there's no 1 to 1 mapping to characters either.

~~~
robjan
If you type the phonetic words in using the Google Keyboard in JyutPing mode,
you will usually get the correct Chinese characters. The thing that would
defeat this, though, is the deliberate introduction of very colloquial
Chinglish puns.

------
s_Hogg
I tried feeding some of the phonetically spelled stuff into google translate
and it was completely lost at sea. So no worries about the claim this could
stuff up NLP/search, then.

~~~
throwawy123
No, you can actually type this straight into Google Translate and it will work
fine. After selecting Chinese as the language, you have to set the input
method as "Cantonese." Of course, Google Translate is banned in China. ;)

This isn't a secret code or anything; it's a standard romanization that almost
everyone who learns Cantonese _formally_ will learn. The thing is---formal
education in Cantonese and other non-Mandarin Chinese languages is banned in
schools in China. Mainlanders that speak Cantonese as a primary language often
don't even know how to use it (I talked to a Cantonese-speaking girl from
Zhuhai, and she was like "Can you show me what Jyutping looks like?" Bizarre).

It's pretty smart, and a bit of a slap in the face to the establishment, which
has been forcing Mandarin down people's throats for the past 50 years or so.

~~~
zhte415
Many things are banned in China, especially web services, in order to create
an internal (was possibly exportable incubation, but not since 'progression'
over the past 5 years after the government - hegemony of leaders and follows
on, rather than government workers, themselves the mules).

But translate.google.cn is not banned on the mainland and Google services work
very finely in Hong Kong.

And forcing Mandarin does people's throats is not all bad, in terms of
literacy and considering that 70 years ago most of the country was illiterate,
no only in language but also in ideas, such as basic western ideas in medical
- which led to a huge reduction in infant mortality - the doctors and the
literate going to the countryside.

And the across the sea, river, passing swamp, was Hong Kong, which figured out
very much earlier and flourished.

But.. translate.google.cn is not blocked in China.

~~~
floatingatoll
It’s not about forcing Mandarin being bad, it’s about prohibiting non-Mandarin
being bad.

US equivalent to this would be forcing English (which we do) while jailing
anyone who teaches Creole (which we don’t).

~~~
yorwba
No one is going to get jailed for teaching Cantonese in China. There are
plenty of commercial offerings, e.g. this free introductory course I found:
[https://www.wanmen.org/courses/586d23485f07127674135d64](https://www.wanmen.org/courses/586d23485f07127674135d64)

Mandarin being the primary language of education doesn't mean that Cantonese
is prohibited; it's just not mandatory, so most native Cantonese speakers
aren't going to get a formal education in it unless they specifically seek it
out.

~~~
floatingatoll
Thank you for the correction; it’s too late to edit but I would if I could :(

------
loocsinus
Hong Kong is not the only place where people speak Cantonese. The entire
region of GuangDong (Canton) speak this dialect. Do they believe all "trolls"
speak Mandarin?

~~~
robjan
The people of Guangdong will usually tell you that South Chinese people are
generally apolitical and tend to care more about money and good food.

------
caf
Aw hafta r'menbadis iffwea genvayed.

~~~
yomly
I'll have to remember this if we get fired?

~~~
tomthecreator
Probably "invaded".

------
charlesdaniels
I was lucky to have the chance to take a natural language processing course in
the spring from a professor who was very knowledgeable and passionate about
the subject.

Sentiment extraction, semantic meaning extraction, categorization... these are
all really hard problems (to do automatically) even on properly spelled and
grammatically correct text. I would imagine they are even harder in Chinese,
which as I understand it has several different writing systems.

The HK protesters are clearly quite clever. If they keep using different
obfuscation schemes for text, I could see it forcing the mainland to use human
beings to read every post. Which I'm sure they have the resources to do, but
it's still more expensive than using a machine.

Some strategies I would expect to be effective:

* Using alternative phonetic encoding (i.e. what is shown in the article, using Latin letters to spell out sounds rather than words)

* Homoglyph attacks

* Using deliberately incorrect or ambiguous grammatical structure

* Using deliberately incorrect spacing and punctuation (for example "m ee t. me;? b!y th e do.,c?s a!.t; m id ni;ght" will completely bewilder all the parsing packages I'm aware of)

* Convert the text to images and post those, possibly adding graphical text which will confuse OCR packages

Mix and match for even more fun!

There are also lots and lots of stenographic techniques, but those are a lot
less accessible to laypeople.

I'm not familiar with NLP tools and techniques available for Chinese, but most
parsers/taggers for English aren't really written with adversarial inputs in
mind. It would probably be possible to deliberately construct valid (or at
least decipherable to a human) English text that would crash the common tools
available.

As an aside, the articles that keep coming out over the HK protester's tactics
are starting to seem a lot like Cory Doctorow's "Little Brother"[1], which is
available for free, and definitely worth a read.

1 - [https://craphound.com/littlebrother/Cory_Doctorow_-
_Little_B...](https://craphound.com/littlebrother/Cory_Doctorow_-
_Little_Brother.pdf)

------
qrbLPHiKpiux
There is state-sponsored warfare occurring on the internet right now. As with
all war, each side desires a specific outcome; they're influencing en masse -
the people they want - to achieve their outcome.

------
BerislavLopac
A mandatory link:
[https://en.wikipedia.org/wiki/Shibboleth](https://en.wikipedia.org/wiki/Shibboleth)

