
Automatic conversion between simplified and traditional Chinese - anewhnaccount2
https://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese
======
ksec
I asked in Wikimedia IRC before and never got a response, for any one with a
system locale defaults to English, any Chinese page you visit are default to
Chinese Simplified, even if you are from Taiwan or Hong Kong, where everyone
uses Chinese Traditional.

~~~
sarabande
I don't see a question in your statement. Is it, "Why does Wikipedia default
to simplified Chinese?"

Probably because > 90% of native Chinese speakers use the simplified script.
Now, I don't know what Wikipedia statistics are -- maybe usage skews heavier
in areas where traditional characters are used, so that would be a reason to
not do so, or to ask first. But without that information, defaulting to the
more-commonly used script seems like a reasonable measure.

~~~
rangibaby
zh-cn (at least) is blocked in the PRC. I assume that even if it is unblocked
at some point it will be subject to heavy censorship

~~~
seanmcdirmid
It is periodically blocked and unblocked. The GFW is extremely inconsistent on
Chinese wiki.

------
contingencies
Yeah I always used to use the PHP implementation from mediawiki to achieve
this, for over 15 years I think. However, last week we hit on
[https://github.com/BYVoid/OpenCC](https://github.com/BYVoid/OpenCC) ... not
sure which is superior yet.

~~~
anewhnaccount2
What's interesting to me about the Mediawiki implementation is it has some
features of a fully fledged rule-based machine translation system. It
implements several layers of translation. In increasing level of precision,
but also increasing effort and decreasing recall:

1\. character to character conversion - originally based on tables from
unicode

2\. word/phrase type structure level matching - based on their own lists

3\. domain specific word/phrase lists, (I'm not sure if this is based on
category, opt in by including a template or both) see
[https://zh.wikipedia.org/wiki/Category:%E5%85%AC%E5%85%B1%E8...](https://zh.wikipedia.org/wiki/Category:%E5%85%AC%E5%85%B1%E8%BD%89%E6%8F%9B%E7%B5%84%E6%A8%A1%E6%9D%BF)
(I can't read Chinese - figured it out with Google Translate)

4\. rules which apply for a whole article, see
[https://zh.wikipedia.org/wiki/Help:%E4%B8%AD%E6%96%87%E7%BB%...](https://zh.wikipedia.org/wiki/Help:%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E7%9A%84%E7%B9%81%E7%AE%80%E3%80%81%E5%9C%B0%E5%8C%BA%E8%AF%8D%E5%A4%84%E7%90%86#%E6%8E%A7%E5%88%B6%E8%87%AA%E5%8A%A8%E8%BD%AC%E6%8D%A2%E7%9A%84%E4%BB%A3%E7%A2%BC)

5\. rules which apply for a single word, see previous link

The whole thing is explained here:
[https://zh.wikipedia.org/wiki/Help:%E4%B8%AD%E6%96%87%E7%BB%...](https://zh.wikipedia.org/wiki/Help:%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E7%9A%84%E7%B9%81%E7%AE%80%E3%80%81%E5%9C%B0%E5%8C%BA%E8%AF%8D%E5%A4%84%E7%90%86#%E8%BD%AC%E6%8D%A2%E6%8A%80%E6%9C%AF)

The last 3 points address level 4 of Chinese character conversion mentioned in
this article:
[http://www.cjk.org/cjk/c2c/c2cbasis.htm](http://www.cjk.org/cjk/c2c/c2cbasis.htm)

I think OpenCC only implements level 1 and 2. I've looked at the source code
to see that OpenCC gets "words" by using greedy longest prefix matching rather
than fancier "Chinese word segmentation" \-- I would guess MediaWiki does the
same.

------
wodenokoto
They say you need to support this conversion in the markup but gives no
examples.

~~~
anewhnaccount2
More in-depth info available at
[https://zh.m.wikipedia.org/wiki/Help:中文维基百科的繁简、地区词处理#转换技术](https://zh.m.wikipedia.org/wiki/Help:中文维基百科的繁简、地区词处理#转换技术)

~~~
lifthrasiir
Don't know why this was downvoted, probably because not in English?

Anyway IIRC (and as the help page confirms) there are multiple conversion
layers: letter-by-letter conversion tables (provided by MediaWiki itself),
customizable word-by-word conversion tables [1] and special markup `-{...}-`
for explicit variant selection [2].

[1]
[https://zh.m.wikipedia.org/wiki/MediaWiki:Conversiontable/zh...](https://zh.m.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-
cn) etc. You can also put them into the article itself, allowing domain-
specific or article-specific conversion table.

[2] Mentioned under the heading "控制自动转换的代碼"; also can be used to turn off the
wrong automatic translation.

