
Show HN: Cherry2.0 – text classification without ML knowledge needed - Windson
https://github.com/Windsooon/cherry/
======
Thorentis
The pre-built model is a facinating insight to Chinese politics/society.

> Gamble / Porn / Political / Nomal (model='harmful')

Not sure what Nomal is, but 'Political' is considered harmful? I suppose they
mean politics has the potential to be harmful. Would be interesting to see
what material it was trained on (a cross section of pro and anti-CCP? pro-CCP
would still be political, but then why would that need to be classified under
a harmful model?)

> Lottery ticket / Finance / Estate / Home / Tech / Society / Sport / Game /
> Entertainment (model='news')

Interesting bit here is that Lottery Tickets are included in news and yet
gambling is included in harmful. Is the lottery not considered gambling in
China? Or gambling just has the potential to be harmful, despite also being
news?

~~~
Windson
> but 'Political' is considered harmful?

It's hard to explain why 'Political' is considered harmful in China. Most of
the Chinese companies try to avoid the political text classify problem by
using a hashtable to store 'sensitive words'. One example would be we can't
search '64g RAM' in Xianyu, which is one of the biggest c2c second-hand online
markets developed from Alibaba because 64 indicate June 4th which is a special
day in China.

> Would be interesting to see what material it was trained on (a cross section
> of pro and anti-CCP? pro-CCP would still be political, but then why would
> that need to be classified under a harmful model?)

Good question. It's quite subjective and no one knows the answer. Companies
have to do lots of self-censorships. However, unlike porn or gamble. We don't
have a standard here.

> Is the lottery not considered gambling in China?

Lottery ticket is gambling, but in China, we call it 福利彩票 which indicate the
money will be used in charity. It's quite different from gambling on football
or basketball and Lottery ticket

~~~
have_faith
> June 4th which is a special day in China

Yes, that's one way to put it.

------
ngngngng
Hey Windson, this is super cool, can't wait to try this out. Wish I had it 6
months ago at my last job when I was starting to build my own text
classification system. I submitted a PR fixing the typos in the english
translation, no promises that I got them all though.

~~~
Windson
Thank you @ngngngng, I will have a look at it soon :D

------
Windson
I have another pre-trained model for spam email classify in English. I didn't
add it to cherry because I don't know if anyone need it. It will be great if
someone can tell me what kind of pre-trained model they need so I can add it
later.

------
mlthoughts2018
Why do people think any kind of “x for people with no knowledge of x” tool is
a worthwhile idea? It reminds me of various cloud vendor ML offerings too.
With ML, you _always_ need ML domain knowledge to (at minimum) understand the
performance characteristics of the black box you’re using, and (more often)
also how to re-train/fine-tune/incrementally update as your use case’s data
distribution changes over time or your required performance characteristics
change over time.

Like with AWS Rekognition where you’re billed on raw usage, which makes no
sense when what matters is going to be the precision & recall & false
detection rate on _your_ data distribution (not whatever Amazon’s team trains
it on). How many true positives / false positives / etc. will you get on your
data?

ML is uniquely poorly suited to be treated as just some API or just some black
box library. I really wish people would stop popularizing footgun approaches
to this!

