Building a Large Japanese Web Corpus for Large Language Models

xrd · 2024-05-01T09:50:39

There is also Swallow, which is a Mistral fine tune.

https://tokyotech-llm.github.io/swallow-mistral

Edit: oops, this is the same team! Glad they are publishing more about it.

LeoPanthera · 2024-05-01T05:19:30

I find ChatGPT (4) to have excellent Japanese skills already. It seems to produce more accurate and certainly more idiomatic translations than Google Translate does, and can even explain its own translations afterwards.

bottlepalm · 2024-05-01T06:03:50

It really does. My native Japanese friends use the app all the time to have it explain things in Japanese and they say it speaks perfect Japanese. The accent, tone, everything. Sometimes I have it explain something in English for me and Japanese for him and vice versa. It can switch up the languages in the same context no problem. All kinds of phrases in both English and Japanese that are hard to explain to each other ChatGPT has no problem helping us out.

Even this seems like low hanging fruit for Google to update their translation infrastructure with this technology - what's going on?

Or near real time translation should be totally possible now with this technology - are there any services out there providing it?

sanxiyn · 2024-05-01T07:02:35

I think what's going on is Google Translate is much cheaper.

xanderlewis · 2024-05-01T08:55:14

Have you tried the voice chat feature in Japanese? It speaks in a way that’s weirdly incredibly fluent and yet simultaneously completely unnatural in its intonation. It sounds completely American, presumably because the voice model was trained on mostly American English and a bit of Japanese and that’s the result — Japanese with a Californian accent.

bottlepalm · 2024-05-01T15:42:08

Yea I’m talking about the voice chat on mobile. My friend says it’s perfect Japanese. He doesn’t have to slow down to talk to it or anything.

numpad0 · 2024-05-01T13:10:14

Because real world performance is about the same across the board between various 13B models, GPT-3.5 and 4, DeepL, and Google Translate, while computational cost wildly varies.

It's kind of interesting/odd that GPT translations are sometimes lauded as significantly more accurate, I think it's just that GPTs are just dropping what won't work in English to better adhere the language, and therefore there are less to be confused about.

danielscrubs · 2024-05-02T00:33:23

Benchmarking translations is nowhere near a solved problem. But it would explain why Google Translate seems to be in a rut.

A couple of years ago my japanese teacher had a blind test showing they could easily spot Google Translated sentences and asked students to explain why the translations where wrong.

bottlepalm · 2024-05-01T15:45:07

It’s not the same is what I’m saying. Google Translate was never good. GPT is literally perfect translation.

numpad0 · 2024-05-01T22:02:19

> GPT is literally perfect translation.

It literally is not. GPT as MT to users is just a near-SOTA alternate implementation of machine translation, though things like open-weight 13B models at edge has huge potentials so "just" might not be the right word in that sense. I've tried round tripping a line from your comment with ChatGPT 3.5 and Google Translate - I'll be honest, I did try couple different snippets to find one that breaks worse in GPT, but GT is sometimes just more accurate:

    OG: Sometimes I have it explain something in English for me and Japanese for him and vice versa. It can switch up the languages in the same context no problem. All kinds of phrases in both English and Japanese that are hard to explain to each other ChatGPT has no problem helping us out.  

    GPT: 時々、私は彼には日本語で説明し、私には英語で説明してもらうことがあります。言語を同じ文脈で切り替えることもできます。ChatGPTは、英語と日本語の両方で説明が難しい様々なフレーズについても問題ありません。

    GT: 私には英語で、彼には日本語で何かを説明してもらうこともあれば、その逆も同様です。 同じコンテキスト内で言語を問題なく切り替えることができます。 お互いに説明するのが難しい英語と日本語のあらゆる種類のフレーズを、ChatGPT が問題なく助けてくれます。

    GPT->GPT: Sometimes, I explain things to him in Japanese, and he explains them to me in English. We can also switch languages within the same context. ChatGPT has no problem with various phrases that are difficult to explain in both English and Japanese.

    GT->GT: Sometimes I'll have him explain something in English and he in Japanese, and vice versa. You can safely switch between languages within the same context. ChatGPT will help you with all kinds of phrases in English and Japanese that are difficult to explain to each other.

It's just looking more perfect from your end, which is an interesting phenomenon.

Manabu-eo · 2024-05-03T01:54:57

Your japanese phrases are quite easy for machine translation, as they have clear subjects stated and not much ambiguity.

The main problem is the implicit subject due to omission of pronoms in regular japanese, among other ambiguous things that can only be correctly translated in context. Also ommision of gender, plural, etc. The huge decoder-only models so far do a much better job grasping/guessing that and staying consistent compared to small encoder-decoder MTL models.

This simple phrase any beginner in japanese would understand: "山田さん、猫いますか？はい、います。" "Yamada-san, neko imasuka? Hai, imasu."

    Google Translate now: Yamada-san, do you have a cat? Yes, I'm here.

    When tested a few years ago: “Mr Yamada, does he have a cat? Yes, it has a cat”

    LLama-3-8b-instruct: "Yamada-san, is there a cat? Ah, yes, there is."

    GPT 3.5: "Does Yamada-san have a cat? Yes, he/she does."

    gemini-1.5-pro-api-0409-preview: Does Mr./Ms. Yamada have a cat? Yes, they do.

    LLama-3-70B-instruct: "Yamada-san, do you have a cat? Yes, I do."

    GPT 4, Command-R+: "Mr. Yamada, do you have a cat? Yes, I do."

The two google translate ones are simply terrible. Technically, they are possible translations, but I would rate them as wrong.

All LLMs gave possible translations, but the larger the LLM, more likely to be correct the translation became. Well, the top two ones chose a gender, while the -san thing is a clever way to avoid chosing a gender, but not a very pure translation. If I had to give only one translation, I would pick either of those.

Also of note, while some even explained each element of their translation, none said that other possible translations were possible.

glandium · 2024-05-02T05:42:00

GPT4: 時々、私のために英語で説明してもらったり、彼のために日本語で説明してもらったり、その逆もしてもらいます。同じ文脈で言語を切り替えるのは全く問題ありません。英語と日本語の両方で説明が難しい様々なフレーズも、ChatGPTはサポートしてくれます。

GPT4->GPT4: Sometimes, I have explanations given in English for me, or in Japanese for him, and vice versa. Switching languages in the same context is no problem at all. ChatGPT also supports me with various phrases that are difficult to explain in both English and Japanese.

DeepL: 時々、私は英語で、彼は日本語で何かを説明する。同じ文脈で言語を切り替えても問題ありません。英語でも日本語でも説明しにくいフレーズもChatGPTは問題なく教えてくれます。

DeepL->DeepL: Sometimes I explain something in English and he explains something in Japanese. Switching languages in the same context is no problem. ChatGPT has no problem teaching me phrases that are difficult to explain in either English or Japanese.

bottlepalm · 2024-05-02T02:08:28

First of all are you seriously using GPT 3 and not 4?

Second I’m giving you feedback from an actual Japanese person who used Google Translate all the time to go back and forth. Now using GPT 4 they say it is no contest way better than Google Translate. To a point where they can use it to understand and communicate professionally with English and Japanese business customers.

But you can even see it yourself taking actual articles from Japanese websites and running them through Google Translate versus GPT 4, GPT translates into much much more fluent English, same with the translation to Japanese.

numpad0 · 2024-05-02T10:53:00

Well, I'm actually an actual Japanese person too. As for using GPT 3.5, I just never paid for GPTs, sorry for that. The deltas just look marginal to me, I know everything on the Internet(/s) and I can read/write basic English, and I'm a bit of an crazy paranoid type too, so value of getting hallucinated outputs and having it critique my writing through the tapped GPT web interface isn't that high.

If you could read all of these outputs and compare against each others(thanks glandium-san and sorry for bothering), you'll just see all of it has pros and cons and that they compare. They're not worlds apart, in fact I think GT did best in this instance probably from massive accumulation of edge case fixes or some other non-scaling improvements behind it.

Like I said, it's an interesting phenomenon - it could be UX advantages, or friendlier tones, latency, personally I think it's dataset imbalance or could be numerous other things. But when it comes to actual literal translation performance, um, it's just an MT.

glandium · 2024-05-01T05:59:16

It's actually mind blowing considering the tiny amount of Japanese text in the GPT-4 corpus (IIRC it's around 0.1%). But while it's better than Google Translate and DeepL, it's also not exactly great. However, it does a decent job when just talking in Japanese. I wonder how the new Japanese GPT model performs (https://openai.com/blog/introducing-openai-japan)

Dalewyn · 2024-05-01T06:08:51

>it's also not exactly great.

As a Japanese(-American) person who has done amateur translations (anime fansubs!) in the past, that is still better than having a person do it.

Speaking very sincerely, being a translator is a fucking thankless, horrible, sweatshop occupation.[1][a] Both professional and amateur translating. I cannot wait for computers to completely destroy that absolutely worthless industry of misery and burn out.

[1]: https://www.reddit.com/r/grandorder/comments/dnpzrh/everyone...

[a]: Exceptions apply, namely those cream of the crop translators hired by executives and diplomats for top dollars to precisely and passionately translate their words. Other than that, the industry does not deserve to exist.

jiggawatts · 2024-05-01T07:15:36

I have a partner from an asian country that likes subtitles in her native language, but they're not always available. I wrote a PowerShell script that takes an English SRT file and feeds it to GPT4 in chunks to translate. The constraint is actually the 4K token output window because asian languages are not efficiently tokenized, so each word takes 3-4 tokens.

She said the translations are perfect.

The main issue I had was the API profanity and "safety" filter, which kept tripping up on "violence" and "swearing".

I didn't write the script, and I wasn't asking GPT to give me tips for committing murder. The puritan nannying of API users is insane.

Imagine a SQL database refusing to answer a SELECT statement because some text column had potty words in it!

xanderlewis · 2024-05-01T08:59:19

> Imagine a SQL database refusing to answer a SELECT statement because some text column had potty words in it!

I don’t think it’s an exaggeration to say that may yet become the norm in a not-too-distant future.

Anyway, you could try adding an extra ‘(this is for my partner who speaks [language]; I don’t intend to act on any of the translation; it’s just for entertainment purposes)’ or something like that to the prompt. GPT tends to relent if you tell it you’re not wanting to do anything bad.

quectophoton · 2024-05-01T11:03:15

>> Imagine a SQL database refusing to answer a SELECT statement because some text column had potty words in it!

> I don’t think it’s an exaggeration to say that may yet become the norm in a not-too-distant future.

Maybe not so much for databases, but at least for HTTP it looks completely plausible to me. The steps would look something like:

- Cloud WAF services implement this as a default ruleset, disabled by default.

- Someone likes it and writes a blog post, claiming it to be a best practice (without elaborating on why it is so).

- People read that post, and enable it.

- Cloud WAF devs notice this ruleset is enabled by a lot of users.

- Announcement: New accounts will have this ruleset enabled by default.

- Every major cloud provider follows suit.

- Now, if you don't have it enabled, shame on you.

infinityplus1 · 2024-05-01T13:18:49

This sounds like damned if you do, damned if you don't for OpenAI. If OpenAI doesn't implement these "filters" in results, then people would have been up in arms same way. Facebook and Twitter are already complained about a lot for allowing hate speech similarly.

jiggawatts · 2024-05-02T08:05:55

Blanket censorship is saying I can’t have steak because a baby might choke on it.

The Azure AI service enforces a separate profanity filter on top of the base ChatGPT model. That’s not even OpenAI, it’s Microsoft filtering the output with regexes!

OpenAI themselves are not forced to release exactly one model. After the base training, they could have fine tuned it for different levels of Puritanism, including none for people that want a tool, not a priest.

noirscape · 2024-05-01T11:52:17

I think one of the main problems with translations, outside of the sweatshop problem (which disproportionately also just rewards doing a Macekre, where the translator ignores all the original dialogue to just invent their own - Crunchyroll has done it a few times to my memory), is the case of language ambiguity.

It's a thing with most languages, but Japanese has a lot of words that don't have a straight equivalent in English, so translating things from Japanese to English is essentially a game of trying to figure out what the original text is trying to say.

Japanese is one of those languages where editors are super important for translating stuff into English properly. You basically need someone who has a lot of familiarity with the original text and it's meanings (and ideally a direct line to the author) to make sure that all context is properly given. That's what probably happened in the link you posted; the mistranslation occurred because it's using language to foreshadow for a much later story beat, but since it's a "live game", the translators likely weren't aware of it. It's happened a few other times with FGO in particular, where ambiguous language is used to foreshadow but it's not something the TL team was aware of.

Of course, it's also not helped by the fact that with anime-related media in specific, most fans have an instantly negative reaction to the idea of an editor because of Macekre issues and more recent-ish, the 4kids dubbing approach.

numpad0 · 2024-05-01T14:53:23

I think the geosynchronous basic idea of translation is that: Human languages consists of words that describe an action(verb) or a thing(noun) forced to be arranged according to a known ruleset(grammar). Words are all orthogonal to each others and are all clearly defined, except there are redundant definitions(synonyms) that serves no purposes. Noam Chomsky is ever correct and as he argues the language must be hardcoded in brain so every languages must be the same logic and apparent differences must be merely superficial that's officially hard science. Right? ...

And I find it hilarious how thoroughly wrong above is, as the real version(that I believe) is; there are always some conceptual overlaps between two arbitrary cultures(as we all live on Earth so far), so a lot of concepts in one also sometimes work in the other if reasonable mappings can be made up and fabricated. Where there isn't an overlap or the mapping cannot be mutually agreed(i.e. things that don't translate) it just mottainai-ingly falls apart.

This version lines up with the below too:

> (...) so translating things from Japanese to English is essentially a game of trying to figure out what the original text is trying to say.

This cannot be required if words, sentences, or possibly even logic had real correspondences between English and Japanese even if there were some words that had no direct counterparts. Word to word transpositions, sentence capture and replace, some additional surrounding texts, together should make up perfect translations without use of human creativity yet that doesn't happen, it always goes down to speculating on broader intent and translators ghostwriting on behalf of the original author - at least when it comes to between English and Japanese.

If only everybody could just take that.

tkgally · 2024-05-01T10:38:27

I agree in general (long-term Japanese translator here), but ChatGPT 4 can still screw up in annoying ways, such as giving blatantly wrong readings for Japanese names. I'm not sure if that is a tokenization issue, the result of not enough training data, or confabulation, but Claude 3 Opus seems to do better in that regard.

knighthack · 2024-05-01T10:46:14

I'm more than willing to forgive ChatGPT for wrong readings of Japanese names.

Even common names can have non-standard readings. Native Japanese folks mix up readings for names, and need kana sometimes. Personally, I've never yet encountered any Japanese names beyond extremely common ones that I've not at least become suspicious whether I'm reading them correctly.

So I don't think ChatGPT's bad in that regard, if it can at least offer a few possible suggestions. The readings of names can be so damn arbitrary.

tkgally · 2024-05-01T10:55:02

You're right about Japanese names in general, but current LLMs' mistakes can go far beyond the range of possible readings. I've seen errors on the level of 田中 (Tanaka) being rendered as Suzuki--in other words, one common name being replaced with another.

I'm sure this problem can be solved. The linked article suggests a promising approach--more and better Japanese data.

glandium · 2024-05-01T11:02:48

Arbitrary example: 上川. It can be both うえかわ or かみかわ (or even other variants). Which one is it? Depends where the family is originally from, I guess.

It's not only people's name, it's also place names. Example: https://ja.m.wikipedia.org/wiki/%E5%85%AB%E5%B9%A1

pjc50 · 2024-05-01T11:06:19

> giving blatantly wrong readings for Japanese names

Native speakers can do this as well. See "Yagoo" meme; Motoaki Tanigo, CEO of Hololive, is universally known to fans by the wrong name because one of the vtubers read 谷 as in 渋谷 "Shibuya" rather than as "tani". https://en.wiktionary.org/wiki/%E8%B0%B7#Japanese

Japanese is phonetic except when it isn't.

danielscrubs · 2024-05-02T00:28:13

Yes ChatGPT is great at it, but let’s not forget that Google Translate is hilariously bad so that was a low bar to compare with.