
HypheNN-De: German Hyphenation with Neural Networks - msiemens
https://blog.m-siemens.de/hyphenn-de-german-hyphenation-with-neural-networks/
======
microcolonel
Did you consider putting the "center" of the detector somewhere other than in
the middle of the vector? what would happen if you had 6 before, and 2 after,
or 5 before, and 3 after?

Another thought I had: for performance reasons, it might be nice to have
something more compact than a one-hot vector for each letter. Have you looked
at determining sets of characters which have a similar impact on hyphenation,
and encoding them together?

PS: do you have the extracted list of wiktionary hyphenations sitting in a
text file somewhere that you could put up? I'm fixin' to quickly compare the
accuracy to TeX's German hyphenation (once the 30+GiB TeXLive repository
finishes downloading).

PPS: You could improve the display of code blocks in your site on desktop by
adding

    
    
        display: block;
        max-width: 710px;
        width: 80%;
        margin-left: auto;
        margin-right: auto;
    

to your `.post-content pre code` rule. Or maybe slightly indent it by reducing
the max width a small amount below that of the body text.

~~~
msiemens
_Did you consider putting the "center" of the detector somewhere other than in
the middle of the vector? what would happen if you had 6 before, and 2 after,
or 5 before, and 3 after?_ _Another thought I had: for performance reasons, it
might be nice to have something more compact than a one-hot vector for each
letter. Have you looked at determining sets of characters which have a similar
impact on hyphenation, and encoding them together?_

These are interesting suggestions! It sure would be interesting to do actual
research on how to optimize the hyphenation even more. It also would be
interesting to play with the hyperparameters and network architecture to see
what impact they have on the hyphenation accuracy. Alas, I'm a student so time
is rather scarce.

 _PS: do you have the extracted list of wiktionary hyphenations sitting in a
text file somewhere that you could put up? I 'm fixin' to quickly compare the
accuracy to TeX's German hyphenation (once the 30+GiB TeXLive repository
finishes downloading)._

Sure! The GitHub repository actually contains a Rust program to process a
Wiktionary XML dump into a word list for training, but if you want to skip
straight ahead, I've uploaded the dataset I used to
[https://gist.githubusercontent.com/msiemens/2aac63cf8d1b88c4...](https://gist.githubusercontent.com/msiemens/2aac63cf8d1b88c48d33c9c82f8f8e15/raw/d078a9dcad5e63afd8a5976184d973a02c5f591f/wordlist.txt)
[6 MB, licensed under CC BY-SA 3.0].

 _PPS: You could improve the display of code blocks in your site on desktop by
adding [...]_

Thanks for the suggestion, I'll look into it!

------
Ciantic
If you love spelling and hyphenation, you should star this issue in
Chrome(ium):

[https://code.google.com/p/chromium/issues/detail?id=20667](https://code.google.com/p/chromium/issues/detail?id=20667)

There are a lots of spelling and hyphenation libraries e.g. for Finnish
language, but it is not possible to get them to working in Chrome cause there
is no extension capability for it. It's really shame, since these odd
languages probably never get support by Chrome team itself.

------
lindig
TeX implements a very good spelling engine that that is driven by patterns
[1]. I would expect it very difficult to improve on this and as far as I can
see, the article doesn't include a comparison.

[1]: [https://tex.stackexchange.com/questions/262588/how-are-
hyphe...](https://tex.stackexchange.com/questions/262588/how-are-hyphenation-
patterns-written)

~~~
Semaphor
He talks about exactly about pattern matching and mentions latex using it in
the second section. Also that this approach doesn't work as well with German
compound words which is the whole premise.

~~~
lindig
Giving one example is not an evaluation that would convince me that NN are
better. The German LaTeX community is one of the largest and I haven't heard
much about it being unhappy with TeX's hyphenation.

~~~
msiemens
That word would be _Nahrungsmittelunverträglichkeit_ again. I just tested it
and LaTeX (with `\usepackage[ngerman]{babel}`) does the same mistake as pyphen
in the article (it hyphenates the word as _Nah-rung-smit-telun-ver-träg-
lichkeit_ ).

To be fair, in day-to-day use problems like these will be corner cases as to
my knowledge LaTeX tries to avoid hyphenation and even if it has to split a
word, it has a good chance of getting it right. Also, to me this project's
focus was more on learning about neural networks than creating a better
hyphenator.

------
tinyrick2
This blog inspired me to hunt for some obscure machine learning papers from
80s and 90s that I may replicate and improve. Any idea where to start?

~~~
logicallee
(serious reply.) A university library. (So you can find old papers.)

I assume you know some modern techniques. If not then I would start with a
modern textbook (like, the assigned textbook from a machine learning class)
and then see if you can use the same exercises to see if you can improve
classic paper results. It would make an interesting blog post.

