

Show HN: My project for this weekend (with source) - fortes
http://www.softhyphen.com/

======
fortes
I work with a lot of online publishing clients, who aren't always particularly
tech savvy (you'd be amazed by some of the workflows). Designers would often
like to justify text, which can cause some terrible whitespace rivers since
browsers don't provide hyphenation. Browsers do, however, support soft-
hyphens, which are hints that tell the browser where it _can_ break a word.

Adding these hyphens manually is quite tedious. So I wrote up this little
utility that takes HTML and adds soft hyphens automatically.

Background:

\- Took a day to write (started it yesterday afternoon)

\- Written in Python, deployed on AppEngine

\- Uses OpenOffice's hyphenation dictionaries

Source code: <http://github.com/fortes/softhyphen>

~~~
jacobolus
This is neat. Some possible drawbacks:

(1) Browser layout engines are pretty messed up, and adding a bunch of shy
hyphens can lead to really weird layout issues. I wish I had some good
examples, but basically it has to do with taking into account the amount of
space a hyphen takes up, and using that to compute how long lines should be,
but then when the hyphens are in the middle of the line, not showing them.
Anyway, as I remember sometimes it can result in text being eaten (i.e. not
shown properly).

In other words, I’d try to do specific testing in a bunch of browsers,
resizing a window through all reasonable sizes it might be, before counting on
shy hyphens to work properly.

(2) This is still going to break search within pages. Pretty stupid that
browsers can’t do proper search, but so it goes.

(3) This is going to be effectively a line-by-line paragraph composer, because
browsers don’t do any kind of real paragraph layout. And it still won’t do any
adjustment of inter-letter space. The combination of these two things means
that using justified text is going to still usually end up looking like crap.
Better than without any hyphenation, but not all that much better.

\----

In short, this is kind of a half-way stopover workaround for the absolutely
stupid lack of real hyphenation and justification algorithms in current
browsers. Absolutely stupid because this is a __solved problem __: the
algorithm used by TeX and described in one of Knuth’s students in the late 70s
works pretty darn well (Adobe uses a modified version for InDesign), and on
modern hardware should be perfectly reasonably fast, as well.

It’s not quite as bad as the lack of real layout in Microsoft Word, given that
laying out text is the _only purpose_ of that application, but still....
pretty bad.

~~~
est
> Browser layout engines are pretty messed up

Did html5 address this?

~~~
jacobolus
It's not in HTML5’s scope. That’s the domain of CSS3.

------
tb
It'd be good if after you entered some text, the sample at the bottom showed
the text you entered un-hyphenated and hyphenated for easy comparison, rather
than keeping the Article 7 sample.

~~~
fortes
I had that originally, but was worried paranoid about XSS attacks
(needlessly?). Will re-enable this week.

~~~
jfarmer
XSS attacks against what? Unless that data can be accessed by multiple people,
there's no opportunity even if I can embed Javascript.

------
kes
A Javascript project that does this - <http://code.google.com/p/hyphenator/>
\- with an example -<http://j.mp/bakUUw>

A relavant Reddit thread - <http://j.mp/d8H3Tf>

------
callmeed
This is pretty cool. My suggestion: create a simple API and a WordPress
plugin. I think it would be handy for a lot of WP-powered sites and that would
make it easier for people to implement.

~~~
fortes
Good idea! I was just about to start it, and decided to do a search in case
someone had something similar. This plugin looks pretty good, actually:
<http://kingdesk.com/projects/wp-hyphenate/>

~~~
mortenjorck
Actually, I think there's still room for a WP plugin based on your project.
I've tried WP-Typography before, and while I love KingDesk's intentions, I
didn't like the complexity of using the package (not to mention that it
requires XHTML strict and my WP theme is all 4.01, a boat I imagine many WP
users are in).

------
voidpointer
Nice. I didn't realize today's HTML implementations were handling soft
hyphens. When I last tried that it didn't work, but that was a long time ago.

~~~
jacobolus
They sort-of work. It’s far from perfect. Do your own testing before you rely
on browsers.

------
chaosmachine
Cool project. Not your fault, but all these soft hyphens really destroy source
readability. I also wonder if this could cause keyword/seo problems.

~~~
romland
Absolutely a valid concern. I can imagine there are plenty of (non-global)
search-engines out there that trip over this issue (in fact, now that I think
about it, I don't think I -ever- took it into consideration. Hmm.).

I implemented something similar (soft hyphenation related) some years back and
ended up looking into just what you're asking about. My conclusion then was:
The big ones act nicely, and that's about where I stopped looking.

A quick Google for "soft hyphens and google" will give a good indication or
simply search this page for "google":
<http://www.cs.tut.fi/~jkorpela/shy.html>

~~~
zackattack
First off, kickass site, very impressive that you did this, let alone over the
weekend. I look forward to browsing through the source.

OK, so general SEO question. Do search engines trip up after this? Why? Are
they parsing HTML straight up or do they have funky rules?

FOR EXAMPLE, I have heard that <h1> gives benefit to enclosed text. i have
wondered this: but can u just have jquery automatically replace it on pageload
and then get credit for arbitrary keywords while actually presenting the
viewer with different phrasing. wouldn't that work?

anyway, couldn't you apply my idea similarly? this would be a really simple
jquery plugin

~~~
JimmyL
It's called "cloaking" and Google looks down heavily
([http://www.google.com/support/webmasters/bin/answer.py?hl=en...](http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769))
on it:

 _Make pages primarily for users, not for search engines. Don't deceive your
users or present different content to search engines than you display to
users, which is commonly referred to as "cloaking."_

~~~
zackattack
Well I think that if you cater to your users with soft hyphens but show the
unhyphenated content to Google, it wouldn't make sense for them to look down
on that. But a possible risk is that they do scan for cloaking and
automatically deduct for that. Though that would be a rather sophisticated
system. Then again, they are Google.

Oh yeah, Google's tech support system sucks btw. Someone would make big bucks
if they found a way to make that fun and efficient for their tech support
staff. Tech support in general. It's rarely fun to call in to tech support.
Shouldn't it be fun? Zappos style. Duuude: SMS text support. Text your initial
question, and someone calls you. Kinda like Aardvark except now.

<http://www.facebook.com/Eckharttolle>

------
mc
Lovely!

I actually did something similar..

Last year, I started looking at algorithms to count syllables. It turns out
that hyphenation and syllable detection have a lot in common.

Anyway, long story short: one of the approaches I use on Haikuist is a
hyphenation algorithm.

I'm glad you built softhyphen.com. It's such a fun idea.

(<http://haikuist.com>: poetic micro-blogging for haiku lovers)

------
jdunck
You should rewrite hyphenate_html as a walker rather than a recursion, because
Python has a stack, and arbitrary input could make you reach max recursion.

[http://github.com/fortes/softhyphen/blob/master/hyphenate_ht...](http://github.com/fortes/softhyphen/blob/master/hyphenate_html.py#L49)

~~~
fortes
Thanks for the tip, I'm still a python novice.

------
scorxn
Copy-paste in Snow Leopard: dis crim in a­tion (et al)

Maybe also provide a JS listener to remove the gaps before copy?

------
prgmatic
Great stuff man. I recommend you change the sample copy that you used to
something more relevant (i.e. programming/web dev related) so your google ads
are more relevant.

~~~
fortes
Good point! I just wanted something inoffensive there to illustrate, and
didn't think about the ads at all.

------
yannis
When I started out in Computers over 25 years ago, I got very interested in
TeX and LaTeX and read a lot of the literature. One item that impressed me at
the time - and still remember it, was that someone wrote a PhD thesis on
_hyphenation_.

Frank Liang wrote his Stanford Ph.D. thesis on a hyphenation algorithm that is
standard in TeX, and has been adapted to numerous languages. The Thesis is
available online at <http://www.tug.org/docs/liang/>

Your application is both useful and well written. It will be very useful
though to have the user specify the width of the column in pixels and the font
and have a sample that renders the way it should.

------
imd
Due to a bug report of mine, Calibre's[0] ebook reader uses hyphenator[1] to
hyphenate ebooks, but it's a useless feature for now because the WebKit it
uses doesn't insert hyphens at the ends of lines, just breaks the word in
half. Funny, a test just now shows that Chrome works as expected.

[0]: <http://calibre-ebook.com/>

[1]: <http://code.google.com/p/hyphenator/>

------
aw3c2
"violation" is hyphenated as "vi­-olation", that does not look correct to me.
Apart from that this seems fantastic. Nice work!

------
tmsh
nicely done (looks like nice code on github). there's another python
implementation of frank liang's hyphenation here, fyi.

<http://nedbatchelder.com/code/modules/hyphenate.html>

cool way to see tries in action. (though arguably a language with duck typing
makes tries pretty hard to read.)

------
asmosoinio
Typo: "Finish" should be "Finnish". BR, a Finnish guy :)

------
winter_blue
A kid's project. Please post only serious projects.

~~~
romland
I can't tell if you are sarcastic or not. But the amount of languages
supported told me that he took the project that extra mile.

But hey, your mileage may vary and all that.

(PS. As English is not my native language: Did the latter expression ("your
mileage may vary...") perhaps stem from the expression "that extra mile"? Just
realized they may be related! Oh well...)

~~~
mcgroob
Taking this WAY off topic ...

"Your mileage may vary" was a disclaimer on car advertisements (because you
never achieve the advertised miles-per-gallon).

"Go the extra mile" is biblical: Matthew 5:41

~~~
romland
_Taking this WAY off topic ..._

One _could_ argue that it's language related! Thanks for your answer.:)

