Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: My project for this weekend (with source) (softhyphen.com)
120 points by fortes on Feb 14, 2010 | hide | past | favorite | 39 comments



I work with a lot of online publishing clients, who aren't always particularly tech savvy (you'd be amazed by some of the workflows). Designers would often like to justify text, which can cause some terrible whitespace rivers since browsers don't provide hyphenation. Browsers do, however, support soft-hyphens, which are hints that tell the browser where it can break a word.

Adding these hyphens manually is quite tedious. So I wrote up this little utility that takes HTML and adds soft hyphens automatically.

Background:

- Took a day to write (started it yesterday afternoon)

- Written in Python, deployed on AppEngine

- Uses OpenOffice's hyphenation dictionaries

Source code: http://github.com/fortes/softhyphen


This is neat. Some possible drawbacks:

(1) Browser layout engines are pretty messed up, and adding a bunch of shy hyphens can lead to really weird layout issues. I wish I had some good examples, but basically it has to do with taking into account the amount of space a hyphen takes up, and using that to compute how long lines should be, but then when the hyphens are in the middle of the line, not showing them. Anyway, as I remember sometimes it can result in text being eaten (i.e. not shown properly).

In other words, I’d try to do specific testing in a bunch of browsers, resizing a window through all reasonable sizes it might be, before counting on shy hyphens to work properly.

(2) This is still going to break search within pages. Pretty stupid that browsers can’t do proper search, but so it goes.

(3) This is going to be effectively a line-by-line paragraph composer, because browsers don’t do any kind of real paragraph layout. And it still won’t do any adjustment of inter-letter space. The combination of these two things means that using justified text is going to still usually end up looking like crap. Better than without any hyphenation, but not all that much better.

----

In short, this is kind of a half-way stopover workaround for the absolutely stupid lack of real hyphenation and justification algorithms in current browsers. Absolutely stupid because this is a solved problem: the algorithm used by TeX and described in one of Knuth’s students in the late 70s works pretty darn well (Adobe uses a modified version for InDesign), and on modern hardware should be perfectly reasonably fast, as well.

It’s not quite as bad as the lack of real layout in Microsoft Word, given that laying out text is the only purpose of that application, but still.... pretty bad.


> Browser layout engines are pretty messed up

Did html5 address this?


It's not in HTML5’s scope. That’s the domain of CSS3.


One of the main issues here would be source-code readability (people will have to add another step to the build-process and the folks using static content will have some difficulty). An elegant solution I can think of is to implement this as django middleware and hyphenates all the content by default. I don't know about the performance impact of this.

But if there is an approximate solution you can create which doesn't require dictionaries, you can port that to javascript as a library which can then instrument all the elements in a class with these soft hyphens. This will also solve the performance issue and can be introduced very unobtrusively into virtually any application.

And also any potential SEO concerns are alleviated.


It'd be good if after you entered some text, the sample at the bottom showed the text you entered un-hyphenated and hyphenated for easy comparison, rather than keeping the Article 7 sample.


I had that originally, but was worried paranoid about XSS attacks (needlessly?). Will re-enable this week.


XSS attacks against what? Unless that data can be accessed by multiple people, there's no opportunity even if I can embed Javascript.


Even better would be a resizable div that the user can drag around to see how the text reflows.


A Javascript project that does this - http://code.google.com/p/hyphenator/ - with an example -http://j.mp/bakUUw

A relavant Reddit thread - http://j.mp/d8H3Tf


This is pretty cool. My suggestion: create a simple API and a WordPress plugin. I think it would be handy for a lot of WP-powered sites and that would make it easier for people to implement.


Good idea! I was just about to start it, and decided to do a search in case someone had something similar. This plugin looks pretty good, actually: http://kingdesk.com/projects/wp-hyphenate/


Actually, I think there's still room for a WP plugin based on your project. I've tried WP-Typography before, and while I love KingDesk's intentions, I didn't like the complexity of using the package (not to mention that it requires XHTML strict and my WP theme is all 4.01, a boat I imagine many WP users are in).


Nice. I didn't realize today's HTML implementations were handling soft hyphens. When I last tried that it didn't work, but that was a long time ago.


They sort-of work. It’s far from perfect. Do your own testing before you rely on browsers.


Cool project. Not your fault, but all these soft hyphens really destroy source readability. I also wonder if this could cause keyword/seo problems.


Absolutely a valid concern. I can imagine there are plenty of (non-global) search-engines out there that trip over this issue (in fact, now that I think about it, I don't think I -ever- took it into consideration. Hmm.).

I implemented something similar (soft hyphenation related) some years back and ended up looking into just what you're asking about. My conclusion then was: The big ones act nicely, and that's about where I stopped looking.

A quick Google for "soft hyphens and google" will give a good indication or simply search this page for "google": http://www.cs.tut.fi/~jkorpela/shy.html


First off, kickass site, very impressive that you did this, let alone over the weekend. I look forward to browsing through the source.

OK, so general SEO question. Do search engines trip up after this? Why? Are they parsing HTML straight up or do they have funky rules?

FOR EXAMPLE, I have heard that <h1> gives benefit to enclosed text. i have wondered this: but can u just have jquery automatically replace it on pageload and then get credit for arbitrary keywords while actually presenting the viewer with different phrasing. wouldn't that work?

anyway, couldn't you apply my idea similarly? this would be a really simple jquery plugin


It's called "cloaking" and Google looks down heavily (http://www.google.com/support/webmasters/bin/answer.py?hl=en...) on it:

Make pages primarily for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."


Well I think that if you cater to your users with soft hyphens but show the unhyphenated content to Google, it wouldn't make sense for them to look down on that. But a possible risk is that they do scan for cloaking and automatically deduct for that. Though that would be a rather sophisticated system. Then again, they are Google.

Oh yeah, Google's tech support system sucks btw. Someone would make big bucks if they found a way to make that fun and efficient for their tech support staff. Tech support in general. It's rarely fun to call in to tech support. Shouldn't it be fun? Zappos style. Duuude: SMS text support. Text your initial question, and someone calls you. Kinda like Aardvark except now.

http://www.facebook.com/Eckharttolle


Lovely!

I actually did something similar..

Last year, I started looking at algorithms to count syllables. It turns out that hyphenation and syllable detection have a lot in common.

Anyway, long story short: one of the approaches I use on Haikuist is a hyphenation algorithm.

I'm glad you built softhyphen.com. It's such a fun idea.

(http://haikuist.com: poetic micro-blogging for haiku lovers)


You should rewrite hyphenate_html as a walker rather than a recursion, because Python has a stack, and arbitrary input could make you reach max recursion.

http://github.com/fortes/softhyphen/blob/master/hyphenate_ht...


Thanks for the tip, I'm still a python novice.


Copy-paste in Snow Leopard: dis crim in a­tion (et al)

Maybe also provide a JS listener to remove the gaps before copy?


Great stuff man. I recommend you change the sample copy that you used to something more relevant (i.e. programming/web dev related) so your google ads are more relevant.


Good point! I just wanted something inoffensive there to illustrate, and didn't think about the ads at all.


When I started out in Computers over 25 years ago, I got very interested in TeX and LaTeX and read a lot of the literature. One item that impressed me at the time - and still remember it, was that someone wrote a PhD thesis on hyphenation.

Frank Liang wrote his Stanford Ph.D. thesis on a hyphenation algorithm that is standard in TeX, and has been adapted to numerous languages. The Thesis is available online at http://www.tug.org/docs/liang/

Your application is both useful and well written. It will be very useful though to have the user specify the width of the column in pixels and the font and have a sample that renders the way it should.


Due to a bug report of mine, Calibre's[0] ebook reader uses hyphenator[1] to hyphenate ebooks, but it's a useless feature for now because the WebKit it uses doesn't insert hyphens at the ends of lines, just breaks the word in half. Funny, a test just now shows that Chrome works as expected.

[0]: http://calibre-ebook.com/

[1]: http://code.google.com/p/hyphenator/


"violation" is hyphenated as "vi­-olation", that does not look correct to me. Apart from that this seems fantastic. Nice work!


nicely done (looks like nice code on github). there's another python implementation of frank liang's hyphenation here, fyi.

http://nedbatchelder.com/code/modules/hyphenate.html

cool way to see tries in action. (though arguably a language with duck typing makes tries pretty hard to read.)


Typo: "Finish" should be "Finnish". BR, a Finnish guy :)


A kid's project. Please post only serious projects.


I can't tell if you are sarcastic or not. But the amount of languages supported told me that he took the project that extra mile.

But hey, your mileage may vary and all that.

(PS. As English is not my native language: Did the latter expression ("your mileage may vary...") perhaps stem from the expression "that extra mile"? Just realized they may be related! Oh well...)


The "extra mile" probably comes from Matthew 5:41 "If someone forces you to go one mile, go with him two miles." (or "And whosoever shall compel thee to go a mile, go with him twain.") You used the phrase perfectly in your comment -- it's about going above and beyond what was expected.

The mileage comment was more from the auto manufacturers / EPA who tested cars. A car will have been tested on specific conditions and found to achieve a certain Miles per Gallon. However - accelerate/brake hard, drive more in the city or other different conditions and you would get a completely different result. Hence the standard disclaimer of "your mileage may vary..." That made it's way into the lingo as a standard phrase meaning "this is pretty subjective - it might be the same for you, it might not"

So aside from the word "mileage" being derived from the base word "mile", there isn't much overlap between the two phrases.

(Oh - and a related note for my American friends - as much as I hate imperial units - I will give you credit that mileage is a nice word. litres/100km is still a better unit than MPG, but mileage is a better term than fuel efficiency (or kilometreage).)


Taking this WAY off topic ...

"Your mileage may vary" was a disclaimer on car advertisements (because you never achieve the advertised miles-per-gallon).

"Go the extra mile" is biblical: Matthew 5:41


Taking this WAY off topic ...

One could argue that it's language related! Thanks for your answer.:)


Nah, I think the expression came from car commercials that advertised "X miles per gallon (but your milage may vary)" to avoid getting sued for false advertising or some such.


I'm certain you would gain an appreciation for this if you knew exactly how tricky hyphenation detection is.


It’s not exactly an unresearched problem. Here’s from 1983:

http://www.tug.org/docs/liang/

Here’s a javascript version:

http://code.google.com/p/hyphenator/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: