
"River" detection in text - cpleppert
http://dsp.stackexchange.com/questions/374/river-detection-in-text
======
oofabz
It's really impressive how their sophisticated image processing algorithms are
only a few lines long. If I wrote an algorithm to detect rivers, it would be
hundreds of lines long and less effective. These guys tackle it mathematically
and it seems almost magical. I wish I could think that way.

~~~
Chirono
I find this comment interesting given the recent (ok, continuous) talk about
not using maths puzzles in programmer interviews. This is perhaps an
interesting real-world case that demonstrates how a mathematical outlook leads
to much cleaner code.

Ok, solving this kind of thing is rare, but still...

~~~
thomasz
The secret sauce here is not the ability to solve "maths puzzles", but
specialized domain knowledge.

~~~
ColinWright
That's exactly the wrong way round. You can acquire domain knowledge, you can
have a domain expert assigned to you to work with you. What you can't acquire,
on demand, is the ability to think in the ways that mathematics gives you.
That requires extensive training and practice, and recognizing when obscure
bits of theory are applied is something that doesn't come overnight.

~~~
kamaal
Its made up of both actually. Somebody outside devops would consider pervasive
use of sed, awk and other unix text processing utilities line noise. Yet such
a thing would come naturally to that person working on it daily.

If you can't really get such stuff easily at the very first time does it mean
you are a bad programmer?

Not quite.

The way I look at it, I would consider somebody good if he can read the
documentation/theory etc and then come up solution to problems(read: can write
programs). That way you know the guy can actually get some work done. It
requires practice and time to get used to and comfortable to a new domain.
That's natural friction you have to wear out. Regardless of whether its math,
music, literature of whatever. Paying too much importance to factual stuff
isn't of much use. What's more important is, can the person work his way out
of the problem.

I checked out your bio, I understand you do math for a living and are
obviously a little defensive when some one underplays the importance of your
area of expertise. But the fact of the matter is though math is relevant, the
areas in which its relevant are largely either rare or are already solved and
presented to most programmers as libraries and frameworks. And if its not,
simple analysis, a little reading and experimenting is sufficient to solve the
problem at hand.

Your problem is not with programmers but with a term called abstraction. And
fighting that is futile. The benefits vastly over weight any intellectual
argument you can present against it.

EDIT: By the way after reading your bio, I have developed atmost respect for
all the work you have done in your life.

~~~
srean
Was a fine tangential comment, till you started an unsolicited diagnosis of
Colin's hypothetical disease.

Downvoted.

~~~
kintamanimatt
Huh? What hypothetical disease?

~~~
ColinWright
He said:

    
    
        Your problem is not with programmers but with a term
        called abstraction. And fighting that is futile. The
        benefits vastly over weight any intellectual argument
        you can present against it.
    

Utterly bizarre, to be claiming that a pure mathematician has a problem with
abstraction.

~~~
kintamanimatt
Why would that be a _disease_ though? I thought he'd maybe edited his reply
and said something else. Maybe I'm just tangled up in the misuse of a word
though!

Side note, does the theory of juggling have a mathematical underpinning?

~~~
ColinWright
I have no idea why it would be a "disease", but he did seem to be diagnosing
my attitude or "problem."

Don't care.

And yes, there is a great deal of mathematics underneath the structures of
juggling patterns. Using them we predicted the existence of previously unknown
juggling patterns, and have now used the math to create a proof that of all
the juggling tricks of a certain type, we know them all (for some definition
of "to know").

~~~
srean
It was supposed to be a metaphor, intended to highlight that no such disease
is present and nor was it necessary to go about trying to diagnose what "the
problem" with Colin was.

diagnose <-> disease

It clearly did not work as intended.

------
mistercow
I wonder how hard it would be, given the detection algorithm, to fix this
automatically. Seems like simulated annealing would be a pretty good fit,
where you perturb the word spacing on each iteration and the energy function
is the total river length. The fitness landscape is rich enough with good
solutions that you should rarely need more than a few iterations to arrive at
one.

Still, that's a pretty expensive fitness test, and I wonder if there's a more
elegant and efficient approach than Monte Carlo methods.

~~~
kordless
Thanks for this. I just spent 30 minutes reading up on calculating pi with a
simple Monte Carlo algo. Thinking about doing it d3.js...

~~~
mistercow
It's an entertaining exercise, especially when you watch your computer chew
away for several minutes before telling you that π is probably somewhere
around 3.26.

~~~
graue
Huh? You should get as close as 3.14 in less than a second if your random
number source is decent.

~~~
mistercow
Perhaps. I may be misremembering, or may have been using a particularly crappy
RNG when I last did it. It was a long time ago.

~~~
Someone
How long ago? How much faster is current hardware?

If it was, say, a 4MHz machine, you both could be right.

~~~
petercooper
Just for fun, and because I'd never done it before, I just threw together some
Ruby to do this: <https://gist.github.com/peterc/5019760>

It gets to 3.14 within a second, about 30 seconds to 3.1415.. and Ruby is
surely doing this in 100x the time C or Java would.

Update: Also did it for JavaScript - <https://gist.github.com/peterc/5019804>

~~~
mistercow
You can optimize that JS code a lot by using the unit circle, limiting
yourself to the first quadrant, and using multiplication instead of
Math.pow(). Then you can make it branchless using a little bitwise trickery.
The result is this: <https://gist.github.com/osuushi/5022143> .

OK, so I was going to just do the unit circle thing. Then I got stuck in
optimization mode. God I miss performance graphics programming.

------
d23
I've been seeing these all my life and for some reason never thought to ask if
anyone else noticed them. I never found them particularly distracting and
didn't know they were a sign of bad typography. Still, cool to see an article
about them.

~~~
homosaur
If you start to study typography at a relatively deep level, there will be two
things that will absolutely kill you on a daily basis. One is rivers. The
second is bad kerning on signs/displays.

~~~
RyJones
Not stuff like this? <http://www.flickr.com/photos/ryjones/4047502122/>
<http://www.flickr.com/photos/ryjones/6836803380/>
<http://www.flickr.com/photos/ryjones/4047501878/>

~~~
spacemanaki
Not a font-nerd but I'll bite. Is the 'N' in MADISON and SECOND upside down?
(but correct in <http://www.flickr.com/photos/ryjones/4046758633/>)

~~~
RyJones
You are correct. It drives me nuts when I walk around Second and Madison in
Seattle; I would assume the workers were trolling when they did it, but I have
pointed it out to plenty of co-workers that didn't see a problem.

------
jtchang
Definitely cool. How about the reverse for an art project? Maybe the
Constitution or Bill of Rights with rivers forming the statue of liberty.

~~~
DeepDuh
Well, the reverse is already done in ASCII art isn't it?

~~~
ygra
There's lots of typographic art that doesn't use monospaced fonts.

------
shitlord
Something similar: [http://stackoverflow.com/questions/8479058/how-do-i-find-
wal...](http://stackoverflow.com/questions/8479058/how-do-i-find-waldo-with-
mathematica)

These sorts of image processing questions are incredibly entertaining imo (and
I don't mean it in a bad way).

------
lifeisstillgood
I was thinking about developer continuing education yesterday, and this
underlines the need for some way of keeping sharp - without wasting time or
direction.

I simply don't know what a Hough Transform is and rather than Wikipedia, I
would rather a coursera course on image processing - it's to get things in
context that matters.

As I get older i am not afraid of hard work, just afraid of exploring in every
direction - the waste of time and effort is the problem, time and effort
finding out what to learn rather than the practise of learning

Oh, just answered my own question - coursera!

------
taltman1
TeX uses a dynamic programming algorithm to perform its advanced hyphenation,
which allows the text to fill the page "beautifully":
[http://en.wikipedia.org/wiki/TeX#Hyphenation_and_justificati...](http://en.wikipedia.org/wiki/TeX#Hyphenation_and_justification)

If TeX is already using dynamic programming in order to improve the visual
appearance of the words, I imagine that the same can be done for the space
between the words without having to resort to image processing of the TeX
document rendered as PDF.

~~~
limmeau
The thread at SO discusses this; both glyph positions and glyph shapes are
important for a river to become noticeable.

~~~
GhotiFish
I think a false positive or two is worth it for glyph based analysis.

You could even do some sensible analysis per glyph.

width of base: A

vs width of top: V

~~~
limmeau
Is glyph width at base and top available to the layout algorithm? I don't know
TeX that well.

------
Zolomon
I've always wondered if there exist any book that uses rivers as an encryption
technique for hidden messages. Like the author has hidden an easter egg or
something more exciting.

------
tekromancr
I love the fact that all three answers used totally different techniques and
each found the answer.

------
Bryan22
This seems to me, like the least efficient method ever for dealing with the
problem. Ok you want to find 'rivers' in text. To do so you have to turn that
text into an image and then run it through your app, then go back and manually
correct the problem? What was the point of the app again? I don't care if it
took 1 line of code or 1000, this seems completely useless, when the anomaly
is blatantly obvious when proofreading. Not to mention if you wanted to fix it
dynamically, you'd end up with an image of a block of text instead of text, no
big deal for print, but devastating for SEO.. "ok smart ass what would you
recommend?" is that what your thinking? Well, since you asked; why not take an
open source text editor and add an algorithm that stores each line of text in
an array, find the index of all the spaces and compare it to the previous and
next lines of text. If the index of spaces are relatively 0 between lines; you
have a river. If the indexes increment or decrement; you have a river. Now add
an extra space somewhere or move the last word on the last line to the next
line. Whatever solution is most aesthetically pleasing. Now your rivers are
getting fixed on the fly and you don't need to take a screenshot of a block of
text to analyze it. Maybe I missed the point of the article, but i thing those
of you praising their solution aren't taking into account what the problem
actually is, and the fact that their app doesn't actually include a solution
to the problem.

------
pdeuchler
I love the variance in answers. You get several different sig-pro solutions
and then at the bottom also a machine learning answer.

IMHO the machine learning solution could be better, as rivers are defined more
aesthetically than functionally.

------
fnordfnordfnord
I'd like to have the tex, to translate the letters to blocks before doing
image processing, but it's almost as good to simply convert it to a binary
mask, and then "dilate" or "grow" the pixels. I think it makes it a simpler
problem both in processing time and conceptually. Not sure about how I'd
decide, identify what vertical lines to pick out, there are a lot of choices
for that.

------
mcav
Next, someone should put that in LaTeX to automatically fix it rather than
just detect it. (Or some sort of HTML/JS plugin, since using LaTeX again is
somewhat disconnected from my immediate life goals).

~~~
fusiongyro
I wouldn't hold my breath. This sounds like a huge, messy, heuristic coding
effort to solve a rare, über cosmetic problem. TeX and typography in general
seem to attract really obsessive folks, but this doesn't strike me as
harboring a hidden ideal that's computationally well-defined and feasible the
way other parts of TeX turned out to be. I guess we'll find out if somebody
does it.

~~~
jamesjporter
I like the idea that mistercow proposes above: use an algorithm to sample the
space of possible typsettings and find one with less that some minimum amount
of rivers or river length.

~~~
fusiongyro
I get that, but here's the problem: there's no guarantee that any perturbation
isn't going to produce more rivers. It's guess-and-check. For simulated
annealing to work, you have to believe that large perturbations are more
likely to produce large differences and small ones produce small differences.
I am not convinced that property holds here, because a river is an optical
phenomenon that doesn't directly relate to the spacing size, which is the one
variable you're interested in manipulating. A small change in space might have
a small effect on the river you noticed, but create an entirely new and worse
one. Another problem with simulated annealing is that you want to start with
large changes and work your way to small changes, but TeX has already
optimized the spacing, so you would really rather start with small changes.

Your next problem is that TeX doesn't know what glyphs look like, just how
much space they take up. If the "riverbanks" are made up of As and Vs, the
river will be extremely pronounced. But it may not be noticeable at all if
they're made up of Ms and Xs. Worse, it will also depend on the font in play.
But TeX doesn't know what the glyphs look like while it's adjusting the
whitespace, so you now need to introduce a feedback loop between the eject-
page phase and the paragraph layout phase. I'm under the impression you'd be
jumping over three or four phases to do that, because TeX lays out lines
first, then paragraphs, then emits pages.

This is a huge amount of work. How big a problem are rivers?

------
Kiro
I want to know if it can be done on HTML texts with JavaScript.

------
qscesz
Really impressed by the short length and high effectiveness of the code.
Thanks!

------
benatkin
Why are they using bogus examples?

~~~
mark-r
It's probably Lorem Ipsum, which is commonly used when you want to focus on
layout rather than meaning. Since the text is meaningless it won't be
distracting, yet it looks close enough to real to be useful. See
<http://en.wikipedia.org/wiki/Lorem_ipsum>

------
martinced
If there are any typesetting geek on HN, any idea when this would be
incorporated in TeX / LaTeX and other typesetting programs?

Oh the memories when I used to write (and typeset) books using LaTeX and Quark
XPress: "rivers" had to be tracked down manually, by eye-ball searching. You'd
basically "blur" your vision a bit and quickly scan through all the pages of
the book. I didn't take that long and I don't write books anymore but I'd
still be curious as to when that technique is going to be implemented (maybe
in InDesign which I never used!?).

~~~
Samuel_Michon
In Adobe InDesign, one chooses to set type using the Single-line Composer or
the Paragraph Composer. The latter analyzes the entire paragraph and tweaks
spacing and hyphenation to minimize rivers, ragged lines, orphans and widows.

So nowadays, I don't manually check each page for rivers anymore. When
designing the page layout, I decide on the 'color' (the density of the
paragraph) that works best, so that rivers are avoided, but I leave it at
that.

I understand how river detection can be a fun mathematical puzzle, but it's
moot. Adobe built the Multi-line Composer into InDesign since version 1.0
(1999). And in my experience, with every version, it gets better.

From MacWorld's review of InDesign 1.0:

 _"InDesign's text features will appeal to designers and production people
tired of the drudgery of manual copyfitting and kerning. The Multi-line
Composer feature can calculate hyphenation and justification settings by
examining an entire paragraph (or as many lines as you specify), instead of
just a single line, to create better-looking text. In the process, it notably
reduces the amount of manual tweaking necessary and will be especially helpful
if you're trying to avoid multiple word breaks in a design with awkward text
wraps. Similarly, the program's Optical Kerning feature does its best to find
optimal character spacing, even if you've mixed different type sizes and
faces. "_

<http://www.macworld.com/article/1014955/k2.html>

[http://blog.paragonpress.net/2010/08/04/adobe-indesigns-
para...](http://blog.paragonpress.net/2010/08/04/adobe-indesigns-paragraph-
composer/)

~~~
arrrg
Woah. InDesign’s Paragraph Composer detects rivers? I didn’t know that. Now I
wonder how it’s doing that.

------
drivebyacct2
This is massively distracting for me at certain times and I was always
considered a "good, fast" reader. The only thing I detest more in a block of
text than rivers, is justified alignment. Ick.

~~~
homosaur
Good typography is about making the correct decisions regarding your text. If
justification is making a passage hard to read then it's simply bad typography
and doesn't mean justification is bad as a concept. Ragged-right isn't
necessarily going to save you from rivers either although they are far less
likely due to uniform spacing.

~~~
drivebyacct2
Oh, I know. I hate them as separate entities. I'm not even convinced
necessarily that justified text causes more rivers - they make awkward gaps
and ruin my pace, but I don't usually get distracted by rivers with justified
text.

~~~
barrkel
Justification without gaps works best with aggressive hyphenation. If the
hyphenation isn't good, the text is full of surprises at line wraps. It's hard
to do justification well as a result. It's usually restricted to newspapers
with limited space these days.

~~~
andreasvc
> It's usually restricted to newspapers with limited space these days

That's false. Almost any printed publication uses justification and
hyphenation by default: books, magazines, scientific papers. The big exception
is websites.

------
virb
Is this really a problem? In how many texts does this occur? It seems quite
unlikely. Show me a book where this has happened.

~~~
kenko
You are unlikely to see many books where this has happened, because
professionals try to make sure that it doesn't, and fiddle with things if it
does.

~~~
mistercow
I'm pretty sure I've seen it in plenty of books, and it's pretty distracting.

~~~
jamesjporter
Probably the result of someone at the publisher not being willing to pay for
good graphic design. My Dad is a graphic designer who typesets a lot of books,
they really do go over every page fiddling and optimizing to make it look
good.

