
Edge cases to keep in mind when working with text - submiter_dor
https://www.thedroidsonroids.com/blog/edge-cases-to-keep-in-mind-part-1-text
======
ewjordan
The Turkish situation referenced ([http://gizmodo.com/382026/a-cellphones-
missing-dot-kills-two...](http://gizmodo.com/382026/a-cellphones-missing-dot-
kills-two-people-puts-three-more-in-jail)) is not an indictment of bad tech,
but of a fucked up honor-based patriarchal culture.

"Ramazan went to the family's home to apologize, only to be greeted by the
father, Emine, two sisters and a lot of very sharp knives."

There's no technological way to fix people that would try to kill someone over
a text misunderstanding without figuring out the truth first. People like this
are garbage-people murderers, let's not blame tech mistakes for the fact that
some people are scum. Everyone involved knew damn well that a couple
characters would make the difference between a benign text and an offensive
one, and frankly, even if the text _was_ offensive, murder was not justified.
Scum.

~~~
nine_k
OK, the overreaction like this may be a cultural problem.

But even without a problem like that, imagine a case when a wrong drug is
administered to a patient, with deadly consequences, or a wrong turn is taken
by a motorist, leading to a collision.

To avoid this, things should be written in an unequivocal way. But for that,
one has to _realize_ how expensive a "negligibly small" mistake can be.

~~~
gumby
> ...imagine a case when a wrong drug is administered to a patient, with
> deadly consequences...

Out of band solutions are used to address these problems more comprehensively.

The FDA regulates brand names for drugs for exactly this reason: avoid
confusion and ambiguities.

This is also one of the reasons for pharmacists, especially in hospitals.
Mouse-clicking in a pull down list can easily select the wrong drug for the
patient.

------
brudgers
_Text (aka strings) exists in virtually all software projects_

For me, distinguishing between text as something that is intended to be read
by humans and strings as serial sequences of characters that may or may not be
human readable but will be processed by one or more computing automata is
useful. For example in C, the string "Hello World" is terminated by a null
character. The null character is not part of the text the string encodes.

Or to put it another way, I find that treating strings as text as two
different layers of abstraction clarifies my intent. Code that manipulates
text is built on code that manipulates strings and in between there's parsing
that has to occur.

~~~
FabHK
Wouldn't you want to call that "string" versus "bytes" (instead of "text"
versus "string")? (That's the Python parlance, if I'm not mistaken, and it
seems good to me.)

~~~
rocqua
In that case, which is HTML? Because it's meant for machines to read, so
should be treated more carefully that pure 'text'. However, HTML shouldn't be
processed like bytes either.

I suppose with HTML the real issue is that there is human readable text in
there.

~~~
nothrabannosir
HTML is serialisation of a DOM. It's not a sequence of text, but a semantic
tree of it. Both encoded as sequential bytes, but different concepts when
decoded.

------
dvfjsdhgfv
I wish more software developers kept these things in mind. At one of my
customers I worked on interfacing their online store with several other
software components. The store was the only piece able to handle the names of
customers (from different parts of the world) correctly. All the rest failed
at some point. There are so many additional aspects you discover only when you
actually work on these things.

~~~
jccalhoun
My first name is hyphenated. I run into online forms all the time that don't
accept my name. Or won't accept it because it is "too long."

~~~
gumby
Indeed when Brian Fox (of bash fame) implemented the name field in finger he
increased it specifically so my entire name would fit.

(After I married his effort was wasted)

------
bloaf
See also: ordering text with numbers in them.

I've worked with several programs that present results as

"Thing number 1"

"Thing number 10"

"Thing number 2"

...

Which IS of course alphabetical order, but it is also the _wrong_ order for
any user-facing list. This is most likely a symptom of prematurely string-
ifying and passing around information as strings.

~~~
j_s
Raymond Chen's Old New Thing covered a similar issue, contrasting NTFS vs.
Windows Explorer:

[https://blogs.msdn.microsoft.com/oldnewthing/20050617-10/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20050617-10/?p=35293)

He referenced Michael Kaplan's MSDN blog about how Windows Explorer sorts,
which took me a while to find:

[http://archives.miloush.net/michkap/archive/2006/09/30/77834...](http://archives.miloush.net/michkap/archive/2006/09/30/778345.html)

And there is a KB article about differences between Windows versions:

[https://support.microsoft.com/en-us/help/319827/the-sort-
ord...](https://support.microsoft.com/en-us/help/319827/the-sort-order-for-
files-and-folders-whose-names-contain-numerals-is-d)

The function is available for use, StrCmpLogicalW:

[https://msdn.microsoft.com/en-
us/library/windows/desktop/bb7...](https://msdn.microsoft.com/en-
us/library/windows/desktop/bb759947.aspx)

