
Making Wrong Code Look Wrong - ColinWright
http://www.joelonsoftware.com/articles/Wrong.html
======
raganwald
Folks, please do not fall into the simple error of arguing whether Haskell’s
typing beats Hungarian notation. This would be bikeshedding. The safe/unsafe
strings are given as an example, the larger point he is making is about using
coding conventions to highlight potential semantic errors.

Every program is going to have some kind of semantic problem that cannot
easily be fixed with compile-time analysis. Unsafe strings in web applications
are merely one example for one class of programs written in one class of
languages. If you grasp what he’s saying about coding conventions, ignore the
example.

So to his actual point: Do you agree with my contention that there are always
going to be some semantically wrong bits of code that cannot be automatically
detected by the compiler, or that it is awkward to do so?

If so, do you think coding conventions help more than they hinder? In other
words, do they increase code safety without imposing too much of a burden on
readability?

~~~
kisielk
Theoretically in a strongly and statically typed language like Haskell
shouldn't it be possible to fix the safe/unsafe string problem by having
separate a string type for unsanitized input?

Not that I've tried this approach myself, but I'd be curious to hear from
anyone who has.

~~~
eru
Yes, that's possible in Haskell. And it's actually one of the canonical
examples.

And even in C you could wrap your (char*)s up in different structs. The typing
overhead would be higher than in Haskell, though.

(There's even a language extension to allow you to overload string literals.
But you probably don't want that, because

    
    
        s = TaintedString "My string literal"
    

is easier to read than:

    
    
        s :: TaintedString
        s = "My string literal"
    

The overloading makes more sense for implementation details like using a
different String implementation like ByteStrings instead of a linked list of
Characters. You actually want to be able to see the semantic differences.)

~~~
masklinn
Re. second one, wouldn't it also be possible to write

    
    
        s = "My string literal" :: TaintedString
    

?

It's not shorter than the constructed version but it should work, you
sometimes need it to get the "right" kind of number out of a literal when the
type inference gets confused (or more commonly in the REPL)

~~~
eru
Yes, that's possible with the right compiler extensions. For this example it's
essentially the same thing as giving the type on its own line, so I chose the
more common syntax.

------
jws
My combinatorial explosion detector is going off.

Encoding a 'u' into variable names for "unsafe" strings and omitting it for
ones that have been encoded safely for HTML bodies is nice. If this is the
only problem to solve and you can universally apply this technique then it
solves the "NUL terminated string of the 2010s", and I may use it but…

What about encoded safely for URLs?

What about encoded safely for use as a Unix file name?

What about encoded safely for SQL? (yes, don't do that, I know).

How about encoded safely for 7 bit ASCII only applications?

How about the output of base64? It isn't unsafe for HTML, but it is for URLs.

What about that bit of HTML that the user entered using their WYSIWYG field
that needs to be sanitized but not encoded? 'u' seems right, but Encode(uVar)
isn't the proper handling.

How about that library that doesn't know the convention, so its function calls
all look like they are encoded safely for HTML since they don't start 'Us'?

What about the library that used 'u' for UTF-8?

(Oh, and…

    
    
      dosomething()
      cleanup()
    

… is broken. Requiring a caller of dosomething() to remember to cleanup() is
about as polite as leaving set bear traps in your living room furniture and
reminding guests to check under the cushions before they sit. Programmers are
used to that sort of abuse (Remember to fclose() what you successfully
fopen()), but the future should really provide use with better constructs, and
it not being 1980 anymore, we are the future.)

~~~
al_james
Yes, exactly, the meaning of 'safe string' depends on its context.

I personally believe that all code that is sensitive to escaping issues,
should be escaped on output (or storage / sending to the database / whatever)
by default unless you explicitly opt it out.

In our framework, we maintain 'already escaped' strings as a separate class,
forcing the developer to acknowledge this fact. The HTML output layer escapes
all strings unless they are instances of this safe class. Similarly for the
SQL layer, all code gets quoted unless it was explicitly marked as 'safe'.

~~~
eru
Do you treat SQL queries as strings in your code, or do you build something
like an abstract syntax tree?

~~~
al_james
Build them using a syntax builder framework, but it is _possible_ to pass raw
SQL strings in in _exceptional_ circumstances. Obviously, the developer needs
to be totally aware of the risks of this. Not ideal, but needed to solve a
couple of problems.

------
kolektiv
Mmmm, but there's more than one way to skin a cat. To take just the point
about safe and unsafe strings - well, you could do that with naming
conventions, sure. But then you've got a peer review issue. It helps, but it's
not _great_. So the next thing (in a language which lets you use types in this
way) is to create a type of UnsafeString, or SafeString etc. And your language
doesn't even have to be that great to let you do this - just having a simple
type system. And now you change your write method so it doesn't take a String
anymore, it takes a SafeString, and only a SafeString. And you make it so that
constructing a SafeString can only be done by Encoding at some point.

Now you have something which can still look clean (how clean may depend on
your language, type inference, etc.) without bringing in messy pseudo-
Hungarian-ness, which stops me doing the most important thing with my code -
reading it easily.

Obviously it's only a silly and trivial example, but I'm not sure this article
has dated that well in some regards (or perhaps only really applies to certain
types of languages - Visual Basic, for example, pretty much forces you to do
something like Hungarian to maintain any semblance of sanity long term).

~~~
eftpotrm
The downside of that approach though is that you end up wrapping the bulk of
the API functions you're using to make them conformant. You've got a huge
amount of extra work creating and maintaining these components, and a _major_
training workload for any new hires so they know why they can't use the
standard options they're used to and what the internal replacements are.

Personally, I'd rather go Hungarian.

Edit - well, yes, you may well be able to do this sort of thing nice and
easily in Haskell, but I'm not sure that going with Haskell to avoid the
training and maintenance overhead of complicating your Java (or whatever) is a
net reduction in workload and hiring difficulty...

~~~
Symmetry
In most languages that support this sort of things, can't you use SafeString
in places where String is expected, as long as you declare SafeString as a
sort of String?

~~~
prodigal_erik
I wouldn't want SafeString is-a String, I would want a conversion from
SafeString to String that removes whatever encoding or escaping was applied.
Otherwise you end up re-encoding values that unbeknownst to you don't need it,
which is why there are thousands of terrible PHP bulletin boards out there
which won't let you use quotes without mangling them.

------
CJefferson
I agree with all of this, except the operator overloading rant. Unless of
course we extend it to forbiding overloading of any function name for more
than one type.

I have done a bunch of mathematical programming in java, and you end up
writing ".add()" a lot. Maybe .add is on a finite field. Maybe it is some
matrices. Maybe complex numbers. I don't see what I am saving over operator
overloading.

Of course, if I used .addFiniteField, .addMatrix, .addComplex, then I can see
the (possible) benefit, of knowing how "heavy" the operation is at a glance.
However, I don't think it is worth the cost, given I lose genericness.

~~~
robfig
It sounds like you would use operator overloading for exactly what it was
meant for (and good at)!

I think the usual complaint is that people use operator overloading for non-
mathematical operations. For example, a guy doing a webapp might overload the
"+" operator for Group and User as a clever way to add a User to a Group. This
turns out to be a terrible idea.

In fact, my eyes glaze over as I read Scala code for this very reason, because
everyone and their mother defines a method with some random array of symbols
because it makes sense to them and they prefer shorthand. It makes the code
totally unreadable.

~~~
scott_s
That's the perennial complaint and worry, but in my experience, people use
operator overloading in C++ for mainly two things: function objects by
overloading operator(), and iterators by overloading the dereferencing
operators and increment operators.

~~~
xyzzyz
At any given place, people use at most 30% of C++ features -- the bad thing is
that these 30% subsets usually overlap only a little. People use operator
overloading in C++ for all the crazy reasons, like I/O, for instance -- for
even better (or maybe worse) examples, see boost, boost::lambda will be a good
starting point.

Good abstractions are good, but operator overloading combined with templates
and implicit casting is not one of them.

~~~
scott_s
_shrug_

People always talk about it what _can_ happen, but I've never seen instances
of operator overloading in C++ that I objected to in a fundamental way.

~~~
xyzzyz
iostream, for one. It's a real pain to deal simultaneously with this fancy
bit-shift I/O syntax on the one hand, and i18n on the other -- fancy syntax is
way inferior to format strings when you want to support multiple languages.

Also, please, take a look at the boost::lambda or other boost libraries --
they have large DSLs written on top of operator overloading which look nice
when you look at them, but are terribly complicated to use and ridiculously
hard to debug, thanks to the completely unintelligible messages that the
compilers produce.

Most probable reason you've never seen it, is that most sane, experienced
people know the danger and avoid it. When there's more than one person working
on a project, it's _way_ more important for the code to be easy to follow and
debug than to look fancy.

~~~
scott_s
I've made heavy use of Spirit, see:
<http://news.ycombinator.com/item?id=2912729>

You are correct in that Spirit is difficult to debug, but that has nothing to
do with operator overloading. That has to do with its extensive use of
template metaprogramming and C++'s lack of concepts
(<http://en.wikipedia.org/wiki/Concepts_(C%2B%2B)>). If they switched the
whole library over to using named member functions, the debugging problems
would still be there. And concepts would go a long way to preventing the
compiler problems.

I've looked at boost::lambda, and decided it was mostly a novelty. How far
towards a real lambada could you get without lambda support in the compiler?
That far, it turns out, which was not far enough. The rules I had to remember
to use it in non-trivial situations was more effort than it required to write
a one-off function object. While I agree boost::lambda is more trouble than
it's worth, I haven't seen or heard of terrible things happening because
people used it.

I also think it's worth noting that many Boost libraries are purposefully on
the edge of what is possible in C++. It's like a refereed sandbox for
experimenting.

I've also had no problems with << and >> as the stream read and write
operators. But, I also have no experience with internationalization. I don't
see any obvious problems, though, what are they?

It's possible I've seen about the same amount of C++ as you (perhaps in some
different areas) and come to a different conclusion.

~~~
xyzzyz
_While I agree boost::lambda is more trouble than it's worth, I haven't seen
or heard of terrible things happening because people used it._

And this is precisely my point: nothing bad could happen because people
haven't used it in serious projects, even though it is _possible_ \-- it's
just more trouble than it's worth. This is actually the whole point of
argument against operator overloading: it is troublesome for both implementors
and users, and the value provided is minor - most people use operator
overloading for math-like types (bignums, complex numbers, matrices) anyway.
On the other hand, it requires (sometimes considerably) more work on the
designers' and implementors' side, and more caution and experience on the
programmer's side.

That said, I'm not against operator overloading myself. However, I'm heartily
against operator overloading in C++, because it interacts really, really bad
with other C++ features (especially implicit casting rules (which are bad
anyway) and manual memory management). Please, see [1] for more information,
it's explained better and wider there than I could do it here.

Regarding bit-shifting I/O, I've never figured out why they even thought it's
a good idea at all. It isn't any more terse. If it's combined with output
formatting it's much harder to figure out what's actually going to be written
than with printf-like functions, because the formatting and expressions are
mixed together. The only advantage I can think of is that it _can_ be a little
faster than parsing the format string every time. But still, you _don't need_
to parse it every time. If the language design allowed it, they could have
introduced something like CL's define-compiler-macro, but I think the utter
mess created by presence of such tool in C++ would be overwhelming -- it's bad
enough as it is, with template "metaprogramming" instead of real
metaprogramming facilities.

The internationalization issues I mentioned make this completely useless: the
usual way to support multiple interface languages is to use something like
gettext for strings in code (see e.g. Qt Linguist for C++ implementation).
Basically, for every string that's supposed to be user visible, you use a
special translating function, i.e. instead of

 _printf("Hello, %s, it's %s!\n", name, day_of_week);_

you write

 _printf(tr("Hello, %s, it's %s!\n"), name, day_of_week);_

This simple approach does not map at all to the bit-shift I/O -- I'm fairly
sure that experienced C++ hackers could create something along the lines of

 _cout << (translate << "Hello, " << name << ", it's " << day_of_week <<
"!\n");_

but this is obviously useless because obviously not all (not even many)
natural languages has syntactical structure of English. Better option would
be:

 _cout << (translate("Hello, %{1}, it's %{2}!\n" << name << day_of_week);_

but again, it's clearly only printf with unnecessarily convoluted syntax.

[1] -- <http://yosefk.com/c++fqa/operator.html>

------
ori_b
Also know as "How to put a second type system into your language". Personally,
I prefer to have the language type system handle it.

------
perlgeek
Nonono, the real solution for the HTML escaping problem is to use a template
system which defaults to escaping all variables. Using a naming convention is
much less robust.

If you need to pass in something that doesn't need to be escaped, you switch
off the escaping explicitly (`<% var name ESCAPE=NONE %>`), and/or you put
HTML strings into a separate type which doesn't get escaped upon substitution.

For example the template engine would return a `HTMLString` instead of a
`String` object, so that it never accidentally encodes twice (as could happen
in the case of includes).

~~~
rlpb
Taking this even further, an HTML templating system that involves supplying
syntactic HTML as strings is fundamentally broken. Ideally you'd never provide
a string that ends up as syntactic HTML. You'd build up a data structure that
represents your output and a serializer would do the rest, escaping everything
correctly since there would be a type-based distinction between HTML structure
and element content. Genshi is an example of a templating system that works on
data rather than syntactic string snippets.

~~~
perlgeek
Live isn't always that simple.

For example sometimes you get a piece of HTML that you have to include
verbatimely for policital reasons (ad banners come to mind that need to be
included character by character because of the Terms of Service - see AdSense
for example).

In that case you really need an option for direct, unescaped HTML inclusion.

Or you might need to hilight the a syntax that's hard to parse, and the only
available hilighter spits out HTML pieces instead of parse trees.

My point is that a template system should be able to deal with such realities,
and it doesn't make it "fundamentally broken".

~~~
nasmorn
Also while you can build a company on the motto "People feeling uncomfortable
with a HTML generation library have no business adding a list tag somewhere"
there are certainly cases where it makes sense for them to do so.

------
olavk
Folks, you are all missing the context of the article. He is writing in
_VBScript_ \- a now obsolete stripped down version of VB. It was created as an
alternative to JavaScript, but was much more limited in its OO capabilities.
VBScript was kind-of-OO in that is allowed you to consume objects, but it
didn't initially allow you to define your own classes.

So you are working in a language which don't allow you to defined custom
types. You can work with built-in types like strings and integer, but if you
want to add "metadata" to these basic values, you can't do it in the type
system. Hungarian notation is a perfectly legitimate solution given those
constraints.

The distinction between "Systems" and "Apps" Hungarian also makes sense in
this context. Systems Hungarian encodes the name of the built-in types. This
made sense in BCPL (where it originated), but is cargo cult in VBScript, where
this is already expressed in the type system. Apps Hungarian allows you to
express "custom types" which is not expressible in the type system of VBScript
and therefore makes sense. But it is cargo-cult programming in any language
which allows you to create custom types or objects. Which is basically any
other modern language.

(The thing about exceptions being "invisible gotos" also makes sense in the
context of VBScript, where exception handling was done through statements like
"On Error Goto..." and "On Error Goto Next".)

------
lucisferre
How in the world are people still making the front page with years old Joel
posts? If you want to read this stuff go visit his blog and read it.

------
gwern
For those curious, the way one would do this in a language with a better type
system (or objects) would be some sort of opaque object/type which you can't
peer into, and which provides a function/method to extract the String within
and which also sanitizes it along the way.

So for Java, you might have a class with the String stored as a private
variable and the obvious getter/setter methods. In Haskell, you'd have a
newtype with similar functions. ( _Can_ this be done in C? I don't know.)

A close approximation to Joel's example (string sanitation) in Haskell with
newtypes: [http://blog.moertel.com/articles/2006/10/18/a-type-based-
sol...](http://blog.moertel.com/articles/2006/10/18/a-type-based-solution-to-
the-strings-problem)

------
Cushman
I've been doing something a little like this in CoffeeScript for a while. For
those who don't know, CS adopts Ruby's style of string interpolation
("formatted #{string}"), and like Ruby, only applies it for double-quoted
strings. My policy is to only ever compose strings using interpolation, and to
only ever write double-quoted strings when they contain interpolation. String
literals are strictly single-quoted.

That way, I know that single-quoted strings, which is most of them, won't ever
involve user input, while double quoted strings get a little extra scrutiny.
It's not XSS protection by itself, but it's a useful little way to increase
the odor of the code one way or another.

------
blakeperdue
Off the topic of unclean code, but there are some great Spolsky articles that
are more business-oriented on Inc's site: <http://www.inc.com/author/joel-
spolsky>

