

Should String Be An Abstract Class? - danbjson
http://appsandsecurity.blogspot.com/2013/05/should-string-be-abstract-class.html
Security researcher John Wilander makes a case that String should be an abstract class. His analysis comes from a cross road of Domain Driven Design and Application Security. Also, he is not only doing research, he actually writes production code on every-day basis, so he knows what he is talking about.
======
DanielBMarkham
98% of this article is a discussion of the use of strings in HTTP headers. The
last 4 or 5 sentences tries to generalize the conclusion to strings at large.

I feel a bit misled by the title. It would have been nice to have 3 or 4
examples, and a much larger argument that strings in general should be
abstract.

However, even with all of this underpinning, it doesn't work. This is like an
argument against integers because integers can express so many things: age,
number of arms, days of the year, and so forth. Base types can cover a lot of
things! Film at eleven.

The discussion should have been about proper API construction, and when to
subclass. Or even particular problems with HTTP Header APIs in Java -- I
really liked that part. If the author had stuck with it, I wouldn't have felt
a bit bamboozled.

~~~
dmethvin
> This is like an argument against integers because integers can express so
> many things: age, number of arms, days of the year, and so forth.

Yes, and just like the strings case it shows how useless most primitive types
are from a program correctness and documentation standpoint. Languages like
C++ and Java encourage programmers to treat them as mystical bags-of-holding
and forget about their limitations or relation to the data being processed.

------
haberman
Gah. The very interesting question that the author asks is whether we should
use the type system to help us enforce validation constraints, and in
particular, security constraints -- a great idea that I've seen discussed
elsewhere. But unfortunately he takes a wrong turn and assumes that this
implies a subclassing relationship, where these validated string class _derive
from_ an abstract string class. This makes no sense from an OO perspective;
nothing about a string class is "abstract," and it's better to favor
composition over inheritance.

------
dougk16
Most "stringly-typed" systems seem inherently quirky to me. In something like
Javascript, everything else is so loose, it's like why not use strings, but
I've always wondered why things like HTTP headers couldn't be handled using
enums. I think it would cover a lot of the cases mentioned in the article
without having a weird-at-first-glance abstract string paradigm to deal with.

As a practical matter, I think enums couldn't be used in the Java spec because
it was written before enums were available in the language. Enums also suffer
from extensibility issues, but that could be solved by having stringly-typed
overrides. I've also been contemplating the possible merits of extensible
enums for solving issues like this. So for example you could have an enum type
that extends the HttpHeader enum. The ordinality would be weird if there were
multiple sub-enums, but it might not matter for most practical cases. I can
think of a few language-level ways to deal with the ordinality issue too.

------
R_Edward
While I agree with the author's sentiment, I think it is a dangerous practice
to assert that "Nothing is any of 100,000 characters and anything between 0
and 2 billion in length." By all means, let's impose more rigorous structure
on data items that require it, like HTTP headers, but why impose artificial
restrictions on things that don't need it?

No single /thing/ might be any of 100,000 characters and between 0 and 2
billion in length, but a group of /thing/s that share enough similarities as
to be functionally identical might very well subsume most of those 100,000
characters and have no intrinsic limits on length. If I have learned anything
in my quarter century of developing software, it's that the moment I impose an
artificial restriction on my data, I will find an item that violates the
restriction and now requires special-case handling to do its job.

~~~
plorkyeran
There's also a really obvious example of a thing that might be any of 100,000
characters and between zero and 2 billion characters in length: the contents
of a plain text file.

~~~
lmm
Right, but is that a genuine use case? If you're writing an editor you
probably want it to have a stronger notion of the data representations people
might want to edit. If you're just considering "user-supplied free text"
that's probably constrained away from certain characters.

------
Camillo
I get the impression that the author doesn't really understand why he's doing
what he's doing. The obvious solution in this case is to have addHeader
validate its arguments. The reason why you might want to have a separate class
that does validation on construction is if you were going to end up validating
the string multiple times: then you can just validate once in the constructor,
and use the type system to ensure that each header string has been validated
at least once.

But is that a realistic concern in this case? The author proposes replacing
this:

    
    
        response.addHeader("Custom-Language", 
                       request.getParameter("lang"));
    

With this:

    
    
        response.addHeader(new HttpHeaderName("Custom-Language"), 
                       new HttpHeaderValue(request.getParameter("lang")));
    

This gives absolutely zero benefits over just doing the authentication in
addHeader. Every string is still being validated the same number of times, but
now the programmer is less productive and the code is more cluttered because
of the added boilerplate.

There might be a slight advantage in saving the constant header name
somewhere, and reusing it:

    
    
        customLanguageHeader = new HttpHeaderName("Custom-Language");
        ...
        response.addHeader(customLanguageHeader, 
                       new HttpHeaderValue(request.getParameter("lang")));
    

But that's even uglier. What might be nice to have is a way to run the
validation at compile time for static strings, so that you can save validation
passes at runtime while still being able to use the literal "Custom-Language"
at the location of the addHeader call. Automatic casting would also be nice,
so that you can just pass the naked string for an HttpHeaderName parameter and
the compiler will insert the constructor. But I don't think you can do these
things in Java.

For this particular language and this particular problem, it may be best to
have the Response object validate all headers while it's being serialized.

~~~
lmm
If you do want to do this kind of thing in Java you can use the JSR308
checkers framework.

------
peter-fogg
This reminds me quite a bit of this article:
[http://blog.moertel.com/posts/2006-10-18-a-type-based-
soluti...](http://blog.moertel.com/posts/2006-10-18-a-type-based-solution-to-
the-strings-problem.html), although Haskell's type system is likely much
better for expressing this sort of constraint.

------
strictfp
String shouldn't be abstract, just like ArrayList shouldn't be abstract. Why?
Because you should use composition, not inheritance. You should introduce
types for your concepts in your code to avoid these type of problems. But it's
always tempting to avoid it in order to simplify the design. It's always a
tradeoff, but personally I think more use of typing would be good in most
projects that I've seen.

------
macspoofing
Every primitive type is a problem. General purpose programming languages are
too "general-purpose" to be useful. Java will happily let you add two integers
even though one may represent a quantity in metric and the other in imperial.

Invariably you should create your own DSL-like layer to organize your code and
enforce consistency and correctness.

------
overgard
F# has this concept of allowing units on numbers
(<http://msdn.microsoft.com/en-us/library/dd233243.aspx>). So instead of just
having "5" you could have "5 meters" or "5 dollars". It doesn't fundamentally
change the number type, it just gives you some metadata to work with.

Making a subclass for every string type strikes me as being a bit heavy, since
usually you don't necessarily want new string behavior, you just want to
classify it and define conversions.

I think it would be neat if (with language support obviously) instead of
overloading the class, you could instead just specify "units" for a string,
and conversions between those units.

~~~
lmm
>Making a subclass for every string type strikes me as being a bit heavy,
since usually you don't necessarily want new string behavior, you just want to
classify it and define conversions.

Many languages can optimize such a construct away so that it only exists at
compile time, or have another construct that declares "represented as type X,
but treat it as a different type at compile time" (e.g. Haskell's newtype)

------
snprbob86
Making string abstract or creating new string subtypes does not solve the
underlying problem: Your type system isn't good enough.

Now what I say this, let be clear: I'm talking about _everybody's_ type
systems. That goes for you Haskell-ers too. And the post-Haskell rocket
surgeons doing crazy advanced typing insanity.

Type systems are _abstractions_. The ones you are used to, like the Java-ish
OOP ones or the ML/Haskell functional ones, are designed to detect and prevent
a wide variety of programming errors, while also enabling analysis that will
improve execution performance. However, there are many sorts of "type systems"
that solve different problems.

\- There are tree schemas for validation and data generation.

\- There are database schemas for indexing and query planning.

\- There are grammars for parsing and validating languages.

\- There are contracts for ensuring preconditions, postconditions, and
invariants.

\- The list goes on for quite a while.

The idea that you can assign a single named type in a single kind of type
system to a value and 100% verify the correctness of your software is just
bogus. You need optional and pluggable type systems, so that you can bring a
particular type system to meet a particular problem.

------
6ren
In pascal, you could have user-typed primitives, with conversion code for
cast. For example, you could have _celcius_ and _fahrenheit_ float types. If
you used a _celcius_ when _fahrenheit_ was wanted, the conversion code would
automatically be called.

I saw an article years ago (by Joel?) about applying this mechanism to html-
encoding: so if you used an _unescaped_ type where an _escaped_ type was
required, it was automatically converted. This provided security against
injection attacks.

Similarly, you can have validated and unvalidated types. If your
libraries/frameworks already used these, there's almost no work left for you
to do.

Of course, java doesn't have user-typed primitives, nor type-conversion code.
In XSD datatypes you can specify a string type in terms of a regular
expression [http://www.w3.org/TR/2004/REC-
xmlschema-2-20041028/datatypes...](http://www.w3.org/TR/2004/REC-
xmlschema-2-20041028/datatypes.html#regexs)

------
venomsnake
This reminded me on Joel's leaky abstractions. And you could hardly find more
leaky than strings.

 _Amusingly, the history of the evolution of C++ over time can be described as
a history of trying to plug the leaks in the string abstraction. Why they
couldn't just add a native string class to the language itself eludes me at
the moment._

------
boomlinde
Integers are also rarely "just integers" -- checksums, hashes, bitmasks,
counters, numbers -- you name it.

------
benjiweber
You could use the CharSequence interface
[http://docs.oracle.com/javase/6/docs/api/java/lang/CharSeque...](http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html)

------
dllthomas
Should int be an abstract class?

Actually, in my recent C I've taken to wrapping most ints in one-member
structs, so the compiler will catch it when I pass a foo id where I meant to
pass a quantity...

~~~
mikeash
I love this technique. It can catch so many errors. It's also easily
extensible to slightly more complicated cases. For example, I was working on
some gnarly code that was working with a bunch of times with different epochs
(e.g. time since startup on the local computer, time since startup on a remote
computer, and time since the UNIX epoch). Rather than try to remember what was
what, or try to painfully encode it in variable names, I simply wrote a struct
that contained the number of seconds and an enum indicating what epoch it
used. Then any operation on a pair of times (deltas, comparisons) got factored
into a function that asserted the time bases of the two times were compatible.

It's interesting how little use this seems to get in C in general.

~~~
dllthomas
That sounds more like a tagged union, which are great for dynamic typing when
that's needed, but is a somewhat different technique.

~~~
mikeash
Not at all. It's just a struct with two members:

    
    
        struct Time {
            double val;
            enum Epoch epoch;
        };
    

It works just like one-member structs, in that it can be treated as a single
value, passed and returned by value when calling functions, and keeps you from
accidentally mixing different kinds of values. Having the "epoch" field along
for the ride just means you can add some additional smarts.

~~~
dllthomas
So yes, like I was thinking; a tagged union - the epoch field determines the
interpretation of the val field. The full application of "wrap values in
single element structs for better static guarantees" would be to have a
different time struct for every epoch. This is a (possibly quite useful) step
back from that, since C's lack of polymorphism would mean a need to implement
every time function for epoch even when the logic is the same.

~~~
mikeash
I guess that makes sense now that you explain it. Conceptually, each epoch
value results in a different type for the other field, meaning it works like a
union, even though it's actually implemented using the same primitive type for
each.

~~~
dllthomas
Right, exactly. It's just the fact that all inhabitants happen to share a
representation that lets you avoid the syntactic union.

------
valtron
Shouldn't the spec define a way to _encode_ the name/value in such a way that
any string can be used as a header name/value? Then `addHeader` is responsible
for handling encoding. For example, if I had a method that takes a name/value
argument and converts them into "name=value" syntax to use in a URL query, I
would consider it a bug if it didn't url-encode name and value.

------
hcarvalhoalves
While I'm not sure String should be abstract, I see what the author is
implying. The idea is that having String abstract would force the programmer
to come up with appropriate data structures more often than not. There's
definitely some truth to it.

------
benmmurphy
if you have another string type that has a limited character set then it is
not a subclass of string because it breaks the liskov substitution principal.
in java land allowing user code to subclass string would create massive
security issues in the sandbox. i think there is a good case for having
different types of strings and providing runtime support so this is cheap but
this is not subclassing.

------
bborud
Definitively not. If you need a specialized type: make one instead of bodging
existing basic types that are already complex enough as they are.

------
lifeisstillgood
But I am a bit mystified - I assume that in Java I cannot subclass String?
Which seems to be the whole argument.

Well, you know what my answer to that will be :-)

(To be fair I have just written a session-cache- management system for python
and did not subclass anything much. I suspect I should look into that :-)

~~~
nightpool
No, the point of the argument was that people should be FORCED to subclass
String. Which is kinda pointless, as its easy for lazy programmers to
circumvent, and there ARE things that are fit to be represented by just
Strings. But I do agree with the author that it should be either subclassed or
run through a validator function in a lot of cases.

------
jkulmala
String is immutable for good reasons. It's explained in Joshua Bloch's
Effective Java, which should be mandatory reading for all Java programmers.

~~~
danbjson
Agreed. I cannot see anyone in this thread (blogpost or comments) that has
argued for the opposite "string should be mutable".

~~~
jkulmala
Well, immutable classes are final. Thus, suggesting to make it abstract does
suggest the opposite, IMHO.

------
sultezdukes
The problem in languages like Java, C#, etc..is that best practice would be to
wrap up the string in a class, but that's a "heavyweight" solution and
developers are lazy. But these "stringy" bugs can be nasty when the wrong
string is passed into the wrong argument.

In some languages you can at least "typedef" or "type alias" strings so that
in your type declarations and argument types, it's the typedef and not just a
raw string that is being passed in.

e.g. (made up pseudocode)

typedef ZipCode as string; State LookupState(ZipCode zip);

So with solutions that don't involve wrapping everything in a class, you can
incrementally develop your type by first doing the "typedef" and then adding
functions.

So that's the problem with languages like Java. You develop a "I don't need a
whole class for this" attitude because you can't incrementally develop your
type safety.

------
michaelochurch
Short answer: no.

Yes, the concept of _String_ is problematic. It's an overloaded one that
people have variously mapped to:

    
    
        a. An array of bytes. (C char is a byte.)
        b. "Words", from a (possibly fuzzy) set of 2-100k specific strings from natural language. 
        c. Arbitrary arrays of characters. 
        d. Arbitrary arrays of *printable* characters.
        e. Compact representations of abstractions, e.g. regexes which represent functions on strings. 
    

These have conflicting needs. For (a), most seasoned programmers have learned
the hard way of the need to separate byte[] from String as concepts, due to
Unicode and encoding and various nasty errors you get if you confuse UTF-8 and
UTF-16; but also because random access into a byte[] of known structure is
often a fast way of getting information while random access into a String is
generally inferior to regex matching.

Regarding (b), what you sometimes end up wanting is a symbol type (or, in
Clojure, keywords) that gives you fast comparison. You might also want
something that lives at a language level (rather than runtime strings) like an
enum or tagged union (see: Scala, Ocaml) to get various validation properties.

Regarding (e), I think everyone agrees that regexes belong in their own type
(or class).

Where there's some controversy is (c)-(d). There are over a million supposedly
valid code points in non-extended Unicode, but only about 150,000 of them are
used, and some have special meanings (e.g. endian markers). UTF-8/16 issues
get nasty quick if you don't know what you're doing. What all this means is
that you can make very few assumptions about an arbitrary "string". You might
not even have random access (see: UTF-8/16)! (Although a strong argument can
be made that if you need random access into something, a string isn't what you
want, but a byte[]. Access into strings is usually done with regexes, not
positional indices, for obvious reasons.)

As messy as Strings are over _all_ use cases, the thing about them is that
they _work_ and also that they're a fundamental concept of modern computing in
practice. We can't get rid of them. We shouldn't. Making them an abstract
class I don't like, for the same reasons as most people would agree that
making Java's String _final_ was the right decision. (Short version:
inheritance mucks up .equals and .hashCode and breaks the world is hard-to-
detect ways.)

What we do however need to keep in mind is that when we have a String, we're
stuck with something that's meaningless without context. That's always true in
computing, but easy to forget. What do I mean by "meaningless without
context"? There's almost nothing that you know about something if it's a
String.

On the other hand, if you have a wrapper called SanitizedString (some static-
typing fu here) that _immutably_ holds a String and the only way to get a
value of that type is to pass a String through a SQLSanitize function, you
know that it's been sanitized (or, at least, that the sanitizing function was
_run_ ; whether it's correct is another matter). But this isn't a case of
inheritance; it's a wrapper. You can use this to strengthen your knowledge
about these objects (a function String -> Option[SanitizedString] returns
Some(ss) only if the input string makes sense for your SQL work).

Inheritance I dislike because it tends to weaken knowledge. I think it's the
wrong model, except for a certain small class of problem. What good there is
in inheritance is being taken over by more principled programming paradigms
(see: type classes in Haskell, protocols in Clojure).

~~~
pornel
BTW: there is no such thing as generic sanitized string. There may be SQL-
escaped string, HTML-escaped, JS-escaped, JS-in-HTML-in-SQL-escaped, etc. It
always depends on context (I'm going to invent format that uses ASCII 'a' in
escape sequence — sanitize that! ;)

~~~
kruhft
Base64 encode a string and it's generically sanitized. They're a bit difficult
to read with the naked eye though.

~~~
damncabbage
Unless it's in a URL. (This is why URL-safe Base64 versions exist... Which can
then in turn be inappropriate for other places.)

~~~
mikeash
And base64 can use the / character, so it's unsafe for POSIX filenames.

------
sultezdukes
For me, xml and other types of config files can be hard to debug because
everything is a string. IDE/smart editors help out a lot when you have types.

~~~
epochwolf
XML has data types. You just have to load the schema.

~~~
sultezdukes
Yeah, the problem is that the values are strings.

