

The abominable strtok() - whackberry
http://zefonseca.com/blogs/zen/the-abominable-strtok/

======
__david__
First off, strtok() is dead, long live strsep().

Secondly, this is C code--it's pretty low-level. Yes, strsep() messes with
your input. But that is very well documented. If you don't want it to,
strdup() beforehand (or strndup() if you don't trust the input).

His whole example of strtok() dying on a character constant is stupid--why on
earth would you do that? If you've got a string constant you may as well just
have the constant array and save yourself the parsing headache.

This isn't rocket science.

~~~
dfox
I tend to use and recommend str(c)spn, which althought more low-level are more
flexible and often what you want. And also you don't have to modify input if
you don't want to.

~~~
tptacek
strcspn is really, really fussy. When what you want is the semantic equivalent
of "foo,bar, baz".split(','), I don't think it's worth the potential for
error, or the tangliness of the code.

------
tptacek
I've heard lots of dev teams gripe about strsep(3) because it "isn't standard"
or "isn't cross-platform". Preposterous! strsep is a ~20 line ANSI C function;
it will compile and run flawlessly on any platform that doesn't natively
provide it.

I agree strongly with the other comments on this thread that strsep(3) is just
peachy, even though it "alters its inputs". Unlike in functional programs,
reasonable destructive functions are to be _preferred_ in C programs; it's
easier to work around destructiveness than it is to track state (and/or to go
through contortions to pretend that you aren't tracking state).

------
jsolson
I began to have serious concerns about this article when I saw the author
allocate and then immediately leak memory in the first example.

I agree with the message: don't use strtok, it has unpleasant side effects
that you probably don't want. I do not feel this article does a good job of
presenting that message. It spends too much time on pathologically bad
examples of strtok usage while only briefly mentioning (and providing no
example code for) any of the alternatives.

~~~
scott_s
He also elided basic error checking. I have no problem if people don't follow
all good programming practices in pedagogical examples.

~~~
tptacek
What error checking are you referring to? If it's the alloc, I think the
explicit error check on it is usually an antipattern and should be avoided.

~~~
js2
Curious: you think it's better to crash attempting to use a NULL pointer? If
so, why is that better than checking for the NULL?

~~~
tptacek
I think regimes that involve manual checking introduce the possibility of
mistakes; when those mistakes involve array references, they can be
exploitable.

I've always, in both my dev career and in security research, felt that manual
checking is dumb. In 99.999% of cases, the only option given to code when
allocations fail is to start a chain of events that ends the program.
Inflating each and every allocation (or, more likely, missing the little
things that allocate silently, like strdup) into 4-5 lines of code seems
wasteful.

What I advocate is what my friend Danny told me he did at Juno in the mid
'90s: preload a wrapper on malloc (don't use silly wrappers like "xmalloc")
that catches faults and does something intelligent; centralize your handling
of allocation failures, instead of trying to graft it onto each and every
allocation point.

The nice thing about my philosophy: it means that when you write normal, non-
runtime code, you just pretend malloc never fails. Your code is cleaner, and
it's probably safer.

~~~
__david__
It depends what kind of program it is. A long running GUI program probably
doesn't want to completely abort if the user tries to open file that it
doesn't have enough memory to process--especially if there are other unsaved
documents open.

Similarly an embedded program might not want to reboot itself if there aren't
enough memory resources to carry out a command (due to other commands being
processed at the same time).

But you are right, a lot of the time (perhaps even the majority of the time)
it really is perfectly acceptable to just die if you can't malloc().

And it really does take a lot of effort to manage memory like that, but if it
is done right it can really make things work robustly.

~~~
tptacek
Your allocation failure strategy is orthogonal to how individual allocations
should be handled. In a UI program, keep a reserve of memory, return valid
addresses to the caller, and pop up a warning.

It's actually _easier_ to do things like this if you aren't hand-crafting
individual little alloc-check-handle routines.

~~~
__david__
That seems full of edge cases. If your reserve isn't big enough it doesn't
help and if it's too big then it's a waste. It seems much cleaner to be able
to gracefully abort whatever operation you were doing. Presumably there is
already error handling for fs/network errors. Usually you can hop onto those
handlers with memory errors and be just fine.

Though if it's linux it'll just kill you anyway :-). (Yeah, I know you can
turn the OOM killer off).

------
ComputerGuru
I think anyone that's done even a little bit of C work on any platform is
aware of this issue.... but it's always worth griping over it some more. I
guess the poster just ran into it and couldn't help but express his
frustration :)

For Win32 developers who don't have the glib g_strsplit function, you can use
strtok_s which is detailed on MSDN here: <http://msdn.microsoft.com/en-
us/library/ftsafwz3(VS.80).aspx>

strtok_s is re-entrant, thread-safe, and uses no global data, but keep in mind
it still modifies your input string.

------
jswinghammer
The man page covers the problems with strtok pretty well. I found it useful
enough and while the code that uses it is pretty awkward it works well enough
and gets the job done.

I think I that I have used it in shipping code two or three times. These days
I don't do much in C so it's unlikely I'll ever use it again.

------
philwelch
strtok() altering the source string is a natural consequence of null
termination. The traditional Pascalian equivalent (put the string length at
the head of the string) creates a symmetric problem at the head of the string.
My (perhaps naive) idea is to use a struct containing the string length and
pointing to the beginning of the string data. Nondestructive tokenization
becomes simple.

This post actually gave me an impetus to go back and add tokenization to my
string library that's built around this idea.

~~~
tptacek
The big problem with _tok_ isn't that it alters the string; it's that it keeps
state, and so can't be used in multicontext code. _sep_ changes the interface
to delegate state tracking to the caller, which is the way it should be.

Just as map and reduce in Python is "unpythonic", interfaces that mint new
temp strings to process what should be a simple charstar are un-c-like, and
should be avoided.

~~~
philwelch
_Just as map and reduce in Python is "unpythonic", interfaces that mint new
temp strings to process what should be a simple charstar are un-c-like, and
should be avoided._

That's not what I'm doing at all, though. I'm just giving char *'s and
size_t's for portions of the original character array, not duplicating
subsections of the array at all.

~~~
tptacek
You are, in other words, reinventing strspn?

~~~
philwelch
strspn takes a set of characters to match against and tells you how long to go
until you reach a non-matching character. My tokenizer, like strtok, takes a
set of delimiter characters. So same thing except with complement sets, I
guess.

It's also part of a larger library that consistently uses pointer/length pairs
to designate strings as opposed to null termination, which admittedly is quite
non-C. No clue whether it's actually a good idea.

EDIT: If anyone's actually curious, I have it on github:
<http://github.com/philwelch/string>

