
String tokenization in C - throwaway2419
https://onebyezero.blogspot.com/2018/12/string-tokenization-in-c.html
======
kazinator
The actions of _strtok_ can easily be coded using _strspn_ and _strcspn_.

[https://groups.google.com/forum/message/raw?msg=comp.lang.c/...](https://groups.google.com/forum/message/raw?msg=comp.lang.c/ZhXAlw6VZsA/_Y5evTIkf6kJ)
[2001]

[https://groups.google.com/forum/message/raw?msg=comp.lang.c/...](https://groups.google.com/forum/message/raw?msg=comp.lang.c/ff0xFqRPH_Y/Cen0mgciXn8J)
[2011 repost]

 _strspn(s, bag)_ calculates the length of the prefix of string _s_ which
consists only of the characters in string _bag_. _strcspn(s, bag)_ calculates
the length of the prefix of _s_ consisting of characters _not_ in _bag_.

The _bag_ is like a one-character regex class; so that is to say _strspn(s,
"abcd")_ is like calculating the length of the token at the front of input _s_
matching the regex [abcd]* , and in the case of _strcspn_ , that becomes
[^abcd]* .

~~~
saagarjha
And it’s nicer, since you can pass in a const char * and use it in concurrent
code.

------
jstimpfle
strtok is one of the silliest parts of the standard library. (And there are
many bad ones). It's broken. It's not thread safe (yes there is strtok_r).
It's needlessly hard to use. And it writes zeros to the input array. The
latter means it's unfit for most use cases, including non-trivial tokenization
where you want e.g. to split "a+1" into three tokens.

If you program in C please just write those four obvious lines yourself.

~~~
yason
_If you program in C please just write those four obvious lines yourself._

Those are not necessarily obvious lines, there are several pitfalls to avoid,
and for that reason strtok() is much longer than four lines. When it comes to
the standard library functions strtok() has well-defined behaviour that is
easy to reason with and near-magically approaches the string-splitting
convenience close to scripting languages.

In contrast, an example of truly sickening part of stdlib is converting
strings to number. The atoi()/atol() family doesn't check for errors at all so
you want to use strtol(). But the way error checking works in strtol() is so
complex that the man page has a specific example of how to do it correctly.
All sane programmers quickly write a clean wrapper around strtol() to encode
the complexity once. Now, strtok() is nothing like that.

In its simplicity, strtok() is quite versatile. A few strtok() calls can
easily parse lines like:

    
    
        keyword=value1, value2, value3
    

that you might find in configuration files. And I mean truly in just a few
lines which you might expect in Python but with C string handling? No.

~~~
jstimpfle
Here is the musl implementation.

>
> [https://github.com/esmil/musl/blob/master/src/string/strtok....](https://github.com/esmil/musl/blob/master/src/string/strtok.c)

It's a bit longer than 4 lines because strtok does things you should not want.
If you insist on parsing that configuration line with strtok, go ahead and
write that brittle code. It breaks as soon as you want empty strings (try
"keyword=value1, , value3" with strtok) or escape sequences or other
transformations, or as soon as you want to do something as basic as parsing
from a stream instead of a string that is completely in memory.

So to clarify, of course you are never done with parsing in 4 lines. But even
if it wasn't as braindead to overwrite the input string, the functionality
strtok provides would not be worth more than 4 lines.

~~~
yason
So, here's that implementation:

    
    
        static char *p;
        if (!s && !(s = p)) return NULL;
        s += strspn(s, sep);
        if (!*s) return p = 0;
        p = s + strcspn(s, sep);
        if (*p) *p++ = 0;
        else p = 0;
        return s;
    

Instead of carrying that code, or something similar, with my source code or my
own utility library I'd much rather have the already debugged version from the
standard library.

Overwriting the input in C is more efficient than maintaining more internal
state and returning a pointer and the length of each token which you would
need to strncpy() to get the token into a C string. strtok() does not want to
do the initial strdup() for you because only you will know whether your input
can already be mutated or whether you need to use a copy.

As I pointed in the other reply, strtok() _does not break_ on strings like
"keyword=value1,, , value3" unless you skipped RTFM and expect it to do
something completely different. And more often than not that's exactly what
you want when parsing non-computer readable input which you can expect to take
a specific form.

If you want to handle escape sequences, parse from a stream (without having
the option to fgets() the next line into memory), or parse CSV tables without
collapsin colunms then you will want to use something more specific to that.
Luckily, strtok() was not advertised as a Swiss army knife so it's off the
hook for specific parsing purposes like those.

~~~
jstimpfle
As someone else pointed out what you really want is the implementation of
strspn/strcspn which is where the loop is. You don't "carry" that code along.
You just write

    
    
        while (i < len && !is_token_begin(buf[i]))
            i++;
        if (i == len)
            error("End of input\n");
        start_token(tok);
        while (i < len && is_token_char(buf[i])) {
            i++;
            add_to_token(tok, buf[i]);
        }
        end_token(tok);
    

or something along those lines. Whatever you need. It's not rocket science.
Putting highly fluctuating and project-specific code like this in a library
would only have disadvantages. Not everything should be in a library. In fact,
most things should not be.

------
stochastic_monk
I recommend ksplit/ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0].
It modifies the string in-place, adding null terminators, and provides a list
of offsets into the string. This gives you the flexibility of accessing tokens
by index without paying costs of copying or memory allocation.

[0]
[https://github.com/attractivechaos/klib](https://github.com/attractivechaos/klib)

------
lixtra
I have an obsession with unsafe example code:

    
    
      strcpy(str,"abc,def,ghi");
      token = strtok(str,",");
      printf("%s \n",token);
    

Even if the author knows how many tokens are returned I would prefer a check
for NULL here since a good fraction might not read further than this bad
example.

~~~
enriquto
> I have an obsession with unsafe example code:

It is perfectly OK for example code to be unsafe. You do not wear a parachute
when you learn to fly using a simulator. You realize that things will become
more serious and complicated in the future, but you have to start with
something simple and unsafe, no big deal. Otherwise you will never see the
consequences of unsafe code in simple cases.

~~~
bqe
I think you underestimate how many people blindly copy examples without
understanding them. Safe example code results in more correct programs.

~~~
enriquto
> I think you underestimate how many people blindly copy examples without
> understanding them. Safe example code results in more correct programs.

Even if this is true, the reasoning here is disturbingly short-sighted.
Copying code that you do not understand is unacceptable behavior, and I'd say
the sooner it blows up in your face, the better. The goal of code examples is
to illustrate how things work in a simplified way, and code without error
checks is often easier to understand at first. Imagine a hello world with all
the possible error checks. That would be incomprehensible.

------
jfries
Well, yes, using strtok works if the data happens to be structured in a
certain simple way. Very often you want to do something more advanced though,
and using regex for matching tokens is then necessary.

~~~
jstimpfle
You don't really use regex in C. You just write a few simple loops. Look up
the lexer of the programming language of your choice.

~~~
barrkel
Most lexers are state machines, either explicit with tables (like you get from
lex) or implicit with program counter (with loops and switches). Those state
machines implement matchers for regular languages; they're effectively hand-
coded implementations of regular expression matching.

Regular expressions don't show up outside the spec, sure; but if you're
writing the code (for implicit state machine), you need to know exactly where
you are in the regular language that defines the tokens to write good code.
Writing a regex matcher in code like this is like writing code in assembly -
mentally, you're mapping to a different set of concepts all the time.

~~~
jstimpfle
Yes. I don't think anyone is disagreeing here.

If you're implying that we should then use a regex implementation instead:
Coding up a lexer (for a mainstream programming language) using simple counter
increments and such is not a lot of work. It has the advantage that it results
in faster code (unless you're going for real heavy machinery) and that you can
easily code up additional transformations. For example, how would you parse
string literals (including escape sequences) with a regex?

~~~
mcguire
"(.|\\.)*"

~~~
jstimpfle
"Evaluation:\tWrong!\x07\x07\x07\r\n"

You want to convert the string literal to an internal buffer, interpreting the
escape sequences. In the same way, you want to parse integers. You cannot
really do that with a regular expression. RE is for _matching_ , not for
transforming.

------
graycat
A lot of experience shows that the string tokenization in Open Object Rexx is
darned useful. E.g., for many years, IBM's internal computing was from about
3600 _mainframe_ computers around the world running VM/CMS with a lot of
_service machines_ written in Rexx. Rexx is no toy but a powerful, polished,
scripting language and really good at handling strings.

A little example of some Rexx code with some string parsing is in

[https://news.ycombinator.com/item?id=18648999](https://news.ycombinator.com/item?id=18648999)

------
pasokan
It used to be that gcc will warn against strtok and recommend strsep instead.
Do not know what the status is today

~~~
tinus_hn
Strtok is not thread safe and can’t be made thread safe without changing the
API. You should not use it.

~~~
heinrichhartman
Well, strtok could use thread local variables to store intermediate state, to
make it threadsafe while maintaining the same API. Not saying this is a good
idea, but technically it would work, no?

~~~
Sharlin
Yes, as long as you can guarantee that there’s only one tokenization going on
per thread at a time.

------
caf
Note though that strsep() is not as portable, because it is an extension to
standard C.

~~~
tptacek
It's a tiny function, written in ANSI C, so if you're really concerned about
this, just include it with your program. It's an extension to the standard C
_library_ , not to C itself.

~~~
beefhash
Except then you have the issue about compilers complaining about double-
declarations of the function, meaning you'll either have a lot of warning spam
on every #include or now hard require some kind of header defines for
HAVE_STRSEP. Once you go that way, there's no going back and it's only gonna
become more and more.

~~~
tptacek
People have been including strsep in packages since the 1990s (people used to
include their own snprintfs, ffs). If you're really this freaked out about it,
call your local copy "mystrsep" or something like that.

You know what else isn't _in POSIX_? All the rest of your C code.

------
satyenr
> Next, strtok is not thread-safe. That's because it uses a static buffer
> internally. So, you should take care that only one thread in your program
> calls strtok at a time.

I wonder why strtok() does not use an output parameter similar to scanf() —
and return the number of tokens. Something like:

    
    
      int strtok(char *str, char *delim, char **tokens);
    

Granted, it would involve dynamic memory allocation and the implementation
that immediately comes to mind would be less efficient than the current
implementation, but surely it’s worth eliminating the kind of bugs the current
strtok() can introduce?

Does anyone here have the historical prospective?

------
megous
Other approach from library calls and flex is re2c. It preprocesses the source
code and inlines regular expression parsing where you needed. It's very
powerful in combination with goto.

------
saagarjha

      str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));
    
      strcpy(str,TESTSTRING);
    

str = strdup(TESTSTRING)?

------
rurban
AFAIK strtok has restrict on both args since C99. And the safe variants
strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not
ASCII.

[https://en.cppreference.com/w/c/string/byte/strtok](https://en.cppreference.com/w/c/string/byte/strtok)

------
bsenftner
...And then the application is required to implement variable length
characters, a la Unicode, and you start your strings logic all over...

~~~
syrrim
As long as you're fine with ascii delimiters, strtok et al. work fine for
utf-8 strings.

~~~
bsenftner
Would you happen to be aware of good Unicode normalization function/lib in
C/C++?

------
the_clarence
Problem is that your token string is going to be quite large. Is there a
built-in solution for when tokens are just single chars?

------
setquk
I just use flex. You don’t have to ship flex as a dependency either.

------
alexandernst
How about just using a properly suited language por string manipulation?

