
A string processing rant - waffle_ss
http://fgiesen.wordpress.com/2013/01/30/a-string-processing-rant/
======
malkia
Nice article! Lots of deja vu there :) (I'm a tools programmer working at
Treyarch)

Here is another little thing you can optimize out (probably not saving much,
but still bugs me):

    
    
      SoftwareOcclusionCulling/CPUT/CPUT/CPUT.h:#define SAFE_DELETE(p)      {if((p)){HEAPCHECK; delete (p);     (p)=NULL;HEAPCHECK; }}
      SoftwareOcclusionCulling/CPUT/CPUT/CPUT.h:#define SAFE_DELETE_ARRAY(p){if((p)){HEAPCHECK; delete[](p);    (p)=NULL;HEAPCHECK; }}
    

There is no need to check for NULL array or pointer before calling delete[] or
delete, the standard allows for it.

Lots of stuff coming from Intel is just badly written for some reason. Few
years ago there was the SMOKE demo using Thread Building Blocks, and there
were O(N^2) and even O(N^3) things for the particles.

Intel, you are supposed to show us how to code to the metal... Or maybe their
ICC compiler is specifically better if you write this way (lol, what a
conspiracy).

------
jlarocco
This guy's biggest problem is that he doesn't know the standard library very
well.

I admit it's ugly, but the code below removes whitespace using only standard
C++ calls. There's also an isspace() function from the <locale> header that
takes a specific locale, but I'm not sure it's important for whitespace.

    
    
        std::wstring removeSpaces(std::wstring theStr) {
            return std::wstring(theStr.begin(),
                                std::remove_if(theStr.begin(), 
                                               theStr.end(),
                                               std::iswspace));
        }
    

There's also codec conversion in the <locale> header [1]. It's also pretty
ugly, but it's easier to understand than his home grown solution, and if
nothing else, is already debugged. And to be honest, if you're using C++ you
should be used to the code being verbose and ugly.

[1] <http://en.cppreference.com/w/cpp/locale/codecvt>

~~~
to3m
The code in the blog post removes only leading or trailing white space, not
all of it.

~~~
pdhborges

        std::wstring remove_side_spaces(const std::wstring& str) {
            auto begin = std::find_if_not(str.begin(), str.end(), std::iswspace);
            if (begin == str.end()) {
                return std::wstring();
            } else {
                auto end = std::find_if_not(str.rbegin(), str.rend(), std::iswspace);
                return std::wstring(begin, end.base());
            }
        }

~~~
jlarocco
Believe it or not, that doesn't compile because begin and end have different
types due to begin/end returning different iterator types than rbegin/rend. It
could be a bug in both clang and g++.

This works, but I don't like the & _strt and &_endp:

    
    
        std::wstring trim(std::wstring theStr) {
            auto strt = std::find_if_not(theStr.begin(), theStr.end(), std::iswspace);
        
            if (strt == theStr.end()) return L"";
        
            auto endp = std::find_if_not(theStr.rbegin(), theStr.rend(), std::iswspace);
        
            return std::wstring(&*strt,&*endp+1);
        }
    

The easiest way is to use Boost's string algorithms library, though:

    
    
        std::wstring trimb(std::wstring theStr) {
            boost::trim(theStr);
            return theStr;
        }

~~~
pdhborges
I also corrected my example. Thanks!

------
skrebbel
It's really a C++ rant, not a string processing rant, though.

~~~
jessaustin
Is it also sort of a Windows rant? It seems like he would not have these
problems with iconv() or its various wrappers on POSIX.

~~~
dfox
The C/C++ level API used is essentially identical and mostly part of C/C++
standard (including iconv()). But the main difference is that on Unix most
people don't bother to use any of this and either don't care about encoding at
all, or convert everything into utf-8 and process utf-8 strings byte by byte,
or use some high level library for string handling.

~~~
jessaustin
OK, that makes sense. So the reason people on Windows _do_ bother with this is
that the Windows system libraries use UTf-16 for everything, right?
(IANAWindowsDeveloper)

------
apaprocki
Also, C++11 adds std::u16string and std::u32string to make life a bit easier
and write cross-platform code without worrying about whether wchar_t is 2
bytes (e.g. Windows) or 4 bytes (e.g. Linux).

~~~
malkia
Then again I would avoid anything non-8 bit for asset parsing. We have tons of
text files for our level maps (Quake map format), used to have
(transitioninig) all models/animation in text format again. Not only this
takes significant time to sync these off P4, but they take more space on the
disk, later more I/O, memory, and you are the mercy of strod, atof, etc. to
work fast and properly (which usually is not the case).

For assets that don't make much sense merging (polygon data) - we are moving
to binary data. For the rest (level data), we might keep text files - some
merging is possible there.

The only time we'll ever need UTF8, UTF16, etc. is for the localized data -
game text, etc. There is absolutely no need for such encoding to be in your
assets.

~~~
lloeki
> _Then again I would avoid anything non-8 bit for asset parsing._

Nitpick: you probably meant non-7 bit (there's a reason why ruby and python2
default to non-ext ascii source code, and barf at you otherwise)

~~~
malkia
Yup. 7-bit, but the point is even if character above 127 were used it would've
been okay.

7-bit (8-bit) ASCII has a long life to live :) It's just machine optimal, and
the same time not flexible for modern days needs... just like "C", and I love
it.

I'm using Qt5 these days, and there is QString and QByteArray (and also
QLatin1String) - and I need to keep in memory lots of properties for our
editor, going to QString (wide-chars) just uses more memory for no reason.
Sticking to QByteArray/QLatin1String makes more sense for them.

------
zokier
Wide chars/strings _are_ horribly broken and should be avoided at all cost.
While UTF-16 itself is not technically broken, many implementations of it are
and UTF-16 has no benefits compared to alternatives. So I'd steer clear of it
too. Once you avoid these two pitfalls then string processing becomes at least
somewhat sensible, although Unicode is pita in any case.

------
brazzy
Well, the C and C++ libraries are what we call "organically grown". I sure as
hell am glad to be using a language based on Unicode and length-encoded
strings right from the start.

------
bjhoops1
Well I guess it makes me feel better to know that there is stupidity to be
found even amongst the mighty C programmers!

------
VMG
So where is "The Definitive Guide to C++ String Processing"?

~~~
angersock
Next to the bourbon and pistol.

------
martinced
Can someone explain me why something dealing with graphics needs to read:
_"several megabytes worth of text files"_? Why aren't these binary data?

(I'm not familiar with graphical / texture code)

But I _am_ familiar with fast text processing. From Java the problem is even
worse: it's not so much that the methods are inefficiently written in
themsevles. The issue is that string processing tends to create lots of
garbage which then needs to be garbage collected.

The optimization when all these strings processing methods are the bottleneck?
Do not create hundreds of thousands or millions of object. Instead of doing,
say, "s2 = removeLeadingAndTrailingStuff();" and then "parse(s2);" you do
write your own method directly:
"parseButDoNotTakeIntoAccountLeadingAndTrailingStuff()".

It's oversimplified and crazy camel case is used above to make a point ; )

It's always the same : when objects creation become the bottleneck, make it so
you're not creating that many objects.

It's exactly what the LMAX Disruptor "pattern" is about: using gigantic arrays
of primitives and hardly any object creation at all (to process 12 million
events per second in Java on a single core!).

That said in TFA I still don't understand why several megabytes of Unicode
text files have to be processed...

~~~
chubot
It would be interesting to see a comparison of Java (and Python) string
processing with Go. In Go you use slices refer to underlying string data,
which are basically (pointer, length) pairs. So you may still create garbage
in terms of these small slice objects, but that should cause less of a problem
than creating tons of temporary immutable strings.

~~~
brazzy
Actually, Java strings work _exactly_ like that. And if you use them
correctly, string processing is no problem at all on a modern VM using a
generational GC.

~~~
chubot
Ah OK I didn't know that (not a Java programmer). What happens if you create a
huge 10MB string, then a 10 byte substring of it, and then remove all
references to the 10MB string. Does that memory ever get reclaimed?

I guess the difference in Go is that the string abstraction and the slice
abstraction is exposed to the programmer, so it will be apparent if that
happens.

There is some info here but I was unable to find a good explanation:
[http://stackoverflow.com/questions/2909848/how-does-java-
imp...](http://stackoverflow.com/questions/2909848/how-does-java-implement-
flyweight-pattern-for-string-under-the-hood)

~~~
brazzy
Nope, that's one way to have a memory leak. But you can avoid the problem by
using the new String(oldString) constructor - it copies only the part of the
underlying char array that the String uses. You're absolutely right that it
would be much better if all that were part of the API rather than an
implementation detail.

The SO question and answers are confusing because they talk about two
competely separate issues: the above and the interning mechanism.

