
Strings Are Evil - bhalp1
https://dev.to/indy_singh_uk/strings-are-evil-9f9
======
dahart
For such a long deep dive, I'm surprised it made a big deal about memory while
overlooking the obvious. Almost no memory was actually saved. The peak working
set was 16Mb at version 1, and it ended at 12Mb. The temporary strings that
are allocated are all immediately released. The timing went from 8.7 seconds
to 6.7 seconds. So all that effort bought a total of a 25% improvement in
memory usage and CPU time.

Great - 25% is nothing to sneeze at. Just noting that memory wasn't the
problem.

And of course there's a hidden development cost to doing fancy indexing on
strings rather than being comfortable with String.Split() making your parser
go 25% slower. The code at the start is more functional, it's easier to look
at and know it will do what's expected. The optimized code took longer to
write, and it's harder to understand, and harder to modify.

~~~
commandlinefan
Yeah, you could easily argue (as I've had many, many bosses do) that the final
version is objectively worse as it's doubtful that it will save as much time
as it cost to develop (and debug - I doubt he got it right on the first pass!)
and will drive in ongoing maintenance: the first version is simple and easy to
understand (but inefficient) while the final version is almost impenetrable
unless you read through the whole blog post, which is probably not part of
source code control.

~~~
kemiller2002
The perfect is the enemy of the good. For a few hundred dollars I can buy more
memory etc. that I can use over and over again. The time that is sunk into
building it to make it more efficient is gone forever.

Admittedly, I liked the write up. It's a fun exercise to go over, but I would
ask my devs to go with the original program.

~~~
cityhomesteader
In a similar strain.

Premature optimization is the root of all evil. - Knuth

~~~
girvo
Here is the larger quote:

“We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. Yet we should not pass up our
opportunities in that critical 3%.”

------
WorldMaker
The article's comments point out heavily that they missed lazier
generators/iterables (LINQ) as the obvious place to start and a key to both
readability and minimizing memory footprint. Even if they preferred yield
return style generators to LINQ from/where/select, and even if the code was
otherwise the same within the scope generators, fewer of the allocations would
leave the GC "nursery", and thus overall less memory needs (if more likelihood
of GC Gen 0 collections thrashing some of their CPU, but this isn't a CPU-
bound process so that's not a huge concern).

Sometimes the simpler, easier to read solution is the right one.

That said, the complex, advanced new solution to try that is also missing in
their efforts in this article is Span<T>, and its ability to immutably splice
strings in stack space (no GC allocations at all). (This is what parsers in
the ASP.NET HTTP stack have been moving to.)

------
benmanbs
Call me crazy, but for something as simple as taking a CSV file and shoving
its data into a database, why wouldn't you just use a combination of shell
tools? Don't get me wrong, I'm a Java programmer, but sometimes you don't need
the overhead of the JVM, and something like awk would save you lots of time
and energy (and memory).

~~~
rhombocombus
That was my first thought, I do ETL work at a large insurance company and we
use shell tools almost exclusively unless we need to do more complex
transformations. Unix tools are really low overhead and work extremely well
for this task. Also, you can parallelize a lot of them with xargs:
[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

------
iamed2
We had a similar task a year ago in a Python codebase and we ended up building
a function in Rust for it. Rust's splitting iterators over `&str` return
references to the same data as their own `&str`s, which is a huge performance
win. We would filter the split strings and conditionally store them in a
`HashMap<&str, &str>`, which we later drained, both automatically freeing
memory behind the scenes as able. We also shaved off a lot of time in some
cases by avoiding parsing and comparing strings in cases where we had fixed-
length integers (e.g., UNIX timestamps in a known range).

~~~
srean
Yes this is one aspect that surprised me about Python.

I expect a scripting language to be really good with strings and I/O. Python
does a pretty good job at both. Strings are immutable, which IMO is a good
choice and then I get this really disappointing surprise that slices don't
share content. I understand their motivation though.

Memoryviews can substitute for immutable string slices at times, but since
these are not strings, its annoying. One can get a string out of a memoryview
but I believe Python copies at that point.

~~~
rndgermandude
If you'd be sharing slices naively you might run into cases like

    
    
        a = "large" * 10000000
        self.b = a[1:10]
    

this tiny slice would then "magically" keep the entire huge large string
alive. It would be easy to defacto leak memory with that.

Even if you are less naive and introduce some kind of heuristic to determine
when to share and when to copy, you might still be leaking "lots" of memory if
you have a lot of small slices (into small-ish strings), which might hurt on
low speced/embedded systems with tiny memory available.

~~~
jerf
I would point out that by "it would be easy to leak memory with that", in
reality almost every run time that has ever implemented this "optimization" to
automatically apply has had to either pull it back out, or make it explicit
whether or not you're taking a copy of the underlying string. I've seen it at
least three times. The probability of a given program having this problem is
lowish (not _that_ low, but reasonably low), but the probability of one of a
language community's flagship projects having this problem shoots to 100%
almost instantly for any non-trivial language or community.

~~~
srean
I like it how Guile does it. There it is very explicit whether you are getting
a copy or a view.

------
asmosoinio
I agree with the comments here - I don't think the final solution should go
into production as it is so complex. And the optimisation for total
allocations seems odd.

The V02 with VERY simple optimisation is only 2% slower than the final version
- and infinitely more readable.

Nice article nevertheless.

------
CGamesPlay
This article is a good analysis! A few things I noticed while reading that may
be useful (mostly really basic stuff, since obviously I'm not in the profiler
working with the code).

Easy Win #1 could have been passing 2 as the second argument to the first call
to String.Split, which would have sidestepped the "2 calls to split" problem.

It's worth pointing out that Easy Win #2 has a bug, because you said you
wanted "MNO" lines, but now you're also going to pull in "MNOX" lines. You
should probably use StartsWith("MNO,"), since this would have behaved the
same.

------
ocdtrekkie
I'm not in a scenario with any of my .NET code where this level of
optimization is actually useful/valuable, but this was a pretty interesting
read. It made me ponder a bit about how I handle strings in my code. I might
try a couple of things for fun.

~~~
megaman22
It's sometimes amazing what low-hanging fruit can be kicking around that you
didn't realize until you actually strap up a profiler to something.

The other day I was trying to figure out why one web request in an application
was taking an absurd amount of time. I started looking at SQL queries and
other external calls, only to realize that it turned out to be a logging
statement that used a little innocent looking helper function to stringify a
List<T>. Except it was using concatenation, with Aggregate e.g. Aggregate("",
(a, b)=> a + b + ", ")... Replace that with a StringBuilder, and I went from
seconds to milliseconds on that call. It's one of those things that you go
"Eh, I'm just doing this on a few items, it'll be fine", and then month or
years later, the wheels come off because something happens to try using it
with tens of thousands of items.

~~~
chriswarbo
Concatenating strings like this is known as "Shlemiel the Painter's
Algorithm", since each concatenation starts back at the beginning of the
string.

From [https://www.joelonsoftware.com/2001/12/11/back-to-
basics](https://www.joelonsoftware.com/2001/12/11/back-to-basics)

> Shlemiel gets a job as a street painter, painting the dotted lines down the
> middle of the road. On the first day he takes a can of paint out to the road
> and finishes 300 yards of the road. “That’s pretty good!” says his boss,
> “you’re a fast worker!” and pays him a kopeck.

> The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as
> good as yesterday, but you’re still a fast worker. 150 yards is
> respectable,” and pays him a kopeck.

> The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his
> boss. “That’s unacceptable! On the first day you did ten times that much
> work! What’s going on?”

> “I can’t help it,” says Shlemiel. “Every day I get farther and farther away
> from the paint can!”

The more general problem is being "accidentally quadratic", e.g. assuming that
N concatenations will be O(N), but it's actually O(N^2) because each must
traverse the intermediate string, and (assuming each input string has O(1)
length) that intermediate string will have length O(N). There's a Tumblr blog
at
[https://accidentallyquadratic.tumblr.com](https://accidentallyquadratic.tumblr.com)
which collects examples of this, but I can't seem to get past Tumblr's GDPR
landing page to see it.

------
viggity
seems like a lot of work and a lot more code, and a lot more complexity to
shave 2 seconds off the runtime. I pay my devs a lot of money and I pay Azure
very little money. I personally wouldn't want to optimize for the later at the
expense of the former..

~~~
ohitsdom
My thought too, but he did mention that it was running in off hours due to
memory constraints. With these optimizations, maybe they can improve their
business process to offer this functionality in realtime for the customer. If
not, I agree that the extra complexity and maintainability might not be worth
it.

------
Noe2097
I am probably missing something here. Isn't the article just rediscovering
what a scanner is? Wasn't this resolved eons ago by scanner generators like
flex?

------
monetus
In TCL, everything is a sting, and its... certainly useful.

[https://wiki.tcl.tk/47683](https://wiki.tcl.tk/47683)

~~~
jerf
It isn't, really. Everything has a transform to a string, and in a pinch,
everything can be converted to a string, have string operations performed on
it, and converted back to the target type transparently, but under the hood,
TCL uses numbers and such whenever possible. If you add ten number together,
at most, it'll convert ten strings to numbers; it doesn't convert the first
two strings to a number, add them, convert that back to a string, the convert
the partial total back to a number and the third number to a number to add
them, etc. The language implements a string-based interface to everything, but
the runtime is not literally dealing in nothing but strings.

~~~
monetus
Yes, an important distinction, called shimmering, in case anyone was
wondering.

Regardless of being a seemingly double-edged sword, I feel like that structure
is fascinatingly useful. I've yet to utilize the C api or delve into the
intricacies of tcl_obj, so I still feel one edge is sharper than the other.

I mean, look at the diagram of how type inference may work in the tclQuadCode
wiki and try not to say 'whoa'

[https://wiki.tcl.tk/3033](https://wiki.tcl.tk/3033) \- shimmering

[https://wiki.tcl.tk/40985](https://wiki.tcl.tk/40985) \- tclQuadCode

[https://wiki.tcl.tk/47683](https://wiki.tcl.tk/47683) \- type diagram

------
pwaivers
This is a great article. I did something very similar in C#, for reading in GB
of data in a daily process. It is very fun to use the profiler and continually
make optimizations. Two of my biggest wins were: 1) writing a custom DateTime
parser and 2) using a integer dictionary instead of Convert.ToInt32().

------
magnushiie
If you are really after performance writing parsers in .NET while keeping the
code structure manageable you should look into the new Span<T> type.

------
shmerl
_> Codeweavers is a financial services software company_

To avoid the confusion, this is a different company from Codeweavers which are
the primary Wine developers.

