
What I learned from my first C coding challenge - jasonmoo
http://blog.jasonmooberry.com/2012/09/what-i-learned-from-my-first-c-coding-challenge/
======
binarycrusader
(I realise the poster came up with a different solution ultimately, but this
bears repeating.)

Don't hand-roll standard library functions. You'll very likely regret it
later.

In particular, as hardware advances, OS vendors often update their
implementations with optimisations specific to new hardware platforms. This
both allows improvements for all programs that consume those interfaces, and
prevention of regressions when hardware changes in some cases.

In addition, you really don't want to be the one that introduces a security
bug into your program because it turns out your optimisation work didn't
account for a scenario that the standard library function does.

I'm aware that there may be cases where some believe it's appropriate; don't
give into temptation. It's a bad idea to duplicate the standard library.

~~~
exDM69
Your advice is good but a bit too generalized here.

The reason why the standard library functions, qsort in particular, were
performing worse in this case is not related to the implementation to the
functions themselves. Instead, it is because adding a function call to a
foreign function will practically inhibit all compiler optimizations that can
happen around the call site. So the "overhead" of the function call itself is
a lot more expensive than differences in the function implementation itself.
In the case of qsort with function pointer callbacks, the effect is especially
devastating. Benchmark qsort vs. C++ std::sort to see how much (a lot).

So when writing performance critical C code, it's a perfectly good option to
re-write a standard library function, especially a simple one like this. Make
it an inline function in a header file to make sure the compiler can optimize.

If your compiler can do link time optimization to these function calls, then
you don't have to re-write them. Before LTO becomes mainstream, sometimes you
might have to.

~~~
binarycrusader
But that's sort of my point; today's optimisation can be tomorrow's
performance regression, security bug, or be outdone by an update by the OS
vendor.

It's very rare that a program's performance is being held back by a standard
library function (speaking of libc here). I remain highly skeptical that the
algorithm isn't the real issue instead of the implementation of standard
library functions.

It's just not worth it.

~~~
cube13
> But that's sort of my point; today's optimisation can be tomorrow's
> performance regression, security bug, or be outdone by an update by the OS
> vendor.

>It's very rare that a program's performance is being held back by a standard
library function (speaking of libc here). I remain highly skeptical that the
algorithm isn't the real issue instead of the implementation of standard
library functions.

In the last year, we encountered a bug with Solaris Sun Studio 12 on x64 where
memcpy wasn't automatically inlined, forcing a full function jump every time
it was invoked. That was a major performance hit, and forced us to switch to
an internal implementation(that normally is worse on Solaris). IIRC, we didn't
have much luck getting a patch out of Oracle for the issue.

So no, this really isn't true. In an ideal world, it would be.

~~~
spc476
How about, "use the standard library function unless proven guilty"? Sure, if
compiling with Solaris Sun Studio 12 on x86 is a loss, but what about SPARC?
Or Linux GCC?

~~~
cube13
SPARC is pretty much a lost cause for us. I believe that we have had better
performance with the system memcpy over anything we've written.

Linux GCC was all over the place, depending on the Red Hat Enterprise version.
IIRC, RHEL 4 and above, our internal code worked better, but with 5 and 6, the
included memcpy is generally better.

If you're writing code that has serious performance requirements,
experimentation is key. There's absolutely no guarantee that the system call
will be better than a hand rolled call.

------
tsahyt
It's interesting that he points out that testing for zero is cheaper than
comparing two numbers. Interesting because this might not always be the case.

For a quick test I used the conditions (i=0; i < NUMBER; i++) and (i=NUMBER;i
!= 0; i--). Without optimizations turned on, gcc translates both of them to
cmpl (compare long) operations followed by a jne (jump not equal). So on an
assembly level, they're absolutely identical except that in the one case
you're testing against zero whilst in the other you're testing against the
number. Fine, so we'll have to dig deeper.

And this is where it gets complicated. This optimization depends entirely on
the inner workings of the ALU. Theoretically one can test against zero with
just one subtraction, because 0-n == n-0 is always true, whereas a-b == b-a
iff a == b, otherwise the two sides will differ in sign. On a hardware level
such operations might be parallelized inside of the ALU though, so a
comparison of two numbers might actually take exactly as many clock cycles as
the comparison against zero.

It's however an interesting excursion into how such basic things work. And
concerning my test: They're both about equally fast on my machine. I've done
multiple runs with no clear winner.

~~~
csense
The second version (test against zero) benefits from the fact that the
subtraction instruction sets the Z flag automatically. So the end of the
second loop is something like "sub i,1" followed by "jnz top".

The most straightforward implementation of first loop would be "add i,1"
followed by "cmp i,NUMBER" followed by "jb top" (or "jl top" if i is signed).
It's an extra instruction which may be even slower, depending on specific CPU
and surrounding code, if NUMBER is a literal or memory access (as opposed to
register value).

My guess is that your compiler will produce code that takes advantage of the
savings if you turn on optimizations, but you might have to actually do
something in the body of the loop to keep the compiler from optimizing the
loop completely out (or tweak flags to enable/disable specific optimization
techniques). Adding a loop body will make the timing difference less
noticeable, as the single-instruction savings of the zero-test version becomes
a smaller percentage of the total time per iteration.

Disclaimer: This is how I'd write assembly by hand. A compiler could
conceivably do something quite different, depending on the specific
application code, compiler, version and flags. I also haven't yet taught
myself 64-bit x86 assembly code, although I understand it's rather similar to
32-bit x86 with more registers available.

~~~
tsahyt
The compiler optimizing out the loop completely is one of the reasons why I
didn't turn on optimizations here, although there was a loop body. gcc is
smart enough to optimize loops with a fixed outcome away in some cases. I
probably should have written a more complex loop body.

That said, it's very likely that the compiler _will_ make sense of such an
optimization and produce the quicker assembly code like you just wrote it.

What I wanted to say with my post was really that things like these are
heavily depended on architecture and that comparisons could be optimized in
hardware.

~~~
exDM69
> The compiler optimizing out the loop completely is one of the reasons why I
> didn't turn on optimizations here

Use a non-constant value for the loop and put a dummy load inside to stop
compiler from doing loop unrolling and/or dead code elimination.

I often use time(), rand() or even argc to get dummy values when looking at
assembly from compiler optimizations.

------
icefox
Any more detailed information about the coding challenge?

For example the repo references a 5Murls.txt file, but it isn't part of the
repo and the blog says it needs to "output in a standardized format:", but
doesn't specify what "standardized" means nor does the current code actually
output anything (the printf is disabled). Does it specifically need to go to
stdout or just that it exists somewhere in memory? Does it require a char* or
will this char* be instantly hashed and tossed away? Does the challenge forbid
the use of threads/cpus/workers? In the blog it says: "In this particular
context the plugin architecture we were writing against allowed for returning
the original string or a new malloc’d string" Are you allowed to muck with the
original string? In the code on github it has a function that takes a const
char* forcing a malloc, but it could easily just re-write the string in place
if allowed.

edit: As for sorting the params, (based upon your comments about the common
usage) pretty sure there is a way to do this without any string comparisons at
all. Post a sanitized 5Murls.txt file and give it a go to make a patch.

~~~
icefox
The best algorithm depends upon the data presented. What is the:

\- Average number of parameters

\- The % that are already sorted

\- The % that don't have keys that start with same letter

~~~
csense
It also depends on the assumptions we make about the problem.

\- Do we have to validate the URL strings in any way?

\- What do we need to do with duplicate names?

\- Are cases with small parameter names and/or small numbers of parameters
common?

\- How common are parameters with lengthy common prefixes?

\- Is there a bound on the length of the longest common prefix of two
different parameter names?

Tuning or special-casing with common cases in mind can often produce big wins.

------
cyberroadie
I made a varnish urlsort module a while ago:
<https://github.com/cyberroadie/varnish-urlsort>

Here my blog post about it:
[http://cyberroadie.wordpress.com/2012/01/05/varnish-
reorderi...](http://cyberroadie.wordpress.com/2012/01/05/varnish-reordering-
query-string/)

It parses the url and every token gets added to a binary tree which than gets
traversed in order to get the parameters alphabetically

It's been used in production in several companies I worked for, so tried and
tested :-)

Olivier

~~~
cageface
Interesting. Do you have any benchmarks handy to show how much of a difference
this makes? I'd wonder if the dynamic allocations necessary to build the tree
might undercut the lookup advantages.

------
jparishy
As a note to the author about separating the arrays instead of using a
structure, it was probably an alignment issue? Check out
<http://en.wikipedia.org/wiki/Data_structure_alignment>

~~~
twoodfin
More specifically, a pointer in x86-64 (and most other 64-bit architectures)
is going to be 8-byte aligned, which means that a struct of an 8-byte pointer
plus a 2-byte short will actually consume 16 bytes.

Alternatively, keeping them in separate arrays removes the alignment overhead,
and each pair of values will consume 10 bytes. That's a substantial savings
when you're trying to fit into something like a 32KB L1 data cache.

Another technique that can help you fit as much useful data as possible into
L1/L2 cache is to store offsets rather than raw pointers. Using a 2-byte
unsigned short for an offset rather than an 8-byte pointer would save another
6 bytes of cache per entry, assuming you're happy limiting your URL size to 64
kilobytes. The extra address arithmetic should be essentially free on a modern
pipeline.

~~~
jparishy
I was hoping someone would fill in the details; it's been quite a while since
I've dealt with anything like that. Thanks!

------
jfasi
> Pointer arithmetic I have always heard of pointer arithmetic but now I
> understand why it’s so useful. Taking a pointer to the first element in an
> array and incrementing it to iterate through it’s elements is elegant and
> more efficient than taking an integer i and using it as an array index. __I
> wish more languages afforded this. __

Trust me, no you don't. Pointer arithmetic is one of the easiest things to
screw up in C, and the raw form can be very dangerous to work with.

For sure, treating a pointer as an integer and incrementing by a fixed number
of bytes each time is fine, but more often than not languages have this
implemented, just as a JIT or compiler optimization.

~~~
nitrogen
_For sure, treating a pointer as an integer and incrementing by a fixed number
of bytes each time is fine, but more often than not languages have this
implemented, just as a JIT or compiler optimization._

And here, you've inadvertently pointed out one of the major pitfalls (and
amazing virtues) of pointer arithmetic: in C, pointers increment by the size
of their data type. You don't add some number of bytes to the pointer, you add
the number of elements by which to increment.

In other words, incrementing a uint8_t* will add one byte to the pointer
address, but a uint32_t* will add four bytes, and a pointer to a 64-byte
struct will add 64 bytes.

Here's an example. This code:

    
    
      #include <stdio.h>
      #include <inttypes.h>
      
      struct foo {
      	char bar[67];
      };
      
      int main()
      {
      	struct foo monkey[60];
      	struct foo *zoochild;
      
      	printf("sizeof foo: %zu\n", sizeof(struct foo));
      	printf("sizeof monkey: %zu\n", sizeof(monkey));
      
      	zoochild = monkey; // child points at first monkey
      	zoochild++; // child points at second monkey
      	printf("(monkey + 1) - monkey: %zd\n", zoochild - monkey);
      
      	printf("(bytes) (monkey + 1) - monkey: %zd\n", (uint8_t *)zoochild - (uint8_t *)monkey);
      
      	return 0;
      }
    

produces this result

    
    
      $ gcc -O4 -Wall -Wextra -pedantic -std=c99 ptr.c -o ptr
      $ ./ptr
      sizeof foo: 67
      sizeof monkey: 4020
      (monkey + 1) - monkey: 1
      (bytes) (monkey + 1) - monkey: 67

------
js2
Good read. Reminds me of the oldie but goodie
[http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

~~~
jasonmoo
Interesting find.

------
jamesaguilar
I'm trying to picture a non-synthetic program where normalizing a URL is a
significant part of the inner loop and I'm drawing a blank. Can someone help
me out?

~~~
T-hawk
Here's a similar case: Ranking of poker hands. That can involve many of the
same steps, like sorting a small list of inputs (5 or 7 cards) into a
normalized configuration (three-of-a-kind is represented as AAABC where B > C)
and taking a hash to memoize the result. And you sure might want that in a
tight inner loop if you're writing a poker simulator or AI where you want to
crunch a few billion iterations.

~~~
jefffoster
Actually, for a poker hand evaluator the last thing you want to do is sort the
hands. There's a small enough number of unique card combinations that you can
arrange it so that you look up in pre-computed tables more often than not.

One clever trick is assigning a prime to each card value. By multiplying the
values together you get a unique number representing each hand without needing
to sort.

There's a great description of an algorithm at
<http://www.suffecool.net/poker/evaluator.html>.

------
raverbashing
Very interesting. The author at first seemed not that familiar with C, but in
the end he got a good result

Depends also if there's pre validation (even with C you're never sure, so for
example, reading that he used strcpy makes my head go "security hole", but
this is a proof-of-concept so it's ok)

The best thing to keep in mind is that "There is no such thing as the fastest
code"

Edit: strcmp to strcpy (or any str C function)

~~~
tptacek
Reading strcmp() makes you think "security hole"? Why?

~~~
jparishy
It doesn't take the length/amount of characters to compare as an argument
relying on the null terminator, meaning it's susceptible to a buffer overflow
attack.

Moral of the story, if you're going to use C strings, use the strn* variants.

~~~
tptacek
No, strcmp is not susceptible to buffer overflow attacks.

~~~
jparishy
[https://buildsecurityin.us-cert.gov/bsi-
rules/home/g1/847-BS...](https://buildsecurityin.us-cert.gov/bsi-
rules/home/g1/847-BSI.html)

If passed an unterminated string, the function will fail at least. How much
you could exploit from that, I guess I exaggerated.

~~~
tptacek
That's not a buffer overflow.

This whole subthread of picking on the guy's implementation because of
"strcmp" is pretty silly. There are times where strcpy() is safe to use, but
most of the time it's a red flag. There are conceivably times when strcmp() is
unsafe to use, but to a professional reviewer, it is very rarely a red flag.

I should have just come right out and said that, rather than begging for the
rationale for picking on strcmp().

------
comex
The hand rolled strcmp almost certainly performed better than the native one
merely because it can be inlined; consider using rep cmpsb or directly
including another assembly version of strcmp (though since the strings are
short the latter's overhead might not be worth it). Ditto memcpy.

Perhaps also consider using a radix sort?

------
alexkus
Good work, one obvious speed up is, if possible, removing the use of dynamic
memory. Simply supply a static output buffer to url_sort so that it doesn't
have to allocate/free memory for the resulting URL.

That took my execution for 5M lines of (not really representative but it's all
I could be bothered to sort out at 11pm):

    
    
      /test.php?b=b&c=bb&a=3&a=4
    

down from 5.05s to 4.49s (so down 10%) on an old 1.5GHz Athlon Linux box that
whirrs away in the corner of my flat. (Down to 2.81s when compiled with -O4).

(I assume you're compiling with -O4 as that's the only way I could make it
worthwhile using your str_compare() function over libc's strcmp().

The only other thing I can think of that may speed it up more is a kind of
binary search on the insertion as that is currently just a flat loop that does
a comparison against each param; and therefore O(n).

In a pessimal case where you've already got 30 params and your 31st is going
just before the tail (but not as the tail so you can't short cut straight to
adding it as the tail) then it'll do a whole bunch of string comparisons on
entries 1, 2, 3, ..., 30 before finding the right place. It'd be better to do
them on, say, 1, 15, 23, 27, 29, 30. (This will only be beneficial when you
have more than a certain number of params). An O(log n) algorithm will
hopefully outweigh the slight expense of having to compute the midpoints.)

But, given you've got to 2M urls/sec, do you really need to eek out any more
performance? That's an awful lot of traffic if you think each HTTP request
will be about 500 bytes (including TCP headers, HTTP headers, etc). 2M * 500
bytes * 8bits = ~8Gbps, and that's just the incoming traffic from established
connections.

For the binary search suggestion I get to point out the minor nitpicks:-

* Not all C compilers accept C++ style comments (gcc is way too lenient by default)

* Not all C compilers accept local variables to be defined mid-function

* unsigned long is not the same as size_t (fun fun fun when porting to different word sizes)

* If you ever have to port this code to work on a different CPU you may find yourself spending time adding lots of casts to (ptrdiff_t) to avoid it breaking due to unexpected sign extensions/etc.

These may seem OTT for a project like this, but it's not trivial when you've
continued those practicises for years (and new devs have copied the house
style) and have a bunch of projects totalling 20MLOC+ that you then need to
port to a new compiler on a new architecture (with a different word size) and
the code is riddled with assumptions about the old compilers and the
architectures. Some things can be automated but spending _days_ moving
variable definitions to the top of functions and fixing types of variables
that hold return values of strlen() in an seemingly endless list of thousands
of files gets boring after a very short period. Me? Bitter?

"gcc -Wall -ansi -pedantic" and lint should be your friends.

~~~
tedunangst
// comments and mid-function variables are C99 standard. The platforms that
don't support them are probably not worth supporting yourself, and are not
likely to be seen running varnish. This is the same mindset that held mozilla
back for ages and ages because the hpux compiler from 1987 didn't support some
c++ feature, so therefore nobody was allowed to use it even in 1997.

~~~
alexkus
True, but not using // comments and mid-function variables is not quite the
same thing as not using a whole feature.

I just know how much work there is to do, at short notice, when a customer
comes along offering $$$ for a version of your product on an architecture that
you hadn't originally considered. It was definitely worth the hassle of
porting and cleaning up to code to provide them with it, but it would have
been so much easier if we'd kept it clean from the start.

------
gsg
Your insertion sort scans downwards from the highest populated index, bumping
elements up as it goes. Given that you have space on both sides, I'm surprised
that you didn't start in the middle and pick a scan direction based on the
comparison result there.

------
gbin
When I read that, I feel OLD !

------
kstop
We're getting ready to deploy Varnish in (non-devops) work. Not having control
over production and having the potential for C inline in the config was
already making me a bit nervous. This thread is not helping!

~~~
jasonmoo
Varnish has actually been a great tool for us. The 3rd party url-sort plugin
that we tried was not. Had a memory leak that caused the server to crash
periodically. Once the url-sort module that we're building as part of this
code challenge is tested and fully formed we'll make it available as open
source.

------
nicholassmith
I've never come across the Duff's Device previously and after staring at it
for a few minutes it made horrifying sense. I can't tell whether it's a piece
of true evil or true genius.

~~~
DLWormwood
More like evil genius… Variations of this trick are a common optimization in
embedded systems for years. (Stuff like this is why “premature optimization”
is considered such a bad thing.)

On that note, I find it interesting that the article writer wanted to see more
languages support pointer arithmetic. Language mechanics like that have
developed a reputation of being “unsafe” due to they enabling common
programming errors like buffer overflows and the like. Depending on the level
of abstraction between the language and underlying iron, it might not even be
possible to do stuff like this anymore. (Some modern languages/VMs/compilers
use the more abstract concept of “references” over “pointers.” Even Apple’s
flavor of Objective C, which AFAICT is still using pointers as an
implementation detail, prefers to couch documentation and API conventions in
the more abstract convention to deter some of the direct memory access abuses
that were discovered during the Classic-to-Carbon Mac migration way back
when…)

------
iopuy
Nice, i remember my first coding challenge in c. Nice result for first
attempt.

~~~
jasonmoo
Thanks!

------
hash_speed
A great read. I am curious about your double-wide insertion sort algorithm.

~~~
jasonmoo
Thanks. I haven't seen it anywhere else, but can't really be sure I came up
with it. Just seemed like a good optimization.

