Interesting that on a six core machine there is very little gain after a Degree of Parallelism of 3. The problem does seem highly data parallelizable (is that a word). So why isn't it able to better utilize the 6 cores?
Law of diminishing returns - the associated overhead with additional parallelization starts creeping up on the benefits of said parallelization.The problem is parallelizable, but it might not be a big enough problem to need access to every core in order to achieve maximum performance.
There's also some graph issues here. For example, on the multithreaded one, it takes two seconds on 1 core. Ideal with full use of 6 cores would be (1sec/6~=0.167sec). The points at 6-8 cores are clearly above 0.1 seconds and probably below 0.33, but it's hard to tell.
Most of the literature in parallel (at least, in the parallel compilers space, where I work) plot speedup (http://en.wikipedia.org/wiki/Speedup) versus number of processors instead of seconds versus threads/processors to show how well a given algorithm and implementation scale. Of course, you have to be careful about what you use for the sequential (T_1) baseline, but it's much easier to understand the data in the graphs.
> Each place name is represented by a UTF-8 (en.wikipedia.org/wiki/UTF-8) text line record (variable length) with more than 15 tab-separated data columns. Note: The UTF-8 encoding assures that a tab (0x9) or line feed (0xA) value won’t occur as part of a multi-byte sequence; this is essential for several implementations.
What? I guess that they use a longer (2-byte?) encoding for those codepoints, but from the very same wikipedia page that they link:
> a sequence that decodes to a value that should use a shorter sequence (an "overlong form") [is invalid]
...
> Implementations of the decoding algorithm MUST protect against decoding invalid sequences
Are they advising to use an invalid and potentially broken UTF8 encoding?
I thought that the author was meaning that it is possible to use \n and \t in values because UTF8 would encode them in multibyte sequences (like Modified UTF8 encodes \0 as 0xC0,0x80).
What he was actually meaning is that if a codepoint is > 127 its multibyte encoding won't contain any \n or \t.