Hacker News new | past | comments | ask | show | jobs | submit login
Data Processing: PLINQ, Parallelism and Performance (microsoft.com)
23 points by stsmytherie on Jan 7, 2011 | hide | past | favorite | 8 comments



Interesting that on a six core machine there is very little gain after a Degree of Parallelism of 3. The problem does seem highly data parallelizable (is that a word). So why isn't it able to better utilize the 6 cores?


Law of diminishing returns - the associated overhead with additional parallelization starts creeping up on the benefits of said parallelization.The problem is parallelizable, but it might not be a big enough problem to need access to every core in order to achieve maximum performance.


There's also some graph issues here. For example, on the multithreaded one, it takes two seconds on 1 core. Ideal with full use of 6 cores would be (1sec/6~=0.167sec). The points at 6-8 cores are clearly above 0.1 seconds and probably below 0.33, but it's hard to tell.

Most of the literature in parallel (at least, in the parallel compilers space, where I work) plot speedup (http://en.wikipedia.org/wiki/Speedup) versus number of processors instead of seconds versus threads/processors to show how well a given algorithm and implementation scale. Of course, you have to be careful about what you use for the sequential (T_1) baseline, but it's much easier to understand the data in the graphs.


It may be hitting another bottleneck, all cores share the same L3 cache, memory lane, disk, etc.


From the article:

> Each place name is represented by a UTF-8 (en.wikipedia.org/wiki/UTF-8) text line record (variable length) with more than 15 tab-separated data columns. Note: The UTF-8 encoding assures that a tab (0x9) or line feed (0xA) value won’t occur as part of a multi-byte sequence; this is essential for several implementations.

What? I guess that they use a longer (2-byte?) encoding for those codepoints, but from the very same wikipedia page that they link:

> a sequence that decodes to a value that should use a shorter sequence (an "overlong form") [is invalid]

...

> Implementations of the decoding algorithm MUST protect against decoding invalid sequences

Are they advising to use an invalid and potentially broken UTF8 encoding?


I think you misread - "assures", not "assumes". UTF8 guarantees that any valid ASCII char is NOT part of a UTF8 encoded multibyte char.


I think you are right.

I thought that the author was meaning that it is possible to use \n and \t in values because UTF8 would encode them in multibyte sequences (like Modified UTF8 encodes \0 as 0xC0,0x80).

What he was actually meaning is that if a codepoint is > 127 its multibyte encoding won't contain any \n or \t.

Sorry for the confusion.


They are describing an UTF8 feature - nothing invalid or broken about it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: