This is not an outlier. `mlr` is quite slow, literally off-the-charts slow for o...

cb321 · on May 12, 2022

For completeness, just one CPU/machine, but a recent checkout of zsv 2tsv (built with -O3 -march=native) on that same file/same computer seems to take 0.380 sec - almost 2X longer than c2tsv's 0.20 sec (built with -mm:arc -d:danger, gcc-11), but zsv 2tsv does seem a little faster than xsv cat rows.

OTOH, zsv count only takes 0.114 sec for me (but of course, as I'm sure you know, that also only counts rows not columns which some might complain about). { EDIT: and I've never tried to time a "parse only" mode for c2tsv. }

mattewong · on May 13, 2022

BTW does c2tsv handle multibyte UTF8, \r\n vs \n, regular escapes e.g. embedded dbl-quote/nl/lf/comma, as well as CSV garbage that doesn't exist in theory but is abundant in real world (e.g. dbl-quote inside a cell that did not start w dbl-quote, malformed UTF8, etc)? Handling those in the same way Excel does added considerable overhead to zsv (and is the reason it could only perform a subset of the processing in SIMD and had to use regular branch code for other)

cb321 · on May 14, 2022

It handles most cases, but maybe not arbitrary garbage that humans might be able to guess, but I don't think rfc4180 includes all those anyway. c2tsv is UTF8/binary agnostic. It just keys off ASCII commas, newlines, etc. Beats me how one ensures you handle anything the "same" way Excel does without actually running Excel's code somehow. { Maybe today, but next year or 10 years ago? } The little state machine could be extended, but it's hard to guess what the speed impact might be until you actually write said extensions.

From a performance perspective, strictly delimiter-separated values { again, ironically redundant ;-) } can be parsed with memchr. On Linux, memchr should be SIMD vectorized at least on x86_64 glibc via ELF 'i' symbols. So, while you give up SIMD on the "messy part" with a byte-at-a-time DFA, you regain it on the other side. (I have no idea if Apple gives you SIMD-vectorized memchr.)

Send to a file and segmentation (for parallel handling of segments) is also a simple application of memchr rather than needing an index of where rows start. You just split by bytes and find the next newline char. (Roughly). This can get you 16..128X speed-ups (today, anyway, on just one host) depending upon what you do.

Conversion to something properly byte-delimited basically restores whatever charm you might have thought ?SV had. I can only imagine a few corner cases where running directly off a complex format like quoted CSV makes sense ("tiny" data, "cannot/will not spend 2X space+must save input", "cannot/will not spend time to recompress", "running on a network fileysystem shared with those who refuse simplicity".) These cases are not common (for me). When they do happen, perf is usually limited by other things like network IO, start-up overheads, etc. Usually that little extra bit to write buffers out to a pipeline will either not matter or be outright immediately repaid in parallelism, parsing simplicity, or both.

Converting from any ASCII to even faster binary formats has a similar story, but usually with even more perf improvement (depending..) and more "choices" like how to represent strings [1]. Fully pre-parsed, the performance of conversion matters much less. (Whatever the ratio of processings per initial parse is.) Between both parallelism and ASCII->binary, however fast you make your serial zsv parser/ETL stuff, actual data analysis may still run 10,000 times slower than it could be on just 1 CPU (depending upon what throttles your workloads..you may only get 10000x for CPU local L1 resident nested loop stuff). { But we veer now toward trying to cram a databases course into an HN comment. :) And I'm probably repeating myself/others. Direct email from here may work better. }

[1] https://github.com/c-blake/nio

CRConrad · on May 20, 2022

> strictly delimiter-separated values

Sigh... If only everyone had used ASCII (and Unicode!) characters 30 and 31 for delimiters, since they are actual delimiter characters: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

I don't think I've ever seen them in the wild. :-(

mattewong · on May 13, 2022

thanks for mentioning. will try out. did you use the default build settings for zsv (i.e. just plain old "make install")? also do you have copy or location for the dataset you used to test on? also what hardware / os if I may ask?