Those plots are confusing - plotting time and standard deviation as two bars next to each other is odd. It would be much better if the plot used error bars, e.g. like this: https://egret.psychol.cam.ac.uk/statistics/local_copies_of_s... . Otherwise it's very hard to tell whether the difference between two implementations is just noise or not.
And if the numbers are so large, adjust the scale. Don't plot thousands of nanoseconds—plot microseconds!
I'm not faulting the author for their plots—making good plots is hard!—but good articles suffer when the data is presented in such ways.
The dates are strictly formatted, the values are stored as UTF8 and there are no cultures to consider.
I wonder what the speedup will be from omitting all the validation and associated branching and just picking the values out directly. Chances are that you won't see more than one date format in the same input either. Another strategy I'd use is to parallelise the digitisation; assuming the YYYY-MM-DDTHH:mm:ss format, it'd start off something like this (untested, but should be close if not completely correct):
Y = *(uint32_t*)s - 0x30303030;
M = *(uint16_t*)(s+5) - 0x3030;
D = *(uint16_t*)(s+8) - 0x3030;
Y = 100 * ((10 * (Y >> 24) + (Y >> 16)) & 255) +
((10 * (Y >> 8) + Y) & 255);
M = (10 * (M >> 8) + M) & 255;
D = (10 * (D >> 8) + D) & 255;
It should be possible to reduce the year computation to 2 multiplications, at the expense of several more masking operations. Around 50-100 clock cycles per full datetime should be possible, which is in the few-dozen-ns range on a GHz-clock CPU. Probably could go a bit faster still if you start bringing in the SSE...
This is quite clever. The catch is that it assumes implementation specific details (like a little-endian architecture) while OP's code is (probably) portable C.
Which part are you referring to? Bit operations etc isn't affected by endianess. It looks like it would only potentially be a problem if input is coming from another system (haven't looked at where input comes from or what format/type).
By casting a string to a uint32_t, you are assuming the memory layout of uint32_t. Y will have a different value depending on architecture. E.g. on a typical little-endian you will get 0x06010002. On big-endian you will get 0x02000106.
I'm talking about C, however. I didn't notice the post is talking about C#. Things might be better defined there.
If there is no carry, then I don't understand. Four bytes will have the same value deducted. regardless of the byte order. If there is no carry implication (all are digits) then there is no issue.
ARM has unaligned access since v6 (introduced in 2001); if you're on linux, unaligned access will be patched by the kernel (as was the case prior ARMv6 and even for MIPS afaik).
Anyway, the point of his post was about possible gains from removing validation, not about being portable or production code.
The subtraction isn't the issue, the cast is. the string "2016" is represented by the byte sequence [0x32, 0x30, 0x31, 0x36]. Casting this array to a uint32* in big endian gives you the integer 0x32303136 (or 842019126) while in little endian gives you the integer 0x36313032 (or 909193266).
When you show the string in memory order, they are the same. Its the operation on the string that's important, not the way you print the hex byte-order-dependent value. Both become 01 00 01 06
But we're trying to convert the string "2016" to the integer 2016.
we want to turn the sequence [0x32, 0x30, 0x31, 0x36] (same on both architectures) into [0x00, 0x00, 0x07, 0xe0] in big endian or [0xe0, 0x07, 0x00, 0x00] in little endian. You can't simply perform the same procedure in both architectures since it'll result in a reversed sequence in one of them...
Y>>24 will be the byte at +3 on a little-endian system, or +0 on a big-endian system. It's therefore the digit in the 10^0 column in the former case, or the 10^3 column in the latter.
So it actually looks like it currently assumes a big-endian system.
I do a lot of log crunching with pypy, and datetime.strftime is a really big part of the cost. I wish someone could contribute faster datetime routines to the standard CPython and pypy.
Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!
One trick that helped when I was doing log crunching (where time parsing was a good 10%) was to cache the parse.
All log lines began with a date+time like "2015-12-10 14:42:54.432" and there's maybe 100 lines per second. You can therefore just take the first 19 characters, parse that to a millisecond unix time and then separately parse the milliseconds to an int and add that. All you need is one cache entry (since logs are mostly in order) and then you can just do a string comparison (i.e. no hashmap lookup) to check the cache - instantly 100x fewer time parsing calls.
The best way to speed up a function is to not call it!
Well, it's half memoization, and half knowing that you can split the work into smaller parts, which allows you to take advantage of the memoization.
Just adding memoization to date and time parsing gets you very little when there's little duplication of the inputs, and without the breaking apart of the data could very likely have yielded worse performance.
In [63]: timeit dateutil.parser.parse('2013-05-11T21:23:58.970460+07:00')
10000 loops, best of 3: 89.5 µs per loop
In [64]: timeit arrow.get('2013-05-11T21:23:58.970460+07:00')
10000 loops, best of 3: 62.1 µs per loop
In [65]: timeit numpy.datetime64('2013-05-11T21:23:58.970460+07:00')
1000000 loops, best of 3: 714 ns per loop
In [66]: timeit iso8601.parse_date('2013-05-11T21:23:58.970460+07:00')
10000 loops, best of 3: 23.9 µs per loop
> Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!
The Python solution you describe is the same as in Java. If you have `String a = b + c + d;` then the compiler may optimize this using a StringBuffer as you say[1]. In Python it's also pretty cheap to do `a = b + c + d` to concatenate strings (or `''.join([b, c, d])`; but you should run a little microbenchmark to see which works best). But if it's in a "complex loop" as you opine then Java will certainly not do this. So you have to build a buffer using StringBuilder and then use toString() which is basically the same exact process except it has the name `builder.toString` instead of `''.join(builder)`
Unless of course you have some interesting insights into the jvm internals about string concatenation optimizations.
Of course you have a different machine but the OP was getting 2.5 us per parse in .NET versus your 89.5 us in Python. I wouldn't have expected such a difference. No wonder it's hot path
Well that's dateutil (installed from pip) and not datetime (std). As part of log ingestion I would, of course, convert to UTC and drop the timezone distinctions since it does slow down python a lot when it has to worry about timezones. Working within the same units and no DST issues is much nicer/quicker.
Anyway, if you're installing packages from pip, may as well just install iso8601 and get the best performance - possibly beating .Net (who knows? as you said, I have a different machine than OP).
dateutil spends most of its time inferring the format, it's not really designed as a performance component, it's designed as a "if it looks like a date we'll give you a datetime" style component.
The Java story is slightly more complex. Javac has emitted code to make a StringBuilder and call append multiple times on it, and then the JIT has spotted this construction and optimised that.
This is, as you can guess, somewhat fragile especially when some of the parts may be constants. So JEP 280 has changed javac to emit an invokeDynamic instruction with information about the constant and dynamic parts of the string so the optimisation strategy can be chosen at run time and can change over time without requiring everyone to recompile their java code.
If you are using CPython 2.4 or 2.5 or later, use the clearest code. In earlier versions using join on a list was faster, but the inplace add operation was optimized, turning the stuff about string.join being faster mostly into mythlore.
So whilst pypy is otherwise much faster than CPython, its missing of this kind of optimisation is why actually CPython can be faster for parsing my logs.
I know about this because I've been bitten by it :)
Forget parsing time, just taking the time takes an awful lot of time. Try measuring a very quick routine (a few nanoseconds) with clock_gettime - you have to do various calls and take a mean because of resolution issues, and also measure and reduce the overhead of the logging routine. This becomes a real issue because taking a mean of various calls is different than doing just one call (cache, branch prediction, etc). You could add something else inside the loop but then your fast routine becomes a tiny part of what you are measuring and the relative error explodes. It is simply not possible to benchmark routines that take just a few nanoseconds precisely and exactly with userspace routines. There are whitepapers from Intel on how to do this on their processors with a kernel module to disable preemption and IRQs and read the TSC directly with asm, but then you can't have userspace stuff... Benchmarking quick stuff is no fun.
Hardest thing I ever had to profile had to do with accessing Postgres from Python. I needed the cold-start time. I ended up kinda fuzzing it to find an estimate. Filled a table with proper data and accessed random keys to find an average. Even then some optimization occurred.
I don't get it. In 1 single millisecond, a processor does on the order of a million operations, and if it's being run that tightly the code should be close to the CPU, perhaps in L2 or L3 cache -- what on earth requires parsing 62x faster than that, for date parsing? 320 ns * 62 is 19 microseconds for the version that yuo said was too slow for you, which is 0.019 milliseconds. Like, what even generates formated date data that fast, or why can't you wait 0.019 ms to parse it out again? Genuine question.
EDIT: I got downvoted, but it's just a question. If it's logs, is it old logs, or streaming from somewheere? Rather than speculate I'd like to hear the actual answer from
thomas-st for their use case. 62x is just under 1.8 orders of magnitude, so I wouldn't think it matters in comparison with typical CPU and RAM speeds, disk, network, or other bottlenecks...
Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so assuming you only get dates your input is ~5 million dates per second, 19µs/date is ~50000 dates per second. The date parsing is three orders of magnitude behind the data source.
Even allowing for other data e.g. a timestamp followed by some other data, at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)
> at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)
+1
This is why a little ELT goes a long way.
>Good CSV parsers reach 200MB/s
By good (and open source) we're talking about libcsv, rust-csv, and rust quick-csv[1]. If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time. If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes. But at that point, maybe write data to disk as protobuf or msgpack or avro, or whatever.
> If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time.
The 200MB/s, at least for rust-csv, is for "raw" parsing (handling CSV itself) not field parsing and conversions, so those would be additional costs.
> If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes.
Yes if you can have fixed-size fields and remove things like escaping and quoting and the like things get much faster.
Interesting. I'd wait to hear their answer - if it's a CSV bottlenecking at 0.2 MB/sec instead of 200 MB/sec it would be interesting (which is the three orders of magnitude you state). But with all that said, they reported a 62x improvement, so 1.8 orders of magnitude.
0.2MB/s parsing logs or basically any file is absolutely terrible. Hence why thomas-st wrote his own parser. Parsing at 2MB/s (rounding up the 1.8 to 2 order of magnitude) is also pretty poor.
that's amazing, and incredibly surprising. It's one of the operations I would ahve thought was essentiall free, and completely dwarfed by other bottlenecks and latencies. I mean it's just a few characters, right! I'm really suprised by the result you report in your link - thanks for writing it up!
Parsing time is a huge cost when the Logfile Navigator (http://lnav.org) is parsing log files. The libc strptime() is terribly slow, so I wrote a simple compiler[1] that would spit out inlined C++ code for parsing a particular time format (e.g. "%Y-%m-%dT%H:%M:%S.%f%z"). Coupled with the fact that a time format is not going to differ across log lines in a single file, lnav is able to lock onto a date format and parse all the lines in a file fairly quickly.
Simple optimisations like this shave minutes off the runtime of log parsing and data import jobs I run.
I do lots of this in Ruby [0], and often write a custom parser when I know that dates or times will be of regular form. As a bonus I get to golf different methods of doing the parsing against each other, using microbenchmarks.
And if the numbers are so large, adjust the scale. Don't plot thousands of nanoseconds—plot microseconds!
I'm not faulting the author for their plots—making good plots is hard!—but good articles suffer when the data is presented in such ways.