Timing the time it takes to parse time

lorenzhs · on Oct 13, 2016

Those plots are confusing - plotting time and standard deviation as two bars next to each other is odd. It would be much better if the plot used error bars, e.g. like this: https://egret.psychol.cam.ac.uk/statistics/local_copies_of_s... . Otherwise it's very hard to tell whether the difference between two implementations is just noise or not.

And if the numbers are so large, adjust the scale. Don't plot thousands of nanoseconds—plot microseconds!

I'm not faulting the author for their plots—making good plots is hard!—but good articles suffer when the data is presented in such ways.

DougBTX · on Oct 13, 2016

Reading further into the article, that chart is a setup for the "new times are smaller than the old error bars" punch line.

userbinator · on Oct 13, 2016

The dates are strictly formatted, the values are stored as UTF8 and there are no cultures to consider.

I wonder what the speedup will be from omitting all the validation and associated branching and just picking the values out directly. Chances are that you won't see more than one date format in the same input either. Another strategy I'd use is to parallelise the digitisation; assuming the YYYY-MM-DDTHH:mm:ss format, it'd start off something like this (untested, but should be close if not completely correct):

    Y = *(uint32_t*)s - 0x30303030;
    M = *(uint16_t*)(s+5) - 0x3030;
    D = *(uint16_t*)(s+8) - 0x3030;
    Y = 100 * ((10 * (Y >> 24) + (Y >> 16)) & 255) +
        ((10 * (Y >> 8) + Y) & 255);
    M = (10 * (M >> 8) + M) & 255;
    D = (10 * (D >> 8) + D) & 255;

It should be possible to reduce the year computation to 2 multiplications, at the expense of several more masking operations. Around 50-100 clock cycles per full datetime should be possible, which is in the few-dozen-ns range on a GHz-clock CPU. Probably could go a bit faster still if you start bringing in the SSE...

avian · on Oct 13, 2016

This is quite clever. The catch is that it assumes implementation specific details (like a little-endian architecture) while OP's code is (probably) portable C.

dimman · on Oct 13, 2016

Which part are you referring to? Bit operations etc isn't affected by endianess. It looks like it would only potentially be a problem if input is coming from another system (haven't looked at where input comes from or what format/type).

avian · on Oct 13, 2016

I was referring to this part:

    Y = *(uint32_t*)s - 0x30303030;

By casting a string to a uint32_t, you are assuming the memory layout of uint32_t. Y will have a different value depending on architecture. E.g. on a typical little-endian you will get 0x06010002. On big-endian you will get 0x02000106.

I'm talking about C, however. I didn't notice the post is talking about C#. Things might be better defined there.

JoeAltmaier · on Oct 13, 2016

Regardless of layout, that subtraction will have the same effect on most printable characters. An identical effect on digits.

avian · on Oct 13, 2016

The cast is the problematic part, not the subtraction.

JoeAltmaier · on Oct 13, 2016

If there is no carry, then I don't understand. Four bytes will have the same value deducted. regardless of the byte order. If there is no carry implication (all are digits) then there is no issue.

huhtenberg · on Oct 13, 2016

On RISC platforms

  *(uint32_t*)s

will get you a SIGBUS if s is not aligned to a 4-byte boundary. Though that's another issue, not what OP was referring to.

zeusk · on Oct 13, 2016

ARM has unaligned access since v6 (introduced in 2001); if you're on linux, unaligned access will be patched by the kernel (as was the case prior ARMv6 and even for MIPS afaik).

Anyway, the point of his post was about possible gains from removing validation, not about being portable or production code.

larvyde · on Oct 13, 2016

The subtraction isn't the issue, the cast is. the string "2016" is represented by the byte sequence [0x32, 0x30, 0x31, 0x36]. Casting this array to a uint32* in big endian gives you the integer 0x32303136 (or 842019126) while in little endian gives you the integer 0x36313032 (or 909193266).

JoeAltmaier · on Oct 14, 2016

...and subtracting 0x30303030 works to exactly the same effect on either one!

larvyde · on Oct 15, 2016

0x32303136 - 0x30303030 = 0x02000106

0x36313032 - 0x30303030 = 0x06010002

how is 0x02000106 the same as 0x06010002?

JoeAltmaier · on Oct 18, 2016

When you show the string in memory order, they are the same. Its the operation on the string that's important, not the way you print the hex byte-order-dependent value. Both become 01 00 01 06

larvyde · on Oct 19, 2016

But we're trying to convert the string "2016" to the integer 2016.

we want to turn the sequence [0x32, 0x30, 0x31, 0x36] (same on both architectures) into [0x00, 0x00, 0x07, 0xe0] in big endian or [0xe0, 0x07, 0x00, 0x00] in little endian. You can't simply perform the same procedure in both architectures since it'll result in a reversed sequence in one of them...

to3m · on Oct 13, 2016

Y>>24 will be the byte at +3 on a little-endian system, or +0 on a big-endian system. It's therefore the digit in the 10^0 column in the former case, or the 10^3 column in the latter.

So it actually looks like it currently assumes a big-endian system.

yread · on Oct 13, 2016

OP's code is unsafe C#

yread · on Oct 13, 2016

Since you're mostly interested in only this century you could probably use lookup tables for everything.

willvarfar · on Oct 13, 2016

I do a lot of log crunching with pypy, and datetime.strftime is a really big part of the cost. I wish someone could contribute faster datetime routines to the standard CPython and pypy.

Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!

aidanhs · on Oct 13, 2016

One trick that helped when I was doing log crunching (where time parsing was a good 10%) was to cache the parse.

All log lines began with a date+time like "2015-12-10 14:42:54.432" and there's maybe 100 lines per second. You can therefore just take the first 19 characters, parse that to a millisecond unix time and then separately parse the milliseconds to an int and add that. All you need is one cache entry (since logs are mostly in order) and then you can just do a string comparison (i.e. no hashmap lookup) to check the cache - instantly 100x fewer time parsing calls.

The best way to speed up a function is to not call it!

ars · on Oct 13, 2016

That's called memoization, and it works really well in quite a number of places.

There are some compilers that can do it automatically with some hints, but I think they are mostly experimental not production.

kbenson · on Oct 13, 2016

Well, it's half memoization, and half knowing that you can split the work into smaller parts, which allows you to take advantage of the memoization.

Just adding memoization to date and time parsing gets you very little when there's little duplication of the inputs, and without the breaking apart of the data could very likely have yielded worse performance.

fnord123 · on Oct 13, 2016

Here are some microbenchmarks:

In [63]: timeit dateutil.parser.parse('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 89.5 µs per loop

In [64]: timeit arrow.get('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 62.1 µs per loop

In [65]: timeit numpy.datetime64('2013-05-11T21:23:58.970460+07:00') 1000000 loops, best of 3: 714 ns per loop

In [66]: timeit iso8601.parse_date('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 23.9 µs per loop

> Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!

The Python solution you describe is the same as in Java. If you have `String a = b + c + d;` then the compiler may optimize this using a StringBuffer as you say[1]. In Python it's also pretty cheap to do `a = b + c + d` to concatenate strings (or `''.join([b, c, d])`; but you should run a little microbenchmark to see which works best). But if it's in a "complex loop" as you opine then Java will certainly not do this. So you have to build a buffer using StringBuilder and then use toString() which is basically the same exact process except it has the name `builder.toString` instead of `''.join(builder)`

Unless of course you have some interesting insights into the jvm internals about string concatenation optimizations.

[1]http://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html...

yread · on Oct 13, 2016

Of course you have a different machine but the OP was getting 2.5 us per parse in .NET versus your 89.5 us in Python. I wouldn't have expected such a difference. No wonder it's hot path

fnord123 · on Oct 13, 2016

Well that's dateutil (installed from pip) and not datetime (std). As part of log ingestion I would, of course, convert to UTC and drop the timezone distinctions since it does slow down python a lot when it has to worry about timezones. Working within the same units and no DST issues is much nicer/quicker.

Anyway, if you're installing packages from pip, may as well just install iso8601 and get the best performance - possibly beating .Net (who knows? as you said, I have a different machine than OP).

beagle3 · on Oct 14, 2016

The numpy version seems to be about 30 times faster than the iso8601 version - note the result there is in nanosecs, not microsecs like the others.

fnord123 · on Oct 14, 2016

Yeah but OP is using pypy and I don't know if numpy works on pypy fully. I think I read it does but I haven't tried it.

x1798DE · on Oct 13, 2016

dateutil spends most of its time inferring the format, it's not really designed as a performance component, it's designed as a "if it looks like a date we'll give you a datetime" style component.

dsp1234 · on Oct 13, 2016

Is there a particular reason that all of the loops are 10k except numpy which is 1M?

mrdmnd · on Oct 13, 2016

The TimeIt macro runs snippets for a variable number of iterations based on how long the snippet takes.

fnord123 · on Oct 13, 2016

Because it can do so many more iterations in a similar amount of time, it just does them.

joejev · on Oct 13, 2016

timeit does a run to decide how many loops to do, since it was much faster it ran it more times.

aardvark179 · on Oct 13, 2016

The Java story is slightly more complex. Javac has emitted code to make a StringBuilder and call append multiple times on it, and then the JIT has spotted this construction and optimised that.

This is, as you can guess, somewhat fragile especially when some of the parts may be constants. So JEP 280 has changed javac to emit an invokeDynamic instruction with information about the constant and dynamic parts of the string so the optimisation strategy can be chosen at run time and can change over time without requiring everyone to recompile their java code.

pjmlp · on Oct 13, 2016

However one should mention that this optimization is Hotspot specific and other Java compilers will behave differently.

vietjtnguyen · on Oct 13, 2016

For your point about Python string concatenation, I believe that's what io.StringIO is for (https://docs.python.org/3/library/io.html#io.StringIO).

maxerickson · on Oct 13, 2016

If you are using CPython 2.4 or 2.5 or later, use the clearest code. In earlier versions using join on a list was faster, but the inplace add operation was optimized, turning the stuff about string.join being faster mostly into mythlore.

https://bugs.python.org/issue980695

PyPy and other implementations may do better with the join idiom though.

Of course someone with code spending a lot of time on joining strings can measure which is best for their situation, but += is fine for most things.

willvarfar · on Oct 13, 2016

http://pypy.org/performance.html says "String concatenation is expensive" and suggests using join() in loops.

Its an old problem https://bitbucket.org/pypy/pypy/issues/1925/very-slow-string...

So whilst pypy is otherwise much faster than CPython, its missing of this kind of optimisation is why actually CPython can be faster for parsing my logs.

I know about this because I've been bitten by it :)

necessity · on Oct 13, 2016

Forget parsing time, just taking the time takes an awful lot of time. Try measuring a very quick routine (a few nanoseconds) with clock_gettime - you have to do various calls and take a mean because of resolution issues, and also measure and reduce the overhead of the logging routine. This becomes a real issue because taking a mean of various calls is different than doing just one call (cache, branch prediction, etc). You could add something else inside the loop but then your fast routine becomes a tiny part of what you are measuring and the relative error explodes. It is simply not possible to benchmark routines that take just a few nanoseconds precisely and exactly with userspace routines. There are whitepapers from Intel on how to do this on their processors with a kernel module to disable preemption and IRQs and read the TSC directly with asm, but then you can't have userspace stuff... Benchmarking quick stuff is no fun.

Beltiras · on Oct 13, 2016

Hardest thing I ever had to profile had to do with accessing Postgres from Python. I needed the cold-start time. I ended up kinda fuzzing it to find an estimate. Filled a table with proper data and accessed random keys to find an average. Even then some optimization occurred.

thomas-st · on Oct 13, 2016

I had similar issues in Python, where the available date parsers weren't fast enough.

With my C module, I got 320ns, which was 62x faster than Python's less flexible strptime. See https://github.com/closeio/ciso8601/ for my benchmarks.

logicallee · on Oct 13, 2016

I don't get it. In 1 single millisecond, a processor does on the order of a million operations, and if it's being run that tightly the code should be close to the CPU, perhaps in L2 or L3 cache -- what on earth requires parsing 62x faster than that, for date parsing? 320 ns * 62 is 19 microseconds for the version that yuo said was too slow for you, which is 0.019 milliseconds. Like, what even generates formated date data that fast, or why can't you wait 0.019 ms to parse it out again? Genuine question.

EDIT: I got downvoted, but it's just a question. If it's logs, is it old logs, or streaming from somewheere? Rather than speculate I'd like to hear the actual answer from thomas-st for their use case. 62x is just under 1.8 orders of magnitude, so I wouldn't think it matters in comparison with typical CPU and RAM speeds, disk, network, or other bottlenecks...

masklinn · on Oct 13, 2016

Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so assuming you only get dates your input is ~5 million dates per second, 19µs/date is ~50000 dates per second. The date parsing is three orders of magnitude behind the data source.

Even allowing for other data e.g. a timestamp followed by some other data, at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)

fnord123 · on Oct 13, 2016

> at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)

+1

This is why a little ELT goes a long way.

>Good CSV parsers reach 200MB/s

By good (and open source) we're talking about libcsv, rust-csv, and rust quick-csv[1]. If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time. If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes. But at that point, maybe write data to disk as protobuf or msgpack or avro, or whatever.

[1] https://bitbucket.org/ewanhiggs/csv-game

masklinn · on Oct 13, 2016

> If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time.

The 200MB/s, at least for rust-csv, is for "raw" parsing (handling CSV itself) not field parsing and conversions, so those would be additional costs.

> If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes.

Yes if you can have fixed-size fields and remove things like escaping and quoting and the like things get much faster.

userbinator · on Oct 13, 2016

It depends what exactly you mean by CSV parsing, but I've done record scanning on CSVs at >1GB/s on a 3GHz CPU.

btilly · on Oct 13, 2016

It is a lot faster to split lines on , than to handle quoting with embedded commas, returns and so on.

logicallee · on Oct 13, 2016

Interesting. I'd wait to hear their answer - if it's a CSV bottlenecking at 0.2 MB/sec instead of 200 MB/sec it would be interesting (which is the three orders of magnitude you state). But with all that said, they reported a 62x improvement, so 1.8 orders of magnitude.

fnord123 · on Oct 13, 2016

0.2MB/s parsing logs or basically any file is absolutely terrible. Hence why thomas-st wrote his own parser. Parsing at 2MB/s (rounding up the 1.8 to 2 order of magnitude) is also pretty poor.

kachnuv_ocasek · on Oct 13, 2016

Let me ask a converse question: Why should a program be slow when it can be fast?

nothrabannosir · on Oct 13, 2016

Not to defend op, just answering this question: code clarity, readability, simplicity, correctness, clearer error handling, ......

Depending on how much faster is "fast", these things can (dare I say, should) trump efficiency.

Essentially this is about optimisation vs premature optimisation. That's a horse dead for decades now.

thomas-st · on Oct 13, 2016

See https://hack.close.io/posts/ciso8601 on the use case.

logicallee · on Oct 13, 2016

that's amazing, and incredibly surprising. It's one of the operations I would ahve thought was essentiall free, and completely dwarfed by other bottlenecks and latencies. I mean it's just a few characters, right! I'm really suprised by the result you report in your link - thanks for writing it up!

sebcat · on Oct 13, 2016

Any data set with a lot of records with few fields, one of which is a timestamp is going to spend some time formatting/parsing that timestamp.

jbrooksuk · on Oct 13, 2016

Parsing logs? Geo location?

tstack · on Oct 13, 2016

Parsing time is a huge cost when the Logfile Navigator (http://lnav.org) is parsing log files. The libc strptime() is terribly slow, so I wrote a simple compiler[1] that would spit out inlined C++ code for parsing a particular time format (e.g. "%Y-%m-%dT%H:%M:%S.%f%z"). Coupled with the fact that a time format is not going to differ across log lines in a single file, lnav is able to lock onto a date format and parse all the lines in a file fairly quickly.

[1] - https://github.com/tstack/lnav/blob/master/src/ptimec.cc

douglasfshearer · on Oct 13, 2016

Simple optimisations like this shave minutes off the runtime of log parsing and data import jobs I run.

I do lots of this in Ruby [0], and often write a custom parser when I know that dates or times will be of regular form. As a bonus I get to golf different methods of doing the parsing against each other, using microbenchmarks.

[0] https://gist.github.com/dougal/e972b26cd0293f99b41896e7d89c0...