
Timing the time it takes to parse time - yread
https://ayende.com/blog/175714/timing-the-time-it-takes-to-parse-time-part-ii?Key=f65c7092-3f17-4ded-bb30-9dd479848d5c
======
lorenzhs
Those plots are confusing - plotting time and standard deviation as two bars
next to each other is odd. It would be much better if the plot used error
bars, e.g. like this:
[https://egret.psychol.cam.ac.uk/statistics/local_copies_of_s...](https://egret.psychol.cam.ac.uk/statistics/local_copies_of_sources_Cardinal_and_Aitken_ANOVA/errorbars_files/graph1.gif)
. Otherwise it's very hard to tell whether the difference between two
implementations is just noise or not.

And if the numbers are so large, adjust the scale. Don't plot thousands of
nanoseconds—plot microseconds!

I'm not faulting the author for their plots—making good plots is hard!—but
good articles suffer when the data is presented in such ways.

~~~
DougBTX
Reading further into the article, that chart is a setup for the "new times are
smaller than the old error bars" punch line.

------
userbinator
_The dates are strictly formatted, the values are stored as UTF8 and there are
no cultures to consider._

I wonder what the speedup will be from omitting all the validation and
associated branching and just picking the values out directly. Chances are
that you won't see more than one date format in the same input either. Another
strategy I'd use is to parallelise the digitisation; assuming the YYYY-MM-
DDTHH:mm:ss format, it'd start off something like this (untested, but should
be close if not completely correct):

    
    
        Y = *(uint32_t*)s - 0x30303030;
        M = *(uint16_t*)(s+5) - 0x3030;
        D = *(uint16_t*)(s+8) - 0x3030;
        Y = 100 * ((10 * (Y >> 24) + (Y >> 16)) & 255) +
            ((10 * (Y >> 8) + Y) & 255);
        M = (10 * (M >> 8) + M) & 255;
        D = (10 * (D >> 8) + D) & 255;
    

It should be possible to reduce the year computation to 2 multiplications, at
the expense of several more masking operations. Around 50-100 clock cycles per
full datetime should be possible, which is in the few-dozen-ns range on a GHz-
clock CPU. Probably could go a bit faster still if you start bringing in the
SSE...

~~~
avian
This is quite clever. The catch is that it assumes implementation specific
details (like a little-endian architecture) while OP's code is (probably)
portable C.

~~~
dimman
Which part are you referring to? Bit operations etc isn't affected by
endianess. It looks like it would only potentially be a problem if input is
coming from another system (haven't looked at where input comes from or what
format/type).

~~~
avian
I was referring to this part:

    
    
        Y = *(uint32_t*)s - 0x30303030;
    

By casting a string to a uint32_t, you are assuming the memory layout of
uint32_t. Y will have a different value depending on architecture. E.g. on a
typical little-endian you will get 0x06010002. On big-endian you will get
0x02000106.

I'm talking about C, however. I didn't notice the post is talking about C#.
Things might be better defined there.

~~~
JoeAltmaier
Regardless of layout, that subtraction will have the same effect on most
printable characters. An identical effect on digits.

~~~
avian
The cast is the problematic part, not the subtraction.

~~~
JoeAltmaier
If there is no carry, then I don't understand. Four bytes will have the same
value deducted. regardless of the byte order. If there is no carry implication
(all are digits) then there is no issue.

~~~
huhtenberg
On RISC platforms

    
    
      *(uint32_t*)s
    

will get you a SIGBUS if s is not aligned to a 4-byte boundary. Though that's
another issue, not what OP was referring to.

~~~
zeusk
ARM has unaligned access since v6 (introduced in 2001); if you're on linux,
unaligned access will be patched by the kernel (as was the case prior ARMv6
and even for MIPS afaik).

Anyway, the point of his post was about possible gains from removing
validation, not about being portable or production code.

------
willvarfar
I do a lot of log crunching with pypy, and datetime.strftime is a really big
part of the cost. I wish someone could contribute faster datetime routines to
the standard CPython and pypy.

Other parts that are always hot include split() and string concatenation. Java
compilers can substitute StringBuffers when they see naive string
concatenation, but in Python there's no easy way to build a string in a
complex loop and you end up putting string fragments into a list and then
finally join()ing them. Madness!

~~~
fnord123
Here are some microbenchmarks:

In [63]: timeit dateutil.parser.parse('2013-05-11T21:23:58.970460+07:00')
10000 loops, best of 3: 89.5 µs per loop

In [64]: timeit arrow.get('2013-05-11T21:23:58.970460+07:00') 10000 loops,
best of 3: 62.1 µs per loop

In [65]: timeit numpy.datetime64('2013-05-11T21:23:58.970460+07:00') 1000000
loops, best of 3: 714 ns per loop

In [66]: timeit iso8601.parse_date('2013-05-11T21:23:58.970460+07:00') 10000
loops, best of 3: 23.9 µs per loop

> Other parts that are always hot include split() and string concatenation.
> Java compilers can substitute StringBuffers when they see naive string
> concatenation, but in Python there's no easy way to build a string in a
> complex loop and you end up putting string fragments into a list and then
> finally join()ing them. Madness!

The Python solution you describe is the same as in Java. If you have `String a
= b + c + d;` then the compiler may optimize this using a StringBuffer as you
say[1]. In Python it's also pretty cheap to do `a = b + c + d` to concatenate
strings (or `''.join([b, c, d])`; but you should run a little microbenchmark
to see which works best). But if it's in a "complex loop" as you opine then
Java will certainly not do this. So you have to build a buffer using
StringBuilder and then use toString() which is basically the same exact
process except it has the name `builder.toString` instead of
`''.join(builder)`

Unless of course you have some interesting insights into the jvm internals
about string concatenation optimizations.

[1][http://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html...](http://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.18.1)

~~~
dsp1234
Is there a particular reason that all of the loops are 10k except numpy which
is 1M?

~~~
mrdmnd
The TimeIt macro runs snippets for a variable number of iterations based on
how long the snippet takes.

------
necessity
Forget parsing time, just taking the time takes an awful lot of time. Try
measuring a very quick routine (a few nanoseconds) with clock_gettime - you
have to do various calls and take a mean because of resolution issues, and
also measure and reduce the overhead of the logging routine. This becomes a
real issue because taking a mean of various calls is different than doing just
one call (cache, branch prediction, etc). You could add something else inside
the loop but then your fast routine becomes a tiny part of what you are
measuring and the relative error explodes. It is simply not possible to
benchmark routines that take just a few nanoseconds precisely and exactly with
userspace routines. There are whitepapers from Intel on how to do this on
their processors with a kernel module to disable preemption and IRQs and read
the TSC directly with asm, but then you can't have userspace stuff...
Benchmarking quick stuff is no fun.

~~~
Beltiras
Hardest thing I ever had to profile had to do with accessing Postgres from
Python. I needed the cold-start time. I ended up kinda fuzzing it to find an
estimate. Filled a table with proper data and accessed random keys to find an
average. Even then some optimization occurred.

------
thomas-st
I had similar issues in Python, where the available date parsers weren't fast
enough.

With my C module, I got 320ns, which was 62x faster than Python's less
flexible strptime. See
[https://github.com/closeio/ciso8601/](https://github.com/closeio/ciso8601/)
for my benchmarks.

~~~
logicallee
I don't get it. In 1 single millisecond, a processor does on the order of a
million operations, and if it's being run that tightly the code should be
close to the CPU, perhaps in L2 or L3 cache -- what on earth requires parsing
62x faster than that, for date parsing? 320 ns * 62 is 19 microseconds for the
version that yuo said was too slow for you, which is 0.019 milliseconds. Like,
what even generates formated date data that fast, or why can't you wait 0.019
ms to parse it out again? Genuine question.

EDIT: I got downvoted, but it's just a question. If it's logs, is it old logs,
or streaming from somewheere? Rather than speculate I'd like to hear the
actual answer from thomas-st for their use case. 62x is just under 1.8 orders
of magnitude, so I wouldn't think it matters in comparison with typical CPU
and RAM speeds, disk, network, or other bottlenecks...

~~~
masklinn
Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so
assuming you only get dates your input is ~5 million dates per second,
19µs/date is ~50000 dates per second. The date parsing is three orders of
magnitude behind the data source.

Even allowing for other data e.g. a timestamp followed by some other data, at
19µs/datetime you can easily end up with that bottlenecking your entire
pipeline if the datasource spews (which is common in contexts like HFT,
aggregated logs and the like)

~~~
fnord123
> at 19µs/datetime you can easily end up with that bottlenecking your entire
> pipeline if the datasource spews (which is common in contexts like HFT,
> aggregated logs and the like)

+1

This is why a little ELT goes a long way.

>Good CSV parsers reach 200MB/s

By good (and open source) we're talking about libcsv, rust-csv, and rust
quick-csv[1]. If you're doing your own custom parsing you can write your own
numeric parsers to remove support for parsing nan, inf, -inf, etc and drop
scientific notation which will claw back a lot of the time. If you also know
the exact width of the date field then you can also shave plenty of time
parsing datetimes. But at that point, maybe write data to disk as protobuf or
msgpack or avro, or whatever.

[1] [https://bitbucket.org/ewanhiggs/csv-
game](https://bitbucket.org/ewanhiggs/csv-game)

~~~
masklinn
> If you're doing your own custom parsing you can write your own numeric
> parsers to remove support for parsing nan, inf, -inf, etc and drop
> scientific notation which will claw back a lot of the time.

The 200MB/s, at least for rust-csv, is for "raw" parsing (handling CSV itself)
not field parsing and conversions, so those would be additional costs.

> If you also know the exact width of the date field then you can also shave
> plenty of time parsing datetimes.

Yes if you can have fixed-size fields and remove things like escaping and
quoting and the like things get much faster.

------
tstack
Parsing time is a huge cost when the Logfile Navigator
([http://lnav.org](http://lnav.org)) is parsing log files. The libc strptime()
is terribly slow, so I wrote a simple compiler[1] that would spit out inlined
C++ code for parsing a particular time format (e.g. "%Y-%m-%dT%H:%M:%S.%f%z").
Coupled with the fact that a time format is not going to differ across log
lines in a single file, lnav is able to lock onto a date format and parse all
the lines in a file fairly quickly.

[1] -
[https://github.com/tstack/lnav/blob/master/src/ptimec.cc](https://github.com/tstack/lnav/blob/master/src/ptimec.cc)

------
douglasfshearer
Simple optimisations like this shave minutes off the runtime of log parsing
and data import jobs I run.

I do lots of this in Ruby [0], and often write a custom parser when I know
that dates or times will be of regular form. As a bonus I get to golf
different methods of doing the parsing against each other, using
microbenchmarks.

[0]
[https://gist.github.com/dougal/e972b26cd0293f99b41896e7d89c0...](https://gist.github.com/dougal/e972b26cd0293f99b41896e7d89c070e)

