

Show HN: A fast ISO8601 date-time parser for Python - thomas-st
https://hack.close.io/posts/ciso8601

======
deathanatos
A regex only seems to take ~1µs.

    
    
      In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))')
    
      In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
      1000000 loops, best of 3: 1.05 µs per loop
    

But hey, once it's written in C, why go back?

I'm missing the timezone, but the OP left that out, so I did too. For
comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why
aniso8601 is so slow. (It's also missing a few other things, depending on if
you count all the non-time forms as valid input.)

That said, cool! I might use this. One of the things that makes dateutil's
parse slower is that it'll parse more than just ISO-8601: it parses many
things that look like dates, including some very non-intuitive ones that have
caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I
really _need_ is an ISO-8601 parser. While I appreciate the theory behind "be
liberal in what you accept", sometimes, I'd rather error out than to build
expectations that sending garbage — er, stuff that requires a complicated
parse algorithm that I don't really understand — is okay.

¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know.
Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not
iterable".

~~~
ajanuary
The library has the following features your regex is missing:

* Every part from month onwards is optional

* Separator characters are optional

* Date/time separator can be a space as well as T

* Timezone information

* Parsing the strings into numbers

* Actually creates a datetime object

I expect adding all of those will bump up the time a bit.

~~~
ajanuary
I'm not much of a regex wizard, but I tried to add all the features listed
other than parsing the result and creating the datetime object.

    
    
        iso_regex = re.compile('([0-9]{4})-?([0-9]{1,2})(?:-?([0-9]{1,2})(?:[T ]([0-9]{1,2})(?::?([0-9]{1,2})(?::?([0-9]{1,2}(?:\\.?[0-9]+)?))?(?:(Z)|([+-][0-9]{1,2}):?([0-9]{1,2})))?)?)?')
    

It seems like it performs quite a bit worse than the library, which creates
the full object.

    
    
        In [82]: %timeit ciso8601.parse_datetime('2014-01-09T21:48:00.921000')
        1000000 loops, best of 3: 368 ns per loop
    
        In [83]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
        100000 loops, best of 3: 9.72 µs per loop
    

In the interest of intellectual pursuit, is there anything that can be done to
the regex to speed it up?

------
birken
Pandas (data analysis library for python) has a lot of cython and C
optimizations for datetime string parsing:

They have their own C function which parses ISO-8601 datetime strings:
[https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf56610...](https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf566103eabef4997274e4576/pandas/src/datetime/np_datetime_strings.c#L344)

They have a version of strptime written in cython:
[https://github.com/pydata/pandas/blob/master/pandas/tslib.py...](https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1473)

I'm not saying these are better/worse than your solution, I haven't done any
benchmarks and the pandas functions sometimes cut a few corners, but perhaps
there is something useful there for reference anyways. They also don't deal
directly in datetime.datetime objects, they use pandas specific intermediate
objects, but should be simple enough to grok.

Having done some work with dateutil, I will tell you that
dateutil.parser.parse is slow, but its main use case shouldn't be converting
strings to datetimes if you already know the format. If you know the format
already you should use datetime.strptime or some faster variant (like the one
above). There is a nice feature of pandas where given a list of datetime-y
strings of an arbitrary format, it will attempt to guess the format using
dateutil's lexer
([https://github.com/pydata/pandas/blob/master/pandas/tseries/...](https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L73))
combined with trial/error, and then try to use a faster parser instead of
dateutil.parser.parse to convert the array if possible. In the general case
this resulted in about a 10x speedup over dateutil.parser.parse if the format
was guessable.

~~~
pekk
It would have been nice if Pandas had let this out into a separate package so
that you didn't need to pull down all of Pandas to use it. This is why people
are duplicating your efforts.

------
data_scientist
I tried to do a fair comparaison between comparaison between the main date
implementations. The ciso8601 is really fast, 3.73 µs on my computer (MBA
2013). aniso8601, iso8601, isolate and arrow are all between 45 and 100µs. The
dateutil parser is the slowest (150 µs).

    
    
      >>> ds = u'2014-01-09T21:48:00.921000+05:30'
    
      >>> %timeit ciso8601.parse_datetime(ds)
      100000 loops, best of 3: 3.73 µs per loop
    
      >>> %timeit dateutil.parser.parse(ds)
      10000 loops, best of 3: 157 µs per loop
    

A regex[1] can be fast, but the parsing is just a small part of the time
spent.

    
    
      >>> %timeit regex_parse_datetime(ds)
      100000 loops, best of 3: 13 µs per loop
    
      >>> %timeit match = iso_regex.match(s)
      100000 loops, best of 3: 2.18 µs per loop
    

Pandas is also slow. However it is the fastest for a list of dates, just
0.43µs per date!!

    
    
      >>> %timeit pd.to_datetime(ds)
      10000 loops, best of 3: 47.9 µs per loop
    
      >>> l = [u'2014-01-09T21:{}:{}.921000+05:30'.format(
            ("0"+str(i%60))[-2:], ("0"+str(int(i/60)))[-2:]) 
         for i in xrange(1000)] #1000 differents dates
     
      >>> len(set(l)), len(l)
      (1000, 1000)
    
      >>> %timeit pd.to_datetime(l)
      1000 loops, best of 3: 437 µs per loop
    

NB: pandas is however very slow in ill-formed dates, like
u'2014-01-09T21:00:0.921000+05:30' (just one figure for the second) (230 µs,
no speedup by vectorization).

So if you care about speed and your dates are well formatted, make a vector of
dates and use pandas. If you can't use it, go for ciso8601. For thomas-st: it
may be possible to speed-up parsing of list of dates like Pandas do. Another
nice feature would be caching.

[1]: [http://pastebin.com/ppJ4dzBP](http://pastebin.com/ppJ4dzBP)

------
userbinator
Extremely simple and straightforward C code too, which is also nice to read.
320ns (on what processor?) is assuming a clock of 2-3GHz on x86 around 1K
instructions, several orders of magnitude less than what it was before. But
that still works out to a few dozen instructions _per character_ of the
string... so I'm inclined to believe that it could go an order of magnitude
faster if you really wanted it to, but at that point the Python overhead
(PyArg_ParseTuple et al) is going to dominate.

I'm not sure this would be any better than just manually writing out both
trivial iterations of the loop:

    
    
        for (i = 0; i < 2; i++)

~~~
thomas-st
I did all the timeit benchmarks on the latest 13" retina MacBook Pro, 2.6 GHz
Intel Core i5. The profiler screenshot is from one of our serves on EC2.

Of course there is always potential for optimization, but at this point it's
fast enough for our purposes. If you can make it significantly faster please
don't hesitate to submit a PR though :)

EDIT: Wouldn't most C compilers unroll the simple "for" loops? Direct link to
the C code:
[https://github.com/elasticsales/ciso8601/blob/master/module....](https://github.com/elasticsales/ciso8601/blob/master/module.c)

~~~
randlet
I don't currently have a use for this library, but I'm going to bookmark it
anyways because it looks like a nice introduction to writing a module in C. It
does something non trivial but still is simple enough to grok quickly. Thanks!

------
josephlord
Does it cover all of ISO8601? I'm sure it covers the common cases so is a
valuable library anyway but I seem to remember that ISO8601 is quite
complicated I.

~~~
thomas-st
It doesn't cover all of it, for example week dates or ordinal dates are not
currently supported. But feel free to submit any patches :)

~~~
josephlord
Sorry. I'm busy with other things at the moment.

It may also be better not covering everything if it keeps performance and
simplicity but I just like to understand the trade-offs.

------
radikalus
My quick look at this shows that unless you cython wrap the call -- this is
going to be slower than using pandas' to_datetime on anything with an array
layout.

I've never really spent much time looking at pandas' to_datetime but I believe
it has to handle a lot of variety in what you pass to it, which probably cause
a bit of a perf hit. (Lists, arrays, Series)

[http://dl.dropboxusercontent.com/u/14988785/ciso8601_compari...](http://dl.dropboxusercontent.com/u/14988785/ciso8601_comparison.html)

------
wanghq
If you control the source data, store it as epoch and you can avoid this
parsing.

Not quite related: Is there any python library that can handle timezone
parsing, like the java SimpleDateFormat
([http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDat...](http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html))?
The timezone could be in utc offset and short name format (EST, EDT,...). I am
surprised that I couldn't find one.

------
btbuilder
While profiling I noticed the same thing about dateutil.parser.parse a few
years ago. We standardized all our interacting systems on UTC so we have a
regex that matches UTC and if that fails to match we call dateutil. That way
the vast majority of cases are optimized but we still support other timezones.

------
jnks
How many dates are you parsing at a time that optimizing this would make a
noticeable difference to users?

~~~
evmar
The post says: "For large object structures with thousands of date time
objects this can easily add up." At 0.1ms per parse, that's 100ms per thousand
dates, within the range of noticeable. (Their profiler screenshot has it
taking 589ms.)

~~~
imaginenore
0.1ms to parse a date???

Even the standard PHP string parser does 0.017ms on my 3 year old netbook.

    
    
        <?php
    
        $st = microtime(true);
        $cnt = 10000;
    
        for ($i=0; $i<$cnt; $i++)
        	strtotime('2014-01-09T21:48:00.921000');
    
        echo 1000 * (microtime(true) - $st) / $cnt;
    

Seems like this solves a non-existing issue.

~~~
anemitz
You can see the issue it solves pretty clearly here:
[https://github.com/elasticsales/ciso8601#benchmark](https://github.com/elasticsales/ciso8601#benchmark)

Python != PHP

~~~
imaginenore
Actually both Python and PHP are ridiculously slow languages. Though Python is
slower.

~~~
illumen
Some implementations of python are slowish for some tasks. Many parts, like
the module being discussed are written in C/assembly/fortran/Java.

Python with a jit is Pypy, [http://speed.pypy.org/](http://speed.pypy.org/)

Also PHP has some fast _implementations_ of PHP.

------
rlpb
There are other parsers that exist already too. For example, did you try this
one?
[https://pypi.python.org/pypi/iso8601](https://pypi.python.org/pypi/iso8601)

How do these all compare to each other?

------
daurnimator
I think you actually mean RFC3339. ISO8601 is probably a lot larger than you
think.

~~~
ajanuary
RFC3339 makes most of the fields mandatory, while this library leaves them
optional, so it is more accurately a subset of ISO8601 than an implementation
of RFC3339. That said, you could describe it as an extension of RFC3339.

------
jamesaguilar
This seems like the type of thing that's good to ffi out of you're using it a
lot. I highly doubt the c version would take this long.

~~~
dfc

      > good to ffi out of you're using it a lot
    

What does that mean?

~~~
michaelmior
s/of/if/

~~~
jamesaguilar
Spell correct is a helluva drug. I get bitten by of/if all the time.

------
Sir_Cmpwn
Would it make more sense to modify the core library and send off a patch?

~~~
ajanuary
Most of the speed comes from only parsing a frequently used subset of ISO8601.
For a core library, you probably want a more complete implementation.

