

Dateparser: Python parser for human readable dates - juanriaza
https://github.com/scrapinghub/dateparser

======
Negative1
In the same vein check out Arrow for a big improvement over Pythons standard
time/date libraries. As a bonus it also generates human readable dates (though
I don't think it parses them like this lib):
[http://crsmithdev.com/arrow/](http://crsmithdev.com/arrow/)

~~~
jdnier
Except that arrow is simple-minded when trying to parse date strings (compared
to dateparser or delorean[1]). By default, it only tries to match a few
patterns. You'll see a lot of this:

    
    
        arrow.parser.ParserError: Could not match input to any of
        ['YYYY-MM-DD', 'YYYY-MM', 'YYYY'] on '01-06-17'
    

Here's a list of (US English–centric) test dates that dateparser
((ddp.get_date_data(date)['date_obj']).date()) and delorean
(delorean.parse(date, dayfirst=False, yearfirst=False).date) both parse
correctly, nearly all of which arrow fails on:

    
    
        01-06-2017
        01-06-17
        2017-01-06
    
        01/06/2017
        01/06/17
        2017/01/06
    
        Jan 6, 2017
        Jan 6 2017
    
        2017, Jan 6
        2017 Jan 6
    
        2017, January 6
        2017 January 6
    
        January 6, 2017
        January 6 2017
    
    
        January 6nd, 2017
        January 6rd, 2017
        January 6st, 2017
        January 6th, 2017
    
        January 6nd 2017
        January 6rd 2017
        January 6st 2017
        January 6th 2017
    
    
        2017, January 6nd
        2017, January 6rd
        2017, January 6st
        2017, January 6th
    
        2017 January 6nd
        2017 January 6rd
        2017 January 6st
        2017 January 6th
    
    
        01//06/2017
        01//06//2017
        01--06-2017
        01--06--2017
    
        01/06-2017
        01-06/2017
    

I like Delorean's API better than arrow's (strictly personal preference) but
think dateparser's language detection is interesting.

[1]
[http://delorean.readthedocs.org/en/latest/quickstart.html](http://delorean.readthedocs.org/en/latest/quickstart.html)

~~~
pbhjpbhj
> _01-06-17_ //

What's the correct parsing of that date? Is it 2001 or 2017 or 1917 or ... is
it June or January ...?

------
rafd
Huh! Just last week I did a survey of NLP Date Parsing libraries. If you're
looking got something similar in other languages, see:

[https://docs.google.com/spreadsheets/d/1dKt0R247B8Mx5sFXd7ht...](https://docs.google.com/spreadsheets/d/1dKt0R247B8Mx5sFXd7htSOQB-B5kMODM2ydmjp9cr80/edit?usp=sharing)

------
jmsdnns
Python also has dateutil, which can do similar things and has been around a
long time: [https://pypi.python.org/pypi/python-
dateutil](https://pypi.python.org/pypi/python-dateutil)

~~~
eliasdorneles
Yeah, dateutil it is cool, but it has a few problems:

    
    
      >>> from dateutil import parser
      >>> parser.parse('')
      datetime.datetime(2014, 11, 24, 0, 0)
    

It gets worse with fuzzy parsing:

    
    
      >>> parser.parse('something meaningless', fuzzy=True)
      datetime.datetime(2014, 11, 24, 0, 0)

~~~
pekk
Can this not be reported as a bug on the project's issue tracker? I don't
understand why people trash things in public instead of at least filing a
polite issue.

~~~
eliasdorneles
I don't mean to trash anything, these are known bugs (there are lots of them
issued there:
[https://bugs.launchpad.net/dateutil](https://bugs.launchpad.net/dateutil)).

It seems that dateutil has just not been receiving much love from its
developers lately.

------
superchink
So quick question to anyone who's used this lib. The README cites an example:
it can give you the date for text like: '1 min ago', '2 weeks ago', '3 months,
1 weeks and 1 day ago', etc

Does it handle proper grammar for singular values (i.e., 1 week vs. 1 weeks)?

~~~
eliasdorneles
Well, it is meant to be very forgivable. Right now it outputs the same thing
for both "1 week ago" and "1 weeks ago" (even though the latter is
grammatically incorrect).

Can you elaborate what you mean by "proper grammar for singular values"?

------
foxhop
Sort of related, I'm the author of ago.py
([https://pypi.python.org/pypi/ago/0.0.6](https://pypi.python.org/pypi/ago/0.0.6))
which generates human readable timedeltas that this parser reverses.

------
rmrfrmrf
I'm ashamed to say that, in the few Python projects I've done, I have resorted
to delegating date parsing out to PHP in the past given its amazing date
parser. Aside from how silly that sounds, it's actually a pretty fast
solution. I'll give this a look and see how it compares. I've found that a lot
of Python libraries seem to add an obscene amount of bloat for the
functionality I'm looking for.

~~~
pekk
If you're happy using PHP for this, I don't want to get in the way of your
happiness - but if you applied the same standard to that practice as you do to
Python libraries you'd certainly see that invoking a separate PHP process is
"an obscene amount of bloat for the functionality".

~~~
tobych
They needn't be invoking a separate process. Could be just calling out to an
API provided by a running PHP process, perhaps over HTTP.

------
callmeed
FYI the ruby equivalent is chronic:
[https://github.com/mojombo/chronic](https://github.com/mojombo/chronic)

------
brendano
Also see Heideltime:
[https://code.google.com/p/heideltime/](https://code.google.com/p/heideltime/)

------
mashematician
Similar project:
[https://github.com/bear/parsedatetime](https://github.com/bear/parsedatetime)

------
ar7hur
FYI in Clojure (with a live demo): [http://duckling-lib.org](http://duckling-
lib.org)

~~~
pekk
How would you propose using that from Python?

~~~
tedunangst
With a mini service. You feed it a line of text; it replies with the parsed
date in a standard format.

------
boyter
Interesting. However it doesn't solve what I would argue is the harder problem
of how to identify a time in the document.

For example as I write this HN url says that it is 8 hours old. Without
knowing the exact format how can I extract these sort of dates out of random
text/html documents?

~~~
brendano
This is a hard problem -- there's a bunch of research in NLP on it, where it's
sometimes called temporal tagging. HeidelTime is a system that does this; some
examples on their webpage,
[https://code.google.com/p/heideltime/](https://code.google.com/p/heideltime/)

------
thraxil
related, for parsing durations:
[https://github.com/thraxil/simpleduration/](https://github.com/thraxil/simpleduration/)

~~~
eliasdorneles
interesting. it seems to support only English dates, sadly.

~~~
thraxil
open a ticket.

