
Python Libraries you should know about - trueduke
http://doda.co/7-python-libraries-you-should-know-about
======
llambda
I'm only surprised to not see any of Kenneth Reitz' work on this list, e.g.
Requests. In fact in that vein is rauth, an OAuth client lib built on top of
Requests (<https://github.com/litl/rauth>). Full disclosure: I'm the author of
rauth. :)

~~~
kami8845
Hey. Lots of people suggested this so I added a notice to the blogpost to
further explain why I chose these libraries:

> I specifically excluded awesome libs like requests, SQLAlchemy, Flask,
> fabric etc. because I think they're already pretty "main-stream". If you
> know what you're trying to do, it's almost guaranteed that you'll stumble
> over the aforementioned. This is a list of libraries that in my opinion
> should be better known, but aren't.

------
mercuryrising
Since everyone is asking (and no one is doing), I put together a simple
benchmark for pyquery, bs4, and lxml (cssselect/xpath).

<https://gist.github.com/4061368>

All it does is grab paragraphs from python.org's html a couple thousand times.

    
    
        ==== Total trials: 100000 =====
        bs4 total time: 31.6
        pq total time: 9.3
        lxml (cssselect) total time: 5.4
        lxml (xpath) total time: 4.3
        regex total time: 8.9 (doesn't find all p)
    

What does it mean? Unless you're running thousands of queries for parsing, it
doesn't matter which library you choose. My computer old and slow. Pick which
one is the easiest, that you'll fight the least with. Don't put energy into
unnecessary optimization. Using a good library is like choking someone,
they'll fight for a little while until they pass out. (you'll remember this
analogy next time you want to switch libraries, do you really need to choke
someone to get your job done?) After they pass out, it's smooth sailing and
you don't have to worry. Don't rock the boat unless you have to.

~~~
kami8845
Haha, thank you mercuryrising. That's quite the interesting comparison. A few
thoughts:

\- Python.org is quite a simple web app, it would be interesting to run this
against something a little more complex like a long wikipedia article or
<http://www.nytimes.com/> or even the Alexa Top 500

\- It would also be interesting to split the times between parsing and
selecting as I feel that's where the difference between pq and lxml comes in.

All in all it seems BS4 is quite a bit faster than I gave it credit for
(especially factoring in parsing+selecting instead of just the former)

~~~
mercuryrising
Python.org was pretty simple. bs4 really hits the brakes on the NYT page.

    
    
        ==== Total trials: 100000 =====
        bs4 parsing time: 0.6
        bs4 selecting time: 146.3
        pq parsing time: 0.0
        pq selecting: 15.7
        lxml parsing time: 0.0
        lxml (cssselect) selecting time: 12.4
        lxml parsing time: 0.0
        lxml (xpath) selecting: 11.5

~~~
kami8845
That seems more in line with my experience :) This is the stuff lxml shines
at. Thank you for testing this out.

------
marcofucci
You should probably add requests <http://docs.python-requests.org>. Anyway
great list! I'm already using most of them and they are awesome.

~~~
kami8845
Hey. Author of the blog post here.

I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric
etc. because I thought them too "main-stream". If you know what you're trying
to do, it's almost guaranteed that you'll stumble over the aforementioned. I
tried to compile a little bit of a list of libraries that SHOULD be better
known, but aren't.

~~~
marcofucci
I see your point but in my opinion, requests is quite different from
SQLAlchemy, Flask and fabric. I wouldn't put them at the same level.

------
bjourne
When compiling lists of the best Python libs, one definitely has to check out
Pocoo: <http://www.pocoo.org/> They are a bunch of dudes who are incredibly
skilled at putting together great API:s. All their libraries, from pygments,
jinja2 to sphinx are well-documented and extremely simple to use.

------
leftnode
One I'd like to add is Docopt: <http://docopt.org/> and
<https://github.com/docopt/docopt>

It makes it very simple and intuitive to build command line apps.

~~~
IgorPartola
And if you want to have both command line args _and_ config files, check out
my project: <https://github.com/ipartola/groper>. It even supports creating a
sample config file out of the options you have defined.

~~~
CJefferson
This might just be your examples, but it seems a bit verbose.

For example for:

define_opt('server', 'daemon', type=bool, cmd_name='daemon',
cmd_short_name='d')

Can I just write define_opt('daemon') ?

~~~
IgorPartola
It is a bit verbose, you are right. Let me explain why all of these are
necessary. (BTW, argparse is almost as verbose [1]).

server - this is the section/module. I could grab this from the name of the
current module, but that means you must use unique module names, and not
change them. Otherwise, your config files would stop working.

daemon - required, obviously, since this is the name of the option.

type - I can't assume a type for you. By default it is a unicode instance, so
you can omit this parameter if that's your use case.

cmd_name/cmd_short_name - I can try to assume that cmd_name is the same as the
second positional argument, but once again, there could be a conflict. For
example, if you want to have options.db.filename and options.log.filename, you
can't use cmd_name = 'filename' for both. cmd_short_name is even worse, since
here you may want a specific letter to be used (such as an upper case D
instead of d). Note that these parameters are optional, since most of your
values will likely go in the config file, not on the command line.

An alternative API might look like this:

    
    
      define_opt_int('server', 'shutdown_timeout')
    
      define_opt_bool('server', 'deameon')
    
      define_opt_cmd_unicode('server', 'pid_file', cmd_name='pid', cmd_short_name='p')
    

With a fallback to:

    
    
      define_opt('log', 'level', type=lambda x: x if x in LOG_LEVELS else 'DEBUG', cmd_name='log-level', cmd_short_name='L')
    

Any feedback is greatly appreciated.

[1] <http://docs.python.org/dev/library/argparse.html>

~~~
MagerValp
So maybe just define_opt('server.daemon'), with the rest given sane defaults
and customizable as needed?

Interesting module nonetheless!

------
takluyver
The very first suggestion complains that BeautifulSoup is too slow, but as of
version 4, it's actually just a navigation layer on top of your preferred
parser. So it's as fast as lxml, and as easy to use as, well, BeautifulSoup.

~~~
pmorici
I used BeautifulSoup for a project once because of all the accolades it gets
here but found it to be less than robust. It might be sufficient in a scenario
where you have a single site you need to scrap but I found it was totally
unreliable when used across a wide range of sites and esp. sites with foreign
language content.

~~~
ludwigvan
I started using PyQuery yesterday, after using BeautifulSoup for a long time.
It seems much easier to use.

    
    
      pq_page = pquery(url=PAGE_URL)
    

Note that PyQuery has some encoding issues too (or rather the sites I were
scraping were too bad, showing two different encodings in meta tag!), here are
two different things I have done to workaround:

    
    
      page = requests.get(PAGE_URL)
      pq_page = pquery(page.text)
    

If that doesn't cut it (because requests detects it wrong too), try forcing
the encoding in requests:

    
    
      page = requests.get(PAGE_URL)
      page.encoding = 'utf-8'
      pq_page = pquery(page.text)

------
Permit
Is PyQuery fast? The whole premise of the first point was that BeautifulSoup
was too slow, but then he didn't provide a comparison between them.

~~~
kami8845
Hey Permit. This is a really good point. Looking at PyQuery's source code [0],
it really does nothing in terms of parsing other than calling lxml's
fromstring function and then works with the result of that when evaluating
queries. So it would probably be a tad slower than pure lxml since it also
does other bits (like checking for URLs and then fetching them for you) but
from looking at the source code, I'd think that the overhead is minimal.

[0]
[https://github.com/dsc/pyquery/blob/master/pyquery/pyquery.p...](https://github.com/dsc/pyquery/blob/master/pyquery/pyquery.py#L179)

~~~
Jabbles
Do you have any more up-to-date data than a 2008 comparison?

~~~
kami8845
Hey, I commented on this a bit further down in the topic:
<http://www.crummy.com/2012/1/22/0>

------
think-large
I <3 you OP. So much right now. I didn't even know i needed these until I read
this post and now I know and I'm so happy.

I know matplotlib comes with Python(x,y) but that's a pretty awesome one too.

------
JeffJenkins
The coolest part of dateutil isn't the parser, it's the recurrence rules and
recurrence rule sets. Doing that on your own is extremely error-prone if you
have a non-trivial recurrence.

~~~
kami8845
Cool, thank you for making me aware of that. I haven't had to use something
like that but it's good to know. I'll put it in the blogpost.

EDIT: I added it below the `parse` example. Thanks again!

------
rabialam
Great list. In particular, fuzzywuzzy and pattern caught my eye in a "how-
have-I-not-heard-of-these" kind of way.

------
reinhardt
If you do any non-trivial work with decorators, the `decorator` module is a
must:
[https://micheles.googlecode.com/hg/decorator/documentation.h...](https://micheles.googlecode.com/hg/decorator/documentation.html).
Think of it as @functools.wraps on steroids (though this probably doesn't do
it justice). FWIW I think it should be in the standard library.

------
polm23
No mention of pandas or nltk?

I'd never heard of pattern before, and while it looks like it's a nice bundle
of features, I'm concerned by the fact it references pyWordNet by name even
though it hasn't been an independent project since 2006
(<http://osteele.com/projects/pywordnet/>). Has anyone actually used it?

------
roryokane
Here are some Ruby libraries I’ve used that are similar to those Python
libraries:

pyquery equivalent: Nokogiri (<http://nokogiri.org/>). Lets you select
elements with jQuery-like selectors. Uses libxml2 as its parser.

watchdog equivalent: watchr (<https://github.com/mynyml/watchr>). Run code
when the filesystem changes.

path.py equivalent: rush (<http://rush.heroku.com/>). Provides a far better
API to the filesystem than the standard library.

I also found this equivalent to fuzzywuzzy, but I’ve never used it: amatch
(<http://flori.github.com/amatch/>)

------
dm8
Excellent collection! I will add one more.

Python Imaging Library - Today's web is full of images and PIL makes it easy
for image manipulation. Although, it's not extremely performance efficient at
very large scale.

~~~
joeshaw
In a similar vein, I recently discovered Wand (<http://dahlia.kr/wand/>) which
is a gorgeous API built on top of ImageMagick. It's a lot more limited in
functionality than what PIL offers, but for common operations like scaling,
cropping, or extracting EXIF data it's much nicer to work with.

------
mochizuki
I have a few scripts that could benefit from PyQuery, didn't know about that
one. Also path.py looks like it will save me some time to. Thanks!

~~~
klibertp
path.py is a must have, I am amazed how often it's being rediscovered, despite
the fact that I try to advertise it whenever it's relevant. But it's quite old
at this point, maybe we should be more excited about efforts like this one:
<http://www.python.org/dev/peps/pep-0428/>

------
jnazario
for fuzzy date work i tend to relay on the parsedatetime module:

<http://code.google.com/p/parsedatetime/>

it seems to accept syntax similar to the 'at' command does (and obviates the
need for my python C module to do that parsing based on the scheduler parser
for 'at'). examples include "1 day ago", "ten hours from now" and the like.
very useful.

------
Tloewald
Speaking as someone with approximately 4h of Python experience -- great list.
I'll be using the sh lib right away.

~~~
CodeMage
Unless you're using Windows :(

The docs got my hopes high, but they fail to mention sh isn't supported on
Windows -- I saw that by looking at its source on GitHub. I'll take a look at
pbs, but this kinda bummed me out.

~~~
pbreit
Do yourself a huge favor and develop on Linux. VirtualBox + Ubuntu 12.04 =
Free.

~~~
CodeMage
Sure, I have an Ubuntu box. Thing is, while I would be doing _myself_ a huge
favor, the same wouldn't apply to the users of what I'm trying to create.
Perhaps the world doesn't need yet another Git GUI, but if I leave out Windows
support, the world will need it even less ;)

------
andrewcooke
i have reservations about dateutil. it's certainly true that the built-in time
and date routines in python are ugly, but dateutil's parsing doesn't try to
give a single, consistent interpretation of an input date/time - instead it
parses it in chunks, where one chunk can effectively overwrite another. so you
can have input that is illogical or inconsistent and dateutil will happily
give a single "right" answer instead of flagging an error.

for example, see this bug on so -
[http://stackoverflow.com/questions/10575919/strange-date-
par...](http://stackoverflow.com/questions/10575919/strange-date-parsing-
results-in-python)

in short: it's ok for non-critical cases where you just need "some date" from
input. but don't use it if you would rather have an error than an incorrect
interpretation.

------
denzil_correa
Thanks for sharing. These are a nice set of libraries I never used. I am
bookmarking this page.

------
chewxy
The latest BeautifulSoup uses lxml (version 4+). How does it compare to
PyQuery?

------
brunoqc
You have a typo. Search for 'PyQyery', twice.

~~~
kami8845
Thank you, fixed.

------
nathan_f77
This is quite a cool idea:

>>> path('a') / 'b' / 'c' path('a/b/c')

It would be fun to have that in Ruby!

~~~
mattdeboard
Alright, let me say that I do not like this part of path. Operator overloading
is fine, I mean the + is overloaded all to hell. But seeing the division
symbol in weird places in a project I took over was distracting first, then
irritating, and by the time I figured out what was going on I did not want it
despite seeing some benefit.

~~~
draegtun
I agree. It reminds me of FancyRoutes which I also found off putting -
<http://news.ycombinator.com/item?id=1100248>

_Path.py_ may already do it (docs are limited so I'm not sure) but something
like below would be better:

    
    
      my $foo = dir('a', 'b', 'c');
    

This is how Path::Class works in Perl
(<https://metacpan.org/module/Path::Class>). This way the path is a filesystem
agnostic directory object.

NB. Twisting operator overloading isn't all bad though. For eg. IO:All nearly
gets it all right and being a little mad I will sometimes use it :)
<https://metacpan.org/module/IO::All>

------
the1
watchdog is pretty buggy on linux at least.

~~~
kami8845
Hey, can you talk more on that? I found it to work quite well on my machine
and servers, it's also an integral part of another open source library that
I'm building. [0]

[0] <http://github.com/doda/imagy>

