Hacker News new | past | comments | ask | show | jobs | submit login
The Python Standard Library - Where Modules Go To Die (leancrew.com)
122 points by b14ck on May 1, 2012 | hide | past | favorite | 72 comments



I'm not really a Python programmer, but a hacker who occasionally has cause to pick up Python scripts and do stuff with them. Perhaps I've been unlucky, but every time I've done this, it's turned into a profoundly frustrating exercise. There have always been dependencies outside the standard library, and those have had dependencies -- which more often than not, are incompatible with whatever version of Python my environment is set up for. I've frequently run across scripts with dependencies that somehow only execute in mutually incompatible versions of Python, which always makes for an exceedingly aggravating day of programming.

As much as people love to bash PHP -- and I agree that it's pretty awful as a language -- its standard library is so comprehensive, backwards-compatible, and superbly-documented that I have never had a comparably aggravating experience with it. The same is true of Javascript: a language with warts, but whenever I try something, it Just Works.

Like I say, perhaps I've just been unlucky, but my distinct impression of Python has been that it's a beautiful language surrounded by a particularly problematic ecosystem of incompatible libraries and sparse documentation. I suspect that the Python community would benefit from paying less attention to the purity of the language, and a lot more attention to the quality of everything surrounding it.


There have always been dependencies outside the standard library, and those have had dependencies -- which more often than not, are incompatible with whatever version of Python my environment is set up for.

Create a virtual environment for the Python version you want to use, and install the package with pip (pip will install all the dependencies in your project's local environment)...

  $ mkdir myproj
  $ cd myproj
  $ virtualenv --python=python2.7 env
  $ source env/bin/activate
  (env)$ pip install somepackage


Or you can install virtualenvwrapper (http://www.doughellmann.com/projects/virtualenvwrapper/) and after a little configuration...

  $ mkvirtualenv env1
  (env1)$ pip install somepackage
Creating self containing python projects has never been so easy!


Yep, I learned this trick on my first go-around. The problem has occurred when I've tried to run scripts with dependencies which somehow are only available in mutually incompatible versions of Python. I'm not sure how this is possible (it's baffling -- it shouldn't be possible), but it's happened three out of four times that I've tried to use a Python script of any real consequence. Usually after half a day of futilely trying to find an environment which will actually accommodate all dependencies, I end up having to port everything to whatever version of Python appears to be the most common denominator. In all fairness to Python, this is relatively easy to do (except in one case, when I had to hire a Python-expert friend to do it), but it still means that what should've been a five-minute affair (less the dependency hell) turns into a full-day affair.

Like I say, I've probably just been unlucky. But at this point it's given me a pretty serious aversion to Python. Will probably have to get over that someday, I suppose.


Which packages? I'm curious, because unless you're trying to work in Python 2 and 3 at the same time, I can't think of any packages which don't support 2.5, 2.6 and 2.7. And I've been programming in Python about 10 years or so. Unless you're going super-bleeding-edge, it shouldn't be a problem.


I hate PHP's standard library - I find it to be weird and inconsistent, arguments are in a random order, etc. and Python to be not so bad. I suspect this is because you have a lot of PHP experience, and I have a lot of Python experience :)

Virtualenv and PIP go a long way towards fixing these problems in Python, similar to how PEAR and CPAN work with PHP and Perl.


Having used Python in anger for some years I have never encountered anything like this (scripts with mutually incompatible dependencies, really?) It sounds like you ran into some specific package which was poorly made or had a bad release, or perhaps just a package which should have had its dependencies pegged at specific versions?

I'm completely lost as to why you think this is a problem with Python or its standard library. (Except that this seems like an opportune venue to bash a perceived competitor to your favorites)

It was years before I even saw any need to use virtualenv, (and that was because of misbehaving packages from Google and a desire never to modify PYTHONPATH again).

About documentation, again I feel strongly that you must have used some specifically bad package and again I can't say how I see that to be an inherent flaw with Python or its standard library.

"Pure" is one of the last things I'd call Python. For example, imperative constructs are jumbled with OO constructs and functional constructs.


Most frustrations I've had with Perl/Ruby/Python have been down to inconsistent, incomplete or out-of-date packaging by the software distribution.

I've found I can mitigate most of this by using perlbrew/cpanminus, pythonbrew/pip and rvm/gem.

This adds complications when it comes to deployment - you've added additional maintenance dependencies to your servers if you don't stick with what the package manager provides... Nothing insurmountable, but keeping an eye on updates (would usually use email to notify on dated versions) and maybe having your CI system always use the latest versions of everything are possible methods. I'd like to hear better ideas, to be honest.

My background is both sysadmin and development, I think there should be no reason we can't make this all simpler for everyone.


Yes, I agree completely.

At least in Python, the problem occurs because distros make several mistakes which build on each other. (A) insist on generating their own packages for modules (and typically, taking forever to update them). (B) not providing any kind of isolation. (C) building on top of these non-isolated modules.

What I do these days is leave the system Python 100% for the use of the distro (at most, dependencies for minor command line scripts which don't have import relationships with code I'm working on). Development occurs inside virtualenvs, production apps run inside virtualenvs. I use pip to install, remove, pin versions.

Dependency versions are really part of the app; the right versions should be pip installable inside a virtualenv, and this is just something which should be done automatically on a deploy. I don't mean you need to include the whole code for your dependencies in your project, but pinning versions is important to not having to put out fires.

You really don't want any versions to advance automatically until you have had a chance at least to run unit tests - if things are working then pin the new version. Of course, if you are using a library which never suffers regressions or API changes then you don't need to do this.

In short, the dependency list and versions should be included with the project and managed by the developers so that deploys are really a matter of creating a virtualenv (with --no-site-packages) and running pip -r requirements.txt to install the right stuff.


I think you've really just been unlucky.

I've had the same experience before, but chiefly with Ruby.

Interestingly, my problems with this in both Python and Ruby have evaporated once I got in the habit of using virtualenv/rb-env/rvm for my development environment.

Python is about the cleanest/nicest experience I have in any language, for the record. Only language that comes close is Clojure.

Leiningen is...legendary.


I would argue that Python's approach is FAR better than Clojure/Leiningen's "no batteries included" approach.

Suppose you want to do a very common task like parse some XML. In Clojure, the workflow is:

  1. Go to Github or Clojars, find the latest version number  of clojure.data.xml

  2. Add this version number to your project.clj

  3. Lein deps and restart the repl

  4. Re-acquire whatever REPL data you had
In Python, it's:

  1. import xml.{sax,dom,etree}
And, paradoxically, the availability of all these different versions of libraries in Clojure leads to MORE conflicts between libraries than would otherwise be the case, not less. In Python, you may not agree that, say, the "os" or "subprocess" modules are optimal -- but by golly, they're consistent.


Thanks to pip I often don't even bother with the Python stdlib for crusty things like one-off web scrapes or XML parsing. Here's a recent example where I wanted to read some attributes out of some remote XML and did it with requests and PyQuery rather than urllib and xml:

    _domains_text = requests.get(API_URL + "/domainlist.xml").content
    _domains_db = pyquery.PyQuery(_domains_text)
    
    DOMAINS = [d.values()[0] for d in _domains_db('domain')]


I like to set

    jQuery = pyquery.PyQuery(someHTMLDocumentString)
So I can use jQuery like I'm used to. At this point, you can do

    links = jQuery('a')
or whatever.


I've had the same experience before, but chiefly with Ruby.

Tell me about it: http://lee-phillips.org/badruby/


Is it reasonable to expect packages to maintain compatibility with a version of Ruby from 2004? I don't know many Python libraries that still work with Python 2.2. Not sure about Perl 5.8, but I couldn't even find a link to download a binary.


Is it reasonable to expect packages to maintain compatibility with a version of Ruby from 2004?

That in itself would not be reasonable. But it seems as if Ruby's own libraries broke going from 1.8.1. -> 1.8.7. Regardless of the number of years involved, that's kind of unexpected. But I'm not familiar with the Ruby world, and maybe that change in version number is considered major.

The whole experience left me with reduced confidence in Ruby stuff and I still avoid it and programs that use it. The comments below by people who know more about the Ruby ecosystem than I do don't give me any reason to change.


(Disclaimer: I am a professional Ruby developer)

The problem is that the various packaged versions of Ruby are a shambles. RVM has a number of significant problems and difficulties (rbenv is both better and worse).

Right now getting a good, modern Ruby version on a standard modern computer is a big pain in the ass.

As long as that stays true, we will be dinged for not supporting old (but common) versions.

Years from now, when everybody has good 1.9 compatibility, things may be better. At least, if 2.0 doesn't have the same problems...


the problem is that it's much, much harder to upgrade ruby from 1.8.7 to 1.9.3 than it is to upgrade perl from 5.8 to 5.14

So many of the language features have changed in ruby programmer will have to edit every file in her project, and upgrade every single dependency, just to get her application running on a new ruby interpreter. Add that to the culture of "let's use as much 3rd party code as possible!" and the library writers emulating the core developers and changing their APIs all the time ... the process of upgrading an interpreter converges on "rewrite the entire application"

Which is what we do. And some percentage of those rewrites are in languages that don't have this problem.


If you know what you are doing then you know what version of the interpreter you want and have the ability to install it. Once upon a time I wrestled hard to compile tarballs, these days it is really rare that I have to do more than ./configure; make; make install in the worst case.


What is Leinigen offering over Maven? Isn't it based on the same stuff?

That said: Leiningen & npm _are_ really nice solutions for me.


I think the problem is because when you used it the Python community probably still hadn't decided on a one true dependency system like Gems / rvm for Ruby, and Maven for Java. It's been about four years since I've heavily used Python, but hopefully this has been fixed by now.

I don't have this problem on either Ruby or Java.


Nope, pip and virtualenv are pretty much it. Use 'em


The fact that the standard library is well-maintained, carefully debugged, and backward-compatible is a far stronger indicator of Python's awesomeness than the existence of shiny, new libraries. Hackers naturally gravitate toward high-visibility projects with brave horizons and bold scopes. By contrast, it is incredibly hard to find the motivation to update, for the umpteenth time, a warty API -- and that's precisely the reason why contributions of the latter sort are the truer test of the vitality of a language's ecosystem.


It is stable, but there are some really nasty warts.

To pick just one that bites Python programmers all the time: by default, the ssl library does not validate the server certificate at all. Not validating the certificate makes SSL/TLS almost useless. But this is still the default (see http://docs.python.org/dev/library/ssl.html#socket-creation, "CERT_NONE"), because the standard library is "stable".


On the contrary, the vast majority of the time, when one is using SSL, they're using it because they want encryption, rather than identification.

The certificate system surrounding SSL is a complete mess. It does virtually nothing other than trigger false positives for people who who haven't paid the appropriate "security partner."

The very rare person who is actually using SSL for identification rather than just to establish an encrypted TCP connection, and therefore cares about certificates, can change the default.

PS: I know the standard response to this, that encryption without identification is useless, because without identification your counter-party might be Eve. In reality, in the real world, that doesn't happen. MITM attacks are extremely rare. And the real Eves on the net (phishers) can easily obtain signed certificates that will fool pretty much any end user.


Not to pick on you per se, but this reasoning is often seen in relation to SSL (or encryption in general) and is dangerously wrong.

Encryption without identification and authentication of your communication partners is useless. You may very well end up with a very secure link with the wrong communication partner (google 'man-in-the-middle-attack').

I agree that the (public) CA system is a mess, however especially with machine-to-machine communication it is very easy to generate, sign and use your own certificates. And contrary to popular belief, self-signed certificates are not any less secure than public CA signed ones. Both have their own use-case though.

If someone cares I'll be happy to explain the above points in more detail.


I now see your 'PS' (perhaps added while I was adding my comment?) - however I cannot follow the argument you make.

IF you assume that MITM-attacks are rare, you probably also assume that traffic snooping is rare (which is after all a form of a MITM-attack). If that's the case, why use encrypted communication channels at all?

Security is never perfect - it always is about adding layer upon layer to make the bar high enough that the remaining number of adversaries becomes more manageable.

Spoofing a site that is not using SSL is trivial. Using SSL with public CA signed certificates significantly raises the bar. Not to the 'perfect' level, but enough to make a real difference. Not checking the server certificate throws you back to the 'trivial' level.


I agree with the original poster

"IF you assume that MITM-attacks are rare, you probably also assume that traffic snooping is rare (which is after all a form of a MITM-attack)"

Well, no! You can snoop traffic without being the MITM (WiFi, local network snooping, etc). Snooping is much more easier.

As you said, security is never perfect, and 'security implementers' less so.

If the other part of the communication uses a self signed certificate (or signed by "Bob's SSL") well, I can try to convince them to change, but it will be hard.

Sure, I'll never accept a self signed key from my bank or e-commerce, but there are several other uses.

And when using APIs to connect to https you should be able to tell it to ignore the certificate, it doesn't matter, way more often than the opposite, unless you don't trust your ISP.


If you are in the position to passively sniff the traffic, you're almost always also in the position to redirect and modify the traffic.

The only thing you're protected against with encryption sans authentication is passive sniffing. If that's all you care about, fine, do realize however how limited the protection is you gain.


If MitM attacks are so rare, why bother encrypting your traffic in the first place? Packet-snooping attacks are also "extremely rare" by most metrics, so why protect against one but not the other?

Either go all the way on security, or be obvious about not having any. Appearing secure when in actuality you're not is the worst option.


Packet-snooping attacks are also "extremely rare" by most metrics [...]

Really? NSA boxes in AT&T (and presumably other) switching stations suggest that for US traffic it's extremely common.


'Packet-snooping attacks are also "extremely rare"'

I think they're pretty common, even for fun and recreation (http://codebutler.com/firesheep). I know I could start reading people's emails in Starbucks with what's on my laptop now and the knowledge in my head, but if I wanted to mount a MITM attack I would need to do some research.


Well-maintained? Carefully debugged?

Hardly. Backward-compatible? Ok, I'll agree to that one.

But seriously, if you want to see an example of a not-so-well-maintained or carefully-debugged standard library. I challenge you to go look at 'shutil'.

That's just one example of many in the standard library that needs some serious TLC. Want another? Go look at tarfile or subprocess.

And don't even get me started on the lack of documentation for many parts of the standard library; the source code is the only real documentation.


> And don't even get me started on the lack of documentation for many parts of the standard library; the source code is the only real documentation.

I've rarely heard anyone say this. Do you have an example of an area that is severely lacking in documentation? Maybe I don't use a wide array of modules, but I can't remember the last time as a user I had to dive into the source code.

I know the stdlib lacks examples in a lot of areas, often going for purely API coverage, which is welcome to change.


Agreed with the OP. The following is a shameless plug:

Python's ConfigParser module is a pain to use. It provides no validation, only supports a limited number of types of data you can retrieve, etc. Similarly, getopt vs optparse vs argparse is a mess. getopt is universal: not only is it going to be in all versions of Python, but it is also the same library available in virtually every other language. The problem with it is that it is not declarative, so you will typically see a giant if/elif statement that goes with it. argparse/optparse are better, but aren't universal even between versions of Python, though argparse has been backported and is available via pypi.

To unify all this into one convenient module, I ended up writing http://ipartola.github.com/groper/. groper lets you specify your parameters declaratively, and if you specify defaults, use them right away without having to create/modify a config file. It automatically figures out the priority of arguments: cmd > config > defaults. It also has some niceties such as the ability to automatically generate usage strings, give the user intelligent error messages, generate sample config files, etc.


I don't understand why I shouldn't be using argparse. Just using argparse means no mess of 'getopt vs optparse vs argparse' because I am not using all those other libraries. I don't see anything seriously wrong for argparse. How does it help me to use a third party module rather than argparse?


Using argparse is probably the safest approach. However, argparse does not work with config files; groper does. So if you have more than a half-dozen options, you should use groper (or something similar).


The permalink for the article is http://www.leancrew.com/all-this/2012/04/where-modules-go-to... (the posted link is actually the home page of the blog).


I've heard core Python developers tell people not to worry about getting a module into the stdlib. The problem being that once the module is there it won't be able to change much. APIs have the exact same problem. If you change it, you're changing other peoples' software. Tight coupling.

Is it a terrible way to write software? Maybe... but perhaps that's a different discussion.

I think the requests library is amazing. It has a much more simple API than urllib/urllib2. Does it need to replace those modules in the stdlib? I hope not!

There are only three reasons I would write a module/package that depended solely on stdlib:

  1. The module/package would be distributed primarily through package management systems.

  2. The installation of my module needs to avoid depending on anything else outside of a base python installation.

  3. The module or package will need to be supported for a long time and will likely not be updated frequently.
The first case is because you can't control what versions of third-party libraries the package manager will make available. Some might run your setuptools script while others may not. It's just easier to live with the cruft/warts of the stdlib and be sure that they'll always be there.

The second case covers a very unique situation. Modules and libraries written with this constraint are typically targeting one of two different kinds of developers. The first are the beginners who may not know about development environments and versioning. The other are experienced developers who want a minimalist script for their little one-off utility. Both should require zero dependency installation if possible.

The final case is harder to define up front. If you're writing something that you expect to run for a long time and receive little maintenance (ie: cron scripts, tools, etc) then you don't want to deal with API updates breaking your code. Fire and forget is what a long-term stable API gets you.


> An overstatement, certainly, but with more than a germ a truth. Once a library is enshrined in the standard set, it can’t change radically because too many programs rely on it—and its bugs, idiosyncrasies, and complications—remaining stable.

That's a problem inherent in the standardization process, though - it's all but contradictory to have something be both 'standard' and 'continuously improving'.

Once something enters the standard, does anyone propose a better way of removing cruft without constantly deprecating everything, rendering the concept of a 'standard' somewhat meaningless?


Things come and go, and we have seen a number of deprecations in python already. urllib predates urllib2, while subprocess deprecates a number of things itself that really came from C. getopt was a port of the eponymous C library, which optparse meant to replace, which was itself deprecated on favor of argparse.

I would really not be surprised to see envoy, requests and so on come up in the standard lib at some point.


The thing with python standard library is that it is crapily documented imho. I often can't make heads or tails of it, while I have a much simpler time with any other language (you name it: Java, Ruby, PHP, C, Scala, Lisp, ...).


I felt that way about some of the standard documentation but was relieved to find: http://www.doughellmann.com/PyMOTW/

I also highly recommend checking out Doug Hellman's book 'The Python Standard Library by Example'. He presents every (or almost every) standard library module with simple explanations and plenty of examples.


I've never seen this before, but that is an AMAZING improvement on the standard docs


Nice post. This is way better than the standard docs. Having lots of examples really helps.


The Python standard library has great documentation.

Except for some of the "batteries included" stuff. urllib2? Ouch.


That was indeed the kind of stuff I had in mind.


Yes, for a language as widely deployed and used as Python, retaining backwards compatibility and stability is more important than adding new and shiny tools to the stdlib at a faster pace. Users rely on the fact that a module in stdlib will remain there and will remain stable for a long time. More modules means more maintainers, and Python is an open-source project developed by volunteers. It's that simple.

I'm not sure what the solution this article proposes is. The tradeoff between "coolness" and "stability" is inherently difficult, and I'm sure Python is not the only language "suffering" from it.

After all, it's quite easy to install a new Python module, and not much harder to distribute it with your application (for web apps it's even easier), so what is the problem?


It's funny, I've had the opposite problem. I was trying to write an IRC bot in Python, noted there didn't seem to be a standard library module for the IRC protocol, and so found myself looking at this:

http://pypi.python.org/pypi?%3Aaction=search&term=IRC

That's 400+ results - at least 20 of which are actually IRC protocol modules. There's no way of telling how mature each one actually is 'til you download it. It turned out the first three I tried were undocumented, buggy, incomplete, or otherwise no good.

So I gave up on PyPi and hacked it as an xchat plugin instead.

----------------

Perhaps the way forward would be styling your package repo after, say, addons.mozilla.org -- add just enough community functionality (as in ratings/reviews/"times downloaded" counters/etc) to allow the occasional gems to rise to the top of the muck. Once one solution for a given problem has been established as the best (well, most popular), that'll get more eyeballs on its internals as well, and it'll only increase its lead until it's de facto standard -- but the possibility is still there for a newcomer to dethrone it if it's genuinely better. And meanwhile, both can exist side-by-side without causing ugly compatibility issues.


I believe that PyPI used to have some kind of popularity contest functionality that got killed.

I have to say I'm not sure that selecting the package you want to use is really the problem which PyPI needs to solve. It isn't the app store. That said, PyPI does provide a 'weight' in searches, which seems to track with popularity and freshness somehow.


The title text on "weight" says "Occurrence of search term weighted by field (name, summary, keywords, description, author, maintainer)". So, not popularity/freshness, just a rough metric for how well it matches your search.

Indeed, PyPI might not be the right place for a community rating system -- perhaps a site could be built on top of it to provide that sort of functionality.


This happens in every language. I dont think it's that big of a deal.

In all honesty, you could continue to maintain a package outside of stdlib, and just require a newer version (which gets installed via the standard packaging tools). This type of behavior isn't well defined in Python, but it's not unrealistic to think it could happen.


How is it not well defined in Python?


Well it would work just fine, but you'd always end up requiring the external dependency even if you didn't need to.

For example, let's say there was a new urllib released (its still called urllib). It's now version 2.0, but the stdlib version is 1.0.

If your package said "I need urllib==1.0", it would have know way of understanding that the version was already included within the standard library.

That said, it would download the correct package (assuming it existed) and work just fine.


> it would have know way of understanding that the version was already included within the standard library

Other than by introspecting which packages are installed, that is. Most of them will have a VERSION, __version__ or _version attribute which tells you.


This is available via the pkg-info in all installed packages; iirc version is required for distutils, and there is pep386 for version number format, so it should be possible to determine version number as well as compare them for all well-behaved packages. There is even a package which will find and parse pkg-info for an installed package called pkginfo:

http://pypi.python.org/pypi/pkginfo


GP was talking about built in modules in the standard library. I don't think they use distutils, but many of them still have some sort of version number.


Since it would work just fine, what is the problem you are trying to solve? Having to download too much stuff?


This has kind of happened in Ruby, too.

Fortunately, Ruby gems are super easy to install and the standard library got some much-needed spring cleaning in 1.9.

Python could use the same. There have been many times where I've wanted to do some simple task that would be made easier with an external library (like Requests) but I'm not going to bother dealing with the Python module install pain for a one-off task.


> the Python module install pain for a one-off task.

"pip install requests" ?


Unfortunately, it seems the official documentation on installing Python modules[0] makes absolutely no mention of pip or even easy_install. Seems like something that should be there, right?

[0]: http://docs.python.org/install/


I guess that is because pip is not part of the official Python. Maybe best described as a front-end to distutils (which is what your link documents and is part of official Python).

edit: Python docs front-page[1] also notes following:

A new documentation project, which will be merged into the Python documentation soon, covers creating, installing and distributing Python packages: http://guide.python-distribute.org/

[1] http://www.python.org/doc/

So I guess that they have acknowledged that the docs are suboptimal currently in this part.


I'm glad that there's no mention of easy_install. I have no idea why someone would want to use a package manager that can't uninstall things.


Yeah, good point. That puzzled me as well before I learned of pip.


I tend to use pip, requirements file, and virtualenv. No real trouble.


I find the opposite. Installing a useful module for a once-off task is a no brainer, since I won't have to worry about whether I'm introducing some long-term dependency I'll have to maintain.

The real issue is in discovering that a better option exists in the first place.


And then you have to deploy it to the server or someone else's computer and it sucks.


pip freeze works even without virtualenv. Expunge things you know don't matter, and send to third party/deploy, which then just has to run pip install -r requirements.txt.


Python has had a lot of cleaning in 3, but everyone is screaming about how 3 doesn't work just like 2 so I guess you just can't satisfy everyone.


If anyone was wondering (Like I was) what was dropped from the standard lib, this looks to be the list:

http://www.python.org/dev/peps/pep-3108/


Every language's standard library needs a "current best practices" concept, even if it's just a well-maintained document and not something structural like a special namespace.

I think the Python "decorator" concept goes a long way toward cleaning up code. Basically you can add a decorator to a routine that you've deprecated so that it will complain if it's actually used (you can even include advice on what would be a good replacement call).

As far as cleaning up what's installed as standard, it's not really practical to remove anything (the fact that it stays is one of the attractive things about Python in old code bases). What you can do though is define a preferred namespace, e.g. "preferred"; this would physically contain only those libraries that are recommended, and perhaps even forked copies of modules that only contain the functions that should be used. This gives programs the option to explicitly import from "preferred" and request purity over long-term stability.


Permalink to this post: http://www.leancrew.com/all-this/2012/04/where-modules-go-to...

(The current article link goes to the front page of the author's blog.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: