Hacker News new | past | comments | ask | show | jobs | submit login
Skip lists: fast PyPy-compatible ordered map in 89 lines of Python (pythonsweetness.tumblr.com)
77 points by _wmd on March 13, 2013 | hide | past | favorite | 25 comments



> As many will attest, it’s easy to live life without an ordered map in Python, but the moment you need one Python starts to suck really fucking hard. This should be built into the language somehow.

OrderedDict [0] was added with Python 3.1, and there's an equivalent (pure Python) class provided in the docs for backwards compatibility. Just in case the author reads this, it would make the code nicer to implement the appropriate "magic" method names that make it act like built-in containers [1].

0: http://docs.python.org/3.3/library/collections.html?highligh...

1: http://docs.python.org/3.3/reference/datamodel.html#emulatin...


OrderedDict tracks insertion order, not key order. Additionally it is not possible to start iteration from an arbitrary key.

The method and variable names here mostly reuse the names from the original skip lists paper, although in a real generic implementation, reusing the mapping protocol would be a nice touch. I posted this more due to the ridiculous simplicity of implementation: skip lists themselves are far more worthy of note than my horrid example code.


This implementation may look ghastly at first sight, however it's worth note that:

* The node pointer list is reused for the node structure to save memory, otherwise a minimum of 72 bytes is wasted per node on CPython, in addition to malloc slack for 2 allocations (the list object itself, and the array of pointers). As it stands, CPython burns about 100 bytes per record, so 'clean' code here would potentially double the memory requirements in addition to added runtime cost.

* Numeric, rather than symbolic indexing wins 5k lookups/second on CPython. It's debatable whether using speed hacks like "search(key, IDX_PREV=IDX_PREV, IDX_NEXT=IDX_NEXT)" is uglier than the bare numbers themselves. Using accessor functions to pretty the code also costs quite a lot.


Out of curiousity, what made you pick skip-lists over a trie structure, such as crit-bit trees?


Primarily because I knew from previous research that they were easy to implement. Secondarily because unlike most trees, they're amenable to concurrent update (although this is mostly irrelevant to a Python application) and I wanted to get some foundational experience. Locking a skip list is very straightforward, and slightly less intuitive lockless versions also exist.


I'm wondering if "slightly less intuitive" is a deliberate understatement or not :)

I recently read through the concurrent skip list map implementation in Java that is lockless. That code is definitely quite tricky, and the comments refer to three PhD thesis's one should read to understand how it works.


This is a conflict that I find very interesting. Python is a language very much about readability. You often find examples like this where "good code" is slower than "ugly code". I would imagine, however, that PyPy's JIT makes alot of these problems go away?


On PyPy it should be more efficient to have a 'class Node:', where when enough instances with similar attributes exist, it'll optimize to something like CPython `__slots__`. One issue is the list of pointers on each node. This could be kept as a separate list attribute (another 2 mallocs, slack, list overallocation etc), or perhaps arranged so that "$max_levels" types exist, each with attributes to cover a particular level. In any case since I wanted CPython compatibility, the horrid list trade-off is the best I could find.

You raise an interesting topic, I got into Python around the 2.2 era almost entirely due to the "cleanliness" factor, but features quickly began to appear that seemed redundant in some way, and the general popularity of the language meant its philosophy became increasingly dilute. Basically whatever "obviousness" the language had is eroded by varying degrees of pomp and ceremony, e.g. "idiomatic" use of crazy stuff like metaclasses, decorators everywhere, or nested list comprehensions containing perlish conditional expressions.

It seems healthier to look at Python less as some bastion of beauty and more a slightly prettier glue language, like a perl 2.0. From this angle it becomes an interesting macro language for combining chunks of C in interesting ways. You can write some awesomely fast, concise code with abuse of buffer(), itertools and functools.partial, but none of it looks particularly "Pythonic" or "beautiful". Nonetheless the resulting code is among the best in its class in terms of performance/size/maintainability compromise, yet the community at large would likely reject it as abhorrent.

I guess this code snippet is a good example of that.


Random note.

The "public domain dedication" is the wrong way to make code available for anyone to do what they want with it. The problem is that under US copyright law, you and your heirs actually own that copyright, whether you want to or not. Declaring otherwise has no legal force. Which means that if you change your mind, or if your heirs feel differently than you do, people can potentially be sued for copyright violations on that code. Unlikely, but possible.

By contrast releasing your code under an extremely permissive license has legal force, and neither you nor your heirs can unrelease it. (Assuming, that is, that you own copyright to your own work. Sometimes people do not, and do not realize this...)


Are you sure about this? Hasn't Daniel J. Bernstein been fighting this meme for years and years? Is he just wrong?

http://cr.yp.to/publicdomain.html


Given a legal dispute between informed developer and a copyright lawyer, caution indicates that I should pay attention to the lawyer.

As Bernstein indicates, Lawrence Rosen argues the other side of this. There is binding precedent on the question in the 9th circuit. However that precedent is not necessarily binding on other courts, the statute does not provide for such a mechanism, and we've already seen statutes bring works back under copyright which had been out of copyright. The most famous example being It's a Wonderful Life. Therefore it is possible that other courts could decide differently, and it is possible that future copyright legislation or treaties could alter the legal status of works that have been abandoned to the public domain.

And this is just the situation in the USA. There are about 200 countries in the world, with different legal systems. Most have some type of copyright law, and that law roughly follows international treaties. I have confidence that in the countries with "reasonable" copyright legislation/jurisprudence, that copyright licenses have force. Given that even the US situation is not entirely settled, I have no confidence that a public domain declaration has force in other countries. Therefore caution would indicate that a simple permissive license is preferable to a public domain declaration.

Therefore yes, I would say that the chances of Damiel Bernstein being wrong on this are good enough that a simple permissive license is preferable to a public domain declaration.


Public domain is explicitly part of the Berne Convention. Bernstein is, as CS professors go, atypically engaged with the law in general and copyright particularly. I wouldn't be too quick to dismiss him.


Berne Convention uses the term "public domain" as something that happens to works when copyright expires (absence of copyright), but it doesn't mean you can say an incantation like "i hereby release it to public domain" to make that happen. In many (most?) countries the incantation doesn't have a defined legal meaning as authors always have copyright to their works, the Berne Convention is implemented in national copyright law without talking about "public domain".

See here, Article 18 is the only place that mentions public domain: http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html#...


Bernstein may be right. He certainly does care about this issue, and has researched it. But when there are two almost equivalent approaches, one has 0 risk, and the other has minimal, why not follow the approach with 0 risk?

Incidentally your Berne Convention argument is rather weak according to my reading of the actual text of the Berne Convention. (See http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html for the text.) Public domain is only referred to in article 18, and the only type of public domain referred to is due to expiration of the term for copyright. Section 7 indicates the minimum terms in question, and those terms are both long and do not contain anything indicating that they can be shortened by the author's wish.

Therefore the fact that public domain is mentioned in the Berne Convention does nothing to reassure me that countries which sign the Berne Convention will necessarily pay any attention to a public domain declaration.


Berstein isn't the only one who dedicates software into the public domain. SQLite is perhaps the most widely used software delivered in that form.

http://www.sqlite.org/copyright.html

They recognize that the public domain might not exist in all legal domains, so a licensed version is available, for a fee.

Others have decided to decline to use copyright protection. There's a list of such software at http://unlicense.org/ .

I wouldn't look to the Berne treaty for some statement of the international existence of copyright law. You need to look towards national laws instead. For example, the US recognizes the public domain, and a work of the United States government is automatically in the public domain in the US. (Though it might not be in the public domain elsewhere.)

So like any social movement, if enough people release software and disclaim copyright protection, then those jurisdictions which don't recognize the public domain might change. If no software ever takes the risk, then it will never change.


> Public domain is explicitly part of the Berne Convention.

That doesn't mean the legal/IP system of Berne signatories has to allow putting things in the public domain before copyrights on it expire.

In fact, I know for certain that France does not allow it under any circumstance. And I believe (though without certainty) that this is the rule in all EU countries.


Thanks. I've slightly improved that paragraph, although I guess it should really be mentioned in the code itself. In any case what's worthy of note is that this can be done with so little code in the first place, rather than the ugly example I've provided


There are several MIT licenses. Just use the ISC license. It's the shortest license possible. http://en.wikipedia.org/wiki/ISC_license


With the "and distribute" or "and/or distribute" wording? Different factions have expressed preferences for each option.


I have no idea. haha ok I'll stop trying to give license advice. I do know the OpenBSD project uses "and/or distribute" on the advice of the FSF.


Dedicating code to the public domain has a nice side effect of stopping people who hire stupid lawyers from using your code.


How do you know the original poster is American and subject to US Copyright Law???


The nationality of the original poster is irrelevant.

As long as there are people who the poster would like to see free to use the code who are American and subject to US Copyright Law, the issue has potential relevance.

Ditto for French users and French law since we've been informed that the declaration is useless in France.


The Creative Commons 0 license is perhaps even closer to public domain.


FWIW, there has long been another Python skiplist recipe at: http://code.activestate.com/recipes/576930/ and it is indexable as well. It is under an MIT license and runs great under PyPy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: