Why is Python so slow?

wahern · on Jan 12, 2019

Lua is just as dynamic as Python, yet PUC Lua is significantly faster. LuaJIT is faster still, even before it JITs the code, and doesn't have a startup penalty. And LuaJIT doesn't sacrifice anything--it's 100% compatible with Lua, including Lua's C API.

The real difference is that PUC Lua is developed by a team of 2-3 people (1-2 for the VM, 1-2 for the compiler. LuaJIT was written by a single individual who relieved himself of the necessity to design and temptation to change the language or interface semantics. Engineering complex, deeply interdependent systems doesn't scale well. While Lua is nearly as old as Python (1993 vs 1991), the Lua VM has been substantially refactored multiple times; LuaJIT at least 1.5x times. The Lua team doesn't accept outside contributions, ensuring that implementation complexity doesn't outgrow their capacity to refactor the internals as they see fit. LuaJIT is almost impervious to outside contributions. (It's not that LuaJIT can't survive without Mike Pall, it's that it can't survive without a Mike Pall-like maintainer.)

Also, the Lua C API and control flow semantics were very deliberately designed. Some people find the stack-based API to be verbose, but it's crucial in keeping the number and complexity of public interfaces low, while ensuring that C modules can fully and seamlessly leverage constructs like closures and coroutines. CPython is much more difficult to refactor because the module API and semantics are too leaky.

But ultimately it comes back to having a small, dedicated team. Larger teams tend to rely upon too many internal abstractions. It's a consequence of having to orchestrate so many cooks in the kitchen. It's not that abstractions are bad. It's that these are the wrong kind of abstractions in the wrong place in the pursuit of the wrong goals. Both PUC Lua and LuaJIT actually have beautifully consistent code with well-designed internal interfaces. Crucially, the interfaces aren't designed to avoid maintainer conflict but rather in service of the overall design and performance of the engines.

rasputinmachine · on Jan 12, 2019

>PUC Lua is significanly faster [than python]

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Damn, where are you getting your numbers from?

EDIT: If I've got the wrong version of Lua, you may proceed to call me an idiot.

wahern · on Jan 12, 2019

In the benchmarks where Python wins it's multithreaded. And even with 4 cpus against 1 cpu Lua still hangs on.

EDIT: Binary trees is peculiar in that it loses by 10x. There's something going on with memory consumption and possibly GC behavior, but I've already wasted enough time and don't care to dig in. There's a reason the Computer Language Benchmark Game has limited utility. It's pretty clear the core Lua VM (e.g. bytecode execution, function dispatch, etc) is substantially faster. Once you move past the core interpreter loop it becomes difficult to compare apples-to-apples, if at all.

EDIT2: Ok, it was bugging me. Turns out that the Python binary tree program has an interesting (clever?) bug. make_tree isn't building a binary tree, it's building... well, I'm not sure what to call it... but has substantially fewer nodes. It walks like a binary tree, though, and so the output of the program is the same. You can do the same thing in Lua with

  local function BottomUpTree(depth)
    local t = {}
    for i=1,depth do
      t = { t, t }
    end
    return t
  end

which ~ quadruples performance, which means apples-to-apples (1 cpu Lua to 4 cpu Python) Lua is almost twice as fast.

igouy · on Jan 13, 2019

> well, I'm not sure what to call it...

Here's where to complain (next time) — https://salsa.debian.org/benchmarksgame-team/benchmarksgame/...

As-it-happens that program was still an open issue, awaiting review, and the review was not favourable — https://salsa.debian.org/benchmarksgame-team/benchmarksgame/...

igouy · on Jan 14, 2019

> In the benchmarks where Python wins it's multithreaded

For reverse-complement and binary-trees, the Python program uses less cpu time — it doesn't matter that it's multithreaded.

wahern · on Jan 14, 2019

For binary-trees it's because of the issue I mentioned. For reverse-complement I believe its partly because the low-level transformation is happening through a Python C module: see reverse_translation = bytes.maketrans(...).

Lua is notoriously batteries NOT included. I suspect only a fraction of Python programmers have written a Python C module, at least for production use, because they rarely need to. By contrast, almost all PUC Lua programmers I have met have written Lua C modules, as you can't even do simple things like directory manipulation or sockets without third-party modules (few of which are "standard") or by writing a custom module. Lua is somewhat handicapped in these benchmark games as the language is fundamentally designed to be extended with C code. It doesn't come with any real modules of its own beyond a simple (non-regex) string matcher and bindings to the ISO C library. The whole point of Lua is that it's super trivial to write C modules; and as compared to writing Python or Perl C modules it's incomparably more straight forward, even for inexperienced C programmers. (The use of binding generators is usually discouraged as 9 times out of 10 manually written bindings are both easier to write and maintain.) But it obviously wouldn't be any more fair to permit custom-written Lua C modules in the benchmarks.

If you look at all those benchmarks in just a little depth, I think it should be fairly clear that the core Lua VM is significantly faster--roughly 2x. OTOH, that's often irrelevant, especially if Python's core module ecosystem directly or indirectly helps you solve an issue. It's a fundamental limitation of the benchmark game, particularly regarding Lua. Lua isn't just a scripting language; it's thoroughly designed to be embedded in or extended by C. See the paper "Passing a Language through the Eye of a Needle: How the embeddability of Lua impacted its design", https://queue.acm.org/detail.cfm?id=1983083

igouy · on Jan 15, 2019

> It's a fundamental limitation of the benchmark game, particularly regarding Lua.

It's a fundamental limitiation of language vs language comparison.

See "Apples and Oranges" —

view-source:https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

igouy · on Jan 15, 2019

> For binary-trees it's because of the issue I mentioned.

No it is not — I removed that program before I replied.

(Refresh you web browser cache).