Hacker News new | past | comments | ask | show | jobs | submit login
Ruff – a fast Python Linter written in Rust (github.com/charliermarsh)
148 points by amrrs on Aug 31, 2022 | hide | past | favorite | 46 comments



As much as I'd love the headline to be accurate (we use pylint and friends extensively) I think the comparision is pre-mature.

The linters the author is comparing against have significantly more features an more overhead given they need to support many more complex rules/situations, also I'd argue this linter is far from well organized given it seems all checks appear to be implemented in a flat file [1].

I would not be surprised if a fully fledged linter written in rust outperforms any linter written in python by a factor of 20-25x though.

I'd be curious to see functionality & performance comparisions if this project continues.

[1] https://github.com/charliermarsh/ruff/blob/0b9e3f8b472dc3fd0...


Thanks, it's a fair comment :)

I've tried to be honest about ruff's limitations: it supports only a small set of rules, it's not extensible in any way, it's missing edge-case handling, it's significantly under-tested compared to existing tools, etc. I don't consider it production-ready -- it's a proof-of-concept.

(I _did_ try to build conviction that there weren't inherent reasons for ruff to slow down significantly as the set of checks extended, but I could definitely be proven wrong as the project grows...)

My goal with ruff was partly to build a fast linter, but that was really in service of testing a broader hypothesis I have about Python developer tools could evolve. If that's interesting to you, I wrote about it a bit more here: https://notes.crmarsh.com/python-tooling-could-be-much-much-.... I'm certainly not here to say that the existing tools are bad or you shouldn't use them or that you should use ruff instead. I use those tools myself -- a lot!

With regards to code organization: oh well, this doesn't bother me given the state of the project. The code will evolve as the project grows in scope, I didn't see a need to over-abstract. I'd written a small amount of Rust prior to ruff, but much of it was a learning experience for me. (Funnily enough: not that I want ruff to be organized as such, but pyflakes, pycodestyle, and autoflake are also effectively single-file projects :))


As I mentioned in another comment [0], compiled languages are the future of interpreted languages' tooling.

> Whereas before one might have thought that a language's tooling should be written in said language, there might come a time when that's too slow, and we instead need to move to faster languages.

> That's what has happened to Python (numpy, pandas, scipy etc are all written in C and simply provide an interface in Python), and now it's happening to Javascript as well, with Deno, swc in Rust, Bun in Zig, esbuild in Go, and so on.

[0] https://news.ycombinator.com/item?id=32577837


Hmm, maybe this is a dumb question, but Numpy, Pandas, Scipy, etc are all libraries. Are libraries considered tooling? When I think of tooling, I think of linters, debuggers, IDEs, and so on.

I can definitely see why libraries should be written in faster languages, they are used all over the place. And anyway, something Numpy is most importantly an interface to pre-existing high performance libraries (BLAS, LAPACK) which were already written in faster languages.

I think it is less obvious that a tooling in the second sense should be written in a lower level language. There's the obvious flexibility tradeoff, the fact that a person interested in a linter for a language probably is most familiar with that language.

I'm not familiar with the computational problems in linters. Is getting better CPU performance a huge concern?


A linter like Prettier can take several seconds in large codebases while the one by Rome (rome.tools) takes fractions of a second, because the former is in JS and the latter is in Rust. If I'm a linter user, I really don't care what the linter is written in, because I'm never going to open up the source code to look at it. It likely exposes a configuration file which is agnostic to any programming language, and that is what I will primarily be editing.

Same thing with Webpack vs esbuild or swc, the latter two are simply instantaneous while Webpack can take minutes, even, to build an app. That it's in JS does not matter to me since I don't care about Webpack source code, I only care about what the tool itself does.


Is a compiler "toolchain" package tooling?

Is the standard library included with that "toolchain" tooling, or a library?


So, I don't think it is really necessary for there to be a perfectly bright line between the two. But for the sake of argument, I'll say that my opinion is that the compiler, linker, etc are part of the tooling, while the standard library is a library.

All of the libraries are linked to the project by some tooling, and a particular distribution of software could include any number of libraries (e.g., conda), I don't see why the standard library needs a special cutout.

It will get even more blurry if you start asking about runtimes, macros, and all-template libraries.


Often (parts of) the C standard library (those needed for a freestanding environment) are actually "baked in" to the compiler, and not actually separate header files at all (though equivalent headers may be provided as well).

In C89, those are float.h, limits.h, iso646.h, stdarg.h, and stddef.h.

C99 added stdbool.h, and stdint.h.

C11 added stdalign.h, and stdnoreturn.h.

So those headers are part of the standard library, but they're also usually part of the compiler. The line gets very blurry.


Is that a bad thing? Libraries like Tensorflow were never going to be written natively in Python, but it's a reasonable choice for deployment (whls can be installed anywhere), for user interface (Python can interop with anything), and for overall ergonomics/ecosystem (import antigravity).


No, I'm arguing that it's a good thing. There are some who argue that a language's tooling should be written in said language, but I argue that that's putting the cart before the horse, that the tooling should be fast and efficient first and extensible second, because there are many more who are tool users than tool creators, so we should optimize for the former. There is also no reason that someone who is so interested in tooling could not also learn the tool's language, like Rust, and contribute that way.


I guess there's kind of a vague philosophical argument to be made for dogfooding, but that's perhaps more relevant early on in the lifecycle of an ecosystem.

On the other hand, if Python had been really committed to dogfooding at the level of something like golang, then the primary implementation would look a lot more like pypy than it would like cpython.


A counterpoint is the typescript compiler, arguably one of the most complex pieces of dynamic language tooling out there. It's written in typescript itself, and reasonably fast given that it has to implement a hugely sophisticated type system.

This shows that it's entirely possible to write fast tooling in dynamic language if you know how to.


Is the compiler expected to run only where a JS interpreter exists? (Which, at the time TS was created, consisted solely of Node?)


Python is kind of a special case. It's slow, widely used and often used as "glue" calling out to other programs and libraries that do the real work. There's also a lot of overlap for python and systems programmers, so there's tooling and that community knows how to do these things. There's more of a culture in python that to do anything serious (read fast) you write that part somewhere else and call into it.

If you look at JS tooling like eslint, tslint, uglifyjs, tsc, webpack, ~~esbuild~~, etc, it's all JavaScript (moving to typescript if anything).

EDIT: I stand corrected in that esbuild is not written in Javascript, but I think that's a disadvantage to it's success if anything and would point out webpack is more popular


> EDIT: I stand corrected in that esbuild is not written in Javascript, but I think that's a disadvantage to it's success if anything and would point out webpack is more popular.

Webpack exists since 2012, of course it's more popular. For a long time there were no other alternatives, so people had to deal with its madness.


As I mentioned above, new JS tooling is beginning to be written in non-JS/TS languages. esbuild which you mention specifically is in Go, not JS.


esbuild is written in Go, which is a large part of why it’s way faster than other dog slow tools especially webpack in your list.


Honestly I believe the low level core, dynamic shell is as old as programming. It's a bit part of the make it right then make it fast.


As I learned during my Tcl days, languages like Python should stay as scripting languages, because as you pretty well describe, you end up writing everything in C anyway.

A lesson learned 20 years ago, before everything .com went bust.


I agree, but I do think the big question will be supporting plugins without recompiling. The answer is probably WebAssembly or one of the native-node interops but it's still not trivial.


I’ve wondered how much those speedups are due to Rust/Go/Zig/etc, versus how much is due to being able to design the system with fresh eyes and an eye for performance. JS is almost surely slower than Rust, for instance, but it’s not 10-100x in most cases, at least in this random set of benchmarks I found [0].

0: https://programming-language-benchmarks.vercel.app/rust-vs-j...


A major benefit of Rust/Go/Zig is control over data layout. If your language doesn't have value types, the CPU spends a lot of time pointer chasing and trashes the cache. In the worst case a cache miss could have the CPU sitting idle for hundreds of clock cycles while it waits for a RAM fetch to finish. That has a big effect in large programs with many layers of data structures, but it's not usually captured in microbenchmarks.


Benchmarks are often not a great indicator for real world performance, especially for Javascript.

Benchmarks are very constrained, repetitive and often numerical code that is executed a lot of times with similar input. This gives the JIT a lot of opportunity to optimise and specialise for specific types.

These tools are usually not long-running , so code has to be JITed all over again on each invocation. A lot of code will probably only be executed a few times, so might only be executed by the interpreter or low tier JIT.

There will also be a lot of AST construction, walking and modification, which is really difficult to optimise without the guidance of types. A lower level language can use better data structures that don't require so much pointer chasing. Also lots of potential GC pressure if there are multiple internal representations to construct.

JS can be insanely fast for what it is, but alot of the weaknesses are apparent with short-running tooling. There is also a reason why most complex,JS heavy web apps seem to get laggy and slow eventually...


As already mentioned in neighboring comment, these languages give you control over data layout. But what's really powerful - is that control over layout opens up a possibility of using SIMD instructions, which by itself may have 10x speed up over optimized scalar code in the same languages, speed up over JS is, of course, greater.


It was a bear to get set up (oh NixOS...), but I imagine it's easier on other platforms: `pytype` from Google is the God Linter for Python.

It:

- seamlessly supports everything from old-ass Python 2 code up to very recent (like 3.10 or someting)

- it can and does exploit `mypy`-style annotations, but needs none to deeply analyze your code and actually find bugs

- it seems to do fairly serious deep PLT magicks. i admittedly haven't dived deep here but it's doing stuff that looks like escape analysis-level whole-program analysis

- it heavily caches and optimizes it's (understandably) heavy lifting via Ninja

It demolishes everything else I've tried. I will definitely take `Ruff` for a spin, we've got a lot of Python and I'm always up for discovering there's a better tool, but the dark magic PLT wizards at Google have probably gotten pretty close to what can be done.


pytype dev here - thanks for the kind words :) whole-program analysis on unannotated or partially-annotated code is our particular focus, but there's surprisingly little dark PLT magic involved, and you certainly don't need to be an academic type theory wizard to understand how it works. our developer docs[1] have more info, but at a high level we have an interpreter that virtually executes python bytecode, tracking types where the cpython interpreter would have tracked values.

it's worth exploring some of the other type checkers as well, since they make different tradeoffs - in particular, microsoft's pyright[2] (written in typescript!) can run incrementally within vscode, and tends to add new and experimentally proposed typing PEPs faster than we do.

[1] https://github.com/google/pytype/blob/main/docs/developers/i...

[2] https://github.com/microsoft/pyright


Just a minor note that Pyright is an LSP server, so will work with pretty much any IDE/editor.

Thanks for all your work on pytype!


Thank you for the amazing software!


I see it's not in nixpkgs, what's difficult about setting it up?


IIRC I had to patch the Ninja executable path-resolution to get around differences from how pip installs it. A sibling has asked for a patch, I’ll try to remember to post it here when I extract it from our thing.


if you can provide a write-up, we would be happy to add it to the pytype docs!


Can you share your nix package?


Probably! It would need a little detangling from the proprietary stuff but not much. Please drop me a line at ben@rocinanteresearch.net so it goes on my todo list (which I’m a bit behind on this week, I’m aiming to clear the inbox tasks this weekend).


I've not looked at the implementation details of Ruff, but I am of the opinion that lint tools should be based on an architecture like GitHub's stack graphs[1]. This is the technology that powers GitHub's code navigation.

Specifically: analysis is broken up into two phases:

1. one that only depends on the content of a file, and 2. a second one which deals with inter-file dependencies.

The advantage is that you can very effectively cache the results of (1). This makes that phase of the analysis not performance critical. Phase 2 has to be fast, but you can push as much expense into phase 1 as possible.

This provides scalability - both at the level of GitHub where people may view code from many different commits, but also for the live editing case where you may be editing a small number of files in a much larger project. The effects of your small changes may affect many of your dependencies and you want to know the lint errors immediately.

[1]: https://github.blog/2021-12-09-introducing-stack-graphs/


Interesting. I've been thinking about an architecture like this for a while, in the context for the Salsa crate (a query engine for compilers).

I wonder if the Salsa maintainers have considered this.


Black has been clearly unable to catch up Prettier for years. We still don't have format selection and often times formating doesn't work for different reasons. On the other hand I love how consistent Prettier is, I hope for a similar experience, will definitely give Ruff a try.


> We still don't have format selection

And you never will. From https://github.com/psf/black:

Black is the uncompromising Python code formatter. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.


Unless your back-and-forth flew over my head, GP is referring to being able to format a range of lines in a file (i.e. a selection), they're not talking about selecting a formatting style.


You are right, I probably misunderstood. But now I am curious, what is the use case for formatting just a range of lines and not the entire file?


Maybe formatting just the lines you've modified: on Emacs I'm a big fan of dtrt-indent, which fixes the indentation only on the lines you've modified, without breaking the style of the rest of the file. Great when collaborating on an open-source project. The default "format-on-save" setting would reformat the entire file, creating huge diffs for simple changes.


It's very handy for when you copy something in the editor and the file is not saved, you probably don't intend to save it, but you want to format the code, because it will be more readable.


>Black has been clearly unable to catch up Prettier for years.

Does prettier work on python code?


Not yet, it's on their roadmap for years


How does it matter what language it is coded in? Was there doubt it could be done in Rust, making it a daring choice, vs. e.g. C++ or Haskell? If Rust did not bring something unique to the table, does it deserve title billing?

Tooling and libraries for Python have always been coded in compiled languages, anywhere performance matters. In this Python might differ from many other interpreted languages.


The goal of this project is to demonstrate that python can tooling can be made faster by writing it in Rust instead of Python. C++ and Haskell could be reasonable choices, but Rust has a strong combination of speed and type safety that make it a good choice for this application.


Is anybody worried about the security of a Python linter?

Did anybody have even a hint of doubt that a compiled linter would be at least tens of times faster than one in Python? (Either Python has got faster in recent years, or it really ought to have been hundreds. Or maybe the others rely on compiled libraries for their own heavy lifting.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: