Show HN: Python Source Code Refactoring Toolkit via AST

hsbauauvhabzb · on Aug 1, 2021

I’ve been thinking about a similar problem I have on a regular basis: I audit a lot of code in many languages, sometimes I’m not even able to build solutions. my toolset pretty much includes git and rudimentary regex. A recent example is a PHP function which is insecure when it’s used at least 3 times in a string concat.

Any suggestion on an AST ‘framework’ that would help me parse code easier? Language specific or generic, even if it only sort-of fits (I don’t even know what I want).

Automated Code analysers exist, but I want something manual.

woile · on Aug 2, 2021

Maybe something like this:

https://semgrep.dev/

I found it last week. It may be able to detect a potential issue in multiple languages, but not sure if it supports refactoring.

chrisphilip · on Aug 2, 2021

I wrote a library that takes a language specification (BNF, ABNF, etc) into a custom AST in Python. Then you can define a set of visitors for the custom AST to transform the tree into whatever you want. I implemented some checks integrated into the Python type system to type-check the visitors and I've used the library to do some non-trivial manipulations before.

It's mostly a pet project and I just wanted to share since it could maybe at least inspire something.

https://github.com/chrisphilip322/prosodia

jay-aye-see-key · on Aug 2, 2021

This is not a problem I'd had, but it sounds like something tree-sitter might be able to help with.

I've been using tree-sitter for syntax highlighting, simple refactors, and custom text objects in NeoVim recently (and love it, thanks NeoVim devs!). Have a play with the syntax tree it generates https://tree-sitter.github.io/tree-sitter/playground (it can generate a tree for most languages, not just the ones in this playground).

bmc7505 · on Aug 2, 2021

I wonder if something like this exists for other languages? Most rewriting/refactoring tools make too many assumptions about the surrounding project. Something simple which supports pattern matching against a CST subtree and invertible parsing/unparsing would be the ideal polyglot macro-system. Suppose I want to find and rename all variables with a certain type or rewrite an expression of a certain shape in multiple languages at once (e.g. Java/JavaScript/C/C++). Something between a regular expressions (too brittle) and a full parser would be great a great solution to have for this use case.

erezsh · on Aug 3, 2021

I actually have a prototype of something like that, built on top of Lark. It's still far from being complete or user friendly, but I'm already successfully transpiling a subset of Python to Javascript.

It still needs a lot of handholding, but at the same time I get to use gems like -

    ... = TemplateTranslator({
    'isinstance($a, str)': "typeof($a) == 'string'",
    'isinstance($a, bool)': "typeof($a) == 'boolean'",
    'len($a)': '$a.length',
    'str($a)': '$a.toString()',
    'getattr($a, $b)': '$a and $a[$b]',
    'getattr($a, $b, $c)': '($a and $a[$b]) or $c',
    'hasattr($a, $b)': '$b in $a',
    '$a.join($b)': '$b.join($a)',
    ... })

bmc7505 · on Aug 8, 2021

Nice! I like the syntax, this reminds me of the structural search and replace feature from IntelliJ IDEA. Is this the project you were referring to?

https://github.com/erezsh/py2js

graton · on Aug 1, 2021

A similar type project. Though I haven't seen much activity recently:

https://github.com/facebookincubator/Bowler

treesciencebot · on Aug 1, 2021

I haven't used bowler, but from what I can see it is using lib2to3 (through fissix), which can't parse newer Python code (parenthesized context managers, new pattern matching statement etc.) due to it being using a LL(1) parser. The regular ast on the other hand is always compatible with the current grammar, since that is what CPython internally consumes. It is also more handy to work with since majority of the linters already using it, so it would be very easy to port rules.

amethyst · on Aug 1, 2021

(Author of Bowler/maintainer of fissix)

The main concern with using the ast module from stdlib is that you lose all formatting/whitespace/comments in your syntax tree, so writing any changes back to the original sources requires doing a lot more extra work to preserve the original formatting, put back comments, etc. This is entire point of lib2to3/fissix and LibCST, allowing large scale CST manipulation while preserving all of those comments and formatting. We do recognize the limitations of lib2to3/fissix, though, so there have been some backburner plans to move Bowler onto LibCST, as well as building a PEG-based parser for LibCST specifically to enable support for 3.10 and future syntax/grammar changes. But of course, this is very difficult to give any ETA or target for release.

treesciencebot · on Aug 2, 2021

Indeed! I also would suggest people to use a CST implementation (parso / LibCST) instead of refactor if they intend do large scale refactors, but from what I can see in my previous attempts (e.g teyit, a unittest assertion formatter) when you deal with small code fragments (a single expression, or a small statement) then you generally don't need to worry much about the style. The only concern is the literals (especially strings, which there are a few different variations of the same AST) where you could resurrect them back from the token stream (which the CustomUnparser representative in refactor allows).

The real start point for this project was to find / replace all type(<literal>)'s in CPython codebase with type(type(<literal>)) (e.g type('') would become type(str)) which is very light weight transformation, and I was able to write a script which did it without having any major problems about style on over 2000 files. Here it is for the reference: https://github.com/isidentical/refactor/blob/master/examples...

Also one thing to note here is that; in the last couple of years, thanks to black (and yapf), the adoptance of code formatters have really increased which is very nice for custom refactoring tools like refactor since the end-code would be refactored anyways so that means if you convert a multi line call, or a list to a single line version then the formatter you use probably reformat that segment anyways.

But thanks for authoring Bowler! It is a very cool project.

westurner · on Aug 2, 2021

Did you consider PyCQA/RedBaron (which is based upon PyCQA/baron, an AST implementation which preserves comments and whitespace)? https://redbaron.readthedocs.io/en/latest/

amethyst · on Aug 2, 2021

It was considered, but the initial goals of Bowler was to a) build on top of lib2to3 in an attempt to get broader upstream support for maintaining it as an "official" cst module, which fizzled out, and b) to use some of the more complex matching semantics that lib2to3 enabled, which baron and other cst alternatives don't really attempt to cover.

Things like "find any function that uses kwargs" or "find any class that defines a method named `bar`" can be easily expressed with lib2to3's matching grammar, and no other CST that I'm aware of (that isn't itself based on lib2to3) has equivalent functionality. This is something we wanted to add to LibCST, but haven't had the time to focus on given other priorities. Meanwhile, we used LibCST to write a safer alternative to isort: https://usort.readthedocs.io

westurner · on Aug 6, 2021

Rog. I think CodeQL (GitHub acquired Semmle and QL in 2019) supports those types of queries; probably atop lib2to3 as well. https://codeql.github.com/docs/writing-codeql-queries/introd...

From https://news.ycombinator.com/item?id=24511280 :

> Additional lists of static analysis, dynamic analysis, SAST, DAST, and other source code analysis tools […]

dreamer7 · on Aug 2, 2021

Would this be a good tool to attempt simple transpiling of typescript code to python?

For my use-cases, not much changes between the two - 1. Replace opening and closing braces with colon. 2. Remove semi colons 3. Replace function with def 4. Remove type definitions 5. Replace array manipulation functions with python equivalents. 6. Remove tokens like const, let and var

treesciencebot · on Aug 2, 2021

It would probably be better if you can just write a typescript parser, and then you can transpile type script AST directly to Python AST and use ast.unparse() without any additional libraries.

milkbikis · on Aug 1, 2021

Nice, I've always thought Python could do better by adopting source transformation tools like babel. I implemented something similar using lib2to3, which preserves whitespace and comments more accurately (than the ast module): https://github.com/banga/prefactor

treesciencebot · on Aug 1, 2021

Yes! I love to do transformations with lib2to3, but unfortunately it is now deprecated. As I've stated in the README, for complex refactorings I also would personally prefer working with CST (probably through parso instead of regular lib2to3), but for refactoring small fragments I wasn't able to observe major differences in formatting in big codebases. Thanks for sharing this btw!

milkbikis · on Aug 1, 2021

Ah I didn’t know it was deprecated, but good to see new work happening in this area!

dreamer7 · on Aug 2, 2021

Would this be a good tool to attempt simple transpiling of typescript code to python?

For my use-cases, not much changes between the two -

1. Replace opening and closing braces with colon.

2. Remove semi colons

3. Replace function with def

4. Remove type definitions

5. Replace array manipulation functions with python equivalents.

6. Remove tokens like const, let and var

erezsh · on Aug 3, 2021

It's a good start, but there are many other differences. a[i] can mean getattr (but usually not), and can throw a KeyError. Similarly, a.b may return null if it doesn't exist, while in Python it will throw an AttributeError. You might need to add uses of global or nonlocal. And there are a lot of other small details.

However, I think it will be much easier than translating the other way around :)

TekMol · on Aug 1, 2021

The one thing I hope for when it comes to Python is that PyPy becomes the default runtime everywhere.

CPython is so unbelievably slow when compared to languages like PHP and Javascript.

When using PyPy, I keep encountering road bumps because many tools are still expecting that one uses CPython.

Dear world - please accelerate the conversion to PyPy!

keithasaurus · on Aug 1, 2021

I think it's more likely that people start using mypyc for performance-critical code. https://mypyc.readthedocs.io/en/latest/

formerly_proven · on Aug 1, 2021

I assume the performance profile is similar to compiling regular Python code with Cython (which it can do). There is a decent but not world-changing speed-up; this makes sense, because pretty much the only thing you are removing from the equation is the instruction and stack overhead of the interpreter, you still have to perform all the equivalent work that CPython does, otherwise you would end up with different semantics. And this work is substantial.

In contrast, a tracing JIT can dynamically elide most of this work without changing the semantics.

xapata · on Aug 1, 2021

After switching to Cython, you can add Cython syntax, like type definitions, to gain big speed improvements.

rattray · on Aug 1, 2021

Wow, this looks exciting. Looks like Cython without any extra work/syntax, for those using mypy.

I wish their readme gave some kind of estimate of how the speedups might compare to Cython and PyPy. But it's currently alpha, so there may be big differences by the time it's ready for production use.

Similar approach to the Sorbet Compiler for Ruby (albeit targeting C instead of llvm... I wonder how that might impact optimizability).

treesciencebot · on Aug 1, 2021

Author of this project here (and also a CPython core developer FWIW), for many aspects (including the AST) PyPy follows CPython closely even on implementation details so `refactor` would probably work out of the box once PyPy releases 3.9 (the latest stable version for PyPy is 3.7, and 3.8 is on the way!).

heavyset_go · on Aug 1, 2021

Projects that depend on CPython C extensions should consider migrating to HPy[1] for C extension compatibility across Python implementations.

[1] https://github.com/hpyproject/hpy

sitkack · on Aug 1, 2021

This looks interesting, but it also looks ambitious and not anywhere ready. Everyone would be well served for extensions to not be aware of Python internals and expose a clean handle based api to all users. For this, cffi is sufficient and already available across nearly all Python implementations.

https://cffi.readthedocs.io/en/latest/

I do agree that a standardized Python should have an object model that is a clean public contract, but given that Python is CPython and there is no standardization, that the library runtime is married to the core also makes this incredibly difficult politically.

hsbauauvhabzb · on Aug 1, 2021

is performance even an issue either the tool? I regularly see people complain about python speed but normally find it’s algorithmic as opposed to interpreter imposed..

boxed · on Aug 2, 2021

You should check out parso which is a fully roundtrip-able AST. That way your tool won't throw away comments and formatting.

treesciencebot · on Aug 2, 2021

Thanks for the suggestion though just want to clarify, I am also one of the maintainers of parso :-) (and also the ast module in the standard library)

treesciencebot · on Aug 2, 2021

Extra clarification: I'm well aware the concerns about tokens vs CST vs AST, that is why I decided to write down a library that does refactors on small code fragments rather than just unparse()ing everything. CST is definitely more sound in terms of preserving these details but dealing with it much more complicated than the regular AST. Refactor's README these also mentions other libraries for complex refactors: parso, LibCST, FixIT