❗ tl;dr: In this blog post, I explain why AST-based autofix is better than text-based autofix and how Semgrep implements AST-based autofix to improve correctness
We have recently started rewriting the autofix engine to improve correctness. The previous engine uses a simple text replacement to synthesize a fix. While this works well in many cases, there are also cases where it generates incorrect code. To more consistently synthesize valid fixes, the new engine creates and then prints an Abstract Syntax Tree (AST) for each fix. Just like Semgrep itself achieves more advanced code searching than text-based search tools by searching ASTs instead of directly searching text, the new autofix engine achieves better results by manipulating ASTs instead of text. With this change, Semgrep can now be considered not just semantic grep, but also semantic sed.
There are many different fixes that text-based autofix does poorly with, but I’ll choose this as a motivating example:(Note: Click “Open in Editor” or "Open in Playground" on the top right to try out autofix)
This pattern tells Semgrep to search for any call to the
do_something_risky function that includes
secure=False as a keyword argument anywhere in the argument list. Then, it instructs Semgrep to replace
secure=False within that function call with
The previous text-based autofix takes the rule’s fix text and replaces the metavariables within it with the text in the matched target file. In this example,
$...BEFORE matches zero arguments, so it is replaced with an empty string. The result is
do_something_risky(, secure=True, 5), which has a leading comma ahead of the argument list. This is invalid Python code.
Instead of manipulating text directly in order to synthesize a fix, Semgrep can now manipulate the AST.
The first step is to parse the provided fix into an AST. Continuing the example above, Semgrep parses
do_something_risky($...BEFORE, secure=True, $...AFTER) into the following AST (Note: These trees here have been simplified for brevity):
The second step is to replace metavariables within the AST in order to produce an AST representation of the final fix. Here,
$...BEFORE is bound to
 (an empty list) and
$...AFTER is bound to
 (a list containing only the element
5). So, Semgrep changes the tree above into this:
In this case, we can reuse
secure=True from the fix, and
5 from the target source. The parentheses and the comma are the only parts that are printed from scratch.
do_something_risky($...BEFORE, secure=True, $...AFTER)
To implement this approach, we have a printer that descends the tree, checking at each point whether it is attempting to print an AST node that has been reused, unchanged, from either the target source file or the fix in the rule. If so, instead of attempting to print that node, it will instead use the original text from which that node was created.
In this case, the printer first looks at the Call node. That node is different from anything in the source or fix, so Semgrep has to print it. The first step in printing the Call node is to print the function name,
do_something_risky. However, the name has been pulled unchanged from the fix, so Semgrep reuses the original text. Next, Semgrep prints the open parenthesis. The argument list is different from any original node, but the arguments themselves are both unchanged. Semgrep reuses the text from each argument and inserts a comma between them. Finally, Semgrep adds the close parenthesis.
Although this process of reusing the original text works quite well, and we have also implemented printing for some common constructs, the printer may run into a node that it does not know how to print. In this case, Semgrep simply aborts and falls back on the previous text-based autofix engine. This allows us to smoothly migrate onto AST-based autofix by incrementally adding more printing capabilities.
We will continue improving AST-based autofix by implementing printing in more cases and for more languages, fixing known issues, and building on this infrastructure to make autofix even smarter! Stay tuned!
To try out autofix:
- Using Semgrep App, add a GitHub or GitLab project and have Semgrep scan your codebase and suggest fixes every time a PR or MR is created!
- On the command line, upgrade to Semgrep v0.120.0 or higher (often using
brew upgrade semgrepor
pip install --upgrade semgrep) and fix your code with
semgrep --config=auto –autofixBe careful! The
--autofixflag will modify your files in place. Make sure you have everything committed to a version control system first.
- To run an autofix over a codebase without explicitly writing a rule, run
semgrep --lang python --pattern ‘foo($X)’ --replacement ‘bar($X)’ --autofix
Join the r2c Community Slack to say “hi” or ask questions — there’s a friendly and active community ready to help!