Powerfully autofixing code with Semgrep's new AST-based approach

by Nat Mote on November 03, 2022

❗ tl;dr: In this blog post, I explain why AST-based autofix is better than text-based autofix and how Semgrep implements AST-based autofix to improve correctness


Introduction

Semgrep, a portmanteau of semantic grep, has proven itself to be a powerful code searching tool. With rules that look like regular code, you can easily find bugs. Semgrep has had experimental support for autofix since 2020, allowing you to not only find bugs but also to automatically fix them (with fixes that also look like code). Many tools can automatically fix issues with code. One popular example is ESLint, a comprehensive linting tool for JavaScript. However, Semgrep is unique in that it supports 20+ languages and the rules and fixes look like regular code.

We have recently started rewriting the autofix engine to improve correctness. The previous engine uses a simple text replacement to synthesize a fix. While this works well in many cases, there are also cases where it generates incorrect code. To more consistently synthesize valid fixes, the new engine creates and then prints an Abstract Syntax Tree (AST) for each fix. Just like Semgrep itself achieves more advanced code searching than text-based search tools by searching ASTs instead of directly searching text, the new autofix engine achieves better results by manipulating ASTs instead of text. With this change, Semgrep can now be considered not just semantic grep, but also semantic sed.

The problem with text-based autofixes

There are many different fixes that text-based autofix does poorly with, but I’ll choose this as a motivating example:

(Note: Click “Open in Editor” or "Open in Playground" on the top right to try out autofix)

This pattern tells Semgrep to search for any call to the do_something_risky function that includes secure=False as a keyword argument anywhere in the argument list. Then, it instructs Semgrep to replace secure=False within that function call with secure=True.

The previous text-based autofix takes the rule’s fix text and replaces the metavariables within it with the text in the matched target file. In this example,$...BEFORE matches zero arguments, so it is replaced with an empty string. The result is do_something_risky(, secure=True, 5), which has a leading comma ahead of the argument list. This is invalid Python code.

AST-based autofix to the rescue

Instead of manipulating text directly in order to synthesize a fix, Semgrep can now manipulate the AST.

Step one: parse the fix into an AST

The first step is to parse the provided fix into an AST. Continuing the example above, Semgrep parses do_something_risky($...BEFORE, secure=True, $...AFTER) into the following AST (Note: These trees here have been simplified for brevity):

Parse fix into an AST

Step two: replace metavariables in the fix AST

The second step is to replace metavariables within the AST in order to produce an AST representation of the final fix. Here, $...BEFORE is bound to [] (an empty list) and $...AFTER is bound to [5] (a list containing only the element 5). So, Semgrep changes the tree above into this:

AST of the final fix

Step three: print the AST to text

The last step is to print the AST to text. Doing this correctly even for a single language is a large undertaking. For example, Flow’s printer for JavaScript, which like Semgrep is implemented in OCaml, is over 4,000 lines long. For the 20+ languages that Semgrep supports, that amount of effort would be multiplied. We wanted to make AST-based autofix a reality without the requirement to write and maintain over 20 complete printers for each of the languages supported by Semgrep, and we also wanted to keep the original formatting and comments of target code where possible. To that end, we decided to recycle the original text of AST nodes that have been taken unchanged from either the target code or the rule’s fix. This allows Semgrep to print many ASTs even if it does not know how to print every node within them.

In this case, we can reuse do_something_risky and secure=True from the fix, and 5 from the target source. The parentheses and the comma are the only parts that are printed from scratch.

Target:

do_something_risky(secure=False, 5)

Fix:

do_something_risky($...BEFORE, secure=True, $...AFTER)

Result:

do_something_risky(secure=True, 5)

To implement this approach, we have a printer that descends the tree, checking at each point whether it is attempting to print an AST node that has been reused, unchanged, from either the target source file or the fix in the rule. If so, instead of attempting to print that node, it will instead use the original text from which that node was created.

In this case, the printer first looks at the Call node. That node is different from anything in the source or fix, so Semgrep has to print it. The first step in printing the Call node is to print the function name, do_something_risky. However, the name has been pulled unchanged from the fix, so Semgrep reuses the original text. Next, Semgrep prints the open parenthesis. The argument list is different from any original node, but the arguments themselves are both unchanged. Semgrep reuses the text from each argument and inserts a comma between them. Finally, Semgrep adds the close parenthesis.

Although this process of reusing the original text works quite well, and we have also implemented printing for some common constructs, the printer may run into a node that it does not know how to print. In this case, Semgrep simply aborts and falls back on the previous text-based autofix engine. This allows us to smoothly migrate onto AST-based autofix by incrementally adding more printing capabilities.

The results

Right now, AST-based autofix is available for autofixes targeting expressions and in Python and JavaScript/TypeScript. For Python, Semgrep can correctly synthesize autofixes in 96.4% of the test cases for fixes in semgrep-rules. For JavaScript, this figure is 100.0%.

We will continue improving AST-based autofix by implementing printing in more cases and for more languages, fixing known issues, and building on this infrastructure to make autofix even smarter! Stay tuned!

To try out autofix:

  • Using Semgrep App, add a GitHub or GitLab project and have Semgrep scan your codebase and suggest fixes every time a PR or MR is created!
  • On the command line, upgrade to Semgrep v0.120.0 or higher (often using brew upgrade semgrep or pip install --upgrade semgrep) and fix your code with semgrep --config=auto –autofix Be careful! The --autofix flag will modify your files in place. Make sure you have everything committed to a version control system first.
  • To run an autofix over a codebase without explicitly writing a rule, run semgrep --lang python --pattern ‘foo($X)’ --replacement ‘bar($X)’ --autofix

Join the r2c Community Slack to say “hi” or ask questions — there’s a friendly and active community ready to help!