Semgrep: a static analysis journey

by Yoann Padioleau and Emma Jin on November 09, 2021

Introduction

Semgrep (semantic grep) is a fast and lightweight static analysis tool to find bugs and enforce code standards. As its name suggests, it understands code at the semantic level. However, it is like grep in that it is easy to learn, can run on incomplete code, and can be run locally on the command line.

Today, Semgrep is an open-source tool supporting over seventeen languages, but it originated as a code transformation tool for C called Spatch (a.k.a. Coccinelle) used by academics, before being repurposed at Facebook to find PHP bugs as Sgrep. This history of practical use due to necessity has informed Semgrep’s lightweight design. Now maintained by r2c, Semgrep is becoming the tool of choice for modern security teams looking to eliminate classes of vulnerabilities.

The academic years: Spatch

Like many innovations in programming languages, the story of Semgrep starts in academia. Readers may be familiar with Coccinelle (see this LWN article on Coccinelle in 2009 and a more recent article), a code transformation tool for Linux. Coccinelle is the engine for Semgrep's precursor, Spatch, which was designed in 2006 by Julia Lawall's and Gilles Muller's research group. Their group focused on operating systems development, and in the process created many Domain Specific Languages (DSLs) to help in their work, such as Bossa, a DSL to write custom schedules.

When I joined the group, just after completing my PhD at Inria, they had become aware of an important problem in Linux development: the evolution of core APIs. Linux embraces evolutions; developers aren’t scared to break internal backwards compatibility for long-term benefit. However, because the Linux kernel contains thousands of device drivers, any modification of a core API entails the modifications of thousands of files. These modifications are usually difficult to automate, time consuming, and error prone. We termed these modifications “collateral evolutions” and published a paper at Eurosys in 2006 describing the problem.

Consider the following API evolution in the Linux kernel. To allocate memory, the kernel provides the kmalloc() function. A recurring pattern is to allocate memory and later set its contents to zero with a subsequent call to memset(). Here is an example:

a = kmalloc(100)
memset(a, 0, 100)

Because this pattern is very common, a new API was designed to combine the two operations in one function call: kzalloc().

A patch to use kzalloc would look like this:

- a = kmalloc(100)
- memset(a, 0, 100)
+ a = kzalloc(100)

Though this patch is fairly rote, it is difficult to automate using traditional regex-based tools like grep or sed. The call to kmalloc() can be spread on multiple lines or have comments, and the call to memset() is not necessarily on the next line. More complex program transformations pose even more problems. As it happens, it took years to perform the kzalloc() collateral evolutions, and new kernel developers who were not familiar with the latest APIs were constantly submitting code using the old kmalloc/memset pattern.

Because of Julia and Gilles's backgrounds in DSL design, we thought we could design a language to automate and document those program transformations. One of our first approaches was influenced by a paper by Oege De Moor describing how the rewrite system and temporal logic formalisms could be combined to express complex program transformations. We started to use the Stratego rewrite system to express transformations, but quickly abandoned the approach as the transformations were difficult to write and impossible for most kernel developers to understand (later on, Oege De Moor would start the company Semmle and design the CodeQL language, a query-based semantic analysis tool).

After cataloging many collateral evolutions and studying hundreds of patches like the previous one, it struck me that we didn’t need to invent a brand-new syntax. We could hijack the patch syntax kernel developers already knew to express program transformations.

Here is the kzalloc program transformation for any amount of memory, expressed in what we called a “semantic patch”:

- $A = kmalloc($S);
+ $A = kzalloc($S);
...
- memset($A, 0, $S);

This semantic patch looks much like the patch we wrote earlier, except we used $A and $S to represent arbitrary variables (metavariables), and an ellipsis to indicate that other lines could be between the kmalloc and memset. (The Spatch metavariable syntax is actually a little more complicated than this; for simplicity this example uses a syntax closer to Semgrep’s.)

Spatch will perform this semantic patch over all C files in a given directory and fix all instances of the antipattern. The command to do so is as simple as:

$ spatch -f kzalloc.spatch linux/drivers/

Using Spatch, we sent patches to the Linux Kernel Mailing List (LKML), and received generally good feedback from Kernel developers (Coccinelle has even become part of the kernel). We published our work on automating and documenting collateral evolutions with Spatch at the Eurosys Conference in 2008, as well as the Ottawa Linux Symposium in 2007 (link), and released the source code of Spatch under the Coccinelle name.

Spatch’s success was due to its familiar syntax and powerful domain-specific features. Developers were already writing patches for C code and were fluent in these syntaxes. The only other thing they needed was a way to express their modifications more generally, so we allowed them to use metavariables to abstract away specific variable names and ellipses to ignore irrelevant sequences of statements, expressions, fields, etc. These features proved intuitive and effective, and are present today in Semgrep.

The Facebook years: Pfff and Sgrep

In 2008, I left the Coccinelle project, and after some searching found my way to Facebook. Back then, most of the Facebook codebase was written in PHP, which is not exactly known for being thoughtfully designed. Developers were constantly calling functions that did not exist, passing the wrong number of arguments to a function, performing operations with incompatible types, accessing undefined fields, etc. Virtually no tools existed to analyze PHP, either (for example, Coverity only added PHP support in 2016).

Being one of the few software engineers in the company with a background in program analysis, I wanted to improve the situation. I coded a simple PHP Frontend For Fun (Pfff) and during my time at Facebook wrote many tools for developers using Pfff: deadcode removers, linters, refactoring tools inspired by Spatch, code visualizers, dependency visualizers, code query tools, etc.

Facebook was becoming more serious about its security around the time I started Pfff, and in 2011 I joined a group of cryptographers, hackers, experts in the codebase, and other program analysis enthusiasts to improve the security of Facebook’s code. We designed new APIs that were secure by default for fetching data, escaping user-data, etc. However, it was time-consuming to deprecate the old unsafe APIs and migrate code to the new APIs, and it was also difficult to educate new developers with the latest rules for writing safe code.

This was the exact problem Spatch had been created for, and I decided to solve it the same way. I adapted Spatch’s expression matching (limiting the scope to speed up the work) to PHP in a tool we called Sgrep (syntactic grep). Additionally, to make it easy for developers to organize and understand our rules, I made it possible to store multiple patterns in a configuration file and associate a message to each pattern. A configuration file for Sgrep might look like this:

- head(array_keys($X))

ERROR: Do not use head(array_keys()), use head_key() instead.

- $X = array_concat($X, $Y)

WARNING: This is inefficient because PHP has to copy the
entire array.  Use array_extend() instead.

Each configuration file thus formed a ruleset. Using Sgrep in our PHP linter, we were able to enforce our new APIs for every developer on commit.

Later on, Coverity and Fortify would add PHP support, and we evaluated some other tools as well, but we decided not to use any of them. They were too slow, reported errors too late in the development cycle, addressed only very general classes of bugs (e.g. null pointer dereference), and had a high false positive rate. This was not because they were badly engineered, but because of the difficulty of the problem they were trying to solve. Detecting null pointer dereference requires complex interprocedural context-sensitive and control-flow sensitive analysis, with necessary approximations that cause false positives. No amount of engineering will solve the halting problem.

In contrast, Sgrep was fast, catching bugs even before they were committed, and high signal. It didn’t want to find every potential null pointer dereference, just places where users violated best practices, so there were usually few to no false positives.

Additionally, with Sgrep, anyone could write rules targeted to our codebase. While some of the commercial tools allowed us to write custom rules, it was difficult to do so. When we evaluated Semmle’s static analysis tools in 2012, they sent a PhD in static analysis to help us. In contrast, though I wrote many of the first Sgrep rules, other developers without any program analysis background were able to add rules for their own teams, specific to their APIs. When I left Facebook in 2014, we had more than 200 Sgrep rules running on every pull request.

Since then, Facebook has invested even more in its program analysis. It has hired domain experts to work on tools that find bugs before code reaches production, such as Infer, Zoncolan, and Pyre. With a feedback loop between developers and researchers, Facebook is able to write complex checks for their codebase that have low false positive rates. Sgrep is now retired, but the early years of Facebook's AppSec showed the real world potential of a lightweight, easy to understand tool.

The r2c years: Semgrep

I was working on a personal crazy computer science education project when I was contacted by r2c, a software security startup, in 2019. I wasn't looking for a new job, but r2c's mission and the enthusiasm of its founders convinced me to get back in the game.

Being a startup, we experimented with several different ideas as we tried to build a static analysis tool developers would love. At some point, we decided to focus on implementing framework-specific checks for a few popular Python and JavaScript libraries.

Writing these checks was slow, and required setting up toolchains and learning the AST representation for each language we wanted to target. Seeing the kinds of rules we wanted to write, I wondered if we could modify Sgrep to work on Python and JavaScript to make it easier. I had a JavaScript parser lying around from my Facebook days, but I needed a Python parser. Also, I wanted Sgrep to work on multiple languages, rather than port it for JavaScript and then Python. Inspired by the GitHub Semantic project, I started to design a generic AST that could represent a number of programming languages, by taking advantage of the structural similarities of most languages. Over the course of two weeks, I wrote a Python parser and ported the Sgrep engine to work on my new generic AST instead of the PHP AST.

Like the original Sgrep, this new generic Sgrep was only able to match expressions, but we wanted to do much more than that. With the rest of the company also excited about Sgrep, I extended it to match statements, function definitions, classes, attributes, etc., to the point where you could match any construct of the original language. Isaac Evans (r2c’s CEO) and I also extended it to combine multiple patterns together with boolean operators, as shown in the example below:

rules:
- id: useless-eqeq
 patterns:
 - pattern-not-inside: assert(...)
 - pattern-either:
    - pattern: '$X == $X'
    - pattern: '$X != $X'
 - pattern-not: '1 == 1'
 message: |
   Useless comparison operator `$X == $X` or `$X != $X`.
 languages: [go]
 severity: ERROR

(You can see it run at https://semgrep.dev/s/emjin:go-useless-eqeq)

Moreover, we extended Sgrep to better understand the semantics of programming languages and abstract away more differences in code that were semantically equivalent. For example, in Python, you can write keyword arguments in any order (foo(x = 1, y = 2) is equivalent to foo(y = 2, x = 1)), so Sgrep understood these two as equivalent. We also implemented limited forms of constant propagation, type inference, and module alias resolution. The goal was to match based on the semantics of the code, rather than what the code looked like, and realizing that we were no longer a syntactic grep, in April 2020 we renamed Sgrep to Semgrep.

All of these changes happened on the generic AST, so any improvement to the matching benefitted all languages that could be converted to the generic AST. First, this was just Python and JavaScript, but we soon added Java, Go, C, OCaml, PHP, and more.

As the expressivity of Semgrep grew, we realized that it would be useful to other security engineers in addition to our own. This was cemented when I presented Semgrep at a meetup for the first time: our attendees were from some of the best security teams in Silicon Valley, and they were all enthusiastic about Semgrep.

In a way, we stumbled upon Semgrep. I wasn’t hired to build it; Semgrep was an idea which I explored during a hackathon, with no expectation that it would become our core product. My lab had created Spatch to help evolve APIs, not find bugs; I later adapted it to Sgrep to write some checks. But our paradigm of security is changing, and security is about designing better APIs—safer APIs—that make it hard to introduce bugs in the first place. It’s about enforcing secure defaults and communicating them. That’s what Semgrep is perfect for.

There are many more features in Semgrep now—taint analysis, proper constant propagation, autofix—and many more languages. It doesn't stand alone as a command line tool anymore, either. There's a web-based playground, so people can try out and revise patterns, as well as over a thousand community-written rules and a dashboard to make it easy to add it to CI and see results. The command line tool is completely open-source—in fact, recent languages like Rust have been added by community members—and we have a community Slack so anyone can ask questions. All of these serve the same idea, that security should be easy. Ultimately, that, rather than Semgrep, is the real core of r2c.

Conclusion

Semgrep and its predecessors have always existed to serve an immediate need, which has informed its design. My lab created Spatch with a specific goal in mind: performing collateral evolutions in Linux device drivers. Its syntax was changed minimally from C syntax, only adding what we needed to express the patches we wanted to make. When I then wanted a tool to enforce new PHP APIs, I ported the subset of Spatch necessary for Sgrep, and added messages and configuration files because the patterns were now part of a codebase. Neither of Semgrep’s predecessors were imagined as potential generic tools; it just turned out that these specific use cases were instances of a general problem. As we change our understanding of Semgrep’s application, we are constantly reimagining it.

While we’ve added a lot of functionality to Semgrep since the Spatch days, arguably most of the work has been to help developers use the results they get. As we expand to broader audiences, Semgrep has grown with them, but always, we hope, only as much as it needs.

For a more lively presentation of the history of Semgrep, see also this PL talk with Jean Yang and Hongyi Hu at https://www.twitch.tv/videos/704349828