Demystifying Taint Mode

by Emily Fortuna on September 01, 2022

This blog post has been adapted from the following video:

But if you prefer to read a summary, you’re in the right place.

In this post I’ll explain all the ins and outs of Semgrep’s taint mode: what it is and when to use it, how to write a taint mode rule, and optional additions for rule refinement. With this knowledge you’ll be able to secure your codebase with confidence and panache!

What is taint mode?

Semgrep has a couple of ways to analyze code for vulnerabilities. You’re probably most familiar with search mode, which is the default for running rules. Search mode can literally look for a specific pattern, just like grep, but it also can use the semantics of your programming language to propagate the value of x through the function and determine that line 3 is a match.

But what if you want to track the flow of data beyond just that one variable x?

    unsafe_input = request.cookies['user_profile']
    decoded = b64decode(unsafe_input)
    'Hey {}!'.format(pickle.loads(decoded))

In the code above, we read some serialized user profile data stored in a cookie from an HTTP request. Then we deserialize it and use it without validating that the data is what we think it is. Depending on the language, deserialization itself could be exploitable and using deserialized data sent over a network without checking it is what we think it is is certainly dangerous. Detecting unsafe code like this that tracks the flow of data from one location to another is exactly the sort of scenario that taint mode was written for.

When should I use taint mode?

To determine if you should write a taint mode rule or not, ask yourself two questions:

  1. Do I want to track the flow of data potentially across multiple variables? And, possibly relatedly,
  2. Am I writing a rule to catch an injection vulnerability, like a cross-site scripting (XSS) attack or SQL injection?

If the answer to either of these questions is “yes”, try writing your rule in taint mode!

In some cases, it is possible to approximate taint mode by writing a rule as you would normally in search mode, but with some extra lines of code. So in the deserialization example before you would need to write out every line that modifies and then uses the potentially unsafe data.

However, if the code path has control flow, or you want to future-proof your rule for when the code inevitably gets refactored over time, search mode rules to detect these sorts of scenarios can get complex pretty quickly!

So in these situations, taint mode can help you make your rules more succinct and readable.

How to write a taint mode rule

Writing a taint mode rule is really easy. First, specify that this is a taint mode rule. Then, specify your sources with pattern-sources. You can have multiple if you want. Lastly, specify the pattern-sinks. Again, you can have multiple sinks if the situation calls for it. Aaaaand that’s it!

Rule refinement with sanitizers

Optionally, you can add pattern-sanitizers, which specify patterns that indicate we are validating and removing potentially unsafe code from our untrusted data. Since the code itself is already explicitly checking for malicious code injections, we’re telling Semgrep that it doesn’t need to flag code inside a sanitizer as a potentially unsafe finding.

Rule refinement with taint propagators

Another optional addition for your rule is specifying taint propagators. Sometimes tainted data might taint other data structures, such as adding untrusted data into a hash set. Now, in the example below, in addition to not trusting the original tainted data, maybe you don’t want to trust anything in that set.

    user_input = request.cookies['user']
    untrusted = b64decode(user_input)
    cookies_set.add(untrusted)
    maybe_tainted = cookies_set.pop()
    return 'Hey {}!'.format(pickle.loads(maybe_tainted))

To express this, you can add the optional pattern-propagators key. pattern-propagators is specified in three parts. First up, is the pattern key. This is where you describe with regular semgrep syntax what the code looks like when tainted data spreads elsewhere. In this case, we’re describing what adding to a set in python looks like. The from key specifies where your tainted source needs to pass “through” to propagate the taint. In this specific example it’s when the “untrusted” variable gets added to the set. The to key specifies what additional object gets tainted as a result, which in this case is the set itself, cookies_set. Now, when anything from cookies_set reaches a sink, Semgrep creates a finding.

    id: hooray-taint-mode-propagators
    mode: taint
    pattern-sources:
      - pattern: request.cookies[...]
    pattern-propagators:
      - pattern: $SET.add($TAINTED)
        from: $TAINTED
        to: $SET
    pattern-sinks:
      - pattern: pickle.loads(...)
    message: Found unsafe potentially input thanks to taint mode!
    languages:
      - python
    severity: WARNING

(playground link for above code)

In Conclusion

Taint mode offers a clean way to track tainted data through code and highlight potential vulnerabilities. It’s perfect for injection vulnerabilities, but we’re sure you’ll get even more creative. Go forth and secure your codebase!