Experimental feature: generic pattern matching

by Pablo Estrada on December 03, 2020

Recently we added a new experimental feature to Semgrep: generic pattern matching. This post outlines how to use it and what to expect when matching code patterns.

Generic pattern matching allows Semgrep to match code patterns in languages that don’t yet have a Semgrep parser, in configuration files, or in other structured data (e.g., HTML or XML). For example, you may want to find unwanted permissions enabled in Terraform files, insecure redirects in nginx, or misconfigured blog engine settings.

Consider this rule that searches for allowed_origins = ["*"] in Terraform files:

rules:
- id: terraform-all-origins-allowed
  patterns:
  - pattern-inside: cors_rule { ... }
  - pattern: allowed_origins = ["*"]
  languages:
  - generic
  severity: WARNING
  message: CORS rule on bucket permits any origin

The above rule matches this Terraform code snippet:

resource "aws_s3_bucket" "b" {
  bucket = "s3-website-test-open.hashicorp.com"
  acl    = "private"

  cors_rule {
    allowed_headers = ["*"]
    allowed_methods = ["PUT", "POST"]
    allowed_origins = ["*"]  # <--- Matches here
    expose_headers  = ["ETag"]
    max_age_seconds = 3000
  }
}

General properties

Generic pattern matching has the following properties:

  • A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
  • ... allows skipping non-matching elements, up to 10 lines down the last match.
  • $X (metavariable) matches any word.
  • The interpretation of a document can be inspected with the spacecat command.
  • Indentation determines primary nesting in the document.
  • Common ASCII braces (), [], and {} introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don't disturb the structure of the rest of document.
  • The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
  • Shorter matches are preferred over longer ones. This avoids matches like def bar def foo when the pattern is def ... foo, instead matching just def foo.
  • Leading dots must match at the beginning of a block, allowing patterns like ... foo to match what comes before foo.
  • In general, short patterns on structured data will perform the best.

Example rules

This Semgrep ruleset based on generic pattern matches performs security checks for nginx configuration files. You can also browse all the generic pattern matching rules in the Semgrep registry.

Caveats and limitations

Generic pattern matching should work fine with any human-readable text, as long as it’s primarily based on ASCII symbols. In practice, it might work great with some languages and less well with others. In general, it’s possible or even easy to write code in weird ways that will prevent matching.

Note it’s not good for detecting malicious code. For example, in HTML one can write &#x48;&#x65;&#x6C;&#x6C;&#x6F; instead of Hello and this is not something that would match if the pattern is Hello, unlike if it had full HTML support.

With respect to Semgrep operators and features:

  • metavariable support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as “hello, world” since in this case there are 3 tokens:

    • hello
    • ,
    • world
  • the ellipsis operator is supported and spans at most 10 lines
  • pattern operators like either/not/inside are supported

Try it out

In addition to running rules already available in the Semgrep Registry, you can write custom Semgrep rules using generic pattern matching in the Semgrep live editor. Look for the generic pattern matching item in the editor menu:

Generic pattern matching in Semgrep live editor