Using CodeQL for security research

/nl_img1

What is CodeQL

CodeQL is a declarative query language for code, currently maintained by GitHub. GitHub maintains a decently sized set of queries that companies can hook into continuous integration and continuous delivery (CI/CD) to disallow trivial code issues from being merged in the first place. CodeQL has also found some traction in cybersecurity defense teams trying to prevent security issues from going into production. However, very little has been written about using CodeQL from an offensive security perspective. If developers of code can use queries to find bugs before they make it out the door, you can write queries to find bugs that did. I’ve gotten good mileage out of using CodeQL for security research on C/C++ codebases and think it’s a worthwhile tool for others to explore using.

One of the powerful things about CodeQL is how it doesn’t only expose the syntax of the program, but also what corresponding semantic information that syntax turns into. With a tool like grep over C++ source, you can’t find all variables of a class because the variable might be constructed with auto, and grep restricts you to performing a purely textual search. Grep might be unable to find a type defined as x but referred to with some_ns::x elsewhere. Other failure cases with grep exist, such as a typedef or a decltype, or if there is a template<type T> and it’s instantiated with your desired class. CodeQL gives you powerful structured introspection, and you can write a query to get all variables and then the types of all those variables, which are ordinary CodeQL objects. These objects represent the final known type from the compiler, so you don’t have to worry about name resolution or type inference yourself.

Because CodeQL constructs its information database via a compiler wrapper for C++, it also means you can look at details of the code that aren’t known until you compile it. Having information from the completed build makes queries easier to write and more correct. The database contains details such as field offsets in a struct and only the conditionally compiled code blocks associated with the desired build configuration. This is very relevant for security research, where often you want to limit queries to a certain subsystem or code that your target has enabled.

CodeQL also lets you perform powerful queries that take into account the flow of data, allowing you to build up reasoning about operations in the language instead of only looking at their syntax. If you have an operator++ in a C++ class that calls abort(), and write a CodeQL query to find all functions that transitively call abort(), the query finds MY_VAR++ is a call to that overloaded operator function. Even trying to use more powerful tools like semgrep (rather than normal grep) to replicate this functionality leads to sadness. Many tools have restricted support for more complex context-sensitive queries, such as not letting the end-user modify those queries. CodeQL permits you to create extensible queries—for example, for arbitrary dataflow analysis.

Getting Started

The canonical way of using CodeQL is to use Visual Studio Code, which has mostly one-click support for installing CodeQL and running queries. I find it useful to be able to use CodeQL from the command line (and use Vim instead of Visual Studio Code…), so let’s review the steps you need to go through to set up CodeQL on a brand-new codebase without using VSCode.

1) Download the CodeQL bundle for your platform (in this case, Linux) from https://github.com/github/codeql-action/releases
2) tar xzf codeql-bundle-linux64.tar.gz to unpack it to codeql
3) In your shell, run export PATH="$(realpath ./codeql):$PATH" to use the CLI tool
4) git clone https://github.com/github/codeql.git codeql-repo to clone the .ql standard library to codeql-repo

Now that we have CodeQL, we need a project to use it against. We’ll be running it against the Linux kernel, because it’s a big, complex, open source project:

1) git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git && cd linux
2) make x86_64_defconfig to configure the kernel for a basic x86 kernel

Now we can run codeql database create cpp-database --language=cpp --command="make -j8" to actually generate a CodeQL database to run queries against later. The codeql tool automagically hooks $CC, etc. in the Makefile to include all the compiled code in our database while building.

Here is a fun trick to limit your queries to only a subset of the codebase: Run make -j8 normally first, and then remove .o files only for the files you want in your database. Then, when you trigger a rebuild using the codeql compiler wrapper, it only builds the code you want visible to queries.

Setting Up Queries

We have a database, and now we want to extract actionable information out of it. Let’s create a folder to hold our queries we’ll be running:

1) mkdir codeql-workspace && cd codeql-workspace
2) touch qlpack.yml and then edit it so it contains

YAML
name: blog-post
version: 1.0.0
dependencies:
    codeql/cpp-all: "*"

3) export CODEQL=./path/to/codeql-workspace so we can refer to it easily from the shell later.

CodeQL queries are definitely unusual when compared with more traditional programming languages. Because it’s a declarative language like SQL, you describe what values you want to find and not how to find them. All queries really operate over sets of values, which you then filter or select down to only the ones you care about. Instead of asking for the function "foo" from some database object, for example, you instead write “give me all functions, filtered to only ones named "foo"“. Classes likewise are actually ways of grouping all values that fulfill some predicate: you create a CodeQL class Foo, and say in its constructor anything that is a Function with a name "bar" is a Foo, and then using Foo operates on all values that meet the constraint. Similarly, for a function you don’t construct the return value—you say the return value is constrained to any possible value that fulfills some predicate, and then a caller is operating over all those values at once. The CodeQL language reference is decent for getting a grasp at the basic language constructs and what features are available for use.

Luckily, the CodeQL standard library is full of features and so a large amount of complex behavior can be captured by using types and predicates that are provided for you. Unfortunately, this standard library has nearly no documentation, so learning how to take advantage of it is mostly only possible via scrolling and trial and error (or looking at the open source queries that GitHub has).

But enough of an intro. Let’s show you the goods.

Recon

In vulnerability research, your first job is to figure out how the program works…which often isn’t exactly trivial. One thing I often do when looking at a new target is search for uses of classes that seem interesting—maybe a weird home-rolled memory buffer class or a manager service with a non-obvious lifetime and you want to look at all the code that interacts with it. This is the type of thing you can grep for, although it might not find everything—but CodeQL will.

Let’s search the Linux kernel for all variables that are of the type sk_buff.
Create a file test_query.ql in our codeql-workspace folder:

CodeQL
import cpp

from Type ty, Variable v
where    
    ty.getName() = "sk_buff" and    
    v.getType().refersTo(ty)
select v.getLocation(), "sk_buff uses"

Then run it from the linux folder:

Bash
codeql query compile $CODEQL/test_query.ql && 
codeql query run $CODEQL/test_query.ql -d cpp-database -j0

This gives you some results in a table like this:

Plaintext
|                           col0                                   |      col1    |
+------------------------------------------------------------------+--------------+
| file://linux/drivers/net/ethernet/broadcom/tg3.c:7788:29:7788:32 | sk_buff uses |
| file://linux/drivers/net/net_failover.c:358:71:358:74            | sk_buff uses |
| file://linux/kernel/taskstats.c:66:75:66:78                      | sk_buff uses |
| file://linux/net/core/datagram.c:243:23:243:26                   | sk_buff uses |

(There are options for exporting to a CSV or BQRS file if you want to programmatically process the outputs later—including importing those files back into future CodeQL queries as inputs!)

If we open one of the results in our text editor, we can see:

C
/* Workaround 4GB and 40-bit hardware DMA bugs. */
static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
                    struct sk_buff **pskb,
                    u32 *entry, u32 *budget,
                    u32 base_flags, u32 mss, u32 vlan)
{

Which does indeed have a variable pskb of type struct sk_buff**.

Maybe I have too many results and don’t want all uses of struct sk_buff, struct sk_buff*, and struct sk_buff** into infinity, and only care about two levels of indirection. We could do that with grep, but it would be annoying. (Maybe PCRE negative lookahead…? Yuck.) Using CodeQL is much less annoying:

CodeQL
import cpp
from Type ty, Variable v
where
    ty.getName() = "sk_buff" and
    v.getType().refersTo(ty) and
    v.getType().getPointerIndirectionLevel() = 2
select v.getLocation(), "sk_buff uses"

And you could even go further than only variables and look to see if any intermediate expressions (or Expr in CodeQL-speak) compute a value of the type struct sk_buff** – say if there was some code like:

CodeQL
struct sk_buff *sk;
foo(&sk)

This is very difficult to find with grep because it has no way of knowing what type &sk is.

Vulnerability Research

So maybe I’ve done some recon, building a picture of how the target system works, and now I’m trying to actually find vulnerabilities. Let’s look for a common bug class: I want to track down memcpy calls that use a dynamic-length value and thus might be vulnerable to integer overflows or underflows in some size calculation.

CodeQL
import cpp
from Function f, FunctionCall fc, Literal size
where
    f.getName() = "memcpy" and
    fc.getTarget() = f and
    not fc.getArgument(2) = size
select fc.getLocation(), "dynamic memcpy"

Can you see the problem with this script? You can run it, and it will spit out results: all memcpy calls that don’t use a numeric literal. But if you scroll through the output, you’ll see a lot of code like memcpy(x, y, sizeof(x)) that you don’t want! CodeQL is powerful, but it gives you exactly what you ask for, even if that isn’t what you thought you were requesting. (Or worse, has some constraint on the value that will, in hindsight, trivially never be true, and so the query results are empty.) We don’t only want to filter out numeric literals, but any values that were known at compile time to be constant: this could be 2+2, or sizeof(x), or maybe even auto x = 1; x+1. (Or we could just filter through the false positives manually! An important part of VR is learning when it’s just faster to triage the results, even if it’s more tedious.)

Let’s try again:

CodeQL
import cpp
from Function f, FunctionCall fc, Expr size
where
    f.getName() = "memcpy" and
    fc.getTarget() = f and
    fc.getArgument(2) = size and
    not size.isConstant()
select fc.getLocation(), "dynamic memcpy"

And yup, that cleared up the trivial sizeof(x) hits from our results.

This script has another downside: it only checks specifically for calls to memcpy. If we wanted, we can incrementally make it better, such as by abstracting over what a “copy” is and saying memmove is also a copy operation:

CodeQL
import cpp
class Copy extends FunctionCall {
    Copy() {
        // A call is a copy if there exists a function, which is the target,
        // and that function has the name we want.
        exists(Function f |
            this.getTarget() = f and
            (f.getName() = "memcpy" or f.getName() = "memmove")
        )
    }
    // The argument for both memcpy and memmove are in the same argument,
    // but we could add more copy functions where that isn't the case!
    Expr getSize() {
        result = this.getArgument(2)
    }
}
from Copy c, Expr size
where
    c.getSize() = size and
    not size.isConstant()
select c.getLocation(), "dynamic memcpy"

This frequently comes in handy for vulnerability research, where sometimes you’ll see a codebase that has three different helper functions for the same behavior. Perhaps the first function is correctly performing internal bounds checks or what have you, but now you want to audit all the calls of the other two that maybe the developer forgot to keep up to date.

Exploit Development

Let’s say you find some kernel bug (with or without CodeQL’s help!) that permits you to double-free an allocation, and so have a primitive where you can have two pointers to the same memory block. One of the common things you then want to do is figure out which pointers are useful to you to overlap for an exploit. The way forward usually is to find some heap-allocated type, within a certain size so it goes in a specific heap bin, with a pointer at an offset you know you can modify with a user-controlled value via another type at the same offset.

I know unfortunate souls (myself included) who have resorted to custom clang passes or parsing out DWARF type information to try and locate useful allocations. Let’s find some with CodeQL instead:

CodeQL
import cpp
from Field f, Type ty, PointerType p, FunctionCall call
where
    f.getByteOffset() = 8 and
    f.getDeclaringType() = ty and
    ty.getSize() > 128 and
    ty.getSize() <= 256 and
    p.refersTo(ty) and
    call.getTarget().getName() = "kmalloc" and
    call.getActualType() = p
select call.getLocation(), "kmalloc", ty.getName(), "type allocated"

We can then expand our query to support kzalloc instead of only kmalloc, or if an allocated struct contains another struct and a nested field is at the correct offset, etc.

The strength of CodeQL is being able to incrementally pare down the results from our query programmatically and express more and more constraints on what are actually useful types. (Or we might want to loosen constraints if we discover our bug is more powerful than we initially thought). The alternative is having to manually review way too many results that aren’t actually useful. Even better, CodeQL lets you write queries and share them with other people so they can replicate the results or collaborate on query development. This approach is obviously better than being restricted to emailing a .txt file with struct layouts from a clang pass back and forth.

Fin

CodeQL is a powerful tool, and I personally find it useful throughout all stages of security research. It provides a way to ask questions and get answers about your code through each step of the compilation process. Using CodeQL enables you to replace a lot of other bespoke or hacked-together scripts you might have used to replicate the same experience. It exposes data at several different abstraction levels, enabling you to write a range of queries, from simply throwing together a script looking at variable names, to performing a complex dataflow analysis. CodeQL is a useful tool to have when conducting security research on hard targets.

Zetier is currently growing our vulnerability research team. If using CodeQL like this interests you, please contact us at careers@zetier.com or check out our other open positions.

Your Next Read

Discover more from Zetier

Subscribe now to keep reading and get access to the full archive.

Continue reading