What is CodeQL

CodeQL is a declarative query language for code, currently maintained by GitHub. GitHub maintains a decently sized set of queries that companies can hook into continuous integration and continuous delivery (CI/CD) to disallow trivial code issues from being merged in the first place. CodeQL has also found some traction in cybersecurity defense teams trying to prevent security issues from going into production. However, very little has been written about using CodeQL from an offensive security perspective. If developers of code can use queries to find bugs before they make it out the door, you can write queries to find bugs that did. I’ve gotten good mileage out of using CodeQL for security research on C/C++ codebases and think it’s a worthwhile tool for others to explore using.

One of the powerful things about CodeQL is how it doesn’t only expose the syntax of the program, but also what corresponding semantic information that syntax turns into. With a tool like grep over C++ source, you can’t find all variables of a class because the variable might be constructed with auto, and grep restricts you to performing a purely textual search. Grep might be unable to find a type defined as x but referred to with some_ns::x elsewhere. Other failure cases with grep exist, such as a typedef or a decltype, or if there is a template<type T> and it’s instantiated with your desired class. CodeQL gives you powerful structured introspection, and you can write a query to get all variables and then the types of all those variables, which are ordinary CodeQL objects. These objects represent the final known type from the compiler, so you don’t have to worry about name resolution or type inference yourself.

Because CodeQL constructs its information database via a compiler wrapper for C++, it also means you can look at details of the code that aren’t known until you compile it. Having information from the completed build makes queries easier to write and more correct. The database contains details such as field offsets in a struct and only the conditionally compiled code blocks associated with the desired build configuration. This is very relevant for security research, where often you want to limit queries to a certain subsystem or code that your target has enabled.

CodeQL also lets you perform powerful queries that take into account the flow of data, allowing you to build up reasoning about operations in the language instead of only looking at their syntax. If you have an operator++ in a C++ class that calls abort(), and write a CodeQL query to find all functions that transitively call abort(), the query finds MY_VAR++ is a call to that overloaded operator function. Even trying to use more powerful tools like semgrep (rather than normal grep) to replicate this functionality leads to sadness. Many tools have restricted support for more complex context-sensitive queries, such as not letting the end-user modify those queries. CodeQL permits you to create extensible queries—for example, for arbitrary dataflow analysis.

Getting Started

The canonical way of using CodeQL is to use Visual Studio Code, which has mostly one-click support for installing CodeQL and running queries. I find it useful to be able to use CodeQL from the command line (and use Vim instead of Visual Studio Code…), so let’s review the steps you need to go through to set up CodeQL on a brand-new codebase without using VSCode.

1) Download the CodeQL bundle for your platform (in this case, Linux) from https://github.com/github/codeql-action/releases
2) tar xzf codeql-bundle-linux64.tar.gz to unpack it to codeql
3) In your shell, run export PATH="$(realpath ./codeql):$PATH" to use the CLI tool
4) git clone https://github.com/github/codeql.git codeql-repo to clone the .ql standard library to codeql-repo

Now that we have CodeQL, we need a project to use it against. We’ll be running it against the Linux kernel, because it’s a big, complex, open source project:

1) git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git && cd linux
2) make x86_64_defconfig to configure the kernel for a basic x86 kernel

Now we can run codeql database create cpp-database --language=cpp --command="make -j8" to actually generate a CodeQL database to run queries against later. The codeql tool automagically hooks $CC, etc. in the Makefile to include all the compiled code in our database while building.

Here is a fun trick to limit your queries to only a subset of the codebase: Run make -j8 normally first, and then remove .o files only for the files you want in your database. Then, when you trigger a rebuild using the codeql compiler wrapper, it only builds the code you want visible to queries.

Setting Up Queries

We have a database, and now we want to extract actionable information out of it. Let’s create a folder to hold our queries we’ll be running:

1) mkdir codeql-workspace && cd codeql-workspace
2) touch qlpack.yml and then edit it so it contains

YAML

name: blog-post
version: 1.0.0
dependencies:
    codeql/cpp-all: "*"

3) export CODEQL=./path/to/codeql-workspace so we can refer to it easily from the shell later.

CodeQL queries are definitely unusual when compared with more traditional programming languages. Because it’s a declarative language like SQL, you describe what values you want to find and not how to find them. All queries really operate over sets of values, which you then filter or select down to only the ones you care about. Instead of asking for the function "foo" from some database object, for example, you instead write “give me all functions, filtered to only ones named "foo"“. Classes likewise are actually ways of grouping all values that fulfill some predicate: you create a CodeQL class Foo, and say in its constructor anything that is a Function with a name "bar" is a Foo, and then using Foo operates on all values that meet the constraint. Similarly, for a function you don’t construct the return value—you say the return value is constrained to any possible value that fulfills some predicate, and then a caller is operating over all those values at once. The CodeQL language reference is decent for getting a grasp at the basic language constructs and what features are available for use.

Luckily, the CodeQL standard library is full of features and so a large amount of complex behavior can be captured by using types and predicates that are provided for you. Unfortunately, this standard library has nearly no documentation, so learning how to take advantage of it is mostly only possible via scrolling and trial and error (or looking at the open source queries that GitHub has).

But enough of an intro. Let’s show you the goods.

Recon

In vulnerability research, your first job is to figure out how the program works…which often isn’t exactly trivial. One thing I often do when looking at a new target is search for uses of classes that seem interesting—maybe a weird home-rolled memory buffer class or a manager service with a non-obvious lifetime and you want to look at all the code that interacts with it. This is the type of thing you can grep for, although it might not find everything—but CodeQL will.

Let’s search the Linux kernel for all variables that are of the type sk_buff.
Create a file test_query.ql in our codeql-workspace folder:

CodeQL

import cpp

from Type ty, Variable v
where    
    ty.getName() = "sk_buff" and    
    v.getType().refersTo(ty)
select v.getLocation(), "sk_buff uses"

Then run it from the linux folder:

Bash

codeql query compile $CODEQL/test_query.ql && 
codeql query run $CODEQL/test_query.ql -d cpp-database -j0

This gives you some results in a table like this:

Plaintext

|                           col0                                   |      col1    |
+------------------------------------------------------------------+--------------+
| file://linux/drivers/net/ethernet/broadcom/tg3.c:7788:29:7788:32 | sk_buff uses |
| file://linux/drivers/net/net_failover.c:358:71:358:74            | sk_buff uses |
| file://linux/kernel/taskstats.c:66:75:66:78                      | sk_buff uses |
| file://linux/net/core/datagram.c:243:23:243:26                   | sk_buff uses |
…

(There are options for exporting to a CSV or BQRS file if you want to programmatically process the outputs later—including importing those files back into future CodeQL queries as inputs!)

If we open one of the results in our text editor, we can see:

/* Workaround 4GB and 40-bit hardware DMA bugs. */
static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
                    struct sk_buff **pskb,
                    u32 *entry, u32 *budget,
                    u32 base_flags, u32 mss, u32 vlan)
{
…

Which does indeed have a variable pskb of type struct sk_buff**.

Maybe I have too many results and don’t want all uses of struct sk_buff, struct sk_buff*, and struct sk_buff** into infinity, and only care about two levels of indirection. We could do that with grep, but it would be annoying. (Maybe PCRE negative lookahead…? Yuck.) Using CodeQL is much less annoying:

CodeQL

import cpp
from Type ty, Variable v
where
    ty.getName() = "sk_buff" and
    v.getType().refersTo(ty) and
    v.getType().getPointerIndirectionLevel() = 2
select v.getLocation(), "sk_buff uses"

And you could even go further than only variables and look to see if any intermediate expressions (or Expr in CodeQL-speak) compute a value of the type struct sk_buff** – say if there was some code like:

CodeQL

struct sk_buff *sk;
foo(&sk)

This is very difficult to find with grep because it has no way of knowing what type &sk is.

Vulnerability Research

So maybe I’ve done some recon, building a picture of how the target system works, and now I’m trying to actually find vulnerabilities. Let’s look for a common bug class: I want to track down memcpy calls that use a dynamic-length value and thus might be vulnerable to integer overflows or underflows in some size calculation.

CodeQL

import cpp
from Function f, FunctionCall fc, Literal size
where
    f.getName() = "memcpy" and
    fc.getTarget() = f and
    not fc.getArgument(2) = size
select fc.getLocation(), "dynamic memcpy"

Can you see the problem with this script? You can run it, and it will spit out results: all memcpy calls that don’t use a numeric literal. But if you scroll through the output, you’ll see a lot of code like memcpy(x, y, sizeof(x)) that you don’t want! CodeQL is powerful, but it gives you exactly what you ask for, even if that isn’t what you thought you were requesting. (Or worse, has some constraint on the value that will, in hindsight, trivially never be true, and so the query results are empty.) We don’t only want to filter out numeric literals, but any values that were known at compile time to be constant: this could be 2+2, or sizeof(x), or maybe even auto x = 1; x+1. (Or we could just filter through the false positives manually! An important part of VR is learning when it’s just faster to triage the results, even if it’s more tedious.)

Let’s try again:

CodeQL

import cpp
from Function f, FunctionCall fc, Expr size
where
    f.getName() = "memcpy" and
    fc.getTarget() = f and
    fc.getArgument(2) = size and
    not size.isConstant()
select fc.getLocation(), "dynamic memcpy"

And yup, that cleared up the trivial sizeof(x) hits from our results.

This script has another downside: it only checks specifically for calls to memcpy. If we wanted, we can incrementally make it better, such as by abstracting over what a “copy” is and saying memmove is also a copy operation:

CodeQL

import cpp
class Copy extends FunctionCall {
    Copy() {
        // A call is a copy if there exists a function, which is the target,
        // and that function has the name we want.
        exists(Function f |
            this.getTarget() = f and
            (f.getName() = "memcpy" or f.getName() = "memmove")
        )
    }
    // The argument for both memcpy and memmove are in the same argument,
    // but we could add more copy functions where that isn't the case!
    Expr getSize() {
        result = this.getArgument(2)
    }
}
from Copy c, Expr size
where
    c.getSize() = size and
    not size.isConstant()
select c.getLocation(), "dynamic memcpy"

This frequently comes in handy for vulnerability research, where sometimes you’ll see a codebase that has three different helper functions for the same behavior. Perhaps the first function is correctly performing internal bounds checks or what have you, but now you want to audit all the calls of the other two that maybe the developer forgot to keep up to date.

Exploit Development

Let’s say you find some kernel bug (with or without CodeQL’s help!) that permits you to double-free an allocation, and so have a primitive where you can have two pointers to the same memory block. One of the common things you then want to do is figure out which pointers are useful to you to overlap for an exploit. The way forward usually is to find some heap-allocated type, within a certain size so it goes in a specific heap bin, with a pointer at an offset you know you can modify with a user-controlled value via another type at the same offset.

I know unfortunate souls (myself included) who have resorted to custom clang passes or parsing out DWARF type information to try and locate useful allocations. Let’s find some with CodeQL instead:

CodeQL

import cpp
from Field f, Type ty, PointerType p, FunctionCall call
where
    f.getByteOffset() = 8 and
    f.getDeclaringType() = ty and
    ty.getSize() > 128 and
    ty.getSize() <= 256 and
    p.refersTo(ty) and
    call.getTarget().getName() = "kmalloc" and
    call.getActualType() = p
select call.getLocation(), "kmalloc", ty.getName(), "type allocated"

We can then expand our query to support kzalloc instead of only kmalloc, or if an allocated struct contains another struct and a nested field is at the correct offset, etc.

The strength of CodeQL is being able to incrementally pare down the results from our query programmatically and express more and more constraints on what are actually useful types. (Or we might want to loosen constraints if we discover our bug is more powerful than we initially thought). The alternative is having to manually review way too many results that aren’t actually useful. Even better, CodeQL lets you write queries and share them with other people so they can replicate the results or collaborate on query development. This approach is obviously better than being restricted to emailing a .txt file with struct layouts from a clang pass back and forth.

Fin

CodeQL is a powerful tool, and I personally find it useful throughout all stages of security research. It provides a way to ask questions and get answers about your code through each step of the compilation process. Using CodeQL enables you to replace a lot of other bespoke or hacked-together scripts you might have used to replicate the same experience. It exposes data at several different abstraction levels, enabling you to write a range of queries, from simply throwing together a script looking at variable names, to performing a complex dataflow analysis. CodeQL is a useful tool to have when conducting security research on hard targets.

Zetier is currently growing our vulnerability research team. If using CodeQL like this interests you, please contact us at careers@zetier.com or check out our other open positions.

Your Next Read

A Zetier in Review 2025

2025 was a successful year for Zetier, with office growth, more "sheeping" unguarded colleague workstations, and fun diet(ary) choices

January 21, 2026
By Anna Staats, Cyber Engineer

Auditing binaries like source code

A case study in identifying real-world stack overflows in Netgear router firmware – without access to source code.

October 13, 2025
By Garrett Pence, Senior Cyber Engineer

Linux SBCs for development + RE

Are low-cost SBCs with 4k output viable for native development, RE, and tinkering?

October 9, 2025

Saleae Analyzer for the BDM protocol

See what your BDM debugger is actually doing under the hood with our new open-source tool.

August 20, 2025
By Matt Smith, Cyber Engineer

Hacking with Frida (Part II)

You just fired up an old Linux-based appliance. Here's one hacking recipe to get beyond basic local access.

May 7, 2025

Did 5G kill the IMSI catcher?

Understand the mechanics, risks, and future of IMSI catching (a.k.a. stealing your cellular ID) in 2025.

April 10, 2025
By Mark Santorello, Senior Cyber Engineer

Speedrunners = vulnerability researchers

Thousands of video game enthusiasts are developing experience in the cybersecurity industry by accident.

February 26, 2025

The mystery of $0 HP printers

One of our engineers obtained free printers on Craigslist, which revealed some intriguing obfuscation.

February 19, 2025

Making the podium at DEFCON 2024

After competing for several years, a Zetier Cyber Engineer made the podium at the DEFCON 32 HHV CTF.

January 14, 2025

Our 2024 open-source contributions

Check out the contributions our team made – and tools we open-sourced – in 2024.

December 19, 2024
By the Zetier team

Is it hard to port Frida on OpenWrt?

Frida runs out-of-the-box on many common targets. How hard is it to port Frida to an unsupported platform?

November 26, 2024

Android testing with Bungeegum

When code is executed with Bungeegum, it operates within the application's context and memory space, mirroring how Android CNO tools are typically used in real-world scenarios.

August 24, 2024
By the Zetier team

A small box contains a big surprise

A recently acquired piece of military technology holds secrets about worldwide manufacturing capabilities.

July 12, 2024

Being a reservist at Zetier

Thousands of military members juggle their reserve commitment and civilian life. Read this post to learn how Zetier makes sure you won’t drop the ball.

August 16, 2024

Something fishy happens when compiling Tshark with Lua

Building tshark from source with support for Lua has proven to be a challenge. This tutorial will save you some time and frustration.

June 12, 2024
By Mark Santorello, Senior Engineer

flaShMASH your dumps

Hardware memory busses are sometimes tied together with multiple ICs. Here is how to SMASH them!

May 24, 2024

Oops, there goes the OPSEC

Breadcrumbs are left throughout computer systems that hackers can use to track attribution or recover sensitive information. See possible gotchas in this post.

May 22, 2024
By the Zetier team

February 2024 watercooler topics

Various topics of interest covering IT, cybersecurity, tech innovations, from GitLab workflows to satellite tech advancements.

March 4, 2024
By the Zetier team

Meet Snipey: Snipe-IT asset management

Zetier is introducing Snipey, a command-line interface (CLI) tool that extends the capabilities of Snipe-IT.

March 4, 2024
By the Zetier team

December 2023 watercooler topics

Highlights from around the internet that we discussed in the office during Dec 2023. Everything from the best deals on collectable turbo-jet engines to Bluetooth CVEs.

February 5, 2024
By the Zetier team

Using JTAG to dump parallel NOR flash

Explore the art of using JTAG for efficient NOR flash memory dumps – via our practical guide for hardware enthusiasts & engineers.

March 18, 2024

Android testing with Lariat

Lariat works with Device Farmer to address the challenges of platform fragmentation in Android device testing.

February 19, 2024
By the Zetier team

Sharing our knowledge

Sharing knowledge is in Zetier’s corporate DNA, and this expresses itself in multiple ways.

January 10, 2024
By Mark Goldenberg, Senior Zetier Technical Writer

Power up your power supplies

Smart load integration with inexpensive power supplies providing protections typically found only in pricier models.

February 5, 2024

2023 corporate offsite

At Zetier’s 2023 annual offsite we met in San Juan Puerto Rico for some corporate business, relationship building, good food, and fun in the sun.

January 10, 2024
By Mark Goldenberg, Senior Zetier Technical Writer

Poor man’s 3D tomography

DIY dental X-ray tech for PCB reverse engineering, enabling faster, budget-friendly 3D tomography.

January 10, 2024

Using CodeQL for security research

What is CodeQL

Getting Started

Setting Up Queries

Recon

Vulnerability Research

Exploit Development

Fin

Your Next Read

Discover more from Zetier