CodeQL for binaries: auditing them like source code

There are a wealth of source code auditing tools that search for known vulnerable patterns in code and flag potential bugs. But, software is often integrated with closed-source or proprietary libraries, drivers, and firmware images. How do we apply source code analysis without source code?

Security researchers are often presented with black box binaries that only experienced reverse engineers and vulnerability researchers are able to analyze. Can we run code auditing tools on these binaries as well? Can we ingest a binary and dump a list of identified bugs?

Let’s explore the idea of performing a code audit of sorts on binaries through static analysis using Ghidra and Joern. Then, we can further investigate this work that was introduced at No Hat 2021. In the years since that presentation, code auditing tools and Joern have come a long way. Together, we’ll work through a vulnerability that exists in the wild and investigate whether we can build a code auditing workflow to identify the bug in the binary.

Why audit code?

Users of source code analysis tools vary widely, ranging from software developers using SAST (static application security testing) all the way to security researchers that perform reverse engineering and vulnerability research. In this article, we’re focusing on the security researcher role since Joern is tuned towards users with a more advanced understanding of vulnerabilities.

SAST can be integrated into development pipelines through CI (continuous integration) to provide early bug detection so that bugs can be fixed early in the SDLC (software development life cycle). Early bug detection allows for fixes that are cheaper and more efficient, resulting in fewer vulnerabilities in the wild.

Many security researchers spend most of their time manually combing through collections of source code trying to spot bugs. Tools that perform source code analysis increase their efficiency, as they cover more ground through automated queries that can identify low-hanging fruit. Code auditing tools can become a force multiplier for vulnerability researchers.

As the adoption of AI-generated code has become widespread, we’re presented with new struggles. AI-assisted coding tools can generate code with security flaws, yet developers have misplaced confidence in AI-generated code as discussed in this study. AI-generated code is often difficult to understand and debug. SonarQube is one product that’s specifically marketed to “keep AI generated code clean.”

Source code analysis tools perform static analysis on source code, enabling bugs to be identified without actually executing the code. Many vulnerabilities follow recognizable patterns, making automated detection possible by searching for these patterns in code. Using control flow and dataflow analysis, bugs such as uninitialized use, double free, and out-of-bounds read can be potentially identified. Using taint analysis, a user can specify the attack surface of a binary and evaluate where it flows throughout the code to prioritize fixing bugs that are more likely to be exploited.

Several static analysis tools aid in bug discovery. Among them are:

CodeQL: Query-based analysis of source code.
Coverity: Industry-grade static analysis with a focus on code quality.
Joern: Designed for vulnerability research.
SonarQube: Primarily used for code quality and security scanning.
Semgrep: Lightweight static analysis for pattern matching in source code.

In our previous blog post titled CodeQL for security research, another one of our engineers used CodeQL to find security flaws in C/C++ source code for the Linux kernel. He crafted queries to search for and identify potentially vulnerable calls to memcpy where the size argument is dynamic.

Can we do something similar with binaries… without having source code?

Code auditing without code

What happens when we’re working with a closed source library, driver, firmware image, or black box binary? Without source code, SAST tools come up short.

There are various reverse engineering tools that operate on binaries including:

Ghidra: Free, open-source tool with powerful decompiler and broad architecture support.
IDA Pro: Commercial disassembler/debugger known for deep analysis and plugin support.
Binary Ninja: User-friendly platform with strong automation and scripting capabilities.
Radare2: Highly scriptable toolkit for advanced users.

Additionally, symbolic execution tools, such as Angr, exist to enable deeper program analysis but require expert-level knowledge.
These tools equip vulnerability researchers to find bugs through manual analysis, but we want to find a way to search for bug patterns in an automated fashion.

Using Ghidra and Joern for binary analysis

Ghidra provides an architecture-agnostic intermediate language called p-code that facilitates cross-platform analysis. P-code is a register transfer language (expresses how data flows between registers and memory and how registers are modified) designed specifically for reverse engineering; it defines a set of generic operations that can model behavior across several different architectures. Machine instructions are lifted to this intermediate level, such that analysis can be performed by a common means rather than for each instruction set.

Joern traditionally operates on source code and generates a CPG (code property graph) through a fuzzy compiler that operates on C/C++ code. The CPG combines AST (abstract syntax tree), CFG (control flow graph), and DFG (dataflow graph) to create a unified representation. This structure allows for efficient query-based vulnerability detection.

In addition to its fuzzy C/C++ compiler, Joern provides ghidra2cpg , which generates a CPG from p-code using Ghidra. By leveraging this, researchers can search for vulnerable code patterns by querying the CPG much like a user queries a database for specific information. Ultimately this allows bug finders to be written as queries that can operate on binaries across multiple devices and architectures.

Challenges in Static Analysis

Before diving into a real-world example, let’s discuss some of the challenges encountered during static analysis.

Some complexities in code resulting from compiler optimizations, code inlining, and data packing may make it difficult to analyze.
Some dynamic behavior can’t be captured when strictly performing static analysis. Some function pointers and vtable pointers can’t be resolved statically. Pointers in general may not be resolved without instrumented dynamic execution.
We may be stuck with incomplete information if libraries are loaded or linked dynamically.
Static analysis may result in false positives, where we flag a potential vulnerability, but upon manual analysis, it’s found to not actually be vulnerable. Imperfect dataflow and control flow may cause false negatives where an actual vulnerability isn’t found.
Static analysis can be resource-intensive, especially for complex binaries. Advanced techniques – such as interprocedural analysis (analysis across function boundaries) and context sensitivity (treat all call contexts as unique) – provide increased fidelity but result in increased memory usage.
Scalability of static analysis can be difficult. Analysis of highly complex binaries or firmwares may be limited to systems with enough compute and memory.

False positives in static analysis can end up being a big problem for several reasons. For reverse engineers and vulnerability researchers who spend lots of time performing manual analysis of code, triaging through false positives isn’t a huge burden. But if we’re using static analysis in an automated fashion, this can cause wasted time triaging bugs that aren’t actually bugs and lead to alert fatigue. When writing queries intended to be part of an automated process, we should keep this in mind. Oftentimes we can prune out false positives by manually triaging results for specific bug patterns, understanding characteristics specific to legitimate bugs, then crafting improved queries.

Tools such as Joern have greatly improved over the last several years, running more efficiently and with a finite set of compute resources. However, analysis of more complicated binaries or firmware images still presents a resource issue. With the recent popularity of cloud-based computing, resources are becoming increasingly available. In fact, scaling up compute and memory resources for things like static analysis was such a pain point, we created an in-house solution: WarpStations. WarpStations provides powerful Linux virtual desktop environments where resource constraints can be mitigated by increasing available compute and memory on the fly. If you’re analyzing a binary that causes your system to run out of memory, it can be increased rapidly, allowing work to continue.

Bug hunting in the wild

As an example use case for utilizing Joern to find vulnerable patterns in a binary, we pick a bug that has already been discovered and still exists in the wild. We pick on the Netgear R7000, a consumer-grade SOHO (small office/home office) router that has proven to be rife with security flaws. We focus on a simple stack overflow identified by GRIMM Cyber and detailed in this blog post. Firmware can be downloaded from Netgear’s website. For this example, we use version 1.0.9.88 (the same version featured in the blog post).

1. Processing the firmware

You can use Binwalk to extract the contents of the firmware image once it has been downloaded and unzipped:

$ wget https://www.downloads.netgear.com/files/GDC/R7000/R7000-V1.0.9.88_10.2.88.zip

R7000-V1.0.9.88_10.2.88.z 100%[==================================>]  30.18M  ‘R7000-V1.0.9.88_10.2.88.zip’ saved [31647028/31647028]

$ unzip R7000-V1.0.9.88_10.2.88.zip 
Archive:  R7000-V1.0.9.88_10.2.88.zip
  inflating: R7000-V1.0.9.88_10.2.88.chk  
  inflating: R7000-V1.0.9.88_10.2.88_Release_Notes.html

$ binwalk -e R7000-V1.0.9.88_10.2.88.chk

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
58            0x3A            TRX firmware header, little endian, image size: 31649792 bytes, CRC32: 0xF97175C3, flags: 0x0, version: 1, header size: 28 bytes, loader offset: 0x1C, linux kernel offset: 0x21E560, rootfs offset: 0x0
86            0x56            LZMA compressed data, properties: 0x5D, dictionary size: 65536 bytes, uncompressed size: 5436480 bytes

...

$ wget https://www.downloads.netgear.com/files/GDC/R7000/R7000-V1.0.9.88_10.2.88.zip

R7000-V1.0.9.88_10.2.88.z 100%[==================================>]  30.18M  ‘R7000-V1.0.9.88_10.2.88.zip’ saved [31647028/31647028]

$ unzip R7000-V1.0.9.88_10.2.88.zip 
Archive:  R7000-V1.0.9.88_10.2.88.zip
  inflating: R7000-V1.0.9.88_10.2.88.chk  
  inflating: R7000-V1.0.9.88_10.2.88_Release_Notes.html

$ binwalk -e R7000-V1.0.9.88_10.2.88.chk

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
58            0x3A            TRX firmware header, little endian, image size: 31649792 bytes, CRC32: 0xF97175C3, flags: 0x0, version: 1, header size: 28 bytes, loader offset: 0x1C, linux kernel offset: 0x21E560, rootfs offset: 0x0
86            0x56            LZMA compressed data, properties: 0x5D, dictionary size: 65536 bytes, uncompressed size: 5436480 bytes

...

2. Loading into Joern

Let’s focus on the httpd binary which provides the HTTP server hosting the web interface for router configuration. This is a good place to start on routers, as the HTTP server provides an easy attack surface. The HTTP server accepts user requests through its web interface to allow for things like configuration of the router and software updates. The binary we’ll evaluate in Joern is located at: _R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd

Once Joern is installed, we’re ready to go.

First, use ghidra2cpg to generate a CPG that Joern can operate on later:

$ ghidra2cpg _R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd

...

------------------------------------------------

Linking the External Programs of 'httpd' to imported libraries...
  [libnat.so] -> not found in project
  [libnvram.so] -> not found in project
  [libacos_shared.so] -> not found in project
  [libcrypt.so.0] -> not found in project
  [libgcc_s.so.1] -> not found in project
  [libssl.so.1.0.0] -> not found in project
  [libcrypto.so.1.0.0] -> not found in project
  [libm.so.0] -> not found in project
  [libbdbroker.so] -> not found in project
  [libpthread.so.0] -> not found in project
  [libbdbroker_util.so] -> not found in project
  [libc.so.0] -> not found in project
------------------------------------------------

Resolving External Symbols of [/tmp/ghidra2cpg_tmp12154979022779944281/httpd] - 487 unresolved symbols, no external libraries configured - skipping

...

Applied data type archive: generic_clib
-----------------------------------------------------
    ARM Constant Reference Analyzer            9.579 secs
    ARM Symbol                                 0.002 secs
    ASCII Strings                              1.587 secs
    Apply Data Archives                        0.712 secs
    Call Convention ID                         0.967 secs
    Call-Fixup Installer                       0.053 secs
    Create Address Tables                      0.169 secs
    Create Address Tables - One Time           2.052 secs
    Create Function                            0.594 secs
    Data Reference                             0.631 secs
    Decompiler Switch Analysis                 6.649 secs
    Decompiler Switch Analysis - One Time      0.000 secs
    Demangler GNU                              0.030 secs
    Disassemble                                3.814 secs
    Disassemble Entry Points                   1.145 secs
    Embedded Media                             0.191 secs
    External Entry References                  0.005 secs
    Function Start Pre Search                  0.006 secs
    Function Start Search                      0.273 secs
    Function Start Search After Code           0.687 secs
    Function Start Search After Data           0.407 secs
    Function Start Search delayed - One Time   0.052 secs
    GCC Exception Handlers                     0.009 secs
    Non-Returning Functions - Discovered       0.640 secs
    Non-Returning Functions - Known            0.004 secs
    Reference                                  1.204 secs
    Shared Return Calls                        0.150 secs
    Stack                                      8.611 secs
    Subroutine References                      0.365 secs
-----------------------------------------------------
     Total Time   40 secs
-----------------------------------------------------

...

$ ghidra2cpg _R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd

...

------------------------------------------------

Linking the External Programs of 'httpd' to imported libraries...
  [libnat.so] -> not found in project
  [libnvram.so] -> not found in project
  [libacos_shared.so] -> not found in project
  [libcrypt.so.0] -> not found in project
  [libgcc_s.so.1] -> not found in project
  [libssl.so.1.0.0] -> not found in project
  [libcrypto.so.1.0.0] -> not found in project
  [libm.so.0] -> not found in project
  [libbdbroker.so] -> not found in project
  [libpthread.so.0] -> not found in project
  [libbdbroker_util.so] -> not found in project
  [libc.so.0] -> not found in project
------------------------------------------------

Resolving External Symbols of [/tmp/ghidra2cpg_tmp12154979022779944281/httpd] - 487 unresolved symbols, no external libraries configured - skipping

...

Applied data type archive: generic_clib
-----------------------------------------------------
    ARM Constant Reference Analyzer            9.579 secs
    ARM Symbol                                 0.002 secs
    ASCII Strings                              1.587 secs
    Apply Data Archives                        0.712 secs
    Call Convention ID                         0.967 secs
    Call-Fixup Installer                       0.053 secs
    Create Address Tables                      0.169 secs
    Create Address Tables - One Time           2.052 secs
    Create Function                            0.594 secs
    Data Reference                             0.631 secs
    Decompiler Switch Analysis                 6.649 secs
    Decompiler Switch Analysis - One Time      0.000 secs
    Demangler GNU                              0.030 secs
    Disassemble                                3.814 secs
    Disassemble Entry Points                   1.145 secs
    Embedded Media                             0.191 secs
    External Entry References                  0.005 secs
    Function Start Pre Search                  0.006 secs
    Function Start Search                      0.273 secs
    Function Start Search After Code           0.687 secs
    Function Start Search After Data           0.407 secs
    Function Start Search delayed - One Time   0.052 secs
    GCC Exception Handlers                     0.009 secs
    Non-Returning Functions - Discovered       0.640 secs
    Non-Returning Functions - Known            0.004 secs
    Reference                                  1.204 secs
    Shared Return Calls                        0.150 secs
    Stack                                      8.611 secs
    Subroutine References                      0.365 secs
-----------------------------------------------------
     Total Time   40 secs
-----------------------------------------------------

...

When ghidra2cpg is done, it will have generated a cpg.bin that contains the AST, CFG, and DFG for the httpd binary.

Launch Joern and load the CPG:

$ joern

     ██╗ ██████╗ ███████╗██████╗ ███╗   ██╗
     ██║██╔═══██╗██╔════╝██╔══██╗████╗  ██║
     ██║██║   ██║█████╗  ██████╔╝██╔██╗ ██║
██   ██║██║   ██║██╔══╝  ██╔══██╗██║╚██╗██║
╚█████╔╝╚██████╔╝███████╗██║  ██║██║ ╚████║
 ╚════╝  ╚═════╝ ╚══════╝╚═╝  ╚═╝╚═╝  ╚═══╝
Version: 4.0.267
Type `help` to begin
      
                                                                                                     
joern> importCpg(inputPath="cpg.bin", projectName="R7000")
Creating project `R7000` for CPG at `cpg.bin`
Creating working copy of CPG to be safe

...

val res0: Option[io.shiftleft.codepropertygraph.generated.Cpg] = Some(value = Cpg[Graph[537034 nodes]])

$ joern

     ██╗ ██████╗ ███████╗██████╗ ███╗   ██╗
     ██║██╔═══██╗██╔════╝██╔══██╗████╗  ██║
     ██║██║   ██║█████╗  ██████╔╝██╔██╗ ██║
██   ██║██║   ██║██╔══╝  ██╔══██╗██║╚██╗██║
╚█████╔╝╚██████╔╝███████╗██║  ██║██║ ╚████║
 ╚════╝  ╚═════╝ ╚══════╝╚═╝  ╚═╝╚═╝  ╚═══╝
Version: 4.0.267
Type `help` to begin
      
                                                                                                     
joern> importCpg(inputPath="cpg.bin", projectName="R7000")
Creating project `R7000` for CPG at `cpg.bin`
Creating working copy of CPG to be safe

...

val res0: Option[io.shiftleft.codepropertygraph.generated.Cpg] = Some(value = Cpg[Graph[537034 nodes]])

3. Initial investigation

As an aside, we load the httpd binary in Ghidra to locate the vulnerable code identified in the blog post. In Ghidra, the function at FUN_0001cda4 corresponds to the function labeled abCheckBoardID from IDA in the aforementioned GRIMM blog post. Note that in the decompilation below, we have annotated some of the variable names, so they may not match the default Ghidra decompilation you see. The vulnerable call to memcpy is at address 0x1ce44. As we see in the decompilation from Ghidra below, the size parameter for memcpy comes from a parameter passed into this function (parameter user_input), which ultimately comes from user input from the function recv.

undefined4 FUN_0001cda4(char *user_input)

{
  byte bVar1;
  byte bVar2;
  byte bVar3;
  byte bVar4;
  int iVar5;
  int iVar6;
  char *__s1;
  size_t size;
  undefined1 auStack_8c [40];
  char acStack_64 [64];
  
  memcpy(&DAT_00f42608,user_input,0x31);
  DAT_00f4263a = 0;
  iVar5 = strcmp(user_input,"*#$^");
  if (iVar5 == 0) {
    bVar1 = user_input[0x27];
    bVar2 = user_input[0x26];
    bVar3 = user_input[0x25];
    bVar4 = user_input[0x24];
    user_input[0x25] = '\0';
    user_input[0x24] = '\0';
    user_input[0x26] = '\0';
    size = (uint)(byte)user_input[7] + (uint)(byte)user_input[4] * 0x1000000 +
          (uint)(byte)user_input[6] * 0x100 + (uint)(byte)user_input[5] * 0x10000;
    user_input[0x27] = '\0';
    memset(auStack_8c,0,100);
    memcpy(auStack_8c,user_input,size);
    calculate_checksum(0,0,0);
    calculate_checksum(1,auStack_8c,size);
    iVar5 = calculate_checksum(2,0,0);
    iVar6 = FUN_0001cd84(acStack_64);
    if (iVar6 != 0) {
      DAT_001d0a9c = 1;
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
    DAT_001d0a9c = 0;
    acosNvramConfig_get("board_id");
    iVar6 = FUN_0001cd84();
    if (iVar6 == 0) {
      __s1 = (char *)acosNvramConfig_get("board_id");
      iVar6 = strcmp(__s1,acStack_64);
      if (iVar6 != 0) {
        return 0xffffffff;
      }
    }
    if (iVar5 == (uint)bVar1 + (uint)bVar4 * 0x1000000 + (uint)bVar2 * 0x100 + (uint)bVar3 * 0x10000
       ) {
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
  }
  return 0xffffffff;
}

undefined4 FUN_0001cda4(char *user_input)

{
  byte bVar1;
  byte bVar2;
  byte bVar3;
  byte bVar4;
  int iVar5;
  int iVar6;
  char *__s1;
  size_t size;
  undefined1 auStack_8c [40];
  char acStack_64 [64];
  
  memcpy(&DAT_00f42608,user_input,0x31);
  DAT_00f4263a = 0;
  iVar5 = strcmp(user_input,"*#$^");
  if (iVar5 == 0) {
    bVar1 = user_input[0x27];
    bVar2 = user_input[0x26];
    bVar3 = user_input[0x25];
    bVar4 = user_input[0x24];
    user_input[0x25] = '\0';
    user_input[0x24] = '\0';
    user_input[0x26] = '\0';
    size = (uint)(byte)user_input[7] + (uint)(byte)user_input[4] * 0x1000000 +
          (uint)(byte)user_input[6] * 0x100 + (uint)(byte)user_input[5] * 0x10000;
    user_input[0x27] = '\0';
    memset(auStack_8c,0,100);
    memcpy(auStack_8c,user_input,size);
    calculate_checksum(0,0,0);
    calculate_checksum(1,auStack_8c,size);
    iVar5 = calculate_checksum(2,0,0);
    iVar6 = FUN_0001cd84(acStack_64);
    if (iVar6 != 0) {
      DAT_001d0a9c = 1;
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
    DAT_001d0a9c = 0;
    acosNvramConfig_get("board_id");
    iVar6 = FUN_0001cd84();
    if (iVar6 == 0) {
      __s1 = (char *)acosNvramConfig_get("board_id");
      iVar6 = strcmp(__s1,acStack_64);
      if (iVar6 != 0) {
        return 0xffffffff;
      }
    }
    if (iVar5 == (uint)bVar1 + (uint)bVar4 * 0x1000000 + (uint)bVar2 * 0x100 + (uint)bVar3 * 0x10000
       ) {
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
  }
  return 0xffffffff;
}

After performing some manual reverse engineering in Ghidra, we determine that data flows from recv to memcpy via the following function calls in the worker function FUN_000163a4:

Address	Function Called	Notes
`0x17868`	`FUN_00010d64`	Get data from `recv`
`0x17f28`	`memcpy`	Copies to dest + offset
`0x178fc`	`stristr`	Checks if user data contains “`mtenFWUpload`“
`0x1790c`	`stristr`	Strips trailing returns
`0x17934`	`FUN_0001cda4`	Calls `memcpy` with user influenced size

As an initial query in Joern, let’s list all function names in the CPG, and then verify that FUN_0001cda4 exists by querying for its method details:

joern> cpg.method.name.l
val res1: List[String] = List(
  "_init",
  "alphasort",
...
  "memcpy",
...
  "FUN_0001cda4",
...
  "<operator>.goto",
  "<operator>.compare"
)

joern> cpg.method.name.l
val res1: List[String] = List(
  "_init",
  "alphasort",
...
  "memcpy",
...
  "FUN_0001cda4",
...
  "<operator>.goto",
  "<operator>.compare"
)

joern> cpg.method.name("FUN_0001cda4").l
val res3: List[io.shiftleft.codepropertygraph.generated.nodes.Method] = List(
  Method(
    astParentFullName = "/home/user/Documents/joern_r7000/_R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd:<global>",
    astParentType = "NAMESPACE_BLOCK",
    code = """
undefined4 FUN_0001cda4(char *param_1)

{
  byte bVar1;
  byte bVar2;
  byte bVar3;
  byte bVar4;
  int iVar5;
  int iVar6;
  char *__s1;
  size_t __n;
  undefined1 auStack_8c [40];
  char acStack_64 [64];
  
  memcpy(&DAT_00f42608,param_1,0x31);
  DAT_00f4263a = 0;
  iVar5 = strcmp(param_1,"*#$^");
  if (iVar5 == 0) {
    bVar1 = param_1[0x27];
    bVar2 = param_1[0x26];
    bVar3 = param_1[0x25];
    bVar4 = param_1[0x24];
    param_1[0x25] = '\0';
    param_1[0x24] = '\0';
    param_1[0x26] = '\0';
    __n = (uint)(byte)param_1[7] + (uint)(byte)param_1[4] * 0x1000000 +
          (uint)(byte)param_1[6] * 0x100 + (uint)(byte)param_1[5] * 0x10000;
    param_1[0x27] = '\0';
    memset(auStack_8c,0,100);
    memcpy(auStack_8c,param_1,__n);
    calculate_checksum(0,0,0);
    calculate_checksum(1,auStack_8c,__n);
    iVar5 = calculate_checksum(2,0,0);
    iVar6 = FUN_0001cd84(acStack_64);
    if (iVar6 != 0) {
      DAT_001d0a9c = 1;
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
    DAT_001d0a9c = 0;
    acosNvramConfig_get("board_id");
    iVar6 = FUN_0001cd84();
    if (iVar6 == 0) {
      __s1 = (char *)acosNvramConfig_get("board_id");
      iVar6 = strcmp(__s1,acStack_64);
      if (iVar6 != 0) {
        return 0xffffffff;
      }
    }
    if (iVar5 == (uint)bVar1 + (uint)bVar4 * 0x1000000 + (uint)bVar2 * 0x100 + (uint)bVar3 * 0x10000
       ) {
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
  }
  return 0xffffffff;
}

""",
    columnNumber = Some(value = -1),
    columnNumberEnd = None,
    filename = "/home/user/Documents/joern_r7000/_R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd",
    fullName = "FUN_0001cda4",
    genericSignature = "<empty>",
    hash = None,
    isExternal = false,
    lineNumber = Some(value = 118180),
    lineNumberEnd = Some(value = -1),
    name = "FUN_0001cda4",
    offset = None,
    offsetEnd = None,
    order = 0,
    signature = "undefined FUN_0001cda4(void)"
  )
)

joern> cpg.method.name("FUN_0001cda4").l
val res3: List[io.shiftleft.codepropertygraph.generated.nodes.Method] = List(
  Method(
    astParentFullName = "/home/user/Documents/joern_r7000/_R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd:<global>",
    astParentType = "NAMESPACE_BLOCK",
    code = """
undefined4 FUN_0001cda4(char *param_1)

{
  byte bVar1;
  byte bVar2;
  byte bVar3;
  byte bVar4;
  int iVar5;
  int iVar6;
  char *__s1;
  size_t __n;
  undefined1 auStack_8c [40];
  char acStack_64 [64];
  
  memcpy(&DAT_00f42608,param_1,0x31);
  DAT_00f4263a = 0;
  iVar5 = strcmp(param_1,"*#$^");
  if (iVar5 == 0) {
    bVar1 = param_1[0x27];
    bVar2 = param_1[0x26];
    bVar3 = param_1[0x25];
    bVar4 = param_1[0x24];
    param_1[0x25] = '\0';
    param_1[0x24] = '\0';
    param_1[0x26] = '\0';
    __n = (uint)(byte)param_1[7] + (uint)(byte)param_1[4] * 0x1000000 +
          (uint)(byte)param_1[6] * 0x100 + (uint)(byte)param_1[5] * 0x10000;
    param_1[0x27] = '\0';
    memset(auStack_8c,0,100);
    memcpy(auStack_8c,param_1,__n);
    calculate_checksum(0,0,0);
    calculate_checksum(1,auStack_8c,__n);
    iVar5 = calculate_checksum(2,0,0);
    iVar6 = FUN_0001cd84(acStack_64);
    if (iVar6 != 0) {
      DAT_001d0a9c = 1;
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
    DAT_001d0a9c = 0;
    acosNvramConfig_get("board_id");
    iVar6 = FUN_0001cd84();
    if (iVar6 == 0) {
      __s1 = (char *)acosNvramConfig_get("board_id");
      iVar6 = strcmp(__s1,acStack_64);
      if (iVar6 != 0) {
        return 0xffffffff;
      }
    }
    if (iVar5 == (uint)bVar1 + (uint)bVar4 * 0x1000000 + (uint)bVar2 * 0x100 + (uint)bVar3 * 0x10000
       ) {
      strncpy(&DAT_001df0a4,acStack_64,0x3f);
      return 0;
    }
  }
  return 0xffffffff;
}

""",
    columnNumber = Some(value = -1),
    columnNumberEnd = None,
    filename = "/home/user/Documents/joern_r7000/_R7000-V1.0.9.88_10.2.88.chk.extracted/squashfs-root/usr/sbin/httpd",
    fullName = "FUN_0001cda4",
    genericSignature = "<empty>",
    hash = None,
    isExternal = false,
    lineNumber = Some(value = 118180),
    lineNumberEnd = Some(value = -1),
    name = "FUN_0001cda4",
    offset = None,
    offsetEnd = None,
    order = 0,
    signature = "undefined FUN_0001cda4(void)"
  )
)

Note that we have “source code” for the function FUN_0001cda4 – we can dump it with the query cpg.method.name("FUN_0001cda4").dumpRaw. This is the same decompilation we saw when we previously loaded the binary into Ghidra. Ghidra provides this decompilation by lifting to a C-like representation from the machine code in the binary. This is an important concept to understand since ghidra2cpg has captured this decompilation into the CPG we’re operating on in Joern. Decompilation provides an accurate-enough representation of the machine code, such that we can process it as C source code with Joern’s fuzzy compiler.

Our flawed CPG

Dataflow is typical within static analysis tools such as Joern. Dataflow helps us to track how data flows through a program without having to actually execute it dynamically. We can define a source as a variable at a specific place in the code where we know user-controlled data enters the program, such as a call to recv. Then, we can find all places in the code where this user-controlled data taints the code, up to the sink where the variable is redefined in such a way that taint no longer propagates. We can also search backwards from a sink that we can define as a critical portion of the code, such as a system call or call to execve, then walk backwards to determine whether there is dataflow to a user-controlled source. Dataflow can sometimes be incomplete in static analysis because we lack execution details that may impact code flow, such as in the case of function pointers.

We currently have the CPG loaded that was generated from ghidra2cpg. It’s possible to perform dataflow queries across this CPG to try and find a path between recv and memcpy, but in practice, dataflow on CPGs generated from Ghidra is imperfect. Why? Because it operates on p-code operations rather than decompilation (the C-like representation in Ghidra). The CPG contains decompilation in the code attribute of each method, but it exists strictly for reference. Joern analysis executes based on the p-code representation. Simple dataflow paths – such as from a parameter of FUN_0001cda4 to an argument of memcpy – does find the path we’re looking for.

joern> def source = cpg.method.name("FUN_0001cda4").parameter
def source:
  Iterator[io.shiftleft.codepropertygraph.generated.nodes.MethodParameterIn]
                                                                                                                                        
joern> def sink = cpg.call.name("memcpy").argument
def sink: Iterator[io.shiftleft.codepropertygraph.generated.nodes.Expression]
                                                                                                                                        
joern> sink.reachableByFlows(source).p

joern> def source = cpg.method.name("FUN_0001cda4").parameter
def source:
  Iterator[io.shiftleft.codepropertygraph.generated.nodes.MethodParameterIn]
                                                                                                                                        
joern> def sink = cpg.call.name("memcpy").argument
def sink: Iterator[io.shiftleft.codepropertygraph.generated.nodes.Expression]
                                                                                                                                        
joern> sink.reachableByFlows(source).p

When we expand the dataflow search across multiple functions (e.g., across FUN_000163a4), Joern fails to find the dataflow paths that we identified manually. This appears to happen because the p-code representation within the CPG doesn’t flow data correctly in all cases.

Another issue is that we don’t have flows for library calls such as stristr, recv, and memcpy from libc, which causes issues for dataflow. This is a common issue in static analysis, as binaries often link in other libraries that we may not have source code or binaries for. This can be overcome by defining flows based on Joern’s documentation, but this is a tedious process that requires expert knowledge.

Running basic queries designed for source code on our CPG generated from Ghidra may fall apart. For example, if we search for calls to memcpy where the third argument is non-literal, this will return all calls to memcpy since p-code points to the instruction register (in this case, r2 for ARM) for the argument variable. Joern interprets this as a non-literal even if previous p-code operations copy a literal into the register.

4. Preparing for analysis

Given these issues, we’ll take an alternate approach that builds off the decompilation provided by the CPG generated from Ghidra. Remember that our current loaded CPG has decompilation for all the functions in the httpd binary we’re analyzing. We can write a query that searches for all functions that call memcpy , and then dump their decompilation to a file that we can process as a C source file. The C style decompilation that Ghidra provides isn’t C code and isn’t directly compilable out of the box. This is still acceptable for Joern since its fuzzy compiler can process this loose C-style representation. Ghidra will often infer arguments for functions and types of arguments, so this process won’t be perfect in all scenarios.

It should be noted that the query below can be modified to dump decompilation for callers of other interesting functions such as strcpy.

joern> import java.io._
val memcpyCallers = cpg.method.where(_.call("memcpy"))
val pw = new PrintWriter(new File("memcpy_callers.c"))
memcpyCallers.dumpRaw.foreach(pw.println)
pw.close()

joern> import java.io._
val memcpyCallers = cpg.method.where(_.call("memcpy"))
val pw = new PrintWriter(new File("memcpy_callers.c"))
memcpyCallers.dumpRaw.foreach(pw.println)
pw.close()

This results in a memcpy_callers.c source file that we can ingest into Joern.

joern> importCode(inputPath="./memcpy_callers.c", projectName="R7000-memcpy-c")

joern> importCode(inputPath="./memcpy_callers.c", projectName="R7000-memcpy-c")

This generates and loads the CPG from the C-style source code we generated from the previous step. We’re now ready to query for vulnerable patterns in this binary as though we had its source code!

Querying for the Known Bug

Because we already understand the bug from prior investigation, we can work backwards and attempt to write a query that will identify the vulnerable pattern. From here we can do several things. We can use this query to search and see if the bug exists in other versions of firmware or other similar devices. Router manufacturers often use the same code base for various products and revisions of products, so bugs can exist across several different devices. We can also use this query to find vulnerable patterns in other binaries we work with in the future.

For this bug, data flows from the buffer parameter of recv ultimately to the size parameter of memcpy. This means that user data being read from recv can control, or at least influence, the size parameter of memcpy, and thus potentially cause a buffer overflow. Since the destination variable of the vulnerable call to memcpy is on the stack, this would result in a stack buffer overflow that could be very easy to exploit.

A prototype for memcpy is provided below.

void * memcpy(void * dest, const void * src, size_t n);

void * memcpy(void * dest, const void * src, size_t n);

Our approach is modeled after querying for a basic pattern in code that would raise suspicion. We’ll write a query that searches for calls of memcpy where the third argument (size) isn’t a literal. We want to skip instances where the third argument is constant (e.g., 8) since those aren’t likely to be vulnerable calls that lead to a buffer overflow.

joern> cpg.call("memcpy").whereNot(_.argument(3).isLiteral).map(c => s"${c.method.name} @ line ${c.lineNumber.getOrElse("??")}: ${c.code}").l
val res17: List[String] = List(
  "FUN_00010af8 @ line 47: memcpy(__dest_00,param_3,param_5)",
  "FUN_00010af8 @ line 48: memcpy((void *)((int)__dest_00 + param_5),param_2,param_4)",
  "FUN_00010af8 @ line 77: memcpy((void *)((int)__dest + 0x28),acStack_b4,__n)",
  "FUN_00010af8 @ line 87: memcpy((void *)((int)param_1 + 0x28),acStack_b4,__n)",
  "FUN_00010af8 @ line 88: memcpy((void *)((int)param_1 + iVar3),__dest_00,__n_00)",
  "FUN_000163a4 @ line 2911: memcpy(acStack_10ee4 + iVar20,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3014: memcpy(pcVar26,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3053: memcpy(local_8dc,__src,uVar39)",
  "FUN_000163a4 @ line 3189: memcpy(DAT_001df0a0,acStack_10ee4 + iVar27,(size_t)pcStack_10f18)",
  "FUN_000163a4 @ line 3223: memcpy(pcStack_10f04,local_ee0,sVar25)",
  "FUN_0001cda4 @ line 3775: memcpy(auStack_8c,param_1,__n)",
  "FUN_000277cc @ line 5787: memcpy((char *)((int)__dest + iVar7),pcVar5,sVar8)",
  "FUN_000277cc @ line 5796: memcpy((char *)((int)__dest + iVar7),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5804: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5817: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5826: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5834: memcpy((char *)((int)__dest + iVar7),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5896: memcpy((char *)((int)__dest + iVar7),acStack_9c,sVar8)",
  "FUN_000277cc @ line 5907: memcpy((char *)((int)__dest + iVar7),acStack_5c,sVar8)",
  "FUN_00031394 @ line 6278: memcpy(auStack_384,param_1,param_2)",
  "FUN_000315cc @ line 6403: memcpy(auStack_4cc + sVar4,local_14cc,__n)",
  "FUN_000315cc @ line 6488: memcpy(acStack_38cc + sVar3,local_14cc,__n)",
  "getStatsFromFile @ line 6877: memcpy(param_1,acStack_38,__n)",
  "FUN_00092574 @ line 13883: memcpy(acStack_120,param_1,(int)pcVar3 - (int)param_1)",
  "FUN_000a5ac8 @ line 15043: memcpy(__s,param_4,param_5)",
  "FUN_000c526c @ line 15495: memcpy(acStack_228,acStack_1a8,(int)__src - (int)acStack_1a8)"
)

joern> cpg.call("memcpy").whereNot(_.argument(3).isLiteral).map(c => s"${c.method.name} @ line ${c.lineNumber.getOrElse("??")}: ${c.code}").l
val res17: List[String] = List(
  "FUN_00010af8 @ line 47: memcpy(__dest_00,param_3,param_5)",
  "FUN_00010af8 @ line 48: memcpy((void *)((int)__dest_00 + param_5),param_2,param_4)",
  "FUN_00010af8 @ line 77: memcpy((void *)((int)__dest + 0x28),acStack_b4,__n)",
  "FUN_00010af8 @ line 87: memcpy((void *)((int)param_1 + 0x28),acStack_b4,__n)",
  "FUN_00010af8 @ line 88: memcpy((void *)((int)param_1 + iVar3),__dest_00,__n_00)",
  "FUN_000163a4 @ line 2911: memcpy(acStack_10ee4 + iVar20,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3014: memcpy(pcVar26,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3053: memcpy(local_8dc,__src,uVar39)",
  "FUN_000163a4 @ line 3189: memcpy(DAT_001df0a0,acStack_10ee4 + iVar27,(size_t)pcStack_10f18)",
  "FUN_000163a4 @ line 3223: memcpy(pcStack_10f04,local_ee0,sVar25)",
  "FUN_0001cda4 @ line 3775: memcpy(auStack_8c,param_1,__n)",
  "FUN_000277cc @ line 5787: memcpy((char *)((int)__dest + iVar7),pcVar5,sVar8)",
  "FUN_000277cc @ line 5796: memcpy((char *)((int)__dest + iVar7),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5804: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5817: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5826: memcpy((char *)((int)__dest + iVar13),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5834: memcpy((char *)((int)__dest + iVar7),acStack_11c,sVar8)",
  "FUN_000277cc @ line 5896: memcpy((char *)((int)__dest + iVar7),acStack_9c,sVar8)",
  "FUN_000277cc @ line 5907: memcpy((char *)((int)__dest + iVar7),acStack_5c,sVar8)",
  "FUN_00031394 @ line 6278: memcpy(auStack_384,param_1,param_2)",
  "FUN_000315cc @ line 6403: memcpy(auStack_4cc + sVar4,local_14cc,__n)",
  "FUN_000315cc @ line 6488: memcpy(acStack_38cc + sVar3,local_14cc,__n)",
  "getStatsFromFile @ line 6877: memcpy(param_1,acStack_38,__n)",
  "FUN_00092574 @ line 13883: memcpy(acStack_120,param_1,(int)pcVar3 - (int)param_1)",
  "FUN_000a5ac8 @ line 15043: memcpy(__s,param_4,param_5)",
  "FUN_000c526c @ line 15495: memcpy(acStack_228,acStack_1a8,(int)__src - (int)acStack_1a8)"
)

This results in a handful of calls to memcpy that are potentially vulnerable, but we can prune them down even further. We’re interested in stack buffer overflows that are easily exploitable, so we can refine the search to find calls to memcpy where the first argument (destination) is a variable stored on the stack. Ghidra’s decompilation automatically assigns names to variables to differentiate them, and variables stored on the stack have a form that contains Stack in them.

Note that below is a quick-and-dirty query that could result in false negatives (e.g., failing to find pointers to a stack variable). But for our purposes of identifying a known bug, it works.

joern> cpg.call("memcpy").whereNot(_.argument(3).isLiteral).where(_.argument(1).code(".*Stack.*")).map(c => s"${c.method.name} @ line ${c.lineNumber.getOrElse("??")}: ${c.code}").l
val res15: List[String] = List(
  "FUN_000163a4 @ line 2911: memcpy(acStack_10ee4 + iVar20,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3223: memcpy(pcStack_10f04,local_ee0,sVar25)",
  "FUN_0001cda4 @ line 3775: memcpy(auStack_8c,param_1,__n)",
  "FUN_00031394 @ line 6278: memcpy(auStack_384,param_1,param_2)",
  "FUN_000315cc @ line 6403: memcpy(auStack_4cc + sVar4,local_14cc,__n)",
  "FUN_000315cc @ line 6488: memcpy(acStack_38cc + sVar3,local_14cc,__n)",
  "FUN_00092574 @ line 13883: memcpy(acStack_120,param_1,(int)pcVar3 - (int)param_1)",
  "FUN_000c526c @ line 15495: memcpy(acStack_228,acStack_1a8,(int)__src - (int)acStack_1a8)"
)

joern> cpg.call("memcpy").whereNot(_.argument(3).isLiteral).where(_.argument(1).code(".*Stack.*")).map(c => s"${c.method.name} @ line ${c.lineNumber.getOrElse("??")}: ${c.code}").l
val res15: List[String] = List(
  "FUN_000163a4 @ line 2911: memcpy(acStack_10ee4 + iVar20,local_ee0,sVar25)",
  "FUN_000163a4 @ line 3223: memcpy(pcStack_10f04,local_ee0,sVar25)",
  "FUN_0001cda4 @ line 3775: memcpy(auStack_8c,param_1,__n)",
  "FUN_00031394 @ line 6278: memcpy(auStack_384,param_1,param_2)",
  "FUN_000315cc @ line 6403: memcpy(auStack_4cc + sVar4,local_14cc,__n)",
  "FUN_000315cc @ line 6488: memcpy(acStack_38cc + sVar3,local_14cc,__n)",
  "FUN_00092574 @ line 13883: memcpy(acStack_120,param_1,(int)pcVar3 - (int)param_1)",
  "FUN_000c526c @ line 15495: memcpy(acStack_228,acStack_1a8,(int)__src - (int)acStack_1a8)"
)

Now we have a list of calls to memcpy where the first argument is stored on the stack, and the third argument is non-literal. We see here that the call to memcpy in FUN_0001cda4 is found. But there are several others that we could investigate to see if user input flows to – or influences the size argument of – memcpy.

These queries may return results that contain false positives, so they need to be manually triaged. However, the query results provide us a set of suspicious calls to memcpy in an automated fashion.

Summary

We’ve shown that source code analysis techniques can be applied to a black-box binary. By extracting C-style decompilation from a CPG generated with Ghidra, we were able to run Joern queries on the binary – queries that are typically used to find vulnerable patterns in source code. Using a real-world device with a known bug, we identified the vulnerable pattern and uncovered additional instances worth investigating.

Where do we go from here?

Extending this work

Homework

As a homework assignment, the reader can search to see if the same bug exists in a different version of Netgear R7000 firmware. Can you find the bug in the latest firmware release 1.0.12.216? Different versions of firmware may have slight modifications to the code, so locations and implementations may be different. Can you find the bug in the latest firmware release 1.0.2.26 for the Netgear R6700 device?

Other patterns

The vulnerable usage of memcpy we demonstrated here is only a brief sample from the wealth of bugs that exist out in the wild. Various other patterns can be identified using similar approaches.

We can search for vulnerable calls to strcpy and strcat that can result in similar buffer overflows. We can track how data flows from calls to functions like recv, read, and getenv to see if they flow to functions, such as memcpy. We can search for command injection through improper use of exec, system, or popen. We can search for vulnerable calls to printf or sprintf or usage of strcpy or memcpy without bound checking.

More advanced queries can find use-after-free bugs or dangling pointers. We can find double-free bugs by searching for paths in dataflow where multiple calls to free are possible. We can find cases where malloc isn’t checked for a null return.

Some basic searches for hard-coded secrets or credentials are easy to implement and run directly on the CPG generated from Ghidra.

joern> cpg.literal.code(".*password.*").location.l

joern> cpg.literal.code(".*secret.*").location.l

joern> cpg.literal.code(".*password.*").location.l

joern> cpg.literal.code(".*secret.*").location.l

Improper or unsafe use of APIs can be found, and for cryptographic APIs we can search for hard-coded keys or improper configuration.

Joern provides a query database that contains samples of queries, including some that can be run out of the box. This is a great starting point, but the possibilities are practically endless.

If you thrive on bug-hunting, be sure to check out the available cyber engineering jobs at Zetier.

Happy hunting!

Illustration by Rebecca DeField.

Auditing binaries like source code: static analysis with Ghidra + Joern