Symbolic execution with SymCC - Don't interpret, compile!

This paper describes SymCC, a concolic execution engine that aims to increase concolic execution performance by compiling directly into a binary. The authors claim that SymCC is, on average, 12 times faster than Klee. Because of this, it is able to achieve higher coverage.

The paper starts with an overview of symbolic and concolic execution. In particular, it notes that symbolic execution's useability is limited by its performance, which is slow largely because of interpretation overhead. Then, it introduces "compilation-based symbolic execution," and explains IR-based and IR-less symbolic execution. I wrote about this in my post on QSym, but it's essentially a tradeoff between performance and implementation complexity; IR-less execution is faster but requires much more code.

Compilation Based Symbolic Execution

Next, the paper explains compilation-based symbolic execution, which aims to improve interpretation speed by "compiling symbolic handling of computations into the target program." SymCC does this at the IR level instead of the language level to remain language agnostic. At compile time, SymCC searched the IR for operations on symbolic variables. Then, it inserts calls to a runtime library that handles the symbolic execution. By doing this at compile time instead of runtime, it reduces the overall work done and makes it possible for further optimization to happen.

These runtime library calls do add significant overhead compared to normal concrete execution. However, it's overall much faster than emulating the IR.

There are some cases where it's impossible to know if a variable is symbolic or not at compile time. In these cases, SymCC introduces a runtime conditional that only uses symbolic execution when it has to.

Implementation

The authors note that SymCC's codebase is relatively small; just around a thousand lines for the compiler pass and another thousand lines for the runtime library.

Compile-time instrumentation

SymCC's compiler pass run in the "middle-end" of compilation, right after the source language is translated to LLVM IR and before any LLVM optimizations occur. The authors note that there is a tradeoff between implementation complexity and runtime speed; adding the compiler pass after LLVM's optimizations would result in more complexity but faster instrumentation and execution.

The paper also addresses the idea that symbolic/concolic execution should be run on the final binary rather than the IR to catch bugs introduced by compilation. They note that it would be possible to extend SymCC to work on the latest LLVM optimization phase before binary codegen, but leave it to future work.

External Environment Boundary

The paper notes that most programs interact with their environment (e.g. create files) in a way that is difficult to model with symbolic execution. To address this, SymCC takes the same approach as QSym by treating any values obtained by uninstrumented code as concrete. However, they have implemented wrappers around common functions like memset and memcpy that use uninstrumentable code (inline assembly/syscalls).

Snapshotting?

The authors don't mention snapshotting to avoid redundant execution, but I believe they take the same approach as QSym since SymCC's execution is faster and snapshotting is probably not worth the overhead.

xs

The rest of the paper shows that SymCC is faster than the state of the art, achieving higher coverage and more bugs found. One interesting note is that it might be possible to combine SymCC with QSym, using SymCC when the source is available and QSym when only the machine code is.