Prototype Concept
Monster is a Symbolic Execution Engine for RISC-U binary code. Given a RISC-U binary and any pointer into the binaries code segment, Monster can compute a concrete input, where the program would reach that point (with a high probability). On top of that Monster can also compute a input for a constrained memory location at a specific point in the binary (with a high success probability).
Machine Model
In order to describe the engine, the following example of C* code will be used:
Pseudocode:
uint64_t input = ?;
uint64_t x1 = input + 5;
uint64_t x2 = 10;
uint64_t x3 = x2 - x1;
exit(x3);
RISC-U:
addi t0, zero, 10
addi t1, a0, 5
sub t1, t1, t0
addi a0, t1, 0
addi a7, zero, 93
ecall
This code should be evaluated partially, where we differentiate between Concrete and Symbolic values. Concrete values are of type unsigned 64-bit integer, whereas symbolic values only consist of a name (e.g. x0) for a SMT formula stored somewhere else. We will also use a Uninitialized value to model memory cells, which have not been initialized yet. Now we need to model the state of the machine to be able execute instructions for it. Keep in mind, that we defined, that our input is in register a0 and therefore this register has to contain already an symbolic value.
Registers | Memory | Formula | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
Every memory cell or register, which is not explicitly mentioned here is Uninitialized. With this the initial state of the machine is defined and we can start executing.
Symbolic Execution
Given the first instruction addi t0, a0, 5
, the symbolic value in register a0 has to be added with a concrete value 5 and the result has to be stored in t0.
For that we introduce a new label and store the formula
addi t0, zero, 10
:
Registers | Formula | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
addi t1, a0, 5
:
Registers | Formula | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
sub t1, t1, t0
:
Registers | Formula | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
addi a0, t1, 0
:
Registers | Formula | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
addi a7, zero, 93
:
Registers | Formula | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
ecall
:
Registers | Formula | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
After the exit ecall we have to decide, how the query to the SMT solver will look like.
In this case an exit_code > 0
is interesting for us, because that is an error condition for a program.
The exit_code
at this point should be in register a0.
Now that the constraint is defined and we know formula x2 represents exit_code
, the SMT formula can be written down as following:
x1 = x0 + 5 and x2 = 10 - x1 and x2 > 0
We can now write SMT-lib code for this formula, which would look like this:
(set-option :produce-models true)
(set-option :incremental true)
(set-logic QF_BV)
(declare-fun x0 () (_ BitVec 64)) ; symbolic input x0 has no constrains (assert)
(declare-fun x1 () (_ BitVec 64))
(assert (= x1 (bvadd x0 (_ bv5 64)))) ; x1 = x0 + 5
(declare-fun x2 () (_ BitVec 64))
(assert (= x2 (bvsub (_ bv10 64) x1))) ; x2 = 10 - x1
(assert (bvugt x2 (_ bv0 64))) ; x2 > 0
(check-sat)
(get-model)
(exit)
Of course that is a very explicit representation of this formula. Someone could easily rewrite that to the following formula and therefore omit many symbolic temporaries:
10 - (x0 + 5) > 0
For this reason, we do not use the tabular approach to represent the formula, but rather use a graph. To be precise, we are using a data flow graph.
Data Flow Graph
Using the example from before, we can show all the intermediate graphs during program execution:
addi t0, zero, 10
addi t1, a0, 5
sub t1, t1, t0
addi a0, t1, 0
addi a7, zero, 93
ecall
With this graph as a result, an SMT formula can trivially be constructed by building a formula for every "Constraint" node. An algorithm could start at a constraint and track all it's input edges to build the formula. This gives us a way more concise SMT formula:
(set-option :produce-models true)
(set-option :incremental true)
(set-logic QF_BV)
(declare-fun x0 () (_ BitVec 64))
(assert (bvugt (bvsub (_ bv10 64) (bvadd x0 (_ bv5 64))) (_ bv0 64)))
; 10 - (x0 + 5) > 0
(check-sat)
(get-model)
(exit)
Control Flow
Up until now, there was no control flow statement involved in our program. In this section, we will describe control flow in our symbolic execution engine. Monster generates a so called Control Flow Graph, which encaptures all possible execution paths of a program in one graph. Let's take a different example with non-trivial control flow:
Here is the C* pseudocode of the program:
uint64_t input = ?;
uint64_t x1 = input + 5;
uint64_t x2 = 10;
uint64_t exit_code;
if (x1 == x2)
exit_code = x2;
else
exit_code = x1;
exit(exit_code);
RISC-U
addi t0, zero, 10
addi t1, a0, 5
beq t0, t1, 2
sd sp(0), t0
jal zero, 2
sd sp(0), t1
ld sp(0), a0
addi a7, zero, 93
ecall
The Control Flow Graph for this program looks like this:
It is a straight forward Control Flow Graph, where every node is a RISC-U instruction and an edge encodes an optional constrain when branch decisions are made.
Candidate Path
With Monster we restrict our engine to execute one full path from program entry point to a specified exit point at a time. This implies, that we have to generate so called candidate paths, which are executed symbolically. Candidate paths are the results of computing all shortest paths in the control flow graph. We start by picking the shortest first.
In the given example the first candidate path to be executed is:
At this point, this path can be executed symbolically, similarly to the first example.
The only difference is, that we have encoded an additional constraint (reg(t0) == reg(t1)
) at one point in the path.
Conditional Symbolic Branching
So let's execute it symbolically and inspect the system state.
addi t0, a0, 5
addi t1, zero, 10
beq t0, t1, 2
Registers | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Execution of this path begins to differ when a beq
instruction is reached.
For this instruction a constraint has to be modeled in the dataflow graph.
This constrain is part of the so called "path condition".
It reflects the decision made when branching.
After branching we have to symbolically execute a sd
instruction.
Memory Access
sd sp(0), t0
So far, I did not discuss the stack pointer at all, but the stack pointer in this case is clearly needed to calculate the address of the memory access. A big question here is, what kind of value (symbolic/concrete) the stack pointer has. For a concrete stack pointer value, the instruction has the same semantics as in normal execution. In Monster we initialize the stack pointer with a concrete value depending on the memory size. Also in RISC-U programs, if the stack pointer is intialized with a concrete value, it always stays concrete by construction (Selfie cstar).
Monster can not handle symbolic addresses for a memory access!
So for this example, sd
and ld
have the exact same semantics as in normal execution.
This gives us the following end result:
jal zero, 2
ld sp(0), a0
addi a7, zero, 93
ecall
Registers | Memory | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Now that this results in 2 constraints, SMT-lib looks kind of different for this graph since 2 assertions have to be made.
One comes from the path condition (reg(t0) != reg(t1)
) and one is our output constraint (reg(a0) > 0
).
(set-option :produce-models true)
(set-option :incremental true)
(set-logic QF_BV)
(declare-fun x0 () (_ BitVec 64)) ; symbolic input x0 has no constrains (assert)
(assert (not (= (bvadd x0 (_ bv5 64)) (_ bv10 64)))) ; path constrain
(assert (bvugt (bvadd x0 (_ bv5 64)) (_ bv0 64))) ; output constrain
(check-sat)
(get-model)
(exit)