Learning from 1.2k bugs (part I: "It's a science experiment!")

Introduction

Welcome to a series of posts in vulnerability detection using Machine Learning with a large corpus of bugs. We are researching on how to extract runtime information to learn to predict if a program seems vulnerable just looking at some of its execution events. A few of these event that we believe contain useful information are:

  • calls to functions in the standrd C library (strcpy, printf, free, etc..) including:
    • arguments
    • return value
    • return address
  • signals (SIGSEGV, SIGFPE, etc..)
  • final state of a process (exit, abort, crash or timeout)

As you can see, this approach works directly over binary executions (no source or debugging symbols are required). A ptrace based tool called Ocean was developed to extract and process these events on a Linux system. Our goal is to mine them for 'vulnerable patterns'. Also, the detection of such events is very fast using ptrace, since it won't require to execute step by step to get them. Automatic exploitation is outside the scope this project right now (and in fact is a completely different problem).

Keep in mind that in our first posts, everything will be highly experimental and subject to change (in fact, we would love to hear your feedback!). So we will start preparing and examining the dataset. Later, we will trying different models to predict it. These are some of the ideas related to my PhD. project.

Preparing the data

First, we need bugs to analyze. In this context, a bug is a set of input (arguments, files, etc) that produces a crash in a certain binary program. A source of bugs can be a bug tracker (of course!) but unfortunately, we can't just use any bug tracker since we need to be able to fetch, execute and reproduce them automatically.

Luckily for us, a reliable source of reproducible bugs are the ones submitted to Debian by the Mayhem Team. Every of these reports contains all the data needed to reproduce them. So we will be analyzing 1.2k bugs discovered and submitted to the Debian Bug Tracker.

Since every package requires libraries and sometimes bugs depend on specific versions of them, we need a controlled environment for testing. Therefore, in order to obtain reproducible results in our experiment, we are going to use a recent version of Vagrant to emulate Debian 7.3. A preinstalled VM with almost every package that was reported to be buggy (minus the incompatible ones) was created for this project. As you can imagine, this VM has a large number of obscure programs and services installed (most of them are not configured or broken). We disabled useless services to speed-up the boot of the VM. To have a VM ready for testing, just execute:

git clone https://github.com/neuromancer/ocean-data
cd ocean-data
./bootstrap.sh

If the download is very slow, an alternative mirror is available tweaking bootstrap.sh. Also, re-runing that script can resume an interrupted download. After executing it, bugs inputs and reports will be uncompressed inside the "bugs" folder. Such folder will be shared between the host and the VM. Finally, we can use:

vagrant up
vagrant ssh

to boot and get a shell inside the VM.

Playing with the data

To start playing with the data, we are going to use Ocean. Ocean is a ptrace based tool created to collect and vectorize runtime information from different executions of a particular program. Also, given a particular input, simple fuzzing procedures are used generate more than one execution per program if this is required.

I copied the last version of Ocean from its repository inside the VM (and it was added in the PATH). It should be ready to be used. Still it's highly recommeded to update Ocean before starting play with it. All the following command and examples are going to be executed using VM shell. In this case

cd ocean
git pull

will update the tool. In the next post, we will explain exactly how Ocean extracts data from the executions but right now, let's play a little with a testcase. For example, let's take our favourite teddy bear simulator, XTeddy. Unfortanately, Xteddy is very old and buggy. In particular, if we execute it using the "-geometry" parameter without its value, it will crash. This is a known bug and one of the 1.2k crashes submited by the Mayhem team. First, we can use ltrace to get some runtime information manually:

env -i ltrace /usr/games/xteddy -geometry

and the output should be:

__libc_start_main(0x8049510, 2, 0xbfdab944, 0x804a2c0, 0x804a2b0 <unfinished ...>
strrchr("/usr/games/xteddy", '/')                               = "/xteddy"
XParseGeometry(0, 0xbfdab868, 0xbfdab864, 0xbfdab860, 0xbfdab85c) = 0
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

We would like to automatize this. Ocean can be used to parse, execute and print data from this test case very easily:

cd sync
ocean.py -n 0 --no-stdout --raw-mode xteddy-report 2> /dev/null

and the output should be:

SIGSEGV:addr=Top32    crashed:eip=GPtr32    strrchr:ret_addr=GPtr32    strrchr:ret_val=SPtr32    strrchr_0=SPtr32    strrchr_1=Num32B8

Despite the analyzed events form a sequence, our first approach to mine them is to transform them to a 'bag of words' representation. In this case, XTeddy crashed with a SIGSEGV. The parameters, return value and return address of the call to strrchr are shown as 'types'. Basic typing information includes if an address is a pointer to stack, heap or global memory, or an ordinary 32-bit integer. A detailed description of this categorization will be the topic of the next post. It is interesting to note that since XParseGeometry isn't part of standard libc functions, Ocean won't process it. Support for functions in very popular libraries will be added later if the developed technique proves to be useful.

Finally, once we finish playing with it, we should not forget to turn it off:

vagrant halt

This was a very quick introduction to our ongoing research. All our software and data used for research is open source and contributions/feedback is highly appreciated.