## PE File Classification Exercise In this notebook we're going to explore, understand and classify PE (Portable Executable) files as being 'benign' or 'malicious'. http://en.wikipedia.org/wiki/Portable_Executable The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with PE File classification as a vehicle for that exploration. The exercise intentionally shows what machine learning experts might call a naive approach, this is for clarity and conciseness. Recommendations for deeper materials and resources are given in the conclusion.

** DISCLAIMER:** This exercise is for illustrative purposes and only uses about 100 samples which is way too small for a generalizable model. ### Python Modules Used:

Pandas: Python Data Analysis Library (http://pandas.pydata.org)
Scikit Learn (http://scikit-learn.org) Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Matplotlib: Python 2D plotting library (http://matplotlib.org)

**All Code and IPython Notebooks for this talk: http://clicksecurity.github.io/data_hacking**

Imports and plot defaults¶

In [190]:

import os
import sklearn.feature_extraction
sklearn.__version__

Out[190]:

'0.14.1'

In [191]:

import pandas as pd
pd.__version__

Out[191]:

'0.13.1'

In [192]:

import numpy as np
np.__version__

Out[192]:

'1.8.0'

In [193]:

# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 16.0, 5.0

In [194]:

def plot_cm(cm, labels):
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  # Derp, I'm sure there's a better way   
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks kermit666 from stackoverflow :)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm',vmin=0,vmax=100)
    plt.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

In [195]:

import os, warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Read in the Raw Data For PE files we want to quickly go from the raw binary files to a feature vector (DataFrame). For PE files there are lots of great tools, there's the pefile python module written by Ero Carrera and there's also a nice new github project called PEFrame https://github.com/guelfoweb/peframe by Gianni Amato at http://www.securityside.it. For this exercise we've provided a little wrapper class around the pefile module.

In [196]:

import pe_features
my_extractor = pe_features.PEFileFeatures()

# Open a PE File and see what features we get
filename = 'data/bad/0cb9aa6fb9c4aa3afad7a303e21ac0f3'
with open(filename,'rb') as f:
    features = my_extractor.execute(f.read())
features

Out[196]:

{'check_sum': 0,
 'compile_date': 1218437803,
 'datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size': 0,
 'datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size': 0,
 'datadir_IMAGE_DIRECTORY_ENTRY_IAT_size': 468,
 'datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size': 100,
 'datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size': 1048,
 'debug_size': 0,
 'export_size': 0,
 'generated_check_sum': 53913,
 'iat_rva': 9256,
 'major_version': 0,
 'minor_version': 0,
 'number_of_bound_import_symbols': 0,
 'number_of_bound_imports': 0,
 'number_of_export_symbols': 0,
 'number_of_import_symbols': 113,
 'number_of_imports': 4,
 'number_of_rva_and_sizes': 16,
 'number_of_sections': 4,
 'pe_char': 271,
 'pe_dll': 0,
 'pe_driver': 0,
 'pe_exe': 1,
 'pe_i386': 1,
 'pe_majorlink': 6,
 'pe_minorlink': 0,
 'pe_warnings': 0,
 'sec_entropy_data': 0.4421475832668401,
 'sec_entropy_rdata': 3.2064873564662046,
 'sec_entropy_reloc': 0,
 'sec_entropy_rsrc': 1.028676764457129,
 'sec_entropy_text': 4.852962403013336,
 'sec_raw_execsize': 16384,
 'sec_rawptr_data': 12288,
 u'sec_rawptr_rdata': 8192,
 'sec_rawptr_rsrc': 16384,
 'sec_rawptr_text': 4096,
 'sec_rawsize_data': 4096,
 u'sec_rawsize_rdata': 4096,
 'sec_rawsize_rsrc': 4096,
 'sec_rawsize_text': 4096,
 'sec_va_execsize': 7044,
 'sec_vasize_data': 468,
 u'sec_vasize_rdata': 2182,
 'sec_vasize_rsrc': 1048,
 'sec_vasize_text': 3346,
 'size_code': 4096,
 'size_image': 20480,
 'size_initdata': 12288,
 'size_uninit': 0,
 'std_section_names': 1,
 'total_size_pe': 20480,
 'virtual_address': 4096,
 'virtual_size': 3346,
 'virtual_size_2': 2182}

In [197]:

# Load up all our files (files come from various places contagio, around the net...)
def load_files(file_list):
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features_list.append(my_extractor.execute(f.read()))
    return features_list

In [198]:

# Bad (malicious) files
file_list = [os.path.join('data/bad', child) for child in os.listdir('data/bad')]
bad_features = load_files(file_list)
print 'Loaded up %d malicious PE Files' % len(bad_features)

Loaded up 50 malicious PE Files

In [199]:

# Good (benign) files
file_list = [os.path.join('data/good', child) for child in os.listdir('data/good')]
good_features = load_files(file_list)
print 'Loaded up %d benign PE Files' % len(good_features)

Loaded up 50 benign PE Files

# Data Transformation: ** Going from a list of python dictionaries to a Pandas DataFrame. Pandas has all sort of different ways to create a data frame. **

In [200]:

# Putting the features into a pandas dataframe
import pandas as pd
df_bad = pd.DataFrame.from_records(bad_features)
df_bad['label'] = 'bad'
df_good = pd.DataFrame.from_records(good_features)
df_good['label'] = 'good'
df_good.head()

Out[200]:

	check_sum	compile_date	datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size	datadir_IMAGE_DIRECTORY_ENTRY_IAT_size	datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size	debug_size	generated_check_sum	iat_rva	major_version	minor_version	number_of_import_symbols	number_of_imports	number_of_rva_and_sizes	number_of_sections
0	97308	1383744221	3044	592	140	7368	28	97308	50424	0	0	142	6	16	5	...
1	103233	1383102953	60	1008	60	872	28	103233	53248	5	1	124	2	16	8	...
2	26573	1386271379	360	208	100	2588	28	25971	8804	0	0	48	4	16	5	...
3	0	1373925025	12	8	83	11904	28	54015	35064	0	0	1	1	16	4	...
4	50003	1378865704	360	208	100	2588	28	59485	8804	0	0	48	4	16	5	...

5 rows × 108 columns

# Lets look at the Data We're going to use some nice functionality in the Pandas dataframe to look at our processed data:

In [201]:

# Now we're set and we open up a a whole new world!

# Gisting and statistics
df_bad.describe()

Out[201]:

	check_sum	compile_date	datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size	datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_IAT_size	datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size	debug_size	export_size	generated_check_sum	iat_rva	major_version	minor_version	number_of_bound_import_symbols	number_of_bound_imports	number_of_export_symbols	number_of_import_symbols	number_of_imports	number_of_rva_and_sizes	number_of_sections
count	50.000000	5.000000e+01	50.000000	50.000000	50.000000	50.000000	50.00000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50	50.00000	...
mean	25235.660000	1.035770e+09	415.280000	14.640000	126.720000	456.160000	9615.64000	3.920000	14.640000	86998.520000	43982.640000	0.960000	0.120000	0.140000	0.740000	0.240000	44.560000	3.740000	16	4.32000	...
std	45704.015095	3.202979e+08	1061.159532	55.908365	180.722252	1060.814846	19062.02003	9.814275	55.908365	30119.209943	44546.311213	2.137708	0.328261	0.404566	2.028471	0.893514	46.412595	3.445257	0	1.75476	...
min	0.000000	2.099200e+06	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	26104.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	16	1.00000	...
25%	0.000000	9.372855e+08	0.000000	0.000000	0.000000	40.000000	0.00000	0.000000	0.000000	68094.000000	14593.000000	0.000000	0.000000	0.000000	0.000000	0.000000	9.250000	1.000000	16	3.00000	...
50%	0.000000	1.172916e+09	0.000000	0.000000	44.000000	100.000000	1152.00000	0.000000	0.000000	82579.000000	25835.000000	0.000000	0.000000	0.000000	0.000000	0.000000	26.000000	2.500000	16	4.00000	...
75%	36417.000000	1.219691e+09	14.000000	0.000000	231.000000	186.000000	5938.00000	0.000000	0.000000	108406.000000	55948.000000	0.000000	0.000000	0.000000	0.000000	0.000000	70.500000	5.000000	16	5.00000	...
max	150326.000000	1.382647e+09	4612.000000	313.000000	748.000000	6234.000000	84152.00000	28.000000	313.000000	164776.000000	189824.000000	10.000000	1.000000	2.000000	8.000000	4.000000	180.000000	18.000000	16	9.00000	...

8 rows × 199 columns

In [202]:

# Visualization I
df_bad['check_sum'].hist(alpha=.5,label='bad',bins=40)
df_good['check_sum'].hist(alpha=.5,label='good',bins=40)
plt.legend()

Out[202]:

<matplotlib.legend.Legend at 0x110fd4b10>

In [203]:

# Visualization I
df_bad['generated_check_sum'].hist(alpha=.5,label='bad',bins=40)
df_good['generated_check_sum'].hist(alpha=.5,label='good',bins=40)
plt.legend()

Out[203]:

<matplotlib.legend.Legend at 0x111e38f50>

In [26]:

# Concatenate the info into a big pile!
df = pd.concat([df_bad, df_good], ignore_index=True)
df.replace(np.nan, 0, inplace=True)

In [27]:

# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot

# Get some quick summary stats and plot it!
df.boxplot('number_of_import_symbols','label')
plt.xlabel('bad vs. good files')
plt.ylabel('# Import Symbols')
plt.title('Comparision of # Import Symbols')
plt.suptitle("")

Out[27]:

<matplotlib.text.Text at 0x11089fb10>

In [28]:

# Get some quick summary stats and plot it!
df.boxplot('number_of_sections','label')
plt.xlabel('bad vs. good files')
plt.ylabel('Num Sections')
plt.title('Comparision of Number of Sections')
plt.suptitle("")

Out[28]:

<matplotlib.text.Text at 0x1108a3850>

In [29]:

# Split the classes up so we can set colors, size, labels
cond = df['label'] == 'good'
good = df[cond]
bad  = df[~cond]
plt.scatter(good['number_of_import_symbols'], good['number_of_sections'], 
            s=140, c='#aaaaff', label='Good', alpha=.4)
plt.scatter(bad['number_of_import_symbols'], bad['number_of_sections'], 
            s=40, c='r', label='Bad', alpha=.5)
plt.legend()
plt.xlabel('Import Symbols')
plt.ylabel('Num Sections')

Out[29]:

<matplotlib.text.Text at 0x110914f50>

# Data Transformation: ** Going from a Pandas DataFrame to an X Matrix and a y vector so we can utilize all of the great scikit-learn algorithms. **

In [131]:

# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = df.as_matrix(['number_of_import_symbols', 'number_of_sections'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(df['label'].tolist())

In [132]:

# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=50, compute_importances=True)

In [133]:

# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=4)
print scores

[ 0.8         0.75        0.7         0.65        0.78947368]

In [134]:

# Typically you train/test on an 80% / 20%  split meaning you train on 80%
# of the data and you test against the remaining 20%. In the case of this
# exercise we have so FEW samples (50 good/50 bad) that if were going
# to play around with predictive performance it's more meaningful
# to train on 60% of the data and test against the remaining 40%.

my_seed = 123
my_tsize = .4 # 40%
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [135]:

# Now plot the results of the 60/40 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['good', 'bad']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
good/good: 72.73% (16/22)
good/bad: 27.27% (6/22)
bad/good: 33.33% (6/18)
bad/bad: 66.67% (12/18)

Features, predictive performance and 'knobs'¶

Here we going to explore some of the ways you can adjust the 'knobs' associated with either the feature input into your ML algorithm or the prediction probability methods that many classes in scikit-learn have.

In [140]:

# Okay now try putting in ALL the features (except the label, which would be cheating :)
no_label = list(df.columns.values)
no_label.remove('label')
X = df.as_matrix(no_label)

# 60/40 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
good/good: 95.45% (21/22)
good/bad: 4.55% (1/22)
bad/good: 5.56% (1/18)
bad/bad: 94.44% (17/18)

In [141]:

# Feature Selection
# Which features best deferentiated the two classes?
# Here we're going to grab the feature_importances from the classifier itself, 
# you can also use a Chi Squared Test sklearn.feature_selection.SelectKBest(chi2)
importances = zip(no_label, clf.feature_importances_)
importances.sort(key=lambda k:k[1], reverse=True)
importances[:10]

Out[141]:

[('compile_date', 0.087118104042058525),
 ('pe_majorlink', 0.059725488127989876),
 (u'sec_rawptr_reloc', 0.059172331241524503),
 ('debug_size', 0.036744505259105907),
 (u'sec_vasize_reloc', 0.035061138659616312),
 ('datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size', 0.033765884515823991),
 (u'sec_rawptr_rdata', 0.033184786235755895),
 ('datadir_IMAGE_DIRECTORY_ENTRY_IAT_size', 0.03261159140279881),
 ('pe_char', 0.030081949321901769),
 ('sec_rawsize_text', 0.028226654611623891)]

In [142]:

# Produce an X matrix with only the most important featuers
X = df.as_matrix([item[0] for item in importances[:10]])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
good/good: 95.45% (21/22)
good/bad: 4.55% (1/22)
bad/good: 0.00% (0/18)
bad/bad: 100.00% (18/18)

In [143]:

# Compute the predition probabilities and use them to mimimize our false positives
# Note: This is simply a trade off, it means we'll miss a few of the malicious
# ones but typically false alarms are a death blow to any new 'fancy stuff' so
# we definitely want to mimimize the false alarms.
y_probs = clf.predict_proba(X_test)[:,0]
thres = .8 # This can be set to whatever you'd like
y_pred[y_probs<thres] = 'good'
y_pred[y_probs>=thres] = 'bad'
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
good/good: 100.00% (22/22)
good/bad: 0.00% (0/22)
bad/good: 16.67% (3/18)
bad/bad: 83.33% (15/18)

Conclusions:¶

The combination of IPython, Pandas and Scikit Learn let us pull in PE files, extract features, plot them, understand them and slap them with some machine learning!

As mentioned in the disclaimer, the biggest issue with this particular exercise is the small number of samples.

There are some really great machine learning resources that cover this material on a deeper and more formal level. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/