R-CNN is a state-of-the-art detector that classifies region proposals by a finetuned Caffe model. For the full details of the R-CNN system and model, refer to its project site and the paper:
Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Arxiv 2013.
In this example, we do detection by a pure Caffe edition of the R-CNN model for ImageNet. The R-CNN detector outputs class scores for the 200 detection classes of ILSVRC13. Keep in mind that these are raw one vs. all SVM scores, so they are not probabilistically calibrated or exactly comparable across classes. Note that this off-the-shelf model is simply for convenience, and is not the full R-CNN model.
Let's run detection on an image of a bicyclist riding a fish bike in the desert (from the ImageNet challenge—no joke).
First, we'll need region proposals and the Caffe R-CNN ImageNet model:
selective_search_ijcv_with_python
, run the demo in MATLAB to compile the necessary functions, then add it to your PYTHONPATH
for importing. (If you have your own region proposals prepared, or would rather not bother with this step, detect.py accepts a list of images and bounding boxes as CSV.)-Run ./scripts/download_model_binary.py models/bvlc_reference_caffenet
to get the Caffe R-CNN ImageNet model.
With that done, we'll call the bundled detect.py
to generate the region proposals and run the network. For an explanation of the arguments, do ./detect.py --help
.
!mkdir -p _temp
!echo `pwd`/images/fish-bike.jpg > _temp/det_input.txt
!../python/detect.py --crop_mode=selective_search --pretrained_model=models/bvlc_reference_rcnn_ilsvrc13/bvlc_reference_rcnn_ilsvrc13.caffemodel --model_def=models/bvlc_reference_rcnn_ilsvrc13/deploy.prototxt --gpu --raw_scale=255 _temp/det_input.txt _temp/det_output.h5
WARNING: Logging before InitGoogleLogging() is written to STDERR I0610 10:12:49.299607 25530 net.cpp:36] Initializing net from parameters: name: "R-CNN-ilsvrc13" layers { bottom: "data" top: "conv1" name: "conv1" type: CONVOLUTION convolution_param { num_output: 96 kernel_size: 11 stride: 4 } } layers { bottom: "conv1" top: "conv1" name: "relu1" type: RELU } layers { bottom: "conv1" top: "pool1" name: "pool1" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layers { bottom: "pool1" top: "norm1" name: "norm1" type: LRN lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layers { bottom: "norm1" top: "conv2" name: "conv2" type: CONVOLUTION convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 } } layers { bottom: "conv2" top: "conv2" name: "relu2" type: RELU } layers { bottom: "conv2" top: "pool2" name: "pool2" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layers { bottom: "pool2" top: "norm2" name: "norm2" type: LRN lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layers { bottom: "norm2" top: "conv3" name: "conv3" type: CONVOLUTION convolution_param { num_output: 384 pad: 1 kernel_size: 3 } } layers { bottom: "conv3" top: "conv3" name: "relu3" type: RELU } layers { bottom: "conv3" top: "conv4" name: "conv4" type: CONVOLUTION convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 2 } } layers { bottom: "conv4" top: "conv4" name: "relu4" type: RELU } layers { bottom: "conv4" top: "conv5" name: "conv5" type: CONVOLUTION convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 2 } } layers { bottom: "conv5" top: "conv5" name: "relu5" type: RELU } layers { bottom: "conv5" top: "pool5" name: "pool5" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layers { bottom: "pool5" top: "fc6" name: "fc6" type: INNER_PRODUCT inner_product_param { num_output: 4096 } } layers { bottom: "fc6" top: "fc6" name: "relu6" type: RELU } layers { bottom: "fc6" top: "fc6" name: "drop6" type: DROPOUT dropout_param { dropout_ratio: 0.5 } } layers { bottom: "fc6" top: "fc7" name: "fc7" type: INNER_PRODUCT inner_product_param { num_output: 4096 } } layers { bottom: "fc7" top: "fc7" name: "relu7" type: RELU } layers { bottom: "fc7" top: "fc7" name: "drop7" type: DROPOUT dropout_param { dropout_ratio: 0.5 } } layers { bottom: "fc7" top: "fc-rcnn" name: "fc-rcnn" type: INNER_PRODUCT inner_product_param { num_output: 200 } } input: "data" input_dim: 10 input_dim: 3 input_dim: 227 input_dim: 227 I0610 10:12:49.300204 25530 net.cpp:77] Creating Layer conv1 I0610 10:12:49.300214 25530 net.cpp:87] conv1 <- data I0610 10:12:49.300220 25530 net.cpp:113] conv1 -> conv1 I0610 10:12:49.300283 25530 net.cpp:128] Top shape: 10 96 55 55 (2904000) I0610 10:12:49.300294 25530 net.cpp:154] conv1 needs backward computation. I0610 10:12:49.300302 25530 net.cpp:77] Creating Layer relu1 I0610 10:12:49.300308 25530 net.cpp:87] relu1 <- conv1 I0610 10:12:49.300314 25530 net.cpp:101] relu1 -> conv1 (in-place) I0610 10:12:49.300323 25530 net.cpp:128] Top shape: 10 96 55 55 (2904000) I0610 10:12:49.300328 25530 net.cpp:154] relu1 needs backward computation. I0610 10:12:49.300335 25530 net.cpp:77] Creating Layer pool1 I0610 10:12:49.300341 25530 net.cpp:87] pool1 <- conv1 I0610 10:12:49.300348 25530 net.cpp:113] pool1 -> pool1 I0610 10:12:49.300357 25530 net.cpp:128] Top shape: 10 96 27 27 (699840) I0610 10:12:49.300365 25530 net.cpp:154] pool1 needs backward computation. I0610 10:12:49.300372 25530 net.cpp:77] Creating Layer norm1 I0610 10:12:49.300379 25530 net.cpp:87] norm1 <- pool1 I0610 10:12:49.300384 25530 net.cpp:113] norm1 -> norm1 I0610 10:12:49.300393 25530 net.cpp:128] Top shape: 10 96 27 27 (699840) I0610 10:12:49.300400 25530 net.cpp:154] norm1 needs backward computation. I0610 10:12:49.300406 25530 net.cpp:77] Creating Layer conv2 I0610 10:12:49.300412 25530 net.cpp:87] conv2 <- norm1 I0610 10:12:49.300420 25530 net.cpp:113] conv2 -> conv2 I0610 10:12:49.300925 25530 net.cpp:128] Top shape: 10 256 27 27 (1866240) I0610 10:12:49.300935 25530 net.cpp:154] conv2 needs backward computation. I0610 10:12:49.300941 25530 net.cpp:77] Creating Layer relu2 I0610 10:12:49.300947 25530 net.cpp:87] relu2 <- conv2 I0610 10:12:49.300954 25530 net.cpp:101] relu2 -> conv2 (in-place) I0610 10:12:49.300961 25530 net.cpp:128] Top shape: 10 256 27 27 (1866240) I0610 10:12:49.300967 25530 net.cpp:154] relu2 needs backward computation. I0610 10:12:49.300974 25530 net.cpp:77] Creating Layer pool2 I0610 10:12:49.300981 25530 net.cpp:87] pool2 <- conv2 I0610 10:12:49.300987 25530 net.cpp:113] pool2 -> pool2 I0610 10:12:49.300994 25530 net.cpp:128] Top shape: 10 256 13 13 (432640) I0610 10:12:49.301000 25530 net.cpp:154] pool2 needs backward computation. I0610 10:12:49.301007 25530 net.cpp:77] Creating Layer norm2 I0610 10:12:49.301013 25530 net.cpp:87] norm2 <- pool2 I0610 10:12:49.301019 25530 net.cpp:113] norm2 -> norm2 I0610 10:12:49.301026 25530 net.cpp:128] Top shape: 10 256 13 13 (432640) I0610 10:12:49.301033 25530 net.cpp:154] norm2 needs backward computation. I0610 10:12:49.301041 25530 net.cpp:77] Creating Layer conv3 I0610 10:12:49.301048 25530 net.cpp:87] conv3 <- norm2 I0610 10:12:49.301054 25530 net.cpp:113] conv3 -> conv3 I0610 10:12:49.302455 25530 net.cpp:128] Top shape: 10 384 13 13 (648960) I0610 10:12:49.302467 25530 net.cpp:154] conv3 needs backward computation. I0610 10:12:49.302477 25530 net.cpp:77] Creating Layer relu3 I0610 10:12:49.302484 25530 net.cpp:87] relu3 <- conv3 I0610 10:12:49.302490 25530 net.cpp:101] relu3 -> conv3 (in-place) I0610 10:12:49.302496 25530 net.cpp:128] Top shape: 10 384 13 13 (648960) I0610 10:12:49.302503 25530 net.cpp:154] relu3 needs backward computation. I0610 10:12:49.302510 25530 net.cpp:77] Creating Layer conv4 I0610 10:12:49.302515 25530 net.cpp:87] conv4 <- conv3 I0610 10:12:49.302521 25530 net.cpp:113] conv4 -> conv4 I0610 10:12:49.303639 25530 net.cpp:128] Top shape: 10 384 13 13 (648960) I0610 10:12:49.303650 25530 net.cpp:154] conv4 needs backward computation. I0610 10:12:49.303658 25530 net.cpp:77] Creating Layer relu4 I0610 10:12:49.303663 25530 net.cpp:87] relu4 <- conv4 I0610 10:12:49.303670 25530 net.cpp:101] relu4 -> conv4 (in-place) I0610 10:12:49.303676 25530 net.cpp:128] Top shape: 10 384 13 13 (648960) I0610 10:12:49.303683 25530 net.cpp:154] relu4 needs backward computation. I0610 10:12:49.303691 25530 net.cpp:77] Creating Layer conv5 I0610 10:12:49.303697 25530 net.cpp:87] conv5 <- conv4 I0610 10:12:49.303704 25530 net.cpp:113] conv5 -> conv5 I0610 10:12:49.304410 25530 net.cpp:128] Top shape: 10 256 13 13 (432640) I0610 10:12:49.304420 25530 net.cpp:154] conv5 needs backward computation. I0610 10:12:49.304427 25530 net.cpp:77] Creating Layer relu5 I0610 10:12:49.304433 25530 net.cpp:87] relu5 <- conv5 I0610 10:12:49.304440 25530 net.cpp:101] relu5 -> conv5 (in-place) I0610 10:12:49.304446 25530 net.cpp:128] Top shape: 10 256 13 13 (432640) I0610 10:12:49.304471 25530 net.cpp:154] relu5 needs backward computation. I0610 10:12:49.304478 25530 net.cpp:77] Creating Layer pool5 I0610 10:12:49.304484 25530 net.cpp:87] pool5 <- conv5 I0610 10:12:49.304491 25530 net.cpp:113] pool5 -> pool5 I0610 10:12:49.304498 25530 net.cpp:128] Top shape: 10 256 6 6 (92160) I0610 10:12:49.304504 25530 net.cpp:154] pool5 needs backward computation. I0610 10:12:49.304512 25530 net.cpp:77] Creating Layer fc6 I0610 10:12:49.304517 25530 net.cpp:87] fc6 <- pool5 I0610 10:12:49.304523 25530 net.cpp:113] fc6 -> fc6 I0610 10:12:49.364333 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.364372 25530 net.cpp:154] fc6 needs backward computation. I0610 10:12:49.364387 25530 net.cpp:77] Creating Layer relu6 I0610 10:12:49.364420 25530 net.cpp:87] relu6 <- fc6 I0610 10:12:49.364429 25530 net.cpp:101] relu6 -> fc6 (in-place) I0610 10:12:49.364437 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.364444 25530 net.cpp:154] relu6 needs backward computation. I0610 10:12:49.364455 25530 net.cpp:77] Creating Layer drop6 I0610 10:12:49.364461 25530 net.cpp:87] drop6 <- fc6 I0610 10:12:49.364467 25530 net.cpp:101] drop6 -> fc6 (in-place) I0610 10:12:49.364480 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.364487 25530 net.cpp:154] drop6 needs backward computation. I0610 10:12:49.364495 25530 net.cpp:77] Creating Layer fc7 I0610 10:12:49.364501 25530 net.cpp:87] fc7 <- fc6 I0610 10:12:49.364507 25530 net.cpp:113] fc7 -> fc7 I0610 10:12:49.391316 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.391350 25530 net.cpp:154] fc7 needs backward computation. I0610 10:12:49.391361 25530 net.cpp:77] Creating Layer relu7 I0610 10:12:49.391369 25530 net.cpp:87] relu7 <- fc7 I0610 10:12:49.391377 25530 net.cpp:101] relu7 -> fc7 (in-place) I0610 10:12:49.391384 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.391391 25530 net.cpp:154] relu7 needs backward computation. I0610 10:12:49.391398 25530 net.cpp:77] Creating Layer drop7 I0610 10:12:49.391427 25530 net.cpp:87] drop7 <- fc7 I0610 10:12:49.391433 25530 net.cpp:101] drop7 -> fc7 (in-place) I0610 10:12:49.391440 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960) I0610 10:12:49.391446 25530 net.cpp:154] drop7 needs backward computation. I0610 10:12:49.391454 25530 net.cpp:77] Creating Layer fc-rcnn I0610 10:12:49.391459 25530 net.cpp:87] fc-rcnn <- fc7 I0610 10:12:49.391466 25530 net.cpp:113] fc-rcnn -> fc-rcnn I0610 10:12:49.392812 25530 net.cpp:128] Top shape: 10 200 1 1 (2000) I0610 10:12:49.392823 25530 net.cpp:154] fc-rcnn needs backward computation. I0610 10:12:49.392829 25530 net.cpp:165] This network produces output fc-rcnn I0610 10:12:49.392850 25530 net.cpp:183] Collecting Learning Rate and Weight Decay. I0610 10:12:49.392868 25530 net.cpp:176] Network initialization done. I0610 10:12:49.392875 25530 net.cpp:177] Memory required for Data 41950840 GPU mode Loading input... selective_search_rcnn({'/home/shelhamer/caffe/examples/images/fish-bike.jpg'}, '/tmp/tmpo7yOum.mat') Processed 1570 windows in 35.012 s. /home/shelhamer/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->['prediction']] warnings.warn(ws, PerformanceWarning) Saved to _temp/det_output.h5 in 0.035 s.
This run was in GPU mode. For CPU mode detection, call detect.py
without the --gpu
argument.
Running this outputs a DataFrame with the filenames, selected windows, and their detection scores to an HDF5 file. (We only ran on one image, so the filenames will all be the same.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_hdf('_temp/det_output.h5', 'df')
print(df.shape)
print(df.iloc[0])
(1570, 5) prediction [-2.64547, -2.88455, -2.85903, -3.17038, -1.92... ymin 79.846 xmin 9.62 ymax 246.31 xmax 339.624 Name: /home/shelhamer/caffe/examples/images/fish-bike.jpg, dtype: object
1570 regions were proposed with the R-CNN configuration of selective search. The number of proposals will vary from image to image based on its contents and size -- selective search isn't scale invariant.
In general, detect.py
is most efficient when running on a lot of images: it first extracts window proposals for all of them, batches the windows for efficient GPU processing, and then outputs the results.
Simply list an image per line in the images_file
, and it will process all of them.
Although this guide gives an example of R-CNN ImageNet detection, detect.py
is clever enough to adapt to different Caffe models’ input dimensions, batch size, and output categories. You can switch the model definition and pretrained model as desired. Refer to python detect.py --help
for the parameters to describe your data set. There's no need for hardcoding.
Anyway, let's now load the ILSVRC13 detection class names and make a DataFrame of the predictions. Note you'll need the auxiliary ilsvrc2012 data fetched by data/ilsvrc12/get_ilsvrc12_aux.sh
.
with open('../data/ilsvrc12/det_synset_words.txt') as f:
labels_df = pd.DataFrame([
{
'synset_id': l.strip().split(' ')[0],
'name': ' '.join(l.strip().split(' ')[1:]).split(',')[0]
}
for l in f.readlines()
])
labels_df.sort('synset_id')
predictions_df = pd.DataFrame(np.vstack(df.prediction.values), columns=labels_df['name'])
print(predictions_df.iloc[0])
name accordion -2.645470 airplane -2.884554 ant -2.859026 antelope -3.170383 apple -1.924201 armadillo -2.493925 artichoke -2.235427 axe -2.378177 baby bed -2.757855 backpack -2.160120 bagel -2.715738 balance beam -2.716172 banana -2.418939 band aid -1.604563 banjo -2.329196 ... trombone -2.531519 trumpet -2.382109 turtle -2.378510 tv or monitor -2.777433 unicycle -2.263807 vacuum -1.894700 violin -2.797967 volleyball -2.807812 waffle iron -2.418155 washer -2.429423 water bottle -2.163465 watercraft -2.803971 whale -3.094172 wine bottle -2.830827 zebra -2.791829 Name: 0, Length: 200, dtype: float32
Let's look at the activations.
plt.gray()
plt.matshow(predictions_df.values)
plt.xlabel('Classes')
plt.ylabel('Windows')
<matplotlib.text.Text at 0x4e2c090>
<matplotlib.figure.Figure at 0x4d008d0>
Now let's take max across all windows and plot the top classes.
max_s = predictions_df.max(0)
max_s.sort(ascending=False)
print(max_s[:10])
name person 1.883164 bicycle 0.936994 unicycle 0.016907 banjo 0.013019 motorcycle -0.024704 electric fan -0.193420 turtle -0.243857 cart -0.289637 lizard -0.307945 baby bed -0.582180 dtype: float32
The top detections are in fact a person and bicycle. Picking good localizations is a work in progress; we pick the top-scoring person and bicycle detections.
# Find, print, and display the top detections: person and bicycle.
i = predictions_df['person'].argmax()
j = predictions_df['bicycle'].argmax()
# Show top predictions for top detection.
f = pd.Series(df['prediction'].iloc[i], index=labels_df['name'])
print('Top detection:')
print(f.order(ascending=False)[:5])
print('')
# Show top predictions for second-best detection.
f = pd.Series(df['prediction'].iloc[j], index=labels_df['name'])
print('Second-best detection:')
print(f.order(ascending=False)[:5])
# Show top detection in red, second-best top detection in blue.
im = plt.imread('images/fish-bike.jpg')
plt.imshow(im)
currentAxis = plt.gca()
det = df.iloc[i]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(plt.Rectangle(*coords, fill=False, edgecolor='r', linewidth=5))
det = df.iloc[j]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(plt.Rectangle(*coords, fill=False, edgecolor='b', linewidth=5))
Top detection: name person 1.883164 swimming trunks -1.136701 rubber eraser -1.251888 plastic bag -1.286928 snowmobile -1.304962 dtype: float32 Second-best detection: name bicycle 0.936994 unicycle -0.372841 scorpion -0.812350 lobster -1.041506 lamp -1.118889 dtype: float32
<matplotlib.patches.Rectangle at 0x4f59f10>
That's cool. Let's take all 'bicycle' detections and NMS them to get rid of overlapping windows.
def nms_detections(dets, overlap=0.3):
"""
Non-maximum suppression: Greedily select high-scoring detections and
skip detections that are significantly covered by a previously
selected detection.
This version is translated from Matlab code by Tomasz Malisiewicz,
who sped up Pedro Felzenszwalb's code.
Parameters
----------
dets: ndarray
each row is ['xmin', 'ymin', 'xmax', 'ymax', 'score']
overlap: float
minimum overlap ratio (0.3 default)
Output
------
dets: ndarray
remaining after suppression.
"""
x1 = dets[:, 0]
y1 = dets[:, 1]
x2 = dets[:, 2]
y2 = dets[:, 3]
ind = np.argsort(dets[:, 4])
w = x2 - x1
h = y2 - y1
area = (w * h).astype(float)
pick = []
while len(ind) > 0:
i = ind[-1]
pick.append(i)
ind = ind[:-1]
xx1 = np.maximum(x1[i], x1[ind])
yy1 = np.maximum(y1[i], y1[ind])
xx2 = np.minimum(x2[i], x2[ind])
yy2 = np.minimum(y2[i], y2[ind])
w = np.maximum(0., xx2 - xx1)
h = np.maximum(0., yy2 - yy1)
wh = w * h
o = wh / (area[i] + area[ind] - wh)
ind = ind[np.nonzero(o <= overlap)[0]]
return dets[pick, :]
scores = predictions_df['bicycle']
windows = df[['xmin', 'ymin', 'xmax', 'ymax']].values
dets = np.hstack((windows, scores[:, np.newaxis]))
nms_dets = nms_detections(dets)
Show top 3 NMS'd detections for 'bicycle' in the image and note the gap between the top scoring box (red) and the remaining boxes.
plt.imshow(im)
currentAxis = plt.gca()
colors = ['r', 'b', 'y']
for c, det in zip(colors, nms_dets[:3]):
currentAxis.add_patch(
plt.Rectangle((det[0], det[1]), det[2]-det[0], det[3]-det[1],
fill=False, edgecolor=c, linewidth=5)
)
print 'scores:', nms_dets[:3, 4]
scores: [ 0.93699419 -0.65612102 -1.32907355]
This was an easy instance for bicycle as it was in the class's training set. However, the person result is a true detection since this was not in the set for that class.
You should try out detection on an image of your own next!
(Remove the temp directory to clean up, and we're done.)
!rm -rf _temp