Start by creating a new conda environment:

$ conda create -n pyannote python=3.6 anaconda
$ source activate pyannote

Then, install pyannote-video and its dependencies:

$ pip install pyannote-video

Finally, download sample video and dlib models:

$ git clone https://github.com/pyannote/pyannote-data.git
$ git clone https://github.com/davisking/dlib-models.git
$ bunzip2 dlib-models/dlib_face_recognition_resnet_model_v1.dat.bz2
$ bunzip2 dlib-models/shape_predictor_68_face_landmarks.dat.bz2

To execute this notebook locally:

$ git clone https://github.com/pyannote/pyannote-video.git
$ jupyter notebook --notebook-dir="pyannote-video/doc"

In [4]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Shot segmentation¶

In [5]:

!pyannote-structure.py --help

Video structure

The standard pipeline for is the following:

    shot boundary detection ==> shot threading ==> segmentation into scenes

Usage:
  pyannote-structure.py shot [options] <video> <output.json>
  pyannote-structure.py thread [options] <video> <shot.json> <output.json>
  pyannote-structure.py scene [options] <video> <thread.json> <output.json>
  pyannote-structure.py (-h | --help)
  pyannote-structure.py --version

Options:
  --height=<n_pixels>    Resize video frame to height <n_pixels> [default: 50].
  --window=<n_seconds>   Apply median filtering on <n_seconds> window [default: 2.0].
  --threshold=<value>    Set threshold to <value> [default: 1.0].
  --min-match=<n_match>  Set minimum number of matches to <n_match> [default: 20].
  --lookahead=<n_shots>  Look at up to <n_shots> following shots [default: 24].
  -h --help              Show this screen.
  --version              Show version.
  --verbose              Show progress.

In [7]:

!pyannote-structure.py shot --verbose ../../pyannote-data/TheBigBangTheory.mkv \
                                      ../../pyannote-data/TheBigBangTheory.shots.json

752frames [00:32, 23.2frames/s]

Detected shot boundaries can be visualized using pyannote.core notebook support:

In [8]:

from pyannote.core.json import load_from
shots = load_from('../../pyannote-data/TheBigBangTheory.shots.json')
shots

Out[8]:

Face processing¶

In [9]:

!pyannote-face.py --help

Face detection and tracking

The standard pipeline is the following

      face tracking => feature extraction => face clustering

Usage:
  pyannote-face track [options] <video> <shot.json> <tracking>
  pyannote-face extract [options] <video> <tracking> <landmark_model> <embedding_model> <landmarks> <embeddings>
  pyannote-face demo [options] <video> <tracking> <output>
  pyannote-face (-h | --help)
  pyannote-face --version

General options:

  -h --help                 Show this screen.
  --version                 Show version.
  --verbose                 Show processing progress.

Face tracking options (track):

  <video>                   Path to video file.
  <shot.json>               Path to shot segmentation result file.
  <tracking>                Path to tracking result file.

  --min-size=<ratio>        Approximate size (in video height ratio) of the
                            smallest face that should be detected. Default is
                            to try and detect any object [default: 0.0].
  --every=<seconds>         Only apply detection every <seconds> seconds.
                            Default is to process every frame [default: 0.0].
  --min-overlap=<ratio>     Associates face with tracker if overlap is greater
                            than <ratio> [default: 0.5].
  --min-confidence=<float>  Reset trackers with confidence lower than <float>
                            [default: 10.].
  --max-gap=<float>         Bridge gaps with duration shorter than <float>
                            [default: 1.].

Feature extraction options (features):

  <video>                   Path to video file.
  <tracking>                Path to tracking result file.
  <landmark_model>          Path to dlib facial landmark detection model.
  <embedding_model>         Path to dlib feature extraction model.
  <landmarks>               Path to facial landmarks detection result file.
  <embeddings>              Path to feature extraction result file.

Visualization options (demo):

  <video>                   Path to video file.
  <tracking>                Path to tracking result file.
  <output>                  Path to demo video file.

  --height=<pixels>         Height of demo video file [default: 400].
  --from=<sec>              Encode demo from <sec> seconds [default: 0].
  --until=<sec>             Encode demo until <sec> seconds.
  --shift=<sec>             Shift result files by <sec> seconds [default: 0].
  --landmark=<path>         Path to facial landmarks detection result file.
  --label=<path>            Path to track identification result file.

Face tracking¶

In [10]:

!pyannote-face.py track --verbose --every=0.5 ../../pyannote-data/TheBigBangTheory.mkv \
                                              ../../pyannote-data/TheBigBangTheory.shots.json \
                                              ../../pyannote-data/TheBigBangTheory.track.txt

752frames [00:23, 32.0frames/s]

Face tracks can be visualized using demo mode:

In [12]:

!pyannote-face.py demo ../../pyannote-data/TheBigBangTheory.mkv \
                       ../../pyannote-data/TheBigBangTheory.track.txt \
                       ../../pyannote-data/TheBigBangTheory.track.mp4

[MoviePy] >>>> Building video ../../pyannote-data/TheBigBangTheory.track.mp4
[MoviePy] Writing audio in TheBigBangTheory.trackTEMP_MPY_wvf_snd.mp3
100%|████████████████████████████████████████| 664/664 [00:01<00:00, 425.86it/s]
[MoviePy] Done.
[MoviePy] Writing video ../../pyannote-data/TheBigBangTheory.track.mp4
100%|████████████████████████████████████████▉| 752/753 [00:08<00:00, 87.38it/s]
[MoviePy] Done.
[MoviePy] >>>> Video ready: ../../pyannote-data/TheBigBangTheory.track.mp4

In [14]:

import io
import base64
from IPython.display import HTML
video = io.open('../../pyannote-data/TheBigBangTheory.track.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''.format(encoded.decode('ascii')))

Out[14]:

Facial landmarks and face embedding¶

In [15]:

!pyannote-face.py extract --verbose ../../pyannote-data/TheBigBangTheory.mkv \
                                    ../../pyannote-data/TheBigBangTheory.track.txt \
                                    ../../dlib-models/shape_predictor_68_face_landmarks.dat \
                                    ../../dlib-models/dlib_face_recognition_resnet_model_v1.dat \
                                    ../../pyannote-data/TheBigBangTheory.landmarks.txt \
                                    ../../pyannote-data/TheBigBangTheory.embedding.txt

752frames [00:24, 30.4frames/s]

Face clustering¶

Once embeddings are extracted, let's apply face track hierarchical agglomerative clustering.
The distance between two clusters is defined as the average euclidean distance between all embeddings.

In [16]:

from pyannote.video.face.clustering import FaceClustering
clustering = FaceClustering(threshold=0.6)

In [17]:

face_tracks, embeddings = clustering.model.preprocess('../../pyannote-data/TheBigBangTheory.embedding.txt')
face_tracks.get_timeline()

Out[17]:

In [18]:

result = clustering(face_tracks, features=embeddings)

In [19]:

from pyannote.core import notebook, Segment
notebook.reset()
notebook.crop = Segment(0, 30)
mapping = {9: 'Leonard', 6: 'Sheldon', 14: 'Receptionist', 5: 'False_alarm'}
result = result.rename_labels(mapping=mapping)
result

Out[19]:

In [21]:

with open('../../pyannote-data/TheBigBangTheory.labels.txt', 'w') as fp:
    for _, track_id, cluster in result.itertracks(yield_label=True):
        fp.write(f'{track_id} {cluster}\n')

In [23]:

!pyannote-face.py demo ../../pyannote-data/TheBigBangTheory.mkv \
                       ../../pyannote-data/TheBigBangTheory.track.txt \
                       --label=../../pyannote-data/TheBigBangTheory.labels.txt \
                       ../../pyannote-data/TheBigBangTheory.final.mp4

[MoviePy] >>>> Building video ../../pyannote-data/TheBigBangTheory.final.mp4
[MoviePy] Writing audio in TheBigBangTheory.finalTEMP_MPY_wvf_snd.mp3
100%|████████████████████████████████████████| 664/664 [00:01<00:00, 411.21it/s]
[MoviePy] Done.
[MoviePy] Writing video ../../pyannote-data/TheBigBangTheory.final.mp4
100%|████████████████████████████████████████▉| 752/753 [00:08<00:00, 87.43it/s]
[MoviePy] Done.
[MoviePy] >>>> Video ready: ../../pyannote-data/TheBigBangTheory.final.mp4

In [25]:

import io
import base64
from IPython.display import HTML
video = io.open('../../pyannote-data/TheBigBangTheory.final.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''.format(encoded.decode('ascii')))

Out[25]: