Start by creating a new conda
environment:
$ conda create -n pyannote python=3.6 anaconda
$ source activate pyannote
Then, install pyannote-video
and its dependencies:
$ pip install pyannote-video
Finally, download sample video and dlib
models:
$ git clone https://github.com/pyannote/pyannote-data.git
$ git clone https://github.com/davisking/dlib-models.git
$ bunzip2 dlib-models/dlib_face_recognition_resnet_model_v1.dat.bz2
$ bunzip2 dlib-models/shape_predictor_68_face_landmarks.dat.bz2
To execute this notebook locally:
$ git clone https://github.com/pyannote/pyannote-video.git
$ jupyter notebook --notebook-dir="pyannote-video/doc"
%pylab inline
Populating the interactive namespace from numpy and matplotlib
!pyannote-structure.py --help
Video structure The standard pipeline for is the following: shot boundary detection ==> shot threading ==> segmentation into scenes Usage: pyannote-structure.py shot [options] <video> <output.json> pyannote-structure.py thread [options] <video> <shot.json> <output.json> pyannote-structure.py scene [options] <video> <thread.json> <output.json> pyannote-structure.py (-h | --help) pyannote-structure.py --version Options: --height=<n_pixels> Resize video frame to height <n_pixels> [default: 50]. --window=<n_seconds> Apply median filtering on <n_seconds> window [default: 2.0]. --threshold=<value> Set threshold to <value> [default: 1.0]. --min-match=<n_match> Set minimum number of matches to <n_match> [default: 20]. --lookahead=<n_shots> Look at up to <n_shots> following shots [default: 24]. -h --help Show this screen. --version Show version. --verbose Show progress.
!pyannote-structure.py shot --verbose ../../pyannote-data/TheBigBangTheory.mkv \
../../pyannote-data/TheBigBangTheory.shots.json
752frames [00:32, 23.2frames/s]
Detected shot boundaries can be visualized using pyannote.core
notebook support:
from pyannote.core.json import load_from
shots = load_from('../../pyannote-data/TheBigBangTheory.shots.json')
shots
!pyannote-face.py --help
Face detection and tracking The standard pipeline is the following face tracking => feature extraction => face clustering Usage: pyannote-face track [options] <video> <shot.json> <tracking> pyannote-face extract [options] <video> <tracking> <landmark_model> <embedding_model> <landmarks> <embeddings> pyannote-face demo [options] <video> <tracking> <output> pyannote-face (-h | --help) pyannote-face --version General options: -h --help Show this screen. --version Show version. --verbose Show processing progress. Face tracking options (track): <video> Path to video file. <shot.json> Path to shot segmentation result file. <tracking> Path to tracking result file. --min-size=<ratio> Approximate size (in video height ratio) of the smallest face that should be detected. Default is to try and detect any object [default: 0.0]. --every=<seconds> Only apply detection every <seconds> seconds. Default is to process every frame [default: 0.0]. --min-overlap=<ratio> Associates face with tracker if overlap is greater than <ratio> [default: 0.5]. --min-confidence=<float> Reset trackers with confidence lower than <float> [default: 10.]. --max-gap=<float> Bridge gaps with duration shorter than <float> [default: 1.]. Feature extraction options (features): <video> Path to video file. <tracking> Path to tracking result file. <landmark_model> Path to dlib facial landmark detection model. <embedding_model> Path to dlib feature extraction model. <landmarks> Path to facial landmarks detection result file. <embeddings> Path to feature extraction result file. Visualization options (demo): <video> Path to video file. <tracking> Path to tracking result file. <output> Path to demo video file. --height=<pixels> Height of demo video file [default: 400]. --from=<sec> Encode demo from <sec> seconds [default: 0]. --until=<sec> Encode demo until <sec> seconds. --shift=<sec> Shift result files by <sec> seconds [default: 0]. --landmark=<path> Path to facial landmarks detection result file. --label=<path> Path to track identification result file.
!pyannote-face.py track --verbose --every=0.5 ../../pyannote-data/TheBigBangTheory.mkv \
../../pyannote-data/TheBigBangTheory.shots.json \
../../pyannote-data/TheBigBangTheory.track.txt
752frames [00:23, 32.0frames/s]
Face tracks can be visualized using demo
mode:
!pyannote-face.py demo ../../pyannote-data/TheBigBangTheory.mkv \
../../pyannote-data/TheBigBangTheory.track.txt \
../../pyannote-data/TheBigBangTheory.track.mp4
[MoviePy] >>>> Building video ../../pyannote-data/TheBigBangTheory.track.mp4 [MoviePy] Writing audio in TheBigBangTheory.trackTEMP_MPY_wvf_snd.mp3 100%|████████████████████████████████████████| 664/664 [00:01<00:00, 425.86it/s] [MoviePy] Done. [MoviePy] Writing video ../../pyannote-data/TheBigBangTheory.track.mp4 100%|████████████████████████████████████████▉| 752/753 [00:08<00:00, 87.38it/s] [MoviePy] Done. [MoviePy] >>>> Video ready: ../../pyannote-data/TheBigBangTheory.track.mp4
import io
import base64
from IPython.display import HTML
video = io.open('../../pyannote-data/TheBigBangTheory.track.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''.format(encoded.decode('ascii')))
!pyannote-face.py extract --verbose ../../pyannote-data/TheBigBangTheory.mkv \
../../pyannote-data/TheBigBangTheory.track.txt \
../../dlib-models/shape_predictor_68_face_landmarks.dat \
../../dlib-models/dlib_face_recognition_resnet_model_v1.dat \
../../pyannote-data/TheBigBangTheory.landmarks.txt \
../../pyannote-data/TheBigBangTheory.embedding.txt
752frames [00:24, 30.4frames/s]
Once embeddings are extracted, let's apply face track hierarchical agglomerative clustering.
The distance between two clusters is defined as the average euclidean distance between all embeddings.
from pyannote.video.face.clustering import FaceClustering
clustering = FaceClustering(threshold=0.6)
face_tracks, embeddings = clustering.model.preprocess('../../pyannote-data/TheBigBangTheory.embedding.txt')
face_tracks.get_timeline()
result = clustering(face_tracks, features=embeddings)
from pyannote.core import notebook, Segment
notebook.reset()
notebook.crop = Segment(0, 30)
mapping = {9: 'Leonard', 6: 'Sheldon', 14: 'Receptionist', 5: 'False_alarm'}
result = result.rename_labels(mapping=mapping)
result
with open('../../pyannote-data/TheBigBangTheory.labels.txt', 'w') as fp:
for _, track_id, cluster in result.itertracks(yield_label=True):
fp.write(f'{track_id} {cluster}\n')
!pyannote-face.py demo ../../pyannote-data/TheBigBangTheory.mkv \
../../pyannote-data/TheBigBangTheory.track.txt \
--label=../../pyannote-data/TheBigBangTheory.labels.txt \
../../pyannote-data/TheBigBangTheory.final.mp4
[MoviePy] >>>> Building video ../../pyannote-data/TheBigBangTheory.final.mp4 [MoviePy] Writing audio in TheBigBangTheory.finalTEMP_MPY_wvf_snd.mp3 100%|████████████████████████████████████████| 664/664 [00:01<00:00, 411.21it/s] [MoviePy] Done. [MoviePy] Writing video ../../pyannote-data/TheBigBangTheory.final.mp4 100%|████████████████████████████████████████▉| 752/753 [00:08<00:00, 87.43it/s] [MoviePy] Done. [MoviePy] >>>> Video ready: ../../pyannote-data/TheBigBangTheory.final.mp4
import io
import base64
from IPython.display import HTML
video = io.open('../../pyannote-data/TheBigBangTheory.final.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''.format(encoded.decode('ascii')))