#!/usr/bin/env python
# coding: utf-8

# # Optimize your Docker Infrastructure with Python
# 
# ## PyData NYC
# 
# November 11, 2015
# 
# * Ryan J. O'Neil (@ryanjoneil)
# * Day: Lead Engineer @ [Yhat](http://www.yhathq.com) (<r@yhathq.com>)
# * Night: PhD student in Operations Research @ GMU (<roneil1@gmu.edu>)
# * Blog: http://ryanjoneil.github.io/blog.html
# * Slides: https://github.com/ryanjoneil/talks/

# ### Motivation
# 
# Consider the DevOps Engineer.

# #### It's a noble profession...
# 
# ![](images/scotty.jpg)

# #### ...with grave responsibilities.
# 
# One of the important functions of DevOps is building environments for:
# 
# * Software development
# * Testing & quality assurance
# * Running operational systems
# * Whatever else you might need a computing environment for

# #### There's even a Venn diagram about DevOps.
# 
# And we all know how much we like Venn diagrams.
# 
# ![](images/devops-venn.svg)

# #### A typical DevOps function
# 
# You work on _Yet Another Enterprise Java Application (TM)_. YAEJA makes \$\$ for _Yet Another Enterprise Software Company_!
# 
# You need to upgrade components where YAEJA runs.
# 
# You can't upgrade production without testing. So you need two environments from your DevOps person.
# 
# One for production:
# 
# &nbsp;  
# 
# |     | Command                                                        |
# |-----|----------------------------------------------------------------|
# | $A$ | Install the Java compiler and runtime environment.             |
# | $B$ | Download a set of external dependencies.                       |
# | $C$ | Set up a an EntepriseDB (TM) schema and populate it with data. |

# And one for testing it with the upgrade:
# 
# &nbsp;  
# 
# |     | Command                                                        |
# |-----|----------------------------------------------------------------|
# | $A$ | Install the Java compiler and runtime environment.             |
# | $B$ | Download a set of external dependencies.                       |
# | $C$ | Set up a an EntepriseDB (TM) schema and populate it with data. |
# | $D$ | Update the underlying system utility.                          |
# 
# &nbsp;  
# 
# If YAJEA integrates with 3rd party apps, you need another test environment for each one.
# 
# If those interact in interesting ways, you need test environments for different combinations of them.

# #### Looks like your DevOps person will be working all weekend
# 
# ![](images/saturday.png)

# #### In the old days...
# 
# At a big software shop, devops may setup and tear down dozens (or _hundreds_) of system configurations for a single software project.
# 
# This used to be done with physical hardware on _(sometimes)_ fresh OS installs.
# 
# Most medium-to-large software shops had an air conditioned room like this.
# 
# ![](images/cable-spaghetti.jpg)

# #### Environmental setup was often pretty manual
# 
# If you had to reproduce an environment, you had a few options:
# 
# * Start with a fresh install of the OS
# 
# 
# * Setup and save the results to a CD
# 
# 
# * Hope that uninstalling and resintalling is good enough...
# 
#     + A _lot_ of people did this
#     + Probably not good enough
#     + Software leaves relics
#         - Logs...
#         - Data files...
#         - Configuration...

# #### Nowadays...
# 
# ![](images/cloud-docker.png)

# ### What's a container?
# 
# Docker is so hot right now.

# I've been getting two questions a lot this week:
# 
# 1. What's Docker?
# 2. Should I be using it?

# #### A1: Docker is a lightweight virtualization platform
# 
# Docker provides two main things: images & containers.
# 
# Images represent the environments containers run it.
# 
# A running container acts like it is:
#     
# * In it own OS
# * On its own hardware
# 
# But it doesn't load heavy stuff like a new kernel or device drivers.

# $${\Huge\text{DEMO TIME}}$$

# #### A2: I don't know what you should be doing
# 
# But Yhat uses a ${\large\text{LOT}}$ of Docker.
# 
# * Development (MySQL containers)
# * Functional tests for our software
# * Buiding binary packages for Linux distros
# * Productionizing predictive models (ScienceOps)
# * Environment management in collaborative tools (ScienceCluster)

# $${\Huge\text{DEMO TIME}}$$

# #### So containers are about isolation...
# 
# Processes and system resources behave as if the are on their own computers.
# 

# #### ...but containers are also about sharing.
# 
# And this is what we care about here.

# Specifically, saving and retrieving the results of a computation from the Docker image cache.

# Why?
# 
# &nbsp;  
# 
# $$\text{SMART CACHE USE} = \text{DEVOPS TIME SAVED!}$$

# ![](images/scotty-approved.jpg)

# ### Docker cache mechanics
# 
# Maybe I should call this section UnionFS mechanics, but then Docker is so hot right now.

# #### A Tale of Two Dockerfiles
# 
# Let's say I collect old or non-standard revision control systems.

# Everyone needs a hobby.

# I set up an environment with a few RCSs in it to play around with.
# 
# Dockerfile: `rcs1`
# &nbsp;    
# ```
# FROM ubuntu:trusty
# RUN apt-get install -y bzr
# RUN apt-get install -y cvs
# ```

# #### Let's build out our non-standard RCS image!
# 
# ```
# [ryan@localhost dockerfiles]$ docker build --file=rcs1 --tag=rcs1 .
# ```

# Step 0 would download the base Ubuntu image, but I've already got it.
# 
# ```
# Sending build context to Docker daemon 24.58 kB
# Step 0 : FROM ubuntu:trusty
#  ---> a5a467fddcb8
# ```

# `bzr` require a number of dependencies...
# 
# ```
# Step 1 : RUN apt-get install -y bzr
#  ---> Running in 7679f0a1525e
# Reading package lists...
# Building dependency tree...
# Reading state information...
# The following extra packages will be installed:
#   ca-certificates dbus gir1.2-glib-2.0 libapparmor1 libassuan0
#   libdbus-glib-1-2 libgirepository-1.0-1 libglib2.0-0 libglib2.0-data libgmp10
#   [...snip...]
# ```

# As does `cvs`.
# 
# ```
# Step 2 : RUN apt-get install -y cvs
#  ---> Running in 54229eef94d2
# Reading package lists...
# Building dependency tree...
# Reading state information...
# The following extra packages will be installed:
#   krb5-locales libedit2 libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3
#   libkrb5support0 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6
#   [...snip...]
# ```

# At the end of this I have a tagged image. I can easily use this to spin up new containers.
# 
# ```
# [ryan@localhost dockerfiles]$ docker run rcs1 bash -c "which cvs && which bzr"
# /usr/bin/cvs
# /usr/bin/bzr
# ```

# #### Oh snap.
# 
# I found an old RCS repository from the 1980s I want to look at.
# 
# ![](images/rcs-sccs.jpg)

# I don't have `rcs` installed. I guess I'll have to build a new image.
# 
# Dockerfile: `rcs2`
# &nbsp;  
# ```
# FROM ubuntu:trusty
# RUN apt-get install -y bzr
# RUN apt-get install -y cvs
# RUN apt-get install -y software-properties-common
# RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
# RUN apt-get update
# RUN apt-get install -y rcs
# ```

# ```
# [ryan@localhost dockerfiles]$ docker build --file=rcs2 --tag=rcs2 .
# ```

# 
# The first few steps have already been run before and can use the Docker cache!
# ```
# Step 0 : FROM ubuntu:trusty
#  ---> a5a467fddcb8
# 
# Step 1 : RUN apt-get install -y bzr
#  ---> Using cache
#  ---> ed8a8efd04cc
# 
# Step 2 : RUN apt-get install -y cvs
#  ---> Using cache
#  ---> 61fe05c24a19
# ```

# After that it's business as usual.
# 
# ```
# Step 3 : RUN apt-get install -y software-properties-common
#  ---> Running in 4adc2bb63a73
# The following extra packages will be installed:
#   iso-codes libasn1-8-heimdal libcurl3-gnutls libgssapi3-heimdal
#   libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal
#   [...snip...]
# 
# Step 4 : RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
#  ---> Running in 5a023435448c
#  ---> 02439f270099
# Removing intermediate container 5a023435448c
# 
# Step 5 : RUN apt-get update
#  ---> Running in 449c25ee3537
# Get:1 http://archive.ubuntu.com trusty-updates InRelease [64.4 kB]
# [...snip...]
# 
# Step 6 : RUN apt-get install -y rcs
#  ---> Running in 3574b94bbea3
# The following NEW packages will be installed:
#   rcs
# ```

# #### Adding to my stable of random revision control.
# 
# _Also, why are they all three letters long?_
# 
# So now I have `rcs` in addition to `bzr` and `cvs`. Note that my `rcs2` image was built off of my `rcs1` image. I didn't have to do any extra work to recreate those shared parts of it.
# 
# ```
# [ryan@localhost dockerfiles]$ docker run rcs2 bash -c "which bzr && which cvs && which rcs"
# /usr/bin/bzr
# /usr/bin/cvs
# /usr/bin/rcs
# ```

# #### What happens if I rebuild the image again?
# 
# Omigosh everything is cached!
# 
# ```
# [ryan@localhost dockerfiles]$ docker build --file=rcs2 --tag=rcs2 .
# Sending build context to Docker daemon 3.072 kB
# Step 0 : FROM ubuntu:trusty
#  ---> a5a467fddcb8
#  
# Step 1 : RUN apt-get install -y bzr
#  ---> Using cache
#  ---> ed8a8efd04cc
#  
# Step 2 : RUN apt-get install -y cvs
#  ---> Using cache
#  ---> 61fe05c24a19
#  
# Step 3 : RUN apt-get install -y software-properties-common
#  ---> Using cache
#  ---> 942207de6f7f 
# ```

# ```
# Step 4 : RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
#  ---> Using cache
#  ---> 02439f270099
# 
# Step 5 : RUN apt-get update
#  ---> Using cache
#  ---> 220536dfc3ff
# 
# Step 6 : RUN apt-get install -y rcs
#  ---> Using cache
#  ---> 1a8cd22cb14d
# Successfully built 1a8cd22cb14d
# ```

# ![](images/spock-logical-awesome.jpg)

# #### What if I change the order of that Dockerfile?
# 
# I'd really rather have the commands to add `universe` at the top.
# 
# Dockerfile: `rcs3`
# &nbsp;  
# ```
# FROM ubuntu:trusty
# RUN apt-get install -y software-properties-common
# RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
# RUN apt-get update
# RUN apt-get install -y bzr
# RUN apt-get install -y cvs
# RUN apt-get install -y rcs
# ```

# #### What happens when I build an image with the commands reordered?
# 
# Nothing except the base image is cached!
# 
# ```
# [ryan@localhost dockerfiles]$ docker build --file=rcs3 --tag=rcs3 .
# Sending build context to Docker daemon 4.096 kB
# 
# Step 0 : FROM ubuntu:trusty
#  ---> a5a467fddcb8
# 
# Step 1 : RUN apt-get install -y software-properties-common
#  ---> Running in 713e49f1f323
# Reading package lists...
# Building dependency tree...
# Reading state information...
# The following extra packages will be installed:
#   ca-certificates gir1.2-glib-2.0 iso-codes krb5-locales libasn1-8-heimdal
#   libcurl3-gnutls libdbus-glib-1-2 libgirepository-1.0-1 libglib2.0-0
#   libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal libhcrypto4-heimdal
#   libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libidn11
#   libk5crypto3 libkeyutils1 libkrb5-26-heimdal libkrb5-3 libkrb5support0
#   libldap-2.4-2 libroken18-heim
#   [...etc...]
# ```

# #### So Docker can use the cache when...
# 
# It has seen the exact same commands in _the same order_.
# 
# If it sees something new, it _stops using the cache_.
# 
# Certain commands, like `COPY`, break the cache entirely.

# #### Docker caching examples
# 
# Building two images that share three commands:
# 
# ![](images/fig01.png)

# If we structure them as such, they share nothing:
# 
# ![](images/fig02.png)

# #### Once images diverge, they can never merge back together
# 
# This won't work:
# 
# ![](images/fig03.png)

# This is what would really happen:
# 
# ![](images/fig04.png)

# This would be better:
# 
# ![](images/fig05.png)

# ### Problem formulation
# 
# Let's write this thing in LaTeX!

# #### What are we trying to do again?
# 
# Given a set of Docker images and their commands, find the best order to run the commands and
# minimize required computing time.
# 
# _Could also read: maximize use of the Docker cache._

# #### Say we know a priori the time required by each command
# 
# We could also use a prior.
# 
# $$C = \{\text{set of commmands}\}$$
# $$t_c > 0 = \text{time required by command c}$$

# We also know the images we need to create and the commands each one requires.
# 
# $$I = \{\text{set of images}\}$$
# $$C_i = \{\text{commands requires for image I}\}$$

# #### We need some decision variables
# 
# We define a set $P$ of (image set, command set) pairs, starting with each image $i \in I$ and, for each of these, every command $c \in C_i$.

# In our first example we had two images, $I = \{1, 2\}$.
# 
# The commands for the images were: 
# 
# $$C_1 = \{A, B, C\}$$
# $$C_2 = \{A, B, C, D\}$$

# So our set $P$ initially contains:
# 
# $$P = \{(\{1\},\{A\}), (\{1\},\{B\}), (\{1\},\{C\}), \\ (\{2\},\{A\}), (\{2\},\{B\}), (\{2\},\{C\}), (\{2\},\{D\})\}$$

# Now define $x_p \in \{0,1\}$ for each $p \in P$. 
# 
# If $x_p = 1$, then we run the commands in $p$ for the images in $p$.

# #### A first, misguided stab at our image construction model
# 
# Find the minimum cost schedule based on our set $P$ and the requirements of our images. 
# 
# $$
# \begin{array}{r l}
#     \min        & t_a x_{1,a} + t_b x_{1,b} + t_c x_{1,c} + t_a x_{2,a} + t_b x_{2,b} + t_c x_{2,c} + t_c x_{2,d} \\  
#     \text{s.t.} & x_{1,a} = 1 \\
#                 & x_{1,b} = 1 \\
#                 & x_{1,c} = 1 \\
#                 & x_{2,a} = 1 \\
#                 & x_{2,b} = 1 \\
#                 & x_{2,c} = 1 \\
#                 & x_{2,d} = 1 \\
#                 & x_p \in \{0,1\}\, \forall\, p \in P
# \end{array}
# $$

# Well that's just ridiculous.

# The solution is trivially:
# 
# $$x_{1,a} =  x_{1,b} = x_{1,c} = x_{2,a} = x_{2,b} = x_{2,c} = x_{2,d} = 1$$

# The cost of implementing this plan is actually the worst we could possibly do:
# 
# $$2 \left(t_a + t_b + t_c\right) + t_d$$

# In fact it looks pretty much like we did this:
# 
# ![](images/fig02.png)

# #### We need a variable that allows us to use the Docker cache
# 
# In this problem, commands $\{A, B, C\}$ are shared between images $\{1,2\}$.
# 
# So let's make a variable for that and add it to $P$.
# 
# $$x_{12,abc} \in \{0,1\}$$

# Of course this changes the constraints.
# 
# $$x_{1,a} + x_{12,abc} = 1$$
# $$x_{1,b} + x_{12,abc} = 1$$
# $$x_{1,c} + x_{12,abc} = 1$$
# $$x_{2,a} + x_{12,abc} = 1$$
# $$x_{b,b} + x_{12,abc} = 1$$
# $$x_{c,c} + x_{12,abc} = 1$$

# #### This time, with feeling
# 
# This one actually allows us to share the time required to run commands between images.
# 
# $$
# \begin{array}{r l}
#     \min        & t_a x_{1,a} + t_b x_{1,b} + t_c x_{1,c} + t_a x_{2,a} + t_b x_{2,b} + t_c x_{2,c} + t_c x_{2,d} + \left(t_a + t_b + t_c\right) x_{12,abc}\\  
#     \text{s.t.} & x_{1,a} + x_{12,abc} = 1 \\
#                 & x_{1,b} + x_{12,abc} = 1 \\
#                 & x_{1,c} + x_{12,abc} = 1 \\
#                 & x_{2,a} + x_{12,abc} = 1 \\
#                 & x_{2,b} + x_{12,abc} = 1 \\
#                 & x_{2,c} + x_{12,abc} = 1 \\
#                 & x_{2,d} = 1 \\
#                 & x_p \in \{0,1\}\, \forall\, p \in P
# \end{array}
# $$

# This is much better. It gives us the correct solution.
# 
# $$x_{1,a} =  x_{1,b} = x_{1,c} = x_{2,a} = x_{2,b} = x_{2,c} = 0$$
# $$x_{12,abc} = x_{2,d} = 1$$

# The minimum time is:
# 
# $$t_a + t_b + t_c + t_d$$

# Visually this could be represented as:
# 
# ![](images/fig01.png)

# #### But things are not always so simple.
# 
# We created the variable $x_{12,abc}$ because it was obvious. $I_1 \bigcap I_2 = \{A, B, C\}$, so it was the best possible option.

# But consider this set of 4 images:
# 
# $$I_3 = \{A, B, C\}$$
# $$I_4 = \{B, C, D\}$$
# $$I_5 = \{A, C, E\}$$
# $$I_6 = \{B, D, E\}$$
# 
# In this problem instance, the values $t_a, \dots, t_e$ are what drive our final schedule.

# If $t_c$ is big enough, the optimal schedule might be:
#     
# ```
#     I3:  C B A
#     I4:  C B D 
#     I5:  C A E
#     I6:  B D E
# ```

# And if $t_a > t_b$ it might be:
#     
# ```
#     I3:  C A B
#     I4:  C B D 
#     I5:  C A E
#     I6:  B D E
# ```

# So if we set $x_{345,c} = 1$, we have the option to use two other variables to create our schedule:
# 
# $$x_{345,c,34,b} \in \{0,1\}$$
# $$x_{345,c,35,a} \in \{0,1\}$$

# But this can only happen if $x_{345,c} = 1$. That is, $x_{345,c,34,b}$ and $x_{345,c,35,a}$ are **dependent** on it.
# 
# $$x_{345,c,34,b} \le x_{345,c}$$
# $$x_{345,c,35,a} \le x_{345,c}$$

# We will represent these dependencies using another set, $R$.
# 
# For all $\left(x_m, x_n\right) \in R$, any feasible solution requires that $x_m \le x_n$.
# 
# Note that dependencies have structure. We may have:
# 
# $$x_m \le x_n$$
# $$x_n \le x_o$$

# #### We also have to worry about intersections...
# 
# Recall our set of 4 images:
# 
# $$I_3 = \{A, B, C\}$$
# $$I_4 = \{B, C, D\}$$
# $$I_5 = \{A, C, E\}$$
# $$I_6 = \{B, D, E\}$$
# 
# In addition to the variables $x_{3,a}$, $x_{3,b}$, $\dots$, $x_{6,e}$, we create $x_{345,c}$, $x_{34,bc}$, $\dots$, $x_{56,e}$ and add the appropriate $\left(\text{images}, \text{commands}\right)$ pairs to $P$.

# For each image $i \in I$, and command $c \in C_i$, exactly one $x_p$ that provides $c$ for $i$ must be set to $1$. This restricts **intersections** in our schedule.
# 
# $$\sum_{p = \left(p_i,p_c\right) \in P,\, i \in p_i,\, c \in p_c} x_p = 1\, \forall\, i \in I,\, c \in C_i$$ 

# Now we have everything we need to build our model!
# 
# $$
# \begin{array}{r l l l}
#     \min        & \sum_{p \in I} \sum_{c \in C_i} t_c x_p & \\
#     \text{s.t.} & \sum_{p = \left(p_i,p_c\right) \in P,\, i \in p_i,\, c \in p_c} x_p = 1 & \forall & i \in I,\, c \in C_i \\
#                 & x_m \le x_n & \forall & \left(m, n\right) \in R \\
#                 & x_p \in \{0,1\} & \\
# \end{array}
# $$

# All we have to do is generate $P$ and $R$!
# 
# Right?

# ### Finding maximal cliques with NetworkX
# 
# At some point, somebody is going to write some code.

# So how do we generate $P$ and $R$?

# Naive option:
# 
# * Make a graph with all images and commands
# * Connect each image to its commands
# * Connect images to each other
# * Connect commands to each other
# * Find maximal cliques!

# **Q:** What's a maximal clique?

# **A:** A subgraph where each pair of nodes is adjacent. If it can't be contained in another clique, it's maximal.
# 
# ![](images/clique.png)
# _A clique. Source: Wikipedia_

# Creating a graph and connect every image to its commands.
# 
# ![](images/graph_1.png)

# Now connect every pair of image, and every pair of commands.
# 
# ![](images/graph_2.png)

# In[79]:


# Define the commands we have available.
commands = ['A','B','C','D','E']

# And the images we need to construct from them.
images = [3,4,5,6]
image_commands = {
    3: ['A','B','C'],
    4: ['B','C','D'],
    5: ['A','C','E'],
    6: ['B','D','E'],
}

# Make up some times we expect the commands to take.
command_times = {
    'A':  5,
    'B': 10,
    'C':  7,
    'D': 12,
    'E':  8,
}


# In[53]:


# Construct the graph shown above.
# Prefix images with 'i' and commands with 'c'.

from itertools import combinations
import networkx as nx

def build_graph(image_dict):
    g = nx.Graph()
    
    all_cmds = set()
    for i, cmds in image_dict.items():
        for c in cmds:
            g.add_edge('i%d' % i, 'c%s' % c)
            all_cmds.add(c)
            
    for i1, i2 in combinations(image_dict, 2):
        g.add_edge('i%d' % i1, 'i%d' % i2)

    for c1, c2 in combinations(all_cmds, 2):
        g.add_edge('c%s' % c1, 'c%s' % c2)

    return g
    
build_graph(image_commands).nodes()


# In[55]:


# Generate all maximal cliques with > 1 image and >= 1 command.

from networkx.algorithms import find_cliques

def command_cliques(image_dict):
    max_cliques = []
    
    g = build_graph(image_dict)
    for c in find_cliques(g):
        imgs = set([int(x.replace('i','')) for x in c if x.startswith('i')])
        cmds = set([x.replace('c','') for x in c if x.startswith('c')])
        if len(imgs) > 1 and cmds:
            max_cliques.append((imgs, cmds))
    
    return max_cliques
        
# Top-level cliques (these depend on nothing)
top_cliques = command_cliques(image_commands)
top_cliques


# Add each of these to $P$.
# 
# Now consider each maximal clique. Let's look at $\left(\{3,4,5\},\{C\}\right)$.
# 
# $x_{345,c} = 1$ if images $\{3,4,5\}$ share command $C$. 
# 
# Can they share anything after that?

# Repeat the process with remaining commands for images $\{3,4,5\}$. Leave out $C$.
# 
# ![](images/graph_3.png)

# ![](images/graph_4.png)

# Max cliques:
# 
# $$\{3,4,B\}$$
# $$\{3,5,A\}$$
# 
# Add variables to $P$:
# 
# $$x_{345,c,34,b} \in \{0,1\}$$
# $$x_{345,c,35,a} \in \{0,1\}$$
# 
# Both depend on "parent" clique:
# 
# $$x_{345,c,34,b} \le x_{345,c}$$
# $$x_{345,c,35,a} \le x_{345,c}$$
# 
# So add dependencies to $R$:
# 
# $$\left(x_{345,c,34,b}, x_{345,c}\right)$$
# $$\left(x_{345,c,35,a}, x_{345,c}\right)$$

# Maximal clique detection is NP-complete.
# 
# But it works _better than you might think!_ Why?
# 
# &nbsp;  
# &nbsp;    

# $${\Huge\text{THEORY} \ne \text{PRACTICE}}$$

# &nbsp;  
# &nbsp;    
#     
# There are more efficient ways, but that's for another day.

# ### Partitioning sets with PuLP
# 
# "NP-Complete" is a term used to scare graduate students.

# #### We have a "set  partitioning problem" with side constraints.
# 
# Let's ignore dependencies. Say $R = \emptyset$.

# A optimal "schedule" is a set of $x_p$ variables we set to $1$ that:
# 
# * Assigns each image all its commands **once**
# * Minimizes required compute time

# Mathematically, set partitioning looks like:
# 
# $$
# \begin{array}{r l}
#     \min|\max    & c^\intercal x \\  
#     \text{s.t.} & Ax = 1 \\
#                 & x \in \{0,1\}^n \\
# \end{array}
# $$
# 
# Where $c^\intercal \in \mathcal{R}^n$ and $A \in \{0,1\}^{m \times n}$
# 

# #### Technically, set partitioning is NP-complete.
# 
# But again...
# 
# &nbsp;  
# &nbsp;  

# $${\huge\text{THEORY} \ne \text{PRACTICE}}$$

# &nbsp;  
# &nbsp;  
# 
# Real world data is usually **NOT** pathological.
# 
# We have _really_ good technology for solving set partitions.
# 
# * Heuristics
# * Optimization (combinatorial, dynamic)
# 
# So even with the dependency constraints, we're probably OK.

# In[76]:


# Let's build a model!

import pulp

m = pulp.LpProblem('Docker schedule', pulp.LpMinimize)
x = {}    # binary variables
obj = 0.0 # objective = minimize required time

# These variables are on when images do not share the results of commands.
for i, cmds in image_commands.items():
    for c in cmds:
        n = '%d-%s' % (i,c)
        x[n] = pulp.LpVariable(name=n, cat=pulp.LpBinary)


# In[77]:


# Create variables for cliques. (Normally we'd do this automatically.)
x['46-BD'] = pulp.LpVariable(name='46-BD', cat=pulp.LpBinary)
x['346-B'] = pulp.LpVariable(name='346-B', cat=pulp.LpBinary)
x['56-E' ] = pulp.LpVariable(name='56-E',  cat=pulp.LpBinary)
x['345-C'] = pulp.LpVariable(name='345-C', cat=pulp.LpBinary)
x['34-BC'] = pulp.LpVariable(name='34-BC', cat=pulp.LpBinary)
x['35-AC'] = pulp.LpVariable(name='35-AC', cat=pulp.LpBinary)

# These are child cliques we generate using subgraphs.
x['346-B-34-C'] = pulp.LpVariable(name='346-B-34-C', cat=pulp.LpBinary)
x['346-B-46-D'] = pulp.LpVariable(name='346-B-46-D', cat=pulp.LpBinary)
x['345-C-34-B'] = pulp.LpVariable(name='345-C-34-B', cat=pulp.LpBinary)
x['345-C-35-A'] = pulp.LpVariable(name='345-C-35-A', cat=pulp.LpBinary)

# Dependencies from parent to child cliques.
m += x['346-B-34-C'] <= x['346-B']
m += x['346-B-46-D'] <= x['346-B']
m += x['345-C-34-B'] <= x['345-C']
m += x['345-C-35-A'] <= x['345-C']

# Restrict intersecting cliques at the same level.
m += x['346-B'] + x['345-C'] + x['34-BC'] + x['35-AC'] <= 1 # image 3
m += x['46-BD'] + x['346-B'] + x['345-C'] + x['34-BC'] <= 1 # image 4
m += x['56-E' ] + x['345-C'] + x['35-AC'] <= 1              # image 5
m += x['46-BD'] + x['346-B'] + x['56-E' ] <= 1              # image 6
m += x['346-B-34-C'] + x['346-B-46-D'] <= 1                 # image 4
m += x['345-C-34-B'] + x['345-C-35-A'] <= 1                 # image 3


# In[78]:


# Each image runs each command exactly once.
for i, cmds in image_commands.items():
    for c in cmds:
        m += sum(x[n] for n in x if str(i) in n and c in n) == 1

# Objective: minimize the required compute time.
obj = 0.0
for n, x_p in x.items():
    t = sum(command_times.get(c,0) for c in n) # a little hacky...
    obj += t * x_p

# Use a variable so it's easy to pull out the objective value.
z = pulp.LpVariable('objective', lowBound=0)
m += z == obj
m.setObjective(z)

assert m.solve() == pulp.LpStatusOptimal
print('Time required:', pulp.value(z))
print('Schedule:')
for n, x_p in x.items():
    if pulp.value(x_p) > 0.5:
        print('\t', n)


# To translate that output into a real schedule, first add any cliques to the schedule:
# 
# ```
#     I3:  A C
#     I4:  B D
#     I5:  A C
#     I6:  B D
# ```

# The remaining commands go at the end:
# 
# ```
#     I3:  A C B
#     I4:  B D C
#     I5:  A C E
#     I6:  B D E
# ```

# ### Computational example
# 
# No, really. That's how we do this thing.

# #### An example with 15 images and 25 commands.
# 
# A randomly generated instance. These are the images and their commands.
# 
# ```json
# {
#     "images": {
#         "01": ["03", "05", "07", "08", "09", "11", "14", "15", "18", "19", "24"],
#         "02": ["02", "05", "07", "10", "15", "20", "22", "23", "25"],
#         "03": ["08", "12", "14", "15", "16", "17"],
#         "04": ["03", "05", "13", "14", "17"],
#         "05": ["03", "05", "09", "15", "17", "18", "20", "21", "22", "23", "24", "25"],
#         "06": ["04", "08", "12", "13", "16", "17"],
#         "07": ["01", "03", "04", "05", "07", "08", "12", "15", "19", "24", "25"],
#         "08": ["11", "13", "15", "16"],
#         "09": ["01", "03", "11", "13", "17", "22", "24"],
#         "10": ["08", "12", "13", "14", "15", "19", "24", "25"],
#         "11": ["04"],
#         "12": ["07", "12", "16", "19", "22", "25"],
#         "13": ["01", "06", "07", "16", "18", "19", "21", "25"],
#         "14": ["05", "10", "11", "13", "14", "22"],
#         "15": ["01", "02", "03", "04", "11", "15"]
#     },
# ```

# And the time required for each command.
# 
# ```json
#     "commands": {
#         "01": 58,
#         "02": 98,
#         "03": 68,
#         "04": 152,
#         "05": 60,
#         "06": 79,
#         "07": 463,
#         "08": 95,
#         "09": 532,
#         "10": 51,
#         "11": 128,
#         "12": 88,
#         "13": 65,
#         "14": 155,
#         "15": 97,
#         "16": 148,
#         "17": 340,
#         "18": 115,
#         "19": 109,
#         "20": 14,
#         "21": 94,
#         "22": 61,
#         "23": 149,
#         "24": 43,
#         "25": 31
#     }
# }
# ```

# And the optimal image construction schedule!
# 
# ```json
# {
#     "unique_images": 73,
#     "compute_time": 9414,
#     "schedule": {
#         "01": ["15", "08", "24", "19", "05", "07", "03", "11", "14", "09", "18"],
#         "02": ["15", "05", "20", "23", "22", "25", "07", "10", "02"],
#         "03": ["15", "08", "14", "12", "16", "17"],
#         "04": ["13", "17", "03", "05", "14"],
#         "05": ["15", "05", "20", "23", "22", "25", "18", "24", "17", "09", "03", "21"],
#         "06": ["13", "17", "16", "08", "04", "12"],
#         "07": ["15", "08", "24", "19", "05", "07", "03", "12", "04", "25", "01"],
#         "08": ["15", "11", "16", "13"],
#         "09": ["13", "17", "03", "01", "22", "11", "24"],
#         "10": ["15", "08", "24", "19", "12", "14", "25", "13"],
#         "11": ["04"],
#         "12": ["19", "07", "25", "16", "12", "22"],
#         "13": ["19", "07", "25", "16", "18", "21", "01", "06"],
#         "14": ["13", "14", "05", "10", "11", "22"],
#         "15": ["15", "11", "01", "04", "02", "03"]
#     }
# }
# ```

# #### How does this compare to a heuristic solution?
# 
# We'll call this the `most-common` heuristic. It makes the greediest choice possible at each step.
# 
# * `most-common(images):`
#     + Choose the most common unscheduled command for `images`
#     + Add it to the schedule next for `images`
#     + `images2` = subset of `images` for which command was scheduled
#     + `most-common(images2)`
#     + `most-common(images \ images2)`

# The heuristic is > 10% worse for this particular instance.
# 
# ```json
# {
#     "unique_images": 80,
#     "compute_time": 10531,
#     "schedule": {
#         "01": ["15", "03", "24", "05", "11", "07", "19", "18", "08", "09", "14"],
#         "02": ["15", "02", "10", "25", "05", "20", "07", "22", "23"],
#         "03": ["15", "08", "12", "14", "17", "16"],
#         "04": ["13", "17", "03", "14", "05"],
#         "05": ["15", "03", "24", "05", "25", "20", "21", "18", "09", "17", "22", "23"],
#         "06": ["13", "17", "08", "12", "04", "16"],
#         "07": ["15", "03", "24", "05", "25", "12", "19", "08", "01", "07", "04"],
#         "08": ["15", "13", "11", "16"],
#         "09": ["13", "17", "03", "11", "24", "01", "22"],
#         "10": ["15", "13", "24", "19", "25", "08", "12", "14"],
#         "11": ["04"],
#         "12": ["25", "07", "16", "19", "12", "22"],
#         "13": ["25", "07", "16", "19", "18", "01", "06", "21"],
#         "14": ["13", "11", "10", "14", "22", "05"],
#         "15": ["15", "03", "02", "11", "01", "04"]
#     }
# }
# ```

# $${\Huge\text{?}}$$