import utils
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
In this exercise we will simulate finding good locations for production plants of a company in order to minimize its logistical costs. In particular, we would like to place production plants near customers so as to reduce shipping costs and delivery time.
We assume that the probability of someone being a customer is independent of its geographical location and that the overall cost of delivering products to customers is proportional to the squared Euclidean distance to the closest production plant. Under these assumptions, the K-Means algorithm is an appropriate method to find a good set of locations. Indeed, K-Means finds a spatial clustering of potential customers and the centroid of each cluster can be chosen to be the location of the plant.
Because there are potentially millions of customers, and that it is not scalable to model each customer as a data point in the K-Means procedure, we consider instead as many points as there are geographical locations (in this exercise, there are only 72000 of them), and assign to each geographical location a weight $w_i$ corresponding to the number of inhabitants at that location. The resulting problem becomes a weighted version of K-Means where we seek to minimize the objective:
$$J(c_1,\dots,c_K) = \frac{\sum_{i} w_i \min_k \|x_i-c_k\|^2}{\sum_{i} w_i},$$where $c_k$ is the $k$th centroid, and $w_i$ is the weight of each geographical coordinate $x_i$. In order to minimize this cost function, we iteratively perform the following EM computations:
until the objective $J(c_1,\dots,c_K)$ has converged.
In this exercise we will use data from http://sedac.ciesin.columbia.edu/, that we store in the files data/countries.pkl
and data/population.pkl
as part of the zip archive. The data contains for each geographical coordinates (latitude and longitude), the number of inhabitants and the corresponding country. Several variables and methods are provided in the file utils.py
:
utils.population
A 2D matrix with the number of inhabitants at each latitude/longitude.utils.countries
A 2D matrix with the country indicator at each latitude/longitude.utils.nx
The number of latitudes considered.utils.ny
The number of longitudes considered.utils.plot(latitudes, longitudes)
Plot a list of centroids given as geographical coordinates in overlay to the population density map.def pivot(dataset, column_name):
df = pd.DataFrame(dataset)
stacked = pd.DataFrame(df.stack())
stacked.reset_index(inplace=True)
stacked.columns = ['lat', 'long', column_name]
return stacked
Let's see the codes for each country:
countries = pivot(utils.countries, 'country')
centers = countries.groupby(['country']).mean()
centers.drop(-9999, inplace=True)
centers.reset_index(inplace=True)
utils.plot([], [])
for idx in range(len(centers)):
plt.annotate('%0.f' % centers.country[idx], (centers.long[idx] - 1, centers.lat[idx] - 1))
Because K-means has a non-convex objective, choosing a good initial set of centroids is important. We consider the following heuristic for initialization: We assign a random score $s(x,y)$ to each geographical coordinate $(x,y)$. The score is computed as
$$s(x,y) = \log(1+\mathrm{population}(x,y)) + N(x,y)$$where $N(x,y) \sim \mathrm{exp(\lambda)}$ is a noise term that follows an exponential probability density of scale parameter $\lambda$ to be determined. The centroids are chosen to be the K locations with highest score.
Tasks:
utils.plot
.def init(population, K, lam=1.0):
data = population.copy()
data['score'] = np.log(1 + data.population) + np.random.exponential(scale=1.0/lam, size=data.population.shape)
data.sort(columns='score', ascending=False, inplace=True)
return data[0:K]
data = pivot(utils.population, 'population')
centroids = init(data, 100)
centroids.head()
lat | long | population | score | |
---|---|---|---|---|
30766 | 85 | 166 | 19562.21818 | 20.169495 |
40568 | 112 | 248 | 26873.88390 | 20.077857 |
35762 | 99 | 122 | 285438.43500 | 20.036830 |
60104 | 166 | 344 | 176047.28500 | 19.738200 |
55389 | 153 | 309 | 32001.44500 | 19.720244 |
utils.plot(centroids.lat, centroids.long)
In order to minimize this cost function, we iteratively perform the following EM computations:
until the objective $J(c_1,\dots,c_K)$ has converged.
$$J(c_1,\dots,c_K) = \frac{\sum_{i} w_i \min_k \|x_i-c_k\|^2}{\sum_{i} w_i}$$Tasks:
utils.plot
.C = centroids[['lat', 'long']].as_matrix()
X = data[['lat', 'long']].as_matrix()
w = data.population
D = distmat(X, C)
calculates the squared distance matrix $D$ between each $x_i \in X$ and each $c_k \in C$
def distmat(X, C):
X2 = np.sum(X * X, axis=1, keepdims=True)
C2 = np.sum(C * C, axis=1, keepdims=True)
XC = np.dot(X, C.T)
D = X2 - 2 * XC + C2.T
return D
A = closest(D)
: returns the closest centroid matrix $A$: with $(A)_{ik} = 1$ if $w_i \in c_k$ and $(A)_{ik} = 0$ if $w_i \not \in c_k$
def closest(D):
D_min = D.min(axis=1, keepdims=True)
return (D == D_min).astype(int)
C = new_centers(X, A, w)
calculates new centroids
To calculate new cetroids use the following rule
For the weighted case we use matrix $W$ instead of $A$, where $(W)_{ik} = w(x_i)$ if $x_i \in c_k$ - weight of data item $x_i$ and 0 otherwise. The update rule is almost the same:
def new_centers(X, A, w):
W = A * w.reshape(-1, 1)
weighted_sum = np.dot(W.T, X)
weights = np.sum(W, axis=0).reshape(-1, 1)
return weighted_sum / weights
def J(X, D, w):
D_min = D.min(axis=1)
return (w * D_min).sum() / w.sum()
First, let's find the best $\lambda$ for initialization
k = 100
repeats = 20
best_J = 10.0 ** 10
best_lam = -1
for lam in [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 1 / 0.5, 1 / 0.3, 1 / 0.2, 1 / 0.1, 1 / 0.01]:
J_lambda = []
for i in range(repeats):
centroids = init(data, k, lam=lam)
C = centroids[['lat', 'long']].as_matrix()
D = distmat(X, C)
A = closest(D)
J_lambda.append(J(X, D, w))
J_avg = np.array(J_lambda).mean()
print 'lambda=%.2f, J=%.4f' % (lam, J_avg)
if best_J > J_avg:
best_lam = lam
best_J = J_avg
print
print 'best lambda=%.2f, J=%.4f' % (best_lam, best_J)
lambda=0.01, J=220.8801 lambda=0.05, J=216.1740 lambda=0.10, J=176.7742 lambda=0.20, J=147.7213 lambda=0.30, J=132.7847 lambda=0.50, J=118.6819 lambda=1.00, J=118.2946 lambda=2.00, J=153.5922 lambda=3.33, J=162.4063 lambda=5.00, J=171.6121 lambda=10.00, J=164.7151 lambda=100.00, J=172.0537 best lambda=1.00, J=118.2946
So the best lambda (one that achieves lowest initial $J$) is $\lambda^* = 1$, we will use it for our k-means
Now can do k-means:
def kmeans(w, k):
data = pivot(utils.population, 'population')
X = data[['lat', 'long']].as_matrix()
centroids = init(data, k, lam=1.0)
C = centroids[['lat', 'long']].as_matrix()
J_old = -1
converged = False
it = 1
while not converged:
D = distmat(X, C)
A = closest(D)
J_new = J(X, D, w)
print 'iteration %2d:' % it, 'J = %.4f' % J_new
C_new = new_centers(X, A, w)
converged = np.abs(J_new - J_old) <= 0.01
C = C_new
J_old = J_new
it = it + 1
return C
C = kmeans(data.population, 100)
iteration 1: J = 122.7488 iteration 2: J = 79.6622 iteration 3: J = 69.7379 iteration 4: J = 65.2335 iteration 5: J = 62.8094 iteration 6: J = 61.1282 iteration 7: J = 59.6757 iteration 8: J = 58.6360 iteration 9: J = 57.7970 iteration 10: J = 57.0818 iteration 11: J = 56.3696 iteration 12: J = 55.9072 iteration 13: J = 55.5855 iteration 14: J = 55.3961 iteration 15: J = 55.2264 iteration 16: J = 54.8011 iteration 17: J = 53.7585 iteration 18: J = 53.6047 iteration 19: J = 53.4834 iteration 20: J = 53.3694 iteration 21: J = 53.2671 iteration 22: J = 53.1886 iteration 23: J = 53.1249 iteration 24: J = 53.0709 iteration 25: J = 53.0024 iteration 26: J = 52.9178 iteration 27: J = 52.8260 iteration 28: J = 52.7474 iteration 29: J = 52.6965 iteration 30: J = 52.6658 iteration 31: J = 52.6494 iteration 32: J = 52.6414
utils.plot(C[:, 0], C[:, 1])
Market analysis has shown that people in German-speaking countries, more precisely, Germany (country 111), Austria (country 104), and Switzerland (country 109), are 25 times more likely to be customers than in other countries.
Tasks:
utils.plot
.To account for this constraint, we just give more weight to these countries
german_speaking = (countries.country == 111) | (countries.country == 104) | (countries.country == 109)
w_modified = w.copy()
w_modified[german_speaking] = w_modified[german_speaking] * 25
C_german = kmeans(w_modified, 100)
iteration 1: J = 56.0960 iteration 2: J = 33.4153 iteration 3: J = 29.8386 iteration 4: J = 28.1791 iteration 5: J = 26.8162 iteration 6: J = 25.7392 iteration 7: J = 25.0362 iteration 8: J = 24.6369 iteration 9: J = 24.3068 iteration 10: J = 24.0636 iteration 11: J = 23.8859 iteration 12: J = 23.7222 iteration 13: J = 23.5520 iteration 14: J = 23.4327 iteration 15: J = 23.3223 iteration 16: J = 23.2331 iteration 17: J = 23.1946 iteration 18: J = 23.1653 iteration 19: J = 23.1420 iteration 20: J = 23.1262 iteration 21: J = 23.1070 iteration 22: J = 23.0893 iteration 23: J = 23.0758 iteration 24: J = 23.0611 iteration 25: J = 23.0462 iteration 26: J = 23.0316 iteration 27: J = 23.0213 iteration 28: J = 23.0133
utils.plot(C_german[:, 0], C_german[:, 1])
We now suppose that deliveries across national borders are taxed heavily, and should be avoided as much as possible.
Tasks:
utils.plot
.Our Idea: allocate $k_i$ clusters for each country $i$ proportinally to population living in that country, and then run k-means for each country separately
Discussion: this method works well only if we have no extreme case for the countries shape (like for example if some country has a shape like a croissant, because then a centroid for such a country could be outside of the country), what we assume here.
data = pivot(utils.population, 'population')
countries = pivot(utils.countries, 'country')
data['country'] = countries.country
data.head()
lat | long | population | country | |
---|---|---|---|---|
0 | 0 | 0 | 0.469012 | 68 |
1 | 0 | 1 | 0.487602 | 68 |
2 | 0 | 2 | 0.335035 | 68 |
3 | 0 | 3 | 0.390171 | 68 |
4 | 0 | 4 | 0.337061 | 68 |
g = data.groupby('country')
pop_by_country = g[['population']].sum()
pop_by_country.head()
population | |
---|---|
country | |
-9999 | 17587054.552094 |
68 | 84.240382 |
102 | 3145404.072854 |
103 | 30345.495150 |
104 | 8050197.563839 |
For some reason there are many people in country '-9999'. Let's have a look where these people are
weird_country = data[(data.country == -9999) & (data.population > 0.0)]
utils.plot(weird_country.lat, weird_country.long)
So it's not just a single country, it's spread across many countries. Instead of figuring out to which exact country these people belong, we might as well just ignore them for this exercise.
pop_by_country = g[['population']].sum()
pop_by_country.drop(-9999, inplace=True)
pop_by_country.reset_index(inplace=True)
total = pop_by_country.population.sum()
pop_by_country['fraction'] = pop_by_country.population / total
pop_by_country.head()
country | population | fraction | |
---|---|---|---|
0 | 68 | 84.240382 | 9.533338e-08 |
1 | 102 | 3145404.072854 | 3.559599e-03 |
2 | 103 | 30345.495150 | 3.434147e-05 |
3 | 104 | 8050197.563839 | 9.110269e-03 |
4 | 105 | 10965442.312100 | 1.240940e-02 |
There are 63 countries in total. We can use a "seat allocation" methods for assigned a number of cluster to each country. The simplest one is Largest Remainder method (wiki) (also known as Hamilton allocation method)
def hamilton_allocation(ratios, k):
frac, results = np.modf(k * ratios)
remainder = int(k - results.sum())
indices = np.argsort(frac)[::-1]
results[indices[0:remainder]] += 1
return results.astype(int)
k_ind = hamilton_allocation(pop_by_country.fraction, 100)
k_ind.sum()
100
def kmeans2(data, k):
w = data.population
X = data[['lat', 'long']].as_matrix()
centroids = init(data, k, lam=1.0)
C = centroids[['lat', 'long']].as_matrix()
J_old = -1
converged = False
it = 1
while not converged:
D = distmat(X, C)
A = closest(D)
J_new = J(X, D, w)
C_new = new_centers(X, A, w)
converged = np.abs(J_new - J_old) <= 0.01
C = C_new
J_old = J_new
it = it + 1
return C
all_C = []
for idx, k in enumerate(k_ind):
if k == 0:
continue
country_id = pop_by_country.country[idx]
country_data = data[data.country == country_id]
C_k = kmeans2(country_data, k)
all_C.append(C_k)
C = np.concatenate(all_C)
utils.plot(C[:, 0], C[:, 1])
Since we used the seat allocation method, some countries were left without factories. Let's see what fraction of the population are there.
100 * pop_by_country[k_ind == 0].population.sum() / total
3.8344646032882341
Only 4% percent of the population are there - which can be acceptable.
Explain in approximately two paragraphs what are the advantages and limitation of the K-Means model to the problem of optimal resource allocation. Example of points that your answer could cover are:
Flexibility of the method with respect to various real-world constraints.
Validity of the squared Euclidean distance as a measure of shipping cost.