Notebook

Chapter 5¶

Would you rather lose an arm or a leg?¶

Statisticians can be a sour bunch. Instead of considering their winnings, they only measure how much they have lost. In fact, they consider their wins as negative losses. But what's interesting is how they measure their losses.

통계학자는 심술궂은 집단이다. 우승을 생각하기 보다는 잃게 되는 걸 잰다. 우승이란 "부정적 손실"일 뿐이다. 관심사는 손실을 어떻게 측정하는가 이다.

For example, consider the following example:

A meteorologist is predicting the probability of a possible hurricane striking his city. He estimates, with 95% confidence, that the probability of it not striking is between 99% - 100%. He is very happy with his precision and advises the city that a major evacuation is unnecessary. Unfortunately, the hurricane does strike and the city is flooded.

허리케인이 왔다. 기상학자는 95% 신뢰도로 허리케이인 강타하지 않을 확률이 99%~100%라고 추정한다. 아주 정밀했음에 흐뭇해 하며 대피는 필요 없다고 알리지만 불행히도 허리케인은 도시를 쑥대밭으로 만들고 만다

This stylized example shows the flaw in using a pure accuracy metric to measure outcomes. Using a measure that emphasizes estimation accuracy, while an appealing and objective thing to do, misses the point of why you are even performing the statistical inference in the first place: results of inference. The author Nassim Taleb of The Black Swan and Antifragility stresses the importance of the payoffs of decisions, not the accuracy. Taleb distills this quite succinctly: "I would rather be vaguely right than very wrong."

이런 류의 예는 결과를 측정하는 순수한 정확도를 사용함에 결점을 보여 준다. 추정 정확도만 강조하는 척도를 사용하면 우리가 통계적 추론을 하는 이유를 잊게된다. 작가 탤렙은 간결하게 농축했다: "심하게 잘못된 것 보다는 애매하게 옳은게 낫다."

Loss Functions¶

We introduce what statisticians and decision theorists call loss functions. A loss function is a function of the true parameter, and an estimate of that parameter

통계학자나 의결 이론가가 말하는 "손실 함수"를 소개하려 한다. 손실 함수는 참값과 그 추정치의 함수이다.

$$L( \theta, \hat{\theta} ) = f( \theta, \hat{\theta} )$$

The important point of loss functions is that it measures how bad our current estimate is: the larger the loss, the worse the estimate is according to the loss function. A simple, and very common, example of a loss function is the squared-error loss:

현재 추정이 얼마나 안 좋은가를 측정하는게 핵심이다. 제곱오류 손실이 손실함수의 한 예이다.

$$L( \theta, \hat{\theta} ) = ( \theta - \hat{\theta} )^2$$

The squared-error loss function is used in estimators like linear regression, UMVUEs and many areas of machine learning. We can also consider an asymmetric squared-error loss function, something like:

제곱오류 손실함수는 선형 회귀, UMVUEs나 기타 기계 학습에서도 추정자로 사용된다. 아래는 비대칭 제곱오류 손실함수이다.

$$L( \theta, \hat{\theta} ) = \begin{cases} ( \theta - \hat{\theta} )^2 & \hat{\theta} \lt \theta \\\\ c( \theta - \hat{\theta} )^2 & \hat{\theta} \ge \theta, \;\; 0\lt c \lt 1 \end{cases}$$

which represents that estimating a value larger than the true estimate is preferable to estimating a value below. A situation where this might be useful is in estimating web traffic for the next month, where an over-estimated outlook is preferred so as to avoid an underallocation of server resources.

참값보다 큰 추정치는 더 낮게 추정되길 선호한다. 웹 트래픽에 유용한데, 서버 자원을 낮게 추정하는 걸 피하여 높게 추정하는 걸 선호한다.

A negative property about the squared-error l손toss is that it puts a disproportionate emphasis on large outliers. This is because the loss increases quadratically, and not linearly, as the estimate moves away. That is, the penalty of being three units away is much less than being five units away, but the penalty is not much greater than being one unit away, though in both cases the magnitude of difference is the same:

제곱오류엔 부정적인 특성이 있는데 외곽으로 나갈수록 손실이 제곱으로 커지는 비선형성이다.

$$\frac{1^2}{3^2} \lt \frac{3^2}{5^2}, \;\; \text{although} \;\; 3-1 = 5-3$$

This loss function imposes that large errors are very bad. A more robust loss function that increases linearly with the difference is the absolute-loss

_이 손실함수는 큰 오류는 매우 나쁘다라는 의미를 내포한다.

$$L( \theta, \hat{\theta} ) = | \theta - \hat{\theta} |$$

Other popular loss functions include:

$L( \theta, \hat{\theta} ) = \mathbb{1}_{ \hat{\theta} \neq \theta }$ is the zero-one loss often used in machine learning classification algorithms.
$L( \theta, \hat{\theta} ) = -\hat{\theta}\log( \theta ) - (1-\hat{ \theta})\log( 1 - \theta ), \; \; \hat{\theta} \in {0,1}, \; \theta \in [0,1]$, called the log-loss, also used in machine learning.

다른 보편적인 손실함수는 0,1손실함수와 로그손실함수이다.

Historically, loss functions have been motivated from 1) mathematical convenience, and 2) they are robust to application, i.e., they are objective measures of loss. The first reason has really held back the full breadth of loss functions. With computers being agnostic to mathematical convenience, we are free to design our own loss functions, which we take full advantage of later in this Chapter.

손실함수는 수학적 편의와 강건한 적응에서 동기를 부여 받았다. 이 후, 우리만의 손실함수를 설계하겠다.

With respect to the second point, the above loss functions are indeed objective, in that they are most often a function of the difference between estimate and true parameter, independent of signage or payoff of choosing that estimate. This last point, its independence of payoff, causes quite pathological results though. Consider our hurricane example above: the statistician equivalently predicted that the probability of the hurricane striking was between 0% to 1%. But if he had ignored being precise and instead focused on outcomes (99% chance of no flood, 1% chance of flood), he might have advised differently.

다른 관점에서 보자면, 참값과 추정값 간에 차함수라는 점에서 위 손실함수는 객관적이라 할 수 있다. 마지막 관점에서 위의 허리케인 예를 보자면 통계학자는 휩쓸고 갈 확률을 0~1%로 동등하게 예측한다 하지만 정밀도는 포기하고 대신 결과에만 집착하여 완전 다르게 충고하고 만다.

By shifting our focus from trying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reflect our goals and outcomes. Some examples of more interesting loss functions:

우리의 초점을 추정값에 대한 못믿을 정밀도에서 모수 추정의 성과로 옮겨 놓으면 우리의 어플리에시션에 최적화된 추정치를 만들수 있다. 요건 새로운 손실함수를 설계하게 만든다. 일부 흥미로운 손실함수의 예이다.

$L( \theta, \hat{\theta} ) = \frac{ | \theta - \hat{\theta} | }{ \theta(1-\theta) }, \; \; \hat{\theta}, \theta \in [0,1]$ emphasizes an estimate closer to 0 or 1 since if the true value $\theta$ is near 0 or 1, the loss will be very large unless $\hat{\theta}$ is similarly close to 0 or 1.

This loss function might be used by a political pundit who's job requires him or her to give confident "Yes/No" answers. This loss reflects that if the true parameter is close to 1 (for example, if a political outcome is very likely to occur), he or she would want to strongly agree as to not look like a skeptic.

$L( \theta, \hat{\theta} ) = 1 - \exp \left( -(\theta - \hat{\theta} )^2 \right)$ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently-far-away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter.

Complicated non-linear loss functions can programmed:

 def loss(true_value, estimate):
     if estimate*true_value > 0:
         return abs(estimate - true_value)
     else:
        return abs(estimate)*(estimate - true_value)**2

Another example is from the book The Signal and The Noise. Weather forecasters have a interesting loss function for their predictions. [2]

People notice one type of mistake — the failure to predict rain — more than other, false alarms. If it rains when it isn't supposed to, they curse the weatherman for ruining their picnic, whereas an unexpectedly sunny day is taken as a serendipitous bonus.

[The Weather Channel's bias] is limited to slightly exaggerating the probability of rain when it is unlikely to occur — saying there is a 20 percent change when they know it is really a 5 or 10 percent chance — covering their butts in the case of an unexpected sprinkle.

As you can see, loss functions can be used for good and evil: with great power, comes great — well you know.

Loss functions in the real world¶

So far we have been under the unrealistic assumption that we know the true parameter. Of course if we knew the true parameter, bothering to guess an estimate is pointless. Hence a loss function is really only practical when the true parameter is unknown.

지금까지 우리는 모수를 알고 있다는 비현실적 상황을 가정했다. 모수를 알고 있다면 추정의 추측은 쓸모없다. 그래서 참값이 알려지지 않았을 때만 손실함수는 실용적이게 된다

In Bayesian inference, we have a mindset that the unknown parameters are really random variables with prior and posterior distributions. Concerning the posterior distribution, a value drawn from it is a possible realization of what the true parameter could be. Given that realization, we can compute a loss associated with an estimate. As we have a whole distribution of what the unknown parameter could be (the posterior), we should be more interested in computing the expected loss given an estimate. This expected loss is a better estimate of the true loss than comparing the given loss from only a single sample from the posterior.

베이지안 추론에서 모르는 모수를 확률 변수로 본다. 사후 분포에 따라 값은 모수가 갖을 수 있는 개연적인 값이 된다. 현실성이 주어졌을 때 우리는 추정치와 연계된 손실을 계산할 수 있다. 알려지지 않는 파라미터가 가지는 전체 분포를 알고 있다면 기대 손실을 계산하는데 더 흥미롭겠지. 이 기대 손실은 사후 확률의 단일 시료에서 오는 손실과 비교하는 것보다는 참 손실에 대한 더 나은 추정치가 된다.

First it will be useful to explain a Bayesian point estimate. The systems and machinery present in the modern world are not built to accept posterior distributions as input. It is also rude to hand someone over a distribution when all they asked for was an estimate. In the course of an individual's day, when faced with uncertainty we still act by distilling our uncertainty down to a single action. Similarly, we need to distill our posterior distribution down to a single value (or vector in the multivariate case). If the value is chosen intelligently, we can avoid the flaw of frequentist methodologies that mask the uncertainty and provide a more informative result.The value chosen, if from a Bayesian posterior, is a Bayesian point estimate.

먼저 베이지안 점추정을 설명하는 유용하겠다. 현대의 시스템과 기계류는 입력으로 사후 확률 분포를 인정하지 않고 구축되었다. 개인의 일상에서 불확실성과 마주했을 때 여전히 불확실성을 단일 행동을로 농축해 버린다. 마찬가지로 우리는 우리의 사후 확률 분포를 단일값으로 농축하는 것이 필요하다. 가치가 지적으로 선택되면 빈도주의자들의 방법론이 갖는 불확실성은 가려서 더 많은 정보를 제공하려는 결점을 피할 수 있다. 베이지안의 사후 확률에서 선택된 값이 베이지안 점추정이다.

Suppose $P(\theta | X)$ is the posterior distribution of $\theta$ after observing data $X$, then the following function is understandable as the expected loss of choosing estimate $\hat{\theta}$ to estimate $\theta$:

$$ l(\hat{\theta} ) = E_{\theta}\left[ \; L(\theta, \hat{\theta}) \; \right] $$

This is also known as the risk of estimate $\hat{\theta}$. The subscript $\theta$ under the expectation symbol is used to denote that $\theta$ is the unknown (random) variable in the expectation, something that at first can be difficult to consider.

요넘은 추정의 리스트로 알려져 있다. 시타의 아래첨자는 기대치에서 모르는 변수, 즉 확률 변수라는 의미이다.

We spent all of last chapter discussing how to approximate expected values. Given $N$ samples $\theta_i,\; i=1,...,N$ from the posterior distribution, and a loss function $L$, we can approximate the expected loss of using estimate $\hat{\theta}$ by the Law of Large Numbers:

$$\frac{1}{N} \sum_{i=1}^N \;L(\theta_i, \hat{\theta} ) \approx E_{\theta}\left[ \; L(\theta, \hat{\theta}) \; \right] = l(\hat{\theta} ) $$

Notice that measuring your loss via an expected value uses more information from the distribution than the MAP estimate which, if you recall, will only find the maximum value of the distribution and ignore the shape of the distribution. Ignoring information can over-expose yourself to tail risks, like the unlikely hurricane, and leaves your estimate ignorant of how ignorant you really are about the parameter.

기대값을 통해서 손실을 측정하게 되면 단순히 분포의 최대값만 찾고 분포의 외모를 무시하는 MAP측정 보다 분포에서 더 많은 정보를 사용하게 됨에 주목하라. 정보의 무시는 허리케인 사례처럼 꼬리쪽 위험에 과하게 노출된다.

Similarly, compare this with frequentist methods, that traditionally only aim to minimize the error, and do not consider the loss associated with the result of that error. Compound this with the fact that frequentist methods are almost guaranteed to never be absolutely accurate. Bayesian point estimates fix this by planning ahead: your estimate is going to be wrong, you might as well err on the right side of wrong.

유사하게 전통적으로 오류를 최소화하고 오류의 결과와 연결된 손실은 고려하지 않는 빈도주의 방법과 비교하라. 빈도주의자의 방식은 거의 절대적으로 정확하지 못하다고 보장할 수 있다는 사실과 혼합되어 있다. 베이지안 점추정은 추정치가 잘못되고 있어서 악화된 쪽에서 오류가 날 수 있다라는 계획하에 요넘을 바로잡는다.

Example: Optimizing for the Showcase on The Price is Right¶

Bless you if you are ever chosen as a contestant on the Price is Right, for here we will show you how to optimize your final price on the Showcase. For those who forget the rules:

https://en.wikipedia.org/wiki/The_Price_Is_Right

Two contestants compete in The Showcase.
Each contestant is shown a unique suite of prizes.
After the viewing, the contestants are asked to bid on the price for their unique suite of prizes.
If a bid price is over the actual price, the bid's owner is disqualified from winning.
If a bid price is under the true price by less than $250, the winner is awarded both prizes.

The difficulty in the game is balancing your uncertainty in the prices, keeping your bid low enough so as to not bid over, and trying to bid close to the price.

겜의 난이도는 가격 불확실성에 균형을 맞추는 것이다. 입찰가를 넘지 않게 충분히 낮게 응찰하면서 가격을 입찰가에 맞춘다.

Suppose we have recorded the Showcases from previous The Price is Right episodes and have prior beliefs about what distribution the true price follows. For simplicity, suppose it follows a Normal:

이전 가격 맞추기 에피소드에서 모든 쇼케이스를 기록했다고 가정하면 참값을 따르는 분포에 대한 사전 확률 믿음을 가지고 있는 꼴이다. 간략하게 정규성을 따른다고 보면

$$\text{True Price} \sim \text{Normal}(\mu_p, \sigma_p )$$

In a later chapter, we will actually use real Price is Right Showcase data to form the historical prior, but this requires some advanced PyMC use so we will not use it here. For now, we will assume $\mu_p = 35 000$ and $\sigma_p = 7500$.

이후 이력 사전 확률을 구축하기 위해 실제 가격 맞추기 쇼케이스 정보를 사용할 것이지만 추가적인 PyMC를 필요하게 되어 지금은 뮤 35,000과 시그마 7,500만 가정한다.

We need a model of how we should be playing the Showcase. For each prize in the prize suite, we have an idea of what it might cost, but this guess could differ significantly from the true price. (Couple this with increased pressure being onstage and you can see why some bids are so wildly off). Let's suppose your beliefs about the prices of prizes also follow Normal distributions:

쇼케이스에서 어떻게 경쟁할 것인가의 모델이 필요하다. 각 상금에 대해서 가격에 대한 생각은 있으나 이런 추축은 참가격과는 완전 다르다. 상금의 가격도 정규 분포를 따른다고 가정하자

$$ \text{Prize}_i \sim \text{Normal}(\mu_i, \sigma_i ),\;\; i=1,2$$

This is really why Bayesian analysis is great: we can specify what we think a fair price is through the $\mu_i$ parameter, and express uncertainty of our guess in the $\sigma_i$ parameter.

요게 바로 베이지안 분석이 대단한 이유이다. 우리는 공정한 가격은 평균 모수에서 기술하고 추측한 불확실성을 시그마로 표현한다.

We'll assume two prizes per suite for brevity, but this can be extended to any number. The true price of the prize suite is then given by $\text{Prize}_1 + \text{Prize}_2 + \epsilon$, where $\epsilon$ is some error term.

두 상금은 판마다 간결하다고 가정하겠지만 이는 어떤 수로든 확장될 수 있다. 상금의 참가격은 두 상금과 오류의 합으로 주어진다.

We are interested in the updated $\text{True Price}$ given we have observed both prizes and have belief distributions about them. We can perform this using PyMC.

Lets make some values concrete. Suppose there are two prizes in the observed prize suite:

A trip to wonderful Toronto, Canada!
A lovely new snowblower!

We have some guesses about the true prices of these objects, but we are also pretty uncertain about them. I can express this uncertainty through the parameters of the Normals:

\begin{align} & \text{snowblower} \sim \text{Normal}(3 000, 500 )\\\\ & \text{Toronto} \sim \text{Normal}(12 000, 3000 )\\\\ \end{align}

For example, I believe that the true price of the trip to Toronto is 12 000 dollars, and that there is a 68.2% chance the price falls 1 standard deviation away from this, i.e. my confidence is that there is a 68.2% chance the trip is in [9 000, 15 000].

We can create some PyMC code to perform inference on the true price of the suite.

In [1]:

%matplotlib inline
import scipy.stats as stats
from IPython.core.pylabtools import figsize
import numpy as np
import matplotlib.pyplot as plt

figsize(12.5, 9)

norm_pdf = stats.norm.pdf

plt.subplot(311)
x = np.linspace(0, 60000, 200)
sp1 = plt.fill_between(x, 0, norm_pdf(x, 35000, 7500),
                       color="#348ABD", lw=3, alpha=0.6,
                       label="historical total prices")
p1 = plt.Rectangle((0, 0), 1, 1, fc=sp1.get_facecolor()[0])
plt.legend([p1], [sp1.get_label()])

plt.subplot(312)
x = np.linspace(0, 10000, 200)
sp2 = plt.fill_between(x, 0, norm_pdf(x, 3000, 500),
                       color="#A60628", lw=3, alpha=0.6,
                       label="snowblower price guess")

p2 = plt.Rectangle((0, 0), 1, 1, fc=sp2.get_facecolor()[0])
plt.legend([p2], [sp2.get_label()])

plt.subplot(313)
x = np.linspace(0, 25000, 200)
sp3 = plt.fill_between(x, 0, norm_pdf(x, 12000, 3000),
                       color="#7A68A6", lw=3, alpha=0.6,
                       label="Trip price guess")
plt.autoscale(tight=True)
p3 = plt.Rectangle((0, 0), 1, 1, fc=sp3.get_facecolor()[0])
plt.legend([p3], [sp3.get_label()]);

In [2]:

import pymc as pm

data_mu = [3e3, 12e3]

data_std = [5e2, 3e3]

mu_prior = 35e3
std_prior = 75e2

true_price = pm.Normal("true_price", mu_prior, 1.0 / std_prior ** 2)

prize_1 = pm.Normal("first_prize", data_mu[0], 1.0 / data_std[0] ** 2)
prize_2 = pm.Normal("second_prize", data_mu[1], 1.0 / data_std[1] ** 2)
price_estimate = prize_1 + prize_2

@pm.potential
def error(true_price=true_price, price_estimate=price_estimate):
        return pm.normal_like(true_price, price_estimate, 1 / (3e3) ** 2)

mcmc = pm.MCMC([true_price, prize_1, prize_2, price_estimate, error])
mcmc.sample(50000, 10000)

price_trace = mcmc.trace("true_price")[:]

 [-----------------100%-----------------] 50000 of 50000 complete in 6.1 sec

In [3]:

figsize(12.5, 4)

import scipy.stats as stats

x = np.linspace(5000, 40000)
plt.plot(x, stats.norm.pdf(x, 35000, 7500), c="k", lw=2,
         label="prior dist. of suite price")

_hist = plt.hist(price_trace, bins=35, normed=True, histtype="stepfilled")
plt.title("Posterior of the true price estimate")
plt.vlines(mu_prior, 0, 1.1 * np.max(_hist[0]), label="prior's mean",
           linestyles="--")
plt.vlines(price_trace.mean(), 0, 1.1 * np.max(_hist[0]),
           label="posterior's mean", linestyles="-.")
plt.legend(loc="upper left")

Out[3]:

<matplotlib.legend.Legend at 0x7fa964e71ed0>

Notice that because of our two observed prizes and subsequent guesses (including uncertainty about those guesses), we shifted our mean price estimate down about $15 000 dollars from the previous mean price.

두개 관측 상금과 하위 추측때문에 평균 추정가가 이전 평균에서 15,000불 정도 내려갔다.

A frequentist, seeing the two prizes and having the same beliefs about their prices, would bid $\mu_1 + \mu_2 = 35000$, regardless of any uncertainty. Meanwhile, the naive Bayesian would simply pick the mean of the posterior distribution. But we have more information about our eventual outcomes; we should incorporate this into our bid. We will use the loss function above to find the best bid (best according to our loss).

두개 상금을 보고도 가격에 대해 믿음을 잃지 않는 빈도주의자는 불확실성에 관계없이 35,000에 응찰할 것이다. 그 동안에 나이브 베이지안은 단순히 사후 확률 분포의 평균을 집는다. 하지만결과에 대해서 더 많은 정보를 가지고 있다. 이 정보를 응찰에 잘 버무려 최고의 응찰가를 찾기 위해 위의 손실 함수를 사용할 것이다.

What might a contestant's loss function look like? I would think it would look something like:

def showcase_loss(guess, true_price, risk=80000):
    if true_price < guess:
        return risk
    elif abs(true_price - guess) <= 250:
        return -2*np.abs(true_price)
    else:
        return np.abs(true_price - guess - 250)

where risk is a parameter that defines of how bad it is if your guess is over the true price. A lower risk means that you are more comfortable with the idea of going over. If we do bid under and the difference is less than $250, we receive both prizes (modeled here as receiving twice the original prize). Otherwise, when we bid under the true_price we want to be as close as possible, hence the else loss is a increasing function of the distance between the guess and true price.

여기서 리스크는 추측값이 참값을 넘는데 얼마나 나쯔가를 정의한 모수이다. 낮은 리스크는 넘을 때에 비해 편안하다. 낮게 응찰하여 차가 250보다 작으면 두개 상금을 다 따먹는다. 아니라면 참값 아래에서 응찰했을 때 차가 250보다 작기를 바래야 한다. 그래서 else 손실은 참값과 추정값 간에 거리의 증가 함수이다.

For every possible bid, we calculate the expected loss associated with that bid. We vary the risk parameter to see how it affects our loss:

In [4]:

figsize(12.5, 7)
# numpy friendly showdown_loss


def showdown_loss(guess, true_price, risk=80000):
        loss = np.zeros_like(true_price)
        ix = true_price < guess
        loss[~ix] = np.abs(guess - true_price[~ix])
        close_mask = [abs(true_price - guess) <= 250]
        loss[close_mask] = -2 * true_price[close_mask]
        loss[ix] = risk
        return loss


guesses = np.linspace(5000, 50000, 70)
risks = np.linspace(30000, 150000, 6)
expected_loss = lambda guess, risk: \
    showdown_loss(guess, price_trace, risk).mean()
ㅣㅐ
for _p in risks:
    results = [expected_loss(_g, _p) for _g in guesses]
    plt.plot(guesses, results, label="%d" % _p)

plt.title("Expected loss of different guesses, \nvarious risk-levels of \
overestimating")
plt.legend(loc="upper left", title="Risk parameter")
plt.xlabel("price bid")
plt.ylabel("expected loss")
plt.xlim(5000, 30000);

Minimizing our losses¶

It would be wise to choose the estimate that minimizes our expected loss. This corresponds to the minimum point on each of the curves above. More formally, we would like to minimize our expected loss by finding the solution to

$$ \text{arg} \min_{\hat{\theta}} \;\;E_{\theta}\left[ \; L(\theta, \hat{\theta}) \; \right] $$

The minimum of the expected loss is called the Bayes action. We can solve for the Bayes action using Scipy's optimization routines. The function in fmin in scipy.optimize module uses an intelligent search to find a minimum (not necessarily a global minimum) of any uni- or multivariate function. For most purposes, fmin will provide you with a good answer.

기대 손실의 최소값을 베이지안 액션이라 부른다.

We'll compute the minimum loss for the Showcase example above:

In [5]:

import scipy.optimize as sop

ax = plt.subplot(111)


for _p in risks:
    _color = ax._get_lines.color_cycle.next()
    _min_results = sop.fmin(expected_loss, 15000, args=(_p,), disp=False)
    _results = [expected_loss(_g, _p) for _g in guesses]
    plt.plot(guesses, _results, color=_color)
    plt.scatter(_min_results, 0, s=60,
                color=_color, label="%d" % _p)
    plt.vlines(_min_results, 0, 120000, color=_color, linestyles="--")
    print "minimum at risk %d: %.2f" % (_p, _min_results)

plt.title("Expected loss & Bayes actions of different guesses, \n \
various risk-levels of overestimating")
plt.legend(loc="upper left", scatterpoints=1, title="Bayes action at risk:")
plt.xlabel("price guess")
plt.ylabel("expected loss")
plt.xlim(7000, 30000)
plt.ylim(-1000, 80000)

minimum at risk 30000: 14575.41
minimum at risk 54000: 13154.86
minimum at risk 78000: 12737.45
minimum at risk 102000: 11609.22
minimum at risk 126000: 10958.78
minimum at risk 150000: 10958.78

Out[5]:

(-1000, 80000)

As intuition suggests, as we decrease the risk threshold (care about overbidding less), we increase our bid, willing to edge closer to the true price. It is interesting how far away our optimized loss is from the posterior mean, which was about 20 000.

직감적으로 리스크 문턱을 줄일수록 참값에 가까워지는 테주리에 따라 응찰를 올린다. 최적화된 손실이 20,000인 사후 확률 평균에서 얼마나 멀리 왔는지가 흥미롭다

Suffice to say, in higher dimensions being able to eyeball the minimum expected loss is impossible. Hence why we require use of Scipy's fmin function.

눈 씻고 찾아 봐도 최소 기대손실은 불가능하다. 그래서 Scipy의 fmin 함수를 이용한다.

Shortcuts¶

For some loss functions, the Bayes action is known in closed form. We list some of them below:

If using the mean-squared loss, the Bayes action is the mean of the posterior distribution, i.e. the value

$$ E_{\theta}\left[ \theta \right] $$

minimizes $E_{\theta}\left[ \; (\theta - \hat{\theta})^2 \; \right]$. Computationally this requires us to calculate the average of the posterior samples [See chapter 4 on The Law of Large Numbers]

Whereas the median of the posterior distribution minimizes the expected absolute-loss. The sample median of the posterior samples is an appropriate and very accurate approximation to the true median.
In fact, it is possible to show that the MAP estimate is the solution to using a loss function that shrinks to the zero-one loss.

Maybe it is clear now why the first-introduced loss functions are used most often in the mathematics of Bayesian inference: no complicated optimizations are necessary. Luckily, we have machines to do the complications for us.

아마도 처음에 소개된 손실함수가 수학적으로 가장 자주 사용되는 이유가 명확해 졌다. 복잡하지 않은 최적화가 필요했고 운좋게도 우리 대신 복잡한 계산을 할 기계가 있었다.

Machine Learning via Bayesian Methods¶

Whereas frequentist methods strive to achieve the best precision about all possible parameters, machine learning cares to achieve the best prediction among all possible parameters. Of course, one way to achieve accurate predictions is to aim for accurate predictions, but often your prediction measure and what frequentist methods are optimizing for are very different.

빈도주의 기법은 가능한 모든 모수에 대해서 최고의 정밀도를 달성하기에 매진하는 반면 기게 학습은 가능한 모든 모수사아에서 최고의 예측을 얻기 위해 주위를 쏟는다. 물론 정확한 예측을 얻는 한 방법이 정확 예측을 추구하는 것이지만 우리의 예측 측정과 빈도주의자의 기법이 최적화 하는 것은 매우 다르다.

For example, least-squares linear regression is the most simple active machine learning algorithm. I say active as it engages in some learning, whereas predicting the sample mean is technically simpler, but is learning very little if anything. The loss that determines the coefficients of the regressors is a squared-error loss. On the other hand, if your prediction loss function (or score function, which is the negative loss) is not a squared-error, like AUC, ROC, precision, etc., your least-squares line will not be optimal for the prediction loss function. This can lead to prediction results that are suboptimal.

최소 제곱 선형 회귀는 거의 단일의 생생한 기계 학습 알고리즘이다. 시료의 평균 예측은 기술적으로 단순하지만 어떤 것이든 조금만 학습한다. 생생한이란 일정한 학습에 참여하는 의미에서 얘기했다. 회귀 계수를 결정하는 손실은 제곱 오류 손실이다. 다시말해 기대 손실함수는 AUC, ROC, precision등과 같은 제곱 오류가 아니라면 당신의 최소 제곱라인은 예측 손실 함수를 위한 최적점이 아니다. 이건 차최적점인 예측 결과를 이끌 수 있다.

Finding Bayes actions is equivalent to finding parameters that optimize not parameter accuracy but an arbitrary performance measure, however we wish to define performance (loss functions, AUC, ROC, precision/recall etc.).

베이지안 액션 찾기는 파모수 정확도를 최적화 하는 모수가 아닌 임의의 성능 측정을 최적화하는 모수를 찾는 과정이다. 하지만 우리는 성능(손실함수, AUC, ROC, precision/recall 등)을 정의하길 바라다.

The next two examples demonstrate these ideas. The first example is a linear model where we can choose to predict using the least-squares loss or a novel, outcome-sensitive loss.

다음 두개 예는 이런 사고 방식을 증명한다. 첫 예를 우리가 최소 제곱 손실이나 새롭고 결과에 민감한 손실을 사용한 예측을 선택할 수 있는 선형 모델이다.

The second example is adapted from a Kaggle data science project. The loss function associated with our predictions is incredibly complicated.

두번째 예는 캐글의 데이터 과학 프로젝트에 적용되었다. 우리의 예측과 연관된 손실 함수는 믿을 수 없을 만큼 복잡하다.

Example: Financial prediction¶

Suppose the future return of a stock price is very small, say 0.01 (or 1%). We have a model that predicts the stock's future price, and our profit and loss is directly tied to us acting on the prediction. How should we measure the loss associated with the model's predictions, and subsequent future predictions? A squared-error loss is agnostic to the signage and would penalize a prediction of -0.01 equally as bad a prediction of 0.03:

주가의 미래 회수가 1%로 낮다고 가정하자. 주가의 미래가격을 예측하는 모델을 가지고 수익과 손실은 직접적으로 우리의 예측 실행에 엮여 있다. 어떻게 모델의 예측과 하위 이후 예측과 연계된 손실을 측정해야 하나? 제곱 오류 손실은 알수없고 3% 예측이 좋지 않은 만큼 -1%의 예측에 벌금을 적용하려 한다.

$$ (0.01 - (-0.01))^2 = (0.01 - 0.03)^2 = 0.004$$

If you had made a bet based on your model's prediction, you would have earned money with a prediction of 0.03, and lost money with a prediction of -0.01, yet our loss did not capture this. We need a better loss that takes into account the sign of the prediction and true value. We design a new loss that is better for financial applications below:

만일 모델의 예측에 기반한 내기를 만들었다면 3%의 예측으론 돈을 딸 것이고 -1%의 예측으로 돈을 잃을 것이다. 아직도 우리의 손실은 이걸 잡아내지 못한다. 우리는 예측과 참값의 신호를 고려한 더 좋은 손실이 필요하다. 우리는 재정적으로 더 나은 새로운 손실을 설계한다.

In [6]:

figsize(12.5, 4)


def stock_loss(true_return, yhat, alpha=100.):
    if true_return * yhat < 0:
        # opposite signs, not good
        return alpha * yhat ** 2 - np.sign(true_return) * yhat \
            + abs(true_return)
    else:
        return abs(true_return - yhat)


true_value = .05
pred = np.linspace(-.04, .12, 75)

plt.plot(pred, [stock_loss(true_value, _p) for _p in pred],
         label="Loss associated with\n prediction if true value = 0.05", lw=3)
plt.vlines(0, 0, .25, linestyles="--")

plt.xlabel("prediction")
plt.ylabel("loss")
plt.xlim(-0.04, .12)
plt.ylim(0, 0.25)

true_value = -.02
plt.plot(pred, [stock_loss(true_value, _p) for _p in pred], alpha=0.6,
         label="Loss associated with\n prediction if true value = -0.02", lw=3)
plt.legend()
plt.title("Stock returns loss if true value = 0.05, -0.02");

Note the change in the shape of the loss as the prediction crosses zero. This loss reflects that the user really does not want to guess the wrong sign, especially be wrong and a large magnitude.

예측이 0을 교차하듯이 손실 모양에 변화에 주목하라. 이 손실은 사용자가 정말로 잘못된 신호, 특히 잘못되고 거대 크기를 추측하길 원하지 않는 걸 반영한다.

Why would the user care about the magnitude? Why is the loss not 0 for predicting the correct sign? Surely, if the return is 0.01 and we bet millions we will still be (very) happy.

사용자가 왜 크기에 관심을 갖는가? 올바른 신호를 예측하면 손실은 왜 0이 되지 않는가? 확실히 회수가 1%이고 백만불을 투자했다면 여전히 행복할 것인가?

Financial institutions treat downside risk, as in predicting a lot on the wrong side, and upside risk, as in predicting a lot on the right side, similarly. Both are seen as risky behaviour and discouraged. Hence why we have an increasing loss as we move further away from the true price. (With less extreme loss in the direction of the correct sign.)

잘못된 방향인 상향 리스크를 예측하듯이 제정적 규정은 하향 리스크를 다룬다. 두개 모두 리스키한 행위로 보이고 지양해야 한다. 그래서 우리가 참값으로 이동하듯이 증가 손실을 가지는 이유이다.

We will perform a regression on a trading signal that we believe predicts future returns well. Our dataset is artificial, as most financial data is not even close to linear. Below, we plot the data along with the least-squares line.

우리는 미래에 잘 회수되리라 예측하는 거래 신호에 회귀 분석을 수행할 것이다.데이터는 임의로 작성되었고 아래는 최소 제곱 선에 따라 그림을 그린 것이다.

In [7]:

# Code to create artificial data
N = 100
X = 0.025 * np.random.randn(N)
Y = 0.5 * X + 0.01 * np.random.randn(N)

ls_coef_ = np.cov(X, Y)[0, 1] / np.var(X)
ls_intercept = Y.mean() - ls_coef_ * X.mean()

plt.scatter(X, Y, c="k")
plt.xlabel("trading signal")
plt.ylabel("returns")
plt.title("Empirical returns vs trading signal")
plt.plot(X, ls_coef_ * X + ls_intercept, label="Least-squares line")
plt.xlim(X.min(), X.max())
plt.ylim(Y.min(), Y.max())
plt.legend(loc="upper left");

We perform a simple Bayesian linear regression on this dataset. We look for a model like:

$$ R = \alpha + \beta x + \epsilon$$

where $\alpha, \beta$ are our unknown parameters and $\epsilon \sim \text{Normal}(0, 1/\tau)$. The most common priors on $\beta$ and $\alpha$ are Normal priors. We will also assign a prior on $\tau$, so that $\sigma = 1/\sqrt{\tau}$ is uniform over 0 to 100 (equivalently then $\tau = 1/\text{Uniform}(0, 100)^2$).

In [8]:

import pymc as pm
from pymc.Matplot import plot as mcplot

std = pm.Uniform("std", 0, 100, trace=False)  # this needs to be explained.


@pm.deterministic
def prec(U=std):
    return 1.0 / (U) ** 2

beta = pm.Normal("beta", 0, 0.0001)
alpha = pm.Normal("alpha", 0, 0.0001)


@pm.deterministic
def mean(X=X, alpha=alpha, beta=beta):
    return alpha + beta * X

obs = pm.Normal("obs", mean, prec, value=Y, observed=True)
mcmc = pm.MCMC([obs, beta, alpha, std, prec])

mcmc.sample(100000, 80000)
mcplot(mcmc)

 [-----------------100%-----------------] 100000 of 100000 complete in 13.7 secPlotting prec
Plotting alpha
Plotting beta

It appears the MCMC has converged so we may continue.

For a specific trading signal, call it $x$, the distribution of possible returns has the form:

$$R_i(x) = \alpha_i + \beta_ix + \epsilon $$

where $\epsilon \sim \text{Normal}(0, 1/\tau_i) $ and $i$ indexes our posterior samples. We wish to find the solution to

$$ \arg \min_{r} \;\;E_{R(x)}\left[ \; L(R(x), r) \; \right] $$

according to the loss given above. This $r$ is our Bayes action for trading signal $x$. Below we plot the Bayes action over different trading signals. What do you notice?

In [9]:

figsize(12.5, 6)
from scipy.optimize import fmin


def stock_loss(price, pred, coef=500):
    """vectorized for numpy"""
    sol = np.zeros_like(price)
    ix = price * pred < 0
    sol[ix] = coef * pred ** 2 - np.sign(price[ix]) * pred + abs(price[ix])
    sol[~ix] = abs(price[~ix] - pred)
    return sol

tau_samples = mcmc.trace("prec")[:]
alpha_samples = mcmc.trace("alpha")[:]
beta_samples = mcmc.trace("beta")[:]

N = tau_samples.shape[0]

noise = 1. / np.sqrt(tau_samples) * np.random.randn(N)

possible_outcomes = lambda signal: alpha_samples + beta_samples * signal \
    + noise


opt_predictions = np.zeros(50)
trading_signals = np.linspace(X.min(), X.max(), 50)
for i, _signal in enumerate(trading_signals):
    _possible_outcomes = possible_outcomes(_signal)
    tomin = lambda pred: stock_loss(_possible_outcomes, pred).mean()
    opt_predictions[i] = fmin(tomin, 0, disp=False)


plt.xlabel("trading signal")
plt.ylabel("prediction")
plt.title("Least-squares prediction vs. Bayes action prediction")
plt.plot(X, ls_coef_ * X + ls_intercept, label="Least-squares prediction")
plt.xlim(X.min(), X.max())
plt.plot(trading_signals, opt_predictions, label="Bayes action prediction")
plt.legend(loc="upper left");

What is interesting about the above graph is that when the signal is near 0, and many of the possible returns outcomes are possibly both positive and negative, our best (with respect to our loss) prediction is to predict close to 0, hence take on no position. Only when we are very confident do we enter into a position. I call this style of model a sparse prediction, where we feel uncomfortable with our uncertainty so choose not to act. (Compare with the least-squares prediction which will rarely, if ever, predict zero).

신호가 0근처에서 다량의 회수 가능한 결과는 +/-가 가능할 때 0에 가까운 예측을 하는 것이 우리가 할 수 있는 최선이다. 단지 우리는 우주 신뢰할 수 있을 때 우리는 긍정에 서게 된다. 이런 모델의 모양을 "산발적 예측"이라 부르고 불확실성에 불편하게 느끼어 액트를 선택하지 않는다.

A good sanity check that our model is still reasonable: as the signal becomes more and more extreme, and we feel more and more confident about the positive/negativeness of returns, our position converges with that of the least-squares line.

좋은 온전성은 우리 모델이 여전히 타당한지를 확인한다. 신호가 더욱 극단에 치우치고 우리가 더욱 회수의 긍부정에 신뢰를 더하면서 우리의 위치는 최소 제곱선에 수렴한다.

The sparse-prediction model is not trying to fit the data the best (according to a squared-error loss definition of fit). That honor would go to the least-squares model. The sparse-prediction model is trying to find the best prediction with respect to our stock_loss-defined loss. We can turn this reasoning around: the least-squares model is not trying to predict the best (according to a stock-loss definition of predict). That honor would go the sparse prediction model. The least-squares model is trying to find the best fit of the data with respect to the squared-error loss.

산발적 예측 모델은 데이터를 최상으로 적합하지 않는다. 이는 최소 제곱 모델로 가는 것이다. 산발적 예측 모델은 주식 손실에 따라서 최고의 예측을 찾으려 애쓴다.최저 제곱 모델은 최고를 예측하려는 건 아니다.

Example: Kaggle contest on Observing Dark World¶

A personal motivation for learning Bayesian methods was trying to piece together the winning solution to Kaggle's Observing Dark Worlds contest. From the contest's website:

There is more to the Universe than meets the eye. Out in the cosmos exists a form of matter that outnumbers the stuff we can see by almost 7 to 1, and we don’t know what it is. What we do know is that it does not emit or absorb light, so we call it Dark Matter. Such a vast amount of aggregated matter does not go unnoticed. In fact we observe that this stuff aggregates and forms massive structures called Dark Matter Halos. Although dark, it warps and bends spacetime such that any light from a background galaxy which passes close to the Dark Matter will have its path altered and changed. This bending causes the galaxy to appear as an ellipse in the sky.

The contest required predictions about where dark matter was likely to be. The winner, Tim Salimans, used Bayesian inference to find the best locations for the halos (interestingly, the second-place winner also used Bayesian inference). With Tim's permission, we provided his solution [1] here:

Construct a prior distribution for the halo positions $p(x)$, i.e. formulate our expectations about the halo positions before looking at the data.
Construct a probabilistic model for the data (observed ellipticities of the galaxies) given the positions of the dark matter halos: $p(e | x)$.
Use Bayes’ rule to get the posterior distribution of the halo positions, i.e. use to the data to guess where the dark matter halos might be.
Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: $\hat{x} = \arg \min_{\text{prediction} } E_{p(x|e)}[ L( \text{prediction}, x) ]$ , i.e. tune our predictions to be as good as possible for the given error metric.

후광 위치의 사전 확률 분포 구축, 예로 데이터를 보기 전의 후광 위치의 기대값
데이터(갤럭시의 관측 타원)의 확률적 모델 구축
후광 위치 사후 확률 분포의 베이즈 룰을 사용, 예로 암흑 물질 후광의 위치 추측을 위한 데이터 사용
후광 위치의 예측값에 대한 사후 분포에 따른 기대 손실값 최소화하기

The loss function in this problem is very complicated. For the very determined, the loss function is contained in the file DarkWorldsMetric.py in the parent folder. Though I suggest not reading it all, suffice to say the loss function is about 160 lines of code — not something that can be written down in a single mathematical line. The loss function attempts to measure the accuracy of prediction, in a Euclidean distance sense, such that no shift-bias is present. More details can be found on the metric's main page.

We will attempt to implement Tim's winning solution using PyMC and our knowledge of loss functions.

_ 본 문제의 손실함수가 복잡하다. 전부는 아니지만 수식은 함 봐라. _

The Data¶

The dataset is actually 300 separate files, each representing a sky. In each file, or sky, are between 300 and 720 galaxies. Each galaxy has an $x$ and $y$ position associated with it, ranging from 0 to 4200, and measures of ellipticity: $e_1$ and $e_2$. Information about what these measures mean can be found here, but for our purposes it does not matter besides for visualization purposes. Thus a typical sky might look like the following:

_ 데이터셑은 하나의 하늘을 나타내는 300개 파일에 나뉘어 있다. 각 파일, 즉 각 하늘은 300에서 720개의 갤럭시를 가진다. 각 은하는 0에서 4200 범위의 x, y 위치를 가지고 타원의 측정치(e1, e2)를 가진다. 이 측정치가 의미하는 바는 링크에서 확인해라. 하지만 우리의 목적상 가시화 빼고는 중요하지 않다. 전형적인 하늘은 아래와 같다. _

In [10]:

from draw_sky2 import draw_sky

n_sky = 3  # choose a file/sky to examine.
data = np.genfromtxt("data/Train_Skies/Train_Skies/\
Training_Sky%d.csv" % (n_sky),
                      dtype=None,
                      skip_header=1,
                      delimiter=",",
                      usecols=[1, 2, 3, 4])
print "Data on galaxies in sky %d." % n_sky
print "position_x, position_y, e_1, e_2 "
print data[:3]

fig = draw_sky(data)
plt.title("Galaxy positions and ellipcities of sky %d." % n_sky)
plt.xlabel("x-position")
plt.ylabel("y-position");

Data on galaxies in sky 3.
position_x, position_y, e_1, e_2 
[[  1.62690000e+02   1.60006000e+03   1.14664000e-01  -1.90326000e-01]
 [  2.27228000e+03   5.40040000e+02   6.23555000e-01   2.14979000e-01]
 [  3.55364000e+03   2.69771000e+03   2.83527000e-01  -3.01870000e-01]]

Priors¶

Each sky has one, two or three dark matter halos in it. Tim's solution details that his prior distribution of halo positions was uniform, i.e.

_ 각 하늘은 하나, 둘, 세개 정도의 암흑 물질 후광을 가진다. 팀의 해법은 후광 위치의 사전 확률 분포는 유니폼으로 봤다. _

\begin{align} & x_i \sim \text{Uniform}( 0, 4200)\\\\ & y_i \sim \text{Uniform}( 0, 4200), \;\; i=1,2,3\\\\ \end{align}

Tim and other competitors noted that most skies had one large halo and other halos, if present, were much smaller. Larger halos, having more mass, will influence the surrounding galaxies more. He decided that the large halos would have a mass distributed as a log-uniform random variable between 40 and 180 i.e.

_ 팀과 경쟁자는 대부분의 하늘은 하나의 거대 후광과 자잘한 후광을 갖는다고 봤다. 더 큰 질량을 갖는 거대 후광은 주별 은하계에 영향을 미친다. 거대 후광은 40~80 범위의 로그 유니폼의 확률 변수라고 봤다. _

$$ m_{\text{large} } = \log \text{Uniform}( 40, 180 ) $$

and in PyMC,

exp_mass_large = pm.Uniform("exp_mass_large", 40, 180)
@pm.deterministic
def mass_large(u = exp_mass_large):
   return np.log(u)

(This is what we mean when we say log-uniform.) For smaller galaxies, Tim set the mass to be the logarithm of 20. Why did Tim not create a prior for the smaller mass, nor treat it as a unknown? I believe this decision was made to speed up convergence of the algorithm. This is not too restrictive, as by construction the smaller halos have less influence on the galaxies.

_ 작은 은하계에 대해서 팀은 20의 로그아리씀, 작은 중량에는 사전 확률을 생성하지 않고 알려진지 않은 상태로 취급하지 않은 이유는 뭘까? 갠생각에 수렴 알고리즘의 속도를 올리기 위한 결정으로 보인다. 그리 제한적이진 않다. 구조에 따라 작은 후광은 은하계에 영향을 덜 준다. _

Tim logically assumed that the ellipticity of each galaxy is dependent on the position of the halos, the distance between the galaxy and halo, and the mass of the halos. Thus the vector of ellipticity of each galaxy, $\mathbf{e}_i$, are children variables of the vector of halo positions $(\mathbf{x},\mathbf{y})$, distance (which we will formalize), and halo masses.

_ 팀은 논리적으로 개별 은하계의 타원성은 후광의 위치에, 은하계와 후광간 거리에, 후광의 질량에 의존한다고 가정했다. 그래서 각 은하계의 타원 벡터는 후광 위치, 거리와 후광 질량 백터의 자식 변수이다. _

Tim conceived a relationship to connect positions and ellipticity by reading literature and forum posts. He supposed the following was a reasonable relationship:

_ 팀은 문헌과 포럼을 읽고 위치와 타원을 연결하는 관계를 구상했다. 다음이 타당한 관계라고 봤다. _

$$ e_i | ( \mathbf{x}, \mathbf{y} ) \sim \text{Normal}( \sum_{j = \text{halo positions} }d_{i,j} m_j f( r_{i,j} ), \sigma^2 ) $$

where $d_{i,j}$ is the tangential direction (the direction in which halo $j$ bends the light of galaxy $i$), $m_j$ is the mass of halo $j$, $f(r_{i,j})$ is a decreasing function of the Euclidean distance between halo $j$ and galaxy $i$.

Tim's function $f$ was defined:

$$ f( r_{i,j} ) = \frac{1}{\min( r_{i,j}, 240 ) } $$

for large halos, and for small halos

$$ f( r_{i,j} ) = \frac{1}{\min( r_{i,j}, 70 ) } $$

This fully bridges our observations and unknown. This model is incredibly simple, and Tim mentions this simplicity was purposefully designed: it prevents the model from overfitting.

_ 관측치와 알수없는 수치에 다리를 놓는다. 이 모델은 믿을 수 없을 정도로 간단하고 팀은 이런 단순성이 설계 의도라 했다. 모델의 과적합을 막기 위한! _

Training & PyMC implementation¶

For each sky, we run our Bayesian model to find the posteriors for the halo positions — we ignore the (known) halo position. This is slightly different than perhaps traditional approaches to Kaggle competitions, where this model uses no data from other skies nor the known halo location. That does not mean other data are not necessary — in fact, the model was created by comparing different skies.

_ 각 하늘에 대해 후광 위치를 위한 사후 확률 분포를 찾기 위해 베이지안 모델을 돌렸다. 헤일로의 위치는 무시했다. 이는 캐글 참가자들의 전통적인 접근법과는 다는 것이었다. 이 모델은 다른 하늘의 정보를 사용하지도 않았고 알려진 후광의 위치도 사용하지 않았다. 그렇다고 데이터가 필요없단 얘긴 아니다. 사실상 모델은 다른 하늘을 비교하기 위해서 생성되었다. _

In [11]:

def euclidean_distance(x, y):
    return np.sqrt(((x - y) ** 2).sum(axis=1))


def f_distance(gxy_pos, halo_pos, c):
    # foo_position should be a 2-d numpy array
    return np.maximum(euclidean_distance(gxy_pos, halo_pos), c)[:, None]


def tangential_distance(glxy_position, halo_position):
    # foo_position should be a 2-d numpy array
    delta = glxy_position - halo_position
    t = (2 * np.arctan(delta[:, 1] / delta[:, 0]))[:, None]
    return np.concatenate([-np.cos(t), -np.sin(t)], axis=1)

import pymc as pm

# set the size of the halo's mass
mass_large = pm.Uniform("mass_large", 40, 180, trace=False)

# set the initial prior position of the halos, it's a 2-d Uniform dist.
halo_position = pm.Uniform("halo_position", 0, 4200, size=(1, 2))


@pm.deterministic
def mean(mass=mass_large, h_pos=halo_position, glx_pos=data[:, :2]):
    return mass / f_distance(glx_pos, h_pos, 240) *\
        tangential_distance(glx_pos, h_pos)

In [12]:

ellpty = pm.Normal("ellipcity", mean, 1. / 0.05, observed=True,
                   value=data[:, 2:])
mcmc = pm.MCMC([ellpty, mean, halo_position, mass_large])
map_ = pm.MAP([ellpty, mean, halo_position, mass_large])
map_.fit()
mcmc.sample(200000, 140000, 3)

 [-----------------100%-----------------] 200000 of 200000 complete in 89.9 sec

Below we plot a "heatmap" of the posterior distribution. (Which is just a scatter plot of the posterior, but we can visualize it as a heatmap.)

In [13]:

t = mcmc.trace("halo_position")[:].reshape(20000, 2)

fig = draw_sky(data)
plt.title("Galaxy positions and ellipcities of sky %d." % n_sky)
plt.xlabel("x-position")
plt.ylabel("y-position")
plt.scatter(t[:, 0], t[:, 1], alpha=0.015, c="r")
plt.xlim(0, 4200)
plt.ylim(0, 4200);

The most probable position reveals itself like a lethal wound.

_ 가장 개연성있는 위치는 가장 위험하게 감긴 그 자리에서 나타난다. _

Associated with each sky is another data point, located in ./data/Training_halos.csv that holds the locations of up to three dark matter halos contained in the sky. For example, the night sky we trained on has halo locations:

_ 각 하늘과 연계된 다른 위치 정보는 하늘에 포함된 세개의 암흑 물질 후광의 위치를 담고 있는 csv에서 나와 있다. _

In [14]:

halo_data = np.genfromtxt("data/Training_halos.csv",
                          delimiter=",",
                          usecols=[1, 2, 3, 4, 5, 6, 7, 8, 9],
                          skip_header=1)
print halo_data[n_sky]

[  1.00000000e+00   1.40861000e+03   1.68586000e+03   1.40861000e+03
   1.68586000e+03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00]

The third and fourth column represent the true $x$ and $y$ position of the halo. It appears that the Bayesian method has located the halo within a tight vicinity.

_ 세번째 네번째는 실제 후광의 x,y위치 이다. 베이지안 메소드는 꽉찬 인접 지역 내에 후광을 위치시킨다. _

In [8]:

fig = draw_sky(data)
plt.title("Galaxy positions and ellipcities of sky %d." % n_sky)
plt.xlabel("x-position")
plt.ylabel("y-position")
plt.scatter(t[:, 0], t[:, 1], alpha=0.015, c="r")
plt.scatter(halo_data[n_sky - 1][3], halo_data[n_sky - 1][4],
            label="True halo position",
            c="k", s=70)
plt.legend(scatterpoints=1, loc="lower left")
plt.xlim(0, 4200)
plt.ylim(0, 4200);

print "True halo location:", halo_data[n_sky][3], halo_data[n_sky][4]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-f5defa241807> in <module>()
----> 1 fig = draw_sky(data)
      2 plt.title("Galaxy positions and ellipcities of sky %d." % n_sky)
      3 plt.xlabel("x-position")
      4 plt.ylabel("y-position")
      5 plt.scatter(t[:, 0], t[:, 1], alpha=0.015, c="r")

NameError: name 'draw_sky' is not defined

Perfect. Our next step is to use the loss function to optimize our location. A naive strategy would be to simply choose the mean:

In [16]:

mean_posterior = t.mean(axis=0).reshape(1, 2)
print mean_posterior

[[ 2324.09028399  1124.29876218]]

In [17]:

from DarkWorldsMetric import main_score

_halo_data = halo_data[n_sky - 1]

nhalo_all = _halo_data[0].reshape(1, 1)
x_true_all = _halo_data[3].reshape(1, 1)
y_true_all = _halo_data[4].reshape(1, 1)
x_ref_all = _halo_data[1].reshape(1, 1)
y_ref_all = _halo_data[2].reshape(1, 1)
sky_prediction = mean_posterior

print "Using the mean:"
main_score(nhalo_all, x_true_all, y_true_all,
           x_ref_all, y_ref_all, sky_prediction)

# what's a bad score?
print
random_guess = np.random.randint(0, 4200, size=(1, 2))
print "Using a random location:", random_guess
main_score(nhalo_all, x_true_all, y_true_all,
           x_ref_all, y_ref_all, random_guess)
print

Using the mean:
Your average distance in pixels you are away from the true halo is 43.1564419087
Your average angular vector is 1.0
Your score for the training data is 1.04315644191

Using a random location: [[4152  707]]
Your average distance in pixels you are away from the true halo is 1874.11082674
Your average angular vector is 1.0
Your score for the training data is 2.87411082674

This is a good guess, it is not very far from the true location, but it ignores the loss function that was provided to us. We also need to extend our code to allow for up to two additional, smaller halos: Let's create a function for automatizing our PyMC.

_ 아주 좋은 추측이다. 실제 위치에서 아주 멀지도 않고 우리에게 제공된 손실 함수도 무시할 정도다. 우리의 코드를 두개의 추가적인 작은 후광으로 확대할 필요가 있다. PyMC를 자동화하기 위해 함수를 만들자. _

In [18]:

from pymc.Matplot import plot as mcplot


def halo_posteriors(n_halos_in_sky, galaxy_data,
                    samples=5e5, burn_in=34e4, thin=4):
    # set the size of the halo's mass
    """
    exp_mass_large = pm.Uniform("exp_mass_large", 40, 180)
    @pm.deterministic
    def mass_large(exp_mass_large = exp_mass_large):
        return np.log(exp_mass_large)
    """

    mass_large = pm.Uniform("mass_large", 40, 180)

    mass_small_1 = 20
    mass_small_2 = 20

    masses = np.array([mass_large, mass_small_1, mass_small_2], dtype=object)

    # set the initial prior positions of the halos, it's a 2-d Uniform dist.
    halo_positions = pm.Uniform("halo_positions", 0, 4200,
                                size=(n_halos_in_sky, 2))  # notice this size

    fdist_constants = np.array([240, 70, 70])

    @pm.deterministic
    def mean(mass=masses, h_pos=halo_positions, glx_pos=data[:, :2],
             n_halos_in_sky=n_halos_in_sky):

        _sum = 0
        for i in range(n_halos_in_sky):
            _sum += mass[i] / f_distance(glx_pos, h_pos[i, :], fdist_constants[i]) *\
                tangential_distance(glx_pos, h_pos[i, :])

        return _sum

    ellpty = pm.Normal("ellipcity", mean, 1. / 0.05, observed=True,
                       value=data[:, 2:])

    map_ = pm.MAP([ellpty, mean, halo_positions, mass_large])
    map_.fit(method="fmin_powell")

    mcmc = pm.MCMC([ellpty, mean, halo_positions, mass_large])
    mcmc.sample(samples, burn_in, thin)
    return mcmc.trace("halo_positions")[:]

In [19]:

n_sky = 215
data = np.genfromtxt("data/Train_Skies/Train_Skies/\
Training_Sky%d.csv" % (n_sky),
                      dtype=None,
                      skip_header=1,
                      delimiter=",",
                      usecols=[1, 2, 3, 4])

In [20]:

# there are 3 halos in this file.
samples = 10.5e5
traces = halo_posteriors(3, data, samples=samples,
                         burn_in=9.5e5,
                         thin=10)

 [-----------------100%-----------------] 1050000 of 1050000 complete in 881.3 sec

In [21]:

fig = draw_sky(data)
plt.title("Galaxy positions and ellipcities of sky %d." % n_sky)
plt.xlabel("x-position")
plt.ylabel("y-position")

colors = ["#467821", "#A60628", "#7A68A6"]

for i in range(traces.shape[1]):
    plt.scatter(traces[:, i, 0], traces[:, i, 1], c=colors[i], alpha=0.02)


for i in range(traces.shape[1]):
    plt.scatter(halo_data[n_sky - 1][3 + 2 * i], halo_data[n_sky - 1][4 + 2 * i],
                label="True halo position",
                c="k", s=90)

# plt.legend(scatterpoints = 1)
plt.xlim(0, 4200)
plt.ylim(0, 4200);

This looks pretty good, though it took a long time for the system to (sort of) converge. Our optimization step would look something like this:

In [23]:

_halo_data = halo_data[n_sky - 1]
print traces.shape

mean_posterior = traces.mean(axis=0).reshape(1, 6)
print mean_posterior


nhalo_all = _halo_data[0].reshape(1, 1)
x_true_all = _halo_data[3].reshape(1, 1)
y_true_all = _halo_data[4].reshape(1, 1)
x_ref_all = _halo_data[1].reshape(1, 1)
y_ref_all = _halo_data[2].reshape(1, 1)
sky_prediction = mean_posterior


print "Using the mean:"
main_score([1], x_true_all, y_true_all,
           x_ref_all, y_ref_all, sky_prediction)

# what's a bad score?
print
random_guess = np.random.randint(0, 4200, size=(1, 2))
print "Using a random location:", random_guess
main_score([1], x_true_all, y_true_all,
           x_ref_all, y_ref_all, random_guess)
print

(10000, 3, 2)
[[ 3826.89753047  3966.53828134  2485.81641533  3864.15564067
   3421.23765245   688.51783207]]
Using the mean:
Your average distance in pixels you are away from the true halo is 132.739681751
Your average angular vector is 1.0
Your score for the training data is 1.13273968175

Using a random location: [[3851 2786]]
Your average distance in pixels you are away from the true halo is 1166.31648985
Your average angular vector is 1.0
Your score for the training data is 2.16631648985

References¶

Antifragile: Things That Gain from Disorder. New York: Random House. 2012. ISBN 978-1-4000-6782-4.
Tim Saliman's solution to the Dark World's Contest
Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. 1. Penguin Press HC, The, 2012. Print.

In [25]:

from IPython.core.display import HTML


def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Out[25]:

In [ ]: