#!/usr/bin/env python
# coding: utf-8

# # Chapter 4: Linear Regression

# # Exercises

# ## Exercise 1 (15 points)
# 
# Suppose we have a set of $N$ integer non-negative measurements, $\{x_i\}$, from a process.
# 
# 1. Demonstrate analytically that the maximum likelihood fit of a Poisson distribution to these data has a mean and variance $\lambda = (1 / N) \sum_i x_i$ . State your assumptions.
# 
# 2. You are then given additional (prior) information that $P (\lambda) \propto \exp(−\lambda/a)$ for $\lambda \geq 0$ and zero otherwise, for known value $a$ ($> 0$). What is the maximum posterior solution for $\lambda$? Show your working. 
# 
# 3. What is the effect of the prior on the value of the estimate, (compared to the maximum likelihood case) and what happens in the limit $a \rightarrow \infty$?

# ## Exercise 2 (10 points)
# 
# Find the  maximum likelihood estimate of the angle $\theta$, given that you have two independent noisy measurements,
# $c$ and $s$, where $c$ is a measure of the cosine, and $s$ is a measure of the sine, of the angle $\theta$.
# Assume each measurement has a known Gaussian standard deviation $\sigma$ (same in both cases). Show your working.

# ## Exercise 3 (20 points)
# 
# Examine the data set `rmr_ISwR.dat`. 
# 
# * Which is the explanatory variable ($x$) and which is the response ($y$)? 
# * Perform a linear regression both ways around (i.e.\ $y$ vs.\ $x$ and $x$ vs.\ $y$) and overplot both fits on the same plot with the data.
# * Discuss the differences (if any) in the two ways of fitting. Is there any structure in the residuals in each case?
# * Can you improve the fit by adding higher order terms? 
# * Do the fit of the metabolic rate as a function of the weight, but this time including first a quadratic term, then also a cubic term. Plot your fits (together with the linear fit and the data) and calculate and compare the sum of square residuals for all three fits. 
# * Using metabolic rate as the response, use the linear model to predict the metabolic rate and its standard deviation at (a) 70 kg, (b) 200 kg. Even though you will use a computer to do this, write down the equations which give rise to these estimates.

# ## Exercice 4: SDSS Standard Star catalog (25 points)
# 
# We will explore a first subsample of the Sloan Digital Sky Survey catalog given in `sdss_chap4.csv`. (A further analysis on larger sample will happen in a different chapter)
# 
# This dataset gives for a significant number of stars in the Milky-Way, their positions and photometry, i.e., a value related to number of photons collected through a given passband filter (e.g., u, g, z). Of course these are measurements and therefore, each value is associated with uncertainties.
# 
# Hence, this file is made of a significant number of columns. Below are their descriptions:
# 
# | Column |  decription                 | units  |
# |--------|-----------------------------|--------|
# | RA     | right ascension             |degrees |
# | DEC    | declination                 |degrees |
# | Ar     | SFD ISM extinction          | mags   |
# | Teff   | Effective temperature       | Kelvin |
# | TeffErr| Standard temperature error  | Kelvin |
# | logg   | surface gravity             | cgs    |
# | loggErr| Standard gravity error      | cgs    |
# | pm<L/B>| proper motion in <L/B>      | mas/yr |
# | pmErr  | standard error in pm        | mas/yr |
# | mjd    | reference time              |  -     |
# | plate  | plate reference number      |  -     |
# | fiber  | spectral fiber reference    |  -     |
# | SNR    | signal-to-noise ratio       |  -     |
# | FeH    | [Fe/H] abundance ratio      |  dex   |
# | FeHErr | [Fe/H] standard error       |  dex   | 
# | alphFe | [alpha/Fe] abundance ratio  |  dex   |
# | alphFeErr | [alpha/Fe] standard error|  dex   | 
# | radVel | radial velocity             | km/s   |
# | radVelErr | radial velocity error    | km/s   |
# 
# in addition, for **each** band in (u g r i z) you have a couple more columns given as  
# 
# | Column        |  decription                                   | units |
# |---------------|-----------------------------------------------|-------|
# | `<band>`psf   | mean magnitude in this band                   | mag   |
# | `<band>`Err   | standard error in this band                   | mag   |
# 
# We propose in this exercise to explore a high dimensional dataset, and the possible correlations between the dimensions. As a first exploratory case, we will specifically limit ourselves to what we call the colors of stars, i.e. difference of magnitudes (e.g.,  `r-i`), and temperature (`Teff`).
# 
# The colors of stars are very good tracers of the temperature of a given star. Red colors indicate in general a low surface temperature and blue colors indicate hot temperature. However, this first order approximation is not working well depending on the wavelength at which stars are observed.
# 
# 1. Build a dataset made of the following different colors $u-g, g-r, r-i, i-z$, in addition to temperature and gravity. Plot them all against each other. Look also at their individual distributions. Be clever on how you visualize your data. Explain your thinking, and what do you observe? 
# 
# 2. The different colours for a population of stars often show correlations. We propose to find a simple expression of temperature as a function of all the colors we have at our disposal. 
#     1. Explain simply why we could neglect the uncertainties on each measurement thanks to the large number of measurements
#     2. Fit the relations $T_{eff} = f(u-g, g-r, r-i, i-z)$ and $log_{10} T_{eff} = f(u-g, g-r, r-i, i-z)$ with linear regression methods we learned from the lectures. In this question, we only consider linear terms in the colors. 
#     3. For both relations plot the scaled residuals in temperature `(predicted - given temperature)/(given temperature)` as a function of `temperature` (**linear scaled residuals**). Show that $\Delta T_{eff}/T_{eff} = \Delta(\log_{10}T_{eff})/\ln(10)$. Compare the efficiency of fitting in linear with fitting in log temperature and conclude.
# 
# 3. Now we only select the broadest baseline in wavelength, which is 'u-z' color. Supposedly, it will give us the best leverage on temperature if we limit ourselves to a single color. Do a linear regression on $log_{10} T_{eff} = \sum_k^N a_k (u-z)^k$, for N from 1 to 4 . Plot the fits and their residuals as before, and compare efficiency with your previous result on $log_{10} T_{eff} = f(u-g, g-r, r-i, i-z)$. Is there a gain to go to higher orders than a degree 2 in the polynomial? Does the results improve with higher degree $N$ of the polynomial? (plot multiple degrees, compare, explain)
# 
# 4. From the previous analysis, we show that a quadratic function in one color is a decent model using a single color. What if we use two colors? Let's use $(u-r, r-z)$ colors and attempt to predict a temperature for each star. Proceed to the linear regression of the polynomial function of the two colors **and their cross-terms**: $T_{eff} = f(u-r, r-z, (u-r)^2, (r-z)^2, (u-r) \cdot (r-z))$. Compare to the model from question 2.
# 
# 5. Results are still not satisfying. From the previous tests and models, How could you improve the results? What could limit the improvement? Explain in words: no calculation or plot required.
# 
# 
# **tip**: remember to consider a non-zero constant in your different relations when doing the regression.

# ## Exercise 5 (25 points)
# 
# A murder has been committed in the city. A DNA sample is found at the crime scene which is assumed to be from the criminal. 
# A DNA test is performed which has a false negative rate of $1$ in $10\,000$, and a false positive rate of $1$ in $100\,000$. 
# 
# The criminal is known (from other evidence) to be a resident of the city (population half a million). 
# A suspect is found and has a positive match to the DNA by this test.

# 1. Based on this information, what is the probability that the suspect is the perpetrator? (Give your interpretation of the question and show your working.)
# 
# 2. Discuss why the probability in reality might differ from the value you find.
# 
# 3. We would like to improve the procedure such that the probability that a suspect is the perpetrator following positive testing is 99%. A single test currently costs $100$ euros. 
# You can improve the procedure in one of two ways: 
#     1. Do the given test several times, assuming each test is independent. All individual tests must be positive for the overall result to be positive, otherwise it is negative; 
#     2. Do an improved individual test which has a lower false positive rate. For each factor of two decrease in this rate (compared to what it currently is) the test costs $100$ euros more. 
# 
# Which of these two approaches is the cheapest way of achieving the required improvement, and how much will it cost? Show your working and explain the differences between your answers.

# ## Exercise 6 (5 points) 
# 
# I have carried out $n$ experiments and so far $r$ have been successful
# 
# * what is your best estimate of the probability that the next trial will be a success? 
# * Does your estimate make sense for values of $r$ and $n$? If not, how could you modify it?