Here we will explore a regression problem important in the field of astronomy: Data in Astronomy is most often in one of two forms: spectral data, and photometric data. Spectra are high-resolution measurements of the energy of a source as a function of wavelength. Photometry, usually measured in a logarithmic scale called magnitudes, can be thought of as the integral of the spectrum through a broad filter.
Run the following code to see an example of a stellar spectrum (in particular, the spectrum of the star Vega) along with the five filters used in the Sloan Digital Sky Survey:
%pylab inline
from figures import plot_sdss_filters
plot_sdss_filters()
One interesting regression problem which appears often in Astronomy is the photometric determination of the galaxy redshift.
In the current standard cosmological model, the universe began nearly 14 billion years ago, in an explosive event commonly known as the Big Bang. Since then, the very fabric of space has been expanding, so that distant galaxies appear to be moving away from us at very high speeds. The uniformity of this expansion means that there is a relationship between the distance to a galaxy, and the speed that it appears to be receeding from us (this relationship is known as Hubble’s Law, named after Edwin Hubble). This recession speed leads to a shift in the frequency of photons, very similar to the more familiar doppler shift that causes the pitch of a siren to change as an emergency vehicle passes by. If a galaxy or star were moving toward us, its light would be shifted to higher frequencies, or blue-shifted. Because the universe is expanding away from us, distant galaxies appear to be red-shifted: their photons are shifted to lower frequencies.
In cosmology, the redshift is measured with the parameter $z$, defined in terms of the observed wavelength $\lambda_{obs}$ and the emitted wavelength $\lambda_{em}$:
$\lambda_{obs} = (1 + z)\lambda_{em}$
When a spectrum can be obtained, determining the redshift is rather straight-forward: if you can localize the spectral fingerprint of a common element, such as hydrogen, then the redshift can be computed using simple arithmetic. But the task becomes much more difficult when only photometric observations are available.
Because of the spectrum shift, an identical source at different redshifts will have a different color through each pair of filters. See the following figure:
from figures import plot_redshifts
plot_redshifts()
This again shows the spectrum of the star Vega ($\alpha$-Lyr), but at three different redshifts. The SDSS ugriz filters are shown in gray for reference.
At redshift z=0.0, the spectrum is bright in the u and g filters, but dim in the i and z filters. At redshift z=0.8, the opposite is the case. This suggests the possibility of determining redshift from photometry alone. The situation is complicated by the fact that each individual source has unique spectral characteristics, but nevertheless, these photometric redshifts are often used in astronomical applications.
The photometric redshift problem is very important. Future astronomical surveys hope to image trillions of very faint galaxies, and use this data to inform our view of the universe as a whole: its history, its geometry, and its fate. Obtaining an accurate estimate of the redshift to each of these galaxies is a pivotal part of this task. Because these surveys will image so many extremely faint galaxies, there is no possibility of obtaining a spectrum for each one. Thus sophisticated photometric redshift codes will be required to advance our understanding of the Universe, including more precisely understanding the nature of the dark energy that is currently accelerating the cosmic expansion.
Here we will address the photometric redshift problem using 50,000 observations
from the Sloan Digital Sky Survey. This example draws from examples available
through the astroML
python package, which can be found here:
We'll start by downloading the data. This fetch function actually generates an SQL query and downloads the data from the SDSS database. The results will be cached locally so that subsequent calls to the function don't result in another download.
import numpy as np
from datasets import fetch_sdss_galaxy_mags
# This will download a ~3MB file the first time you call the function
data = fetch_sdss_galaxy_mags()
print data.shape
print data.dtype
Next we'll extract the data. Because the relative magnitudes are easier to calibrate than the absolute magnitude, we'll work with what astronomers call colors, the difference of two magnitudes. Because the magnitudes are related to the logarithm of the flux, the colors can be thought of as normalized magnitudes.
redshift = data['redshift']
mags = np.vstack([data[f] for f in 'ugriz']).transpose()
colors = mags[:, :-1] - mags[:, 1:]
print colors.shape
We'll split the data into training and validation sets, and do a decision tree fit to the data:
from sklearn import cross_validation
ctrain, ctest, ztrain, ztest = cross_validation.train_test_split(colors, redshift)
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
Let's start by running a 4-fold cross validation and see how we're doing. The cross validation here prints the r2 score, which lies between 0 and 1. The closer the score is to 1, the better:
print cross_validation.cross_val_score(clf, colors, redshift, cv=4)
Another way we can visualize the results is to scatter-plot the input versus the output:
# We'll use this function several times below
def plot_redshifts(ztrue, zpred):
"""scatter-plot the true vs predicted redshifts"""
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(ztrue, zpred, '.k')
# plot trend lines, +/- 0.1 in z
ax.plot([0, 3], [0, 3], '--k')
ax.plot([0, 3], [0.2, 3.2], ':k')
ax.plot([0.2, 3.2], [0, 3], ':k')
ax.text(2.9, 0.1,
"RMS = %.2g" % np.sqrt(np.mean((ztrue - zpred) ** 2)),
ha='right', va='bottom')
ax.set_xlim(0, 3)
ax.set_ylim(0, 3)
ax.set_xlabel('True redshift')
ax.set_ylabel('Predicted redshift')
clf = DecisionTreeRegressor()
clf.fit(ctrain, ztrain)
zpred = clf.predict(ctest)
plot_redshifts(ztest, zpred)
We see several things here: first, there are some regions of redshift space where the results are not very precise: they can vary by $\pm 0.2$ or so.
Second, there are many catastrophic outliers: values where the prediction is completely wrong. Both these sources of error are important to minimize, or at the very least, statistically characterize. We'll explore some ways to do this below.
The question of how to improve the model goes back to the discussion of Learning Curves from earlier. We'll start by attempting to optimize the model itself through cross-validation.
The decision tree regressors have a few hyperparameters, but one of the more important is the depth. This tells how many times the data set is split in the process of computing a fit. Here we'll plot the validation curve for the maximum depth:
# we'll explore results for max_depth from 1 to 20
max_depth_array = np.arange(1, 21)
train_error = np.zeros(len(max_depth_array))
test_error = np.zeros(len(max_depth_array))
for i, max_depth in enumerate(max_depth_array):
clf = DecisionTreeRegressor(max_depth=max_depth)
clf.fit(ctrain, ztrain)
ztrain_pred = clf.predict(ctrain)
ztest_pred = clf.predict(ctest)
train_error[i] = np.sqrt(np.mean((ztrain_pred - ztrain) ** 2))
test_error[i] = np.sqrt(np.mean((ztest_pred - ztest) ** 2))
plt.plot(max_depth_array, train_error, label='training')
plt.plot(max_depth_array, test_error, label='validation')
plt.grid()
plt.legend(loc=3)
plt.xlabel('max_depth')
plt.ylabel('error')
We see a very clean curve which looks much like what we'd expect: the training error decreases consistently as the model over-fits it more and more, while the validation error turns over at some optimal value.
Note that scikit-learn has a set of functions which automate this sort
of calculation: it's in the grid_search
model. As of scikit-learn version 0.13,
the interface for grid search is still evolving, so the following code
may have to be adjusted when newer scikit-learn versions are released:
from sklearn.grid_search import GridSearchCV
clf = DecisionTreeRegressor()
grid = GridSearchCV(clf, param_grid=dict(max_depth=max_depth_array))
grid.fit(colors, redshift)
print grid.best_params_
The grid search uses a full cross-validation rather than a single validation set:
this is what leads to the larger optimal value of max_depth
.
We can read from the above plot that the optimal RMS is obtained with a max depth of about 7. Let's see what this looks like:
clf = DecisionTreeRegressor(max_depth=7)
clf.fit(ctrain, ztrain)
zpred = clf.predict(ctest)
plot_redshifts(ztest, zpred)
Ugh... not pretty.
Even though by eye this looks like a much worse fit, than we had above, it actually does have a better RMS residual (0.21 vs. 0.27). This is a good illustration that the form of the loss or score function is very important.
We said above that along with being concerned about RMS error, we're also concerned about the level of outliers in the data. With that in mind, we could propose another loss function particular to this problem: the fraction of points with an absolute deviation greater than 0.2 (that is, outside the dotted lines in the scatter-plot).
We'll define the function to compute this, and then create the validation curve for this metric:
def outlier_fraction(y_pred, y_true, cutoff=0.2):
return np.sum((abs(y_pred - y_true) > cutoff)) * 1. / len(y_pred)
train_outfrac = np.zeros(len(max_depth_array))
test_outfrac = np.zeros(len(max_depth_array))
for i, max_depth in enumerate(max_depth_array):
clf = DecisionTreeRegressor(max_depth=max_depth)
clf.fit(ctrain, ztrain)
ztrain_pred = clf.predict(ctrain)
ztest_pred = clf.predict(ctest)
train_outfrac[i] = outlier_fraction(ztrain_pred, ztrain)
test_outfrac[i] = outlier_fraction(ztest_pred, ztest)
plt.plot(max_depth_array, train_outfrac)
plt.plot(max_depth_array, test_outfrac)
plt.grid()
This outlier-based loss function settles on a much deeper tree. Let's see what the result looks like with a max depth of 19:
clf = DecisionTreeRegressor(max_depth=20)
clf.fit(ctrain, ztrain)
zpred = clf.predict(ctest)
plot_redshifts(ztest, zpred)
Unfortunately, this is about as far as we can go with a simple decision tree trained on this data. There are two more possibilities we can pursue, though:
These approaches will be exercises below.
Earlier we showed how learning curves can be used to determine the best course of action when a model is under-performing: should we gather more samples? Gather more features? Seek a more sophisticated model?
The goal of this exercise is to use learning curves to answer this question: how should astronomers spend their resources when trying to improve photometric redshifts?
One way to improve upon Decision Trees is to use Ensemble Methods. Ensemble methods (also known as boosting and bagging) use ensembles of randomized estimators which can prevent over-fitting and lead to very powerful learning algorithms.
It is interesting to see how small an RMS you can attain on the photometric redshift problem using a more sophisticated method. Try one of the following:
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.GradientBoostingRegressor
sklearn.ensemble.ExtraTreesRegressor
You can read more about each of these methods in the scikit-learn documentation:
Each method has hyperparameters which need to be determined using cross-validation steps like those above. Can you use ensemble methods to reduce the rms error for the test set down below 0.1?
Here you can adjust several hyperparameters, but the important ones will be
the number of estimators n_estimators
as well as the maximum depth
max_depth
that we saw above.