# from Coello Coello, Lamont and Van Veldhuizen (2007) Evolutionary Algorithms for Solving Multi-Objective Problems, Second Edition. Springer [Appendix E](http://www.cs.cinvestav.mx/~emoobook/apendix-e/apendix-e.html).
# New `toolbox` instance with the necessary components.
# In[23]:
toolbox = base.Toolbox()
# Define problem domain as $\vec{x}\in\left[0,1\right]^{30}$ and a two-objective DTLZ3 instance.
# In[24]:
NDIM = 30
BOUND_LOW, BOUND_UP = 0.0, 1.0
toolbox.register("evaluate", lambda ind: benchmarks.dtlz3(ind, 2))
# Describing attributes, individuals and population and defining the selection, mating and mutation operators.
# In[25]:
toolbox.register("attr_float", uniform, BOUND_LOW, BOUND_UP, NDIM)
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.attr_float)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxSimulatedBinaryBounded, low=BOUND_LOW, up=BOUND_UP, eta=20.0)
toolbox.register("mutate", tools.mutPolynomialBounded, low=BOUND_LOW, up=BOUND_UP, eta=20.0, indpb=1.0/NDIM)
toolbox.register("select", tools.selNSGA2)
# Let's also use the `toolbox` to store other configuration parameters of the algorithm. This will show itself usefull when performing massive experiments.
# In[26]:
toolbox.pop_size = 50
toolbox.max_gen = 1000
toolbox.mut_prob = 0.2
# ## A compact NSGA-II implementation
#
# Storing all the required information in the `toolbox` and using DEAP's `algorithms.eaMuPlusLambda` function allows us to create a very compact -albeit not a 100% exact copy of the original- implementation of NSGA-II.
# In[27]:
def run_ea(toolbox, stats=None, verbose=False):
pop = toolbox.population(n=toolbox.pop_size)
pop = toolbox.select(pop, len(pop))
return algorithms.eaMuPlusLambda(pop, toolbox, mu=toolbox.pop_size,
lambda_=toolbox.pop_size,
cxpb=1-toolbox.mut_prob,
mutpb=toolbox.mut_prob,
stats=stats,
ngen=toolbox.max_gen,
verbose=verbose)
# ## Running the algorithm
# We are now ready to run our NSGA-II.
# In[28]:
get_ipython().run_line_magic('time', 'res,_ = run_ea(toolbox)')
# We can now get the Pareto fronts in the results (`res`).
# In[29]:
fronts = tools.emo.sortLogNondominated(res, len(res))
# ## Resulting Pareto fronts
# In[30]:
plot_colors = seaborn.color_palette("Set1", n_colors=10)
fig, ax = plt.subplots(1, figsize=(4,4))
for i,inds in enumerate(fronts):
par = [toolbox.evaluate(ind) for ind in inds]
df = pd.DataFrame(par)
df.plot(ax=ax, kind='scatter', label='Front ' + str(i+1),
x=df.columns[0], y=df.columns[1],
color=plot_colors[i])
plt.xlabel('$f_1(\mathbf{x})$');plt.ylabel('$f_2(\mathbf{x})$');
# * It is better to make an animated plot of the evolution as it takes place.
# ## Animating the evolutionary process
# We create a `stats` to store the individuals not only their objective function values.
# In[31]:
stats = tools.Statistics()
stats.register("pop", copy.deepcopy)
# In[32]:
toolbox.max_gen = 1000 # we need more generations!
# Re-run the algorithm to get the data necessary for plotting.
# In[33]:
get_ipython().run_line_magic('time', 'res, logbook = run_ea(toolbox, stats=stats)')
# In[34]:
from matplotlib import animation
from IPython.display import HTML
# In[35]:
def animate(frame_index, logbook):
'Updates all plots to match frame _i_ of the animation.'
ax.clear()
fronts = tools.emo.sortLogNondominated(logbook.select('pop')[frame_index],
len(logbook.select('pop')[frame_index]))
for i,inds in enumerate(fronts):
par = [toolbox.evaluate(ind) for ind in inds]
df = pd.DataFrame(par)
df.plot(ax=ax, kind='scatter', label='Front ' + str(i+1),
x=df.columns[0], y=df.columns[1], alpha=0.47,
color=plot_colors[i % len(plot_colors)])
ax.set_title('$t=$' + str(frame_index))
ax.set_xlabel('$f_1(\mathbf{x})$');ax.set_ylabel('$f_2(\mathbf{x})$')
return []
# In[36]:
fig = plt.figure(figsize=(4,4))
ax = fig.gca()
anim = animation.FuncAnimation(fig, lambda i: animate(i, logbook),
frames=len(logbook), interval=60,
blit=True)
plt.close()
# In[37]:
HTML(anim.to_html5_video())
# Here it is clearly visible how the algorithm "jumps" from one local-optimum to a better one as evolution takes place.
# # MOP benchmark problem toolkits
#
# Each problem instance is meant to test the algorithms with regard with a given feature: local optima, convexity, discontinuity, bias, or a combination of them.
#
# * [*ZDT1-6*](http://www.tik.ee.ethz.ch/sop/download/supplementary/testproblems/): Two-objective problems with a fixed number of decision variables.
# * E. Zitzler, K. Deb, and L. Thiele. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evolutionary Computation, 8(2):173-195, 2000. ([pdf](http://www.tik.ee.ethz.ch/sop/publicationListFiles/zdt2000a.pdf))
# * [*DTLZ1-7*](http://www.tik.ee.ethz.ch/sop/download/supplementary/testproblems/): $m$-objective problems with $n$ variables.
# * K. Deb, L. Thiele, M. Laumanns and E. Zitzler. Scalable Multi-Objective Optimization Test Problems. CEC 2002, p. 825 - 830, IEEE Press, 2002. ([pdf](http://www.tik.ee.ethz.ch/sop/publicationListFiles/dtlz2002a.pdf))
# * [*CEC'09*](http://dces.essex.ac.uk/staff/zhang/moeacompetition09.htm): Two- and three- objective problems that very complex Pareto sets.
# * Zhang, Q., Zhou, A., Zhao, S., & Suganthan, P. N. (2009). Multiobjective optimization test instances for the CEC 2009 special session and competition. In 2009 IEEE Congress on Evolutionary Computation (pp. 1–30). ([pdf](http://dces.essex.ac.uk/staff/zhang/MOEAcompetition/cec09testproblem0904.pdf.pdf))
# * [*WFG1-9*](http://www.wfg.csse.uwa.edu.au/publications.html#toolkit): $m$-objective problems with $n$ variables, very complex.
# * Huband, S., Hingston, P., Barone, L., & While, L. (2006). A review of multiobjective test problems and a scalable test problem toolkit. IEEE Transactions on Evolutionary Computation, 10(5), 477–506. doi:10.1109/TEVC.2005.861417
# How does our NSGA-II behaves when faced with different benchmark problems?
# In[38]:
problem_instances = {'ZDT1': benchmarks.zdt1, 'ZDT2': benchmarks.zdt2,
'ZDT3': benchmarks.zdt3, 'ZDT4': benchmarks.zdt4,
'DTLZ1': lambda ind: benchmarks.dtlz1(ind,2),
'DTLZ2': lambda ind: benchmarks.dtlz2(ind,2),
'DTLZ3': lambda ind: benchmarks.dtlz3(ind,2),
'DTLZ4': lambda ind: benchmarks.dtlz4(ind,2, 100),
'DTLZ5': lambda ind: benchmarks.dtlz5(ind,2),
'DTLZ6': lambda ind: benchmarks.dtlz6(ind,2),
'DTLZ7': lambda ind: benchmarks.dtlz7(ind,2)}
# In[39]:
toolbox.max_gen = 1000
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("obj_vals", np.copy)
def run_problem(toolbox, problem):
toolbox.register('evaluate', problem)
return run_ea(toolbox, stats=stats)
# Running NSGA-II solving all problems. Now it takes longer.
# In[40]:
get_ipython().run_line_magic('time', 'results = {problem: run_problem(toolbox, problem_instances[problem]) for problem in problem_instances}')
# Creating this animation takes more programming effort.
# In[41]:
class MultiProblemAnimation:
def init(self, fig, results):
self.results = results
self.axs = [fig.add_subplot(3,4,i+1) for i in range(len(results))]
self.plots =[]
for i, problem in enumerate(sorted(results)):
(res, logbook) = self.results[problem]
pop = pd.DataFrame(data=logbook.select('obj_vals')[0])
plot = self.axs[i].plot(pop[0], pop[1], 'b.', alpha=0.47)[0]
self.plots.append(plot)
fig.tight_layout()
def animate(self, t):
'Updates all plots to match frame _i_ of the animation.'
for i, problem in enumerate(sorted(results)):
#self.axs[i].clear()
(res, logbook) = self.results[problem]
pop = pd.DataFrame(data=logbook.select('obj_vals')[t])
self.plots[i].set_data(pop[0], pop[1])
self.axs[i].set_title(problem + '; $t=' + str(t)+'$')
self.axs[i].set_xlim((0, max(1,pop.max()[0])))
self.axs[i].set_ylim((0, max(1,pop.max()[1])))
return self.axs
# In[42]:
mpa = MultiProblemAnimation()
# In[43]:
fig = plt.figure(figsize=(14,6))
anim = animation.FuncAnimation(fig, mpa.animate, init_func=mpa.init(fig,results),
frames=toolbox.max_gen, interval=60, blit=True)
plt.close()
# In[44]:
HTML(anim.to_html5_video())
# * It is interesting how the algorithm deals with each problem: clearly some problems are harder than others.
# * In some cases it "hits" the Pareto front and then slowly explores it.
#
# **Read more about this in the class materials.**
# # Experiment design and reporting results
#
# * Watching an animation of an EMO algorithm solve a problem is certainly fun.
# * It also allows us to understand many particularities of the problem being solved.
# * But, as [Carlos Coello](http://delta.cs.cinvestav.mx/~ccoello/) would say, we are *not* in an art appreciation class.
# * We should follow the key concepts provided by the [scientific method](http://en.wikipedia.org/wiki/Scientific_method).
# * I urge you to study the [experimental design](http://en.wikipedia.org/wiki/Design_of_experiments) topic in depth as it is an essential knowledge.
#
# **Evolutionary algorithms are stochastic algorithms, therefore their results must be assessed by repeating experiments until you reach an statistically valid conclusion.**
# ## An illustrative simple/sample experiment
#
# Let's make a relatively simple but very important experiment:
#
# * **Question**: In our NSGA-II applied to a two-objective DTLZ3 problem: Is it more important to have a big population and let the algorithm run for a few iterations or is better to have a small population and let the algorithm run for larger number of iterations.
# * **Procedure**: We must perform an experiment testing different population sizes and maximum number of iterations while keeping the other parameters constant.
# ### Notation
#
# As usual we need to establish some notation:
#
# * *Multi-objective problem* (or just *problem*): A multi-objective optimization problem, as defined above.
# * *MOEA*: An evolutionary computation method used to solve multi-objective problems.
# * *Experiment*: a combination of problem and MOEA and a set of values of their parameters.
# * *Experiment run*: The result of running an experiment.
# * We will use `toolbox` instances to define experiments.
# We start by creating a `toolbox` that will contain the configuration that will be shared across all experiments.
# In[45]:
toolbox = base.Toolbox()
# In[46]:
BOUND_LOW, BOUND_UP = 0.0, 1.0
NDIM = 30
# the explanation of this... a few lines bellow
def eval_helper(ind):
return benchmarks.dtlz3(ind, 2)
toolbox.register("evaluate", eval_helper)
# In[47]:
def uniform(low, up, size=None):
try:
return [random.uniform(a, b) for a, b in zip(low, up)]
except TypeError:
return [random.uniform(a, b) for a, b in zip([low] * size, [up] * size)]
toolbox.register("attr_float", uniform, BOUND_LOW, BOUND_UP, NDIM)
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.attr_float)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxSimulatedBinaryBounded, low=BOUND_LOW, up=BOUND_UP, eta=20.0)
toolbox.register("mutate", tools.mutPolynomialBounded, low=BOUND_LOW, up=BOUND_UP, eta=20.0, indpb=1.0/NDIM)
toolbox.register("select", tools.selNSGA2)
toolbox.mut_prob = 0.15
# We add a `experiment_name` to `toolbox` that we will fill up later on. Here $n\mathrm{pop}$ denotes the population size and $t_\mathrm{max}$ the max. number of iterations.
# In[48]:
experiment_name = "$n_\mathrm{{pop}}={0};\ t_\mathrm{{max}}={1}$"
# In[49]:
total_evals = 500
# We can now replicate this toolbox instance and then modify the mutation probabilities.
# In[50]:
pop_sizes = (10,50,100)
# In[51]:
toolboxes=list([copy.deepcopy(toolbox) for _ in range(len(pop_sizes))])
# Now `toolboxes` is a list of copies of the same toolbox. One for each experiment configuration (population size).
# ...but we still have to set the population sizes in the elements of `toolboxes`.
# In[52]:
for pop_size, toolbox in zip(pop_sizes, toolboxes):
toolbox.pop_size = pop_size
toolbox.max_gen = total_evals // pop_size
toolbox.experiment_name = experiment_name.format(toolbox.pop_size, toolbox.max_gen)
# In[53]:
for toolbox in toolboxes:
print(toolbox.experiment_name, toolbox.pop_size, toolbox.max_gen)
# ## Experiment design
#
# As we are dealing with stochastic methods their results should be reported relying on an statistical analysis.
#
# * A given experiment (a `toolbox` instance in our case) should be repeated a *sufficient* amount of times.
# * In theory, the more runs the better, but how much in enough? In practice, we could say that about *30* runs is enough.
# * The non-dominated fronts produced by each experiment run should be compared to each other.
# * We have seen in class that a number of *performance indicators*, like the *hypervolume*, *additive* and *multiplicative epsilon indicators*, among others, have been proposed for that task.
# * We can use statistical visualizations like [box plots](http://en.wikipedia.org/wiki/Box_plot) or [violin plots](http://en.wikipedia.org/wiki/Violin_plot) to make a visual assessment of the indicator values produced in each run.
# * We must apply a set of [statistical hypothesis tests](http://en.wikipedia.org/wiki/Statistical_hypothesis_testing) in order to reach an statistically valid judgment of the results of an algorithms.
#
#
# _Note_: I personally like the number [42](http://en.wikipedia.org/wiki/42_%28number%29) as it is the [answer to The Ultimate Question of Life, the Universe, and Everything](http://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life.2C_the_Universe.2C_and_Everything_.2842.29).
# In[54]:
number_of_runs = 42
# ### Running experiments in parallel
# As we are now solving more demanding problems it would be nice to make our algorithms to run in parallel and profit from modern multi-core CPUs.
#
# * In DEAP it is very simple to parallelize an algorithm (if it has been properly programmed) by providing a parallel `map()` function via the `toolbox`.
# * Local parallelization can be achieved using Python's [`multiprocessing`](https://docs.python.org/2/library/multiprocessing.html) or [`concurrent.futures`](https://docs.python.org/3/library/concurrent.futures.html) modules.
# * Cluster parallelization can be achieved using IPython Parallel or [SCOOP](http://en.wikipedia.org/wiki/Python_SCOOP_%28software%29).
# ### Progress feedback
#
# * Another issue with these long experiments has to do being patient.
# * A little bit of feedback on the experiment execution would be cool.
# * We can use the integer progress bar from [IPython widgets](http://nbviewer.ipython.org/github/jvns/ipython/blob/master/examples/Interactive%20Widgets/Index.ipynb) and report every time an experiment run is finished.
# In[55]:
from ipywidgets import IntProgress
from IPython.display import display
# ### A side-effect of using process-based parallelization
#
# Process-based parallelization based on `multiprocessing` requires that the parameters passed to `map()` be [pickleable](https://docs.python.org/3.4/library/pickle.html).
#
# * The direct consequence is that `lambda` functions can not be directly used.
# * This is will certainly ruin the party to all `lambda` fans out there! *-me included*.
# * Hence we need to write some wrapper functions instead.
# * But, that wrapper function can take care of filtering out dominated individuals in the results.
# In[56]:
def run_algo_wrapper(toolbox):
result, _ = run_ea(toolbox)
local_pareto_set = tools.emo.sortLogNondominated(result, len(result), first_front_only=True)
return local_pareto_set
# # All set! Run the experiments...
# In[57]:
get_ipython().run_cell_magic('time', '', 'import concurrent.futures\nprogress_bar = IntProgress(description="000/000", max=len(toolboxes)*number_of_runs)\ndisplay(progress_bar)\n\nresults = {toolbox.experiment_name:[] for toolbox in toolboxes}\nwith concurrent.futures.ProcessPoolExecutor() as executor:\n # Submit all the tasks...\n futures = {executor.submit(run_algo_wrapper, toolbox): toolbox\n for _ in range(number_of_runs)\n for toolbox in toolboxes}\n \n # ...and wait for them to finish.\n for future in concurrent.futures.as_completed(futures):\n tb = futures[future]\n results[tb.experiment_name].append(future.result())\n progress_bar.value +=1\n progress_bar.description = "%03d/%03d:" % (progress_bar.value, progress_bar.max)\n')
# This is not a perfect implementation but... works!
# As running the experiments sometimes takes a long time it is a good practice to store the results.
# In[58]:
import pickle
# In[59]:
pickle.dump(results, open('nsga_ii_dtlz3-results.pickle', 'wb'))
# In case you need it, this file is included in the github repository.
#
# To load the results we would just have to:
# In[60]:
# loaded_results = pickle.load(open('nsga_ii_dtlz3-results.pickle', 'rb'))
# results = loaded_results # <-- uncomment if needed
# `results` is a dictionary, but a pandas `DataFrame` is a more handy container for the results.
# In[61]:
res = pd.DataFrame(results)
# Headers may come out unsorted. Let's fix that first.
# In[62]:
res.head()
# In[63]:
res = res.reindex_axis([toolbox.experiment_name for toolbox in toolboxes], axis=1)
# In[64]:
res.head()
# ## A first glace at the results
# In[65]:
a = res.applymap(lambda pop: [toolbox.evaluate(ind) for ind in pop])
plt.figure(figsize=(11,3))
for i, col in enumerate(a.columns):
plt.subplot(1, len(a.columns), i+1)
for pop in a[col]:
x = pd.DataFrame(data=pop)
plt.scatter(x[0], x[1], marker='.', alpha=0.5)
plt.title(col)
# The local Pareto-optimal fronts are clearly visible!
# ## Calculating performance indicators
#
# * As already mentioned, we need to evaluate the quality of the solutions produced in every execution of the algorithm.
# * We will use the hypervolumne indicator for that.
# * Larger hypervolume values are better.
# * We already filtered each population a leave only the non-dominated individuals.
# Calculating the *reference point*: a point that is **“worst”** than any other individual in every objective.
# In[66]:
def calculate_reference(results, epsilon=0.1):
alldata = np.concatenate(np.concatenate(results.values))
obj_vals = [toolbox.evaluate(ind) for ind in alldata]
return np.max(obj_vals, axis=0) + epsilon
# In[67]:
reference = calculate_reference(res)
# In[68]:
reference
# We can now compute the hypervolume of the Pareto-optimal fronts yielded by each algorithm run.
# In[69]:
import deap.benchmarks.tools as bt
# In[70]:
hypervols = res.applymap(lambda pop: bt.hypervolume(pop, reference))
# In[71]:
hypervols.head()
# ## How can we interpret the indicators?
# ### Option A: Tabular form
# In[72]:
hypervols.describe()
# ### Option B: Visualization
# In[73]:
fig = plt.figure(figsize=(11,4))
plt.subplot(121, title='Violin plots of NSGA-II with $P_{\mathrm{mut}}$')
seaborn.violinplot(data=hypervols, palette='Set2')
plt.ylabel('Hypervolume'); plt.xlabel('Configuration')
plt.subplot(122, title='Box plots of NSGA-II with $P_{\mathrm{mut}}$')
seaborn.boxplot(data=hypervols, palette='Set2')
plt.ylabel('Hypervolume'); plt.xlabel('Configuration');
plt.tight_layout()
# ## Option C: Statistical hypothesis test
# * Choosing the correct statistical test is essential to properly report the results.
# * [Nonparametric statistics](http://en.wikipedia.org/wiki/Nonparametric_statistics) can lend a helping hand.
# * [Parametric statistics](http://en.wikipedia.org/wiki/Parametric_statistics) could be a better choice in some cases.
# * Parametric statistics require that *all* data follow a known distribution (frequently a normal one).
# * Some tests -like the [normality test](http://en.wikipedia.org/wiki/Normality_test)- can be apply to verify that data meet the parametric stats requirements.
# * In my experience that is very unlikely that all your EMO result meet those characteristics.
# We start by writing a function that helps us tabulate the results of the application of an statistical hypothesis test.
# In[74]:
import itertools
import scipy.stats as stats
# In[75]:
def compute_stat_matrix(data, stat_func, alpha=0.05):
'''A function that applies `stat_func` to all combinations of columns in `data`.
Returns a squared matrix with the p-values'''
p_values = pd.DataFrame(columns=data.columns, index=data.columns)
for a,b in itertools.combinations(data.columns,2):
s,p = stat_func(data[a], data[b])
p_values[a].ix[b] = p
p_values[b].ix[a] = p
return p_values
# The [Kruskal-Wallis H-test](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kruskal.html) tests the null hypothesis that the population median of all of the groups are equal.
#
# * It is a non-parametric version of [ANOVA](http://en.wikipedia.org/wiki/Analysis_of_variance).
# * The test works on 2 or more independent samples, which may have different sizes.
# * Note that rejecting the null hypothesis does not indicate which of the groups differs.
# * Post-hoc comparisons between groups are required to determine which groups are different.
# In[76]:
stats.kruskal(*[hypervols[col] for col in hypervols.columns])
# We now can assert that the results are not the same but which ones are different or similar to the others the others?
# In case that the null hypothesis of the Kruskal-Wallis is rejected the Conover–Inman procedure (Conover, 1999, pp. 288-290) can be applied in a pairwise manner in order to determine if the results of one algorithm were significantly better than those of the other.
#
# * Conover, W. J. (1999). *Practical Nonparametric Statistics*. John Wiley & Sons, New York, 3rd edition.
#
# _Note_: If you want to get an extended summary of this method check out my [PhD thesis](http://lmarti.com/pubs).
# In[77]:
def conover_inman_procedure(data, alpha=0.05):
num_runs = len(data)
num_algos = len(data.columns)
N = num_runs*num_algos
_,p_value = stats.kruskal(*[data[col] for col in data.columns])
ranked = stats.rankdata(np.concatenate([data[col] for col in data.columns]))
ranksums = []
for i in range(num_algos):
ranksums.append(np.sum(ranked[num_runs*i:num_runs*(i+1)]))
S_sq = (np.sum(ranked**2) - N*((N+1)**2)/4)/(N-1)
right_side = stats.t.cdf(1-(alpha/2), N-num_algos) * \
math.sqrt((S_sq*((N-1-p_value)/(N-1)))*2/num_runs)
res = pd.DataFrame(columns=data.columns, index=data.columns)
for i,j in itertools.combinations(np.arange(num_algos),2):
res[res.columns[i]].ix[j] = abs(ranksums[i] - ranksums[j]/num_runs) > right_side
res[res.columns[j]].ix[i] = abs(ranksums[i] - ranksums[j]/num_runs) > right_side
return res
# In[78]:
conover_inman_procedure(hypervols)
# We now know in what cases the difference is sufficient as to say that one result is better than the other.
# Another alternative is the [Friedman test](http://en.wikipedia.org/wiki/Friedman_test).
#
# * Its null hypothesis that repeated measurements of the same individuals have the same distribution.
# * It is often used to test for consistency among measurements obtained in different ways.
# * For example, if two measurement techniques are used on the same set of individuals, the Friedman test can be used to determine if the two measurement techniques are consistent.
# In[79]:
measurements = [list(hypervols[col]) for col in hypervols.columns]
stats.friedmanchisquare(*measurements)
# [Mann–Whitney U test](http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) (also called the Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test (WRS), or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other.
#
# It has greater efficiency than the $t$-test on non-normal distributions, such as a mixture of normal distributions, and it is nearly as efficient as the $t$-test on normal distributions.
# In[80]:
raw_p_values=compute_stat_matrix(hypervols, stats.mannwhitneyu)
raw_p_values
# The [familywise error rate](http://en.wikipedia.org/wiki/Familywise_error_rate) (FWER) is the probability of making one or more false discoveries, or [type I errors](http://en.wikipedia.org/wiki/Type_I_and_type_II_errors), among all the hypotheses when performing multiple hypotheses tests.
#
# _Example_: When performing a test, there is a $\alpha$ chance of making a type I error. If we make $m$ tests, then the probability of making one type I error is $m\alpha$. Therefore, if an $\alpha=0.05$ is used and 5 pairwise comparisons are made, we will have a $5\times0.05 = 0.25$ chance of making a type I error.
#
# * FWER procedures (such as the [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction)) exert a more stringent control over false discovery compared to False discovery rate controlling procedures.
# * FWER controlling seek to reduce the probability of even one false discovery, as opposed to the expected proportion of false discoveries.
# * Thus, FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting the null hypothesis of no effect when it should be accepted.
# One of these corrections is the [Šidák correction](http://en.wikipedia.org/wiki/%C5%A0id%C3%A1k_correction) as it is less conservative than the [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction):
# $$\alpha_{SID} = 1-(1-\alpha)^\frac{1}{m},$$
# where $m$ is the number of tests.
#
# * In our case $m$ is the number of combinations of algorithm configurations taken two at a time,
# $$
# m = {\mathtt{number\_of\_experiments} \choose 2}.
# $$
# * There are other corrections that can be used.
# In[81]:
from scipy.misc import comb
alpha=0.05
alpha_sid = 1 - (1-alpha)**(1/comb(len(hypervols.columns), 2))
alpha_sid
# Let's apply the corrected alpha to `raw_p_values`. If we have a cell with a `True` value that means that those two results are the same.
# In[82]:
raw_p_values.applymap(lambda value: value <= alpha_sid)
# ## Further -and highly recommended- reading
#
# * Cohen, P. R. (1995). _Empirical Methods for Artificial Intelligence_ (Vol. 139). Cambridge: MIT press. [link](http://mitpress.mit.edu/books/empirical-methods-artificial-intelligence)
# * Bartz-Beielstein, Thomas (2006). _Experimental Research in Evolutionary Computation: The New Experimentalism_. Springer [link](http://link.springer.com/book/10.1007%2F3-540-32027-X)
# * García, S., & Herrera, F. (2008). _An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons_. Journal of Machine Learning Research, 9, 2677–2694. [pdf](http://www.jmlr.org/papers/v9/garcia08a.html)
# # Final remarks
#
#
# In this class/notebook we have seen some key elements:
#
# 1. The Pareto dominance relation in action.
# 2. The NSGA-II algorithm.
# 3. Some of the existing MOP benchmarks.
# 4. How to perform experiments and draw statistically valid conclusions from them.
#
# Bear in mind that:
#
# * When working in EMO topics problems like those of the CEC'09 or WFG toolkits are usually involved.
# * The issue of devising a proper experiment design and interpreting the results is a fundamental one.
# * The experimental setup presented here can be used with little modifications to single-objective optimization and even to other machine learning or stochastic algorithms.
#